Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactoring to work with the annotated plain text #18

Open
wants to merge 248 commits into
base: master
Choose a base branch
from

Conversation

tpeng
Copy link
Contributor

@tpeng tpeng commented May 26, 2014

sometime the training data maybe plain text, instead of using python-crfsuite or any other CRF package, i still prefer to use webstruct because it has sklearn pipeline and some evaluation tools out of box.

the input text annotated text is similar to GATE: e.g. this is a <NER>test</NER>. the entities are surrounded by <> tags. the rest of the change just moving the generic code to a more proper place.

@@ -3,7 +3,7 @@
:mod:`webstruct.feature_extraction` contains classes that help
with:

- converting HTML pages into lists of feature dicts and
- converting annnotated data into lists of feature dicts and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data is not necessarily annotated: HtmlLoader is used to load raw data

@kmike
Copy link
Member

kmike commented May 26, 2014

My main concern in Token class and TextTokenizer thing. Creating Token instances looks like a total overkill - why would anyone need to wrap text token in Token instance and to keep reference to all other tokens in the text there? Also, there is already a text_tokenizers module, so this adds to confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants