-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute character index mapping for before preprocess.normalize_whitespace
#121
Comments
Hi @betatim , I understand the problem, although I don't know of a "good" way to solve it. The preprocessing functions are destructive and one-way, so not a lot of thought has been given to recovering the changes. Basic question: Do you need to normalize the white space before using spacy's NER? It seems like weird spacing shouldn't affect the model's performance, in which case, I'd just skip the normalization. The only solution that comes to mind is iterating over the resulting entities and re-locating them in the original text, a process which can be made more efficient than the simplest implementation but not, like, great. This reminds me of annotating, say, keyterms visually in a PDF document while using the extracted/processed text in the analysis. It's definitely a thing I've seen done. (Unfortunately, my google-fu failed me — I couldn't find a concrete example.) Might be worth trying to track down... |
It seem to help with things like "07\n Feb 2017" being found as a date and not as a CARDINAL and a DATE. Was hoping you had found a nice way to do the transporting things back. Will think if we can solve it by tweaking the UI a bit. Will see if I can find something on the PDFs |
Currently I have the following process:
preprocess.normalize_whitespace
However doing (4) is kind of hard as the character coordinates (
doc = nlp(text); doc.ents[0].start
) match up with the normalised text. Any bright ideas how to transform the coordinates back to the original string? Would be nice not to have to reformat the text the user typed in ("Hey user, we reformatted your text for you, you better like it!")The text was updated successfully, but these errors were encountered: