-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of german NER #2774
Comments
Thanks for the report – and yeah, I've noticed similar issues as well 😞 We'd love to have better models and support a more diverse annotation scheme for other languages, to make it consistent with the English models. The problem at the moment is that we need to make do with the existing datasets that are available – or produce our own annotations (which we're planning for the future, using Prodigy). The German entity recognizer is trained on Wikipedia data, which works okay for some cases – but it also has its limitations, especially for texts that are very different from Wikipedia texts. German also doesn't really allow using capitalisation as an indicator for an entity (like English etc.), so the model currently seems to produce a lot of false positives for nouns. That said, it's also important to keep in mind that the pre-trained models distributed with the library are baseline models that were tuned for the best possible compromise of speed, size, and accuracy and make it easy to get started building your own systems. You almost always want to adjust the model to your specific domain if extracting named entities is important to you. You can find more details on this in the documentation on training and updating models. |
Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behaviour
code:
file contents:
Anna ist am 2.Oktober geboren und Uwe ist am 4.Oktober geboren. Sie haben zwei Kinder.
result:
(u'Anna',
u'PER')
(u'2.Oktober', u'ORG')
(u'Uwe', u'LOC')
(u'4.Oktober', u'LOC')
Your Environment
The text was updated successfully, but these errors were encountered: