Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of german NER #2774

Closed
vostrova opened this issue Sep 19, 2018 · 3 comments
Closed

Poor performance of german NER #2774

vostrova opened this issue Sep 19, 2018 · 3 comments
Labels
lang / de German language data and models perf / accuracy Performance: accuracy

Comments

@vostrova
Copy link

vostrova commented Sep 19, 2018

How to reproduce the behaviour

code:

nlp = spacy.load('de_core_news_sm')
file_contents = ''
with io.open("test.txt", mode="r", encoding="utf-8") as f:
    for line in f:
        file_contents = file_contents + line
doc = nlp(unicode(file_contents))
sents = list(doc.sents)
for ent in doc.ents:
    print (ent.text, `ent.label_)`

file contents:
Anna ist am 2.Oktober geboren und Uwe ist am 4.Oktober geboren. Sie haben zwei Kinder.

result:
(u'Anna', u'PER')
(u'2.Oktober', u'ORG')
(u'Uwe', u'LOC')
(u'4.Oktober', u'LOC')

Your Environment

  • Operating System: Windows 10 Pro
  • Python Version Used: 2.7
  • spaCy Version Used: 2.0.12
@honnibal honnibal added lang / de German language data and models perf / accuracy Performance: accuracy labels Sep 25, 2018
@ines
Copy link
Member

ines commented Sep 27, 2018

Thanks for the report – and yeah, I've noticed similar issues as well 😞 We'd love to have better models and support a more diverse annotation scheme for other languages, to make it consistent with the English models.

The problem at the moment is that we need to make do with the existing datasets that are available – or produce our own annotations (which we're planning for the future, using Prodigy).

The German entity recognizer is trained on Wikipedia data, which works okay for some cases – but it also has its limitations, especially for texts that are very different from Wikipedia texts. German also doesn't really allow using capitalisation as an indicator for an entity (like English etc.), so the model currently seems to produce a lot of false positives for nouns.

That said, it's also important to keep in mind that the pre-trained models distributed with the library are baseline models that were tuned for the best possible compromise of speed, size, and accuracy and make it easy to get started building your own systems. You almost always want to adjust the model to your specific domain if extracting named entities is important to you. You can find more details on this in the documentation on training and updating models.

@ines
Copy link
Member

ines commented Dec 14, 2018

Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

@ines ines closed this as completed Dec 14, 2018
@lock
Copy link

lock bot commented Jan 13, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / de German language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

3 participants