Poor performance of german NER #2774

vostrova · 2018-09-19T09:52:51Z

How to reproduce the behaviour

code:

nlp = spacy.load('de_core_news_sm')
file_contents = ''
with io.open("test.txt", mode="r", encoding="utf-8") as f:
    for line in f:
        file_contents = file_contents + line
doc = nlp(unicode(file_contents))
sents = list(doc.sents)
for ent in doc.ents:
    print (ent.text, `ent.label_)`

file contents:
Anna ist am 2.Oktober geboren und Uwe ist am 4.Oktober geboren. Sie haben zwei Kinder.

result:
(u'Anna', u'PER')
(u'2.Oktober', u'ORG')
(u'Uwe', u'LOC')
(u'4.Oktober', u'LOC')

Your Environment

Operating System: Windows 10 Pro
Python Version Used: 2.7
spaCy Version Used: 2.0.12

The text was updated successfully, but these errors were encountered:

ines · 2018-09-27T10:42:48Z

Thanks for the report – and yeah, I've noticed similar issues as well 😞 We'd love to have better models and support a more diverse annotation scheme for other languages, to make it consistent with the English models.

The problem at the moment is that we need to make do with the existing datasets that are available – or produce our own annotations (which we're planning for the future, using Prodigy).

The German entity recognizer is trained on Wikipedia data, which works okay for some cases – but it also has its limitations, especially for texts that are very different from Wikipedia texts. German also doesn't really allow using capitalisation as an indicator for an entity (like English etc.), so the model currently seems to produce a lot of false positives for nouns.

That said, it's also important to keep in mind that the pre-trained models distributed with the library are baseline models that were tuned for the best possible compromise of speed, size, and accuracy and make it easy to get started building your own systems. You almost always want to adjust the model to your specific domain if extracting named entities is important to you. You can find more details on this in the documentation on training and updating models.

ines · 2018-12-14T11:35:14Z

Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

lock · 2019-01-13T16:58:38Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added lang / de German language data and models perf / accuracy Performance: accuracy labels Sep 25, 2018

ines closed this as completed Dec 14, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of german NER #2774

Poor performance of german NER #2774

vostrova commented Sep 19, 2018 •

edited

Loading

ines commented Sep 27, 2018

ines commented Dec 14, 2018

lock bot commented Jan 13, 2019

Poor performance of german NER #2774

Poor performance of german NER #2774

Comments

vostrova commented Sep 19, 2018 • edited Loading

How to reproduce the behaviour

Your Environment

ines commented Sep 27, 2018

ines commented Dec 14, 2018

lock bot commented Jan 13, 2019

vostrova commented Sep 19, 2018 •

edited

Loading