Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False Positives with URLS #18

Open
max-otto opened this issue Sep 1, 2020 · 2 comments
Open

False Positives with URLS #18

max-otto opened this issue Sep 1, 2020 · 2 comments

Comments

@max-otto
Copy link
Contributor

max-otto commented Sep 1, 2020

I just wanted to make you aware that frameworks such as 'VB.NET' or 'ASP.Net' are considered URLs after tokenization and are thus not splitted (which is probably good). This is also the case for some abbriviations such as 'L/S/R' and SAP Versions such as R/3. Unfortunately this can't be prevented by adding them to 'single_token_abbreviations_de.txt' since they are checked after URLs. (R/3 is even included in 'single_token_abbreviations_de.txt').

@tsproisl
Copy link
Owner

tsproisl commented Sep 9, 2020

The two names vb.net and asp.net are indeed working URLs (though only one is registered by Microsoft). While they are probably used much more frequently as proper names, recognizing them as URLs is technically correct. In either case, they should not be split.

L/S/R and R/3 puzzled me at first. The explanation is that they are recognized as Reddit links. Reddit links take the form "/r/subreddit" or "/u/user". The leading slash is often omitted and the German Reddit community also uses "l" instead of "r".

If the tokens class (URL vs. abbreviation) is important for your use case, you could either try to correct this in a postprocessing step, or, in the case of Reddit links, try to get rid of reddit_links. Reddit links should only rarely occur outside Reddit posts, therefore a very quick'n'dirty hack would be:

tokenizer = SoMaJo("de_CMC")
tokenizer._tokenizer.reddit_links = re.compile(r"\s{10}")

When the regex for reddit_links is applied, there are only single spaces in the text, i.e. the modified regex will never match.

Of course, a cleaner solution would be to either have an option for enabling/disabling the recognition of Reddit links or, even better, to have an option for user specified special cases that are processed relatively early.

@max-otto
Copy link
Contributor Author

You are obviously right concerning the first two. You might consider changing the regex so it no longer hits on 'r/l' or 'l/r' literally because in a technical context this often means "rechts/links" "links/rechts". ButI don't know how this would be handled in a competitive scenario.

I'm already doing a lot of preprocessing, by replacing substring that I don't want to split and reintroducing them afterwards. Pretty much like you did in the pre 2.0 versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants