Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All preprocessing functions to receive as input TokenSeries #145

Open
jbesomi opened this issue Aug 7, 2020 · 4 comments
Open

All preprocessing functions to receive as input TokenSeries #145

jbesomi opened this issue Aug 7, 2020 · 4 comments
Labels

Comments

@jbesomi
Copy link
Owner

jbesomi commented Aug 7, 2020

The aim of this issue is to discuss and understand when tokenize should happen in the pipeline.

The current solution is to apply tokenize once the text has already been cleaned, either with clean or with a custom pipeline. In general, in the cleaning phase, we also remove the punctuation symbols.

The problem with this approach is that, especially for non-Western languages (#18 and #128), the tokenization operation might actually need the punctuation to execute correctly.

The natural question is: wouldn't be better to have as very first operation tokenize?

In this scenario, all preprocessing functions would receive as input a TokenSeries. As we care about performance, one question is whereas we can develop a remove_punctuation enough efficient with TokenSeries. The current version of the tokenize function is quite efficient as it makes use of regex. The first task would be to develop the new variant and benchmark it against the current one. An advantage of the non-regex approach is that as the input is a list of lists, we might empower parallelization.

Could we move tokenize at the very first step yet keeping performance high? Which solution offer the fastest performance?

The other question is: is there a scenario where preprocessing functions should deal with TextSeries rather than TokenSeries?


Extra crunch:

The current tokenize version uses a very naive approach based on regex that works only for Western languages. The main advantage is that it's quite fast compared to NLTK or other solutions. An alternative we should seriously consider is to replace the regex version with the SpaCy tokenizer (#131). The question is: how can we tokenize with SpaCy in a very efficient fashion?

@jbesomi jbesomi added the discussion To discuss new improvements label Aug 7, 2020
@Iota87
Copy link

Iota87 commented Aug 7, 2020

To keep in mind: one of the advantages of the "clean" section of pre-processing is the possibility to clean in a uniform way small strings (e.g. Names, Addresses, etc.) in a dataset. This, although little when compared to the overall benefits of using the whole pipeline on big chunks of text, could be an interesting pre-step to string matching operations. They are very common in some research contexts, where you have to merge different dataset based for instance on Company Name or Scientific Publications authors. Would advancing "tokenize" in the pipeline prevent this use of TextHero?

@jbesomi
Copy link
Owner Author

jbesomi commented Aug 8, 2020

Very interesting observation.

For the case you mentioned, we can tokenize, clean (probably with a custom pipeline and normalization) and then join back the tokens. Do you see any drawbacks?

>>> s = pd.Series(["Madrid", "madrid, the", "Madrid!"])
>>> s = hero.tokenize(s)
>>> s = hero.clean(s)
>>> s = s.str.join("")
0 madrid
1 madrid
2 madrid

(out of the discussion) -> soon or later we will have to think about how to add a universal hero.merge/hero.join function to merge DataFrame with string-columns (Pandas merge works only on perfectly equal strings). A (naive) approach might be totokenize (probably at the sub-level, to be implemented), compute embeddings (section 4 of #85 with flair) and merge cells that shares very similar vectors (somehow related to #45 ).

@jbesomi jbesomi added version2 and removed discussion To discuss new improvements labels Jan 12, 2021
@jbesomi
Copy link
Owner Author

jbesomi commented Jan 12, 2021

@henrifroese, would you mind help us with that? As you are already familiar with the Series subject.

@henrifroese
Copy link
Collaborator

@jbesomi with what part do you need help? Or in general?

I think that overall, as described in #131 the spaCy version without parallelization is too slow to be useful for texthero. With the spaCy parallelization it's still a lot slower than the regex version, but useable, and with the parallelization from #162 , it's pretty fast and useable.

However, I'm not 100% convinced we should always tokenize first. I think the point mentioned by @Iota87 is correct that there are users who mainly use cleaning functions etc. and it's a little annoying and counterintuitive having to tokenize, clean, then join again.

Additionally, this would of course be a pretty big development effort with needing to change a lot of functionality in the preprocessing module and tests, so I want to make sure this really is necessary 🥵

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants