All preprocessing functions to receive as input TokenSeries #145

jbesomi · 2020-08-07T15:06:44Z

The aim of this issue is to discuss and understand when tokenize should happen in the pipeline.

The current solution is to apply tokenize once the text has already been cleaned, either with clean or with a custom pipeline. In general, in the cleaning phase, we also remove the punctuation symbols.

The problem with this approach is that, especially for non-Western languages (#18 and #128), the tokenization operation might actually need the punctuation to execute correctly.

The natural question is: wouldn't be better to have as very first operation tokenize?

In this scenario, all preprocessing functions would receive as input a TokenSeries. As we care about performance, one question is whereas we can develop a remove_punctuation enough efficient with TokenSeries. The current version of the tokenize function is quite efficient as it makes use of regex. The first task would be to develop the new variant and benchmark it against the current one. An advantage of the non-regex approach is that as the input is a list of lists, we might empower parallelization.

Could we move tokenize at the very first step yet keeping performance high? Which solution offer the fastest performance?

The other question is: is there a scenario where preprocessing functions should deal with TextSeries rather than TokenSeries?

Extra crunch:

The current tokenize version uses a very naive approach based on regex that works only for Western languages. The main advantage is that it's quite fast compared to NLTK or other solutions. An alternative we should seriously consider is to replace the regex version with the SpaCy tokenizer (#131). The question is: how can we tokenize with SpaCy in a very efficient fashion?

The text was updated successfully, but these errors were encountered:

Iota87 · 2020-08-07T17:18:33Z

To keep in mind: one of the advantages of the "clean" section of pre-processing is the possibility to clean in a uniform way small strings (e.g. Names, Addresses, etc.) in a dataset. This, although little when compared to the overall benefits of using the whole pipeline on big chunks of text, could be an interesting pre-step to string matching operations. They are very common in some research contexts, where you have to merge different dataset based for instance on Company Name or Scientific Publications authors. Would advancing "tokenize" in the pipeline prevent this use of TextHero?

jbesomi · 2020-08-08T08:17:51Z

Very interesting observation.

For the case you mentioned, we can tokenize, clean (probably with a custom pipeline and normalization) and then join back the tokens. Do you see any drawbacks?

>>> s = pd.Series(["Madrid", "madrid, the", "Madrid!"])
>>> s = hero.tokenize(s)
>>> s = hero.clean(s)
>>> s = s.str.join("")
0 madrid
1 madrid
2 madrid

(out of the discussion) -> soon or later we will have to think about how to add a universal hero.merge/hero.join function to merge DataFrame with string-columns (Pandas merge works only on perfectly equal strings). A (naive) approach might be totokenize (probably at the sub-level, to be implemented), compute embeddings (section 4 of #85 with flair) and merge cells that shares very similar vectors (somehow related to #45 ).

jbesomi · 2021-01-12T11:50:18Z

@henrifroese, would you mind help us with that? As you are already familiar with the Series subject.

henrifroese · 2021-01-15T20:43:52Z

@jbesomi with what part do you need help? Or in general?

I think that overall, as described in #131 the spaCy version without parallelization is too slow to be useful for texthero. With the spaCy parallelization it's still a lot slower than the regex version, but useable, and with the parallelization from #162 , it's pretty fast and useable.

However, I'm not 100% convinced we should always tokenize first. I think the point mentioned by @Iota87 is correct that there are users who mainly use cleaning functions etc. and it's a little annoying and counterintuitive having to tokenize, clean, then join again.

Additionally, this would of course be a pretty big development effort with needing to change a lot of functionality in the preprocessing module and tests, so I want to make sure this really is necessary 🥵

jbesomi mentioned this issue Aug 7, 2020

Getting started: preprocessing #144

Open

jbesomi added the discussion To discuss new improvements label Aug 7, 2020

jbesomi mentioned this issue Jan 12, 2021

[WIP] Matching content in our doctests #197

Open

jbesomi added version2 and removed discussion To discuss new improvements labels Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All preprocessing functions to receive as input TokenSeries #145

All preprocessing functions to receive as input TokenSeries #145

jbesomi commented Aug 7, 2020

Iota87 commented Aug 7, 2020

jbesomi commented Aug 8, 2020

jbesomi commented Jan 12, 2021

henrifroese commented Jan 15, 2021

All preprocessing functions to receive as input TokenSeries #145

All preprocessing functions to receive as input TokenSeries #145

Comments

jbesomi commented Aug 7, 2020

Iota87 commented Aug 7, 2020

jbesomi commented Aug 8, 2020

jbesomi commented Jan 12, 2021

henrifroese commented Jan 15, 2021