-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
punctuation not being removed correctly using preprocessing.clean
#207
Comments
Hi, could you paste the actual data you're using? (Just one of the texts would help probably). For me with the beginning of your first text, the punctuation is removed successfully: >>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(["Honestly people don't know about the fact ..."])
>>> hero.clean(s)
0 honestly people know fact
dtype: object The issue is probably that some punctuation in your text is not "standard" punctuation (texthero internally uses |
Thank you @henrifroese. @aliforgetti do you have any updates? |
This is my code and I was trying to clean a large dataset
According to the documentation this is the default pipeline for the
clean
functionality:But my ouput does not reflect this as some of the punctuation remained in the text.
Original text column
Preprocessed text column
The text was updated successfully, but these errors were encountered: