Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bigram for spell checking #110

Open
ierezell opened this issue Jun 2, 2021 · 12 comments
Open

Use bigram for spell checking #110

ierezell opened this issue Jun 2, 2021 · 12 comments

Comments

@ierezell
Copy link

ierezell commented Jun 2, 2021

Hi, first of all thanks for this very nice piece of software !

I'm using the symspellpy port and it's working perfectly.

However, on some cases (in french for exemple) I have chat messages like
randé vs instead of rendez vous or even je suit instead of je suis.

The later always is in my bigram dictionary and not the former.

So I was thinking about checking against all bigrams to have better spell checking than only single words which are in the unigram list and I was wondering if some kind of similar behaviour was already in symspell or if it was planned to be.

Thanks again,
Have a wonderful day

@wolfgarbe
Copy link
Owner

wolfgarbe commented Jun 2, 2021

SymSpell.LookupCompound should do exactly this. It uses the optional bigram dictionary (load with symSpell.LoadBigramDictionary) in order to use sentence level context information for selecting the best spelling correction for multiple input terms. But I haven't tested it for French.

@ierezell
Copy link
Author

ierezell commented Jun 2, 2021

Hi @wolfgarbe, thanks a lot for the fast answer !

I did exactly that (symSpell.LoadBigramDictionary) with symspellpy (maybe the implementation differs ?).
I created my own bigram dictionary from the google n-grams (btw I can offer the code in python if needed).

However, some chatbot sentence (really really bad writting) is not corrected correctly.

Here is an exemple, I hope it helps.

je peut pas recevoir mes 3 enfants avec leurs enfants cecqui fait 3 bukbes perce wue ils sont plus que 8 pas logique ni justeo
        |                                               |             |       |   |
je peut pas recevoir mes 3 enfants avec leurs enfants ce qui fait 3 bulbes perce que ils sont plus que 8 pas logique ni juste
        |                                               |             |       |   |
        |                                               |             |   "perce" exists, "que" exists but "perce que" is 
        |                                               |             |   not a bigram in the dict it should be "parce que"
        |                                               |             |
        |                                               |       Not the good word but it's ok, i will check with custom logic
        |                                          Perfect
      "je" and "peut" are valid unigrams but "je peut" is not a bigram, it should be "je peux" which is in the bigrams.   

Thanks again for your time,

Have a great day

@wolfgarbe
Copy link
Owner

If you attach the French frequency dictionary and the bigram dictionary files to the issue in plain text format, I could have a look what goes wrong (in SymSpell and/or the port)

@ierezell
Copy link
Author

ierezell commented Jun 2, 2021

Here are the bigrams and unigrams dictionaries and obtained with the script bellow.
bigram.txt
unigram.txt

Note that the extension is .txt because github don't allow posting .py files.
I took only the most recent count for each uni or bigrams. Also I limited to the 80 000 most frequent unigrams and 160 000 bigrams.
google_ngrams.txt

Thanks again a lot for your help !

@ierezell
Copy link
Author

ierezell commented Jun 2, 2021

Sorry to bump again but I played with it more to make it work as the english version.

I tried to put space in random places and I realized my first bigrams and unigrams version was too loose, but I made it a bit more strict (no bi-grams of space + word or word + space and only word of at least one character which is in a french dictionnary)

Even with that I cannot get the exemple above to work but it fixed most of the random spaces/random splitting errors.

When all will be fixed I will replicate this for the other languages of the google n-grams and I will give you the files so that this framework can support more languages built-in.

Have a great week

@wolfgarbe
Copy link
Owner

When all will be fixed I will replicate this for the other languages of the google n-grams and I will give you the files so that this framework can support more languages built-in.

That's great.

I will try to figure out why your examples above do not work, and if there is a way to improve SymSpell to support such cases. But that could last some days as I'm currently quite busy with some other projects.

@ierezell
Copy link
Author

ierezell commented Jun 3, 2021

Hi @wolfgarbe,

I will try to figure out why your examples above do not work, and if there is a way to improve SymSpell to support such cases. But that could last some days as I'm currently quite busy with some other projects.

Don't worry and thanks a lot for your time and dedication, it's really nice !

I will think a bit more about how to clean and collect data for other languages like russian which have one characters words or chineese...

Also another related sentence :

"when will she arrive" : quand va-t-elle arriver was written quand va telleariver and corrected with quand va telle river
My problem is that the word arriver is more frequent than river and I thought it would be corrected with that.
Also elle arriver is a bigram,telle river is not and correcting with quand va elle arriver would be perfect

Else it's harder to retrieve the real sentence (with a language model for exemple).

For now SymSpell was the most complete spellchecker for my need but I will maybe add a phonetic or POS layer (chatbot text is really awfully spelled). Do you plan to have this kind of improvements ?

Have a great day

@wolfgarbe
Copy link
Owner

wolfgarbe commented Jun 5, 2021

I will maybe add a phonetic or POS layer. Do you plan to have this kind of improvements ?

Implementing a weighted edit distance giving a higher rank to character pairs which are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound) would certainly be a good improvement. But I don't think that I will find time to implement this near-term.

But there are at least two SymSpell ports who have already implemented a weighted edit distance:

https://github.com/MighTguY/customized-symspell
https://github.com/searchhub/preDict

@ierezell
Copy link
Author

Hi @wolfgarbe, sorry to bump this thread again... Do you have any news about the possible improvements on the bigrams corrections ?

I would be glad to help if some contribution is possible or needed.

Have a great day

@wolfgarbe
Copy link
Owner

Unfortunately, I have not yet found the time, but it is still on my mind.

@ierezell
Copy link
Author

Hi @wolfgarbe,
I'm really sorry to bump again... I'm sure you have tons on your hands, so could you point me the good place in the code so I can debug this and do a PR fix ?

It's starting to be urgent for me so I will put some hours in it :)

Thanks in advance and thanks again for this great library !
Have a great day

@ierezell
Copy link
Author

ierezell commented Aug 1, 2022

Hello @wolfgarbe, as I also raised the issue on symspellpy, we might have found were it came from and it could be a fix.

mammothb/symspellpy#107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants