Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more documentation. package still maintained? #7

Open
randomgambit opened this issue Jul 27, 2016 · 7 comments
Open

more documentation. package still maintained? #7

randomgambit opened this issue Jul 27, 2016 · 7 comments

Comments

@randomgambit
Copy link

Hi,

First of all, congratulations for this amazing packages that is wayyyy faster than fuzzymatch when dealing with large datasets of strings.

Do you have more documentation about the matching algorithm that is used here? In particular I am matching sentences together (not only words) such as this is a sentence and I wanted to know if your defaut settings were appropriate in that case (ngrams=2 for instance).

How can I change them?

Many thanks for your help

@iarroyof
Copy link

Hi @randomgambit , it seem there is nobody giving feedback on this amazing package. I'm trying to use it, but no documentation is there. Can you tell me whether you found more information or something seemed to this? fuzzywuzzy has good features (although also poor documentation), but there are mentioned some efficiency issues.

@axiak
Copy link
Owner

axiak commented Jan 24, 2019

I can write up some more documentation if you care about it :)

@randomgambit
Copy link
Author

that would be great, thanks!

@nodice73
Copy link

I've been using your package and it is working very well for me. However, I'm afraid I don't completely understand how it works based on your description. Is there a primary published reference for this algorithm?

In particular, the passage

Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.

Is not clear to me.

we create a list of any element in the set that has at least one occurrence of a trigram listed above

Is this a reference to the reference trigram (both the reference and the query are "listed above")?

For each of these matched elements, we compute the cosine similarity between each element and the query string.

Does "these matched elements" refer to the query or the reference? I think it only makes sense if you are taking about the cosine similarity between the reference trigram and the query string, but I could be wrong. In either case, if they match, won't the cosine similarity be perfect by definition? Additionally, you seem to be implying that you are comparing a string with 3 characters to a string with more characters. How do you calculate the cosine similarity of two strings of different length?

Based on the current description, I'm not seeing how you distinguish between different matches.

Thanks again for your efforts, and if these questions can be answered by a reference, please point me to it.

@axiak
Copy link
Owner

axiak commented Mar 11, 2019

I don't have a paper, but it's inspired by fulltext search. In some circles you might see this called trigram or shingle indexing. @Glench wrote a wonderful intuitive description of how it works here: https://github.com/glench/fuzzyset

@axiak
Copy link
Owner

axiak commented Mar 11, 2019

Eh. I meant here: http://glench.github.io/fuzzyset.js/

@nodice73
Copy link

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants