more documentation. package still maintained? #7

randomgambit · 2016-07-27T11:13:51Z

Hi,

First of all, congratulations for this amazing packages that is wayyyy faster than fuzzymatch when dealing with large datasets of strings.

Do you have more documentation about the matching algorithm that is used here? In particular I am matching sentences together (not only words) such as this is a sentence and I wanted to know if your defaut settings were appropriate in that case (ngrams=2 for instance).

How can I change them?

Many thanks for your help

The text was updated successfully, but these errors were encountered:

iarroyof · 2017-05-10T23:10:28Z

Hi @randomgambit , it seem there is nobody giving feedback on this amazing package. I'm trying to use it, but no documentation is there. Can you tell me whether you found more information or something seemed to this? fuzzywuzzy has good features (although also poor documentation), but there are mentioned some efficiency issues.

axiak · 2019-01-24T17:21:43Z

I can write up some more documentation if you care about it :)

randomgambit · 2019-02-06T14:54:47Z

that would be great, thanks!

nodice73 · 2019-03-11T23:00:03Z

I've been using your package and it is working very well for me. However, I'm afraid I don't completely understand how it works based on your description. Is there a primary published reference for this algorithm?

In particular, the passage

Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.

Is not clear to me.

we create a list of any element in the set that has at least one occurrence of a trigram listed above

Is this a reference to the reference trigram (both the reference and the query are "listed above")?

For each of these matched elements, we compute the cosine similarity between each element and the query string.

Does "these matched elements" refer to the query or the reference? I think it only makes sense if you are taking about the cosine similarity between the reference trigram and the query string, but I could be wrong. In either case, if they match, won't the cosine similarity be perfect by definition? Additionally, you seem to be implying that you are comparing a string with 3 characters to a string with more characters. How do you calculate the cosine similarity of two strings of different length?

Based on the current description, I'm not seeing how you distinguish between different matches.

Thanks again for your efforts, and if these questions can be answered by a reference, please point me to it.

axiak · 2019-03-11T23:39:03Z

I don't have a paper, but it's inspired by fulltext search. In some circles you might see this called trigram or shingle indexing. @Glench wrote a wonderful intuitive description of how it works here: https://github.com/glench/fuzzyset

axiak · 2019-03-11T23:39:53Z

Eh. I meant here: http://glench.github.io/fuzzyset.js/

nodice73 · 2019-03-12T00:39:50Z

Thanks!

abubelinha mentioned this issue Mar 5, 2024

which package is the maintained one? alpae/fuzzyset#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more documentation. package still maintained? #7

more documentation. package still maintained? #7

randomgambit commented Jul 27, 2016

iarroyof commented May 10, 2017

axiak commented Jan 24, 2019

randomgambit commented Feb 6, 2019

nodice73 commented Mar 11, 2019

axiak commented Mar 11, 2019

axiak commented Mar 11, 2019

nodice73 commented Mar 12, 2019

more documentation. package still maintained? #7

more documentation. package still maintained? #7

Comments

randomgambit commented Jul 27, 2016

iarroyof commented May 10, 2017

axiak commented Jan 24, 2019

randomgambit commented Feb 6, 2019

nodice73 commented Mar 11, 2019

axiak commented Mar 11, 2019

axiak commented Mar 11, 2019

nodice73 commented Mar 12, 2019