Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beyond fuzzy search #936

Open
bovine3dom opened this issue Aug 21, 2018 · 11 comments · May be fixed by #4989
Open

Beyond fuzzy search #936

bovine3dom opened this issue Aug 21, 2018 · 11 comments · May be fixed by #4989

Comments

@bovine3dom
Copy link
Member

Fuzzy search is a bit pants, really.

It'd be pretty cool if we could use some newfangled machine learning model in JavaScript so that when I'm looking through my history I can find results when I slightly misremember a word, e.g. "All the king's mules". Essentially, we want a metric on sentence fragments.

I know these models exist but I'm struggling to find them right now.

@saulrh
Copy link
Contributor

saulrh commented Aug 22, 2018

Essentially, we want a metric on sentence fragments.

Yep. You'll find information about this technique in the NLP literature using the keywords "vector space model", "document similarity", and "word embedding". Basic technique is a vector space with one dimension per word in the dictionary and each document is a vector in that space. Advanced versions apply dimensionality reduction to help with the horse/mule thing (learn a transformation from the vector space where the dictionary forms an orthonormal basis to a reduced-dimensionality space that is a semantically-meaningful efficient encoding for the documents in the corpus) or add fancy hand-crafted feature extraction algorithms (original space is n-grams and skip-grams instead of single words, ever-fancier stop-word detection and stemming, etc).

word2vec is one of the gold standards for this from the engineering perspective, but I think it's unix-only and there isn't a pure-JS solution. FastText and GloVe also seem to have mind-share and may be more portable. You'll probably want to find a pre-trained model for English (or the union of our target locales) and ship it with tridactyl, since I don't think that users will have enough history to learn a good embedding and you'd end up just doing the bog-standard vector space model thing. It looks like GloVe and FastText ship models. If you really want to try hard you might be able to find a tensorflow implementation of word2vec and then run it on one of the "tensorflow in js" solutions, but that sounds like it'd have a crippling impedance mismatch.

Another option would be to skip the fancy NLP stuff and use a "dumb" hand-tuned fuzzy search library.

Link dump:
https://en.wikipedia.org/wiki/Word2vec
https://www.tensorflow.org/tutorials/representation/word2vec
https://github.com/loretoparisi/fasttext.js/
https://en.wikipedia.org/wiki/Word_embedding
https://github.com/krisk/Fuse
https://github.com/mattyork/fuzzy

@bovine3dom
Copy link
Member Author

Yeah, I wanted a pretrained model. I don't think training on device would do wonders for battery life...

Thanks for the links. Lots to look through. (Although we do already use Fuse).

@bovine3dom
Copy link
Member Author

bovine3dom commented Feb 17, 2020

I'm wondering if we can use a naïve approach, e.g. https://github.com/FinNLP/synonyms and maybe https://github.com/spencermountain/compromise, to improve our results. I suppose missing completion of half finished words might be annoying.

@mozbugbox
Copy link
Contributor

We can do an exact matching first and set the matched options aside with a score 0. Then feed the rest of the options into Fuse matcher.

@YodaEmbedding
Copy link

YodaEmbedding commented Feb 19, 2021

What about just using something similar to fzf or fzy? fzf works quite well for me in the terminal and is quite fast. fzy claims to both have better accuracy and be even faster. There also appears to be a fzy.js implementation by the original author. Wouldn't using ML models here be a bit heavy? Do they offer a significant benefit?

On a similar note, tridactyl's :tab fuzzy search is quite slow with hundreds of tabs and usually requires 10x as many keypresses as fzf probably would.

@bovine3dom
Copy link
Member Author

Wouldn't using ML models here be a bit heavy? Do they offer a significant benefit?

Trained ML models are usually very quick for prediction. The main downside is file size. The idea is that it would allow sorting based on distance in meaning rather than string distance, so e.g. loose search could match fuzzy search.

Looking at the fzy algorithm is a good idea. I think the speed of searching is constrained mostly by drawing the results rather than the speed of sorting them.

@YodaEmbedding
Copy link

YodaEmbedding commented Feb 19, 2021

Do you have an example test case in mind where the ML models has better accuracy than the classical fuzzy matching methods?

I think the speed of searching is constrained mostly by drawing the results rather than the speed of sorting them.

Ah, I see that for my 1000 tabs there are 1000 different .BufferCompletionOption nodes being displayed and recreated on each keystroke. It might be more responsive to only create .BufferCompletionOption nodes for only the top N=20 matches.

@bovine3dom
Copy link
Member Author

Do you have an example test case in mind where the ML models has better accuracy than the classical fuzzy matching methods?

Google, for example. Any modern search engine really.

On performance: that does sound sensible : )

@meinzer1899
Copy link

meinzer1899 commented Jan 18, 2024

Hey @bovine3dom , I love tridactyl, thank you so much for the development. I just searched all tabs (B for ´taball´) and found the tab I was looking for -> awed 🏆

I noticed that I am limited when fuzzy searching: tridactyl was not able to find the tab named gitlab ... feature/event-type ... by searching gitlab event.

As I am using fzf all the time for a couple of years, I found this thread where I want to highlight @YodaEmbedding 's post

What about just using something similar to fzf or fzy? fzf works quite well for me in the terminal and is quite fast. fzy claims to both have better accuracy and be even faster. There also appears to be a fzy.js implementation by the original author. Wouldn't using ML models here be a bit heavy? Do they offer a significant benefit?

https://github.com/junegunn/fzf/wiki/Related-projects#javascript mentions a javascript alternative.

@bovine3dom
Copy link
Member Author

Thanks @meinzer1899 :)

I agree that our fuzzy search is a bit odd and could be improved, but this isn't really the issue to discuss it in - this issue is about adding semantic search to Tridactyl where you could search for e.g. "tiger" and have it match "leopard". I was thinking about it recently because it could work particularly well for our help pages / apropos...

but back to your topic: we have quite a few issues open already about our fuzzy matching. It would be helpful if there was a meta issue opened that linked all of them so we could get a better idea about whether it is worth replacing our current library totally and to have something to measure success against if we do replace it. I might open one, but if you beat me to it I'll be happy :)

@bovine3dom
Copy link
Member Author

This looks doable now:

https://huggingface.co/docs/transformers.js/en/index

small model like this one https://huggingface.co/Xenova/all-MiniLM-L6-v2

in tridactyl e.g. mlenable downloads model and calculates embeddings for help. Could make model configurable for nerds

mlhelp [search query] embeds query and does vector search to find best match

if it works remotely well we could expand to tab titles, tab contents, history etc

just in case any AI sceptics see this: it would be on-device, optional and (probably) useless/bad

@bovine3dom bovine3dom linked a pull request Jun 1, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants