Beyond fuzzy search #936

bovine3dom · 2018-08-21T22:09:05Z

Fuzzy search is a bit pants, really.

It'd be pretty cool if we could use some newfangled machine learning model in JavaScript so that when I'm looking through my history I can find results when I slightly misremember a word, e.g. "All the king's mules". Essentially, we want a metric on sentence fragments.

I know these models exist but I'm struggling to find them right now.

saulrh · 2018-08-22T06:01:26Z

Essentially, we want a metric on sentence fragments.

Yep. You'll find information about this technique in the NLP literature using the keywords "vector space model", "document similarity", and "word embedding". Basic technique is a vector space with one dimension per word in the dictionary and each document is a vector in that space. Advanced versions apply dimensionality reduction to help with the horse/mule thing (learn a transformation from the vector space where the dictionary forms an orthonormal basis to a reduced-dimensionality space that is a semantically-meaningful efficient encoding for the documents in the corpus) or add fancy hand-crafted feature extraction algorithms (original space is n-grams and skip-grams instead of single words, ever-fancier stop-word detection and stemming, etc).

word2vec is one of the gold standards for this from the engineering perspective, but I think it's unix-only and there isn't a pure-JS solution. FastText and GloVe also seem to have mind-share and may be more portable. You'll probably want to find a pre-trained model for English (or the union of our target locales) and ship it with tridactyl, since I don't think that users will have enough history to learn a good embedding and you'd end up just doing the bog-standard vector space model thing. It looks like GloVe and FastText ship models. If you really want to try hard you might be able to find a tensorflow implementation of word2vec and then run it on one of the "tensorflow in js" solutions, but that sounds like it'd have a crippling impedance mismatch.

Another option would be to skip the fancy NLP stuff and use a "dumb" hand-tuned fuzzy search library.

Link dump:
https://en.wikipedia.org/wiki/Word2vec
https://www.tensorflow.org/tutorials/representation/word2vec
https://github.com/loretoparisi/fasttext.js/
https://en.wikipedia.org/wiki/Word_embedding
https://github.com/krisk/Fuse
https://github.com/mattyork/fuzzy

bovine3dom · 2018-08-22T08:06:06Z

Yeah, I wanted a pretrained model. I don't think training on device would do wonders for battery life...

Thanks for the links. Lots to look through. (Although we do already use Fuse).

bovine3dom · 2020-02-17T16:45:34Z

I'm wondering if we can use a naïve approach, e.g. https://github.com/FinNLP/synonyms and maybe https://github.com/spencermountain/compromise, to improve our results. I suppose missing completion of half finished words might be annoying.

mozbugbox · 2020-02-17T17:46:38Z

We can do an exact matching first and set the matched options aside with a score 0. Then feed the rest of the options into Fuse matcher.

YodaEmbedding · 2021-02-19T08:28:47Z

What about just using something similar to fzf or fzy? fzf works quite well for me in the terminal and is quite fast. fzy claims to both have better accuracy and be even faster. There also appears to be a fzy.js implementation by the original author. Wouldn't using ML models here be a bit heavy? Do they offer a significant benefit?

On a similar note, tridactyl's :tab fuzzy search is quite slow with hundreds of tabs and usually requires 10x as many keypresses as fzf probably would.

bovine3dom · 2021-02-19T08:52:58Z

Wouldn't using ML models here be a bit heavy? Do they offer a significant benefit?

Trained ML models are usually very quick for prediction. The main downside is file size. The idea is that it would allow sorting based on distance in meaning rather than string distance, so e.g. loose search could match fuzzy search.

Looking at the fzy algorithm is a good idea. I think the speed of searching is constrained mostly by drawing the results rather than the speed of sorting them.

YodaEmbedding · 2021-02-19T09:03:17Z

Do you have an example test case in mind where the ML models has better accuracy than the classical fuzzy matching methods?

I think the speed of searching is constrained mostly by drawing the results rather than the speed of sorting them.

Ah, I see that for my 1000 tabs there are 1000 different .BufferCompletionOption nodes being displayed and recreated on each keystroke. It might be more responsive to only create .BufferCompletionOption nodes for only the top N=20 matches.

bovine3dom · 2021-02-19T09:06:01Z

Do you have an example test case in mind where the ML models has better accuracy than the classical fuzzy matching methods?

Google, for example. Any modern search engine really.

On performance: that does sound sensible : )

meinzer1899 · 2024-01-18T12:17:24Z

Hey @bovine3dom , I love tridactyl, thank you so much for the development. I just searched all tabs (B for ´taball´) and found the tab I was looking for -> awed 🏆

I noticed that I am limited when fuzzy searching: tridactyl was not able to find the tab named gitlab ... feature/event-type ... by searching gitlab event.

As I am using fzf all the time for a couple of years, I found this thread where I want to highlight @YodaEmbedding 's post

What about just using something similar to fzf or fzy? fzf works quite well for me in the terminal and is quite fast. fzy claims to both have better accuracy and be even faster. There also appears to be a fzy.js implementation by the original author. Wouldn't using ML models here be a bit heavy? Do they offer a significant benefit?

https://github.com/junegunn/fzf/wiki/Related-projects#javascript mentions a javascript alternative.

bovine3dom · 2024-01-20T01:25:30Z

Thanks @meinzer1899 :)

I agree that our fuzzy search is a bit odd and could be improved, but this isn't really the issue to discuss it in - this issue is about adding semantic search to Tridactyl where you could search for e.g. "tiger" and have it match "leopard". I was thinking about it recently because it could work particularly well for our help pages / apropos...

but back to your topic: we have quite a few issues open already about our fuzzy matching. It would be helpful if there was a meta issue opened that linked all of them so we could get a better idea about whether it is worth replacing our current library totally and to have something to measure success against if we do replace it. I might open one, but if you beat me to it I'll be happy :)

bovine3dom · 2024-05-29T21:16:59Z

This looks doable now:

https://huggingface.co/docs/transformers.js/en/index

small model like this one https://huggingface.co/Xenova/all-MiniLM-L6-v2

in tridactyl e.g. mlenable downloads model and calculates embeddings for help. Could make model configurable for nerds

mlhelp [search query] embeds query and does vector search to find best match

if it works remotely well we could expand to tab titles, tab contents, history etc

just in case any AI sceptics see this: it would be on-device, optional and (probably) useless/bad

bovine3dom added the enhancement label Aug 21, 2018

maricn mentioned this issue Jan 10, 2021

New tab (Ctrl+T) - text typed in the address bar disappears #2840

Open

bovine3dom added the P4 label Jan 10, 2021

bovine3dom linked a pull request Jun 1, 2024 that will close this issue

wip messing around with sentence similarity #4989

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beyond fuzzy search #936

Beyond fuzzy search #936

bovine3dom commented Aug 21, 2018

saulrh commented Aug 22, 2018

bovine3dom commented Aug 22, 2018

bovine3dom commented Feb 17, 2020 •

edited

Loading

mozbugbox commented Feb 17, 2020

YodaEmbedding commented Feb 19, 2021 •

edited

Loading

bovine3dom commented Feb 19, 2021

YodaEmbedding commented Feb 19, 2021 •

edited

Loading

bovine3dom commented Feb 19, 2021

meinzer1899 commented Jan 18, 2024 •

edited

Loading

bovine3dom commented Jan 20, 2024

bovine3dom commented May 29, 2024

Beyond fuzzy search #936

Beyond fuzzy search #936

Comments

bovine3dom commented Aug 21, 2018

saulrh commented Aug 22, 2018

bovine3dom commented Aug 22, 2018

bovine3dom commented Feb 17, 2020 • edited Loading

mozbugbox commented Feb 17, 2020

YodaEmbedding commented Feb 19, 2021 • edited Loading

bovine3dom commented Feb 19, 2021

YodaEmbedding commented Feb 19, 2021 • edited Loading

bovine3dom commented Feb 19, 2021

meinzer1899 commented Jan 18, 2024 • edited Loading

bovine3dom commented Jan 20, 2024

bovine3dom commented May 29, 2024

bovine3dom commented Feb 17, 2020 •

edited

Loading

YodaEmbedding commented Feb 19, 2021 •

edited

Loading

YodaEmbedding commented Feb 19, 2021 •

edited

Loading

meinzer1899 commented Jan 18, 2024 •

edited

Loading