-
-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beyond fuzzy search #936
Comments
Yep. You'll find information about this technique in the NLP literature using the keywords "vector space model", "document similarity", and "word embedding". Basic technique is a vector space with one dimension per word in the dictionary and each document is a vector in that space. Advanced versions apply dimensionality reduction to help with the horse/mule thing (learn a transformation from the vector space where the dictionary forms an orthonormal basis to a reduced-dimensionality space that is a semantically-meaningful efficient encoding for the documents in the corpus) or add fancy hand-crafted feature extraction algorithms (original space is n-grams and skip-grams instead of single words, ever-fancier stop-word detection and stemming, etc). word2vec is one of the gold standards for this from the engineering perspective, but I think it's unix-only and there isn't a pure-JS solution. FastText and GloVe also seem to have mind-share and may be more portable. You'll probably want to find a pre-trained model for English (or the union of our target locales) and ship it with tridactyl, since I don't think that users will have enough history to learn a good embedding and you'd end up just doing the bog-standard vector space model thing. It looks like GloVe and FastText ship models. If you really want to try hard you might be able to find a tensorflow implementation of word2vec and then run it on one of the "tensorflow in js" solutions, but that sounds like it'd have a crippling impedance mismatch. Another option would be to skip the fancy NLP stuff and use a "dumb" hand-tuned fuzzy search library. Link dump: |
Yeah, I wanted a pretrained model. I don't think training on device would do wonders for battery life... Thanks for the links. Lots to look through. (Although we do already use Fuse). |
I'm wondering if we can use a naïve approach, e.g. https://github.com/FinNLP/synonyms and maybe https://github.com/spencermountain/compromise, to improve our results. I suppose missing completion of half finished words might be annoying. |
We can do an exact matching first and set the matched options aside with a score 0. Then feed the rest of the options into Fuse matcher. |
What about just using something similar to On a similar note, tridactyl's |
Trained ML models are usually very quick for prediction. The main downside is file size. The idea is that it would allow sorting based on distance in meaning rather than string distance, so e.g. Looking at the |
Do you have an example test case in mind where the ML models has better accuracy than the classical fuzzy matching methods?
Ah, I see that for my 1000 tabs there are 1000 different |
Google, for example. Any modern search engine really. On performance: that does sound sensible : ) |
Hey @bovine3dom , I love tridactyl, thank you so much for the development. I just searched all tabs (B for ´taball´) and found the tab I was looking for -> awed 🏆 I noticed that I am limited when fuzzy searching: tridactyl was not able to find the tab named gitlab ... feature/event-type ... by searching gitlab event. As I am using
https://github.com/junegunn/fzf/wiki/Related-projects#javascript mentions a javascript alternative. |
Thanks @meinzer1899 :) I agree that our fuzzy search is a bit odd and could be improved, but this isn't really the issue to discuss it in - this issue is about adding semantic search to Tridactyl where you could search for e.g. "tiger" and have it match "leopard". I was thinking about it recently because it could work particularly well for our help pages / apropos... but back to your topic: we have quite a few issues open already about our fuzzy matching. It would be helpful if there was a meta issue opened that linked all of them so we could get a better idea about whether it is worth replacing our current library totally and to have something to measure success against if we do replace it. I might open one, but if you beat me to it I'll be happy :) |
This looks doable now: https://huggingface.co/docs/transformers.js/en/index small model like this one https://huggingface.co/Xenova/all-MiniLM-L6-v2 in tridactyl e.g.
if it works remotely well we could expand to tab titles, tab contents, history etc just in case any AI sceptics see this: it would be on-device, optional and (probably) useless/bad |
Fuzzy search is a bit pants, really.
It'd be pretty cool if we could use some newfangled machine learning model in JavaScript so that when I'm looking through my history I can find results when I slightly misremember a word, e.g. "All the king's mules". Essentially, we want a metric on sentence fragments.
I know these models exist but I'm struggling to find them right now.
The text was updated successfully, but these errors were encountered: