-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first cut at llama.cpp encoding #292
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @alecf!
Thank you for your contribution! It's really valuable as a lot of people are complaining about the accuracy of the results. Actually right now I'm working on creating some kind of benchmark in order to be able to consider different options!
btw I think that the model might not be the only reason for low accuracy, for example this could also be a problem: #162
and this could also be related: #10
MAXIMUM_VECTOR_DISTANCE = 1.5 | ||
|
||
|
||
def initialize(repository: Repository): | ||
cache = Cache("chroma", Path(repository.path), {}) | ||
embedding_function = get_embedding_function(repository) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could get the config directly here, instead of passing down the data through the repository class.
Actually the purpose of the Repository class is to put any logic that is not dependent on the query (which basically means the calculation of the frecency score)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_config_values
can be used for this, it just needs the repo path, which is already provided. But maybe like a "config object" or sth like that could be created so that we can pass more data to data sources. Anyways for now I think it's ok to add a little bit of redudancy and get the config here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I would look at this comment as it seems to be relevant, looks like there are people interested in configuring the model, and maybe it' relevant to also allow users to customize the embedding function in general, like even customizing whether to use GPU and such 🤔 not a requirement for merging necessarily
If you are interested, we can jump on a video call and discuss about certain ways to improve the efficiency as well as benchmarking |
note, now it should be possible to measure the accuracy against other models using these tools: https://github.com/kantord/SeaGOAT/blob/main/benchmark/benchmark.ipynb |
Did anyone run benchmarks on embeddings, available in ChromaDB? Sorry for hijacking the issue, I may open a different one to discuss this. |
that definitely doesn't sound correct, but are also the other metrics identical? Or just the average position of result? If it's just the latter, it might suggest a different bug 🤔 |
All other metrics are also the same. Just this one was the easiest to check. Embedding definitely were regenerated and servers served from a fresh db. |
that is weird. It could mean that actually you are not successfully changing the model. I think if you change the model, you should observe other differences. For example different models might be a lot slower when initially analyzing the files. At least I think most models are slower than the one that is being used by default. If they seem just as fast that would mean that probably you are still using the same model |
Yes, of course, embedding processing is definitely slower with |
that is very unusual, it might be a huge bug |
@akhavr what happens if you disable the regex based results here: It's enough to comment out the ripgrep source before you run the benchmark. This would mean that all results come from chromadb, so the results should change heavily if the model changes. But I suspect that the problem might be with they the benchmarks are being run, or maybe there is something confusing, as the benchmark part is not really documented (yet) |
Only now could return to the topic. Trying to run it with the change you've suggested, but while doing that, noticed that Guess, I'd better open a separate issue/pull request for that. |
NOTE: This is not ready to be merged but I wanted to share in case anyone wants to help with this
SeaGOAT is pretty cool but I've been disappointed by the results.. so I thought I'd try using codellama for embeddings, using llama_cpp_python
It works reasonably well with a few caveats: