first cut at llama.cpp encoding #292

alecf · 2023-10-04T17:57:43Z

NOTE: This is not ready to be merged but I wanted to share in case anyone wants to help with this

SeaGOAT is pretty cool but I've been disappointed by the results.. so I thought I'd try using codellama for embeddings, using llama_cpp_python

It works reasonably well with a few caveats:

embedding without GPU is slow - initially it was like 7-10s per chunk on my M2 mac, but once you turn on GPU support it is more like 300ms. I haven't tried on an Intel machine or with a non-metal GPU
you need to download a llama model an set up your .seagoat.yml to point to the model:

server:
  model_path: "/path/to/codellama-7b-instruct.Q5_K_M.gguf"

I think the chunking could also be improved by doing some kind of syntactic/hierarchical chunking, but that is pretty slow too... however I wonder if bigger chunks might actually be better here. (because I think embedding fragments of source probably loses a lot of semantic information)

kantord

Hey @alecf!

Thank you for your contribution! It's really valuable as a lot of people are complaining about the accuracy of the results. Actually right now I'm working on creating some kind of benchmark in order to be able to consider different options!

btw I think that the model might not be the only reason for low accuracy, for example this could also be a problem: #162

and this could also be related: #10

kantord · 2023-10-05T12:37:09Z

seagoat/sources/chroma.py

 MAXIMUM_VECTOR_DISTANCE = 1.5


 def initialize(repository: Repository):
 cache = Cache("chroma", Path(repository.path), {})
+ embedding_function = get_embedding_function(repository)


I think we could get the config directly here, instead of passing down the data through the repository class.

Actually the purpose of the Repository class is to put any logic that is not dependent on the query (which basically means the calculation of the frecency score)

get_config_values can be used for this, it just needs the repo path, which is already provided. But maybe like a "config object" or sth like that could be created so that we can pass more data to data sources. Anyways for now I think it's ok to add a little bit of redudancy and get the config here

Also I would look at this comment as it seems to be relevant, looks like there are people interested in configuring the model, and maybe it' relevant to also allow users to customize the embedding function in general, like even customizing whether to use GPU and such 🤔 not a requirement for merging necessarily

kantord · 2023-10-05T14:22:38Z

If you are interested, we can jump on a video call and discuss about certain ways to improve the efficiency as well as benchmarking

kantord · 2023-10-18T11:11:59Z

note, now it should be possible to measure the accuracy against other models using these tools: https://github.com/kantord/SeaGOAT/blob/main/benchmark/benchmark.ipynb

akhavr · 2023-11-28T16:12:46Z

Did anyone run benchmarks on embeddings, available in ChromaDB?
I've run several and they give me identical Average position of result which doesn't feel correct.

Sorry for hijacking the issue, I may open a different one to discuss this.

kantord · 2023-12-05T06:59:12Z

Did anyone run benchmarks on embeddings, available in ChromaDB? I've run several and they give me identical Average position of result which doesn't feel correct.

Sorry for hijacking the issue, I may open a different one to discuss this.

that definitely doesn't sound correct, but are also the other metrics identical? Or just the average position of result? If it's just the latter, it might suggest a different bug 🤔

akhavr · 2023-12-05T21:47:35Z

that definitely doesn't sound correct, but are also the other metrics identical? Or just the average position of result? If it's just the latter, it might suggest a different bug 🤔

All other metrics are also the same. Just this one was the easiest to check.

Embedding definitely were regenerated and servers served from a fresh db.

kantord · 2023-12-05T22:43:44Z

that definitely doesn't sound correct, but are also the other metrics identical? Or just the average position of result? If it's just the latter, it might suggest a different bug 🤔

All other metrics are also the same. Just this one was the easiest to check.

Embedding definitely were regenerated and servers served from a fresh db.

that is weird. It could mean that actually you are not successfully changing the model. I think if you change the model, you should observe other differences. For example different models might be a lot slower when initially analyzing the files. At least I think most models are slower than the one that is being used by default. If they seem just as fast that would mean that probably you are still using the same model

akhavr · 2023-12-06T15:56:35Z

that is weird. It could mean that actually you are not successfully changing the model. I think if you change the model, you should observe other differences. For example different models might be a lot slower when initially analyzing the files. At least I think most models are slower than the one that is being used by default. If they seem just as fast that would mean that probably you are still using the same model

Yes, of course, embedding processing is definitely slower with all-mpnet-base-v2 than with all-MiniLM-L6-v2. But benchmark results are the same. I may attach here my run of benchmark.ipynb if that helps.

kantord · 2023-12-06T16:29:42Z

that is weird. It could mean that actually you are not successfully changing the model. I think if you change the model, you should observe other differences. For example different models might be a lot slower when initially analyzing the files. At least I think most models are slower than the one that is being used by default. If they seem just as fast that would mean that probably you are still using the same model

Yes, of course, embedding processing is definitely slower with all-mpnet-base-v2 than with all-MiniLM-L6-v2. But benchmark results are the same. I may attach here my run of benchmark.ipynb if that helps.

that is very unusual, it might be a huge bug

kantord · 2023-12-06T17:11:44Z

@akhavr what happens if you disable the regex based results here:

https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py](https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py#L64-L71)

It's enough to comment out the ripgrep source before you run the benchmark. This would mean that all results come from chromadb, so the results should change heavily if the model changes. But I suspect that the problem might be with they the benchmarks are being run, or maybe there is something confusing, as the benchmark part is not really documented (yet)

akhavr · 2024-01-11T19:52:48Z

@akhavr what happens if you disable the regex based results here:

https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py](https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py%5D(https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py#L64-L71))

It's enough to comment out the ripgrep source before you run the benchmark. This would mean that all results come from chromadb, so the results should change heavily if the model changes. But I suspect that the problem might be with they the benchmarks are being run, or maybe there is something confusing, as the benchmark part is not really documented (yet)

Only now could return to the topic.

Trying to run it with the change you've suggested, but while doing that, noticed that seagoat/repository.py:Repository.get_file_object_id tries to run git-ls-tree with option --object-only which is absent in my git 2.34.1

Guess, I'd better open a separate issue/pull request for that.

alecf added 5 commits October 4, 2023 09:33

first cut at llama.cpp encoding

d3d481a

debugging

98c8b86

turn on gpu

f548411

add llama-cpp-python as a dependency

de0e704

move llama embeddings config out to a config file

6552c66

kantord requested changes Oct 5, 2023

View reviewed changes

kantord mentioned this pull request Oct 6, 2023

feat: allowing users to set ONNX execution provider #295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first cut at llama.cpp encoding #292

first cut at llama.cpp encoding #292

alecf commented Oct 4, 2023

kantord left a comment

kantord Oct 5, 2023

kantord Oct 5, 2023

kantord Oct 5, 2023

kantord commented Oct 5, 2023

kantord commented Oct 18, 2023

akhavr commented Nov 28, 2023

kantord commented Dec 5, 2023

akhavr commented Dec 5, 2023

kantord commented Dec 5, 2023

akhavr commented Dec 6, 2023

kantord commented Dec 6, 2023

kantord commented Dec 6, 2023 •

edited

akhavr commented Jan 11, 2024

first cut at llama.cpp encoding #292

Are you sure you want to change the base?

first cut at llama.cpp encoding #292

Conversation

alecf commented Oct 4, 2023

kantord left a comment

Choose a reason for hiding this comment

kantord Oct 5, 2023

Choose a reason for hiding this comment

kantord Oct 5, 2023

Choose a reason for hiding this comment

kantord Oct 5, 2023

Choose a reason for hiding this comment

kantord commented Oct 5, 2023

kantord commented Oct 18, 2023

akhavr commented Nov 28, 2023

kantord commented Dec 5, 2023

akhavr commented Dec 5, 2023

kantord commented Dec 5, 2023

akhavr commented Dec 6, 2023

kantord commented Dec 6, 2023

kantord commented Dec 6, 2023 • edited

akhavr commented Jan 11, 2024

kantord commented Dec 6, 2023 •

edited