Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first cut at llama.cpp encoding #292

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

alecf
Copy link

@alecf alecf commented Oct 4, 2023

NOTE: This is not ready to be merged but I wanted to share in case anyone wants to help with this

SeaGOAT is pretty cool but I've been disappointed by the results.. so I thought I'd try using codellama for embeddings, using llama_cpp_python

It works reasonably well with a few caveats:

  1. embedding without GPU is slow - initially it was like 7-10s per chunk on my M2 mac, but once you turn on GPU support it is more like 300ms. I haven't tried on an Intel machine or with a non-metal GPU
  2. you need to download a llama model an set up your .seagoat.yml to point to the model:
server:
  model_path: "/path/to/codellama-7b-instruct.Q5_K_M.gguf"
  1. I think the chunking could also be improved by doing some kind of syntactic/hierarchical chunking, but that is pretty slow too... however I wonder if bigger chunks might actually be better here. (because I think embedding fragments of source probably loses a lot of semantic information)

Copy link
Owner

@kantord kantord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alecf!

Thank you for your contribution! It's really valuable as a lot of people are complaining about the accuracy of the results. Actually right now I'm working on creating some kind of benchmark in order to be able to consider different options!

btw I think that the model might not be the only reason for low accuracy, for example this could also be a problem: #162

and this could also be related: #10

MAXIMUM_VECTOR_DISTANCE = 1.5


def initialize(repository: Repository):
cache = Cache("chroma", Path(repository.path), {})
embedding_function = get_embedding_function(repository)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could get the config directly here, instead of passing down the data through the repository class.

Actually the purpose of the Repository class is to put any logic that is not dependent on the query (which basically means the calculation of the frecency score)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_config_values can be used for this, it just needs the repo path, which is already provided. But maybe like a "config object" or sth like that could be created so that we can pass more data to data sources. Anyways for now I think it's ok to add a little bit of redudancy and get the config here

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would look at this comment as it seems to be relevant, looks like there are people interested in configuring the model, and maybe it' relevant to also allow users to customize the embedding function in general, like even customizing whether to use GPU and such 🤔 not a requirement for merging necessarily

@kantord
Copy link
Owner

kantord commented Oct 5, 2023

If you are interested, we can jump on a video call and discuss about certain ways to improve the efficiency as well as benchmarking

@kantord
Copy link
Owner

kantord commented Oct 18, 2023

note, now it should be possible to measure the accuracy against other models using these tools: https://github.com/kantord/SeaGOAT/blob/main/benchmark/benchmark.ipynb

@akhavr
Copy link

akhavr commented Nov 28, 2023

Did anyone run benchmarks on embeddings, available in ChromaDB?
I've run several and they give me identical Average position of result which doesn't feel correct.

Sorry for hijacking the issue, I may open a different one to discuss this.

@kantord
Copy link
Owner

kantord commented Dec 5, 2023

Did anyone run benchmarks on embeddings, available in ChromaDB? I've run several and they give me identical Average position of result which doesn't feel correct.

Sorry for hijacking the issue, I may open a different one to discuss this.

that definitely doesn't sound correct, but are also the other metrics identical? Or just the average position of result? If it's just the latter, it might suggest a different bug 🤔

@akhavr
Copy link

akhavr commented Dec 5, 2023

that definitely doesn't sound correct, but are also the other metrics identical? Or just the average position of result? If it's just the latter, it might suggest a different bug 🤔

All other metrics are also the same. Just this one was the easiest to check.

Embedding definitely were regenerated and servers served from a fresh db.

@kantord
Copy link
Owner

kantord commented Dec 5, 2023

that definitely doesn't sound correct, but are also the other metrics identical? Or just the average position of result? If it's just the latter, it might suggest a different bug 🤔

All other metrics are also the same. Just this one was the easiest to check.

Embedding definitely were regenerated and servers served from a fresh db.

that is weird. It could mean that actually you are not successfully changing the model. I think if you change the model, you should observe other differences. For example different models might be a lot slower when initially analyzing the files. At least I think most models are slower than the one that is being used by default. If they seem just as fast that would mean that probably you are still using the same model

@akhavr
Copy link

akhavr commented Dec 6, 2023

that is weird. It could mean that actually you are not successfully changing the model. I think if you change the model, you should observe other differences. For example different models might be a lot slower when initially analyzing the files. At least I think most models are slower than the one that is being used by default. If they seem just as fast that would mean that probably you are still using the same model

Yes, of course, embedding processing is definitely slower with all-mpnet-base-v2 than with all-MiniLM-L6-v2. But benchmark results are the same. I may attach here my run of benchmark.ipynb if that helps.

@kantord
Copy link
Owner

kantord commented Dec 6, 2023

that is weird. It could mean that actually you are not successfully changing the model. I think if you change the model, you should observe other differences. For example different models might be a lot slower when initially analyzing the files. At least I think most models are slower than the one that is being used by default. If they seem just as fast that would mean that probably you are still using the same model

Yes, of course, embedding processing is definitely slower with all-mpnet-base-v2 than with all-MiniLM-L6-v2. But benchmark results are the same. I may attach here my run of benchmark.ipynb if that helps.

that is very unusual, it might be a huge bug

@kantord
Copy link
Owner

kantord commented Dec 6, 2023

@akhavr what happens if you disable the regex based results here:

https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py](https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py#L64-L71)

It's enough to comment out the ripgrep source before you run the benchmark. This would mean that all results come from chromadb, so the results should change heavily if the model changes. But I suspect that the problem might be with they the benchmarks are being run, or maybe there is something confusing, as the benchmark part is not really documented (yet)

@akhavr
Copy link

akhavr commented Jan 11, 2024

@akhavr what happens if you disable the regex based results here:

https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py](https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py%5D(https://github.com/kantord/SeaGOAT/blob/bb1086ced94bb83f6e63cbc613016625d3e15883/seagoat/engine.py#L64-L71))

It's enough to comment out the ripgrep source before you run the benchmark. This would mean that all results come from chromadb, so the results should change heavily if the model changes. But I suspect that the problem might be with they the benchmarks are being run, or maybe there is something confusing, as the benchmark part is not really documented (yet)

Only now could return to the topic.

Trying to run it with the change you've suggested, but while doing that, noticed that seagoat/repository.py:Repository.get_file_object_id tries to run git-ls-tree with option --object-only which is absent in my git 2.34.1

Guess, I'd better open a separate issue/pull request for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants