Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Client & Persistent Client are retrieving different documents #2134

Open
sparshbhawsar opened this issue May 4, 2024 · 4 comments
Open
Labels
bug Something isn't working

Comments

@sparshbhawsar
Copy link

sparshbhawsar commented May 4, 2024

What happened?

Hi Team,

I noticed when I am using Client and Persistent client I am getting different docs. I have crossed check the indexes, embeddings the length of docs all are exactly same.

There is no problem with saving and loading from persistent client there I am getting the same results.

But the problem is with Persistent Client.

I am attaching example here:

Docs from Normal Client k=4

[['This provides a daily snapshot of the ...',
'This is the description of....',
'Table1',
'Table2']]

Docs from Persistent Client k=4

[['Table1',
'Table2',
'Table3',
'Table4']]

So when i am running with Persistent client some how it is removing my top 2 docs which I am getting from normal client.

I checked in local files the docs and embeddings for this top 2 is stored.

Could you please help me, from where exactly the issue is coming.

Thanks,
Sparsh

Versions

Chroma: 0.4.17

Relevant log output

No response

@sparshbhawsar sparshbhawsar added the bug Something isn't working label May 4, 2024
@tazarov
Copy link
Contributor

tazarov commented May 4, 2024

@sparshbhawsar, thanks for raising this. Do you have a short snippet of your add/query with some sample data to help with reproducing this?

Side note: Is the bug reproducible in Chroma 0.5.0?

@sparshbhawsar
Copy link
Author

sparshbhawsar commented May 4, 2024

Hi @tazarov, Yes the issue still in 0.5.0 version.

I can't provide the data, it's confidential but i can share the code using which you can reproduce this.

import chroma db 

### Using Normal Client 
chroma_client = chromadb.Client()

from chromadb import Documents, EmbeddingFunction, Embeddings 

Class MyEmbeddingFunction(EmbeddingFunction): 
def __call__(self, input: Documents) -> Embeddings: 
     embeddings = Your Embeddings 
     return embeddings 

collection = chroma_client.create_collection( name="test", embedding_function=MyEmbeddingFunction(), metadata={"hnsw:space": "cosine"} ) 

# docs = Your Document 

collection.add(ids=[str(i) for i in range(len(docs))], documents=[d.page_content for d in docs], metadatas=[d.metadata for d in docs])

collection.query( query_embeddings==[Query Vector], n_results=3 ) 


### Using Persistent Client (Saving to disk)
persistent_client = chromadb.PersistentClient(path="/path/to/save/to”) 

from chromadb import Documents, EmbeddingFunction, Embeddings 

Class MyEmbeddingFunction(EmbeddingFunction): 
def __call__(self, input: Documents) -> Embeddings: 
       embeddings = Your Embeddings 
       return embeddings 

persistent_collection = persistent_client.create_collection( name="test", embedding_function=MyEmbeddingFunction(), metadata={"hnsw:space": "cosine"} ) 

# docs = Your Document 

persistent_collection.add(ids=[str(i) for i in range(len(docs))], documents=[d.page_content for d in docs], metadatas=[d.metadata for d in docs])

persistent_collection.query( query_embeddings==[Query Vector], n_results=3 )

@sparshbhawsar
Copy link
Author

Hi @tazarov, any update on the issue ?

@tazarov
Copy link
Contributor

tazarov commented May 7, 2024

@sparshbhawsar,

I've tried with:

import chromadb

### Using Normal Client 
chroma_client = chromadb.Client()


collection = chroma_client.create_collection( name="test123", metadata={"hnsw:space": "cosine"} )

docs = ["This provides a daily snapshot of the ...", "This is the description of....","Table1","Table2"] 

collection.add(ids=[str(i) for i in range(len(docs))], documents=[d for d in docs])

qr = collection.query( query_texts=["description of snapshot table"], n_results=4)

print(qr)

### Using Persistent Client (Saving to disk)
persistent_client = chromadb.PersistentClient(path="./2134")


persistent_collection = persistent_client.create_collection( name="test", metadata={"hnsw:space": "cosine"} )

# docs = Your Document 

persistent_collection.add(ids=[str(i) for i in range(len(docs))], documents=[d for d in docs])

qr1 = persistent_collection.query( query_texts=["description of snapshot table"], n_results=4 )

print(qr1)

A few things to note about the above code is that it relies on the default embedding function (it is not great with cosine, but it works. It yields consistent results for both clients. We do a lot of testing around the consistency of things, so I wonder what conditions you see this problem under. I have two suspects:

  • Data
  • Custom Embedding functions

I think next step is for me to work on the first by getting a little more "decent" dataset than just 4 docs. You mentioned that your dataset is private, but can you give me an indication of how many records (embeddings) you add to Chroma and whether your topK results have small or large distances between each other?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants