You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Chroma removes newline characters before generating embeddings in Chroma v0.5.0, even though this is now unnecessary (post-V1 models), negatively impacts similarity search results, and makes it more difficult to predict outputs (openai issue 418, langchain issue 3853).
In openai issue 418 BorisPower explains that the preprocessing of newline characters should be removed because it is no longer needed for models like "text-embedding-ada-002". However, if you run the code below, you will see that chroma is still replacing newline characters with spaces before generating embeddings, leading to embeddings that differ from the embeddings generated from the openai package.
Also, could someone please confirm that the only pre-processing of text before embedding that happens in chroma is the replacement of newline characters? We do not feel comfortable using a chroma embedding function for our DB unless the preprocessing is transparent.
importchromadb.utils.embedding_functionsasembedding_functionsfromopenaiimportAzureOpenAIimportnumpyasnpimportosimportchromadbfromtypingimportListdeployment_name_embeddings="text-embedding-ada-002"chroma_embedding_api_creds=dict(
api_type=os.getenv('OPENAI_API_TYPE_EMB'),
api_base=os.getenv('OPENAI_API_BASE_EMB'),
api_version="2024-02-01",
api_key=os.getenv('OPENAI_API_KEY_EMB'),
)
chroma_embedding_function=embedding_functions.OpenAIEmbeddingFunction(model_name=deployment_name_embeddings, **chroma_embedding_api_creds)
openai_client=AzureOpenAI(
api_key=os.getenv('OPENAI_API_KEY_EMB'),
api_version="2024-02-01",
azure_endpoint=os.getenv('OPENAI_API_BASE_EMB')
)
defget_embedding(text, model=deployment_name_embeddings):
returnopenai_client.embeddings.create(input= [text], model=model).data[0].embeddingdefcosine_similarity(a, b):
returnnp.dot(a, b) / (np.linalg.norm(a) *np.linalg.norm(b))
client=chromadb.PersistentClient(path='chroma_collection', )
collection=client.create_collection(name='chroma_collection', embedding_function=chroma_embedding_function)
chunks= [
{
"page_content": "Chroma Rocks!!!",
"metadata": {
"source": "chunk1",
"token_count": 15
},
},
{
"page_content": "Chroma\nRocks!!!",
"metadata": {
"source": "chunk2",
"token_count": 15
},
},
]
collection.add(
documents= [chunk_dict['page_content'] forchunk_dictinchunks],
metadatas= [chunk_dict['metadata'] forchunk_dictinchunks],
ids= [chunk_dict['metadata']['source'] forchunk_dictinchunks])
collection_output=collection.get(include=['embeddings', ])
forchunk, chroma_collection_embeddinginzip(chunks, collection_output['embeddings']):
chunk['openai_embedding'] =get_embedding(chunk['page_content'])
chunk['chroma_collection_embedding'] =chroma_collection_embedding# chunk['chroma_fn_embedding'] = chroma_embedding_function([chunk['page_content']])# compare embeddings from chroma and openaiclient.delete_collection('chroma_collection')
print('First chunk...')
print(f"text from first chunk: {chunks[0]['page_content'][:30]!r}")
print(f"cosine similarity: {round(cosine_similarity(chunks[0]['openai_embedding'], chunks[0]['chroma_collection_embedding']), 5)}")
print(f"First number of `openai` embedding: {chunks[0]['openai_embedding'][0]}")
print(f"First number of `chroma` collecton embedding: {chunks[0]['chroma_collection_embedding'][0]}")
print('Second chunk...')
print(f"text from second chunk: {chunks[1]['page_content'][:30]!r}")
print(f"cosine similarity: {round(cosine_similarity(chunks[1]['openai_embedding'], chunks[1]['chroma_collection_embedding']), 5)}")
print(f"First number of `openai` embedding: {chunks[1]['openai_embedding'][0]}")
print(f"First number of `chroma` collecton embedding: {chunks[1]['chroma_collection_embedding'][0]}")
# outputFirstchunk...
textfromfirstchunk: 'Chroma Rocks!!!'cosinesimilarity: 1.0Firstnumberof`openai`embedding: 0.01531514897942543Firstnumberof`chroma`collectonembedding: 0.01531514897942543Secondchunk...
textfromsecondchunk: 'Chroma\nRocks!!!'cosinesimilarity: 0.97234Firstnumberof`openai`embedding: 0.023534949868917465Firstnumberof`chroma`collectonembedding: 0.01531514897942543
Versions
Chroma v0.5.0, Python 3.11.7, Debian 12
Relevant log output
No response
The text was updated successfully, but these errors were encountered:
What happened?
Chroma removes newline characters before generating embeddings in Chroma v0.5.0, even though this is now unnecessary (post-V1 models), negatively impacts similarity search results, and makes it more difficult to predict outputs (openai issue 418, langchain issue 3853).
In openai issue 418 BorisPower explains that the preprocessing of newline characters should be removed because it is no longer needed for models like "text-embedding-ada-002". However, if you run the code below, you will see that
chroma
is still replacing newline characters with spaces before generating embeddings, leading to embeddings that differ from the embeddings generated from theopenai
package.Also, could someone please confirm that the only pre-processing of text before embedding that happens in
chroma
is the replacement of newline characters? We do not feel comfortable using a chroma embedding function for our DB unless the preprocessing is transparent.Versions
Chroma v0.5.0, Python 3.11.7, Debian 12
Relevant log output
No response
The text was updated successfully, but these errors were encountered: