[Bug]: Chroma v0.5.0 unnecessarily replaces newline characters with spaces before generating embeddings #2129

dasheffie · 2024-05-03T20:14:36Z

What happened?

Chroma removes newline characters before generating embeddings in Chroma v0.5.0, even though this is now unnecessary (post-V1 models), negatively impacts similarity search results, and makes it more difficult to predict outputs (openai issue 418, langchain issue 3853).

In openai issue 418 BorisPower explains that the preprocessing of newline characters should be removed because it is no longer needed for models like "text-embedding-ada-002". However, if you run the code below, you will see that chroma is still replacing newline characters with spaces before generating embeddings, leading to embeddings that differ from the embeddings generated from the openai package.

Also, could someone please confirm that the only pre-processing of text before embedding that happens in chroma is the replacement of newline characters? We do not feel comfortable using a chroma embedding function for our DB unless the preprocessing is transparent.

import chromadb.utils.embedding_functions as embedding_functions
from openai import AzureOpenAI
import numpy as np
import os
import chromadb
from typing import List

deployment_name_embeddings = "text-embedding-ada-002"

chroma_embedding_api_creds = dict(
    api_type = os.getenv('OPENAI_API_TYPE_EMB'),
    api_base = os.getenv('OPENAI_API_BASE_EMB'),
    api_version = "2024-02-01",
    api_key = os.getenv('OPENAI_API_KEY_EMB'),
)
chroma_embedding_function = embedding_functions.OpenAIEmbeddingFunction(model_name=deployment_name_embeddings, **chroma_embedding_api_creds)


openai_client = AzureOpenAI(
  api_key = os.getenv('OPENAI_API_KEY_EMB'),  
  api_version = "2024-02-01",
  azure_endpoint = os.getenv('OPENAI_API_BASE_EMB')
)

def get_embedding(text, model=deployment_name_embeddings):
    return openai_client.embeddings.create(input = [text], model=model).data[0].embedding


def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

client = chromadb.PersistentClient(path='chroma_collection', )
collection = client.create_collection(name='chroma_collection', embedding_function=chroma_embedding_function)

chunks = [
    {
        "page_content": "Chroma Rocks!!!",
        "metadata": {
            "source": "chunk1",
            "token_count": 15
        },
    },
    {
        "page_content": "Chroma\nRocks!!!",
        "metadata": {
            "source": "chunk2",
            "token_count": 15
            },
    },
]

collection.add(
    documents = [chunk_dict['page_content'] for chunk_dict in chunks],
    metadatas = [chunk_dict['metadata'] for chunk_dict in chunks],
    ids = [chunk_dict['metadata']['source'] for chunk_dict in chunks])

collection_output = collection.get(include=['embeddings', ])
for chunk, chroma_collection_embedding in zip(chunks, collection_output['embeddings']):
    chunk['openai_embedding'] = get_embedding(chunk['page_content'])
    chunk['chroma_collection_embedding'] = chroma_collection_embedding
    # chunk['chroma_fn_embedding'] = chroma_embedding_function([chunk['page_content']])

# compare embeddings from chroma and openai
client.delete_collection('chroma_collection')
print('First chunk...')
print(f"text from first chunk: {chunks[0]['page_content'][:30]!r}")
print(f"cosine similarity: {round(cosine_similarity(chunks[0]['openai_embedding'], chunks[0]['chroma_collection_embedding']), 5)}")
print(f"First number of `openai` embedding: {chunks[0]['openai_embedding'][0]}")
print(f"First number of `chroma` collecton embedding: {chunks[0]['chroma_collection_embedding'][0]}")
print('Second chunk...')
print(f"text from second chunk: {chunks[1]['page_content'][:30]!r}")
print(f"cosine similarity: {round(cosine_similarity(chunks[1]['openai_embedding'], chunks[1]['chroma_collection_embedding']), 5)}")
print(f"First number of `openai` embedding: {chunks[1]['openai_embedding'][0]}")
print(f"First number of `chroma` collecton embedding: {chunks[1]['chroma_collection_embedding'][0]}")

# output
First chunk...
text from first chunk: 'Chroma Rocks!!!'
cosine similarity: 1.0
First number of `openai` embedding: 0.01531514897942543
First number of `chroma` collecton embedding: 0.01531514897942543
Second chunk...
text from second chunk: 'Chroma\nRocks!!!'
cosine similarity: 0.97234
First number of `openai` embedding: 0.023534949868917465
First number of `chroma` collecton embedding: 0.01531514897942543

Versions

Chroma v0.5.0, Python 3.11.7, Debian 12

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

tazarov · 2024-05-04T15:10:05Z

@dasheffie, linking the PR for this - #2125

dasheffie added the bug Something isn't working label May 3, 2024

tazarov linked a pull request May 4, 2024 that will close this issue

[ENH] Removing new line cleanup from OpenAI EF (python) #2125

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Chroma v0.5.0 unnecessarily replaces newline characters with spaces before generating embeddings #2129

[Bug]: Chroma v0.5.0 unnecessarily replaces newline characters with spaces before generating embeddings #2129

dasheffie commented May 3, 2024

tazarov commented May 4, 2024

[Bug]: Chroma v0.5.0 unnecessarily replaces newline characters with spaces before generating embeddings #2129

[Bug]: Chroma v0.5.0 unnecessarily replaces newline characters with spaces before generating embeddings #2129

Comments

dasheffie commented May 3, 2024

What happened?

Versions

Relevant log output

tazarov commented May 4, 2024