Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Embeddings Deletion Causes "Delete of nonexisting embedding ID" #989

Open
mickey-lyx opened this issue Aug 15, 2023 · 15 comments
Open
Assignees
Labels
bug Something isn't working to-discuss

Comments

@mickey-lyx
Copy link

What happened?

Hi there, I tried to upload two PDF files to a persistant collection and delete one of them. But I received Warning Messages: "Delete of nonexisting embedding ID". This Warning only appears when I upload multiple files and delete one of them. Here are my test files and code.

alphabet-2023-q1-10q.pdf
Apple Inc.-10K.pdf

from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction


def main():
    # create collection
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_or_create_collection(name="test", embedding_function=OpenAIEmbeddingFunction())
    text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=100)

    # load document_1
    loader_1 = PyPDFLoader("alphabet-2023-q1-10q.pdf")
    documents1 = loader_1.load()
    docs_1 = text_splitter.split_documents(documents1)
    ids_1 = [str(i) for i in range(1, len(docs_1) + 1)]
    texts_1 = [split.page_content for split in docs_1]
    metadatas_1 = [split.metadata for split in docs_1]
    collection.add(ids=ids_1, metadatas=metadatas_1, documents=texts_1)

    # load document_2
    loader_2 = PyPDFLoader("Apple Inc.-10K.pdf")
    documents_2 = loader_2.load()
    docs_2 = text_splitter.split_documents(documents_2)
    ids_2 = [str(i) for i in range(47, len(docs_2) + 47)]
    texts_2 = [split.page_content for split in docs_2]
    metadatas_2 = [split.metadata for split in docs_2]
    collection.add(ids=ids_2, metadatas=metadatas_2, documents=texts_2)

    print(f"ids_1: {ids_1}")
    print(f"ids_2: {ids_2}")

    print("count before", collection.count())
    # delete document_1
    collection.delete(ids_1)
    print("count after", collection.count())


if __name__ == '__main__':
    main()

Versions

chromadb==0.4.5
langchain==0.0.264
python==3.10.12
MacOS==13.3.1

Relevant log output

ids_1: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46']
ids_2: ['47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107']
count before 107
Delete of nonexisting embedding ID: 1
Delete of nonexisting embedding ID: 2
Delete of nonexisting embedding ID: 3
Delete of nonexisting embedding ID: 4
Delete of nonexisting embedding ID: 5
Delete of nonexisting embedding ID: 6
Delete of nonexisting embedding ID: 7
Delete of nonexisting embedding ID: 8
Delete of nonexisting embedding ID: 9
Delete of nonexisting embedding ID: 10
Delete of nonexisting embedding ID: 11
Delete of nonexisting embedding ID: 12
Delete of nonexisting embedding ID: 13
Delete of nonexisting embedding ID: 14
Delete of nonexisting embedding ID: 15
Delete of nonexisting embedding ID: 16
Delete of nonexisting embedding ID: 17
Delete of nonexisting embedding ID: 18
Delete of nonexisting embedding ID: 19
Delete of nonexisting embedding ID: 20
Delete of nonexisting embedding ID: 21
Delete of nonexisting embedding ID: 22
Delete of nonexisting embedding ID: 23
Delete of nonexisting embedding ID: 24
Delete of nonexisting embedding ID: 25
Delete of nonexisting embedding ID: 26
Delete of nonexisting embedding ID: 27
Delete of nonexisting embedding ID: 28
Delete of nonexisting embedding ID: 29
Delete of nonexisting embedding ID: 30
Delete of nonexisting embedding ID: 31
Delete of nonexisting embedding ID: 32
Delete of nonexisting embedding ID: 33
Delete of nonexisting embedding ID: 34
Delete of nonexisting embedding ID: 35
Delete of nonexisting embedding ID: 36
Delete of nonexisting embedding ID: 37
Delete of nonexisting embedding ID: 38
Delete of nonexisting embedding ID: 39
Delete of nonexisting embedding ID: 40
Delete of nonexisting embedding ID: 41
Delete of nonexisting embedding ID: 42
Delete of nonexisting embedding ID: 43
Delete of nonexisting embedding ID: 44
Delete of nonexisting embedding ID: 45
Delete of nonexisting embedding ID: 46
count after 61

Process finished with exit code 0
@mickey-lyx mickey-lyx added the bug Something isn't working label Aug 15, 2023
@mickey-lyx mickey-lyx changed the title [Bug]: Embeddings deletion problem in persistent Chromadb [Bug]: Embeddings Deletion Causes "Delete of nonexisting embedding ID" Aug 16, 2023
@qyzhizi
Copy link

qyzhizi commented Aug 19, 2023

I have the problem too

@mickey-lyx
Copy link
Author

@tazarov Hi, could you please look at this problem? Thank you for you time!

@tazarov
Copy link
Contributor

tazarov commented Aug 21, 2023

@mickey-lyx, thanks for reporting this. I'll take a look at this soon. At a glance, the code looks fine, and the actual result seems to be fine - you have 61 docs once you remove 47 from the starting 107. All in all, this seems like a warning, not an actual bug. The I will have a look and let you know.

@mickey-lyx
Copy link
Author

@tazarov Really appreciate it. The result is right. I'm just wondering why there appears to be warnings of deleting nonexisting embeddings. Is it because the embeddings were deleted multiple times?

@guyko81
Copy link

guyko81 commented Aug 31, 2023

I have the same issue, and running queries on the db triggers this warning every time. What I did is selected items based on where statement (no ID was given) and removed them one-by-one:

my_collection.delete(
            where={"file_id": str(file_id)}
        )

Since then the warning is shown every time I query it.

@becklabs
Copy link

becklabs commented Sep 3, 2023

I'm having the same issue. This seems to occur even when an empty list is passed as ids to Collection.delete.

@jeffchuber
Copy link
Contributor

We'd love to get this fixed - is anyone able to help post a minimal repro?

@mickey-lyx
Copy link
Author

@jeffchuber

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction


def main():
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_or_create_collection(name="test", embedding_function=OpenAIEmbeddingFunction())

    num_1 = 47
    num_2 = 70

    texts_1 = [f"text_1.{i}" for i in range(num_1)]
    ids_1 = [f"1.{i}" for i in range(num_1)]
    texts_2 = [f"text_2.{i}" for i in range(num_2)]
    ids_2 = [f"2.{i}" for i in range(num_2)]

    collection.add(ids=ids_1, documents=texts_1)
    collection.add(ids=ids_2, documents=texts_2)

    print("count before", collection.count())
    collection.delete(ids_1)
    print("count after", collection.count())


if __name__ == '__main__':
    main()

@timothymugayi
Copy link

timothymugayi commented Sep 16, 2023

I'm seeing similar warnings, but I'm unsure if I should be concerned since it's a warning. It would be good to get some insights to why this occurs even after uploading a few PDF files and while the fastapi is idle, keeps logging.

112-49d5-a776-2c02c03897e8:77661df1-86bc-4f33-9119-a90d77f7c24e
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:484a228b-de38-4674-8f14-078f4f218afd
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:51c75801-6ecd-4490-941e-8ee6f2229476
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:282cb350-257b-49ef-ae55-ab3997099d58
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:fe9d8119-b72a-44c1-9bc5-f5c173621a4b
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:c92f759d-f0e7-46e9-9156-e5c47e917de7
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:5be4bf1c-7c02-4815-9c25-de4463b0231f
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:32500766-ceb7-4b12-8e8d-04b34306f30f
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:7e5d60fd-cb8a-4ecf-adf3-8d86694458e8
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:5cfbdc44-cc08-4749-8d5d-d628f6aa4676
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-

package versions

chromadb==0.4.10
langchain==0.0.225

Running chroma client server with the latest Docker version

  chroma:
    container_name: chroma
    image: ghcr.io/chroma-core/chroma:latest
    volumes:
      - index_data:/chroma/chroma
    environment:
      - IS_PERSISTENT=true
      - CHROMA_SERVER_HTTP_PORT=8000
    restart: unless-stopped
    ports:
      - '8000:8000'
    networks:
      - mynetwork

@chrispangg
Copy link

I have the same issue, and running queries on the db triggers this warning every time. What I did is selected items based on where statement (no ID was given) and removed them one-by-one:

my_collection.delete(
            where={"file_id": str(file_id)}
        )

Since then the warning is shown every time I query it.

I am having this exact issue too

@tazarov
Copy link
Contributor

tazarov commented Sep 18, 2023

@jeffchuber, @chrispangg, @timothymugayi, @mickey-lyx, As I mentioned above, the issue is benign. Chroma maintains a temporary index of embeddings before it flushes it to disk after it reaches a certain threshold. In your example, the threshold is reached (100) so the temp index is flushed and cleared, and subsequent entries are appended to it, but when delete comes right after add Chroma attempts to remove any and all embeddings from the temporary index which leads to the warning you see. I have made a fix to properly check if ids to be removed are part of the temp index and if not Chroma will not attempt deletion.

PR's on the way.

tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 18, 2023
- When the BF index overflows (batch_size upon insertion of large batch it is cleared, if a subsequent delete request comes to delete Ids which were in the cleared BF index a warning is raised for non-existent embedding. The issue was resolved by separately checking if BF the record exists in the BF index and conditionally execute the BF removal

Refs: chroma-core#989
tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 18, 2023
- Remove ternary expression

Refs: chroma-core#989
HammadB pushed a commit that referenced this issue Sep 19, 2023
Refs: #989

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- When the BF index overflows (batch_size upon insertion of large batch
it is cleared, if a subsequent delete request comes to delete Ids which
were in the cleared BF index a warning is raised for non-existent
embedding. The issue was resolved by separately checking if BF the
record exists in the BF index and conditionally execute the BF removal

## Test plan
*How are these changes tested?*

- [x] Tests pass locally with `pytest` for python

## Documentation Changes
N/A
tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 19, 2023
…ore#1150)

Refs: chroma-core#989

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- When the BF index overflows (batch_size upon insertion of large batch
it is cleared, if a subsequent delete request comes to delete Ids which
were in the cleared BF index a warning is raised for non-existent
embedding. The issue was resolved by separately checking if BF the
record exists in the BF index and conditionally execute the BF removal

## Test plan
*How are these changes tested?*

- [x] Tests pass locally with `pytest` for python

## Documentation Changes
N/A
tazarov added a commit to amikos-tech/chroma-core that referenced this issue Sep 21, 2023
…ore#1150)

Refs: chroma-core#989

## Description of changes

*Summarize the changes made by this PR.*
 - Improvements & Bug fixes
- When the BF index overflows (batch_size upon insertion of large batch
it is cleared, if a subsequent delete request comes to delete Ids which
were in the cleared BF index a warning is raised for non-existent
embedding. The issue was resolved by separately checking if BF the
record exists in the BF index and conditionally execute the BF removal

## Test plan
*How are these changes tested?*

- [x] Tests pass locally with `pytest` for python

## Documentation Changes
N/A
@tazarov
Copy link
Contributor

tazarov commented Oct 25, 2023

@HammadB I think we can close this now.

@s-peryt
Copy link

s-peryt commented Apr 14, 2024

I think this issue is still present. I've just stumbled upon it in my application. And I'm using latest (0.4.24) version of Chroma, so the fix from #1150 should probably be already merged.

@running-frog
Copy link

我更新了chromadb==0.5.0,但还是有这个问题:
我是用threading更新的:
t=threading.Thread(target=mydb.add_collection_from_file,args=[local_f],daemon=True)
t.start()

@tazarov
Copy link
Contributor

tazarov commented May 12, 2024

@running-frog, @s-peryt, we have a bug in the HNSW binary index that, under certain conditions, can result in the above errors. There is a PR - #2062 that should resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working to-discuss
Projects
None yet
Development

No branches or pull requests