Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Fix number of documents logging when using chromadb #13074

Closed
yoonsch217 opened this issue Apr 24, 2024 · 3 comments · Fixed by #13238
Closed

[Bug]: Fix number of documents logging when using chromadb #13074

yoonsch217 opened this issue Apr 24, 2024 · 3 comments · Fixed by #13238
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@yoonsch217
Copy link
Contributor

Bug Description

https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/vector_stores/llama-index-vector-stores-chroma/llama_index/vector_stores/chroma/base.py

    def _query(
        self, query_embeddings: List["float"], n_results: int, where: dict, **kwargs
    ) -> VectorStoreQueryResult:
        results = self._collection.query(
            query_embeddings=query_embeddings,
            n_results=n_results,
            where=where,
            **kwargs,
        )

        logger.debug(f"> Top {len(results['documents'])} nodes:")
        nodes = []
        similarities = []
        ids = []
        for node_id, text, metadata, distance in zip(
            results["ids"][0],
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        ):

logger.debug(f"> Top {len(results['documents'])} nodes:")

It seems that the number of retrieved nodes should be len(results['documents'][0])

Version

0.10.30

Steps to Reproduce

Using chromadb, retrieve nearest nodes with a query. Even when the number of retrieved nodes is 10, which is a default setting, it prints that the number of nodes is 1 as below.

2024-04-22 16:49:52,678 - llama_index.vector_stores.chroma.base - DEBUG - > Top 1 nodes:

Relevant Logs/Tracbacks

No response

@yoonsch217 yoonsch217 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Apr 24, 2024
Copy link

dosubot bot commented Apr 24, 2024

It looks like the issue you're encountering with the logging statement logger.debug(f"> Top {len(results['documents'])} nodes:") might indeed be a misunderstanding of how the results data is structured. If results['documents'] directly contains the documents as a list, then len(results['documents']) should accurately reflect the number of retrieved nodes.

However, if you're observing that the log always reports the number of nodes as 1, despite expecting more based on the query, it's crucial to:

  1. Verify the structure and content of the results dictionary returned by self._collection.query to ensure it matches expectations.
  2. Check that results['documents'] is properly populated with a list of documents corresponding to the queried nodes.

If after these checks the issue persists, providing more details about the query and the data could help further diagnose the problem.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@yoonsch217
Copy link
Contributor Author

@dosu
As it contains the list of documents in results['documents'][0], the desired length would be len(results['documents'][0]).
When creating nodes, it also deals only with results['documents'][0] as below

        for node_id, text, metadata, distance in zip(
            results["ids"][0],
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        ):

Copy link

dosubot bot commented Apr 25, 2024

To address the logging discrepancy you've identified, you should update the logging statement in the _query method to accurately reflect the number of documents retrieved. Since the documents are indeed contained within results['documents'][0], the logging statement should be modified as follows:

logger.debug(f"> Top {len(results['documents'][0])} nodes:")

This adjustment ensures the log message correctly reports the actual number of documents returned by the query, aligning with the structure of the results dictionary where the documents are nested within the first element of the results['documents'] list.

It's important to note that this change is specific to the logging functionality and is based on the current structure of the results dictionary as returned by the ChromaDB query operation. The structure is designed to accommodate batched query results, where each batch corresponds to a list of results for a single query embedding, hence the documents being accessed via results['documents'][0].

If you encounter any further issues or have additional questions, feel free to ask.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant