Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: How do you correctly load MultiModalVectorStoreIndex from persist directory? #13102

Closed
1 task done
emilia-szymanska opened this issue Apr 25, 2024 · 9 comments
Closed
1 task done
Labels
question Further information is requested

Comments

@emilia-szymanska
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I am using the following code to create a MultiModalVectorStoreIndex only from images:

if persist_dir and persist_dir.is_dir():
            storage_context: StorageContext = StorageContext.from_defaults(
                persist_dir=persist_dir
            )
            image_index = load_index_from_storage(
                storage_context, embed_model=self._embed_model
            )

elif image_dir and image_dir.is_dir():
            img_documents = load_image_documents(image_dir) # my function that returns a list of Documents

            client = qdrant_client.QdrantClient(":memory:")
            image_vector_store = QdrantVectorStore(
                client=client, collection_name="image_collection"
            )
            storage_context = StorageContext.from_defaults(
                image_store=image_vector_store
            )

            image_index = MultiModalVectorStoreIndex(
                img_documents,
                embed_model=self._embed_model,
                storage_context=storage_context,
                is_text_vector_store_empty=True,
                show_progress=True,
            )

            if persist_dir:
                image_index.storage_context.persist(persist_dir=persist_dir)

retriever= image_index.as_retriever(embed_model=self._embed_model)
img_nodes = retriever.retrieve(query)

When the persist directory does not exist, everything is okay and the retrieved img_nodes are not empty. However, when I run the code with the same query, but the persist directory exists (created from the previous run), img_nodes is always empty. What am I doing wrong? I went through the documentation and the issues and cannot figure out what I should change. I tried adding service context in load_index_from_storage, but it did not help.

@emilia-szymanska emilia-szymanska added the question Further information is requested label Apr 25, 2024
Copy link

dosubot bot commented Apr 25, 2024

It looks like the issue you're encountering with empty results from your MultiModalVectorStoreIndex when loading from a persist directory might be related to how the storage context is being handled. To ensure that your index is correctly loaded, including all embeddings and associated metadata, follow these steps:

  1. Correctly Persist the Storage Context: After initializing and populating your index, make sure to persist the entire storage context to disk. This is done using the persist method of the StorageContext class, specifying the persist_dir where your storage context, including the vector store, will be saved.

  2. Reload the Storage Context: When you need to reload your index, first recreate the storage context from the persist directory using StorageContext.from_defaults(persist_dir="your_persist_directory").

  3. Use load_index_from_storage to Load the Index: With the storage context reloaded, use the load_index_from_storage function, passing in the newly loaded storage context. This ensures that the index is fully operational and can return results as expected.

Make sure that the persist_dir is correctly pointing to the location where the storage context was saved. This process should help in avoiding the issue of getting empty results when querying the store after it has been created in a previous run.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@emilia-szymanska
Copy link
Author

@dosu that's exactly what I did, and for an index storing text it works good. However, in the case of this multimodal index I encounter problems.

Copy link

dosubot bot commented Apr 25, 2024

The issue with empty retrieval results when loading from a persist directory might be due to several nuanced factors related to the storage and loading process of your MultiModalVectorStoreIndex. Here are some targeted steps and considerations to troubleshoot and ensure your index loads correctly:

  1. Ensure Correct Persistence and Loading: Confirm that the storage context is accurately persisted and then loaded from the persist directory. This involves using StorageContext.from_defaults() correctly for both saving and reloading the context, ensuring the storage context is initialized consistently.

  2. Check Index ID Consistency: Make sure the index ID is consistent between sessions. This is crucial for confirming that the correct index is being loaded for queries.

  3. Verify Retrieval Results Before and After Persistence: Test the retrieval results before and after persistence and loading to ensure they match. This direct comparison can help identify if the issue arises during the loading process.

  4. Maintain Service Context Consistency: The service context used by your query engine should remain consistent across sessions. This consistency is vital for ensuring that the index functions correctly after being reloaded.

  5. Review Handling of Multiple Indices: If your application uses multiple indices, ensure that the correct index or indices are being loaded from the storage context.

Regarding the use of ":memory:" for the QdrantClient, it's important to remember that this setting means data is only stored in memory during the application's runtime and lost upon termination. For persistent storage across sessions, consider using a file path or database connection string for the QdrantClient.

These steps are derived from practices in the LlamaIndex tests and should help diagnose and resolve the issue with empty retrieval results.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@logan-markewich
Copy link
Collaborator

Hmm, I think it should be

index = load_index_from_storage(storage_context, image_store=storage_context.vector_stores['image'])

@emilia-szymanska
Copy link
Author

emilia-szymanska commented Apr 26, 2024

Hmm, I think it should be

index = load_index_from_storage(storage_context, image_store=storage_context.vector_stores['image'])

@logan-markewich
I get a KeyError. The only key in storage_context.vector_stores is 'deafult' and even if I use this one, my retrieved image documents are still empty.

@emilia-szymanska
Copy link
Author

@logan-markewich bumping the problem up, for now I need to create the index from the start every time, it would be very convenient to be able to load it :)

@logan-markewich
Copy link
Collaborator

@emilia-szymanska one thing to note, if you use memory mode in qdrant, it doesn't actually save anywhere, its not persist able 😅

If I start the qdrant docker image, I can run this, and it works ok

from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(host="localhost", port=6333)

text_store = QdrantVectorStore(
    "text_collection", client=client
)
image_store = QdrantVectorStore(
    "image_collection", client=client
)
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

index = MultiModalVectorStoreIndex.from_documents(
    <documents>, 
    storage_context=storage_context
)

nodes = index.as_retriever().retrieve("test")
print(len(nodes))

index.storage_context.persist(persist_dir="./storage")

text_store = QdrantVectorStore(
    "text_collection", client=client
)
image_store = QdrantVectorStore(
    "image_collection", client=client
)

loaded_storage_context = StorageContext.from_defaults(
    persist_dir="./storage", vector_store=text_store, image_store=image_store
)

index = load_index_from_storage(loaded_storage_context)

nodes = index.as_retriever().retrieve("test")
print(len(nodes))

@emilia-szymanska
Copy link
Author

emilia-szymanska commented May 7, 2024

@logan-markewich Oh, fair point. Is there a specific reason why in all MultiModalVectorStoreIndex-related tutorials Qdrant is used? Could I use something that does not require any Docker or API key (apart from Azure OpenAI key), but simply runs? Please consider that I am looking for a beginner's solution, the simpler the better :)

Edit: I store only images, no text, so I cannot use from_vector_store method to recreate a VectorStoreIndex, since it "Cannot initialize from a vector store that does not store text.".

@emilia-szymanska
Copy link
Author

I will include the solution that @logan-markewich suggested on discord. It works for me, so I am closing the issue! :)

if persist_dir.is_dir():
   storage_context: StorageContext = StorageContext.from_defaults(
      persist_dir=persist_dir
   )

   image_index = load_index_from_storage(
      storage_context, embed_model=self._embed_model
    )

elif image_dir and image_dir.is_dir():
   img_documents = load_image_documents(image_dir)

   image_index = MultiModalVectorStoreIndex.from_documents(
      img_documents,
      embed_model=self._embed_model,
      is_text_vector_store_empty=True,
      show_progress=True,
    )
    if persist_dir:
       image_index.storage_context.persist(persist_dir=persist_dir)

retriever= image_index.as_retriever(embed_model=self._embed_model)
img_nodes = retriever.retrieve(query)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants