[Question]: How do you correctly load MultiModalVectorStoreIndex from persist directory? #13102

emilia-szymanska · 2024-04-25T08:52:18Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

I am using the following code to create a MultiModalVectorStoreIndex only from images:

if persist_dir and persist_dir.is_dir():
            storage_context: StorageContext = StorageContext.from_defaults(
                persist_dir=persist_dir
            )
            image_index = load_index_from_storage(
                storage_context, embed_model=self._embed_model
            )

elif image_dir and image_dir.is_dir():
            img_documents = load_image_documents(image_dir) # my function that returns a list of Documents

            client = qdrant_client.QdrantClient(":memory:")
            image_vector_store = QdrantVectorStore(
                client=client, collection_name="image_collection"
            )
            storage_context = StorageContext.from_defaults(
                image_store=image_vector_store
            )

            image_index = MultiModalVectorStoreIndex(
                img_documents,
                embed_model=self._embed_model,
                storage_context=storage_context,
                is_text_vector_store_empty=True,
                show_progress=True,
            )

            if persist_dir:
                image_index.storage_context.persist(persist_dir=persist_dir)

retriever= image_index.as_retriever(embed_model=self._embed_model)
img_nodes = retriever.retrieve(query)

When the persist directory does not exist, everything is okay and the retrieved img_nodes are not empty. However, when I run the code with the same query, but the persist directory exists (created from the previous run), img_nodes is always empty. What am I doing wrong? I went through the documentation and the issues and cannot figure out what I should change. I tried adding service context in load_index_from_storage, but it did not help.

dosubot · 2024-04-25T08:52:35Z

It looks like the issue you're encountering with empty results from your MultiModalVectorStoreIndex when loading from a persist directory might be related to how the storage context is being handled. To ensure that your index is correctly loaded, including all embeddings and associated metadata, follow these steps:

Correctly Persist the Storage Context: After initializing and populating your index, make sure to persist the entire storage context to disk. This is done using the persist method of the StorageContext class, specifying the persist_dir where your storage context, including the vector store, will be saved.
Reload the Storage Context: When you need to reload your index, first recreate the storage context from the persist directory using StorageContext.from_defaults(persist_dir="your_persist_directory").
Use load_index_from_storage to Load the Index: With the storage context reloaded, use the load_index_from_storage function, passing in the newly loaded storage context. This ensures that the index is fully operational and can return results as expected.

Make sure that the persist_dir is correctly pointing to the location where the storage context was saved. This process should help in avoiding the issue of getting empty results when querying the store after it has been created in a previous run.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

emilia-szymanska · 2024-04-25T09:34:05Z

@dosu that's exactly what I did, and for an index storing text it works good. However, in the case of this multimodal index I encounter problems.

dosubot · 2024-04-25T10:19:19Z

The issue with empty retrieval results when loading from a persist directory might be due to several nuanced factors related to the storage and loading process of your MultiModalVectorStoreIndex. Here are some targeted steps and considerations to troubleshoot and ensure your index loads correctly:

Ensure Correct Persistence and Loading: Confirm that the storage context is accurately persisted and then loaded from the persist directory. This involves using StorageContext.from_defaults() correctly for both saving and reloading the context, ensuring the storage context is initialized consistently.
Check Index ID Consistency: Make sure the index ID is consistent between sessions. This is crucial for confirming that the correct index is being loaded for queries.
Verify Retrieval Results Before and After Persistence: Test the retrieval results before and after persistence and loading to ensure they match. This direct comparison can help identify if the issue arises during the loading process.
Maintain Service Context Consistency: The service context used by your query engine should remain consistent across sessions. This consistency is vital for ensuring that the index functions correctly after being reloaded.
Review Handling of Multiple Indices: If your application uses multiple indices, ensure that the correct index or indices are being loaded from the storage context.

Regarding the use of ":memory:" for the QdrantClient, it's important to remember that this setting means data is only stored in memory during the application's runtime and lost upon termination. For persistent storage across sessions, consider using a file path or database connection string for the QdrantClient.

These steps are derived from practices in the LlamaIndex tests and should help diagnose and resolve the issue with empty retrieval results.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

logan-markewich · 2024-04-25T14:55:00Z

Hmm, I think it should be

index = load_index_from_storage(storage_context, image_store=storage_context.vector_stores['image'])

emilia-szymanska · 2024-04-26T07:03:52Z

Hmm, I think it should be

index = load_index_from_storage(storage_context, image_store=storage_context.vector_stores['image'])

@logan-markewich
I get a KeyError. The only key in storage_context.vector_stores is 'deafult' and even if I use this one, my retrieved image documents are still empty.

emilia-szymanska · 2024-05-02T07:04:53Z

@logan-markewich bumping the problem up, for now I need to create the index from the start every time, it would be very convenient to be able to load it :)

logan-markewich · 2024-05-06T17:25:35Z

@emilia-szymanska one thing to note, if you use memory mode in qdrant, it doesn't actually save anywhere, its not persist able 😅

If I start the qdrant docker image, I can run this, and it works ok

from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(host="localhost", port=6333)

text_store = QdrantVectorStore(
    "text_collection", client=client
)
image_store = QdrantVectorStore(
    "image_collection", client=client
)
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

index = MultiModalVectorStoreIndex.from_documents(
    <documents>, 
    storage_context=storage_context
)

nodes = index.as_retriever().retrieve("test")
print(len(nodes))

index.storage_context.persist(persist_dir="./storage")

text_store = QdrantVectorStore(
    "text_collection", client=client
)
image_store = QdrantVectorStore(
    "image_collection", client=client
)

loaded_storage_context = StorageContext.from_defaults(
    persist_dir="./storage", vector_store=text_store, image_store=image_store
)

index = load_index_from_storage(loaded_storage_context)

nodes = index.as_retriever().retrieve("test")
print(len(nodes))

emilia-szymanska · 2024-05-07T07:24:05Z

@logan-markewich Oh, fair point. Is there a specific reason why in all MultiModalVectorStoreIndex-related tutorials Qdrant is used? Could I use something that does not require any Docker or API key (apart from Azure OpenAI key), but simply runs? Please consider that I am looking for a beginner's solution, the simpler the better :)

Edit: I store only images, no text, so I cannot use from_vector_store method to recreate a VectorStoreIndex, since it "Cannot initialize from a vector store that does not store text.".

emilia-szymanska · 2024-05-08T09:00:41Z

I will include the solution that @logan-markewich suggested on discord. It works for me, so I am closing the issue! :)

if persist_dir.is_dir():
   storage_context: StorageContext = StorageContext.from_defaults(
      persist_dir=persist_dir
   )

   image_index = load_index_from_storage(
      storage_context, embed_model=self._embed_model
    )

elif image_dir and image_dir.is_dir():
   img_documents = load_image_documents(image_dir)

   image_index = MultiModalVectorStoreIndex.from_documents(
      img_documents,
      embed_model=self._embed_model,
      is_text_vector_store_empty=True,
      show_progress=True,
    )
    if persist_dir:
       image_index.storage_context.persist(persist_dir=persist_dir)

retriever= image_index.as_retriever(embed_model=self._embed_model)
img_nodes = retriever.retrieve(query)

emilia-szymanska added the question Further information is requested label Apr 25, 2024

emilia-szymanska closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How do you correctly load MultiModalVectorStoreIndex from persist directory? #13102

[Question]: How do you correctly load MultiModalVectorStoreIndex from persist directory? #13102

emilia-szymanska commented Apr 25, 2024

dosubot bot commented Apr 25, 2024 •

edited

Details

emilia-szymanska commented Apr 25, 2024

dosubot bot commented Apr 25, 2024

Details

logan-markewich commented Apr 25, 2024

emilia-szymanska commented Apr 26, 2024 •

edited

emilia-szymanska commented May 2, 2024

logan-markewich commented May 6, 2024

emilia-szymanska commented May 7, 2024 •

edited

emilia-szymanska commented May 8, 2024

[Question]: How do you correctly load MultiModalVectorStoreIndex from persist directory? #13102

[Question]: How do you correctly load MultiModalVectorStoreIndex from persist directory? #13102

Comments

emilia-szymanska commented Apr 25, 2024

Question Validation

Question

dosubot bot commented Apr 25, 2024 • edited

Details

emilia-szymanska commented Apr 25, 2024

dosubot bot commented Apr 25, 2024

Details

logan-markewich commented Apr 25, 2024

emilia-szymanska commented Apr 26, 2024 • edited

emilia-szymanska commented May 2, 2024

logan-markewich commented May 6, 2024

emilia-szymanska commented May 7, 2024 • edited

emilia-szymanska commented May 8, 2024

dosubot bot commented Apr 25, 2024 •

edited

emilia-szymanska commented Apr 26, 2024 •

edited

emilia-szymanska commented May 7, 2024 •

edited