Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Why is this llama-index query to Ollama (llama3) always timing out? #13106

Open
1 task done
komal-SkyNET opened this issue Apr 25, 2024 · 2 comments
Open
1 task done
Labels
question Further information is requested

Comments

@komal-SkyNET
Copy link

komal-SkyNET commented Apr 25, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

So far everything I've tried on llama-index has failed due to timeouts. In contrast, I run multiple projects in langchain where the response takes <10s against the same LLM - Ollama(llama3).

Here's my llama-index trace.

**********
Trace: query
    |_query -> 120.166777 seconds
      |_templating -> 7.7e-05 seconds
      |_llm -> 0.0 seconds
**********

Running on M1 MacBook Air

Code:

import nest_asyncio, llama_index, chromadb
nest_asyncio.apply()
import logging
import sys
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
llama_index.core.set_global_handler("simple")
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])


Settings.llm = Ollama(model="llama3", request_timeout=120)
Settings.chunk_overlap = 0
Settings.chunk_size = 128
Settings.callback_manager = callback_manager
# %%

lyft_index = VectorStoreIndex.from_documents(lyft_docs, embed_model=HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"))
uber_index = VectorStoreIndex.from_documents(uber_docs, embed_model=HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"))

# %%
lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)
uber_engine = uber_index.as_query_engine(similarity_top_k=3)


query_engine_tools = [
    QueryEngineTool(
        query_engine=lyft_engine,
        metadata=ToolMetadata(
            name="lyft_10k",
            description=(
                "Provides information about Lyft financials for year 2021"
            ),
        ),
    ),
    QueryEngineTool(
        query_engine=uber_engine,
        metadata=ToolMetadata(
            name="uber_10k",
            description=(
                "Provides information about Uber financials for year 2021"
            ),
        ),
    ),
]



s_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools
)

# %%
response = s_engine.query(
    "Compare and contrast the customer segments and geographies that grew the"
    " fastest"
)
@komal-SkyNET komal-SkyNET added the question Further information is requested label Apr 25, 2024
Copy link

dosubot bot commented Apr 25, 2024

The timeout issue you're experiencing with llama-index queries to Ollama (llama3) might not be solely due to the request_timeout setting, especially since you've already set it to 120 seconds, which is quite generous. Given that your queries work fine in langchain with less than 10 seconds response time against the same LLM, it suggests that the problem might lie elsewhere.

Considering the complexity of your query and the operations involved in processing it, there are a few areas you might want to investigate:

  1. Query Complexity and Server Load: The complexity of your query or a high load on the Ollama server could be contributing to the timeouts. If possible, try simplifying your query or running it at a different time to see if the issue persists.

  2. Batch Size and Embedding Model Efficiency: The VectorStoreIndex.from_documents method you're using is computationally intensive. Adjusting the batch size or opting for a more efficient embedding model could help. Specifically, using ONNX models through the OptimumEmbedding class for faster inference speeds might improve performance.

  3. Network Latency: Although less likely, network latency between your machine and the Ollama server could also be a factor. This is harder to diagnose but worth considering if the above adjustments don't resolve the issue.

If after exploring these areas the problem remains, it might be worth reaching out to the Ollama support or community forums with specific details about your setup and the issues you're facing for more targeted assistance.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@logan-markewich
Copy link
Collaborator

Set the request timeout larger, I usually do something like request_timeout=3000.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants