[Question]: on querying responses are totally incorrect #13130

elvisbegovic · 2024-04-26T19:52:09Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

Hi there, sory to make noise here.

I am impressed how responses are completly incorrect base on a simple "structured" csv file. Am I doing things correctly ?

Step 1, persist csv file with load.py :

documents = SimpleDirectoryReader("./data").load_data()
Settings.embed_model = resolve_embed_model("local:BAAI/bge-small-en-v1.5")
Settings.llm = Ollama(model="llama3", request_timeout=60.0)
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir="./persist/")

Step 2, chat with data with chat.py :

Settings.embed_model = resolve_embed_model("local:BAAI/bge-small-en-v1.5")
Settings.llm = Ollama(model="llama3", request_timeout=60.0)
storage_context = StorageContext.from_defaults(persist_dir="./persist/")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()

while True:
    user_input = input("Ask your question ?")
    if user_input.lower() == "quit":
        break
    response = query_engine.query(user_input)
    print(response)

So on code / compilation etc everytihng works perfectly (llama-index is awesome) BUT every question I asking on this CSV file, reponses are completly wrong. What Am I doing wrong here ?

dosubot · 2024-04-26T19:52:44Z

It sounds like you're encountering issues with the accuracy of responses from your query engine. Let's tackle this step by step to identify potential causes and solutions:

CSV File Structure and Format: Ensure your CSV file's structure is correctly interpreted. If your CSV is structured with specific key-value pairs per row, consider using PagedCSVReader instead of SimpleDirectoryReader. This might align better with your data's format, improving how documents are indexed and queried.
Embedding and LLM Models: The choice of embedding model ("local:BAAI/bge-small-en-v1.5") and LLM model ("llama3") is crucial. Verify these models are suitable for your data type and query needs. Incompatibility or issues with the model's training could lead to inaccurate responses.
Query Processing: Investigate how your queries are processed by query_engine.query(user_input). Ensure this function accurately interprets user inputs and retrieves relevant responses from the index. Misinterpretation at this stage could be a significant factor in receiving incorrect responses.
Data Persistence and Retrieval: Check the persistence and retrieval process of your index through StorageContext. Any mishaps in correctly persisting or retrieving the index could affect response accuracy.
Debugging Steps: Consider adding logging to your scripts to gain insights into each step of the process, from data loading and model selection to query processing and response generation. This can help pinpoint where things might be going awry.

By methodically reviewing these areas, you should be able to identify and resolve the issues causing inaccurate responses.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

elvisbegovic · 2024-04-26T20:32:23Z

@dosu ok but there is NO documentation/exemple on "PagedCSVReader" and cannot find any readers to install. If I shouldn't use SimpleDirectoryReader as you said, give me simple example to persist simple data.csv to ./persist/ folder using PagedCSVReader.

logan-markewich · 2024-04-26T20:44:50Z

@elvisbegovic structured data is pretty special case. Unless it makes sense to embed and retrieve individual rows (maybe you have QA pairs, etc.), using a vector index is probably not the right approach

Instead you probably want to put it into a sqlite db or similar and do text-to-sql, or a dataframe and do text-to-pandas

https://docs.llamaindex.ai/en/stable/use_cases/q_and_a/#qa-over-structured-data

elvisbegovic · 2024-04-26T21:16:56Z

Thank you @logan-markewich i will try what you say and I understand what txt-to-sql try to achieve... thanks.

let me ask : I want be sure to understand. As "base readers" includes CSV file extension and as far as I know every CSV is "structured". So what explain that llama-index support officially reading CSV as core feature and you say it is "special case? Is there any different way of what im doing here to create documents[] based on csv. Or in other words: what is usecase reading csv using SimpleDirectoryReader? thank you

RussellLuo · 2024-04-27T08:50:28Z

From my observation, the built-in readers of SimpleDirectoryReader are mainly focus on parsing unstructured texts from various unstructured or structured files.

As for a sample CSV as below:

a, b, c
d, e, f
g, h, i

there will only be one document with the text: "d, e, f\ng, h, i" (the first row is ignored since it might be treated as a title line by pandas).

Similarly, for the following JSON:

{
  "a": "b":
  "c": [
    {
      "n": 1
    },
    {
      "n": 2
    }
  ]
}

you will also get a single document that contains the text: "{\n "a": "b":\n "c": [\n {\n "n": 1\n },\n {\n "n": 2\n }\n ]\n}\n".

@elvisbegovic Returning to the current issue, I think it matters what the content your CSV file contains and what the Q/A behavior you expect. It may be helpful if you can provide more information.

elvisbegovic added the question Further information is requested label Apr 26, 2024

elvisbegovic closed this as completed May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: on querying responses are totally incorrect #13130

[Question]: on querying responses are totally incorrect #13130

elvisbegovic commented Apr 26, 2024 •

edited

dosubot bot commented Apr 26, 2024 •

edited

Details

elvisbegovic commented Apr 26, 2024

logan-markewich commented Apr 26, 2024

elvisbegovic commented Apr 26, 2024

RussellLuo commented Apr 27, 2024

[Question]: on querying responses are totally incorrect #13130

[Question]: on querying responses are totally incorrect #13130

Comments

elvisbegovic commented Apr 26, 2024 • edited

Question Validation

Question

Step 1, persist csv file with load.py :

Step 2, chat with data with chat.py :

dosubot bot commented Apr 26, 2024 • edited

Details

elvisbegovic commented Apr 26, 2024

logan-markewich commented Apr 26, 2024

elvisbegovic commented Apr 26, 2024

RussellLuo commented Apr 27, 2024

elvisbegovic commented Apr 26, 2024 •

edited

dosubot bot commented Apr 26, 2024 •

edited