Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: on querying responses are totally incorrect #13130

Closed
1 task done
elvisbegovic opened this issue Apr 26, 2024 · 5 comments
Closed
1 task done

[Question]: on querying responses are totally incorrect #13130

elvisbegovic opened this issue Apr 26, 2024 · 5 comments
Labels
question Further information is requested

Comments

@elvisbegovic
Copy link

elvisbegovic commented Apr 26, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Hi there, sory to make noise here.

I am impressed how responses are completly incorrect base on a simple "structured" csv file. Am I doing things correctly ?

Step 1, persist csv file with load.py :

documents = SimpleDirectoryReader("./data").load_data()
Settings.embed_model = resolve_embed_model("local:BAAI/bge-small-en-v1.5")
Settings.llm = Ollama(model="llama3", request_timeout=60.0)
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir="./persist/")

Step 2, chat with data with chat.py :

Settings.embed_model = resolve_embed_model("local:BAAI/bge-small-en-v1.5")
Settings.llm = Ollama(model="llama3", request_timeout=60.0)
storage_context = StorageContext.from_defaults(persist_dir="./persist/")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()

while True:
    user_input = input("Ask your question ?")
    if user_input.lower() == "quit":
        break
    response = query_engine.query(user_input)
    print(response)

So on code / compilation etc everytihng works perfectly (llama-index is awesome) BUT every question I asking on this CSV file, reponses are completly wrong. What Am I doing wrong here ?

@elvisbegovic elvisbegovic added the question Further information is requested label Apr 26, 2024
Copy link

dosubot bot commented Apr 26, 2024

It sounds like you're encountering issues with the accuracy of responses from your query engine. Let's tackle this step by step to identify potential causes and solutions:

  1. CSV File Structure and Format: Ensure your CSV file's structure is correctly interpreted. If your CSV is structured with specific key-value pairs per row, consider using PagedCSVReader instead of SimpleDirectoryReader. This might align better with your data's format, improving how documents are indexed and queried.

  2. Embedding and LLM Models: The choice of embedding model ("local:BAAI/bge-small-en-v1.5") and LLM model ("llama3") is crucial. Verify these models are suitable for your data type and query needs. Incompatibility or issues with the model's training could lead to inaccurate responses.

  3. Query Processing: Investigate how your queries are processed by query_engine.query(user_input). Ensure this function accurately interprets user inputs and retrieves relevant responses from the index. Misinterpretation at this stage could be a significant factor in receiving incorrect responses.

  4. Data Persistence and Retrieval: Check the persistence and retrieval process of your index through StorageContext. Any mishaps in correctly persisting or retrieving the index could affect response accuracy.

  5. Debugging Steps: Consider adding logging to your scripts to gain insights into each step of the process, from data loading and model selection to query processing and response generation. This can help pinpoint where things might be going awry.

By methodically reviewing these areas, you should be able to identify and resolve the issues causing inaccurate responses.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@elvisbegovic
Copy link
Author

@dosu ok but there is NO documentation/exemple on "PagedCSVReader" and cannot find any readers to install. If I shouldn't use SimpleDirectoryReader as you said, give me simple example to persist simple data.csv to ./persist/ folder using PagedCSVReader.

@logan-markewich
Copy link
Collaborator

@elvisbegovic structured data is pretty special case. Unless it makes sense to embed and retrieve individual rows (maybe you have QA pairs, etc.), using a vector index is probably not the right approach

Instead you probably want to put it into a sqlite db or similar and do text-to-sql, or a dataframe and do text-to-pandas

https://docs.llamaindex.ai/en/stable/use_cases/q_and_a/#qa-over-structured-data

@elvisbegovic
Copy link
Author

Thank you @logan-markewich i will try what you say and I understand what txt-to-sql try to achieve... thanks.

let me ask : I want be sure to understand. As "base readers" includes CSV file extension and as far as I know every CSV is "structured". So what explain that llama-index support officially reading CSV as core feature and you say it is "special case? Is there any different way of what im doing here to create documents[] based on csv. Or in other words: what is usecase reading csv using SimpleDirectoryReader? thank you

@RussellLuo
Copy link
Contributor

From my observation, the built-in readers of SimpleDirectoryReader are mainly focus on parsing unstructured texts from various unstructured or structured files.

As for a sample CSV as below:

a, b, c
d, e, f
g, h, i

there will only be one document with the text: "d, e, f\ng, h, i" (the first row is ignored since it might be treated as a title line by pandas).

Similarly, for the following JSON:

{
  "a": "b":
  "c": [
    {
      "n": 1
    },
    {
      "n": 2
    }
  ]
}

you will also get a single document that contains the text: "{\n "a": "b":\n "c": [\n {\n "n": 1\n },\n {\n "n": 2\n }\n ]\n}\n".

@elvisbegovic Returning to the current issue, I think it matters what the content your CSV file contains and what the Q/A behavior you expect. It may be helpful if you can provide more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants