Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Summarize a long document? #13071

Open
1 task done
chigkim opened this issue Apr 24, 2024 · 3 comments
Open
1 task done

[Question]: Summarize a long document? #13071

chigkim opened this issue Apr 24, 2024 · 3 comments
Labels
question Further information is requested

Comments

@chigkim
Copy link

chigkim commented Apr 24, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I'm trying to summarize a long document. I've loaded a pdf document which got splitted into 74 documents by SimpleDirectoryReader.
As expected, DocumentSummaryIndex.from_documents goes through each document, and created a summary via the selected llm.
However, query_engine.query("Summarize the documents") only selects one node and sends to LLM to summarize the document.
How do you make it so that it summarizes the entire document using all nodes?
Is there a different QueryEngine or Index I should use for this purpose?

documents = SimpleDirectoryReader(input_files=path).load_data()
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize", use_async=True)
doc_summary_index = DocumentSummaryIndex.from_documents(documents, response_synthesizer=response_synthesizer)
query_engine = doc_summary_index.as_query_engine(response_mode="tree_summarize", use_async=True)
response = query_engine.query("Summarize the documents")
print(response.response)
print(len(response.source_nodes))

I'm sure there got to be a tutorial on this, but I couldn't find one.
The closest one I could find was Document Summary Index - LlamaIndex.
However, it's not exactly what I'm looking for. It's more of question and answer system based via embedding/llm.
Also, I'm not sure what the advantage of using DocumentSummaryIndex instead of VectorStoreIndex for Q&A.

Thanks so much for your help!

@chigkim chigkim added the question Further information is requested label Apr 24, 2024
Copy link

dosubot bot commented Apr 24, 2024

To achieve a comprehensive summary of your entire document split into 74 parts, you should consider using the QASummaryQueryEngineBuilder from LlamaIndex. This builder allows for the creation of a query engine that is capable of handling both question answering and summarization tasks across multiple document segments. Here's a streamlined approach:

  1. Prepare Your Documents: Make sure each document part is properly represented as a Document instance.

  2. Instantiate QASummaryQueryEngineBuilder: This builder is designed to accommodate customization, including the use of a specific Large Language Model (LLM), embedding models, and more.

  3. Build the Query Engine: Utilize the build_from_documents method to process your documents and create a RouterQueryEngine. This engine is configured for both QA and summarization tasks, enabling it to generate summaries across all document parts.

  4. Perform Summarization Query: With the RouterQueryEngine, you can now query for a summary that encompasses all parts of your document.

Regarding the difference between DocumentSummaryIndex and VectorStoreIndex, the former is tailored for summarizing documents, ideal for generating quick insights from large texts. In contrast, VectorStoreIndex is focused on storing and manipulating high-dimensional vector data, which is more aligned with machine learning applications where vector operations are key.

For implementation details and further guidance, you might want to explore the QASummaryQueryEngineBuilder documentation and examples provided by LlamaIndex. This will give you a clearer understanding of how to set up and use the query engine for document summarization across multiple segments.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@chigkim
Copy link
Author

chigkim commented Apr 24, 2024

Thanks @dosu.
How do you use Ollama as llm for QASummaryQueryEngineBuilder? It wants to use OpenAI instead even though I specified llm parameter.

llm = Ollama(model="llama3", request_timeout=600, base_url=host, additional_kwargs=options)
query_engine_builder = QASummaryQueryEngineBuilder(llm=llm)
ValueError: No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys

Copy link

dosubot bot commented Apr 24, 2024

To resolve the issue where specifying Ollama as the LLM for QASummaryQueryEngineBuilder results in an error requesting an OpenAI API key, it appears that the system defaults to using OpenAI's LLM, possibly due to how the ServiceContext is initialized or used within your code. Here's how you can explicitly use your local LLM and avoid this error:

  1. Explicitly Set the LLM in ServiceContext: Ensure that your local LLM (Ollama in this case) is correctly set in the ServiceContext. This step is crucial to override any default behavior that attempts to use OpenAI's LLM.
from llama_index.llms import LangChainLLM  # Adjust this import to match your LLM's specific class
from llama_index import ServiceContext

# Initialize your LLM here, replacing LangChainLLM() with your Ollama initialization
llm = LangChainLLM(model="llama3", request_timeout=600, base_url=host, additional_kwargs=options)
service_context = ServiceContext(llm=llm)
  1. Set Global Service Context: To ensure that your entire application uses this ServiceContext, set it globally at the start of your application. This approach helps in making sure that all parts of your pipeline are aligned to use the specified LLM.
from llama_index import set_global_service_context

set_global_service_context(service_context)

By following these steps, you should be able to bypass the error related to the OpenAI API key and use Ollama as your LLM for the QASummaryQueryEngineBuilder. This method ensures that your specified LLM is used throughout your application, avoiding any unintended fallbacks to default settings that might cause issues.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant