Error when generating summary for long documents: 'ValueError: A single document was longer than the context length, we cannot handle this.' #21284

Brritany · 2024-05-03T23:46:46Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain import LLMChain, HuggingFacePipeline, PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=3000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=terminators
)

llm = HuggingFacePipeline(
    pipeline = pipe, 
    model_kwargs = {
        'max_new_tokens':256,
        'temperature':0, 
        'eos_token_id':terminators, 
        'pad_token_id':tokenizer.eos_token_id
    }
)

paul_graham_essay = '/content/startupideas.txt'

with open(paul_graham_essay, 'r', encoding='utf-8') as file:
    essay = file.read()

llm.get_num_tokens(essay)
-> Token indices sequence length is longer than the specified maximum sequence length for this model (9568 > 1024). Running this sequence through the model will result in indexing errors
9568

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "."], chunk_size=3000, chunk_overlap=500)

docs = text_splitter.create_documents([essay])

summary_chain = load_summarize_chain(llm=llm, chain_type='map_reduce', token_max=1000)
output = summary_chain.invoke(docs)

Error Message and Stack Trace (if applicable)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-13-e791bf376fd5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 output = summary_chain.invoke(docs)

6 frames
[/usr/local/lib/python3.10/dist-packages/langchain/chains/combine_documents/reduce.py](https://localhost:8080/#) in split_list_of_docs(docs, length_func, token_max, **kwargs)
     48         if _num_tokens > token_max:
     49             if len(_sub_result_docs) == 1:
---> 50                 raise ValueError(
     51                     "A single document was longer than the context length,"
     52                     " we cannot handle this."

ValueError: A single document was longer than the context length, we cannot handle this.

Description

I am attempting to generate summaries for long documents using the Langchain library combined with Llama-3 model, but I encounter a ValueError indicating that "a single document was longer than the context length, we cannot handle this." This issue occurs even after splitting the document into smaller chunks.

Expected Behavior

I expect the summary chain to generate concise summaries for each document chunk without exceeding the token limit.

Actual Behavior

The process results in a ValueError as mentioned above, suggesting that the document chunks still exceed the token limit configured in the summary chain.

Possible Solution

I suspect this might be related to how the RecursiveCharacterTextSplitter handles the tokenization and chunking, but I'm not sure how to adjust it correctly to ensure all chunks are within the acceptable token limit.

Additional Context

I tried reducing the chunk_size and adjusting the chunk_overlap, but these attempts did not resolve the issue. Any guidance on how to ensure that the document chunks conform to the specified token limits would be greatly appreciated.

System Info

Environment

Langchain version: 0.1.17
Transformers version: 4.40.1
Accelerate version: 0.30.0
Torch version: 2.2.1+cu121
Operating System: Google Colab on Nvidia A100

dosubot bot added Ɑ: text splitters Related to text splitters package 🔌: huggingface Primarily related to HuggingFace integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when generating summary for long documents: 'ValueError: A single document was longer than the context length, we cannot handle this.' #21284

Error when generating summary for long documents: 'ValueError: A single document was longer than the context length, we cannot handle this.' #21284

Brritany commented May 3, 2024

Error when generating summary for long documents: 'ValueError: A single document was longer than the context length, we cannot handle this.' #21284

Error when generating summary for long documents: 'ValueError: A single document was longer than the context length, we cannot handle this.' #21284

Comments

Brritany commented May 3, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

Description

Expected Behavior

Actual Behavior

Possible Solution

Additional Context

System Info

Environment