Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when generating summary for long documents: 'ValueError: A single document was longer than the context length, we cannot handle this.' #21284

Open
5 tasks done
Brritany opened this issue May 3, 2024 · 0 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: huggingface Primarily related to HuggingFace integrations Ɑ: text splitters Related to text splitters package

Comments

@Brritany
Copy link

Brritany commented May 3, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain import LLMChain, HuggingFacePipeline, PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=3000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=terminators
)

llm = HuggingFacePipeline(
    pipeline = pipe, 
    model_kwargs = {
        'max_new_tokens':256,
        'temperature':0, 
        'eos_token_id':terminators, 
        'pad_token_id':tokenizer.eos_token_id
    }
)

paul_graham_essay = '/content/startupideas.txt'

with open(paul_graham_essay, 'r', encoding='utf-8') as file:
    essay = file.read()

llm.get_num_tokens(essay)
-> Token indices sequence length is longer than the specified maximum sequence length for this model (9568 > 1024). Running this sequence through the model will result in indexing errors
9568

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "."], chunk_size=3000, chunk_overlap=500)

docs = text_splitter.create_documents([essay])

summary_chain = load_summarize_chain(llm=llm, chain_type='map_reduce', token_max=1000)
output = summary_chain.invoke(docs)

Error Message and Stack Trace (if applicable)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-13-e791bf376fd5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 output = summary_chain.invoke(docs)

6 frames
[/usr/local/lib/python3.10/dist-packages/langchain/chains/combine_documents/reduce.py](https://localhost:8080/#) in split_list_of_docs(docs, length_func, token_max, **kwargs)
     48         if _num_tokens > token_max:
     49             if len(_sub_result_docs) == 1:
---> 50                 raise ValueError(
     51                     "A single document was longer than the context length,"
     52                     " we cannot handle this."

ValueError: A single document was longer than the context length, we cannot handle this.

Description

Description

I am attempting to generate summaries for long documents using the Langchain library combined with Llama-3 model, but I encounter a ValueError indicating that "a single document was longer than the context length, we cannot handle this." This issue occurs even after splitting the document into smaller chunks.

Expected Behavior

I expect the summary chain to generate concise summaries for each document chunk without exceeding the token limit.

Actual Behavior

The process results in a ValueError as mentioned above, suggesting that the document chunks still exceed the token limit configured in the summary chain.

Possible Solution

I suspect this might be related to how the RecursiveCharacterTextSplitter handles the tokenization and chunking, but I'm not sure how to adjust it correctly to ensure all chunks are within the acceptable token limit.

Additional Context

I tried reducing the chunk_size and adjusting the chunk_overlap, but these attempts did not resolve the issue. Any guidance on how to ensure that the document chunks conform to the specified token limits would be greatly appreciated.

System Info

Environment

  • Langchain version: 0.1.17
  • Transformers version: 4.40.1
  • Accelerate version: 0.30.0
  • Torch version: 2.2.1+cu121
  • Operating System: Google Colab on Nvidia A100
@dosubot dosubot bot added Ɑ: text splitters Related to text splitters package 🔌: huggingface Primarily related to HuggingFace integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: huggingface Primarily related to HuggingFace integrations Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

No branches or pull requests

1 participant