Migrating from ConversationalRetrievalChain - Problems with chat model's memory #21178

beniii121 · 2024-05-02T06:52:56Z

beniii121
May 2, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

memory_store = {}

def get_session_history(session_id):
  # storing conversations in a dict, based on session_ids
  if session_id not in memory_store:
      memory_store[session_id] = ChatMessageHistory()
  return memory_store[session_id] 

def creating_prompt(max_tokens):
  # assembling the prompt from the default template

  messages = [
    ("system", prompt_template_default+context_and_history),
    ("user", user_template),
  ]

  qa_prompt = ChatPromptTemplate.from_messages( messages )

  return qa_prompt

def completion(human_prompt, chat_history, max_tokens, temperature, cache):
  # this function sends the human_prompt to the llm and gets back the appropriate answer

  model_name = "gpt-3.5-turbo-16k"
  persist_directory = "db"
  cache_persist_directory = "cachedb"

  if not max_tokens:
    max_tokens = 400

  if not temperature:
    temperature = 0.1

  vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

  # getting the cache from the specified directory
  if not cache:
    logging.warning("CACHE NOT FOUND")
    cache = Chroma(persist_directory=cache_persist_directory, embedding_function=embeddings)
  else:
    logging.warning("CACHE FOUND")
  
  # creating prompt from the templates
  qa_prompt = creating_prompt(max_tokens)

  #initializing chat history
  history = get_session_history("54")

  output = vectorstore.similarity_search_with_score(human_prompt)

    with open("similarity.log", "w") as file:
      for doc in output:
        file.write("Content: "+doc[0].page_content+"\nSource: "+doc[0].metadata['source']+"\nScore: "+str(doc[1])+"\n\n")

    # initalizing the NoOpLLMChain to bypass the question_generator, thus bypassing question rephrasing (makes the model inaccurate), not used because conversationalretrievalchain is deprecated
    # noop_chain = NoOpLLMChain()

    # creating the chain from the given arguments. The 16k model should be used considering the amount of data being fed to the model
    # old chain:
    # chain = ConversationalRetrievalChain.from_llm(
    #   llm=ChatOpenAI(model=model_name, temperature=temperature, max_tokens=max_tokens, streaming=True, callbacks=[CustomStreamCallbackHandler(server=server)]),
    #   retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    #   chain_type="stuff",
    #   memory=memory,
    #   combine_docs_chain_kwargs={'prompt': qa_prompt},
    #   max_tokens_limit=8192,
    # )

    # new way of creating the chain
    retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
    llm = ChatOpenAI(model=model_name, temperature=temperature, max_tokens=max_tokens, streaming=True, callbacks=[CustomStreamCallbackHandler(server=server)])
    retriever_with_history = create_history_aware_retriever(llm, retriever, qa_prompt)
    qa_chain = create_stuff_documents_chain(llm, qa_prompt)

    chain = create_retrieval_chain(retriever_with_history, qa_chain)

    chain_with_history = RunnableWithMessageHistory(
      chain,
      get_session_history,
      input_messages_key="input",
      history_messages_key="chat_history",
      output_messages_key="answer"
    )
  
    query = human_prompt

    result = chain_with_history.invoke({"input": query, "chat_history": str(history), "token_limit":max_tokens, "context":""}, config={"configurable":{"session_id":"54"}})

    history.add_messages([HumanMessage(content=query), AIMessage(content=result['answer'])])

    answer_pair = {"question":query, "answer":result['answer']}

  return answer_pair

Description

I am trying to implement the new way of creating a RAG chain with memory, since ConversationalRetrievalChain is deprecated. When invoked, the chain outputs the correct and expected answer.

However, when I ask something that requires some kind of operation involving the chain's memory, I get back the expected answer, but it is returned twice. Somehow the chat model is being called twice and I just can't figure out why.

@dosu said that my model is somehow being invoked twice because of the logic of my implementation. I would like to bypass the chat-history condensation and directly use it for the follow-up question.

System Info

python 3.11

@dosu · 2024-05-02T06:53:27Z

dosubot[bot]
bot May 2, 2024

To address the issue of the chat model being invoked twice, particularly when dealing with follow-up questions that involve the chain's memory, you can adjust the logic to bypass the initial invocation that condenses the chat history and follow-up question into a standalone question. This adjustment involves directly using the chat history and question for generating the context and synthesizing the answer, thus avoiding the unnecessary first call to the chat model.

Here's a suggested approach:

Modify the logic to skip the condensation step (CONDENSE_QUESTION_PROMPT | ChatOpenAI(temperature=0) | StrOutputParser(),) when chat_history is present. This can be achieved by checking if chat_history exists and, if so, directly proceeding with the retrieval and answer synthesis steps without condensing the chat history into a standalone question.
Ensure the chat history and the follow-up question are formatted in a way that they can be directly used for context retrieval and answer generation. This means adjusting the input to the retrieval and synthesis steps to include the full chat history and question as they are.

This approach eliminates the redundant invocation of the chat model for condensing purposes, ensuring that the model is only called once to generate the final response based on the provided context and question.

Implementing this fix requires careful adjustment of your chain's logic to ensure that the presence of chat_history correctly bypasses the unnecessary condensation step. This might involve modifying the _search_query logic or the conditions under which the ChatOpenAI model is invoked within your chain's workflow.

By directly using the chat history and follow-up question without an intermediate condensation step, you should be able to resolve the issue of the chat model being called twice and ensure that the memory operations involving the chain's memory are handled more efficiently.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

18 replies

dosubot bot May 2, 2024

To effectively manage run IDs, callbacks, UUIDs, and address the issue with chain_with_history.astream_log() not calling the LLM as expected, consider the following structured approach:

Run IDs and UUID Management: Utilize UUIDs to uniquely identify each run, crucial for tracking and managing state. Generate a new UUID for each run initiation, ensuring unique identification and easy access to run data.
Effective Callback Use: Implement custom callback handlers by extending the BaseCallbackHandler. Define custom behavior for lifecycle events of a run, such as start and end of chat model and LLM interactions, to manage state effectively in asynchronous operations.
Resolving chain_with_history.astream_log() Issues: Ensure asynchronous operations are correctly managed with async context managers. Adapt the approach to managing runs and callbacks to ensure asynchronous operations and their contexts are correctly handled, including initializing runs and passing context to LLM calls.
Handling "Parent run not found" Errors: This error often occurs if the parent run ID is not correctly passed or managed. Maintain the correct parent-child relationship between runs by ensuring the parent_run_id is correctly specified and managed through the callback handler.
Debugging and Tracing: Leverage structured logging and tracing to gain insights into the execution flow and identify issues. Logging key events, such as the start and end of LLM calls, can help diagnose and resolve issues related to run management and callback invocation.

By adopting these strategies and referencing the PromptLayerCallbackHandler for custom callback handler implementation, you can improve the management of run IDs, callbacks, and UUIDs in your LangChain chat model. This approach also aids in resolving callback invocation and state management issues between queries, enhancing the reliability and traceability of your applications.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

beniii121 May 3, 2024
Author

@dosu I've ditched the astream_log and I started using astream but whenever the model is getting called I get the error:

[chain/error] [5:chain:RunnableAssign<answer>] [223ms] Chain run errored with error:

"GeneratorExit()Traceback (most recent call last):\n\n\n  File \"/usr/local/lib/python3.11/site-packages/langchain_core/runnables/base.py\", line 1986, in _atransform_stream_with_config\n    yield chunk\n\n\nGeneratorExit"

Can you help me decipher what this means and what throws it? This is my code:

async for chunk in chain_with_history.astream({"input": query, "chat_history": str(history), "token_limit":max_tokens, "context":""}, config={"configurable":{"session_id":"54"}}):
        yield json.dumps(chunk['answer']).encode()

dosubot bot May 3, 2024

The GeneratorExit error you're encountering typically occurs when an asynchronous generator is not fully consumed or properly managed upon early exit from the consuming loop. This error is raised when Python attempts to close the generator because the loop consuming it has been terminated prematurely, either due to an exception, a break statement, or exiting the asynchronous context.

To address this, ensure that your asynchronous generator is either fully consumed or that you have a try/except block within the generator to catch the GeneratorExit exception. This allows you to perform any necessary cleanup before re-raising the exception to ensure the generator closes properly. Here's how you might adjust your code:

async for chunk in chain_with_history.astream({"input": query, "chat_history": str(history), "token_limit":max_tokens, "context":""}, config={"configurable":{"session_id":"54"}}):
    try:
        yield json.dumps(chunk['answer']).encode()
    except GeneratorExit:
        # Handle cleanup here if necessary
        raise  # Re-raise the GeneratorExit to ensure proper closure

This pattern ensures that your generator is resilient to being prematurely closed, helping to prevent GeneratorExit errors from unexpectedly propagating.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

beniii121 May 4, 2024
Author

@dosu when calling astream, the answer returned contains the variables passed to it as parameters, even the context and all the retrieved documents. How can I only get the answer generated by the chat model?

dosubot bot May 4, 2024

To ensure that your asynchronous stream only returns the answer generated by the chat model, excluding any passed parameters, context, or retrieved documents, you can adjust your code to specifically filter and yield only the generated messages. This involves modifying the part of your code that handles the response from the chat model to extract and return only the relevant message content.

Here's a simplified example of how you might adjust your code:

async for chunk in chain_with_history.astream({"input": query, "chat_history": str(history), "token_limit":max_tokens, "context":""}, config={"configurable":{"session_id":"54"}}):
    # Assuming 'chunk' is a dictionary that includes 'answer' among other keys
    answer_content = chunk['answer']  # Extract the answer from the chunk
    # Further process 'answer_content' if needed to extract only the chat model's generated answer
    yield json.dumps(answer_content).encode()

In this adjustment, the focus is on directly accessing and yielding the answer part of the chunk, which is assumed to be the part containing the chat model's generated answer. This approach ensures that only the desired content is returned by your asynchronous stream, filtering out any unnecessary data related to context, parameters, or retrieved documents.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrating from ConversationalRetrievalChain - Problems with chat model's memory #21178

{{title}}

Replies: 1 comment 18 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Details

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Details

{{title}}

{{title}}

Details

{{title}}

{{title}}

Details

Select a reply

Migrating from ConversationalRetrievalChain - Problems with chat model's memory #21178

beniii121 May 2, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment · 18 replies

dosubot[bot] bot May 2, 2024

Details

dosubot bot May 2, 2024

Details

beniii121 May 3, 2024 Author

dosubot bot May 3, 2024

Details

beniii121 May 4, 2024 Author

dosubot bot May 4, 2024

Details

beniii121
May 2, 2024

Replies: 1 comment 18 replies

dosubot[bot]
bot May 2, 2024

beniii121 May 3, 2024
Author

beniii121 May 4, 2024
Author