keepSeparator doesn't work with RecursiveCharacterTextSplitter #5151

guidev · 2024-04-19T08:54:58Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain.js documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain.js rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

        
const text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.\nQuisque placerat nibh ex, eleifend varius dolor vulputate nec.\nNunc euismod ligula in leo tempus, vel lobortis leo accumsan. Phasellus elementum, nisi id tincidunt imperdiet, felis massa luctus diam, id scelerisque nisi mi et libero.\nDonec pulvinar turpis eu ipsum ullamcorper scelerisque.\nMauris non nibh nisi. Cras urna nulla, maximus sed mattis vel, gravida vel sapien. Aenean in diam a libero ultrices elementum.\nPellentesque ac ex nunc. Praesent nunc magna, viverra at urna vitae, aliquam placerat libero. Aenean vel turpis porta, semper velit eget, hendrerit ipsum.";

const document = new Document({pageContent: text});

const textSplitter = new RecursiveCharacterTextSplitter({chunkSize: 200, chunkOverlap: 40, keepSeparator: true});

const splitDocs = await textSplitter.splitDocuments([document], {
   chunkHeader: "CHUNK HEADER",
   appendChunkOverlapHeader: true,
});

console.log(JSON.stringify(splitDocs, null, 2));

Error Message and Stack Trace (if applicable)

No response

Description

I get these chunks:

You can see that second chunks starts with CHUNK HEADER(cont'd) Nunc, but I would expect it to start with CHUNK HEADER(cont'd) Nunc OR the first chunk should end with \n.

The \n between vulputate nec.\nNunc euismod is lost.

[
   {
      "pageContent":"CHUNK HEADERLorem ipsum dolor sit amet, consectetur adipiscing elit.\nQuisque placerat nibh ex, eleifend varius dolor vulputate nec.",
      "metadata":{
         "loc":{
            "lines":{
               "from":1,
               "to":2
            }
         }
      }
   },
   {
      "pageContent":"CHUNK HEADER(cont'd) Nunc euismod ligula in leo tempus, vel lobortis leo accumsan. Phasellus elementum, nisi id tincidunt imperdiet, felis massa luctus diam, id scelerisque nisi mi et libero.",
      "metadata":{
         "loc":{
            "lines":{
               "from":3,
               "to":3
            }
         }
      }
   },
   {
      "pageContent":"CHUNK HEADER(cont'd) Donec pulvinar turpis eu ipsum ullamcorper scelerisque.\nMauris non nibh nisi. Cras urna nulla, maximus sed mattis vel, gravida vel sapien. Aenean in diam a libero ultrices elementum.",
      "metadata":{
         "loc":{
            "lines":{
               "from":4,
               "to":5
            }
         }
      }
   },
   {
      "pageContent":"CHUNK HEADER(cont'd) Pellentesque ac ex nunc. Praesent nunc magna, viverra at urna vitae, aliquam placerat libero. Aenean vel turpis porta, semper velit eget, hendrerit ipsum.",
      "metadata":{
         "loc":{
            "lines":{
               "from":6,
               "to":6
            }
         }
      }
   }
]

System Info

[email protected]
macOS
node 18.20.2

The text was updated successfully, but these errors were encountered:

dosubot · 2024-04-19T08:55:14Z

It looks like the issue you're encountering with the keepSeparator option not working as expected in the RecursiveCharacterTextSplitter is due to how separators are handled during the text splitting process. Specifically, when keepSeparator is set to true, the implementation should ensure that the separator (\n in your case) is preserved between chunks. However, it seems like the current logic mistakenly omits the separator from the final output.

To address this, you can modify the part of the code where the separator is managed. The correct approach would be to include the separator in the merged text only when keepSeparator is true. This involves adjusting the logic in the mergeSplits method or similar, where the decision to append the separator to the text is made.

Here's a brief outline of what the corrected code might look like:

mergeSplits(splits: string[], separator: string): string[] {
    // Adjust the logic here to ensure separator is included based on the keepSeparator flag
    const doc = this.joinDocs(currentDoc, this.keepSeparator ? separator : "");
    // Further logic to handle the splits
}

This adjustment ensures that the separator is correctly preserved in the output when keepSeparator is true, addressing the issue you've observed.

Given this insight, I recommend applying this fix to your local copy of the LangChain.js library and testing to see if it resolves the problem. If successful, consider contributing this fix back to the LangChain.js repository to help others who might face the same issue.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

jeasonnow · 2024-04-25T08:47:45Z

In simple terms, the work of the Splitter can be summarized in the following steps:

Perform text segmentation, retaining the separating symbols if the keepSeperator flag is set, such as \n.
Iterate through the segmented strings, and if the length is less than the set chunkSize, merge them sequentially.
After merging, iterate through each merged string to trim it, removing any leading or trailing whitespace characters.
Generate each Chunk based on the merged result, attaching a Chunk Header to each.

langchainjs/libs/langchain-textsplitters/src/text_splitter.ts

Lines 179 to 182 in a8f7f5c

 private joinDocs(docs: string[], separator: string): string | null { 

 const text = docs.join(separator).trim(); 

 return text === "" ? null : text; 

 }

The cause of this issue is that when merging to generate pageContent, the method String.prototype.trim is used to filter out spaces at the beginning and end of the string. However, \n is also filtered out by trim. Therefore, when there is a split with a \n at the beginning, the \n at the beginning and end will be removed.

@guidev

jeasonnow · 2024-04-25T10:50:21Z

However, based on the following reasons, I believe this issue may not require modification:

Leading and trailing whitespace characters do not affect the semantics of the text split out.
If excessive special whitespace characters (\n) appear at the beginning and end without filtering, it may generate more chunks, leading to additional token consumption.

So I closed my PR. If you have any better suggestions, we can discuss them.

wcummings · 2024-05-02T23:37:59Z

I found this to be a problem, this is my workaround:

// Fixed version of RecursiveCharacterTextSplitter which respects keepSeparator
class FixedRecursiveCharacterTextSplitter extends RecursiveCharacterTextSplitter {
  async mergeSplits(splits: string[], separator: string): Promise<string[]> {
    const docs: string[] = [];
    const currentDoc: string[] = [];
    let total = 0;
    for (const d of splits) {
      const _len = await this.lengthFunction(d);
      if (
        total + _len + currentDoc.length * separator.length >
        this.chunkSize
      ) {
        if (total > this.chunkSize) {
          console.warn(
            `Created a chunk of size ${total}, +
which is longer than the specified ${this.chunkSize}`,
          );
        }
        if (currentDoc.length > 0) {
          const doc = this._joinDocs(currentDoc, separator);
          if (doc !== null) {
            docs.push(doc);
          }
          // Keep on popping if:
          // - we have a larger chunk than in the chunk overlap
          // - or if we still have any chunks and the length is long
          while (
            total > this.chunkOverlap ||
            (total + _len + currentDoc.length * separator.length >
              this.chunkSize &&
              total > 0)
          ) {
            total -= await this.lengthFunction(currentDoc[0]);
            currentDoc.shift();
          }
        }
      }
      currentDoc.push(d);
      total += _len;
    }
    const doc = this._joinDocs(currentDoc, separator);
    if (doc !== null) {
      docs.push(doc);
    }
    return docs;
  }

  // Do not trim if keepSeparator is enabled, or whitespace separators may be erased
  private _joinDocs(docs: string[], separator: string): string | null {
    const text = this.keepSeparator
      ? docs.join(separator)
      : docs.join(separator).trim();
    return text === "" ? null : text;
  }
}

If this is an acceptable solution, I'll make a PR.

dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Apr 19, 2024

jeasonnow added a commit to jeasonnow/langchainjs that referenced this issue Apr 25, 2024

fix: langchain-ai#5151 Reduce the content that needs trimming

59a5ede

jeasonnow added a commit to jeasonnow/langchainjs that referenced this issue Apr 25, 2024

fix: langchain-ai#5151 Reduce the content that needs trimming

aca8193

jeasonnow mentioned this issue Apr 25, 2024

fix: #5151 Reduce the content that needs trimming #5211

Closed

jeasonnow added a commit to jeasonnow/langchainjs that referenced this issue Apr 25, 2024

fix: langchain-ai#5151 Reduce the content that needs trimming

433f102

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keepSeparator doesn't work with RecursiveCharacterTextSplitter #5151

keepSeparator doesn't work with RecursiveCharacterTextSplitter #5151

guidev commented Apr 19, 2024 •

edited

dosubot bot commented Apr 19, 2024 •

edited

About Dosu

jeasonnow commented Apr 25, 2024 •

edited

jeasonnow commented Apr 25, 2024 •

edited

wcummings commented May 2, 2024

keepSeparator doesn't work with RecursiveCharacterTextSplitter #5151

keepSeparator doesn't work with RecursiveCharacterTextSplitter #5151

Comments

guidev commented Apr 19, 2024 • edited

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

dosubot bot commented Apr 19, 2024 • edited

Sources

About Dosu

jeasonnow commented Apr 25, 2024 • edited

jeasonnow commented Apr 25, 2024 • edited

wcummings commented May 2, 2024

guidev commented Apr 19, 2024 •

edited

dosubot bot commented Apr 19, 2024 •

edited

jeasonnow commented Apr 25, 2024 •

edited

jeasonnow commented Apr 25, 2024 •

edited