-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keepSeparator doesn't work with RecursiveCharacterTextSplitter #5151
Comments
It looks like the issue you're encountering with the To address this, you can modify the part of the code where the separator is managed. The correct approach would be to include the separator in the merged text only when Here's a brief outline of what the corrected code might look like: mergeSplits(splits: string[], separator: string): string[] {
// Adjust the logic here to ensure separator is included based on the keepSeparator flag
const doc = this.joinDocs(currentDoc, this.keepSeparator ? separator : "");
// Further logic to handle the splits
} This adjustment ensures that the separator is correctly preserved in the output when Given this insight, I recommend applying this fix to your local copy of the LangChain.js library and testing to see if it resolves the problem. If successful, consider contributing this fix back to the LangChain.js repository to help others who might face the same issue. Sources
|
In simple terms, the work of the Splitter can be summarized in the following steps:
langchainjs/libs/langchain-textsplitters/src/text_splitter.ts Lines 179 to 182 in a8f7f5c
The cause of this issue is that when merging to generate pageContent, the method String.prototype.trim is used to filter out spaces at the beginning and end of the string. However, \n is also filtered out by trim. Therefore, when there is a split with a \n at the beginning, the \n at the beginning and end will be removed. |
However, based on the following reasons, I believe this issue may not require modification:
So I closed my PR. If you have any better suggestions, we can discuss them. |
I found this to be a problem, this is my workaround: // Fixed version of RecursiveCharacterTextSplitter which respects keepSeparator
class FixedRecursiveCharacterTextSplitter extends RecursiveCharacterTextSplitter {
async mergeSplits(splits: string[], separator: string): Promise<string[]> {
const docs: string[] = [];
const currentDoc: string[] = [];
let total = 0;
for (const d of splits) {
const _len = await this.lengthFunction(d);
if (
total + _len + currentDoc.length * separator.length >
this.chunkSize
) {
if (total > this.chunkSize) {
console.warn(
`Created a chunk of size ${total}, +
which is longer than the specified ${this.chunkSize}`,
);
}
if (currentDoc.length > 0) {
const doc = this._joinDocs(currentDoc, separator);
if (doc !== null) {
docs.push(doc);
}
// Keep on popping if:
// - we have a larger chunk than in the chunk overlap
// - or if we still have any chunks and the length is long
while (
total > this.chunkOverlap ||
(total + _len + currentDoc.length * separator.length >
this.chunkSize &&
total > 0)
) {
total -= await this.lengthFunction(currentDoc[0]);
currentDoc.shift();
}
}
}
currentDoc.push(d);
total += _len;
}
const doc = this._joinDocs(currentDoc, separator);
if (doc !== null) {
docs.push(doc);
}
return docs;
}
// Do not trim if keepSeparator is enabled, or whitespace separators may be erased
private _joinDocs(docs: string[], separator: string): string | null {
const text = this.keepSeparator
? docs.join(separator)
: docs.join(separator).trim();
return text === "" ? null : text;
}
} If this is an acceptable solution, I'll make a PR. |
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
I get these chunks:
You can see that second chunks starts with
CHUNK HEADER(cont'd) Nunc
, but I would expect it to start withCHUNK HEADER(cont'd) Nunc
OR the first chunk should end with\n
.The
\n
betweenvulputate nec.\nNunc euismod
is lost.System Info
[email protected]
macOS
node 18.20.2
The text was updated successfully, but these errors were encountered: