Skip to content
This repository has been archived by the owner on Apr 3, 2024. It is now read-only.

CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) is creating chunks longer than the specified (1000) #46

Open
sirio2013 opened this issue May 28, 2023 · 1 comment

Comments

@sirio2013
Copy link

Dear.

From this piece of code

from langchain.document_loaders import TextLoader
loader = TextLoader('cleaned_catalogue.txt')
documents = loader.load()

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

I keep getting chunks longer than the specified.
Why?

@SDcodehub
Copy link

if you do not define character then CharacterTextSplitter taking separator: str = '\n\n' as Seperator.

you have to specify correct Seperator. else have to change the text splitter. plain TextSplitter will work better if only want to split text

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants