New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] FileSystemDocumentLoader#loadDocuments has a potential issue when using ApacheTikaDocumentParser #1026
Comments
@vprudnikov thank you so much for reporting! |
KaisNeffati
added a commit
to KaisNeffati/langchain4j
that referenced
this issue
Apr 28, 2024
10 tasks
KaisNeffati
added a commit
to KaisNeffati/langchain4j
that referenced
this issue
Apr 29, 2024
KaisNeffati
added a commit
to KaisNeffati/langchain4j
that referenced
this issue
Apr 29, 2024
KaisNeffati
added a commit
to KaisNeffati/langchain4j
that referenced
this issue
May 3, 2024
KaisNeffati
added a commit
to KaisNeffati/langchain4j
that referenced
this issue
May 3, 2024
langchain4j
pushed a commit
that referenced
this issue
May 6, 2024
…1031) ## Issue #1026 ## General checklist <!-- Please double-check the following points and mark them like this: [X] --> - [X] There are no breaking changes - [X] I have added unit and integration tests for my change - [X] I have manually run all the unit and integration tests in the module I have added/changed, and they are all green - [X] I have manually run all the unit and integration tests in the [core](https://github.com/langchain4j/langchain4j/tree/main/langchain4j-core) and [main](https://github.com/langchain4j/langchain4j/tree/main/langchain4j) modules, and they are all green - [X] I have added/updated the [documentation](https://github.com/langchain4j/langchain4j/tree/main/docs/docs) - [ ] I have added an example in the [examples repo](https://github.com/langchain4j/langchain4j-examples) (only for "big" features)
Fixed by #1031 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
FileSystemDocumentLoader#loadDocuments
has a potential issue when using ApacheTikaDocumentParserAs shown in the
langchain4j-examples
, this line of code doesn't work as expected:List<Document> documents = loadDocuments(directoryPath, new ApacheTikaDocumentParser());
To Reproduce
Try to load at least 2 documents using the provided example.
Expected behavior
Each document is loaded independently.
Current behavior
Please complete the following information:
Additional context
It looks like the ApacheTikaDocumentParser is a stateful object. That's why it cannot be reused across all documents.
Probably, a better solution would be using a Supplier like so:
List<Document> documents = loadDocuments(directoryPath, () -> new ApacheTikaDocumentParser());
or
List<Document> documents = loadDocuments(directoryPath, filePath -> new ApacheTikaDocumentParser());
As a result, a new fresh instance of the parser will be provided each time.
The text was updated successfully, but these errors were encountered: