Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FileSystemDocumentLoader#loadDocuments has a potential issue when using ApacheTikaDocumentParser #1026

Closed
vprudnikov opened this issue Apr 26, 2024 · 2 comments
Labels
bug Something isn't working P1 Highest priority

Comments

@vprudnikov
Copy link

Describe the bug
FileSystemDocumentLoader#loadDocuments has a potential issue when using ApacheTikaDocumentParser
As shown in the langchain4j-examples, this line of code doesn't work as expected:
List<Document> documents = loadDocuments(directoryPath, new ApacheTikaDocumentParser());

To Reproduce
Try to load at least 2 documents using the provided example.

Expected behavior
Each document is loaded independently.

Current behavior

  • The first document contains its parsed content
  • The second document contains the content of the first document plus its content
  • The third document contains the content of the previous documents plus its content
  • ... and so on

Please complete the following information:

  • LangChain4j version: 0.30.0
  • LLM(s) used: NA
  • Java version: 21
  • Spring Boot version (if applicable): NA

Additional context
It looks like the ApacheTikaDocumentParser is a stateful object. That's why it cannot be reused across all documents.
Probably, a better solution would be using a Supplier like so:
List<Document> documents = loadDocuments(directoryPath, () -> new ApacheTikaDocumentParser());
or
List<Document> documents = loadDocuments(directoryPath, filePath -> new ApacheTikaDocumentParser());
As a result, a new fresh instance of the parser will be provided each time.

@vprudnikov vprudnikov added the bug Something isn't working label Apr 26, 2024
@langchain4j
Copy link
Owner

@vprudnikov thank you so much for reporting!

@langchain4j langchain4j added the P1 Highest priority label Apr 26, 2024
KaisNeffati added a commit to KaisNeffati/langchain4j that referenced this issue Apr 28, 2024
KaisNeffati added a commit to KaisNeffati/langchain4j that referenced this issue Apr 29, 2024
KaisNeffati added a commit to KaisNeffati/langchain4j that referenced this issue Apr 29, 2024
KaisNeffati added a commit to KaisNeffati/langchain4j that referenced this issue May 3, 2024
KaisNeffati added a commit to KaisNeffati/langchain4j that referenced this issue May 3, 2024
langchain4j pushed a commit that referenced this issue May 6, 2024
…1031)

## Issue
#1026


## General checklist
<!-- Please double-check the following points and mark them like this:
[X] -->
- [X] There are no breaking changes
- [X] I have added unit and integration tests for my change
- [X] I have manually run all the unit and integration tests in the
module I have added/changed, and they are all green
- [X] I have manually run all the unit and integration tests in the
[core](https://github.com/langchain4j/langchain4j/tree/main/langchain4j-core)
and
[main](https://github.com/langchain4j/langchain4j/tree/main/langchain4j)
modules, and they are all green
- [X] I have added/updated the
[documentation](https://github.com/langchain4j/langchain4j/tree/main/docs/docs)
- [ ] I have added an example in the [examples
repo](https://github.com/langchain4j/langchain4j-examples) (only for
"big" features)
@langchain4j
Copy link
Owner

Fixed by #1031

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Highest priority
Projects
None yet
Development

No branches or pull requests

2 participants