Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: NodeParser Previous and Next Node Relationships Cross Document Boundaries #13095

Closed
alineberry opened this issue Apr 24, 2024 · 5 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@alineberry
Copy link
Contributor

Bug Description

NodeParser.get_nodes_from_documents assigns previous and next relationships to nodes without considering whether a given node is at the beginning and/or end of the source document; in which case, a node can have a previous and/or next relationship to a node from another source document.

The way it's written assumes that the documents being passed into NodeParser.get_nodes_from_documents are related and are in some meaningful order, which does not seem to be the intuitive use case.

Code source

Proposed Solution:

Check that previous and next nodes share the same source node before creating the relationship. I would like to implement the change myself if that's ok.

Version

0.10.31

Steps to Reproduce

This basic example demonstrates the issue.

from llama_index.core import Document
from llama_index.core.node_parser import TokenTextSplitter

docs = [Document(text=f'{i} the quick brown fox jumped over the lazy dog {i}') for i in range(10)]
id_to_docs = {d.id_: d for d in docs}

parser = TokenTextSplitter(
    chunk_size=10,
    chunk_overlap=0,
    separator=" ",
)

nodes = parser.get_nodes_from_documents(docs)

id_to_node = {n.id_: n for n in nodes}

test_node = nodes[1]
print('node text:',test_node.text)
print('node source node id:', test_node.source_node.node_id)
next_node = id_to_node[test_node.next_node.node_id]
print('next node:', next_node.text)
print('next node source node id:', next_node.source_node.node_id)

Output:

node text: lazy dog 0
node source node id: ec209db0-109f-49c2-8e34-ea4833c228e2
next node: 1 the quick brown fox jumped over the
next node source node id: 897ca3fd-c74a-44e6-af94-44ba47b3f79f

Note that the node and the nodes' next node do not share the same source node.

Relevant Logs/Tracbacks

No response

@alineberry alineberry added bug Something isn't working triage Issue needs to be triaged/prioritized labels Apr 24, 2024
Copy link

dosubot bot commented Apr 24, 2024

Hey there, @alineberry! 👋 I'm Dosu, a bot here to lend a hand with bugs, answer your questions, or help you dive into contributing while we wait for a human maintainer to swing by. Excited to assist you today! I'll take a closer look at the issue you've reported and get back to you shortly with some insights. Stay tuned!

@logan-markewich
Copy link
Collaborator

Yea good catch. Feel free to propose a PR :)

@alineberry
Copy link
Contributor Author

Will do!

Copy link

dosubot bot commented Apr 24, 2024

Great! If you need any guidance or have questions as you work on your PR, feel free to reach out. Happy coding!

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@alineberry
Copy link
Contributor Author

PR #13109 opened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants