Replies: 1 comment 2 replies
-
Hey @aiwalter! Great to see you diving into another challenge 🚀 Based on your requirements, I would suggest implementing a caching mechanism or a hash map to keep track of the chunks that have already been used. This approach would allow for constant time checks to see if a chunk has been used before, significantly reducing the overhead associated with searching through and modifying a list of chunks each time a new chunk is processed. In the context of the Here's a conceptual example of how this could be implemented in the class CitationQueryEngine(BaseQueryEngine):
def __init__(self, ...):
...
self.used_chunks_cache = set() # Initialize an empty set to store used chunk identifiers
def _create_citation_nodes(self, nodes: List[NodeWithScore]) -> List[NodeWithScore]:
new_nodes: List[NodeWithScore] = []
for node in nodes:
text_chunks = self.text_splitter.split_text(
node.node.get_content(metadata_mode=self._metadata_mode)
)
for text_chunk in text_chunks:
chunk_hash = hash(text_chunk) # Create a hash of the chunk for identification
if chunk_hash not in self.used_chunks_cache: # Check if the chunk has been used
self.used_chunks_cache.add(chunk_hash) # Mark the chunk as used
text = f"Source {len(new_nodes)+1}:\n{text_chunk}\n"
new_node = NodeWithScore(
node=TextNode.parse_obj(node.node), score=node.score
)
new_node.node.text = text
new_nodes.append(new_node)
return new_nodes This example uses a set to store the hashes of used chunks, providing constant time complexity for add and check operations. This approach is significantly more efficient than removing items from a list, especially as the number of processed chunks grows. You can find more details about the Sources
|
Beta Was this translation helpful? Give feedback.
-
I am using a
CitationQueryEngine
and this is called several times in sequence to complete a task. For each task I want to make sure that a given chunk is only used/cited once. My idea was to remove the chunk from the index as soon as it is used by the query engine to avoid it will be used again. Is there a smarter way to track which chunks have been used and avoid using again? How could I implement this? @dosubotBeta Was this translation helpful? Give feedback.
All reactions