-
Is there a plugin that allow to deduplicate nearly identical chunks? My dataset contains thousands of books, some of them had been scanned and OCRed several times. Such books generate chunks that in part are almost identical, the only potential differences are heads, tails and OCR errors. Is there a way to scan Qdrant collection and find such similarities? |
Beta Was this translation helpful? Give feedback.
Answered by
generall
Dec 21, 2023
Replies: 1 comment 2 replies
-
Each dataset require individual calibration, so I don't think there is an out-of-the-box solution for this. However, you can try to run similarity search against the whole dataset with duplicates to generate a list of candidates for the further deduplication |
Beta Was this translation helpful? Give feedback.
2 replies
Answer selected by
dimus
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Each dataset require individual calibration, so I don't think there is an out-of-the-box solution for this. However, you can try to run similarity search against the whole dataset with duplicates to generate a list of candidates for the further deduplication