Is there a way to deduplicate vectors if they came from very similar sources? #3268

dimus · 2023-12-21T19:23:13Z

dimus
Dec 21, 2023

Is there a plugin that allow to deduplicate nearly identical chunks? My dataset contains thousands of books, some of them had been scanned and OCRed several times. Such books generate chunks that in part are almost identical, the only potential differences are heads, tails and OCR errors. Is there a way to scan Qdrant collection and find such similarities?

Answered by generall

Dec 21, 2023

Each dataset require individual calibration, so I don't think there is an out-of-the-box solution for this. However, you can try to run similarity search against the whole dataset with duplicates to generate a list of candidates for the further deduplication

View full answer

generall · 2023-12-21T19:38:09Z

generall
Dec 21, 2023
Maintainer

Each dataset require individual calibration, so I don't think there is an out-of-the-box solution for this. However, you can try to run similarity search against the whole dataset with duplicates to generate a list of candidates for the further deduplication

2 replies

dimus Dec 21, 2023
Author

Do you mean to create a crawler, that compares everything to everything, or there is a way to query for unusually close cosine similarity globally?

generall Dec 21, 2023
Maintainer

I am not sure about the crawler part, but if you make a search for each vector in the collection, it would be about as long as creating index for the collection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qdrant

Is there a way to deduplicate vectors if they came from very similar sources? #3268

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Qdrant

Is there a way to deduplicate vectors if they came from very similar sources? #3268

dimus Dec 21, 2023

Replies: 1 comment · 2 replies

generall Dec 21, 2023 Maintainer

dimus Dec 21, 2023 Author

generall Dec 21, 2023 Maintainer

dimus
Dec 21, 2023

Replies: 1 comment 2 replies

generall
Dec 21, 2023
Maintainer

dimus Dec 21, 2023
Author

generall Dec 21, 2023
Maintainer