Why does it divide up into overlapping chunks instead of sentences? #58

endolith · 2023-07-15T23:51:23Z

endolith
Jul 15, 2023

I was under the impression that the embeddings models were trained on sentences? Does it matter?

https://www.sbert.net/docs/quickstart.html#comparing-sentence-similarities

freedmand · 2023-07-17T02:31:34Z

freedmand
Jul 17, 2023
Maintainer

Semantra splits into chunks instead of sentences because that way you can ensure each chunk is the same length. From my personal experience, the results are better when all the chunks are the exact same size because sometimes embeddings are affected by size (e.g. embeddings of small sentences match more closely with other small sentences, even if a longer sentence is more relevant).

It doesn't really matter that much if a chunk spans several sentences or starts in the middle of a sentence — on average it will still find the relevant parts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does it divide up into overlapping chunks instead of sentences? #58

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why does it divide up into overlapping chunks instead of sentences? #58

endolith Jul 15, 2023

Replies: 1 comment

freedmand Jul 17, 2023 Maintainer

endolith
Jul 15, 2023

freedmand
Jul 17, 2023
Maintainer