In EmbeddingStore: we must choose between "id, embedding" OR "embedding, embedded" - why not all 3 params ? #212
Replies: 3 comments 1 reply
-
@gypsy-gnomad the rationale is that if you already have your data somewhere (e.g. SQL DB) and want to embed it and store in a separate place (EmbeddingStore) then you don't need to store duplicated data again but only existing ID from existing storage (to correlate) and created Embedding. If you don't have your data stored anywhere yet, then you can store it in EmbeddingStore along with the Embedding, but then you most probably don't care about ID (since you don't need to correlate it, so it can be random. Of corse there can be exceptions form this logic, but we tried to keep the API of EmbeddingStore as simple as possible and not overload all possible method combinations. Can you explain why do you want to store all 3 of them? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the prompt reply.
I'm all for simplicity - respectfully, I'd argue in this case that it
complicates the system architecture and limits developer choice.
In our use cases: during prompt engineering we inject the text and for
search we use the ID for lookup in the DB/KB.
As implemented, it seems we have to store / manage the embeddings twice. Is
this the recommended approach ?
It's a great service you're all doing - but I'd vote the 3 parameter method
is also exposed whilst preserving the existing "convenience" methods.
…On Mon, 9 Oct 2023 at 17:47, LangChain4j ***@***.***> wrote:
@gypsy-gnomad <https://github.com/gypsy-gnomad> the rationale is that if
you already have your data somewhere (e.g. SQL DB) and want to embed it and
store in a separate place (EmbeddingStore) then you don't need to store
duplicated data again but only existing ID from existing storage (to
correlate) and created Embedding. If you don't have your data stored
anywhere yet, then you can store it in EmbeddingStore along with the
Embedding, but then you most probably don't care about ID (since you don't
need to correlate it, so it can be random. Of corse there can be exceptions
form this logic, but we tried to keep the API of EmbeddingStore as simple
as possible and not overload all possible method combinations. Can you
explain why do you want to store all 3 of them?
—
Reply to this email directly, view it on GitHub
<#212 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3QGZ23WNE6BY3QATXC4UPLX6OMYPAVCNFSM6AAAAAA5YFDQ2OVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TEMRXGE4DC>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Awesome. Thank you very much. -GG-On 13 Oct 2023, at 19:36, LangChain4j ***@***.***> wrote:
@gypsy-gnomad no, you should not store embedding twice. We will add an extra method with (ids, embeddings and embedded) to accomodate for this use case.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
We're using an EmbeddingStore, we have/want to specify all three parameters (Id, embedding, embedded).
We need to store the TextSegment(embedded) we can only use a random ID, or if we specify the ID we can't store the TextSegment.
Is this an omission or is there some rationale we're missing here? Wouldn't good practice be to provide all, if we have them - but sensible defaults to provide a user-friendly API ?
We've be happy to fix but in
MilvusEmbeddingStore
we have a private (not even protected?) method so we can't:private void addInternal(String id, Embedding embedding, TextSegment textSegment)
Same in the Chroma, Vespa and Weaviate stores - we have similar private methods and can't see any reason for their limited scope.
We could work around it, by saving the ID in the MetaData - but the Milvus implementation doesn't save metadata (yet?).
Beta Was this translation helpful? Give feedback.
All reactions