New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#435: Add metadata support (read/write) to pinecone embedded store #955
base: main
Are you sure you want to change the base?
#435: Add metadata support (read/write) to pinecone embedded store #955
Conversation
@@ -34,6 +34,7 @@ void should_add_embedding_with_segment_with_metadata() { | |||
} | |||
|
|||
awaitUntilPersisted(); | |||
awaitUntilPersisted(embedding, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to add this function to be able to use Awaitility with pinecone client to wait for records to be retrievable.
Pinecone is eventually consistent, so records are/may not be available just after write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! What do you think about changing existing awaitUntilPersisted()
instead of introducing new overloaded method? I guess the implementation can also be moved to the EmbeddingStoreIT
so that all implementations can benefit from using awaitility?
Map<String, Value> fields = struct.getFieldsMap(); | ||
|
||
for (Map.Entry<String, Value> entry : fields.entrySet()) { | ||
if (filterOutMetadataTextKey && isMetadataTextKey(entry.getKey())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when retreiving I am filtering out metadataTextKey from metadata, otherwise tests were failing.
I assumed metadataTextKey is only a technical thing to let us store original content in metadata, but it is not something that should be exposed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We store original embedded text under text_segment
(or whatever is defined in metadataTextKey
property) key in Pinecone metadata, and it should be returned back inside TextSegment.text()
.
The problem can happen in such case:
TextSegment textSegment = TextSegment.from("hello", new Metadata().put("text_segment", "bye"))
Since metadata key matches key of the text, text will be overriden and no metadata when retrieveing it back:
TextSegment { text = "bye" metadata = {} }
One option is to prepend all metadata keys with metadata_
prefix and then removing this prefix when retreiving back, so this TextSegment.from("hello", new Metadata().put("text_segment", "bye"))
will become:
text_segment -> "hello"
metadata_text_segment -> "bye"
in Pinecone's metadata.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgrebski thank you a lot!
* @param apiKey The Pinecone API key. | ||
* @param environment The environment (e.g., "northamerica-northeast1-gcp"). | ||
* @param projectId The ID of the project (e.g., "19a129b"). This is <b>not</b> a project name. | ||
* The ID can be found in the Pinecone URL: <a href="https://app.pinecone.io/organizations/.../projects/">...</a>...:{projectId}/indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change required?
@@ -231,7 +246,7 @@ public Builder environment(String environment) { | |||
|
|||
/** | |||
* @param projectId The ID of the project (e.g., "19a129b"). This is <b>not</b> a project name. | |||
* The ID can be found in the Pinecone URL: https://app.pinecone.io/organizations/.../projects/...:{projectId}/indexes. | |||
* The ID can be found in the Pinecone URL: <a href="https://app.pinecone.io/organizations/.../projects/">...</a>...:{projectId}/indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -34,6 +34,7 @@ void should_add_embedding_with_segment_with_metadata() { | |||
} | |||
|
|||
awaitUntilPersisted(); | |||
awaitUntilPersisted(embedding, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! What do you think about changing existing awaitUntilPersisted()
instead of introducing new overloaded method? I guess the implementation can also be moved to the EmbeddingStoreIT
so that all implementations can benefit from using awaitility?
* @param metadataTextKey (Optional) The key to find the text in the metadata. If not provided, "text_segment" will be used. | ||
* @param afterUpsertAction | ||
*/ | ||
public PineconeEmbeddingStore(String apiKey, | ||
String environment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that environment
and projectId
are not used any more. Does Pinecone resolve them from api key now?
I think we should mark there params as @Deprecated
in the builder then, WDYT?
this.nameSpace = nameSpace == null ? DEFAULT_NAMESPACE : nameSpace; | ||
this.metadataTextKey = metadataTextKey == null ? DEFAULT_METADATA_TEXT_KEY : metadataTextKey; | ||
this.afterUpsertAction = afterUpsertAction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why afterUpsertAction
is needed? Seems to not be used anywhere, also no javadoc with explanation.
I would remove it if not necessary.
Map<String, Value> fields = struct.getFieldsMap(); | ||
|
||
for (Map.Entry<String, Value> entry : fields.entrySet()) { | ||
if (filterOutMetadataTextKey && isMetadataTextKey(entry.getKey())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We store original embedded text under text_segment
(or whatever is defined in metadataTextKey
property) key in Pinecone metadata, and it should be returned back inside TextSegment.text()
.
The problem can happen in such case:
TextSegment textSegment = TextSegment.from("hello", new Metadata().put("text_segment", "bye"))
Since metadata key matches key of the text, text will be overriden and no metadata when retrieveing it back:
TextSegment { text = "bye" metadata = {} }
One option is to prepend all metadata keys with metadata_
prefix and then removing this prefix when retreiving back, so this TextSegment.from("hello", new Metadata().put("text_segment", "bye"))
will become:
text_segment -> "hello"
metadata_text_segment -> "bye"
in Pinecone's metadata.
WDYT?
<artifactId>kotlin-stdlib</artifactId> | ||
</exclusion> | ||
</exclusions> | ||
<version>1.0.0</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for migrating to 1.0.0! 🤗
Context
Fixes #435
Change
I have added support for storing custom metadata and retreiving it.
I have added some comments to the PR describing more important changes for verification