Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#435: Add metadata support (read/write) to pinecone embedded store #955

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rgrebski
Copy link

Context

Fixes #435

Change

I have added support for storing custom metadata and retreiving it.
I have added some comments to the PR describing more important changes for verification

@@ -34,6 +34,7 @@ void should_add_embedding_with_segment_with_metadata() {
}

awaitUntilPersisted();
awaitUntilPersisted(embedding, 1);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to add this function to be able to use Awaitility with pinecone client to wait for records to be retrievable.
Pinecone is eventually consistent, so records are/may not be available just after write

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! What do you think about changing existing awaitUntilPersisted() instead of introducing new overloaded method? I guess the implementation can also be moved to the EmbeddingStoreIT so that all implementations can benefit from using awaitility?

Map<String, Value> fields = struct.getFieldsMap();

for (Map.Entry<String, Value> entry : fields.entrySet()) {
if (filterOutMetadataTextKey && isMetadataTextKey(entry.getKey())) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when retreiving I am filtering out metadataTextKey from metadata, otherwise tests were failing.
I assumed metadataTextKey is only a technical thing to let us store original content in metadata, but it is not something that should be exposed

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We store original embedded text under text_segment (or whatever is defined in metadataTextKey property) key in Pinecone metadata, and it should be returned back inside TextSegment.text().

The problem can happen in such case:
TextSegment textSegment = TextSegment.from("hello", new Metadata().put("text_segment", "bye"))
Since metadata key matches key of the text, text will be overriden and no metadata when retrieveing it back:
TextSegment { text = "bye" metadata = {} }

One option is to prepend all metadata keys with metadata_ prefix and then removing this prefix when retreiving back, so this TextSegment.from("hello", new Metadata().put("text_segment", "bye")) will become:

text_segment -> "hello"
metadata_text_segment -> "bye"

in Pinecone's metadata.

WDYT?

@langchain4j langchain4j added the P2 High priority label Apr 23, 2024
Copy link
Owner

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgrebski thank you a lot!

* @param apiKey The Pinecone API key.
* @param environment The environment (e.g., "northamerica-northeast1-gcp").
* @param projectId The ID of the project (e.g., "19a129b"). This is <b>not</b> a project name.
* The ID can be found in the Pinecone URL: <a href="https://app.pinecone.io/organizations/.../projects/">...</a>...:{projectId}/indexes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required?

@@ -231,7 +246,7 @@ public Builder environment(String environment) {

/**
* @param projectId The ID of the project (e.g., "19a129b"). This is <b>not</b> a project name.
* The ID can be found in the Pinecone URL: https://app.pinecone.io/organizations/.../projects/...:{projectId}/indexes.
* The ID can be found in the Pinecone URL: <a href="https://app.pinecone.io/organizations/.../projects/">...</a>...:{projectId}/indexes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -34,6 +34,7 @@ void should_add_embedding_with_segment_with_metadata() {
}

awaitUntilPersisted();
awaitUntilPersisted(embedding, 1);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! What do you think about changing existing awaitUntilPersisted() instead of introducing new overloaded method? I guess the implementation can also be moved to the EmbeddingStoreIT so that all implementations can benefit from using awaitility?

* @param metadataTextKey (Optional) The key to find the text in the metadata. If not provided, "text_segment" will be used.
* @param afterUpsertAction
*/
public PineconeEmbeddingStore(String apiKey,
String environment,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that environment and projectId are not used any more. Does Pinecone resolve them from api key now?
I think we should mark there params as @Deprecated in the builder then, WDYT?

this.nameSpace = nameSpace == null ? DEFAULT_NAMESPACE : nameSpace;
this.metadataTextKey = metadataTextKey == null ? DEFAULT_METADATA_TEXT_KEY : metadataTextKey;
this.afterUpsertAction = afterUpsertAction;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why afterUpsertAction is needed? Seems to not be used anywhere, also no javadoc with explanation.
I would remove it if not necessary.

Map<String, Value> fields = struct.getFieldsMap();

for (Map.Entry<String, Value> entry : fields.entrySet()) {
if (filterOutMetadataTextKey && isMetadataTextKey(entry.getKey())) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We store original embedded text under text_segment (or whatever is defined in metadataTextKey property) key in Pinecone metadata, and it should be returned back inside TextSegment.text().

The problem can happen in such case:
TextSegment textSegment = TextSegment.from("hello", new Metadata().put("text_segment", "bye"))
Since metadata key matches key of the text, text will be overriden and no metadata when retrieveing it back:
TextSegment { text = "bye" metadata = {} }

One option is to prepend all metadata keys with metadata_ prefix and then removing this prefix when retreiving back, so this TextSegment.from("hello", new Metadata().put("text_segment", "bye")) will become:

text_segment -> "hello"
metadata_text_segment -> "bye"

in Pinecone's metadata.

WDYT?

<artifactId>kotlin-stdlib</artifactId>
</exclusion>
</exclusions>
<version>1.0.0</version>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for migrating to 1.0.0! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 High priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Pinecone should support storing metadata
2 participants