Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): openai embedding function #1275

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

universalmind303
Copy link
Contributor

@universalmind303 universalmind303 commented May 7, 2024

part of #994.

Adds the ability to use the openai embedding functions.

the example can be run by the following

> EXPORT OPENAI_API_KEY="sk-..."
> cargo run --example openai --features=openai

which should output

Closest match: Winter Parka

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool! A few initial questions.

.execute()
.await?;

// there is no equivalent to '.search(<query>)' yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to have an equivalent at some point in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, there were plans to add similar functionality to rust sdk

rust/lancedb/src/embeddings/openai.rs Outdated Show resolved Hide resolved
Comment on lines +169 to +170
// We can't use the FixedSizeListBuilder here because it always adds a null bitmap
// and we want to explicitly work with non-nullable arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that OpenAi only supports non-nullable but will this builder hide the nulls by dropping the nullability (so we don't get an error when we should get an error)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we assert that the input array is non null, so we should never run into a scenario where we drop the nullability.

_ => unreachable!("This should not happen. We already checked the data type."),
};

let client = Client::with_config(creds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to cache / pool this client at some point in the future? Although maybe that's more work than its worth given the amount of time it takes to calculate embeddings the connection establishment might be a small fraction anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was my initial thought. We can definitely add this as an optimization later on if we need to.

Comment on lines +246 to +247
task::block_in_place(move || {
Handle::current().block_on(async {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should compute_inner be an async function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd still need to spawn an async task for this regardless unless we made EmbeddingFunction::compute_source_embeddings async, but we can't do that without some significant refactoring.

We use the RecordBatchReader which isn't async. We'd likely have to refactor all instances of that to an async equivalent.

However, if you prefer the wrapping to happen outside of compute_inner, that's totally fine & i can make those changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to worry about it right now. I am thinking we might eventually want compute_source_embeddings to be async though. The top-level add and query methods are already async so it seems odd to go async -> sync -> async.

Also, we might want to run embeddings in parallel at some point too. Although that might be best left to embedding function since different embeddings might have different preferences on batch size / parallel execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to worry about it right now. I am thinking we might eventually want compute_source_embeddings to be async though. The top-level add and query methods are already async so it seems odd to go async -> sync -> async.

makes sense. Will update!


Also, we might want to run embeddings in parallel at some point too. Although that might be best left to embedding function since different embeddings might have different preferences on batch size / parallel execution.

I made a similar comment on an earlier PR. #1259 (comment)

@github-actions github-actions bot added enhancement New feature or request Rust Rust related issues labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Rust Rust related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants