Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector search reference #2824

Merged
merged 7 commits into from
May 28, 2024

Conversation

guimachiavelli
Copy link
Member

This PR adds an initial base reference for:

  • hybrid and vector search parameters
  • embedders index setting

@meili-bot
Copy link
Collaborator

How to see the preview of this PR?

⚠️ Private link, only accessible to Meilisearch employees.

Go to this URL: https://website-git-deploy-preview-mei-16-meili.vercel.app/docs/branch:ai-search-reference-parameter

Credentials to access the page are in the company's password manager as "Docs deploy preview".

Copy link
Contributor

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this reference together, that's a lot of work ☀️ There are some missing pieces, see inline comments

reference/api/search.mdx Outdated Show resolved Hide resolved
reference/api/settings.mdx Outdated Show resolved Hide resolved
reference/api/settings.mdx Outdated Show resolved Hide resolved
reference/api/settings.mdx Show resolved Hide resolved
reference/api/settings.mdx Outdated Show resolved Hide resolved
reference/api/settings.mdx Show resolved Hide resolved
reference/api/settings.mdx Show resolved Hide resolved
reference/api/settings.mdx Show resolved Hide resolved
reference/api/settings.mdx Outdated Show resolved Hide resolved
reference/api/settings.mdx Show resolved Hide resolved
@guimachiavelli
Copy link
Member Author

Thanks for the review, @dureuill!

I have updated the embedder documentation according to your feedback. I did my best for pathToEmbeddings and embeddingObject, but usage is quite complex due to behaviour changing signficantly depending on the value for inputType.

Also, I'm not convinced about the documentation for the query field. Could you give me a bit more context on what it's supposed to do and/or expected use cases?

@dureuill
Copy link
Contributor

I did my best for pathToEmbeddings and embeddingObject, but usage is quite complex due to behaviour changing signficantly depending on the value for inputType.

I agree! I'm thinking of a better way of achieving this, here's what I get:

  1. Make it so that inputType only governs whether an array or single text is sent in the query, as its name indicates.
  2. Make it so that pathToEmbeddings can be null.
  3. When pathToEmbeddings is null, then Meilisearch expects a single embedding in the response, and will look it up at the path described by embeddingObject in the response
  4. When pathToEmbeddings is not null, then Meilisearch expects an array of embeddings in the response, and will look up the array of embeddings at the path described by pathToEmbeddings in the response, and then will look up each embedding at the path described by embeddingObject in each item of the array of embeddings.

Maybe for clarity we could rename these to pathToEmbeddingArray and pathToEmbeddingData or something. Similarly inputField could be pathToInput.

Or these could be EmbeddingArrayPath, EmbeddingDataPath and InputPath.

What do you think? Do you think it would make things clearer?

@dureuill
Copy link
Contributor

Could you give me a bit more context on what [query is] supposed to do and/or expected use cases?

Sure, query is about sending the value of fields other than the texts to embed.

To send the embedding request, Meilisearch performs two steps:

  1. Build an object that is equal to the value of the query field.
  2. Inject the text(s) to embed at the path described by inputField.

For example, to implement the OpenAI embedder API, the final request needs to be:

{
    "input": "TEXT TO EMBED",
    "model": "text-embedding-ada-002",
    "encoding_format": "float"
}

in this example, the model and encoding_format fields will be the same in all our requests to that embedder. So we have the field query be:

{
  "model": "text-embedding-ada-002",
  "encoding_format": "float",
}

However the input field will differ for each request: this is where we should inject the text that we want to embed. So we would have "inputField": ["input"].

Final embedder configuration would be:

{
  "url": "https://api.openai.com/v1/embeddings",
   "apiKey": "OPENAI_APIKEY",
  "query": {
    "model": "text-embedding-ada-002",
    "encoding_format": "float",
  },
  "inputField": ["input"],
  // omitting "inputType", "pathToEmbeddings" and "embeddingObject"
}

Similarly, an ollama request looks like:

{
  "model": "nomic-embed-text",
  "prompt": "TEXT TO EMBED"
}

So you'd have query = { "model": "nomic-embed-text" } and inputField = ["prompt"]

Copy link
Member

@curquiza curquiza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for code samples

@guimachiavelli
Copy link
Member Author

@dureuill, Regarding the API, I think pathToEmbeddingArray, pathToEmbeddingData, and pathToInput are pretty good. I think the more we can "tag" fields to signal whether they refer to what Meilisearch sends or to what Meilisearch receives, the best.

Possibly stupid idea: would we gain anything from creating two main fields, input and response? So the embedder object would be something like:

{
  "default": {
    "source": "",
    "input": {
      "apiKey": "",
      "model": "",
      "revision": "",
      "dimensions": 123,
      "inputType": "",
      "pathToInput": [],
      "query": {}
    },
    "response": {
      "pathToEmbeddingArray": [],
      "pathToEmbeddingData": [],
      "distribution": {}
    },
  }
}

I was also thinking about inputType. It seems its only function is to determine whether a user should use pathToEmbeddingArray or not. Couldn't we cut inputType and infer a request is textArray from whether or not pathToEmbeddingArray has been specified? Or would that be bad API design? Or perhaps return an error if a user sets pathToEmbeddingArray, but inputType is text?


Regarding query, why do users need to specify model inside query for rest embedders, but as its own thing for openAi, huggingFace, and ollama?

@dureuill
Copy link
Contributor

Hey @guimachiavelli

pathToEmbeddingArray, pathToEmbeddingData, and pathToInput are pretty good.

Nice we might consider changing the name of these parameters then :-)

would we gain anything from creating two main fields, input and response?

I like the idea, but I'm not sure we could implement it. The fields are shared between all embedders, and the fields in response only make sense for the REST embedder.

Also, as much as I love nesting, we should avoid it as much as possible, because it becomes very unwieldy when using the API (the previous API had the parameters nested under the source, which was easier from an implementation perspective, but harder to input).

I was also thinking about inputType. It seems its only function is to determine whether a user should use pathToEmbeddingArray or not.

Not really, it determines if multiple texts can be sent as input, which allows for better performance.

Couldn't we cut inputType and infer a request is textArray from whether or not pathToEmbeddingArray has been specified?

We could, but we might have to come with some sort of smart naming 🤔

Regarding query, why do users need to specify model inside query for rest embedders, but as its own thing for openAi, huggingFace, and ollama?

That is because the model is a first-class concept for openAi, huggingFace and ollama, but not for REST. The REST embedder configuration is just a way to tell Meilisearch how to send POST request with a JSON body where the text to embed is injected.

You could imagine some embedders with a REST API not exposing the model at all. For instance, Hugging Face inference endpoints are created with the model already selected, so one does not pass the model every request.

For a HF inference endpoint, the REST configuration could be something like:

{
      "url": "https://l2skjfwp9punv393.us-east-1.aws.endpoints.huggingface.cloud",
   "apiKey": "YOUR_TOKEN",
  "query": {
     "truncate": true
   },
  "inputField": ["inputs"],
  "inputType": "textArray",
  "pathToEmbeddings": [] // no idea what the answer looks like would have to test,
  "embeddingObject": [] // same
}

@guimachiavelli
Copy link
Member Author

Ok, thanks for the answers, @dureuill. I'm curious on how the API will evolve before we stabilise it, especially if we manage to get more direct feedback from users (perhaps organising a poll or a couple of interviews?). A lot of my concerns might end up being fairly academic and mostly unimportant for the majority of people who are actually using vector search.

In any case, I think this PR is ready for an official review. I don't think you need to re-read everything, just the new section describing each embedder option in more detail: https://github.com/meilisearch/documentation/pull/2824/files#diff-a88efc3f5697059650c8e14b221124b09e9c2eb12aadc2290bb87a71456fd64aR1999-R2193

@guimachiavelli guimachiavelli marked this pull request as ready for review May 23, 2024 14:01
@guimachiavelli guimachiavelli requested a review from a team as a code owner May 23, 2024 14:01

Other models, such as those provided by Ollama and REST embedders may also be compatible with Meilisearch.

This field is mandatory for `openAi`, `huggingFace`, and `Ollama` embedders.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field has default values for openAi and huggingFace, so it is only mandatory for ollama embedders.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the default values for openAi and huggingFace? text-embedding-3-small and BAAI/bge-base-en-v1.5?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BAAI/bge-base-en-v1.5 for hf and the ada one for openai

@dureuill
Copy link
Contributor

Ah, I also think it would be nice to have a list of the embedder with the allowed/mandatory parameter per embedder.

Copy link
Contributor

@dureuill dureuill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this huge addition 🎈 🎉

@guimachiavelli guimachiavelli merged commit 0d7b4f1 into new-section-search May 28, 2024
1 check passed
@guimachiavelli guimachiavelli deleted the ai-search-reference-parameter branch May 28, 2024 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants