Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inference API] Add Azure AI Studio Embeddings and Chat Completion Support #108472

Conversation

markjhoy
Copy link
Contributor

@markjhoy markjhoy commented May 9, 2024

This PR adds support for Azure AI Studio integration into the Inference API. Currently this supports text_embedding and completion task types.

Prerequisites to Model Creation

image

Model Creation:

PUT _inference/{tasktype}/{model_id}
{
  "service": "azureaistudio",
  "service_settings": {
    "api_key": "{api_key}",
    "target": “{deployment_target}”,
    “provider”: “(model provider}”,
    “endpoint_type”: “(endpoint type)”
  }
}
  • Valid {tasktype} types are: [text_embedding, completion]

Required Service Settings:

  • api_key: The API key can be found on your Azure AI Studio deployment's overview page
  • target: The target URL can be found on your Azure AI Studio deployment's overview page
  • provider: Valid provider types are (case insensitive):
    • openai - available for embeddings and completion
    • mistral - available for completion only
    • meta - available for completion only
    • microsoft_phi - available for completion only
    • cohere - available for embeddings and completion
    • snowflake - available for completion only (not in this MVP - we cannot currently create a deployment due to quota issues - we think this should work, but would like to test first)
    • databricks - available for completion only
  • endpoint_type: Valid endpoint types are:
    • token - a "pay as you go" endpoint (charged by token)
      • Available for OpenAI, Meta and Cohere
    • realtime - a realtime endpoint VM deployment (charged by the hour)
      • Available for Mistral, Meta, Microsoft Phi, Snowflake and Databricks

Embeddings Service Settings

  • dimensions: (optional) the number of dimensions the resulting output embeddings should have.

Embeddings Task Settings

(this is also overridable in the inference request)

  • user: (optional) a string that is a unique identifier representing your end-user. This helps Azure AI Studio in the case of abuse or issues for debugging.

Completion Service Settings

(no additional service settings)

Completion Task Settings

(these are all optional and can be overridden in the inference request)

  • temperature: What sampling temperature to use, between 0 and 2. Higher values mean the model takes more risks. Try 0.9 for more creative applications, and 0 (argmax sampling) for ones with a well-defined answer. Microsoft recommends altering this or top_p but not both.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. Microsoft recommends altering this or temperature but not both.
  • do_sample: request to perform the sampling or not
  • max_new_tokens: the maximum number of new tokens the chat completion inference should produce in the output

Text Embedding Inference

POST _inference/text_embedding/{model_id}
{
    "input": "The answer to the universe is"
}

Chat Completion Inference

POST _inference/completion/{model_id}
{
    "input": "The answer to the universe is"
}

@markjhoy markjhoy added >non-issue :ml Machine learning Team:ML Meta label for the ML team :EnterpriseSearch/Application Enterprise Search Team:Enterprise Search Meta label for Enterprise Search team v8.15.0 labels May 9, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ent-search-eng (Team:Enterprise Search)

Copy link
Contributor

@jonathan-buttner jonathan-buttner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, just posting the comments I had so far. I'll post another round shortly.

ActionListener<List<ChunkedInferenceServiceResults>> listener
) {
ActionListener<InferenceServiceResults> inferListener = listener.delegateFailureAndWrap(
(delegate, response) -> delegate.onResponse(translateToChunkedResults(input, response))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might have been an accident that we didn't implement the word boundary chunker for the azure service 🤔 . An example of the chunker is here: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/cohere/CohereService.java#L227-L239

I think we only support chunking for text embedding. Doesn't look like we have logic in OpenAiService to throw an exception or anything if we try to use for other model types though.

@davidkyle do we want chunking support for azure openai and azure studio?

If so, it's ok with me if you want to do those changes in a separate PR @markjhoy .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @jonathan-buttner I'll follow up with that change

protected final AzureAiStudioEndpointType endpointType;
protected final RateLimitSettings rateLimitSettings;

protected static final RateLimitSettings DEFAULT_RATE_LIMIT_SETTINGS = new RateLimitSettings(1_440);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this rate is probably going to be different depending on the target used 🤔 I tried looking for some docs on rate limits for azure studio as a whole but didn't find much. Have you seen anything?

Maybe we should have the child classes (embedding and chat completion) pass in a default and those classes can guesstimate the default to use based on the provider?

Or I suppose we could set a fairly low limit (I think the lowest so far is like 240 requests per minute from azure openai chat completions that Tim worked on) and just document that the user should change this as needed.

If/once we have dynamic rate limiting I suppose this won't be an issue.

What do you all think @maxhniebergall @davidkyle ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would say that we should pick a low limit and make it clear that this is something users should change. As long as the error message is clear, they will understand.

Copy link
Contributor Author

@markjhoy markjhoy May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen anything?

Unfortunately no - and I would assume it's provider specific as well... for the "realtime" deployments, I suspect there are no limits, as the VM is hosted by the user and it would be whatever the VM size can handle as well...

Or I suppose we could set a fairly low limit

I have a feeling this will probably be the best way to start. As long as we let the user know and to have them change it if they need as you mention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A low limit sounds good to me, and we can make it clear in the docs that it needs to be adjusted by the user 👍

}
dimensionsSetByUser = dims != null;
}
case PERSISTENT -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up, we've run into issues with not correctly adding backwards compatible logic for parsing the persistent configuration so I think we're going to stop validating the configuration when parsing it from storage. I don't think you need to change anything at the moment but we'll probably be going through and remove the validation checks for the persistent code path.

@davidkyle davidkyle added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label May 13, 2024
@davidkyle
Copy link
Member

@elasticmachine test this please

@davidkyle
Copy link
Member

After a successful PUT, Mistral failed with a 405 status code on inference

PUT _inference/completion/mist
{
  "service": "azureaistudio",
  "service_settings": {
    "api_key": "XXX",
    "target": "https://Mistral-small-XXX",
    "provider": "mistral",
    "endpoint_type": "realtime"
  }
}
# PUT ok


POST _inference/mist?error_trace
{
  "input": "Breakfast alternatives to muesli"
}
# POST failed
{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "Received an unsuccessful status code for request from inference entity id [mist] status [405]",
      }
    ]
  }
}

@davidkyle
Copy link
Member

✅ Completions with the databricks provider and databricks-dbrx-instruct-3 model

The default value of max_new_tokens is quite low for this model, if you don't increase the setting you get a very short response. Perhaps Elasticsearch can set a higher default value or we can make it clearer somehow that the max_new_tokens setting value is coming into play.

@markjhoy
Copy link
Contributor Author

The default value of max_new_tokens is quite low for this model, if you don't increase the setting you get a very short response. Perhaps Elasticsearch can set a higher default value or we can make it clearer somehow that the max_new_tokens setting value is coming into play.

That's interesting... About what would you say was the default from Databrick's side? And I'd opt to add this to the docs to tell the user they may need to increase the max_new_tokens rather than setting a default, as the others seem to return a decent amount of text (usually in the 100+ word range)... I'm not opposed to either way ultimately, but coding in a default might be future maintenance if Databrick's changes this in the future...

@markjhoy
Copy link
Contributor Author

markjhoy commented May 14, 2024

After a successful PUT, Mistral failed with a 405 status code on inference

Huh - that is really odd.

I just tried it myself without any issues... although like Databricks, the return tokens was really low:

Create model:

PUT _inference/completion/test_mistral_completion
{
  "service": "azureaistudio",
  "service_settings": {
    "api_key": "########",
    "target": "https://########.eastus2.inference.ml.azure.com/score",
    "provider": "mistral",
    "endpoint_type": "realtime"
  }
}

Test infer:

POST _inference/completion/test_mistral_completion
{
  "input": "The answer to the universe is"
}

Response:

{
    "completion": [
        {
            "result": " I'm an artificial intelligence and don't have the ability to know the"
        }
    ]
}

@davidkyle - just to be sure - when you created (PUT) your model, did you include the /score in the target at the end?

@markjhoy
Copy link
Contributor Author

And I'd opt to add this to the docs to tell the user they may need to increase the max_new_tokens rather than setting a default, as the others seem to return a decent amount of text (usually in the 100+ word range)

I take that back... from more testing, with Meta and my Mistral tests -- the default number of tokens seems low, so we may want to add a default... do you have a suggestion for what is a good value?

(and btw, ✅ Meta works for chat completions)

@markjhoy
Copy link
Contributor Author

I can confirm as well that ✅ Microsoft Phi works as expected... and again, I think we do need to set a default max num tokens... the default from this yielded 12 terms in the output... :(

@markjhoy
Copy link
Contributor Author

buildkite test this

@davidkyle
Copy link
Member

@davidkyle - just to be sure - when you created (PUT) your model, did you include the /score in the target at the end?

I will test again, likely a user error

I think we do need to set a default max num tokens... the default from this yielded 12 terms in the output... :(

++
DataBricks returned a short sentence ~8 words. Once I upped the token count it returned a much longer response.

@markjhoy
Copy link
Contributor Author

I think we do need to set a default max num tokens... the default from this yielded 12 terms in the output... :(

DataBricks returned a short sentence ~8 words. Once I upped the token count it returned a much longer response.

From some tests - I think perhaps 64 is a decent number... thoughts?

@markjhoy
Copy link
Contributor Author

@davidkyle , @jonathan-buttner - FYI - I added in a default max_new_tokens of 64 if none is entered. This will be documented as well.

@markjhoy
Copy link
Contributor Author

I can't for whatever reason get a Snowflake deployment working...

FYI - I still can't get a Snowflake deployment due to quota issues - and no idea how to get this working... I'm confident in my implementation as written by the input/output on the model card, and it seems to be the same as the others...

I'm cool with (a) going forward with this, or (b) omitting Snowflake from this... let me know your thoughts.


ValidationException validationException = new ValidationException();

Double temperature = extractOptionalDouble(map, TEMPERATURE_FIELD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing some testing and looks like we allow temperature and topP to be negative values but it results in

{
    "completion": [
        {
            "result": "None"
        }
    ]
}

Should we validate that they're in the correct range? I suppose that could be problematic if the allowable ranges changes in the future 🤔 I wonder if we're getting an error response back but not passing it along.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate that they're in the correct range?

Good question... I'd say yes - but there's no direct docs I can see for the valid ranges for any of these parameters... the only thing that comes close is in the AzureOpenAI .DLL / SDK documentation:

I wonder if we're getting an error response back but not passing it along.

This I doubt as we're still getting a 200 response - however, I can certainly see if all the probabilities are negative it might only consider those > 0.0... but 🤷 I don't know for certain and will do a bit of testing manually...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we're getting an error response back but not passing it along.

This I doubt as we're still getting a 200 response - however, I can certainly see if all the probabilities are negative it might only consider those > 0.0... but 🤷 I don't know for certain and will do a bit of testing manually...

FYI - I ran a manual test - and yep - no error. Calling with temperature of -2.0 directly to the .../score endpoint yields a 200 response with the following:

{
    "output": "None"
}

@jonathan-buttner
Copy link
Contributor

✅ Microsoft Phi: phi-3-mini-128k
I did notice that we allow a negative temperature/top_p which results in "Result: None". I'm not sure if that's the response we're getting when we pass along invalid values for those fields or if we're not parsing the error response or something.

@markjhoy
Copy link
Contributor Author

@jonathan-buttner - FYI - just pushed up a commit that constrains the top_p and temperature to the 0.0 to 2.0 range. The max new tokens already was set up to only receive positive integers. 👍

@davidkyle
Copy link
Member

@davidkyle - just to be sure - when you created (PUT) your model, did you include the /score in the target at the end?

Regarding mistral, I tried again with and without /score and saw the same error. If it worked for you it must be something in my configuration but I don't know what. It is hard too debug as there is little information in the logs.

FYI - I still can't get a Snowflake deployment due to quota issues - and no idea how to get this working... I'm confident in my implementation as written by the input/output on the model card, and it seems to be the same as the others...

I hit the same quota problem for Snowflake

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markjhoy
Copy link
Contributor Author

Just did a ✅ using Meta Llama 7B - and all looks good there - @jonathan-buttner - did you get a chance to test yet?

@markjhoy markjhoy merged commit e87047f into elastic:main May 15, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing :EnterpriseSearch/Application Enterprise Search :ml Machine learning >non-issue Team:Enterprise Search Meta label for Enterprise Search team Team:ML Meta label for the ML team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants