An Azure AI Services API Gateway

If you're seeking the essential components for rapid implementation and deployment of AI Information Assistant (a.k.a AI Chatbot) solutions on Azure, the AI Services API Gateway is your go-to resource to expedite the development process and transition smoothly from pilot phase to full-scale production.

This solution accelerator is designed to deliver 80-90% of the core functionality essential for constructing and deploying AI Solutions and Chatbots. Most notably, it facilitates the use of a shared, minimal infrastructure component set, allowing for the smooth roll out of numerous AI Chatbots/Solutions on the same foundational infrastructure.

Recipe	Components	Functional Architecture (**)
AI Information Assistant	1. Chatbot User Interface 2. Semantic Caching 3. State Management 4. API Traffic Routing 5. Azure OpenAI Service 6. Azure AI Search (RAG/OYD) 7. Prompt Persistence 8. API Metrics Collection

** Components marked by green circles are out of box features.

Supported Features At A Glance

Feature/Capability	Azure AI Service	Description
Shared Infrastructure Model	All	The AI Services API Gateway simplifies and streamlines the deployment of multiple AI Solutions (Chatbots) by utilizing a shared infrastructure backbone. This approach allows for deploying the infrastructure once and subsequently scaling it to build and deploy numerous AI Chatbots.
Unified Management Plane	All	The Gateway provides a unified management plane for sharing AI Service deployment endpoints among multiple AI Applications. The gateway is AI Application Aware meaning Azure AI Service endpoints can be configured separately for each AI Application. This design not only makes it easier to share AI Service deployments among various AI Applications but also enhances metrics collection and request routing for each specific AI use case.
Intelligent Traffic Routing	Azure OpenAI Service	Circuit Breaker: The Gateway can be configured with multiple Azure AI Service deployment URI's (a.k.a backend endpoints). When a backend endpoint is busy/throttled (returns http status code 429), the gateway will function as a circuit-breaker and automatically switch to the next configured endpoint in its backend priority list. In addition, the gateway will also keep track of throttled endpoints and will not direct any traffic to them until they are available again. Rate Limiting: Users can set up a RPM Limit for each OpenAI backend endpoint for any AI Application. When multiple AI Applications use the same endpoint, the gateway will enforce rate limiting and throttle excessive requests by returning http 429 status codes. This is especially useful for distributing model processing capacity (PTU deployment) evenly across different AI Applications. Traffic Splitting: The Gateway provides the flexibility to split AI application traffic between multiple Azure AI Service deployment endpoints. In the case of Azure OpenAI Service, the AI application traffic can be split among multiple Paygo (tokens per minute) and Provisioned Throughput Unit (reserved capacity) model deployments.
Streaming API Responses	Azure Open AI Service (Chat Completions API only)	The AI Services API Gateway fully supports the response streaming feature provided by Azure OpenAI Chat Completions API. This function is seamlessly integrated with semantic caching, state management and traffic routing features.
Semantic Caching	Azure OpenAI Service	This feature is seamlessly integrated into AI Services API Gateway and can be used to cache OpenAI Service prompts and responses. Cache hits are evaluated based on semantic similarity and the configured algorithm. With semantic caching, runtime performance of LLM/AI applications can be improved by up to 40%. This solution leverages the vectorization and semantic search features supported by the widely popular PostgreSQL open source database.
Conversational State Management	Azure OpenAI Service (Chat Completion API only)	AI Chatbots must maintain context during end user sessions so they can reference previous user inputs, ensuring coherent and contextually relevant conversations. This feature manages the conversational state and can effortlessly scale to support anywhere from 10 to hundreds of concurrent user sessions for multiple AI applications simultaneously. Additionally, it can function independently or in tandem with the Semantic Caching feature to enhance performance.
Prompt Persistence (Chat History)	Azure OpenAI Service	This optional feature can be used to persist OpenAI Service Prompts (inputs) and Completions (responses) in a relational database table. With this feature, customers can analyze prompts and accordingly adjust the similarity distance for the chosen vector search algorithm to maximize performance (increase throughput). This feature can also be used to introspect the prompt and completion tokens associated with a particular API request (Request ID) and troubleshoot content filteration issues quickly. The gateway currently supports PostgreSQL database as the persistence provider.
Dynamic Server Configuration	All	The Gateway exposes a separate reconfig (/reconfig) endpoint to allow dynamic reconfiguration of backend endpoints. Backend endpoints can be reconfigured anytime even when the server is running thereby limiting AI application downtime.
API Metrics Collection	All	The Gateway continously collects backend API metrics and exposes them thru the metrics (/ metrics) endpoint. Users can analyze the throughput and latency metrics and reconfigure the gateway's backend endpoint priority list to effectively route/shift the AI Application workload to the desired backend endpoints based on available and consumed capacity.
Observability and Traceability	All	The AI Services Gateway is instrumented with Azure Application Insights SDK. When this setting is enabled, detailed telemetry information on Azure OpenAI and dependent services is collected and sent to Azure Monitor.
Client SDK's and AI Application (LLM) Frameworks	Azure OpenAI Service	The AI Services Gateway server supports Azure OpenAI Client SDK. The gateway has also been tested to work with Prompt Flow and Langchain LLM frameworks.
Robust Runtime	All	The AI Services Gateway is powered by tried and true Nodejs runtime. Nodejs uses a single threaded event loop to asynchronously serve requests. It is built on Chrome V8 engine and extremely performant. The server can easily scale to handle 10's ... 1000's of concurrent requests simultaneously.

Usage scenarios

The AI Services Gateway can be used in the following scenarios.

Rapid deployment of AI Chatbots (or AI Information Assistants)

The AI Services Gateway solution provides core features such as Semantic Caching, State Management, Traffic Routing and API Metrics Collection right out of the box, which are crucial for implementing conversational AI applications such as AI Chatbots.
Capturing Azure AI Service API usage metrics and estimating capacity for AI applications/workloads

For each AI Application, the AI Services Gateway collects Azure AI Service endpoint usage metrics. The metrics are collected for each AI application based on pre-configured time intervals. In the case of Azure OpenAI Service, the collected metrics include Tokens per minute (TPM) and Requests per minute (RPM). These metrics can then be used to estimate Provisioned Throughput Units for each OpenAI workload.
Intelligently route AI Application requests to Azure AI Service deployments/backends

For each AI Application, the AI Services Gateway functions as an intelligent router and distributes AI Service API traffic among multiple backend endpoints. The gateway keeps track of unavailable/busy backend endpoints and automatically redirects traffic to available endpoints thereby distributing the API traffic load evenly and not overloading a given endpoint with too many requests.

The gateway currently supports proxying requests to the following Azure AI Services.
- Azure OpenAI Service (Full API support)
- Azure AI Search (Full API support)
- Azure AI Language (Limited API support - Entity Linking, Language detection, Key phrase extraction, NER, PII, Sentiment analysis and opinion mining only)
- Azure AI Translator (Limited API support - Text Translation only)
- Azure AI Content Safety (Limited API support - Analyze Text and Analyze Image only)

Feature/Capability Support Matrix

Feature/Capability	Configurable (Yes/No)	Azure OpenAI Service	Azure AI Search	Azure AI Language	Azure AI Translator	Azure AI Content Safety
Semantic Cache	Yes	Yes - Completions API - Chat Completions API	No	No	No	No
State Management	Yes	Yes - Chat Completions API	No	No	No	No
API Router	Yes	Yes	Yes	Yes	Yes	Yes
Prompt Persistence	Yes	Yes	No	No	No	No
Metrics Collection	No	Yes	Yes	Yes	Yes	Yes

Reference Architecture

AI Services API Router workflow for Azure OpenAI Service

Bill Of Materials

The AI Services API Gateway is designed from the grounds up to be a cost-effective solution and has minimal services footprint. For details on the Azure services needed to deploy this solution, please see the accompanying table below.

Environment	Azure Services	Notes
- Development - Testing	- Azure Linux VM (Minimum 2vCPUs; 8GB Memory) - Azure Database for PostgreSQL Server (2-4 vCores; 8-16GB Memory; 1920-2880 max. iops) - Azure OpenAI Service - Azure AI Search - Azure Storage	- The gateway can be run as a standalone server or can be containerized and run on the Linux VM.
- Pre-Production - Production	- Azure Linux VM (4-8 vCPUs; 8-16GB Memory) - Azure Database for PostgreSQL Server (4-8 vCores; 16-32GB Memory; 2880-4320 max iops) - Azure Kubernetes Service / Azure Container Apps. - Azure OpenAI Service - Azure AI Search - Azure Storage	- The AI Services Gateway can be deployed on AKS or Azure Container Apps. For large scale deployments, we recommend AKS. - Select the appropriate deployment type(s) and OpenAI models for Azure OpenAI Service. - Select the appropriate pricing tier (S, S2, S3) for Azure AI Search service to meet your data indexing & storage requirements. - The Linux VM can be used as a jumpbox for testing the gateway server locally, connecting to the kubernetes cluster, managing other Azure resources etc.

NOTE: Other Azure AI Services may be needed to implement functions specific to your use case.

Prerequisites

An Azure Resource Group with Owner Role permission. All Azure resources can be deloyed into this resource group.
A GitHub Account to fork and clone this GitHub repository.
Review Overview of Azure Cloud Shell. Azure Cloud Shell is an interactive, browser accessible shell for managing Azure resources. You will be using the Cloud Shell to create the Jumpbox VM (Linux VM).
This project assumes readers are familiar with Linux, Git SCM, Linux Containers (docker engine) and Kubernetes. If you are new to any of these technologies, go thru the resources below.
(Optional) Download and install Postman App, a REST API Client used for testing the AI Services Gateway.

Readers are advised to refer to the following on-line resources as needed.

Disclaimer:

The software (AI Services Gateway) is provided "as is" without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and non infringement. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of, or in connection with the software or the use or other dealings in the software. Use at your own risk.
Currently, the Gateway does not secure the exposed API's by means of security tokens or API keys. Therefore, it's use should be limited to private virtual network deployments on Azure. However, the gateway can be seamlessly integrated behind an application gateway or a firewall appliance (WAF) that offers advanced and robust security capabilities.

Deployment Options

Deploy the AI Services API Gateway in a pre-production environment first and configure the desired features by setting the configuration parameters to appropriate values. The pre-production environment should be as close as possible to an actual production environment. Rigorously validate the features of routing, caching, state management, and persistence to confirm their optimal functionality as anticipated.

The Sections below describe the steps to configure and deploy the Gateway on Azure. Although, there are multiple deployment options available on Azure, we will only describe the top two options recommended for production deployments.

Recommended for Usage Scenarios 1 and 2

Containerize the Gateway and deploy it on a standalone Virtual Machine. Refer to Sections A and B below.

Recommended for Usage Scenario 3

Containerize the AI Services API Gateway and deploy it on a serverless container platform such as Azure Container Apps.

We will not be describing the steps for this option here. Readers can follow the deployment instructions described in Azure Container Apps documentation here.
Containerize the AI Services API Gateway and deploy it on a container platform such as Azure Kubernetes Service. Refer to Sections B and E below.

Important Notes

Please review the sections below before proceeding to Section A.

Gateway Router/Load Balancer

It is important to understand how the Gateway's load balancer distributes incoming API requests among configured Azure OpenAI backends (model deployment endpoints). Please read below.

The Gateway will strictly follow the priority order when forwarding OpenAI API requests to backends. Lower numeric values equate to higher priority. This implies, the gateway will forward requests to backends with priority of '0', '1' and then go in that order. Priorities assigned to OpenAI backends can be viewed by invoking the instanceinfo endpoint - /instanceinfo.
When a backend endpoint is busy or throttled (returns http status code = 429), the gateway will mark this endpoint as unavailable and record the 'retry-after' seconds value returned in the OpenAI API response header. The gateway will not forward/proxy any more API requests to this backend until retry-after seconds has elapsed thereby ensuring the backend (OpenAI endpoint) doesn't get overloaded with too many requests.
When all configured backend endpoints are busy or throttled (return http status code = 429), the gateway will return the lowest 'retry-after' seconds value returned by one of the throttled OpenAI backends. This value (in seconds) will be returned in the Gateway response header 'retry-after'. Client applications should ideally wait the no. of seconds returned in the 'retry-after' response header before making a subsequent API call.
For as long as all the backend endpoints are busy/throttled, the Gateway will perform global rate limiting and continue to return the lowest 'retry-after' seconds in it's response header ('retry-after').

Semantic Caching and Retrieval

Cached completions are retrieved based on semantic text similarity algorithm and distance threshold configured for each AI Application. Caching and retrieval of Azure OpenAI Service responses (completions) can be enabled at 3 levels.

Global Setting

To enable caching of OpenAI Service responses/completions, the environment variable API_GATEWAY_USE_CACHE must be set to "true". If this variable is empty or not set, caching and retrieval of OpenAI Service completions (responses) will be disabled for all configured AI Applications.
AI Application

To enable caching at AI Application level, the configuration attribute cacheSettings.useCache must be set to "true". If this variable is empty or not set (or set to "false"), caching and retrieval of OpenAI Service completions (responses) will be disabled for the AI Application (only).
API Gateway (HTTP) Request

Caching and retrieval of completions can be disabled for each individual API Gateway request by passing in a query parameter use_cache and setting its value to false (eg., ?use_cache=false). Setting this parameter value to "true" has no effect.

Prior to turning on Semantic Caching feature for an AI Application (in Production), please review the following notes.

The semantic caching feature utilizes an Azure OpenAI embedding model to vectorize prompts. Any of the three embedding models offered by Azure OpenAI Service can be used to vectorize/embedd prompt data. The embedding models have a request token size limit of 8K and output dimension of 1536 tokens. This implies, any request payload containing more than 8K tokens (prompt) will likely be truncated and result in faulty search results.
By default, pgvector extension performs exact nearest neighbor search which provides excellent recall. However, search performance is likely to take a hit (degrade) as the number of records in the table go above 1K. To trade some recall for query performance, its recommended to add an index and use approximate nearest neighbor search. pgvector extension supports two index types - HNSW and IVFFlat. Between the two, HNSW has better query performance. Refer to the snippet below to add an HNSW index to the apigtwycache table. Use psql to add the index.
```
# Create HNSW index for cosine similarity distance function.
#
=> CREATE INDEX ON apigtwycache USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
#
# To use L2 distance, set the distance function to 'vector_l2_ops'. Similarly, for IP distance function use 'vector_ip_ops'.
```
During functional tests, setting the cosine similarity score threshold to a higher value > 0.95 was found to deliver more accurate search results.
The Inner Product distance function has not been thoroughly tested with sample data. Prior to using this function, it is advised to run functional tests and verify results.

Invalidating Cached Entries

When semantic caching and retrieval is enabled at the global level (API_GATEWAY_USE_CACHE=true), the Gateway periodically runs a cache entry invalidator process on a pre-configured schedule. If no schedule is configured, the cache invalidator process is run on a default schedule every 45 minutes. This default schedule can be overridden by setting the environment variable API_GATEWAY_CACHE_INVAL_SCHEDULE as described in Section A below.
For each AI Application, cached entries can be invalidated (deleted) by setting the configuration attribute cacheSettings.entryExpiry. This attribute must be set to a value that conforms to PostgreSQL Interval data type. If this attribute value is empty or not set, cache entry invalidation will be skipped. Refer to the documentation here to configure the cache invalidation interval to an appropriate value.

Conversational State Management

When interacting with AI Chatbots and information assistants, maintaining conversational continuity and context is crucial for generating accurate relevant responses to user's questions. Memory management for user sessions can be configured as follows.

Configure global setting

To enable memory management at the API Gateway level, first set the environment variable API_GATEWAY_STATE_MGMT to "true". If this env variable is empty or not set, memory (state) management will be disabled for all AI Applications.
Configure AI Application setting

To enable memory management for an AI Application, the attribute memorySettings.useMemory must be set to "true" in the router configuration file. If this attribute is empty or set to "false", memory management for user sessions will be disabled.

Prior to turning on Conversational State Management feature for an AI Application (in Production), please review the following notes.

Conversational state management feature is only supported for Azure OpenAI Chat Completion API.
When memory management is enabled for an AI application, the AI Services Gateway will return a custom http header x-thread-id in the API response. This custom header will contain a unique value (GUID) representing a Thread ID. To initiate a new user session and have the AI Services Gateway manage the conversational context, client applications must send this value in the http request header x-thread-id, in subsequent API calls. The Thread ID represents an end user's session with an AI Chatbot/Assistant application. A client application can end a user session at any time by not sending this custom http header in the API request.
Use the memorySettings.msgCount attribute to specify the number of end user interactions (messages) to persist for each user session. Once the number of saved user interactions reaches this max. value (specified by this attribute), the memory manager component will discard the oldest message and only keep the most recent messages. For each user session, the first user interaction (message) will always be retained by the memory manager.
Use the memorySettings.entryExpiry attribute to specify the expiry time for user sessions. After a user's session expires, API requests containing the expired Thread ID in the http header will receive an exception stating the session has expired.
To quickly test the user session state management feature, you can use one of the provided standalone nodejs applications. See below.
- A simple chat application that uses REST API calls - ./samples/chat-client/simple-chat-app.js.
- A chat application that uses Azure OpenAI SDK to make streaming API calls - ./samples/chat-client/stream-chat-app.js

Invalidating Memory Entries

When conversational state management feature is enabled at the global level (API_GATEWAY_STATE_MGMT=true), the AI Services Gateway periodically runs a memory invalidator process on a pre-configured Cron schedule. If no schedule is configured, the memory invalidator process is run on a default schedule every 10 minutes. This default schedule can be overridden by setting the environment variable API_GATEWAY_MEMORY_INVAL_SCHEDULE as described in Section A below.
For each AI Application, memory entries can be invalidated (deleted) by setting the configuration attribute memorySettings.entryExpiry. This attribute must be set to a value that conforms to PostgreSQL Interval data type. If this attribute value is empty or not set, memory entry invalidation will be skipped. Refer to the documentation here to configure memory entry invalidation interval to an appropriate value.

Persisting Prompts

When global environment variable API_GATEWAY_PERSIST_PROMPTS is set to true, prompts and completions along with other API request related metadata will be persisted in database table apigtwyprompts.
API Request prompts will not be persisted under the following conditions a) All backend endpoints for a given AI Application are busy/throttled. In this case, the Gateway will return HTTP status code 429. b) API Gateway encounters an internal error while handling a request. In this case, the Gateway will return HTTP status code 500.
The Gateway returns a unique (GUID) id x-request-id in the HTTP response header for every request. This header value along with the user value sent in the API request (body) can be used to query table apigtwyprompts and troubleshoot issues. For instance, these values could be used to query a request that failed due to application of a content filter (HTTP status = 400).

The remainder of this readme describes how to configure/enable specific features and deploy the AI Services API Gateway on Azure.

A. Configure and run the AI Services API Gateway on a standalone Virtual Machine

Before we can get started, you will need a Linux Virtual Machine to run the AI Services Gateway. If you haven't already, provision a Virtual Machine with a Linux flavor of your choice on Azure.

Clone or fork this GitHub repository into a directory on the VM.

SSH login to the Virtual Machine using a terminal window. If you intend to customize the Gateway, it's best to fork this repository into your GitHub account and then clone the repository to the VM.
Install Node.js.

Refer to the installation instructions on nodejs.org for your specific Linux distribution.

Install PostgreSQL database server.

NOTE: If you do not intend to use Semantic Caching, Conversational State Management or Prompt Persistence features, you can safely skip this step and go to Step 4.

Refer to the installation instructions here to install Azure Database for PostgreSQL. Create a new database and give it a suitable name. Note down the database name, server user name and password. Save it in a secure location as we will need this info. in a subsequent step (below).

You can connect to the database using any one of the following options - 1) Azure CLI 2) psql or Azure Cloud shell.

Next, go to the root directory of this project repository. Update values for the environment variables shown in the table below. Export these environment variables.

Environment Variable	Value
VECTOR_DB_HOST	Name of Azure Database for PostgreSQL server. You will find this info. in the Overview blade/tab of the PostgreSQL Server resource in Azure Portal (value of the field Server name).
VECTOR_DB_PORT	5432 (This is the default PostgreSQL Server listen port)
VECTOR_DB_USER	Name of the database user (Saved in step above)
VECTOR_DB_UPWD	Password of the database user (Saved in step above)
VECTOR_DB_NAME	Name of the database (Saved in step above)

Create the database tables using the script ./db-scripts/pg-test.js. See command snippet below.

# Make sure you are in the project's root directory.
#
$ node ./db-scripts/pg-test.js
#

Connect to the database (using psql) and verify the database tables were created successfully. The following two tables should have been created.

Table Name	Description
apigtwycache	This table stores vectorized prompts and completions
apigtwyprompts	This table stores prompts and completions
apigtwymemory	This table stores state for user sessions (threads)

Update the AI Services Gateway Configuration file.

Edit the ./api-router-config.json file.

For each AI Application,

Specify a unique appId and an optional description.

Specify the type appType of the backend Azure AI Service. This type must be one of the values listed in the table below.

Application Type	Description
azure_language	This value denotes Azure AI Language service
azure_translator	This value denotes Azure AI Translator service
azure_content_safety	This value denotes Azure AI Content Safety service
azure_search	This value denotes Azure AI Search service
azure_oai	This value denotes Azure OpenAI service

Specify Azure AI Service endpoints/URI's and corresponding API key values within endpoints attribute. Refer to the table below and set appropriate values for each attribute.

Attribute Name	Description
uri	AI Service endpoint URI
apiKey	AI Service API Key
rpm	Requests per minute (RPM) rate limit to be applied to this endpoint. This attribute is optional. When it is not specified, no rate limits are applied.

(Optional) To enable caching and retrieval of OpenAI Service completions (Semantic Caching feature), specify values for attributes contained within cacheSettings attribute. Refer to the table below and set appropriate values.

Attribute Name	Description
useCache	AI Services Gateway will cache OpenAI Service completions (output) based on this value. If caching is desired, set it to true. Default is false.
searchType	AI Services Gateway will search and retrieve OpenAI Service completions based on a semantic text similarity algorithm. This attribute is used to specify the similarity distance function/algorithm for vector search. Supported values are a) CS (= Cosine Similarity). This is the default. b) LS (= Level2 or Euclidean distance) c) IP (= Inner Product).
searchDistance	This attribute is used to specify the search similarity threshold. For instance, if the search type = CS, this value should be set to a value between 0 and 1.
searchContent.term	This value specifies the attribute in the request payload which should be vectorized and used for semantic search. For OpenAI completions API, this value should be prompt. For chat completions API, this value should be set to messages.
searchContent.includeRoles	This attribute value should only be set for OpenAI models that expose chat completions API. Value can be a comma separated list. Permissible values are system, user and assistant.
entryExpiry	This attribute is used to specify when cached entries (completions) should be invalidated. Specify, any valid PostgreSQL Interval data type expression. Refer to the docs here.

(Optional) To enable conversational state management for user chat sessions, specify values for attributes contained within memorySettings attribute. Refer to the table below and set appropriate values.

Attribute Name	Description
useMemory	When this value is set to true, AI Services Gateway will manage state for user sessions. Default is false.
msgCount	This attribute is used to specify the number of messages (user - assistant interactions) to store and send to the LLM.
entryExpiry	This attribute is used to specify when the user sessions should be invalidated. Specify, any valid PostgreSQL Interval data type expression. Refer to the docs here.

After making the changes, save the ./api-router-config.json file.

IMPORTANT:

Azure OpenAI Service:

The model deployment endpoints/URI's should be listed in increasing order of priority (top down). Endpoints listed at the top of the list will be assigned higher priority than those listed at the lower levels. For each API Application, the Gateway server will traverse and load the deployment URI's starting at the top in order of priority. While routing requests to OpenAI API backends, the gateway will strictly follow the priority order and route requests to endpoints with higher priority first before falling back to low priority endpoints.

Set the gateway server environment variables.

Set the environment variables to the correct values and export them before proceeding to the next step. Refer to the table below for descriptions of the environment variables.

Env Variable Name	Description	Required	Default Value
API_GATEWAY_KEY	AI Services Gateway private key (secret) required to reconfigure backend (Azure OpenAI) endpoints	Yes	Set this value to an alphanumeric string
API_GATEWAY_CONFIG_FILE	The gateway configuration file location	Yes	Set the full or relative path to the AI Services Gateway Configuration file from the project root directory.
API_GATEWAY_NAME	Gateway instance name	Yes	Set a value such as 'Instance-01' ...
API_GATEWAY_PORT	Gateway server listen port	No	8000
API_GATEWAY_ENV	Gateway environment	Yes	Set a value such as 'dev', 'test', 'pre-prod', 'prod' ...
API_GATEWAY_LOG_LEVEL	Gateway logging level	No	Default=info. Possible values are debug, info, warn, error, fatal.
API_GATEWAY_METRICS_CINTERVAL	Backend API metrics collection and aggregation interval (in minutes)	Yes	Set it to a numeric value eg., 60 (1 hour)
API_GATEWAY_METRICS_CHISTORY	Backend API metrics collection history count	Yes	Set it to a numeric value (<= 600)
APPLICATIONINSIGHTS_CONNECTION_STRING	Azure Monitor connection string	No	Assign the value of the Azure Application Insights resource connection string (from Azure Portal)
API_GATEWAY_USE_CACHE	Global setting for enabling semantic caching feature. This setting applies to all AI Applications.	No	false
API_GATEWAY_CACHE_INVAL_SCHEDULE	Global setting for configuring the frequency of Cache Entry Invalidator runs. The schedule should be specified in GNU Crontab syntax. Refer to the docs here.	No	"/45 * * *"
API_GATEWAY_STATE_MGMT	Global setting for enabling conversational state management feature. This setting applies to all AI Applications.	No	false
API_GATEWAY_MEMORY_INVAL_SCHEDULE	Global setting for configuring the frequency of Memory Invalidator runs. The schedule should be specified in GNU Crontab syntax. Refer to the docs here.	No	"/10 * * *"
API_GATEWAY_PERSIST_PROMPTS	Global setting for persisting prompts and completions in a database table (PostgreSQL)	No	false
API_GATEWAY_VECTOR_AIAPP	Name of the AI application that exposes endpoints for data embedding model. This value is required if semantic caching feature is enabled	No	None
API_GATEWAY_SRCH_ENGINE	The vector search engine used by semantic caching feature	No	Postgresql/pgvector

NOTE: You can update and run the shell script ./set-api-gtwy-env.sh to set and export the environment variables.

Run the AI Services Gateway.

Switch to the project root directory. Then issue the command shown in the snippet below.

# Use the node package manager (npm) to install the server dependencies
$ npm install
#
# Start the AI Services Gateway Server
$ npm start
#

You will see the API Gateway server start up message in the terminal window as shown in the snippet below.

> [email protected] start
> node ./src/server.js

13-Jun-2024 06:51:31 [info] [server.js] Starting initialization of Azure AI Services API Gateway ...
13-Jun-2024 06:51:31 [info] [server.js] Azure Application Insights 'Connection string' not found. No telemetry data will be sent to App Insights.
Server(): Azure AI Services API Gateway server started successfully.
Gateway uri: http://localhost:8080/api/v1/dev
13-Jun-2024 06:51:31 [info] [cp-pg.js] checkDbConnection(): Postgres DB connectivity OK!
13-Jun-2024 06:51:31 [info] [server.js] Completions will be cached
13-Jun-2024 06:51:31 [info] [server.js] Prompts will be persisted
13-Jun-2024 06:51:31 [info] [server.js] Conversational state will be managed
13-Jun-2024 06:51:31 [info] [server.js] AI Application backend (Azure AI Service) endpoints:
Application ID: language-app; Type: azure_language
  Priority: 0   Uri: https://gr-dev-lang.cognitiveservices.azure.com/language/:analyze-text?api-version=2022-05-01
Application ID: translate-app; Type: azure_translator
  Priority: 0   Uri: https://api.cognitive.microsofttranslator.com/
Application ID: content-safety-app; Type: azure_content_safety
  Priority: 0   Uri: https://gr-dev-cont-safety.cognitiveservices.azure.com/contentsafety/text:analyze?api-version=2023-10-01
Application ID: search-app-ak-stip-v2; Type: azure_search
  Priority: 0   Uri: https://gr-dev-rag-ais.search.windows.net/indexes/ak-stip-v2/docs/search?api-version=2023-11-01
Application ID: search-app-ak-stip-aisrch-iv; Type: azure_search
  Priority: 0   Uri: https://gr-dev-rag-ais.search.windows.net/indexes/ak-stip-aisrch-iv/docs/search?api-version=2023-10-01-preview
Application ID: search-garmin-docs; Type: azure_search
  Priority: 0   Uri: https://gr-dev-rag-ais.search.windows.net/indexes/dev-garmin-idx/docs/search?api-version=2023-10-01-preview
Application ID: vectorizedata; Type: azure_oai; useCache=false; useMemory=false
  Priority: 0   Uri: https://oai-gr-dev.openai.azure.com/openai/deployments/dev-embedd-ada-002/embeddings?api-version=2023-05-15
Application ID: ai-doc-assistant-gpt-4t-0125; Type: azure_oai; useCache=true; useMemory=true
  Priority: 0   Uri: https://oai-gr-dev.openai.azure.com/openai/deployments/gpt-4-0125/chat/completions?api-version=2024-02-01
Application ID: aichatbotapp; Type: azure_oai; useCache=true; useMemory=true
  Priority: 0   Uri: https://oai-gr-dev.openai.azure.com/openai/deployments/dev-gpt35-turbo-16k/chat/completions?api-version=2024-02-01
  Priority: 1   Uri: https://oai-gr-dev.openai.azure.com/openai/deployments/gpt-4-0125/chat/completions?api-version=2024-02-01
Application ID: aidocusearchapp; Type: azure_oai; useCache=true; useMemory=false
  Priority: 0   Uri: https://oai-gr-dev.openai.azure.com/openai/deployments/dev-gpt35-turbo-instruct/completions?api-version=2023-05-15
  Priority: 1   Uri: https://oai-gr-dev.openai.azure.com/openai/deployments/gpt-35-t-inst-01/completions?api-version=2023-05-15
13-Jun-2024 06:51:31 [info] [server.js] Successfully loaded backend API endpoints for AI applications
13-Jun-2024 06:51:31 [info] [server.js] Cache entry invalidate run schedule (Cron) - */5 * * * *
13-Jun-2024 06:51:31 [info] [server.js] Memory (State) invalidate run schedule (Cron) - */2 * * * *

Leave the terminal window open.

Retrieve the AI Services API Gateway Instance info (/instanceinfo)

Use a web browser to access the AI Services API Gateway info endpoint - /instanceinfo. Specify correct values for the gateway listen port and environment. See below.

http://localhost:{API_GATEWAY_PORT}/api/v1/{API_GATEWAY_ENV}/apirouter/instanceinfo

If you get a json response similar to the one shown in the snippet below then the Gateway server is ready to accept Azure OpenAI service requests.

{
  "serverName": "Gateway-Instance-01",
  "serverVersion": "1.8.0",
  "serverConfig": {
     "host": "localhost",
     "listenPort": 8000,
     "environment": "dev",
     "persistPrompts": "true",
     "collectInterval": 1,
     "collectHistoryCount": 2,
     "configFile": "./api-router-config-test.json"
  },
  "cacheSettings": {
     "cacheEnabled": true,
     "embeddAiApp": "vectorizedata",
     "searchEngine": "Postgresql/pgvector",
     "cacheInvalidationSchedule": "*/5 * * * *"
  },
  "memorySettings": {
     "memoryEnabled": "true",
     "memoryInvalidationSchedule": "*/10 * * * *"
  },
  "aiApplications": [
     {
         "applicationId": "language-app",
         "description": "Azure AI Language Service test application",
         "type": "azure_language",
         "cacheSettings": {
             "useCache": false
         },
         "endpoints": {
             "0": {
 	  "uri": "https://gr-dev-lang.cognitiveservices.azure.com/language/:analyze-text?api-version=2022-05-01"
 	}
         }
     },
     {
         "applicationId": "translate-app",
         "description": "Azure AI Translator Service test application",
         "type": "azure_translator",
         "cacheSettings": {
             "useCache": false
         },
         "endpoints": {
             "0": {
 	  "uri": "https://api.cognitive.microsofttranslator.com/"
 	}
         }
     },
     {
         "applicationId": "vectorizedata",
         "description": "Application that uses OAI model to generate data embeddings/vectors",
         "type": "azure_oai",
         "cacheSettings": {
             "useCache": false
         },
         "memorySettings": {
             "useMemory": false
         },
         "endpoints": {
             "0": {
 	  "uri": "https://oai-gr-dev.openai.azure.com/openai/deployments/dev-embedd-ada-002/embeddings?api-version=2023-05-15"
 	}
         }
     },
     {
         "applicationId": "ai-doc-assistant-gpt-4t-0125",
         "description": "An AI Assistant / Chatbot application",
         "type": "azure_oai",
         "cacheSettings": {
             "useCache": true,
             "searchType": "CS",
             "searchDistance": 0.95,
             "searchContent": {
                 "term": "messages",
                 "includeRoles": "user"
             },
             "entryExpiry": "2 minutes"
         },
         "memorySettings": {
             "useMemory": true,
             "msgCount": 5,
             "entryExpiry": "5 minutes"
         },
         "endpoints": {
             "0": {
                 "uri": "https://oai-gr-dev.openai.azure.com/openai/deployments/gpt-4-0125/chat/completions?api-version=2024-02-01"
             }
         }
     },
     {
         "applicationId": "aichatbotapp",
         "description": "A test AI Assistant / Chatbot application",
         "type": "azure_oai",
         "cacheSettings": {
             "useCache": true,
             "searchType": "CS",
             "searchDistance": 0.98,
             "searchContent": {
                 "term": "messages",
                 "includeRoles": "user"
             },
             "entryExpiry": "2 minutes"
         },
         "memorySettings": {
             "useMemory": true,
             "msgCount": 1,
             "entryExpiry": "5 minutes"
         },
         "endpoints": {
             "0": {
                 "rpm": 10,
                 "uri": "https://oai-gr-dev.openai.azure.com/openai/deployments/dev-gpt35-turbo-16k/chat/completions?api-version=2024-02-01"
             },
             "1": {
                 "rpm": 10,
                 "uri": "https://oai-gr-dev.openai.azure.com/openai/deployments/gpt-4-0125/chat/completions?api-version=2024-02-01"
             }
         }
     },
     {
         "applicationId": "aidocusearchapp",
         "description": "A test AI text generation application",
         "type": "azure_oai",
         "cacheSettings": {
             "useCache": true,
             "searchType": "CS",
             "searchDistance": 0.95,
             "searchContent": {
                 "term": "prompt"
             },
             "entryExpiry": "1 day"
         },
         "endpoints": {
             "0": {
                 "uri": "https://oai-gr-dev.openai.azure.com/openai/deployments/dev-gpt35-turbo-instruct/completions?api-version=2023-05-15"
             },
             "1": {
                 "uri": "https://oai-gr-dev.openai.azure.com/openai/deployments/gpt-35-t-inst-01/completions?api-version=2023-05-15"
             }
         }
     }
  ],
  "containerInfo": {},
  "apiGatewayUri": "/api/v1/dev/apirouter",
  "endpointUri": "/api/v1/dev/apirouter/instanceinfo",
  "serverStartDate": "6/13/2024, 6:57:56 AM",
  "status": "OK"
}

Access the AI Services Gateway load balancer/router (/lb) endpoint

Use Curl or Postman to send a few completion / chat completion API requests to the gateway server load balancer endpoint - /lb. Remember to substitute the correct value for AI_APPLICATION_ID in the URL below. The AI Application ID value should be one of the unique appId values specified in Gateway configuration file ./api-router-config.json.

http://localhost:{API_GATEWAY_PORT}/api/v1/{API_GATEWAY_ENV}/apirouter/lb/{AI_APPLICATION_ID}

Review the OpenAI API response and log lines output by the gateway server in the respective terminal windows.

IMPORTANT:

For invoking the model deployment endpoints exposed by AI Services Gateway from a LangChain LLM application (/framework), two environment variables must be set. See below.
- AZURE_OPENAI_BASE_PATH: Set the value of this variable to the Gateway load balancer / router endpoint URI (/lb). This URI can also be specified as part of the OpenAI configuration object (in code).
- AZURE_OPENAI_API_DEPLOYMENT_NAME: Set the value of this variable to the AI Application name/ID configured in the Gateway. This value can also be specified as part of the OpenAI configuration object (in code).
Refer to the sample script provided in ./samples/lang-chain directory for details.
For generating OpenAI API traffic and/or simulating API workload, one of the following methods can be used. See below.
- Update and use the provided shell script ./tests/test-oai-api-gateway.sh with sample data. For an AI Application, observe how the Gateway intelligently distributes the OpenAI API requests among multiple configured backend endpoints.
- For simulating continuous API traffic and performing comprehensive load testing (capacity planning), use Azure Load Testing PaaS service.

B. Containerize the AI Services API Gateway and deploy it on the Virtual Machine

Before getting started with this section, make sure you have installed a container runtime such as docker or containerd on the Linux VM. For installing docker engine, refer to the docs here.

Build the AI Services Gateway container image.

Review the container image build script ./Dockerfile. Make any required updates to the environment variables. The environment variables can also be passed to the docker engine at build time. To do this, you can modify the provided container build script ./scripts/build-container.sh. After making the updates to this build shell script, run the script to build the Gateway container image. See command snippet below.
```
# Run the container image build
$ . ./scripts/build-container.sh
#
# List the container images.  This command should list the images on the system.
$ docker images
#
```
Run the containerized AI Services Gateway server instance.

Run the Gateway container instance using the provided ./scripts/start-container.sh shell script. Refer to the command snippet below.
```
# Run the AI Services Gateway container instance
$ . ./scripts/start-container.sh
#
# Leave this terminal window open
```
Access the Gateway server load balancer/router (/lb) endpoint

Use Curl or Postman to send a few completion / chat completion API requests to the gateway server load balancer endpoint - /lb. See URL below.

http://localhost:{API_GATEWAY_PORT}/api/v1/{API_GATEWAY_ENV}/apirouter/lb/{AI_APPLICATION_ID}

Review the OpenAI API response and log lines output by the gateway server in the respective terminal windows.

TIP: You can update and use the shell script ./tests/test-oai-api-gateway.sh with sample data to test how the Gateway intelligently distributes the OpenAI API requests among multiple configured backend endpoints.

C. Analyze Azure OpenAI endpoint(s) traffic metrics

Access the AI Services Gateway metrics endpoint and analyze OpenAI API metrics.

Use a web browser and access the Gateway metrics endpoint to retrieve the backend OpenAI API metrics information. The metrics endpoint URL - /metrics, is listed below.

http://localhost:{API_GATEWAY_PORT}/api/v1/{API_GATEWAY_ENV}/apirouter/metrics

A sample Json output snippet is pasted below.

{
  "listenPort": "8000",
  "instanceName": "Gateway-Instance-01",
  "collectionIntervalMins": 1,
  "historyCount": 5,
  "applicationMetrics": [
 {
         "applicationId": "language-app",
         "endpointMetrics": [
             {
                 "endpoint": "https://gr-dev-lang.cognitiveservices.azure.com/language/:analyze-text?api-version=2022-05-01",
                 "priority": 0,
                 "metrics": {
                     "apiCalls": 6,
                     "languageDetectionApiCalls": 1,
                     "namedEntityRecognitionApiCalls": 1,
                     "keyPhraseExtractionApiCalls": 1,
                     "entityLinkingApiCalls": 1,
                     "sentimentAnalysisApiCalls": 1,
                     "piiEntityRecognitionApiCalls": 1,
                     "failedApiCalls": 0,
                     "totalApiCalls": 6,
                     "history": {}
                 }
             }
         ]
     },
 {
         "applicationId": "vectorizedata",
         "endpointMetrics": [
             {
                 "endpoint": "https://oai-gr-dev.openai.azure.com/openai/deployments/dev-embedd-ada-002/embeddings?api-version=2023-05-15",
                 "priority": 0,
                 "metrics": {
                     "apiCalls": 319,
                     "failedApiCalls": 0,
                     "throttledApiCalls": 0,
                     "filteredApiCalls": 0,
                     "totalApiCalls": 319,
                     "kInferenceTokens": 3.135,
                     "history": {
                         "1": {
                             "collectionTime": "5/5/2024, 7:18:59 PM",
                             "collectedMetrics": {
                                 "apiCalls": 239,
                                 "failedApiCalls": 0,
                                 "throttledApiCalls": 0,
                                 "filteredApiCalls": 0,
                                 "totalApiCalls": 239,
                                 "throughput": {
                                     "kTokensPerWindow": 2.318,
                                     "requestsPerWindow": 13.908000000000001,
                                     "avgTokensPerCall": 9.698744769874477,
                                     "avgRequestsPerCall": 0.05819246861924686,
                                     "tokensPerMinute": 2318,
                                     "requestsPerMinute": 239
                                 },
                                 "latency": {
                                     "avgResponseTimeMsec": 108.7489539748954
                                 }
                             }
                         },
                         "2": {
                             "collectionTime": "5/5/2024, 7:19:59 PM",
                             "collectedMetrics": {
                                 "apiCalls": 268,
                                 "failedApiCalls": 0,
                                 "throttledApiCalls": 0,
                                 "filteredApiCalls": 0,
                                 "totalApiCalls": 268,
                                 "throughput": {
                                     "kTokensPerWindow": 2.647,
                                     "requestsPerWindow": 15.881999999999998,
                                     "avgTokensPerCall": 9.876865671641792,
                                     "avgRequestsPerCall": 0.05926119402985075,
                                     "tokensPerMinute": 2647,
                                     "requestsPerMinute": 268
                                 },
                                 "latency": {
                                     "avgResponseTimeMsec": 101.9776119402985
                                 }
                             }
                         }
                     }
                 }
             }
         ]
     },
 {
         "applicationId": "aichatbotapp",
         "cacheMetrics": {
             "hitCount": 7,
             "avgScore": 0.9999999148505135
         },
         "endpointMetrics": [
             {
                 "endpoint": "https://oai-gr-dev.openai.azure.com/openai/deployments/dev-gpt35-turbo-16k/chat/completions?api-version=2024-02-01",
                 "priority": 0,
                 "metrics": {
                     "apiCalls": 2,
                     "failedApiCalls": 0,
                     "throttledApiCalls": 0,
                     "filteredApiCalls": 0,
                     "totalApiCalls": 2,
                     "kInferenceTokens": 723,
                     "history": {
                         "0": {
                             "collectionTime": "5/5/2024, 7:35:50 PM",
                             "collectedMetrics": {
                                 "apiCalls": 0,
                                 "failedApiCalls": 0,
                                 "throttledApiCalls": 0,
                                 "filteredApiCalls": 0,
                                 "totalApiCalls": 0,
                                 "throughput": {
                                     "kTokensPerWindow": 0,
                                     "requestsPerWindow": 0,
                                     "avgTokensPerCall": 0,
                                     "avgRequestsPerCall": 0,
                                     "tokensPerMinute": 0,
                                     "requestsPerMinute": 0
                                 },
                                 "latency": {
                                     "avgResponseTimeMsec": 0
                                 }
                             }
                         },
                         "1": {
                             "collectionTime": "5/5/2024, 7:42:25 PM",
                             "collectedMetrics": {
                                 "apiCalls": 9,
                                 "failedApiCalls": 0,
                                 "throttledApiCalls": 0,
                                 "filteredApiCalls": 0,
                                 "totalApiCalls": 9,
                                 "throughput": {
                                     "kTokensPerWindow": 4.367,
                                     "requestsPerWindow": 26.201999999999998,
                                     "avgTokensPerCall": 485.22222222222223,
                                     "avgRequestsPerCall": 2.9113333333333333,
                                     "tokensPerMinute": 4367,
                                     "requestsPerMinute": 9
                                 },
                                 "latency": {
                                     "avgResponseTimeMsec": 5201.666666666667
                                 }
                             }
                         }
                     }
                 }
             }
         ]
     },
 {
         "applicationId": "aidocusearchapp",
         "cacheMetrics": {
             "hitCount": 1750,
             "avgScore": 0.9999999147483241
         },
         "endpointMetrics": [
             {
                 "endpoint": "https://oai-gr-dev.openai.azure.com/openai/deployments/dev-gpt35-turbo-instruct/completions?api-version=2023-05-15",
                 "priority": 0,
                 "metrics": {
                     "apiCalls": 0,
                     "failedApiCalls": 0,
                     "throttledApiCalls": 0,
                     "filteredApiCalls": 0,
                     "totalApiCalls": 0,
                     "kInferenceTokens": 0,
                     "history": {}
                 }
             },
             {
                 "endpoint": "https://oai-gr-dev.openai.azure.com/openai/deployments/gpt-35-t-inst-01/completions?api-version=2023-05-15",
                 "priority": 1,
                 "metrics": {
                     "apiCalls": 0,
                     "failedApiCalls": 0,
                     "throttledApiCalls": 0,
                     "filteredApiCalls": 0,
                     "totalApiCalls": 0,
                     "kInferenceTokens": 0,
                     "history": {}
                 }
             }
         ]
     }
  ],
  "successApiCalls": 17,
  "cachedApiCalls": 1757,
  "failedApiCalls": 0,
  "totalApiCalls": 1774,
  "endpointUri": "/api/v1/dev/apirouter/metrics",
  "currentDate": "5/5/2024, 7:45:55 PM",
  "status": "OK"
}

Azure AI Service endpoint metrics collected by the Gateway server across all AI Applications is described in the table below.

Metric name	Description
successApiCalls	Number of API calls successfully processed by the Gateway.
cachedApiCalls	Number of API call responses served from the Gateway cache.
failedApiCalls	Number of API calls which failed as a result of a backend exception (AI Service returned status != 200).
totalApiCalls	Total number of API calls received by this Gateway Server instance. Does not include API calls initiated by the Gateway.

For each AI Application, the Gateway collects multiple metrics. Description of Azure OpenAI Service metrics collected by the Gateway are provided in the tables below.

Cache hit metrics

Metric name	Description
hitCount	Number of API calls where responses were served from the gateway cache
avgScore	Average similarity search score for cached responses

AI Service (OpenAI) endpoint metrics

Metric name	Description
apiCalls	Number of OpenAI API calls successfully handled by this backend endpoint in the current metrics collection interval
failedApiCalls	Number of OpenAI API calls which failed in the current metrics collection interval
throttledAPICalls	Number of OpenAI API calls that were throttled (Service status = 429)
filteredAPICalls	Number of OpenAI API calls that were filtered due to harmful content or bad request (Service status = 400)
totalApiCalls	Total number of OpenAI API calls received by this backend (OpenAI) endpoint in the current metrics collection interval
kInferenceTokens	Total tokens (K) processed/handled by this backend (OpenAI) endpoint in the current metrics collection interval

AI Service (OpenAI) endpoint metrics history

Metric name	Description
collectionTime	Start time of metrics collection interval/window
apiCalls	Number of OpenAI API calls successfully handled by this backend/endpoint
failedApiCalls	Number of OpenAI API calls which failed (Service status != 200)
throttledAPICalls	Number of OpenAI API calls that were throttled (Service returned status = 429)
filteredAPICalls	Number of OpenAI API calls that were filtered due to harmful content or bad request (Service status = 400)
throughput.kTokensPerWindow	Total tokens (K) processed/handled by this OpenAI backend
throughput.requestsPerWindow	Total number of requests processed/handled by this OpenAI endpoint
throughput.avgTokensPerCall	Average tokens (K) processed by this OpenAI backend per API call
throughput.avgRequestsPerCall	Average requests processed by this OpenAI backend per API call (PTU only)
throughput.tokensPerMinute	Number of tokens processed by this endpoint per minute
throughput.requestsPerMinute	Number of OpenAI API calls handled by this endpoint per minute
latency.avgResponseTimeMsec	Average response time of OpenAI backend API call

Access AI Services Gateway and OpenAI API metrics using Azure Application Insights

Login to Azure Portal and go to the Overview blade/tab of the Application Insights resource. Here you can get a quick summary view of the Gateway server requests, availability, server response time and failed requests in the last 30 minutes, 1 hour .. 30 days. See screenshot below.

Click on the Performance tab to view the average duration for each Gateway endpoint call, call counts, request and response details. See screenshot below.

Click on the Application map tab to view the Gateway and dependency info. (Azure OpenAI). Here you can drill down into API calls received by a) Gateway and b) Azure OpenAI deployment endpoint, and investigate failures and performance issues.

D. Reload the AI Services API Gateway backend endpoints (Configuration)

The Gateway endpoint configuration can be updated even when the server is running. Follow the steps below.

Update the Gateway endpoint configuration file.

Open the Gateway endpoint configuration json file (api-router-config.json) and update the OpenAI endpoints as needed. Save this file.
Reload the Gateway endpoint configuration.

Use Curl command in a terminal window or a web browser to access the gateway reconfiguration endpoint - /reconfig. See URL below. The private key of the Gateway is required to reload the endpoint configuration.

http://localhost:{API_GATEWAY_PORT}/api/v1/{API_GATEWAY_ENV}/apirouter/reconfig/{API_GATEWAY_KEY}

IMPORTANT:

A side effect of reconfiguring the Gateway endpoints is that all current and historical metric values collected and cached by the server will be reset. Hence, if you want to retain metrics history, you should save the metrics (/metrics) endpoint output prior to reloading the updated OpenAI endpoints from the configuration file.
It is advised to reconfigure the backend endpoints during a maintenance time window (down time) when there is minimal to no API traffic. Reconfiguring the backend endpoints when the gateway is actively serving API requests may result in undefined behavior.

E. Deploy the AI Services API Gateway on Azure Kubernetes Service

Before proceeding with this section, make sure you have installed the following services on Azure.

An Azure Container Registry (ACR) instance
An Azure Kubernetes Cluster (AKS) instance

The following command line tools should be installed on the Linux VM.

Azure CLI (az)
Kubernetes CLI (kubectl)
Helm CLI (helm)

Additionally, the following resources should be deployed/configured.

A Kubernetes ingress controller (Ngnix) should be deployed and running on the AKS / Kubernetes cluster.
The AKS cluster should be attached to the ACR instance. The cluster's managed identity should have ACR Pull permissions so cluster nodes can pull container images from the attached ACR.

Push the AI Services Gateway container image into ACR.

Refer the script snippet below to push the Gateway container image into ACR. Remember to substitute ACR name with the name of your container registry.

# Login to the ACR instance. Substitute the correct name of your ACR instance.
$ az acr login --name [acr-name].azurecr.io
#
# Tag the container image so we can push it to ACR repo.
$ docker tag az-oai-api-gateway [acr-name].azurecr.io/az-oai-api-gateway:v1.020224
# 
# List container images on your VM
$ docker images
#
# Push the AI Services Gateway container image to ACR repo.
$ docker push [acr-name].azurecr.io/az-oai-api-gateway:v1.020224
#

Use Azure portal to verify the Gateway container image was stored in the respective repository (az-oai-api-gateway).

Deploy the Azure OpenAI API endpoints configuration.

Create a ConfigMap Kubernetes resource containing the Gateway endpoint configurations. See command snippet below.

# First, create a namespace 'apigateway' where all Gateway resources will be deployed.
$ kubectl create namespace apigateway
#
# List all namespaces.
$ kubectl get ns
#
# Create a ConfigMap containing the Gateway endpoints. Substitute the correct location of the
# Gateway configuration (json) file.
#
$ kubectl create configmap api-gateway-config-cm --from-file=[Path to 'api-router-config.json' file] -n apigateway
#
# List ConfigMaps.
$ kubectl get cm -n apigateway
#

Review and update the Helm deployment configuration file.

Go thru the variables in values.yaml file and update them as needed.

Review/Update the following variable values. See table below.

Variable Name	Description	Default Value
replicaCount	Number of Pod instances (gateway instances)	1
image.repository	ACR location of the AI Services Gateway container image. Specify the correct values for `acr-name` and `api-gateway-repo-name`.	[acr-name].azurecr.io/[api-gateway-repo-name]
image.tag	Gateway container image tag. Specify correct value for the image tag.	v1.xxxxxx
apigateway.name	The deployment name of Gateway server instance.	aoai-api-gateway-v1.5
apigateway.instanceName	The name of Gateway server instance. Can be set to any string. Use the model version as prefix or suffix to easily identify which models are served by specific gateway server instances.	aoai-api-gateway-gpt35
apigateway.configFile	Path to Gateway configuration file	/home/node/app/files/api-router-config.json
apigateway.secretKey	Gateway private key. This key is required for reconfiguring the gateway with updated endpoint info.	None.
apigateway.env	Gateway environment (eg., dev, test, pre-prod, prod)	dev
apigateway.metricsCInterval	Backend API metrics collection and aggregation interval (in minutes)	60 minutes (1 hour)
apigateway.metricsCHistory	Backend API metrics collection history count	168 (Max. <= 600)
apigateway.appInsightsConnectionString	(Optional) To collect API request telemetry, set this value to the Azure Application Insights resource connection string	None.
apigateway.useCache	Global setting to enable semantic caching	false (true/false)
apigateway.cacheInvalSchedule	Used to specify run schedule for cache entry invalidator	"/45 * * *"
apigateway.persistPrompts	Global setting to persist prompts in database table	false (true/false)
apigateway.vectorAiApp	Used to specify name of AI application for embedding prompts. This value is required if semantic caching feature is enabled.	None
apigateway.searchEngine	Used to specify vector search engine for semantic caching feature	Postgresql/pgvector
database.host	PostgreSQL server/host name. Database connection parameter values are required if semantic caching or prompt persistence features are enabled	None
database.port	PostgreSQL server listen port	5432
database.user	PostgreSQL database user name	None
database.password	PostgreSQL database user password	None
database.name	PostgreSQL database name	None

Assign required compute resources to AI Services Gateway pods.

Review/Update the Pod compute resources as needed in the values.yaml file.

Deploy the AI Services Gateway on Azure Kubernetes Service.

Refer to the command snippet below to deploy all Kubernetes resources for the Gateway.

# Make sure you are in the project root directory!
# Use helm chart to deploy the Gateway. Substitute the correct value for the image tag.
#
$ helm upgrade --install az-oai-api-gateway ./aoai-api-gtwy-chart --set image.tag=[image-tag-name] --namespace apigateway
#

Verify deployment.

First, confirm the Gateway Pod(s) is running. Refer to the command snippet below.

# Make sure the AI Services Gateway pods are up and running. Output is shown below the command.
#
$ kubectl get pods -n apigateway
NAME                                  READY   STATUS    RESTARTS   AGE
aoai-api-gateway-v1-7f7bf5f75-grk6p   1/1     Running   0          11h

Get the public IP of Nginx ingress controller (application routing system). Refer to the command snippet below.

# Get public IP (Azure LB IP) assigned to Nginx ingress controller service. Save (copy) the IP address listed under
# column 'EXTERNAL-IP' in the command output. See below.
#
$ kubectl get svc -n app-routing-system
NAME    TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                                      AGE
nginx   LoadBalancer   10.0.114.112   xx.xx.xx.xx   80:30553/TCP,443:32318/TCP,10254:31744/TCP   2d

Use a web browser to access the Gateway Server instance information endpoint (/instanceinfo). Substitute the public IP of the Nginx ingress controller which you copied in the previous step. See below.

http://{NGINX_PUBLIC_IP}/api/v1/{API_GATEWAY_ENV}/apirouter/instanceinfo

The Json output is shown below.

{
  "serverName": "AOAI-API-Gateway-01",
  "serverVersion": "1.5.0",
  "serverConfig": {
     "host": "10.244.2.6",
     "listenPort": 8000,
     "environment": "dev",
     "persistPrompts": "true",
     "collectInterval": 60,
     "collectHistoryCount": 8,
     "configFile": "/home/node/app/files/api-router-config.json"
  },
  "cacheSettings": {
     "cacheEnabled": true,
     "embeddAiApp": "vectorizedata",
     "searchEngine": "Postgresql/pgvector",
     "cacheInvalidationSchedule": "*/45 * * * *"
  },
  "appConnections": [
     {
         "applicationId": "vectorizedata",
         "cacheSettings": {
             "useCache": false
         },
         "oaiEndpoints": {
             "0": "https://oai-gr-dev.openai.azure.com/openai/deployments/dev-embedd-ada-002/embeddings?api-version=2023-05-15"
         }
     },
     {
         "applicationId": "aichatbotapp",
         "cacheSettings": {
             "useCache": true,
             "searchType": "CS",
             "searchDistance": 0.95,
             "searchContent": {
                 "term": "messages",
                 "includeRoles": "system,user,assistant"
             },
             "entryExpiry": "7 days"
         },
         "oaiEndpoints": {
             "0": "https://oai-gr-dev.openai.azure.com/openai/deployments/dev-gpt35-turbo-16k/chat/completions?api-version=2023-05-15"
         }
     },
     {
         "applicationId": "aidocusearchapp",
         "cacheSettings": {
             "useCache": true,
             "searchType": "CS",
             "searchDistance": 0.95,
             "searchContent": {
                 "term": "prompt"
             },
             "entryExpiry": "1 hour"
         },
         "oaiEndpoints": {
             "0": "https://oai-gr-dev.openai.azure.com/openai/deployments/dev-gpt35-turbo-instruct/completions?api-version=2023-05-15",
             "1": "https://oai-gr-dev.openai.azure.com/openai/deployments/gpt-35-t-inst-01/completions?api-version=2023-05-15"
         }
     }
  ],
  "containerInfo": {
     "imageID": "oaiapigateway.azurecr.io/az-oai-api-gateway:v1.5.031224",
     "nodeName": "aks-nodepool1-35747021-vmss000001",
     "podName": "aoai-api-gateway-v1-7777f5d8bd-tt7kq",
     "podNamespace": "apigateway",
     "podServiceAccount": "default"
  },
  "nodejs": {
     "node": "20.11.0",
     "acorn": "8.11.2",
     "ada": "2.7.4",
     "ares": "1.20.1",
     "base64": "0.5.1",
     "brotli": "1.0.9",
     "cjs_module_lexer": "1.2.2",
     "cldr": "43.1",
     "icu": "73.2",
     "llhttp": "8.1.1",
     "modules": "115",
     "napi": "9",
     "nghttp2": "1.58.0",
     "nghttp3": "0.7.0",
     "ngtcp2": "0.8.1",
     "openssl": "3.0.12+quic",
     "simdutf": "4.0.4",
     "tz": "2023c",
     "undici": "5.27.2",
     "unicode": "15.0",
     "uv": "1.46.0",
     "uvwasi": "0.0.19",
     "v8": "11.3.244.8-node.17",
     "zlib": "1.2.13.1-motley-5daffc7"
  },
  "apiGatewayUri": "/api/v1/dev/apirouter",
  "endpointUri": "/api/v1/dev/apirouter/instanceinfo",
  "serverStartDate": "3/12/2024, 11:45:10 PM",
  "status": "OK"
}

Use Curl or Postman to send a few completion / chat completion API requests to the gateway server load balancer endpoint - /lb. See URL below.

http://{NGINX_PUBLIC_IP}/api/v1/{API_GATEWAY_ENV}/apirouter/lb/{AI_APPLICATION_ID}

Congratulations!

You have reached the end of this how-to for deploying and scaling an AI Services API Gateway on Azure. Please feel free to customize and use the artifacts posted in this repository to efficiently distribute AI Application traffic among multiple Azure OpenAI model deployments and elastically scale the Gateway solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Azure AI Services API Gateway

Supported Features At A Glance

Usage scenarios

Feature/Capability Support Matrix

Reference Architecture

AI Services API Router workflow for Azure OpenAI Service

Bill Of Materials

Prerequisites

Disclaimer:

Deployment Options

Important Notes

A. Configure and run the AI Services API Gateway on a standalone Virtual Machine

B. Containerize the AI Services API Gateway and deploy it on the Virtual Machine

C. Analyze Azure OpenAI endpoint(s) traffic metrics

D. Reload the AI Services API Gateway backend endpoints (Configuration)

E. Deploy the AI Services API Gateway on Azure Kubernetes Service

About

Releases 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
aoai-api-gtwy-chart		aoai-api-gtwy-chart
db-scripts		db-scripts
images		images
k8s-resources		k8s-resources
samples		samples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api-gtwy-env.sh		api-gtwy-env.sh
api-router-config.json		api-router-config.json
package.json		package.json
set-api-gtwy-env.sh		set-api-gtwy-env.sh

License

ganrad/openai-api-router

Folders and files

Latest commit

History

Repository files navigation

An Azure AI Services API Gateway

Supported Features At A Glance

Usage scenarios

Feature/Capability Support Matrix

Reference Architecture

AI Services API Router workflow for Azure OpenAI Service

Bill Of Materials

Prerequisites

Disclaimer:

Deployment Options

Important Notes

A. Configure and run the AI Services API Gateway on a standalone Virtual Machine

B. Containerize the AI Services API Gateway and deploy it on the Virtual Machine

C. Analyze Azure OpenAI endpoint(s) traffic metrics

D. Reload the AI Services API Gateway backend endpoints (Configuration)

E. Deploy the AI Services API Gateway on Azure Kubernetes Service

About

Resources

License

Stars

Watchers

Forks

Releases 4

Languages