When we talk about NLP techniques and Large Language Models (LLM), a common way to build a search engine application is transforming the text data into vector embeddings, then calculate the similarity between the embeddings.
Dedicated vector database like Pinecone and Faiss are good choices for storing and making search, but we still want to use regular relational databases because they are more common to be used. With pgvector, this extension gives PostgreSQL the power to easily store vector data and make search with them.
This project refers to a solution delivered by AWS: "Building AI-powered search in PostgreSQL using Amazon SageMaker and pgvector" (GitHub Repo), and makes some improvements to make it more efficient, including:
- Use CDK to deploy the whole stack instead of deployed with CloudFormation template
- Provide a sample dataset with IDGB's data, and use a Lambda function to import them into PostgreSQL database when initializing the database instance which saves a lot of time
- And it's easy to replace the data so you can import your own
- If you're interested in the way to produce the sample dataset, it's put in another repo
- Change the model endpoint to a Serverless one to lower the inferencing cost
- Add a little inference application built with Gradio
- A VPC with 2 Available Zone, and each az contains 2 subnets (1 public and 1 private)
- A RDS for PostgreSQL database instance which is compitable for pgvector extension
- A Lambda function that can import example dataset into database automatically after the database is initialized compleleted
- A Serverless SageMaker Model Endpoint with the "all-MiniLM-L6-v2" pre-trained SentenceTransformers model, which is balanced in performance and speed
- A SageMaker Notebook instance to make inference or interact with the model endpoint
├── README.md
├── app.py ## Entrypoint
├── assets
│ ├── nintendo_switch_games.csv ## The source dataset
│ ├── nintendo_switch_games_cls_pooling.json ## Dataset with embeddings processed with only CLS pooling
│ └── nintendo_switch_games_mean_pooling.json ## Dataset with embeddings processed with Mean pooling
├── cdk.context.json.example ## Example CDK runtime context file
├── lambda
│ ├── __init__.py
│ └── index.py ## Lambda funciton to import sample dataset into database
├── model
│ └── code ## Custom inference script for HuggingFace model
├── notebooks
│ ├── 1-get-embeddings-and-import.ipynb ## Example notebook to create and import embeddings
│ ├── 2-1-inference-in-notebook.ipynb ## Example notebook to make inferences inline
│ └── 2-2-inference-with-gradio.ipynb ## Example notebook to make inference with Gradio app
├── poetry.lock
├── pyproject.toml
├── requirements-layer.txt ## Lambda function's Python dependencies list
├── requirements.txt
├── scripts
│ └── get_assets.sh ## Script to archive model into a single file and download sample datasets
└── stacks
├── __init__.py
├── lambda_stack.py
├── rds_stack.py
├── s3_stack.py
├── sagemaker_stack.py
├── top_stack.py
└── vpc_stack.py
Create the config file by executing:
aws configure
Copy the example CDK runtime context file cdk.context.json.example
to cdk.context.json
, fill in the Region and Stack Prefix information, for example:
{
"region": "us-east-1",
"prefix": "yet-another-cdk-project"
}
Execute the script at the root directory of this project:
./scripts/get_assets.sh
This script will do the following steps
- Download The "all-MiniLM-L6-v2" pre-trained SentenceTransformers model artifact from HuggingFace, and pack with the Inference Code (Made with SageMaker Hugging Face Inference Toolkit)
- Example IGDB Dataset
All of them are saved in ./assets
directory.
Install the CDK toolkit then deploy by executing:
cdk deploy --all --require-approval=never
After all stacks deployed, visit SageMaker Notebook service page, find the launched Notebook instance with a pgvectorNotebook
postfix in its name, click "Open Jupyter Lab" link to open Jupyter Lab.
The notebook will automatically clone this repo, get into the notebooks
directory, choose a notebook that fits your usage:
1-get-embeddings-and-import.ipynb
: Create and import embeddings into database*2-1-inference-in-notebook.ipynb
: Make inferences inside the notebook2-2-inference-with-gradio.ipynb
: Make inference with Gradio app
Not going to use these anymore? Remove them with:
cdk destroy --all
The default service role for SageMaker might be too large, follow the Least authrorities principle and narrow it down.
If we try to search the game with question like "One of the main characters is a man with a red hat" or "A platform game with a pink character" and imagine the answer will be something with Mario and Kirby, the result will let you down.
"all-MiniLM-L6-v2" is a good model, but if we look into the datasets used to train this model...... I'm not sure that if they contain enough gaming data or not. Also, the reasoning ability is definitely not the main feature of it compared to LLMs like GPT-3, GPT-4..., etc.
Maybe improve the results by implementing a Large Language Model in the future.