Skip to content

Lab environment for demostrate building a simple search engine with LLM model and PostgreSQL database with pgvector extension

Notifications You must be signed in to change notification settings

Vivicorp-AWS/cdk-lab-pgvector-igdb

Repository files navigation

CDK Lab - AI-powered Search Solution with PostgreSQL & pgvector

When we talk about NLP techniques and Large Language Models (LLM), a common way to build a search engine application is transforming the text data into vector embeddings, then calculate the similarity between the embeddings.

Dedicated vector database like Pinecone and Faiss are good choices for storing and making search, but we still want to use regular relational databases because they are more common to be used. With pgvector, this extension gives PostgreSQL the power to easily store vector data and make search with them.

This project refers to a solution delivered by AWS: "Building AI-powered search in PostgreSQL using Amazon SageMaker and pgvector" (GitHub Repo), and makes some improvements to make it more efficient, including:

  • Use CDK to deploy the whole stack instead of deployed with CloudFormation template
  • Provide a sample dataset with IDGB's data, and use a Lambda function to import them into PostgreSQL database when initializing the database instance which saves a lot of time
    • And it's easy to replace the data so you can import your own
    • If you're interested in the way to produce the sample dataset, it's put in another repo
  • Change the model endpoint to a Serverless one to lower the inferencing cost
  • Add a little inference application built with Gradio

Components to be deployed

  • A VPC with 2 Available Zone, and each az contains 2 subnets (1 public and 1 private)
  • A RDS for PostgreSQL database instance which is compitable for pgvector extension
  • A Lambda function that can import example dataset into database automatically after the database is initialized compleleted
  • A Serverless SageMaker Model Endpoint with the "all-MiniLM-L6-v2" pre-trained SentenceTransformers model, which is balanced in performance and speed
  • A SageMaker Notebook instance to make inference or interact with the model endpoint

File Structure

├── README.md
├── app.py                                       ## Entrypoint
├── assets
│   ├── nintendo_switch_games.csv                ## The source dataset
│   ├── nintendo_switch_games_cls_pooling.json   ## Dataset with embeddings processed with only CLS pooling
│   └── nintendo_switch_games_mean_pooling.json  ## Dataset with embeddings processed with Mean pooling
├── cdk.context.json.example                     ## Example CDK runtime context file
├── lambda
│   ├── __init__.py
│   └── index.py                                 ## Lambda funciton to import sample dataset into database
├── model
│   └── code                                     ## Custom inference script for HuggingFace model
├── notebooks
│   ├── 1-get-embeddings-and-import.ipynb        ## Example notebook to create and import embeddings
│   ├── 2-1-inference-in-notebook.ipynb          ## Example notebook to make inferences inline
│   └── 2-2-inference-with-gradio.ipynb          ## Example notebook to make inference with Gradio app
├── poetry.lock
├── pyproject.toml
├── requirements-layer.txt                       ## Lambda function's Python dependencies list
├── requirements.txt
├── scripts
│   └── get_assets.sh                            ## Script to archive model into a single file and download sample datasets
└── stacks
    ├── __init__.py
    ├── lambda_stack.py
    ├── rds_stack.py
    ├── s3_stack.py
    ├── sagemaker_stack.py
    ├── top_stack.py
    └── vpc_stack.py

Usage

Step 0: Prepare the credentials

Create the config file by executing:

aws configure

Step 1: Fill in the necessary information into CDK runtime context

Copy the example CDK runtime context file cdk.context.json.example to cdk.context.json, fill in the Region and Stack Prefix information, for example:

{
  "region": "us-east-1",
  "prefix": "yet-another-cdk-project"
}

Step 2: Archive the model, and download the sample dataset files

Execute the script at the root directory of this project:

./scripts/get_assets.sh

This script will do the following steps

  1. Download The "all-MiniLM-L6-v2" pre-trained SentenceTransformers model artifact from HuggingFace, and pack with the Inference Code (Made with SageMaker Hugging Face Inference Toolkit)
  2. Example IGDB Dataset

All of them are saved in ./assets directory.

Step 3: Deploy with CDK toolkit (cdk command)

Install the CDK toolkit then deploy by executing:

cdk deploy --all --require-approval=never

Step 4: Make inferences

After all stacks deployed, visit SageMaker Notebook service page, find the launched Notebook instance with a pgvectorNotebook postfix in its name, click "Open Jupyter Lab" link to open Jupyter Lab.

The notebook will automatically clone this repo, get into the notebooks directory, choose a notebook that fits your usage:

  • 1-get-embeddings-and-import.ipynb: Create and import embeddings into database
  • *2-1-inference-in-notebook.ipynb: Make inferences inside the notebook
  • 2-2-inference-with-gradio.ipynb: Make inference with Gradio app

(Optional) Step 5: Clean all resources

Not going to use these anymore? Remove them with:

cdk destroy --all

Future Improvement Suggestions

Security

The default service role for SageMaker might be too large, follow the Least authrorities principle and narrow it down.

The Performance of the Search Function is not good enough

If we try to search the game with question like "One of the main characters is a man with a red hat" or "A platform game with a pink character" and imagine the answer will be something with Mario and Kirby, the result will let you down.

"all-MiniLM-L6-v2" is a good model, but if we look into the datasets used to train this model...... I'm not sure that if they contain enough gaming data or not. Also, the reasoning ability is definitely not the main feature of it compared to LLMs like GPT-3, GPT-4..., etc.

Maybe improve the results by implementing a Large Language Model in the future.

About

Lab environment for demostrate building a simple search engine with LLM model and PostgreSQL database with pgvector extension

Topics

Resources

Stars

Watchers

Forks