Retrieval Augmented Generation for EIC

This is a project that is currently being developed to build a RAG based system for the upcoming EIC.

There are three main parts to the RAG pipeline.

Ingestion

Ingestion in Retrieval-Augmented Generation (RAG) is a crucial process that involves the preparation and organization of data to be used by the model. This process can be broken down further into three main steps: chunking of information, embedding models, and storing it in a vector database.

Chunking
Encoding chunked information into a vector using a embedding model (e.g. BERT, seq2seq, text2vec)
Storing the encoded information in a vector database.

Chunking

This is the first step in the ingestion process. The raw data can come in various forms. which could be a large corpus of text, is divided into manageable chunks or segments. The size of these chunks can vary depending on the specific requirements of the task at hand. Chunking helps in reducing the complexity of the data and makes it easier for the model to process the information.

Retrieval

Content Fusion and Generation

Types of RAG system

A very recent survey paper. summarizes the types of RAG system¹. There are three types of RAG architecture broadly based on where LLM being used in the pipeline

Project Milestones

Building a Naive RAG for EIC using the 200 papers from arxiv on EIC. ✅
- Backend is a relatively straight forward RAG architecture. Where ingestion of data is done using PyPDF.
- Frontend is a simple web interface that allows for the user to upload a PDF and get back a list of papers that are relevant to the input.
- Report evaulated RAGAS metrics for the built architecture.
- Publish this in the proceeding for AI4EIC-2023. 🧑‍🏭
Going beyong Naive RAG. Towards building a RAG architecture with Testable Evaulation Metrics. 🧑‍🏭
- This requires going beyond
Multi modal output as a Proof of concept.
- Storing meta data information about table etc.
- Using Agents in Langchain to build a latex report.

References

How tos

Running the webapp

In order to run the streamlit app do the following

git clone https://github.com/wmdataphys/EIC-RAG-Project.git & cd EIC-RAG-Project
It is better to have a seperate python environment incase of any version mismanagement with other projects. I use conda env conda create --name env_RAG-EIC python=3.10 This creates a python version 3.10 as this was stable when I started building the app. Once created activate the env as conda activate env_RAG-EIC
Now install pip before installing all other packages. conda install pip
Now install all the requirements pip install -r requirements.txt
Ask [email protected] about the secrets.toml and config.toml
Create a folder named .streamlit in the parent directory and move the files secrets.toml and config.toml in there.
Now run streamlit run streamlit_app/AI4EIC-RAGAS4EIC.py. This should run on a http://localhost:8050

Updating `requirements.txt`

If any new library has been used in the app that requires installation through pip. Make sure to use the --format freeze when updating the requirements.txt
The command is pip list --format freeze > requirements.txt

Types of RAG ↩

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Evaluations		Evaluations
Templates		Templates
ingestion		ingestion
streamlit_app		streamlit_app
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluations

Evaluations

Templates

Templates

ingestion

ingestion

streamlit_app

streamlit_app

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Retrieval Augmented Generation for EIC

Ingestion

Chunking

Retrieval

Content Fusion and Generation

Types of RAG system

Project Milestones

References

How tos

Running the webapp

Updating `requirements.txt`

About

Releases 1

Packages

Languages

ai4eic/EIC-RAG-Project

Folders and files

Latest commit

History

Repository files navigation

Retrieval Augmented Generation for EIC

Ingestion

Chunking

Retrieval

Content Fusion and Generation

Types of RAG system

Project Milestones

References

How tos

Running the webapp

Updating requirements.txt

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Languages

Updating `requirements.txt`