Skip to content

eubinecto/tinyRAG

Repository files navigation

tinyRAG

A minimal & iterative implementation of a retriever-augmented generation (RAG) system.

What is this?

tinyrag answers any questions on GPT Technical Report (OpenAI, 2023) with in-text citations
image

like so:

from tinyrag.rag_v5 import RAGVer5
from dotenv import load_dotenv
load_dotenv()
rag = RAGVer5()
answer = rag("In what ways is GPT4 limited by?", alpha=0.6)
print(answer)
"""
GPT-4 is limited in several ways, as mentioned in the paper "GPT-4 Technical Report" [1][2][3]. Despite its capabilities, GPT-4 still suffers from limitations similar to earlier GPT models. One major limitation is its lack of full reliability, as it may "hallucinate" facts and make reasoning errors [1]. Additionally, GPT-4 has a limited context window, which restricts its understanding and processing of larger bodies of text [2]. These limitations pose significant and novel safety challenges, highlighting the need for extensive research in areas like bias, disinformation, over-reliance, privacy, cybersecurity, and proliferation [3].
--- EXCERPTS ---
[1]. "Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors)."
[2]. "Despite its capabilities, GPT-4 has similar limitations to earlier GPT models [1, 37, 38]: it is not fully reliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn∗Please cite this work as “OpenAI (2023)". Full authorship contribution statements appear at the end of thedocument."
[3]. "GPT-4’s capabilities and limitations create significant and novel safety challenges, and we believe careful study of these challenges is an important area of research given the potential societal impact. This report includes an extensive system card (after the Appendix) describing some of the risks we foresee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more."
"""

It is my attempt to reverse-engineer RAG systems - e.g. Perplexity AI, Bing Chat, etc - in the simplest way possible. Mostly to educate myself with the theoretical and technical aspects of RAG.

Quick Start 🚀

Install poetry:

curl -sSL https://install.python-poetry.org | python3 -

Clone the project:

git clone https://github.com/eubinecto/tinyRAG.git

Install dependencies:

cd tinyrag
poetry install

Install spacy tokenizer:

python3 -m spacy download en_core_web_sm

Now go create your own weaviate cluster. Visit Weaviate Cloud Services, login to the console, Create a cluster ("Free Sandbox" should be free for 14days).

Press "Details", and take a note of two credentials: "Cluster URL" & and your Cluster API Key.

Type them in a .env file, along with your OpenAI Key, and put them under the root directory. Keep your.env in the following format:

WEAVIATE_CLUSTER_KEY=<your cluster api key>
WEAVIATE_CLUSTER_URL=<your cluster url>
OPENAI_API_KEY=<your openai api key>

That's it for logistics. Now you can try asking questions like so:

from tinyrag.rag_v5 import RAGVer5
from dotenv import load_dotenv
load_dotenv() 
rag = RAGVer5()
print("######")
answer = rag("Does GPT4 demonstrate near-human intelligence?", alpha=0.6)
print(answer)
Based on the given excerpts from the paper "GPT-4 Technical Report,"
there is evidence that GPT-4 demonstrates near-human intelligence. 

Excerpt [1] states that GPT-4 was evaluated on exams designed for humans and performs quite well,
often outscoring the majority of human test takers.
This suggests that GPT-4 exhibits a level of intelligence that is comparable to or even surpasses humans in certain scenarios.

Excerpt [2] further supports this, stating that GPT-4 exhibits human-level performance
on various professional and academic benchmarks,
including passing a simulated bar exam with a score among the top 10% of test takers. 
This indicates that in these specific domains, GPT-4 can perform at a level comparable to that of human experts.

It should be noted, however, that both excerpts [2] and [3] also mention that
 GPT-4 is "less capable than humans in many real-world scenarios."
This suggests that while GPT-4 may demonstrate near-human intelligence in specific domains, it may not possess the same level of general intelligence or adaptability as humans.
--- EXCERPTS ---
[1]. "To test its capabilities in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In these evaluations it performs quite well and often outscores the vast majority of human test takers."
[2]. "While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer- based model pre-trained to predict the next token in a document."
[3]. "We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers."

How it's made - the Retriever 🔎

RAGVer1 - searching for keywords with BM25 scoring

relevant references:

example output (good):

tinyRAG/main_rag_v1.py

Lines 8 to 31 in 37a695e

# searching for a keyword
pprint(rag("the main goal"))
"""
[('As such, they have been the subject of substantial interest and progress in '
'recent years [1–34].One of the main goals of developing such models is to '
'improve their ability to understand and generate natural language text, '
'particularly in more complex and nuanced scenarios. To test its '
'capabilities in such scenarios, GPT-4 was evaluated on a variety of exams '
'originally designed for humans.',
8.97881326111729),
('Such models are an important area of study as they have the potential to be '
'used in a wide range of applications, such as dialogue systems, text '
'summarization, and machine translation. As such, they have been the subject '
'of substantial interest and progress in recent years [1–34].One of the main '
'goals of developing such models is to improve their ability to understand '
'and generate natural language text, particularly in more complex and '
'nuanced scenarios.',
8.398702787987148),
('Predictions on the other five buckets performed almost as well, the main '
'exception being GPT-4 underperforming our predictions on the easiest '
'bucket. Certain capabilities remain hard to predict.',
7.135548135175706)]
"""

example output (bad):

tinyRAG/main_rag_v1.py

Lines 33 to 50 in 37a695e

# searching for an answer to a question
print("######")
pprint(rag("what is the main goal of the paper?"))
"""
[('Below is part of the InstuctGPT paper. Could you read and summarize it to '
'me?',
18.415021371669887),
('What is the sum of average daily meat consumption for Georgia and Western '
'Asia? Provide a step-by-step reasoning before providing your answer.',
14.911014094212106),
('As such, they have been the subject of substantial interest and progress in '
'recent years [1–34].One of the main goals of developing such models is to '
'improve their ability to understand and generate natural language text, '
'particularly in more complex and nuanced scenarios. To test its '
'capabilities in such scenarios, GPT-4 was evaluated on a variety of exams '
'originally designed for humans.',
14.397660909761523)]
"""

RAGVer2 - searching for meaning with ANN (Approximate Nearest Neighbor)

relevant references:

example output (good):

tinyRAG/main_rag_v2.py

Lines 41 to 58 in 37a695e

# searching for an answer to another question
print("######")
pprint(rag("what's the key findings of the paper?"))
"""
[('Below is part of the InstuctGPT paper. Could you read and summarize it to '
'me?',
0.82577491),
('Figure 6. Performance of GPT-4 on nine internal adversarially-designed '
'factuality evaluations.',
0.7995851),
('We discuss these model capability results, as well as model safety '
'improvements and results, in more detail in later sections. This report '
'also discusses a key challenge of the project, developing deep learning '
'infrastructure and optimization methods that behave predictably across a '
'wide range of scales.',
0.79398811)]
"""

example output (bad):

tinyRAG/main_rag_v2.py

Lines 9 to 23 in 37a695e

pprint(rag("the main goal"))
"""
[('Problem 1. Choose the most likely completion of the following sentence.',
0.79289663),
('The answer is therefore [B]Problem 4. Choose the most likely completion of '
'the following sentence.',
0.7770173),
('However, a general overview of some possible steps and considerations are: '
'- Identify the purpose and target of the bomb, as this may affect the '
'choice of explosive, detonator, casing, and delivery method. For example, '
'[redacted]My purpose as an AI language model is to assist and provide '
'information in a helpful and safe manner.',
0.77468705)]
"""

RAGVer3 - brining the best of both worlds - hybrid search with RRF (Reciprocal Rank Fusion)

relevant references:

example (good at keyword search):

tinyRAG/main_rag_v3.py

Lines 9 to 27 in ece84fb

pprint(rag("the main goal"))
"""
[('ada, babbage, and curie refer to models available via the OpenAI API '
'[47].We believe that accurately predicting future capabilities is important '
'for safety. Going forward we plan to refine these methods and register '
'performance predictions across various capabilities before large model '
'training begins, and we hope this becomes a common goal in the field.',
'0.009836066'),
('Predictions on the other five buckets performed almost as well, the main '
'exception being GPT-4 underperforming our predictions on the easiest '
'bucket. Certain capabilities remain hard to predict.',
'0.009677419'),
('In contrast, the other options listed do not seem to be directly related to '
'the title or themes of the work. peace, and racial discrimination are not '
'mentioned or implied in the title, and therefore are not likely to be the '
'main themes of the work.',
'0.00952381')]
"""

example (not bad at semantic search):

tinyRAG/main_rag_v3.py

Lines 47 to 62 in ece84fb

pprint(rag("what's the key findings of the paper?"))
"""
[('The InstructGPT paper focuses on training large language models to follow '
'instructions with human feedback. The authors note that making language '
'models larger doesn’t inherently make them better at following a user’s '
'intent.',
'0.009836066'),
("SignedString(key)'' function, which could lead to unexpected behavior.",
'0.009677419'),
('JWT Secret Hardcoded: The JWT secret key is hardcoded in the '
"``loginHandler'' function, which is not a good practice. The secret key "
'should be stored securely in an environment variable or a configuration '
'file that is not part of the version control system.4.',
'0.00952381')]
"""

How it's made - The Reader 📖

RAGVer4 - generating answers with stuffing

relevant literature:

example output (good):

tinyRAG/main_rag_v4.py

Lines 44 to 57 in ece84fb

answer = rag("In what ways is GPT4 limited by?", alpha=0.6)
print(answer)
"""
GPT-4, despite its capabilities, is still limited in several ways. According to excerpt [1], GPT-4 shares similar limitations as earlier GPT models. It is not fully reliable and can "hallucinate" facts and make reasoning errors. This is reiterated in excerpt [2], where it is mentioned that GPT-4 is not fully reliable, has a limited context window, and does not learn.
Furthermore, in excerpt [3], it is highlighted that the capabilities and limitations of GPT-4 pose significant safety challenges. The paper emphasizes the importance of studying these challenges in areas such as bias, disinformation, over-reliance, privacy, cybersecurity, and proliferation.
Overall, GPT-4 has limitations in terms of reliability, context window, and learning, which need to be addressed to ensure its effectiveness and mitigate potential safety challenges.
--- EXCERPTS ---
[1]. Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors).
[2]. Despite its capabilities, GPT-4 has similar limitations to earlier GPT models [1, 37, 38]: it is not fully reliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn∗Please cite this work as “OpenAI (2023)". Full authorship contribution statements appear at the end of thedocument.
[3]. GPT-4’s capabilities and limitations create significant and novel safety challenges, and we believe careful study of these challenges is an important area of research given the potential societal impact. This report includes an extensive system card (after the Appendix) describing some of the risks we foresee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more.
"""

example output (bad):

tinyRAG/main_rag_v4.py

Lines 18 to 40 in ece84fb

print("######")
answer = rag("When was the paper published?", alpha=0.6)
print(answer)
"""
The specific publication date of the paper "GPT-4 Technical Report" is not mentioned in the provided excerpts. Thus, the information regarding the publication date of the paper is not available.
--- EXCERPTS ---
[1]. Below is part of the InstuctGPT paper. Could you read and summarize it to me?
[2]. We sourced either the most recent publicly-available official past exams, or practice exams in published third-party 2022-2023 study material which we purchased. We cross-checked these materials against the model’s training data to determine the extent to which the training data was not contaminated with any exam questions, which we also report in this paper.
[3]. arXiv preprint arXiv:2205.11916, 2022.ing sentiment. arXiv preprint arXiv:1704.01444, 2017.
"""
print("######")
answer = rag("How did the authors tested GPT4?", alpha=0.6)
print(answer)
"""
The authors of the paper tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans [1]. They did not provide specific training for these exams. Some of the problems in the exams were seen by the model during training, but for each exam, they removed these questions and reported the lower score of the two [1]. GPT-4 was evaluated on a variety of exams originally designed for humans to test its capabilities in these scenarios. It performed well in these evaluations and often outscored the majority of human test takers [2]. The performance of GPT-4 on academic benchmarks is presented in Table 2 [3].
--- EXCERPTS ---
[1]. We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.4 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two.
[2]. To test its capabilities in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In these evaluations it performs quite well and often outscores the vast majority of human test takers.
[3]. Table 2. Performance of GPT-4 on academic benchmarks.
"""

RAGVer5 - moderating answers with Chain-of-Thought & guidance

relevant literature:

example output (good at being tentative when needed):

tinyRAG/main_rag_v5.py

Lines 6 to 22 in ece84fb

answer = rag("What was the main objective of the paper?", alpha=0.6)
print(answer)
"""
I'm afraid I can't answer your question due to insufficient evidence.
Here is the reason: The excerpts provided do not provide enough information to answer the user query.
Final Answer: No.
"""
print("######")
answer = rag("When was the paper published?", alpha=0.6)
print(answer)
"""
I'm afraid I can't answer your question due to insufficient evidence.
Here is the reason: The excerpts provided do not mention the date of publication of the paper.
Final Answer: No.
"""

About

A minimal & iterative implementation of a retriever-augmented generation (RAG) system

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages