New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Unit and Integration tests for MongoDBAtlasVectorSearch #12854
Merged
logan-markewich
merged 24 commits into
run-llama:main
from
caseyclements:feature/mongodb-datastore
May 10, 2024
Merged
Changes from 17 commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
7a85dd5
PYTHON-4160 MongoDBAtlasVectorSearch Cleanup. id -> _id. delete_one -…
caseyclements 9019ffb
PYTHON-4160 Created unit and integration tests
caseyclements 6511be8
Removed unused imports
caseyclements 0089e8c
Switch dependency in pyproject from llama-index-core to llama-index t…
caseyclements a229bf5
Removed hardcode in test
caseyclements a923060
Removed unused import
caseyclements 1e93f61
[PYTHON-4307] Retries query until response contains number requested
caseyclements f774b86
[PYTHON-4307] assert response contains number requested AND retries
caseyclements 0d0e9ce
[PYTHON-4307] Loosened assertion in test
caseyclements 126fb61
Added markdown to describe Atlas setup.
caseyclements e724632
Moved setup.md to llama_index/vector_stores/mongodb
caseyclements 8d264d7
Added __init__ to embeddings as it was not properly set up as a package
caseyclements 8c48295
Linting
caseyclements c90490c
Bump micro version of llama-index-vector-stores-mongodb
caseyclements ccd1ac9
Updated dependencies. llama-index-embeddings-openai is now a dev.depe…
caseyclements d30558f
Added llama-index-llms-openai and +llama-index-readers-file to dev de…
caseyclements 74f3a76
Update stopping condition in test_vectorstore
caseyclements f2b29ee
Moved setup.md into README
caseyclements a61f8dd
add build file
logan-markewich 7376e7f
Standardized environ variable naming: MONGODB_URI
caseyclements c0e0559
Skip tests if appropriate environment variable, OPENAI_API_KEY or MON…
caseyclements 1f22c8a
fix tests
logan-markewich 45d667c
Added typehints to tests
caseyclements 39f23d2
fix integration tests
logan-markewich File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,13 +21,15 @@ ignore_missing_imports = true | |
python_version = "3.8" | ||
|
||
[tool.poetry] | ||
authors = ["Your Name <[email protected]>"] | ||
authors = [ | ||
"The MongoDB Python Team", | ||
] | ||
description = "llama-index vector_stores mongodb integration" | ||
exclude = ["**/BUILD"] | ||
license = "MIT" | ||
name = "llama-index-vector-stores-mongodb" | ||
readme = "README.md" | ||
version = "0.1.4" | ||
version = "0.1.5" | ||
|
||
[tool.poetry.dependencies] | ||
python = ">=3.8.1,<4.0" | ||
|
@@ -37,6 +39,9 @@ pymongo = "^4.6.1" | |
[tool.poetry.group.dev.dependencies] | ||
ipython = "8.10.0" | ||
jupyter = "^1.0.0" | ||
llama-index-embeddings-openai = "^0.1.5" | ||
llama-index-llms-openai = "^0.1.13" | ||
llama-index-readers-file = "^0.1.4" | ||
mypy = "0.991" | ||
pre-commit = "3.2.0" | ||
pylint = "2.15.10" | ||
|
128 changes: 128 additions & 0 deletions
128
llama-index-integrations/vector_stores/llama-index-vector-stores-mongodb/setup.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
# Setting up MongoDB Atlas as the Datastore Provider | ||
|
||
MongoDB Atlas is a multi-cloud database service made by the same people that build MongoDB. | ||
Atlas simplifies deploying and managing your databases while offering the versatility you need | ||
to build resilient and performant global applications on the cloud providers of your choice. | ||
|
||
You can perform semantic search on data in your Atlas cluster running MongoDB v6.0.11, v7.0.2, | ||
or later using Atlas Vector Search. You can store vector embeddings for any kind of data along | ||
with other data in your collection on the Atlas cluster. | ||
|
||
In the section, we set up a cluster, a database, test it, and finally create an Atlas Vector Search Index. | ||
|
||
### Deploy a Cluster | ||
|
||
Follow the [Getting-Started](https://www.mongodb.com/basics/mongodb-atlas-tutorial) documentation | ||
to create an account, deploy an Atlas cluster, and connect to a database. | ||
|
||
### Retrieve the URI used by Python to connect to the Cluster | ||
|
||
When you deploy the ChatGPT Retrieval App, this will be stored as the environment variable: `MONGO_URI` | ||
It will look something like the following. The username and password, if not provided, | ||
can be configured in _Database Access_ under Security in the left panel. | ||
|
||
``` | ||
export MONGO_URI="mongodb+srv://<username>:<password>@chatgpt-retrieval-plugin.zeatahb.mongodb.net/?retryWrites=true&w=majority" | ||
``` | ||
|
||
There are a number of ways to navigate the Atlas UI. Keep your eye out for "Connect" and "driver". | ||
|
||
On the left panel, navigate and click 'Database' under DEPLOYMENT. | ||
Click the Connect button that appears, then Drivers. Select Python. | ||
(Have no concern for the version. This is the PyMongo, not Python, version.) | ||
Once you have got the Connect Window open, you will see an instruction to `pip install pymongo`. | ||
You will also see a **connection string**. | ||
This is the `uri` that a `pymongo.MongoClient` uses to connect to the Database. | ||
|
||
### Test the connection | ||
|
||
Atlas provides a simple check. Once you have your `uri` and `pymongo` installed, | ||
try the following in a python console. | ||
|
||
```python | ||
from pymongo.mongo_client import MongoClient | ||
|
||
client = MongoClient(uri) # Create a new client and connect to the server | ||
try: | ||
client.admin.command( | ||
"ping" | ||
) # Send a ping to confirm a successful connection | ||
print("Pinged your deployment. You successfully connected to MongoDB!") | ||
except Exception as e: | ||
print(e) | ||
``` | ||
|
||
**Troubleshooting** | ||
|
||
- You can edit a Database's users and passwords on the 'Database Access' page, under Security. | ||
- Remember to add your IP address. (Try `curl -4 ifconfig.co`) | ||
|
||
### Create a Database and Collection | ||
|
||
As mentioned, Vector Databases provide two functions. In addition to being the data store, | ||
they provide very efficient search based on natural language queries. | ||
With Vector Search, one will index and query data with a powerful vector search algorithm | ||
using "Hierarchical Navigable Small World (HNSW) graphs to find vector similarity. | ||
|
||
The indexing runs beside the data as a separate service asynchronously. | ||
The Search index monitors changes to the Collection that it applies to. | ||
Subsequently, one need not upload the data first. | ||
We will create an empty collection now, which will simplify setup in the example notebook. | ||
|
||
Back in the UI, navigate to the Database Deployments page by clicking Database on the left panel. | ||
Click the "Browse Collections" and then "+ Create Database" buttons. | ||
This will open a window where you choose Database and Collection names. (No additional preferences.) | ||
Remember these values as they will be as the environment variables, | ||
`MONGODB_DATABASE` and `MONGODB_COLLECTION`. | ||
|
||
### Set Datastore Environment Variables | ||
|
||
To establish a connection to the MongoDB Cluster, Database, and Collection, plus create a Vector Search Index, | ||
define the following environment variables. | ||
You can confirm that the required ones have been set like this: `assert "MONGO_URI" in os.environ` | ||
|
||
**IMPORTANT** It is crucial that the choices are consistent between setup in Atlas and Python environment(s). | ||
|
||
| Name | Description | Example | | ||
| -------------------- | ----------------- | ------------------------------------------------------------------- | | ||
| `MONGO_URI` | Connection String | mongodb+srv://`<user>`:`<password>`@llama-index.zeatahb.mongodb.net | | ||
| `MONGODB_DATABASE` | Database name | llama_index_test_db | | ||
| `MONGODB_COLLECTION` | Collection name | llama_index_test_vectorstore | | ||
| `MONGODB_INDEX` | Search index name | vector_index | | ||
|
||
The following will be required to authenticate with OpenAI. | ||
|
||
| Name | Description | | ||
| ---------------- | ------------------------------------------------------------ | | ||
| `OPENAI_API_KEY` | OpenAI token created at https://platform.openai.com/api-keys | | ||
|
||
### Create an Atlas Vector Search Index | ||
|
||
The final step to configure MongoDB as the Datastore is to create a Vector Search Index. | ||
The procedure is described [here](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/#procedure). | ||
|
||
Under Services on the left panel, choose Atlas Search > Create Search Index > | ||
Atlas Vector Search JSON Editor. | ||
|
||
The Plugin expects an index definition like the following. | ||
To begin, choose `numDimensions: 1536` along with the suggested EMBEDDING variables above. | ||
You can experiment with these later. | ||
|
||
```json | ||
{ | ||
"fields": [ | ||
{ | ||
"numDimensions": 1536, | ||
"path": "embedding", | ||
"similarity": "cosine", | ||
"type": "vector" | ||
} | ||
] | ||
} | ||
``` | ||
|
||
### Running MongoDB Integration Tests | ||
|
||
In addition to the Jupyter Notebook in `examples/`, | ||
a suite of integration tests is available to verify the MongoDB integration. | ||
The test suite needs the cluster up and running, and the environment variables defined above. |
70 changes: 70 additions & 0 deletions
70
llama-index-integrations/vector_stores/llama-index-vector-stores-mongodb/tests/conftest.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
import os | ||
|
||
import openai | ||
import pytest | ||
from llama_index.core import SimpleDirectoryReader | ||
from llama_index.core.ingestion import IngestionPipeline | ||
from llama_index.core.node_parser import SentenceSplitter | ||
from llama_index.embeddings.openai import OpenAIEmbedding | ||
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch | ||
from pymongo import MongoClient | ||
|
||
openai.api_key = os.environ["OPENAI_API_KEY"] | ||
|
||
import threading | ||
from pathlib import Path | ||
|
||
lock = threading.Lock() | ||
|
||
|
||
@pytest.fixture(scope="session") | ||
def documents(tmp_path_factory): | ||
"""List of documents represents data to be embedded in the datastore. | ||
Minimum requirements for Documents in the /upsert endpoint's UpsertRequest. | ||
""" | ||
data_dir = Path(__file__).parents[4] / "docs/docs/examples/data/paul_graham" | ||
return SimpleDirectoryReader(data_dir).load_data() | ||
|
||
|
||
@pytest.fixture(scope="session") | ||
def nodes(documents): | ||
pipeline = IngestionPipeline( | ||
transformations=[ | ||
SentenceSplitter(chunk_size=1024, chunk_overlap=200), | ||
OpenAIEmbedding(), | ||
], | ||
) | ||
|
||
return pipeline.run(documents=documents) | ||
|
||
|
||
db_name = os.environ.get("MONGODB_DATABASE", "llama_index_test_db") | ||
collection_name = os.environ.get("MONGODB_COLLECTION", "llama_index_test_vectorstore") | ||
index_name = os.environ.get("MONGODB_INDEX", "vector_index") | ||
cluster_uri = os.environ["MONGO_URI"] | ||
|
||
|
||
@pytest.fixture(scope="session") | ||
def atlas_client(): | ||
client = MongoClient(cluster_uri) | ||
|
||
assert db_name in client.list_database_names() | ||
assert collection_name in client[db_name].list_collection_names() | ||
assert index_name in [ | ||
idx["name"] for idx in client[db_name][collection_name].list_search_indexes() | ||
] | ||
|
||
# Clear the collection for the tests | ||
client[db_name][collection_name].delete_many({}) | ||
|
||
return client | ||
|
||
|
||
@pytest.fixture(scope="session") | ||
def vector_store(atlas_client): | ||
return MongoDBAtlasVectorSearch( | ||
mongodb_client=atlas_client, | ||
db_name=db_name, | ||
collection_name=collection_name, | ||
index_name=index_name, | ||
) |
60 changes: 60 additions & 0 deletions
60
...ex-integrations/vector_stores/llama-index-vector-stores-mongodb/tests/test_integration.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
"""Integration Tests of llama-index-vector-stores-mongodb | ||
with MongoDB Atlas Vector Datastore and OPENAI Embedding model. | ||
|
||
As described in docs/providers/mongodb/setup.md, to run this, one must | ||
have a running MongoDB Atlas Cluster, and | ||
provide a valid OPENAI_API_KEY. | ||
""" | ||
|
||
import os | ||
from time import sleep | ||
|
||
import pytest | ||
from llama_index.core import StorageContext, VectorStoreIndex | ||
|
||
from .conftest import lock | ||
|
||
|
||
def test_required_vars(): | ||
"""Confirm that the environment has all it needs.""" | ||
required_vars = ["OPENAI_API_KEY", "MONGO_URI"] | ||
for var in required_vars: | ||
try: | ||
os.environ[var] | ||
except KeyError: | ||
pytest.fail(f"Required var '{var}' not in os.environ") | ||
|
||
|
||
def test_mongodb_connection(atlas_client): | ||
"""Confirm that the connection to the datastore works.""" | ||
assert atlas_client.admin.command("ping")["ok"] | ||
|
||
|
||
def test_index(documents, vector_store): | ||
"""End-to-end example from essay and query to response. | ||
|
||
via NodeParser, LLM Embedding, VectorStore, and Synthesizer. | ||
""" | ||
with lock: | ||
vector_store._collection.delete_many({}) | ||
sleep(2) | ||
storage_context = StorageContext.from_defaults(vector_store=vector_store) | ||
index = VectorStoreIndex.from_documents( | ||
documents, storage_context=storage_context | ||
) | ||
query_engine = index.as_query_engine() | ||
|
||
question = "Who is the author of this essay?" | ||
no_response = True | ||
response = None | ||
retries = 5 | ||
search_limit = query_engine.retriever.similarity_top_k | ||
while no_response and retries: | ||
response = query_engine.query(question) | ||
if len(response.source_nodes) == search_limit: | ||
no_response = False | ||
else: | ||
retries -= 1 | ||
sleep(5) | ||
assert retries | ||
assert "Paul Graham" in response.response |
81 changes: 81 additions & 0 deletions
81
...ex-integrations/vector_stores/llama-index-vector-stores-mongodb/tests/test_vectorstore.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
import os | ||
from time import sleep | ||
|
||
import openai | ||
from llama_index.core.schema import Document, TextNode | ||
from llama_index.core.vector_stores.types import VectorStoreQuery | ||
from llama_index.embeddings.openai import OpenAIEmbedding | ||
|
||
from .conftest import lock | ||
|
||
openai.api_key = os.environ["OPENAI_API_KEY"] | ||
|
||
|
||
def test_documents(documents: list[Document]): | ||
"""Sanity check essay was found and documents loaded.""" | ||
assert len(documents) == 1 | ||
assert isinstance(documents[0], Document) | ||
|
||
|
||
def test_nodes(nodes): | ||
"""Test Ingestion Pipeline transforming documents into nodes with embeddings.""" | ||
assert isinstance(nodes, list) | ||
assert isinstance(nodes[0], TextNode) | ||
|
||
|
||
def test_vectorstore(nodes, vector_store): | ||
"""Test add, query, delete API of MongoDBAtlasVectorSearch.""" | ||
with lock: | ||
# 0. Clean up the collection | ||
vector_store._collection.delete_many({}) | ||
sleep(2) | ||
|
||
# 1. Test add() | ||
ids = vector_store.add(nodes) | ||
assert set(ids) == {node.node_id for node in nodes} | ||
|
||
# 2. test query() | ||
query_str = "Who is this author of this essay?" | ||
n_similar = 2 | ||
query_embedding = OpenAIEmbedding().get_text_embedding(query_str) | ||
query = VectorStoreQuery( | ||
query_str=query_str, | ||
query_embedding=query_embedding, | ||
similarity_top_k=n_similar, | ||
) | ||
result_found = False | ||
query_responses = None | ||
retries = 5 | ||
while retries and not result_found: | ||
query_responses = vector_store.query(query=query) | ||
if len(query_responses.nodes) == n_similar: | ||
result_found = True | ||
else: | ||
sleep(2) | ||
retries -= 1 | ||
|
||
assert all(score > 0.89 for score in query_responses.similarities) | ||
assert any( | ||
"seem more like rants" in node.text for node in query_responses.nodes | ||
) | ||
assert all(id_res in ids for id_res in query_responses.ids) | ||
|
||
# 3. Test delete() | ||
# Remember, the current API deletes by *ref_doc_id*, not *node_id*. | ||
# In our case, we began with only one document, | ||
# so deleting the ref_doc_id from any node | ||
# should delete ALL the nodes. | ||
n_docs = vector_store._collection.count_documents({}) | ||
assert n_docs == len(ids) | ||
remove_id = query_responses.nodes[0].ref_doc_id | ||
sleep(2) | ||
retries = 5 | ||
while retries: | ||
vector_store.delete(remove_id) | ||
n_remaining = vector_store._collection.count_documents({}) | ||
if n_remaining == n_docs: | ||
sleep(2) | ||
retries -= 1 | ||
else: | ||
retries = 0 | ||
assert n_remaining == 0 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be in the readme instead?