Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Langchain plugin for Chroma always tries to create the collection even if the collection already exists. #2163

Closed
harshal-cuminai opened this issue May 8, 2024 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@harshal-cuminai
Copy link

Describe the problem

Use Case:
Only allow querying collection hosted in chroma server running remotely for similarity search. The assumption is that the triple (tenant, db, collection) will always exist and the client will always pass the right values that already exists in db. If not, we err out.

Problem:
We are trying to integrate the Chroma db server into an application. We use chroma's langchain plugin for client side testing and wish to support client side integration with Langchain with limited access to chroma server.

chroma_client = chromadb.HttpClient(host='<chroma server host>', port=443, tenant="<tenant>", database="<db>", ssl=True)

db = Chroma(
    client=chroma_client,
    collection_name="demo",
    embedding_function=embedding_function,
)

retriever = db.as_retriever(search_kwargs={"k": 3})

The problem is that we don't want to expose all the api endpoints of chroma server and only are exposing the following in our app ingress rules:

  1. Get Tenant by Name
  2. Get Database by Name
  3. Get Collection by Name
  4. Query Collection
    (Note: We are not exposing Create Collection endpoint)

This works great when using pure chromadb way as shown below. Assuming that the collection "demo" is already created before. The code only uses the 4 api calls as mentioned above.

import chromadb
from chromadb.utils import embedding_functions
from chromadb import Settings

embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
chroma_client = chromadb.HttpClient(host='<remote host name>', port=443, tenant="<tenant>", database="<db>", ssl=True)
demo_collection = chroma_client.get_collection(name="demo", embedding_function=embedding_function)

results = demo_collection.query(
    query_texts=["<query>"],
    n_results=2
)

However, by default the langchain plugin tries to create a collection by defaulting to get_or_create to true and thus errs out as we are not exposing the Collection create api.

Describe the proposed solution

We should have an option to set get_or_create to false.

db = Chroma(
    client=chroma_client,
    collection_name="demo",
    embedding_function=embedding_function,
    get_or_create=False,
)

Alternatives considered

No response

Importance

i cannot use Chroma without it

Additional Information

No response

@harshal-cuminai harshal-cuminai added the enhancement New feature or request label May 8, 2024
@harshal-cuminai
Copy link
Author

harshal-cuminai commented May 8, 2024

@jeffchuber @tazarov need help with this.

@tazarov
Copy link
Contributor

tazarov commented May 8, 2024

@harshal-cuminai, thanks for the elaborate and deep exploration of the issue. Separating your ingestion and query/get flows makes sense for more than security reasons.

Just off the top of my head, I see two options here:

  • Small change in Langchain🦜🔗 as per the suggested approach or similar to it
  • Add auth to your API, thus rejecting anonymous (write requests)

Is auth something you can work with? If yes, then I can give you some configs to try out. It might be worth it until we figure out a more flexible solution.

@harshal-cuminai
Copy link
Author

harshal-cuminai commented May 8, 2024

hi @tazarov sure we are open to any temporary solution till we can make some variant of proposed solution a first class integration in Langchain.

Currently we do have auth setup as a subprocess for the nginx proxy sitting in front of the chromadb service. But our use case requires rejecting collection creation altogether (even for authenticated clients) which is not possible due to current langchain integration, so i am thinking we will probably have to redirect POST call for collection creation (triggered by langchain) post authentication (based on client role) as follows:

Original: POST /collections
Modified Rewrite: GET /collections/<collection name>

as they both have same response schema and output when get_or_create is set to true as it is in current case.

What approach are you suggesting ?

@tazarov
Copy link
Contributor

tazarov commented May 8, 2024

Rewriting sounds like a sensible approach. However, you'll have to read the POST payload to get the name attribute and then pass that to the GET. I think for NGINX, that translates to a bit of Lua scripting

@harshal-cuminai
Copy link
Author

yes correct. Any cheaper alternative, you can suggest ?

On a side note, it would be best to have this as a first class feature in langchain-chroma. wdyt?

@tazarov
Copy link
Contributor

tazarov commented May 8, 2024

I've already written up the Langchain🦜🔗 PR, just adding tests, and off it goes. However, it might take a few days to merge and release it. Your problem is not uncommon or shouldn't be for some publicly facing products where you'd want a modicum of control over who can write to the DB.

tazarov added a commit to amikos-tech/langchain that referenced this issue May 8, 2024
Adds the ability to either get_or_create or simply get collection. This is useful when dealing wit read-only Chroma instances where users can only get_collection. Targeted at Http/CloudClients mostly.

Closes chroma-core/chroma#2163
@tazarov
Copy link
Contributor

tazarov commented May 8, 2024

@harshal-cuminai, PR in Langchain🦜🔗 created.

ccurme pushed a commit to langchain-ai/langchain that referenced this issue May 9, 2024
…ma constructor (#21420)

- **Description:** Adds the ability to either `get_or_create` or simply
`get_collection`. This is useful when dealing with read-only Chroma
instances where users are constraint to using `get_collection`. Targeted
at Http/CloudClients mostly.
- **Issue:** chroma-core/chroma#2163
- **Dependencies:** N/A
- **Twitter handle:** `@t_azarov`




| Collection Exists | create_collection_if_not_exists | Outcome | test |

|-------------------|---------------------------------|----------------------------------------------------------------|----------------------------------------------------------|
| True | False | No errors, collection state unchanged |
`test_create_collection_if_not_exist_false_existing` |
| True | True | No errors, collection state unchanged |
`test_create_collection_if_not_exist_true_existing` |
| False | False | Error, `get_collection()` fails |
`test_create_collection_if_not_exist_false_non_existing` |
| False | True | No errors, `get_or_create_collection()` creates the
collection | `test_create_collection_if_not_exist_true_non_existing` |
@tazarov
Copy link
Contributor

tazarov commented May 9, 2024

@harshal-cuminai The PR should be in the next release.

Narapady pushed a commit to Narapady/langchain that referenced this issue May 9, 2024
…ma constructor (langchain-ai#21420)

- **Description:** Adds the ability to either `get_or_create` or simply
`get_collection`. This is useful when dealing with read-only Chroma
instances where users are constraint to using `get_collection`. Targeted
at Http/CloudClients mostly.
- **Issue:** chroma-core/chroma#2163
- **Dependencies:** N/A
- **Twitter handle:** `@t_azarov`




| Collection Exists | create_collection_if_not_exists | Outcome | test |

|-------------------|---------------------------------|----------------------------------------------------------------|----------------------------------------------------------|
| True | False | No errors, collection state unchanged |
`test_create_collection_if_not_exist_false_existing` |
| True | True | No errors, collection state unchanged |
`test_create_collection_if_not_exist_true_existing` |
| False | False | Error, `get_collection()` fails |
`test_create_collection_if_not_exist_false_non_existing` |
| False | True | No errors, `get_or_create_collection()` creates the
collection | `test_create_collection_if_not_exist_true_non_existing` |
kyle-cassidy pushed a commit to kyle-cassidy/langchain that referenced this issue May 10, 2024
…ma constructor (langchain-ai#21420)

- **Description:** Adds the ability to either `get_or_create` or simply
`get_collection`. This is useful when dealing with read-only Chroma
instances where users are constraint to using `get_collection`. Targeted
at Http/CloudClients mostly.
- **Issue:** chroma-core/chroma#2163
- **Dependencies:** N/A
- **Twitter handle:** `@t_azarov`




| Collection Exists | create_collection_if_not_exists | Outcome | test |

|-------------------|---------------------------------|----------------------------------------------------------------|----------------------------------------------------------|
| True | False | No errors, collection state unchanged |
`test_create_collection_if_not_exist_false_existing` |
| True | True | No errors, collection state unchanged |
`test_create_collection_if_not_exist_true_existing` |
| False | False | Error, `get_collection()` fails |
`test_create_collection_if_not_exist_false_non_existing` |
| False | True | No errors, `get_or_create_collection()` creates the
collection | `test_create_collection_if_not_exist_true_non_existing` |
@harshal-cuminai
Copy link
Author

@tazarov is the package auto published on release? https://pypi.org/project/langchain-chroma/#history

@tazarov
Copy link
Contributor

tazarov commented May 11, 2024

@harshal-cuminai, I think they do separate releases for partner libs. But you can always do the following:

With pip:

pip install git+https://github.com/langchain-ai/langchain.git@master#subdirectory=libs/partners/chroma

In requirements.txt:

git+https://github.com/langchain-ai/langchain.git@master#subdirectory=libs/partners/chroma

In pyproject.toml:

[tool.poetry.dependencies]
langchain-chroma = { git = "https://github.com/langchain-ai/langchain.git", branch = "master", subdirectory = "libs/partners/chroma" }

@tazarov tazarov self-assigned this May 11, 2024
@harshal-cuminai
Copy link
Author

perfect. this works. Thanks a ton @tazarov . Closing this thread now.

@harshal-cuminai
Copy link
Author

@tazarov now that we have tested it locally, we are kinda blocked from release of our package till this change gets rolled out in the langchain-chroma package (as we can't rollout packages with direct repo based dependencies). I have dropped in a comment on your langchain PR, but is there a way you folks can expedite the release ?

kyle-cassidy pushed a commit to kyle-cassidy/langchain that referenced this issue May 16, 2024
…ma constructor (langchain-ai#21420)

- **Description:** Adds the ability to either `get_or_create` or simply
`get_collection`. This is useful when dealing with read-only Chroma
instances where users are constraint to using `get_collection`. Targeted
at Http/CloudClients mostly.
- **Issue:** chroma-core/chroma#2163
- **Dependencies:** N/A
- **Twitter handle:** `@t_azarov`




| Collection Exists | create_collection_if_not_exists | Outcome | test |

|-------------------|---------------------------------|----------------------------------------------------------------|----------------------------------------------------------|
| True | False | No errors, collection state unchanged |
`test_create_collection_if_not_exist_false_existing` |
| True | True | No errors, collection state unchanged |
`test_create_collection_if_not_exist_true_existing` |
| False | False | Error, `get_collection()` fails |
`test_create_collection_if_not_exist_false_non_existing` |
| False | True | No errors, `get_or_create_collection()` creates the
collection | `test_create_collection_if_not_exist_true_non_existing` |
@harshal-cuminai
Copy link
Author

closing as 0.1.1 is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants