You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it
Feature request
I would like to propose multithreading when initializing a VectorStore or adding texts/documents to it.
Currently, sync and async methods of add_texts, add_documents, from_texts, and from_documents all processes texts sequentially. This does not fully utilize Embeddings API throughput and becomes bottleneck.
The following was my workaround to this problem. It splits documents into N groups and runs aadd_documents in parallel. This improves the entire embedding processing.
db: VectorStore = Chroma(embedding_function=UpstageEmbeddings())
async def embed_group(docs: list[Document]):
await db.aadd_documents(docs)
n = int(len(docs) / 10)
doc_groups = [docs[i:i + n] for i in range(0, len(docs), n)]
tasks = [embed_group(group) for group in doc_groups]
await asyncio.gather(*tasks)
I thought it would be a great feature if VectorStore supports something like this internally and users can use this with one liner.
One option is to support concurrency parameter in VectorStore which defaults to 1.
Chroma.afrom_documents(docs, concurrency=10)
I also noticed ContextThreadPoolExecutor already exists, so we can probably leverage that in VectorStore.
Let me know if there is already a better way to achieve this!
Motivation
Adding large number of chunks into VectorStore takes very long time currently and easily becomes a bottleneck. There is a workaround to this problem, but it is cumbersome to code out the concurrent processing.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Checked
Feature request
I would like to propose multithreading when initializing a VectorStore or adding texts/documents to it.
Currently, sync and async methods of
add_texts
,add_documents
,from_texts
, andfrom_documents
all processes texts sequentially. This does not fully utilize Embeddings API throughput and becomes bottleneck.The following was my workaround to this problem. It splits documents into N groups and runs
aadd_documents
in parallel. This improves the entire embedding processing.I thought it would be a great feature if
VectorStore
supports something like this internally and users can use this with one liner.One option is to support
concurrency
parameter inVectorStore
which defaults to 1.I also noticed
ContextThreadPoolExecutor
already exists, so we can probably leverage that inVectorStore
.Let me know if there is already a better way to achieve this!
Motivation
Adding large number of chunks into VectorStore takes very long time currently and easily becomes a bottleneck. There is a workaround to this problem, but it is cumbersome to code out the concurrent processing.
Proposal (If applicable)
No response
Beta Was this translation helpful? Give feedback.
All reactions