Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifying an embedder should not always recompute all the vectors #4615

Open
1 of 5 tasks
ManyTheFish opened this issue May 2, 2024 · 0 comments
Open
1 of 5 tasks
Labels
experimental feature Related to an experimental feature performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing

Comments

@ManyTheFish
Copy link
Member

ManyTheFish commented May 2, 2024

Related product team resources: PRD (internal only)

⚠️ this issue depends on #4480 to be implemented

Summary

This issue is a subset of the work implementing the settings diff-indexing enhancement.

When adding or changing an embedder, the whole vector store is erased to be recomputed totally during indexing.
It should be possible to avoid that erase and only compute/delete the modified data to save computing time.

⚠️ The main goal of this issue is to minimize the number of vectors to recompute during indexing. Erasing the DB to recompute it is sometimes the best approach because computing a difference implies computing twice the number of vectors for the data.

Current implementation

When modifying the settings, the first step is to compute the changes in the embedding settings, in the current implementation, we erase the database if any changes imply any modification in the vector store.
Then later in the indexing process, the embeddings pipeline is ran if any changes has been made in the embeddings settings, this relies on the InnerSettingsDiff structure that should be enhanced to be able to compute more precisely the changes in the embedders.
In this extraction pipeline, only the new settings are used, because the database is currently erased so no need to compute the old settings in order to remove a part of the data, but this pipeline should be reworked to support both settings versions.

TODO

  • Changes to 1 embedder should not trigger the reindexing of all embedders. Only the modified embedder should be reindexed
    • In the case where one embedder is deleted, the code currently benefits from the "reindex everything" behavior in that the associated index of embedders after the deleted embedder is modified to not leave a "hole" in the list of embedders. A free list is required to support incremental deleting of embedder.
  • Currently, changes to searchable, displayable, sortable, filterable, etc. do not need to trigger a reindexing. In the future, the properties of a field might be accessible to the document template, madking reindexing mandatory in this case. (related to Create SettingsDiff structure and run extractions based on it #4480)
  • A change to the apiKey (for OpenAI models) should not trigger a reindexing operation of the modified embedder.
  • When modifying the documentTemplate, a reindexing is necessary, but would be more minimal if comparing the rendered versions of both the old and the new template, and only regenerating embeddings for documents where the rendered version actually changed.
  • Future extensions:
    1. distributionShift: should not trigger a reindexing operation.
    2. distance: needs reindexing

Related Benchmarks:

  • settings-add-embeddings.json
  • movies-subset-hf-embeddings.json
  • a new benchmark modifying the embeddings may be added
@ManyTheFish ManyTheFish added performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing labels May 2, 2024
@curquiza curquiza added the experimental feature Related to an experimental feature label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental feature Related to an experimental feature performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption settings diff-indexing Issues related to settings diff-indexing
Projects
None yet
Development

No branches or pull requests

2 participants