Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Auto-save during indexing #695

Open
GownerCode opened this issue Apr 22, 2024 · 3 comments
Open

[Feature Request]: Auto-save during indexing #695

GownerCode opened this issue Apr 22, 2024 · 3 comments

Comments

@GownerCode
Copy link

Feature Description

I suggest implementing an autosave feature:

embeddings.index(data, autosave={"interval": 3600, "save_path": "/home/user/index"})

Something like this should save the index every interval seconds to save_path.

Reason

When one processes a large dataset, indexing can take a long time. The naive approach:

embeddings.index(my_data)
embeddings.save(my_save_path)

embeddings.index(...) takes a long time. When working with a database or other network dependent data retrieval, something may go wrong during the index call which means the save call is never reached, all progress is lost.

Value of Feature

It would allow for the ability to continue an index that has failed for reasons other than bad data.

@davidmezzetti
Copy link
Member

Thank you for this idea. I'll take a look.

Would it make sense to save per number of records vs time? Most checkpoints I'm familiar with tend to be based on data volume.

@GownerCode
Copy link
Author

You're right, data volume makes more sense. Thanks for looking into it!

@dustyatx
Copy link

I'd yes and this.. It would be better if we can treat it like a check point and in the event that the indexing fails, be able to restart it from the last check point. I've had issues where there were non-strings in a field that was trying to create vector index on and it breaks after many hours of running. It would be great to be able to identify the error, correct any issue with the data and start the indexing from the last check point position.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants