[ENH] full-text search index reader and writer #2178

beggers · 2024-05-10T15:41:02Z

Description of changes

Summarize the changes made by this PR.

New functionality
- Migrate the full-text search index to the new reader/writer pattern and add some tests.

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

vercel · 2024-05-10T15:41:04Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
chroma	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 13, 2024 9:10pm

github-actions · 2024-05-10T15:41:14Z

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

Can you think of any use case in which the code does not behave as intended? Have they been tested?
Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
If appropriate, are there adequate property based tests?
If appropriate, are there adequate unit tests?
Should any logging, debugging, tracing information be added or removed?
Are error messages user-friendly?
Have all documentation changes needed been made?
Have all non-obvious changes been commented?

System Compatibility

Are there any potential impacts on other parts of the system or backward compatibility?
Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

beggers · 2024-05-10T15:42:23Z

rust/worker/src/index/fulltext/types.rs

+ async fn commit_and_flush(self) -> Result<(), Box<dyn ChromaError>> {
+ // TODO should we be `await?`ing these? Or can we just return the futures?
+ self.posting_lists_blockfile_writer
+ .commit::<u32, &Int32Array>()?


Note: we no longer store the positional posting list directly as a value, but rather as (token, doc_id) key and positions value. I think this is strictly better?

So
prefix: token
key: doc_id
positions: [pos]?

I think thats fine, its functionally the same at storage. Can we remove the pos value type from storage then and only use it as a builder?

That's the fun part -- it's already not in Storage! If you're talking about in-memory storage at least

beggers · 2024-05-10T15:43:05Z

rust/worker/src/index/fulltext/types.rs

+ // for the character have been seen/used in the matching algorithm. By
+ // leaving them ordered per the query, we can stick to the more straightforward
+ // but less efficient matching algorithm.
+ // token_frequencies.sort_by(|a, b| a.1.cmp(&b.1));


Should we do the work to sort by token frequency now? It'll take some time to get it right and test. This bug was lurking in the old implementation (yet another argument for property testing).

HammadB · 2024-05-13T18:38:42Z

rust/worker/src/index/fulltext/types.rs

 }
 }
 Ok(())
 }

- fn search(&mut self, query: &str) -> Result<Vec<i32>, Box<dyn ChromaError>> {
+ async fn commit_and_flush(self) -> Result<(), Box<dyn ChromaError>> {


I don't think we want this method at the index level, if you look at the higher level pattern we want commit and flush as two steps with a type safe transition

HammadB · 2024-05-13T18:40:14Z

rust/worker/src/index/fulltext/types.rs

 }
+ if res.len() > 1 {
+ panic!("Multiple tokens found in frequencies blockfile");


HammadB · 2024-05-13T18:40:38Z

rust/worker/src/index/fulltext/types.rs

- for position_for_doc_id in positions_for_doc_id.values() {
- if position_for_doc_id - token_offset == *position {
+ for pos in positions.iter() {
+ if pos.unwrap() == position + token_offset {


is this unwrap safe?

If it's not it's an invariant violation. Changed to a Result

HammadB

Looks good overall, some minor to moderate comments

HammadB · 2024-05-13T21:09:04Z

rust/worker/src/index/fulltext/types.rs

+
+#[derive(Error, Debug)]
+pub enum FullTextIndexError {
+ #[error("Multiple tokens found in frequencies blockfile")]


I don't understand this error? (just missing something)

The frequencies blockfile should have exactly zero or one entries for a given token. If it has multiple then something very bad has happened (constraint violation/data corruption)

HammadB · 2024-05-13T21:09:20Z

rust/worker/src/index/fulltext/types.rs

+ }
+}
+
+pub(crate) struct FullTextIndexFlusher {


[ENH] full-text search index reader and writer

cca764e

vercel bot deployed to Preview May 10, 2024 15:41 View deployment

beggers commented May 10, 2024

View reviewed changes

HammadB reviewed May 13, 2024

View reviewed changes

HammadB approved these changes May 13, 2024

View reviewed changes

flusher

95d021a

vercel bot deployed to Preview May 13, 2024 21:08 View deployment

HammadB reviewed May 13, 2024

View reviewed changes

merge

132af63

HammadB reviewed May 13, 2024

View reviewed changes

rust/worker/src/index/fulltext/types.rs

}

}

pub(crate) struct FullTextIndexFlusher {

Copy link

Collaborator

HammadB May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

beggers reacted with heart emoji

vercel bot deployed to Preview May 13, 2024 21:10 View deployment

beggers merged commit a843309 into main May 13, 2024
46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] full-text search index reader and writer #2178

[ENH] full-text search index reader and writer #2178

beggers commented May 10, 2024 •

edited

vercel bot commented May 10, 2024 •

edited

github-actions bot commented May 10, 2024

beggers May 10, 2024

HammadB May 13, 2024

HammadB May 13, 2024

beggers May 13, 2024

beggers May 10, 2024

HammadB May 13, 2024

HammadB May 13, 2024

HammadB May 13, 2024

beggers May 13, 2024

HammadB left a comment

HammadB May 13, 2024

beggers May 13, 2024

HammadB May 13, 2024

[ENH] full-text search index reader and writer #2178

[ENH] full-text search index reader and writer #2178

Conversation

beggers commented May 10, 2024 • edited

Description of changes

Test plan

Documentation Changes

vercel bot commented May 10, 2024 • edited

github-actions bot commented May 10, 2024

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HammadB left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beggers commented May 10, 2024 •

edited

vercel bot commented May 10, 2024 •

edited