Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: row id index structures (experimental) #2303

Merged
merged 15 commits into from May 22, 2024

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented May 6, 2024

These are experimental indices to map from stable row ids to row addresses. It's possible there are some improvements to serialization format or performance we will make before stabilizing, but I'd like to defer that work so we can unblock work with the stable row ids.

These row id indices are optimized for storage size (in-memory and on-disk) and access speed.

Closes: #2308

@github-actions github-actions bot added the enhancement New feature or request label May 6, 2024
@codecov-commenter
Copy link

codecov-commenter commented May 6, 2024

Codecov Report

Attention: Patch coverage is 86.56846% with 155 lines in your changes are missing coverage. Please review.

Project coverage is 80.01%. Comparing base (2e07d71) to head (e3db3c0).
Report is 14 commits behind head on main.

Files Patch % Lines
rust/lance-table/src/rowids/segment.rs 87.34% 40 Missing and 2 partials ⚠️
rust/lance-table/src/rowids/encoded_array.rs 86.03% 37 Missing ⚠️
rust/lance-table/src/rowids.rs 82.14% 29 Missing and 1 partial ⚠️
rust/lance-table/src/rowids/serde.rs 82.75% 24 Missing and 6 partials ⚠️
rust/lance-table/src/rowids/bitmap.rs 92.64% 8 Missing and 2 partials ⚠️
rust/lance-table/src/rowids/index.rs 92.40% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2303      +/-   ##
==========================================
- Coverage   80.75%   80.01%   -0.75%     
==========================================
  Files         192      197       +5     
  Lines       56303    54302    -2001     
  Branches    56303    54302    -2001     
==========================================
- Hits        45469    43448    -2021     
- Misses       8201     8342     +141     
+ Partials     2633     2512     -121     
Flag Coverage Δ
unittests 80.01% <86.56%> (-0.75%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127
Copy link
Contributor Author

wjones127 commented May 20, 2024

I don't want to block further work on operations, so I'm leaving the performance testing where it is now. We should have someone make improvements in parallel with other work to update query and write code paths.

Index Size

Comparison of size (in bytes) of index for 1 million (original row ids) that are sorted, with some random percentage of rows deleted.

percent_deletions flat size index size % change
0% 16,000,000 1,194 99.99%
25% 12,000,000 126,696 98.9%
50% 8,000,000 126,696 98.4%

The zero deletion sorted case is using a Range. While the other ones are likely using a bitmap.

Access speed

Right now it's roughly in the same ball park as a hashmap, but somewhat slower. Still, I think 100ns is pretty decent. The only case where we are faster is when there are no deleted rows.

access_speed

Comment on lines +7 to +8
// TODO: what would it take to store this in a LanceV2 file?
// Or would flatbuffers be better for this?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this TODO for a future PR. Would appreciate input on how we want to support this. For now, protobufs seems like the easiest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an encoding for an array of u64 values. The problem is probably less with lance v2 and more with Arrow. Our file reader returns arrow arrays at the moment. I can't think of any good way to stuff this structure into an arrow Array. Maybe this could be done with a union array but I'm generally scared of those.

That being said, you can always put this in a file metadata buffer too, either as protobuf or as an encoded array. One advantage of using an encoded array, once the bit packing PR is done, is that we can pack into bits-per-value other than 16/32/64 (e.g. 23 or 12), although this would incur an encode/decode cost which might not make sense if the array is short.

We can also add a to_bytes / from_bytes methods for the primitive encodings. This would let you store it anywhere you can place a buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less with lance v2 and more with Arrow

Yeah, that's what I was thinking too. I think it's likely very important we keep the in-memory and on-disk format aligned to minimize serialization.

@wjones127 wjones127 added the experimental Features that are experimental label May 20, 2024
@wjones127 wjones127 changed the title feat: row id index structures feat: row id index structures (experimental) May 20, 2024
@wjones127 wjones127 marked this pull request as ready for review May 20, 2024 20:44
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, a few questions but we can find issues as we start to use these structures too so I don't think we need to find everything right now.

Comment on lines +7 to +8
// TODO: what would it take to store this in a LanceV2 file?
// Or would flatbuffers be better for this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an encoding for an array of u64 values. The problem is probably less with lance v2 and more with Arrow. Our file reader returns arrow arrays at the moment. I can't think of any good way to stuff this structure into an arrow Array. Maybe this could be done with a union array but I'm generally scared of those.

That being said, you can always put this in a file metadata buffer too, either as protobuf or as an encoded array. One advantage of using an encoded array, once the bit packing PR is done, is that we can pack into bits-per-value other than 16/32/64 (e.g. 23 or 12), although this would incur an encode/decode cost which might not make sense if the array is short.

We can also add a to_bytes / from_bytes methods for the primitive encodings. This would let you store it anywhere you can place a buffer.

message U32Array {
uint64 base = 1;
/// The deltas are stored as 32-bit unsigned integers.
/// (we use bytes instead of uint32 to avoid overhead of varint encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, did you actually notice this overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not. I could try to quickly measure it.

Comment on lines +49 to +61
for row_id in iter.by_ref() {
first_10.push(row_id);
if first_10.len() > 10 {
break;
}
}

while let Some(row_id) = iter.next_back() {
last_10.push(row_id);
if last_10.len() > 10 {
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are 15 row ids will there be overlap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They pull off the same double-ended iterator, so I don't think there should be any duplicates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I have to wrap my head around "double ended iterator" I'm not used to using it. I was thinking you were just starting with a forward iterator and then creating a backward iterator. I didn't realize you are reusing the iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they are interesting. I learned while writing this they are passed through a surprising number of combinators. For example, if x is a DoubleEndedIterator, then x.enumerate() and x.enumerate().cycle() are too.

Comment on lines 91 to 93
pub fn len(&self) -> u64 {
self.iter().count() as u64
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this is O(N) (is it?) is slightly surprising. I would expect this value to be cached (and maybe computed at construction) or worst case at least O(# segments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change this to sum over the segments.

Comment on lines +101 to +103
// If the last element of this sequence and the first element of next
// sequence are ranges, we might be able to combine them into a single
// range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically there is no guarantee that other follows self and so the reverse could be true. The last element of other could be the first element of self and we could merge those too (I guess any range in other could merge with any range in self). I'm guessing we just care about optimizing this case because it is quite common?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last element of other could be the first element of self and we could merge those too (I guess any range in other could merge with any range in self)

Not sure I follow. Remember order matters in these sequences. 0..10 + 10..20 == 0..20, but 10..20 + 0..10 != 0..20. But yeah there are probably other things we can combine. I just chose this as one common one.

// Often, the row ids will already be provided in the order they appear.
// So the optimal way to search will be to cycle through rather than
// restarting the search from the beginning each time.
let mut segment_iter = self.0.iter().enumerate().cycle();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused but might just be missing something. You call cycle here which makes me think you plan to iterate through the segments more than once (e.g. as described by your comment). However, in your loop (row_ids.into_iter()...) it seems like you will call segment_iter.next() at most self.0.len() times which means you aren't looping through multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the thing you are missing is we are re-using the same segment_iter in the row_ids.into_iter().for_each() loop. This means each new search for a row id will pick up right after the last one we found. The idea is this is more efficient in the common case where the row ids we are searching for are in the same order they appear in the segment. Instead of restarting our search at the beginning of the segment for each row id, we can just keep going from where we left off. This makes the sorted case O(max(n, m)) instead of O(n * m) (n being the length of segment and m being number of row ids we are searching for).

The reason we limit each search to self.0.len() is so that if we are passed a non-existant row id we don't loop forever.


#[derive(PartialEq, Eq, Clone)]
pub struct Bitmap {
pub data: Vec<u8>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use BooleanBuffer from arrow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. There are some details with serialization to figure out, but I think it would work well and eliminate the need for this file. I'm going to leave this as a TODO for now.

rust/lance-table/src/rowids/encoded_array.rs Outdated Show resolved Hide resolved
rust/lance-table/src/rowids/segment.rs Outdated Show resolved Hide resolved
@wjones127 wjones127 merged commit e310ab4 into lancedb:main May 22, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request experimental Features that are experimental
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create the row id index data structure
3 participants