New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: row id index structures (experimental) #2303
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2303 +/- ##
==========================================
- Coverage 80.75% 80.01% -0.75%
==========================================
Files 192 197 +5
Lines 56303 54302 -2001
Branches 56303 54302 -2001
==========================================
- Hits 45469 43448 -2021
- Misses 8201 8342 +141
+ Partials 2633 2512 -121
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I don't want to block further work on operations, so I'm leaving the performance testing where it is now. We should have someone make improvements in parallel with other work to update query and write code paths. Index SizeComparison of size (in bytes) of index for 1 million (original row ids) that are sorted, with some random percentage of rows deleted.
The zero deletion sorted case is using a Access speedRight now it's roughly in the same ball park as a hashmap, but somewhat slower. Still, I think 100ns is pretty decent. The only case where we are faster is when there are no deleted rows. |
// TODO: what would it take to store this in a LanceV2 file? | ||
// Or would flatbuffers be better for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this TODO for a future PR. Would appreciate input on how we want to support this. For now, protobufs seems like the easiest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an encoding for an array of u64
values. The problem is probably less with lance v2 and more with Arrow. Our file reader returns arrow arrays at the moment. I can't think of any good way to stuff this structure into an arrow Array. Maybe this could be done with a union array but I'm generally scared of those.
That being said, you can always put this in a file metadata buffer too, either as protobuf or as an encoded array. One advantage of using an encoded array, once the bit packing PR is done, is that we can pack into bits-per-value other than 16/32/64 (e.g. 23 or 12), although this would incur an encode/decode cost which might not make sense if the array is short.
We can also add a to_bytes / from_bytes methods for the primitive encodings. This would let you store it anywhere you can place a buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less with lance v2 and more with Arrow
Yeah, that's what I was thinking too. I think it's likely very important we keep the in-memory and on-disk format aligned to minimize serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, a few questions but we can find issues as we start to use these structures too so I don't think we need to find everything right now.
// TODO: what would it take to store this in a LanceV2 file? | ||
// Or would flatbuffers be better for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an encoding for an array of u64
values. The problem is probably less with lance v2 and more with Arrow. Our file reader returns arrow arrays at the moment. I can't think of any good way to stuff this structure into an arrow Array. Maybe this could be done with a union array but I'm generally scared of those.
That being said, you can always put this in a file metadata buffer too, either as protobuf or as an encoded array. One advantage of using an encoded array, once the bit packing PR is done, is that we can pack into bits-per-value other than 16/32/64 (e.g. 23 or 12), although this would incur an encode/decode cost which might not make sense if the array is short.
We can also add a to_bytes / from_bytes methods for the primitive encodings. This would let you store it anywhere you can place a buffer.
message U32Array { | ||
uint64 base = 1; | ||
/// The deltas are stored as 32-bit unsigned integers. | ||
/// (we use bytes instead of uint32 to avoid overhead of varint encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, did you actually notice this overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not. I could try to quickly measure it.
for row_id in iter.by_ref() { | ||
first_10.push(row_id); | ||
if first_10.len() > 10 { | ||
break; | ||
} | ||
} | ||
|
||
while let Some(row_id) = iter.next_back() { | ||
last_10.push(row_id); | ||
if last_10.len() > 10 { | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are 15 row ids will there be overlap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They pull off the same double-ended iterator, so I don't think there should be any duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I have to wrap my head around "double ended iterator" I'm not used to using it. I was thinking you were just starting with a forward iterator and then creating a backward iterator. I didn't realize you are reusing the iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah they are interesting. I learned while writing this they are passed through a surprising number of combinators. For example, if x
is a DoubleEndedIterator
, then x.enumerate()
and x.enumerate().cycle()
are too.
rust/lance-table/src/rowids.rs
Outdated
pub fn len(&self) -> u64 { | ||
self.iter().count() as u64 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this is O(N) (is it?) is slightly surprising. I would expect this value to be cached (and maybe computed at construction) or worst case at least O(# segments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change this to sum over the segments.
// If the last element of this sequence and the first element of next | ||
// sequence are ranges, we might be able to combine them into a single | ||
// range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically there is no guarantee that other
follows self
and so the reverse could be true. The last element of other
could be the first element of self
and we could merge those too (I guess any range in other
could merge with any range in self
). I'm guessing we just care about optimizing this case because it is quite common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last element of other could be the first element of self and we could merge those too (I guess any range in other could merge with any range in self)
Not sure I follow. Remember order matters in these sequences. 0..10 + 10..20 == 0..20
, but 10..20 + 0..10 != 0..20
. But yeah there are probably other things we can combine. I just chose this as one common one.
// Often, the row ids will already be provided in the order they appear. | ||
// So the optimal way to search will be to cycle through rather than | ||
// restarting the search from the beginning each time. | ||
let mut segment_iter = self.0.iter().enumerate().cycle(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused but might just be missing something. You call cycle
here which makes me think you plan to iterate through the segments more than once (e.g. as described by your comment). However, in your loop (row_ids.into_iter()...
) it seems like you will call segment_iter.next()
at most self.0.len()
times which means you aren't looping through multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the thing you are missing is we are re-using the same segment_iter
in the row_ids.into_iter().for_each()
loop. This means each new search for a row id will pick up right after the last one we found. The idea is this is more efficient in the common case where the row ids we are searching for are in the same order they appear in the segment. Instead of restarting our search at the beginning of the segment for each row id, we can just keep going from where we left off. This makes the sorted case O(max(n, m))
instead of O(n * m)
(n
being the length of segment and m
being number of row ids we are searching for).
The reason we limit each search to self.0.len()
is so that if we are passed a non-existant row id we don't loop forever.
|
||
#[derive(PartialEq, Eq, Clone)] | ||
pub struct Bitmap { | ||
pub data: Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use BooleanBuffer
from arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea. There are some details with serialization to figure out, but I think it would work well and eliminate the need for this file. I'm going to leave this as a TODO for now.
These are experimental indices to map from stable row ids to row addresses. It's possible there are some improvements to serialization format or performance we will make before stabilizing, but I'd like to defer that work so we can unblock work with the stable row ids.
These row id indices are optimized for storage size (in-memory and on-disk) and access speed.
Closes: #2308