-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add log materializer v2 taking care of updates/deletes/upserts + wire compactor to use it + clean up old abstraction #2239
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
984e063
to
dc91372
Compare
35e3e13
to
aab64f3
Compare
pub(crate) trait SegmentWriter { | ||
fn apply_materialized_log_chunk(&self, records: Chunk<MaterializedLogRecord>); | ||
fn apply_log_chunk(&self, records: Chunk<LogRecord>); | ||
pub(crate) trait SegmentWriter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we let the compiler infer this lifetime ( i think thats how it was working before?) - otherwise prefer explicit named lifetimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this was complaining. For the distributed hnsw segment writer it was complaining with error[E0726]: implicit elided lifetime not allowed here
. Don't understand it fully yet though
.max_offset_id | ||
.as_ref() | ||
.unwrap() | ||
.set("", MAX_OFFSET_ID, log_record.offset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember where we landed on the offset_id reuse conversation. @Ishiihara @HammadB if a record is deleted and then a new record is added, will the new record always get a brand new (and therefore highest) offset ID or do we reuse old offset IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't track "released" ids and always increment the counter when assigning a new one
impl ChromaError for WriteSegmentsOperatorError { | ||
fn code(&self) -> crate::errors::ErrorCodes { | ||
match self { | ||
WriteSegmentsOperatorError::LogMaterializationPreparationError(e) => e.code(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen this before -- usually we use these impl
s to map crate-specific errors to one of the standard grpc error codes. Does .code()
know how to do so correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its already a Box dyn ChromaError .code()
comes from that trait and just forwards the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[error("Preparation for log materialization failed {0}")]
LogMaterializationPreparationError(#[from] RecordSegmentReaderCreationError),
The LogMaterializationPreparationError
composes on top of RecordSegmentReaderCreationError
which has an error code that is mapped to one of the standard grpc error codes. So e.code() will resolve to that
rust/worker/src/segment/types.rs
Outdated
// If log has [Insert, Update, Update] then final operation is Insert. | ||
// If log has [Update, Update] then final operation is Update. | ||
// All Upserts are converted to Update or Insert as the final operation. | ||
// For e.g. if log has [Insert, Upsert] then final operation is update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case isn't the final operation Insert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes my bad. The code is still correct though
// E.g. if has log has [Insert(a: h), Update(a: b, c: d), Update(a: e, f: g)] then this | ||
// will contain (a: e, c: d, f: g). This is None if the final operation | ||
// above is Delete. | ||
pub(super) metadata_to_be_merged: Option<Metadata>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we represent metadata deletions? For example, if I insert a record with {'a': 1, 'b': 2}
and then I set {'a': None}'
to represent clearing that metadata key on the record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I am following up on that in the next PR
8d6cd58
to
198a2f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer if we named the lifetimes, otherwise I believe this looks good.
Description of changes
Summarize the changes made by this PR.
Adds log materializer v2 taking care of updates/deletes/upserts.
Wires the compactor (i.e. record segment writer and hnsw index writer) to use the new materialization
Cleans up the old needless abstraction
Test plan
Rust test
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
None