Operator Life Cycle Overhaul #568

jacksonrnewhouse · 2024-03-19T18:04:58Z

This reworks how our operators shutdown, both because of EndOfData messages and stopping checkpoints. It affects SourceOperators, ArrowOperators, and the controller.

SourceOperator changes

Two main changes here. First, rather than having every source have a very similar match block on the control receiver, they all call handle_control_message(). This will call flush_before_checkpoint(), which each operator must implement. This did require moving some of the tracking structs into the main source struct.

Secondly, once the source finishes producing data, it now waits for either a stopping checkpoint or an immediate stop message. This lets checkpoints still be taken after some but not all sources are finished, where previously they would've hung.

ArrowOperators

The previous on_close() is mainly replaced with an on_end_of_data() method, which is called when the operator is finishing. Mainly relevant for the WatermarkGenerator, which will insert the watermark that flushes downstream nodes. on_close() is still there, but now only used by the PreviewSink.

Like sources, it waits until an explicit signal to shut down, in this case receiving a SignalMessage::Shutdown.

It also tracks whether the inputs were finished because they were stopped or because they don't have anymore data.

Controller

The Finishing state has been overhauled. First, it is only transitioned to once all of the operators have finished processing their data. Then, it takes a stopping checkpoint which will commit any outstanding sinks.

Because we have several states that take stopping checkpoints I've factored that out into the take_stopping_checkpoint() method on the JobConfig. I'd appreciate a look if I translated all of the states correctly.

mwylde · 2024-03-21T18:16:03Z

crates/arroyo-operator/src/lib.rs

- StopAndSendStop,
- Finish,
+ FinishedStoppingCheckpoint,
+ NoMoreData { end_of_data: bool },


This reads as a bit confusing to me — NoMoreData and end_of_data seem like synonyms, but we can have NoMoreData { end_of_data: false }. Is there a better name for one of them?

mwylde · 2024-03-21T18:17:20Z

crates/arroyo-operator/src/lib.rs

- Stop,
- StopAndSendStop,
- Finish,
+ FinishedStoppingCheckpoint,


These names don't make sense to me for ControlOutcome, which is supposed to tell the operator what to do ("continue", "stop", "finish", etc.).

mwylde · 2024-03-21T18:18:26Z

crates/arroyo-operator/src/lib.rs

- Stop,
- StopAndSendStop,
- Finish,
+ FinishedStoppingCheckpoint,


These names don't make sense to me for ControlOutcome, which is supposed to tell the operator what to do ("continue", "stop", "finish", etc.).

mwylde · 2024-03-21T18:23:47Z

crates/arroyo-operator/src/operator.rs

+ .await;
+ }
+ SourceFinishType::Immediate => {
+ ctx.broadcast(ArrowMessage::Signal(SignalMessage::Shutdown))


For immediate shutdown, I think the source should just exit or use try_send, instead of blocking on send a message through the dataflow. (I believe this isn't new behavior in this PR, but the changes have made it more obvious what's happening).

The point of immediate shutdown is that there may be problems processing messages, or a ton of backpressure, which prevents messages from flowing (and possibly blocking even sending this message).

Immediate shutdown is a message from the user that they would like to shut down immediately, without waiting on the dataflow. That's accomplished fastest by exiting and letting the queues close.

mwylde · 2024-03-21T18:33:19Z

crates/arroyo-rpc/proto/rpc.proto

@@ -172,6 +172,7 @@ service ControllerGrpc {
 rpc TaskStarted(TaskStartedReq) returns (TaskStartedResp);
 rpc TaskCheckpointEvent(TaskCheckpointEventReq) returns (TaskCheckpointEventResp);
 rpc TaskCheckpointCompleted(TaskCheckpointCompletedReq) returns (TaskCheckpointCompletedResp);
+ rpc TaskDataFinished(TaskFinishedReq) returns (TaskFinishedResp);


Is TaskDataFinished the right name for this? It seems to be sent whenever operators exit.

mwylde · 2024-03-21T18:37:06Z

crates/arroyo-operator/src/operator.rs

+ }
+ }
+ None => {
+ warn!("source {}-{} received None from control channel, indicating sender has been dropped", 


should this be a panic?

mwylde · 2024-03-22T20:50:25Z

crates/arroyo-controller/src/states/mod.rs

+ if job_controller.finished() {
+ return Ok(StoppingCheckpointOutcome::SuccessfullyStopped);
+ } else if !shutdown_started {
+ info!("Starting shutdown");


In the controller, logs should use structed logging like https://github.com/ArroyoSystems/arroyo/blob/master/crates/arroyo-controller/src/states/recovering.rs#L24 — this preserves context around job_id and makes it easier to build alerts / monitoring around the log lines

mwylde · 2024-03-22T20:51:07Z

crates/arroyo-controller/src/states/mod.rs

+ .rx
+ .recv()
+ .await
+ .expect("channel closed while receiving")


we shouldn't panic in the controller

mwylde · 2024-03-22T20:53:56Z

crates/arroyo-controller/src/states/rescaling.rs

- _ => {
- // ignore other messages
- }
+ match ctx.take_stopping_checkpoint().await {


It looks like there might be some problems with rescaling. I got into a state where the the operators didn't shutdown cleanly, getting the controller stuck indefinitely in rescaling:
rescaling.txt.

I also wasn't able to force stop out of rescaling — the pipeline was just wedged until I restarted the controller, then on recovery it went into CheckpointStopping where it also got stuck.

I'm able to consistently reproduce this by doing two rescales in a row

mwylde · 2024-03-22T20:57:24Z

crates/arroyo-operator/src/operator.rs

- break;
+ ControlOutcome::NoMoreData { end_of_data } => {
+ if operator_state != OperatorState::Running {
+ warn!("received no more data update in operator {}-{} while in state {:?}",


I consistently see this warning when stopping pipelines — is it an actual issue, or should we make it a debug?

Change how Finishing pipelines behave so that they take a final stopping checkpoint. Unify how sources handle control messages. Factor out stopping checkpoint control.

jacksonrnewhouse requested a review from mwylde March 19, 2024 18:04

mwylde requested changes Mar 22, 2024

View reviewed changes

Overhaul operator life-cycle to include an explicit cleanup step.

b872ae3

Change how Finishing pipelines behave so that they take a final stopping checkpoint. Unify how sources handle control messages. Factor out stopping checkpoint control.

jacksonrnewhouse force-pushed the rework_operator_lifecycle branch from d9882a1 to b872ae3 Compare April 3, 2024 21:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator Life Cycle Overhaul #568

Operator Life Cycle Overhaul #568

jacksonrnewhouse commented Mar 19, 2024

mwylde Mar 21, 2024

mwylde Mar 21, 2024

mwylde Mar 21, 2024

mwylde Mar 21, 2024

mwylde Mar 21, 2024

mwylde Mar 21, 2024

mwylde Mar 22, 2024

mwylde Mar 22, 2024

mwylde Mar 22, 2024

mwylde Mar 22, 2024

mwylde Mar 22, 2024

Operator Life Cycle Overhaul #568

Are you sure you want to change the base?

Operator Life Cycle Overhaul #568

Conversation

jacksonrnewhouse commented Mar 19, 2024

SourceOperator changes

ArrowOperators

Controller

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment