Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill worker after action execution limit is reached #825

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zbirenbaum
Copy link
Contributor

@zbirenbaum zbirenbaum commented Apr 2, 2024

Description

Allows users to set the maximum number of action executions a worker is allowed to complete. Upon reaching this limit, the worker will no longer accept new jobs, and will exit upon completing all assigned ones.

Fixes #815

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

A test has been added to ensure the going_away signal is properly sent and received

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r1, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 5 discussions need to be resolved


nativelink-config/src/cas_server.rs line 611 at r1 (raw file):

    /// If you would like for each individual action to spin up a kubernetes
    /// pod and then exit on completion, you would set this value to 1.
    pub execution_limit: Option<u64>,

tioli: Thoughts on other names to convey the idea of fixed number of execution before worker terminates, workload_limit, action_limit_before_termination, job_completion_limit, execution_termination_limit. No strong thoughts here, do wonder if we can somehow encode the idea of termination of worker in naming


nativelink-config/src/cas_server.rs line 611 at r1 (raw file):

    /// If you would like for each individual action to spin up a kubernetes
    /// pod and then exit on completion, you would set this value to 1.
    pub execution_limit: Option<u64>,

Where should we assert that this value is > 1, if someone supplies a 0 I think we should fail to launch on bad config, maybe the serialization library has some primitive value constraints?


nativelink-worker/src/local_worker.rs line 82 at r1 (raw file):

    // on by the scheduler.
    actions_in_transit: Arc<AtomicU64>,
    // The number of actions a worker has completed.

This triplet makes me think of something I've used in the past of triplets for async counter/counting patterns request/success/failure or started/executing/completed. I wonder if other parts of the code could use that kind of information exposed as metrics.

I think we should have metrics here that align with the concept of actions "started/executing/completed", current metrics afaict don't cover that, just covers start_actions_received


nativelink-worker/src/local_worker.rs line 346 at r1 (raw file):

                // Futures will never be empty due to keep alive
                if futures.len() == 1 && assigned == limit && completed == assigned {
                    std::process::exit(0);

We should debug! or info! a message about exiting due to worker reaching termination limit, probably debug! as lots of executions that exist could be noisy for logs.


nativelink-worker/tests/local_worker_test.rs line 643 at r1 (raw file):

    #[tokio::test]
    async fn worker_action_limit_test() -> Result<(), Box<dyn std::error::Error>> {

To be pedantic we should add tests that assert a 0, 1, 3 as execution_limit values. Also a test where multiple requests happen and a previous request takes longer to complete than the others. For example request1 takes T0:T4 time, request2 T1:T2, request3 T1:T3 or something of that nature to show exit conditions are not bound by last request received

@zbirenbaum zbirenbaum force-pushed the actions-per-worker branch 2 times, most recently from 3cd731e to 28bdf91 Compare April 3, 2024 21:31
Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04, and 5 discussions need to be resolved


nativelink-config/src/cas_server.rs line 611 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

tioli: Thoughts on other names to convey the idea of fixed number of execution before worker terminates, workload_limit, action_limit_before_termination, job_completion_limit, execution_termination_limit. No strong thoughts here, do wonder if we can somehow encode the idea of termination of worker in naming

I updated it to be actions_before_termination which I think is pretty clear


nativelink-config/src/cas_server.rs line 611 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Where should we assert that this value is > 1, if someone supplies a 0 I think we should fail to launch on bad config, maybe the serialization library has some primitive value constraints?

I added a check inside LocalWorkerImpl::new() that throws an error stating it has to be greater than 0. This should catch during initialization like our other validation and default value checks


nativelink-worker/src/local_worker.rs line 82 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

This triplet makes me think of something I've used in the past of triplets for async counter/counting patterns request/success/failure or started/executing/completed. I wonder if other parts of the code could use that kind of information exposed as metrics.

I think we should have metrics here that align with the concept of actions "started/executing/completed", current metrics afaict don't cover that, just covers start_actions_received

I was thinking about this too. The metrics stuff is going to undergo a full rewrite soon though, so having these vars and maybe adding a way to expose them later should be sufficient.


nativelink-worker/src/local_worker.rs line 346 at r1 (raw file):

Previously, adam-singer (Adam Singer) wrote…

We should debug! or info! a message about exiting due to worker reaching termination limit, probably debug! as lots of executions that exist could be noisy for logs.

Good idea, added

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r1, 1 of 3 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 1 discussions need to be resolved

Copy link

Copy link
Collaborator

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 3 discussions need to be resolved


nativelink-config/src/cas_server.rs line 606 at r3 (raw file):

    /// The maximum number of actions a worker can executions .
    /// After this limit is reached, the nativelink binary will exit.
    /// A value of None means no limit.

nit: Put this on the last line of the comment by itself and say Default: (see other examples in the config)


nativelink-worker/src/local_worker.rs line 139 at r3 (raw file):

            grpc_client,
            worker_id,
            actions_completed: Arc::new(AtomicU64::new(0)),

I don't think these needs to be Atomic's or Arc's. Looking at how they are used, it looks like all the places have mutable access to Self.

Allows users to set the maximum number of action executions a worker is
allowed to complete. Upon reaching this limit, the worker will no longer
accept new jobs, and will exit upon completing all assigned ones.

closes: TraceMachina#815
Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and 2 discussions need to be resolved


nativelink-config/src/cas_server.rs line 606 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Put this on the last line of the comment by itself and say Default: (see other examples in the config)

Done.


nativelink-worker/src/local_worker.rs line 139 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I don't think these needs to be Atomic's or Arc's. Looking at how they are used, it looks like all the places have mutable access to Self.

I tried making them normal u64 but both run into borrow checker issues I couldn't figure out how to fix due to the spawns.

Copy link
Contributor

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks, and 2 discussions need to be resolved

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Zach Birenbaum seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

how to kill worker after the remote action execution?
4 participants