-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GNU Make 4.4+ jobserver fifo / semaphore client support #2450
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
My suggestions:
- Declare variables when you use them, not C89-style at the beginning of the scope.
- Move all function definitions to .cc files, not in a .h
- Use {} even for one line statements
Any idea why the windows build fails? |
missing |
Aah, missed that. I was referring to the other error:
|
27a269b
to
cc1044e
Compare
Fixed now. |
May 19, 2024 19:03:13 David Turner ***@***.***>:
***@***.**** commented on this pull request.
----------------------------------------
In src/build.cc[#2450 (comment)]:
> @@ -789,7 +805,7 @@ bool Builder::Build(string* err) {
while (plan_.more_to_do()) {
// See if we can start any more commands.
if (failures_allowed) {
- size_t capacity = command_runner_->CanRunMore();
+ size_t capacity = command_runner_->CanRunMore(plan_.JobserverEnabled());
while (capacity > 0) {
I think I found it: FindWork() will return null if the token could not be acquired, so this effectively limits the number of processes that Ninja will ever spawn. And the Acquire() / Release() methods are never blocking (related to my other comment in jobserver.h).
Now I am curious to understand what Ninja does where there are no spawned processes anymore, and no tokens available. Do you know if Ninja would be busy-looping in this case?
It breaks out of the inner while loop, and descents into WaitForCommand() below (still in the outer while loop). That command uses ppoll()/select() to wait for any command to return without busy looping.
(Not at my laptop now, so details might differ slightly...)
// Martin
…
—
Reply to this email directly, view it on GitHub[#2450 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIGB4SYSPHG36IPCBI3ZDDLM7AVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTKMJRHE].
You are receiving this because you authored the thread.
[Tracking image][https://github.com/notifications/beacon/AAA6PICPL7UQFARDWJXE7WLZDDLM7A5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C4T66.gif]
|
May 19, 2024 19:13:55 David Turner ***@***.***>:
***@***.**** commented on this pull request.
----------------------------------------
In src/jobserver.h[#2450 (comment)]:
> +
+struct Jobserver {
+ Jobserver();
+ ~Jobserver();
+ void Init();
+ bool Enabled() const;
+ bool Acquire();
+ void Release();
+
+private:
+ bool ParseJobserverAuth(const char *type);
+ bool AcquireToken();
+ void ReleaseToken();
+
+ std::string jobserver_name_;
+ size_t token_count_;
I recommend using default initialization for members here to simplify the source code, e.g.:
* size_t token_count_ = 0;
#ifdef _WIN32
HANDLE sem_ = INVALID_HANDLE_VALUE;
#else
int fd_ = -1;
#endif
*
wdyt?
All for it! Just didn't look enough around to see if it was something done elsewhere too...
…
—
Reply to this email directly, view it on GitHub[#2450 (review)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIDSQ3L6RLFEJSKVOXDZDDMVFAVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTMNJXGA].
You are receiving this because you authored the thread.
[Tracking image][https://github.com/notifications/beacon/AAA6PIFFW2K3PZJLMJVPAJTZDDMVFA5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C4WZU.gif]
|
May 19, 2024 19:37:01 David Turner ***@***.***>:
***@***.**** commented on this pull request.
----------------------------------------
In src/build.cc[#2450 (comment)]:
> @@ -789,7 +805,7 @@ bool Builder::Build(string* err) {
while (plan_.more_to_do()) {
// See if we can start any more commands.
if (failures_allowed) {
- size_t capacity = command_runner_->CanRunMore();
+ size_t capacity = command_runner_->CanRunMore(plan_.JobserverEnabled());
while (capacity > 0) {
Thanks, replying here to your email answer,
My answer shows up on GitHub too :)
which mentions that the outer loop ends up blocking in SubprocessSet::DoWork(). It looks like, from the implementation, that if there is no running commands, this would either block infintely or just busy-loop calling *perror("ppoll")*, depending on whether ppoll() returns an error where there are no fds and no timeout provided to the syscall.
There should always be at least one running command (that's why there's special casing on the _token_count == 0 in jobserver.cc
Neither is really good, but this is probably an edge case that we should document somewhere, and shouldn't stop submitting this PR. And please don't be mistaken by my remarks, this is truly an excellent feature, so thank you for uploading it :-)
You're very much welcome. And thanks for the review!
Server mode, and client-server passthrough will probably require much more complicated changes, but this hits the sweet spot for client-only support!
The new make-4.4 approach is so excellent, because it just requires the MAKEFLAGS env to be passed on to child commands. Which makes me wonder: does ninja filter the env when starting commands?
// Martin
…
—
Reply to this email directly, view it on GitHub[#2450 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIGHQPIX7HM4BOACJB3ZDDPLZAVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTSMZTGI].
You are receiving this because you authored the thread.
[Tracking image][https://github.com/notifications/beacon/AAA6PICMNKLQ5BZZ6CB46ADZDDPLZA5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C44GI.gif]
|
At the moment, Ninja always passes its environment to sub-commands, so the When using a named FIFO mechanism, either Posix or Windows, this is enough for them to participate properly in token negotiation (Ninja taking implicit token for each sub-command before launching it, as expected). The file descriptor-based scheme will fail though (because Ninja doesn't try to keep these open in the spawned processes), and it's probably not something worthy of supporting, though this should be documented. I am trying to setup some tests on top of your commits to see how we can ensure everything works as expected, and that we never regress in the future. OT: Your answer appears in the general conversation for the PR, and not in the specific comment's thread. This loses context and can make things hard to follow. On the other hand, Github doesn't preserve comments when new commit are force-pushed to upload fixes (unlike Gerrit which tracks these very well), so these are not ideal either. Feel free to use whatever you prefer :) |
69bf358
to
e92f95b
Compare
@jhasse @digit-google I have fixed most of the comments, and responded to the remaining ones. Should I mark the fixed ones as resolved, or do you want to do that? Is there anything else I need to address? |
Rebased on master and removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! The documentation is awesome :)
I've added several nitpick comments.
For the parsing I think it might be a good idea to add a unit test which check for the error/warning cases, too (i.e. invalid MAKEFLAGS).
Should I mark the fixed ones as resolved, or do you want to do that?
Feel free to resolve the comments yourself.
6d81a64
to
d4f279a
Compare
Sorry for the late answer, but apart from the latest nits, this looks really good. Thanks for adding the unit-tests, I hope you can make them work. |
Thank you, I added a few nits, but only address them if you feel to. This is great :) |
15b5aa4
to
765f294
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Haven't tested it at all though.
@jhasse Is anything holding this back from being merged? I would like to integrate jobserver functionalitiy in Yocto / OpenEmbedded, and doing so would be "prettier" if I can do it with a pure backport of the ninja changes. |
Would be great if someone could test it and comment here. |
I did some test on VLC. We build 100+ contribs at once from autotools, CMake and meson projects. The build is started from a Makefile calling ninja for CMake and meson projects. We build ninja beforehand to have jobserver support. We use a prebuilt version of the Kitware version on our Docker. This build is with the Kitware version of ninja with jobserver. This other build on the same machine as the same time, is done with this jobserver branch. The first thing to notice is that this branch does build successfully. The other thing is that they take about the same time to build (10m51s vs 10m32s), suggesting the parallel usage (on this 48 cores machine) is working as expected. It's even slightly faster but I don't think we can really conclude it's faster. |
@robUx4 thanks for testing. I'm afraid you need to tweak the build system to use the fifo style jobserver instead of the old-style pipe-fd method: |
@robUx4 btw: it's probably just a matter of updating |
We use whatever Debian is giving us. It seems Debian doesn't provide make 4.4 yet: https://packages.debian.org/bookworm/make, even in sid: https://packages.debian.org/sid/make |
Hello, I could experiment today with this patch applied to a local Ninja binary, used to build a small subset of Fuchsia targets. Good news, I see a decent improvement in build times when all remote builds are disabled (which is not our default configuration): 13m47s -> 12m39s. For fully remote builds, we go: 5m54s -> 4m56s which is even nicer. So this PR looks really good to me. NOTE: I wrote a Python script to setup and serve the tokens, then invoking Ninja, see d6c0c1a (probably not the final version). |
Tested this briefly with our build, and for whatever reason, more tokens are returned to the pool than were originally acquired. This causes more and more parallel jobs to start as the build proceeds which eventually brings the system to a crawl. At the end of our build:
I have 36 CPUs on the build machine, and specify Is it guaranteed that there will be in equal number of EDIT: This fixes the issue I'm seeing (applied to this PR): diff --git a/src/build.cc b/src/build.cc
index f05e31e..a1e808e 100644
--- a/src/build.cc
+++ b/src/build.cc
@@ -170,6 +170,7 @@ Edge* Plan::FindWork() {
}
Edge* work = ready_.top();
+ work->acquired_job_server_token_ = jobserver_.Enabled();
ready_.pop();
return work;
}
@@ -207,7 +208,7 @@ bool Plan::EdgeFinished(Edge* edge, EdgeResult result, string* err) {
edge->pool()->RetrieveReadyEdges(&ready_);
// Return the token acquired for this very edge to the jobserver
- if (jobserver_.Enabled()) {
+ if (edge->acquired_job_server_token_) {
jobserver_.Release();
}
diff --git a/src/graph.h b/src/graph.h
index 314c442..f908d75 100644
--- a/src/graph.h
+++ b/src/graph.h
@@ -227,6 +227,7 @@ struct Edge {
bool deps_loaded_ = false;
bool deps_missing_ = false;
bool generated_by_dep_loader_ = false;
+ bool acquired_job_server_token_ = false;
TimeStamp command_start_time_ = 0;
const Rule& rule() const { return *rule_; } |
Nice numbers.
Looks good. I've also written something similar, albeit less feature rich: |
Uf, good catch. I'll have to look into how the tokens are released again. Suggestions are welcome... |
See my edit, really small diff that fixes the problem! |
The principle of such a job server is rather simple: Before starting a new job (edge in ninja-speak), a token must be acquired from an external entity. On posix systems, that entity is simply a fifo filled with N characters. On win32 systems it is a semaphore initialized to N. Once a job is finished, the token must be returned to the external entity. This functionality is desired when ninja is used as part of a bigger build, such as builds with Yocto/OpenEmbedded, Buildroot and Android. Here, multiple compile jobs are executed in parallel to maximize cpu utilization, but if each compile job uses all available cores, the system is over loaded.
Implement proper testing of the MAKEFLAGS parsing, and the token acquire/release logic in the jobserver class.
Nice.
Thanks! Pushed the change with some comments added :) |
The principle of such a job server is rather simple: Before starting a new job (edge in ninja-speak), a token must be acquired from an external entity. On posix systems, that entity is simply a fifo filled with N characters. On win32 systems it is a semaphore initialized to N. Once a job is finished, the token must be returned to the external entity.
This functionality is desired when ninja is used as part of a bigger build, such as builds with Yocto/OpenEmbedded, Buildroot and Android. Here, multiple compile jobs are executed in parallel to maximize cpu utilization, but if each compile job uses all available cores, the system is over loaded.
Note: this is a re-implementation of the last part[1] of the previous attempt to implement jobserver functionality. I have left out the server[2] part, and the older "pipe"[3] methods from here, as I don't need those. Doing so allows for a much simpler implementation.
Note note: I don't have windows or mac systems available. I would greatly appreciate anyone who can test on those for me.
[1] #2263
[2] #2260
[3] #1140