Implement GNU Make 4.4+ jobserver fifo / semaphore client support #2450

hundeboll · 2024-05-17T14:08:22Z

The principle of such a job server is rather simple: Before starting a new job (edge in ninja-speak), a token must be acquired from an external entity. On posix systems, that entity is simply a fifo filled with N characters. On win32 systems it is a semaphore initialized to N. Once a job is finished, the token must be returned to the external entity.

This functionality is desired when ninja is used as part of a bigger build, such as builds with Yocto/OpenEmbedded, Buildroot and Android. Here, multiple compile jobs are executed in parallel to maximize cpu utilization, but if each compile job uses all available cores, the system is over loaded.

Note: this is a re-implementation of the last part[1] of the previous attempt to implement jobserver functionality. I have left out the server[2] part, and the older "pipe"[3] methods from here, as I don't need those. Doing so allows for a much simpler implementation.

Note note: I don't have windows or mac systems available. I would greatly appreciate anyone who can test on those for me.

[1] #2263
[2] #2260
[3] #1140

jhasse

Thanks for the PR!
My suggestions:

Declare variables when you use them, not C89-style at the beginning of the scope.
Move all function definitions to .cc files, not in a .h
Use {} even for one line statements

src/jobserver.h

src/build.h

hundeboll · 2024-05-18T07:05:47Z

Any idea why the windows build fails?

jhasse · 2024-05-18T07:32:56Z

missing #include <cassert>

hundeboll · 2024-05-18T07:42:33Z

missing #include <cassert>

Aah, missed that.

I was referring to the other error:

D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,20): error C2062: type 'unknown-type' unexpected [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]

hundeboll · 2024-05-18T07:58:09Z

missing #include <cassert>

Aah, missed that.

I was referring to the other error:

D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,20): error C2062: type 'unknown-type' unexpected [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]

Fixed now.

src/jobserver.h

src/build.cc

src/jobserver-win32.cc

hundeboll · 2024-05-19T17:06:34Z

May 19, 2024 19:03:13 David Turner ***@***.***>:

***@***.**** commented on this pull request. ---------------------------------------- In src/build.cc[#2450 (comment)]: > @@ -789,7 +805,7 @@ bool Builder::Build(string* err) { while (plan_.more_to_do()) { // See if we can start any more commands. if (failures_allowed) { - size_t capacity = command_runner_->CanRunMore(); + size_t capacity = command_runner_->CanRunMore(plan_.JobserverEnabled()); while (capacity > 0) { I think I found it: FindWork() will return null if the token could not be acquired, so this effectively limits the number of processes that Ninja will ever spawn. And the Acquire() / Release() methods are never blocking (related to my other comment in jobserver.h). Now I am curious to understand what Ninja does where there are no spawned processes anymore, and no tokens available. Do you know if Ninja would be busy-looping in this case?

It breaks out of the inner while loop, and descents into WaitForCommand() below (still in the outer while loop). That command uses ppoll()/select() to wait for any command to return without busy looping. (Not at my laptop now, so details might differ slightly...) // Martin

…

— Reply to this email directly, view it on GitHub[#2450 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIGB4SYSPHG36IPCBI3ZDDLM7AVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTKMJRHE]. You are receiving this because you authored the thread. [Tracking image][https://github.com/notifications/beacon/AAA6PICPL7UQFARDWJXE7WLZDDLM7A5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C4T66.gif]

src/jobserver-posix.cc

src/jobserver.h

hundeboll · 2024-05-19T17:15:40Z

May 19, 2024 19:13:55 David Turner ***@***.***>:

***@***.**** commented on this pull request. ---------------------------------------- In src/jobserver.h[#2450 (comment)]: > + +struct Jobserver { + Jobserver(); + ~Jobserver(); + void Init(); + bool Enabled() const; + bool Acquire(); + void Release(); + +private: + bool ParseJobserverAuth(const char *type); + bool AcquireToken(); + void ReleaseToken(); + + std::string jobserver_name_; + size_t token_count_; I recommend using default initialization for members here to simplify the source code, e.g.: * size_t token_count_ = 0; #ifdef _WIN32 HANDLE sem_ = INVALID_HANDLE_VALUE; #else int fd_ = -1; #endif * wdyt?

All for it! Just didn't look enough around to see if it was something done elsewhere too...

…

— Reply to this email directly, view it on GitHub[#2450 (review)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIDSQ3L6RLFEJSKVOXDZDDMVFAVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTMNJXGA]. You are receiving this because you authored the thread. [Tracking image][https://github.com/notifications/beacon/AAA6PIFFW2K3PZJLMJVPAJTZDDMVFA5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C4WZU.gif]

hundeboll · 2024-05-19T17:45:01Z

May 19, 2024 19:37:01 David Turner ***@***.***>:

***@***.**** commented on this pull request. ---------------------------------------- In src/build.cc[#2450 (comment)]: > @@ -789,7 +805,7 @@ bool Builder::Build(string* err) { while (plan_.more_to_do()) { // See if we can start any more commands. if (failures_allowed) { - size_t capacity = command_runner_->CanRunMore(); + size_t capacity = command_runner_->CanRunMore(plan_.JobserverEnabled()); while (capacity > 0) { Thanks, replying here to your email answer,

My answer shows up on GitHub too :) which mentions that the outer loop ends up blocking in SubprocessSet::DoWork(). It looks like, from the implementation, that if there is no running commands, this would either block infintely or just busy-loop calling *perror("ppoll")*, depending on whether ppoll() returns an error where there are no fds and no timeout provided to the syscall. There should always be at least one running command (that's why there's special casing on the _token_count == 0 in jobserver.cc

Neither is really good, but this is probably an edge case that we should document somewhere, and shouldn't stop submitting this PR. And please don't be mistaken by my remarks, this is truly an excellent feature, so thank you for uploading it :-)

You're very much welcome. And thanks for the review!

Server mode, and client-server passthrough will probably require much more complicated changes, but this hits the sweet spot for client-only support!

The new make-4.4 approach is so excellent, because it just requires the MAKEFLAGS env to be passed on to child commands. Which makes me wonder: does ninja filter the env when starting commands? // Martin

…

— Reply to this email directly, view it on GitHub[#2450 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAA6PIGHQPIX7HM4BOACJB3ZDDPLZAVCNFSM6AAAAABH4G5C3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRVGEYTSMZTGI]. You are receiving this because you authored the thread. [Tracking image][https://github.com/notifications/beacon/AAA6PICMNKLQ5BZZ6CB46ADZDDPLZA5CNFSM6AAAAABH4G5C3CWGG33NNVSW45C7OR4XAZNRKB2WY3CSMVYXKZLTORJGK5TJMV32UY3PNVWWK3TUL5UWJTT3C44GI.gif]

digit-google · 2024-05-20T09:19:39Z

At the moment, Ninja always passes its environment to sub-commands, so the MAKEFLAGS value will be passed to them as well.

When using a named FIFO mechanism, either Posix or Windows, this is enough for them to participate properly in token negotiation (Ninja taking implicit token for each sub-command before launching it, as expected).

The file descriptor-based scheme will fail though (because Ninja doesn't try to keep these open in the spawned processes), and it's probably not something worthy of supporting, though this should be documented.

I am trying to setup some tests on top of your commits to see how we can ensure everything works as expected, and that we never regress in the future.

OT: Your answer appears in the general conversation for the PR, and not in the specific comment's thread. This loses context and can make things hard to follow. On the other hand, Github doesn't preserve comments when new commit are force-pushed to upload fixes (unlike Gerrit which tracks these very well), so these are not ideal either. Feel free to use whatever you prefer :)

hundeboll · 2024-05-24T08:09:04Z

@jhasse @digit-google I have fixed most of the comments, and responded to the remaining ones. Should I mark the fixed ones as resolved, or do you want to do that?

Is there anything else I need to address?

hundeboll · 2024-05-28T19:58:07Z

Rebased on master and removed the #define NOMINMAX from jobserver.h

jhasse

Thanks for the changes! The documentation is awesome :)

I've added several nitpick comments.

For the parsing I think it might be a good idea to add a unit test which check for the error/warning cases, too (i.e. invalid MAKEFLAGS).

Should I mark the fixed ones as resolved, or do you want to do that?

Feel free to resolve the comments yourself.

src/build.cc

src/build.h

src/jobserver.cc

src/jobserver.h

src/build_test.cc

digit-google · 2024-05-30T06:39:52Z

Sorry for the late answer, but apart from the latest nits, this looks really good. Thanks for adding the unit-tests, I hope you can make them work.

src/jobserver_test.cc

digit-google · 2024-05-30T08:15:50Z

Thank you, I added a few nits, but only address them if you feel to. This is great :)

jhasse

LGTM. Haven't tested it at all though.

src/build.cc

hundeboll · 2024-06-06T06:37:21Z

@jhasse Is anything holding this back from being merged? I would like to integrate jobserver functionalitiy in Yocto / OpenEmbedded, and doing so would be "prettier" if I can do it with a pure backport of the ninja changes.

jhasse · 2024-06-08T20:12:04Z

Would be great if someone could test it and comment here.

robUx4 · 2024-06-11T09:28:49Z

I did some test on VLC. We build 100+ contribs at once from autotools, CMake and meson projects. The build is started from a Makefile calling ninja for CMake and meson projects. We build ninja beforehand to have jobserver support. We use a prebuilt version of the Kitware version on our Docker.

This build is with the Kitware version of ninja with jobserver. This other build on the same machine as the same time, is done with this jobserver branch.

The first thing to notice is that this branch does build successfully. The other thing is that they take about the same time to build (10m51s vs 10m32s), suggesting the parallel usage (on this 48 cores machine) is working as expected. It's even slightly faster but I don't think we can really conclude it's faster.

hundeboll · 2024-06-11T09:36:47Z

@robUx4 thanks for testing. I'm afraid you need to tweak the build system to use the fifo style jobserver instead of the old-style pipe-fd method:
https://code.videolan.org/robUx4/vlc/-/jobs/1800466#L1738

hundeboll · 2024-06-11T10:00:25Z

@robUx4 btw: it's probably just a matter of updating make to version 4.4 or later...

robUx4 · 2024-06-11T10:03:54Z

We use whatever Debian is giving us. It seems Debian doesn't provide make 4.4 yet: https://packages.debian.org/bookworm/make, even in sid: https://packages.debian.org/sid/make

digit-google · 2024-06-17T17:08:04Z

Hello, I could experiment today with this patch applied to a local Ninja binary, used to build a small subset of Fuchsia targets.
This subset involves launching, in the end, about 8 sub-Ninja builds in parallel which fight for CPU resources concurrently to buils around 19,000+ targets each (moderated by default with -j5 on a very powerful workstation). This includes many Rust and C++ compilation / link commands (whose toolchain support MAKEFLAGS natively).

Good news, I see a decent improvement in build times when all remote builds are disabled (which is not our default configuration): 13m47s -> 12m39s.

For fully remote builds, we go: 5m54s -> 4m56s which is even nicer.

So this PR looks really good to me.

NOTE: I wrote a Python script to setup and serve the tokens, then invoking Ninja, see d6c0c1a (probably not the final version).

jdrouhard · 2024-06-17T18:41:06Z

Tested this briefly with our build, and for whatever reason, more tokens are returned to the pool than were originally acquired. This causes more and more parallel jobs to start as the build proceeds which eventually brings the system to a crawl.

At the end of our build:

make: INTERNAL: Exiting with 176 jobserver tokens available; should be 36!

I have 36 CPUs on the build machine, and specify -j 36 to the make invocation. make is 4.4.1 and the command recipe for doing the build spawns ninja with a + ninja logs which fifo it's using for the jobserver so I confirmed make is passing the appropriate file down and this PR is being used.

Is it guaranteed that there will be in equal number of FindWork() calls to EdgeFinished() calls? Briefly looking through the source, I'm seeing more EdgeFinished() calls nested within other aspects of the build process, such as NodeFinished(), EdgeMaybeReady(), etc. Seems like maybe it's possible to call EdgeFinished() more times than the initial FindWork() call.

EDIT:
This call - https://github.com/hundeboll/ninja/blob/be47d5de9312f486425c12e10b82b35d42a0d273/src/build.cc#L263 - is in EdgeMaybeReady() which is used when other nodes complete or dyndep discovery kicks in and the output (dependent edge) doesn't need to directly be built. This is an EdgeFinished() call that doesn't correlate to any FindWork().

This fixes the issue I'm seeing (applied to this PR):

diff --git a/src/build.cc b/src/build.cc
index f05e31e..a1e808e 100644
--- a/src/build.cc
+++ b/src/build.cc
@@ -170,6 +170,7 @@ Edge* Plan::FindWork() {
   }
 
   Edge* work = ready_.top();
+  work->acquired_job_server_token_ = jobserver_.Enabled();
   ready_.pop();
   return work;
 }
@@ -207,7 +208,7 @@ bool Plan::EdgeFinished(Edge* edge, EdgeResult result, string* err) {
   edge->pool()->RetrieveReadyEdges(&ready_);
 
   // Return the token acquired for this very edge to the jobserver
-  if (jobserver_.Enabled()) {
+  if (edge->acquired_job_server_token_) {
     jobserver_.Release();
   }
 
diff --git a/src/graph.h b/src/graph.h
index 314c442..f908d75 100644
--- a/src/graph.h
+++ b/src/graph.h
@@ -227,6 +227,7 @@ struct Edge {
   bool deps_loaded_ = false;
   bool deps_missing_ = false;
   bool generated_by_dep_loader_ = false;
+  bool acquired_job_server_token_ = false;
   TimeStamp command_start_time_ = 0;
 
   const Rule& rule() const { return *rule_; }

hundeboll · 2024-06-17T19:05:35Z

For fully remote builds, we go: 5m54s -> 4m56s which is even nicer.

So this PR looks really good to me.

Nice numbers.

NOTE: I wrote a Python script to setup and serve the tokens, then invoking Ninja, see d6c0c1a (probably not the final version).

Looks good. I've also written something similar, albeit less feature rich:
https://lore.kernel.org/openembedded-core/[email protected]/

hundeboll · 2024-06-17T19:06:27Z

EDIT:
This call - https://github.com/hundeboll/ninja/blob/be47d5de9312f486425c12e10b82b35d42a0d273/src/build.cc#L263 - is in EdgeMaybeReady() which is used when other nodes complete or dyndep discovery kicks in and the output (dependent edge) doesn't need to directly be built. This is an EdgeFinished() call that doesn't correlate to any FindWork()

Uf, good catch. I'll have to look into how the tokens are released again. Suggestions are welcome...

jdrouhard · 2024-06-17T19:10:36Z

EDIT:
This call - https://github.com/hundeboll/ninja/blob/be47d5de9312f486425c12e10b82b35d42a0d273/src/build.cc#L263 - is in EdgeMaybeReady() which is used when other nodes complete or dyndep discovery kicks in and the output (dependent edge) doesn't need to directly be built. This is an EdgeFinished() call that doesn't correlate to any FindWork()

Uf, good catch. I'll have to look into how the tokens are released again. Suggestions are welcome...

See my edit, really small diff that fixes the problem!

The principle of such a job server is rather simple: Before starting a new job (edge in ninja-speak), a token must be acquired from an external entity. On posix systems, that entity is simply a fifo filled with N characters. On win32 systems it is a semaphore initialized to N. Once a job is finished, the token must be returned to the external entity. This functionality is desired when ninja is used as part of a bigger build, such as builds with Yocto/OpenEmbedded, Buildroot and Android. Here, multiple compile jobs are executed in parallel to maximize cpu utilization, but if each compile job uses all available cores, the system is over loaded.

Implement proper testing of the MAKEFLAGS parsing, and the token acquire/release logic in the jobserver class.

hundeboll · 2024-06-17T19:24:51Z

See my edit, really small diff that fixes the problem!

Nice.

See my edit, really small diff that fixes the problem!

Thanks! Pushed the change with some comments added :)

jhasse reviewed May 17, 2024

View reviewed changes

src/jobserver.h Outdated Show resolved Hide resolved

src/jobserver.h Outdated Show resolved Hide resolved

src/build.h Outdated Show resolved Hide resolved

hundeboll force-pushed the jobserver branch from 91fda41 to 5b613ea Compare May 18, 2024 07:02

hundeboll force-pushed the jobserver branch from 5b613ea to a81fb9f Compare May 18, 2024 07:39

hundeboll force-pushed the jobserver branch 2 times, most recently from 27a269b to cc1044e Compare May 18, 2024 07:56