Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GNU Make 4.4+ jobserver fifo / semaphore client support #2450

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hundeboll
Copy link
Contributor

The principle of such a job server is rather simple: Before starting a new job (edge in ninja-speak), a token must be acquired from an external entity. On posix systems, that entity is simply a fifo filled with N characters. On win32 systems it is a semaphore initialized to N. Once a job is finished, the token must be returned to the external entity.

This functionality is desired when ninja is used as part of a bigger build, such as builds with Yocto/OpenEmbedded, Buildroot and Android. Here, multiple compile jobs are executed in parallel to maximize cpu utilization, but if each compile job uses all available cores, the system is over loaded.

Note: this is a re-implementation of the last part[1] of the previous attempt to implement jobserver functionality. I have left out the server[2] part, and the older "pipe"[3] methods from here, as I don't need those. Doing so allows for a much simpler implementation.

Note note: I don't have windows or mac systems available. I would greatly appreciate anyone who can test on those for me.

[1] #2263
[2] #2260
[3] #1140

Copy link
Collaborator

@jhasse jhasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
My suggestions:

  1. Declare variables when you use them, not C89-style at the beginning of the scope.
  2. Move all function definitions to .cc files, not in a .h
  3. Use {} even for one line statements

src/jobserver.h Outdated Show resolved Hide resolved
src/jobserver.h Outdated Show resolved Hide resolved
src/build.h Outdated Show resolved Hide resolved
@hundeboll
Copy link
Contributor Author

Any idea why the windows build fails?

@jhasse
Copy link
Collaborator

jhasse commented May 18, 2024

missing #include <cassert>

@hundeboll
Copy link
Contributor Author

missing #include <cassert>

Aah, missed that.

I was referring to the other error:

D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,20): error C2062: type 'unknown-type' unexpected [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]

@hundeboll hundeboll force-pushed the jobserver branch 2 times, most recently from 27a269b to cc1044e Compare May 18, 2024 07:56
@hundeboll
Copy link
Contributor Author

missing #include <cassert>

Aah, missed that.

I was referring to the other error:

D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,20): error C2062: type 'unknown-type' unexpected [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(125,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2589: '(': illegal token on right side of '::' [D:\a\ninja\ninja\build\libninja.vcxproj]
D:\a\ninja\ninja\src\status_printer.cc(127,25): error C2059: syntax error: ')' [D:\a\ninja\ninja\build\libninja.vcxproj]

Fixed now.

src/jobserver.h Outdated Show resolved Hide resolved
@hundeboll
Copy link
Contributor Author

hundeboll commented May 19, 2024 via email

src/jobserver.h Outdated Show resolved Hide resolved
@hundeboll
Copy link
Contributor Author

hundeboll commented May 19, 2024 via email

@hundeboll
Copy link
Contributor Author

hundeboll commented May 19, 2024 via email

@digit-google
Copy link
Contributor

At the moment, Ninja always passes its environment to sub-commands, so the MAKEFLAGS value will be passed to them as well.

When using a named FIFO mechanism, either Posix or Windows, this is enough for them to participate properly in token negotiation (Ninja taking implicit token for each sub-command before launching it, as expected).

The file descriptor-based scheme will fail though (because Ninja doesn't try to keep these open in the spawned processes), and it's probably not something worthy of supporting, though this should be documented.

I am trying to setup some tests on top of your commits to see how we can ensure everything works as expected, and that we never regress in the future.

OT: Your answer appears in the general conversation for the PR, and not in the specific comment's thread. This loses context and can make things hard to follow. On the other hand, Github doesn't preserve comments when new commit are force-pushed to upload fixes (unlike Gerrit which tracks these very well), so these are not ideal either. Feel free to use whatever you prefer :)

@hundeboll hundeboll force-pushed the jobserver branch 3 times, most recently from 69bf358 to e92f95b Compare May 24, 2024 08:07
@hundeboll
Copy link
Contributor Author

@jhasse @digit-google I have fixed most of the comments, and responded to the remaining ones. Should I mark the fixed ones as resolved, or do you want to do that?

Is there anything else I need to address?

@hundeboll
Copy link
Contributor Author

Rebased on master and removed the #define NOMINMAX from jobserver.h

Copy link
Collaborator

@jhasse jhasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! The documentation is awesome :)

I've added several nitpick comments.

For the parsing I think it might be a good idea to add a unit test which check for the error/warning cases, too (i.e. invalid MAKEFLAGS).

Should I mark the fixed ones as resolved, or do you want to do that?

Feel free to resolve the comments yourself.

src/build.cc Outdated Show resolved Hide resolved
src/build.cc Outdated Show resolved Hide resolved
src/build.cc Outdated Show resolved Hide resolved
src/build.cc Outdated Show resolved Hide resolved
src/build.h Outdated Show resolved Hide resolved
src/jobserver.cc Outdated Show resolved Hide resolved
src/jobserver.cc Outdated Show resolved Hide resolved
src/jobserver.cc Outdated Show resolved Hide resolved
src/jobserver.h Outdated Show resolved Hide resolved
src/build_test.cc Show resolved Hide resolved
@hundeboll hundeboll force-pushed the jobserver branch 2 times, most recently from 6d81a64 to d4f279a Compare May 29, 2024 13:19
@digit-google
Copy link
Contributor

Sorry for the late answer, but apart from the latest nits, this looks really good. Thanks for adding the unit-tests, I hope you can make them work.

src/jobserver_test.cc Outdated Show resolved Hide resolved
src/jobserver_test.cc Outdated Show resolved Hide resolved
@digit-google
Copy link
Contributor

Thank you, I added a few nits, but only address them if you feel to. This is great :)

@hundeboll hundeboll force-pushed the jobserver branch 2 times, most recently from 15b5aa4 to 765f294 Compare May 30, 2024 08:43
Copy link
Collaborator

@jhasse jhasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Haven't tested it at all though.

src/build.cc Outdated Show resolved Hide resolved
src/build.cc Show resolved Hide resolved
@hundeboll
Copy link
Contributor Author

@jhasse Is anything holding this back from being merged? I would like to integrate jobserver functionalitiy in Yocto / OpenEmbedded, and doing so would be "prettier" if I can do it with a pure backport of the ninja changes.

@jhasse
Copy link
Collaborator

jhasse commented Jun 8, 2024

Would be great if someone could test it and comment here.

@robUx4
Copy link

robUx4 commented Jun 11, 2024

I did some test on VLC. We build 100+ contribs at once from autotools, CMake and meson projects. The build is started from a Makefile calling ninja for CMake and meson projects. We build ninja beforehand to have jobserver support. We use a prebuilt version of the Kitware version on our Docker.

This build is with the Kitware version of ninja with jobserver. This other build on the same machine as the same time, is done with this jobserver branch.

The first thing to notice is that this branch does build successfully. The other thing is that they take about the same time to build (10m51s vs 10m32s), suggesting the parallel usage (on this 48 cores machine) is working as expected. It's even slightly faster but I don't think we can really conclude it's faster.

@hundeboll
Copy link
Contributor Author

@robUx4 thanks for testing. I'm afraid you need to tweak the build system to use the fifo style jobserver instead of the old-style pipe-fd method:
https://code.videolan.org/robUx4/vlc/-/jobs/1800466#L1738

@hundeboll
Copy link
Contributor Author

@robUx4 btw: it's probably just a matter of updating make to version 4.4 or later...

@robUx4
Copy link

robUx4 commented Jun 11, 2024

We use whatever Debian is giving us. It seems Debian doesn't provide make 4.4 yet: https://packages.debian.org/bookworm/make, even in sid: https://packages.debian.org/sid/make

@digit-google
Copy link
Contributor

Hello, I could experiment today with this patch applied to a local Ninja binary, used to build a small subset of Fuchsia targets.
This subset involves launching, in the end, about 8 sub-Ninja builds in parallel which fight for CPU resources concurrently to buils around 19,000+ targets each (moderated by default with -j5 on a very powerful workstation). This includes many Rust and C++ compilation / link commands (whose toolchain support MAKEFLAGS natively).

Good news, I see a decent improvement in build times when all remote builds are disabled (which is not our default configuration): 13m47s -> 12m39s.

For fully remote builds, we go: 5m54s -> 4m56s which is even nicer.

So this PR looks really good to me.

NOTE: I wrote a Python script to setup and serve the tokens, then invoking Ninja, see d6c0c1a (probably not the final version).

@jdrouhard
Copy link
Contributor

jdrouhard commented Jun 17, 2024

Tested this briefly with our build, and for whatever reason, more tokens are returned to the pool than were originally acquired. This causes more and more parallel jobs to start as the build proceeds which eventually brings the system to a crawl.

At the end of our build:

make: INTERNAL: Exiting with 176 jobserver tokens available; should be 36!

I have 36 CPUs on the build machine, and specify -j 36 to the make invocation. make is 4.4.1 and the command recipe for doing the build spawns ninja with a + ninja logs which fifo it's using for the jobserver so I confirmed make is passing the appropriate file down and this PR is being used.

Is it guaranteed that there will be in equal number of FindWork() calls to EdgeFinished() calls? Briefly looking through the source, I'm seeing more EdgeFinished() calls nested within other aspects of the build process, such as NodeFinished(), EdgeMaybeReady(), etc. Seems like maybe it's possible to call EdgeFinished() more times than the initial FindWork() call.

EDIT:
This call - https://github.com/hundeboll/ninja/blob/be47d5de9312f486425c12e10b82b35d42a0d273/src/build.cc#L263 - is in EdgeMaybeReady() which is used when other nodes complete or dyndep discovery kicks in and the output (dependent edge) doesn't need to directly be built. This is an EdgeFinished() call that doesn't correlate to any FindWork().

This fixes the issue I'm seeing (applied to this PR):

diff --git a/src/build.cc b/src/build.cc
index f05e31e..a1e808e 100644
--- a/src/build.cc
+++ b/src/build.cc
@@ -170,6 +170,7 @@ Edge* Plan::FindWork() {
   }
 
   Edge* work = ready_.top();
+  work->acquired_job_server_token_ = jobserver_.Enabled();
   ready_.pop();
   return work;
 }
@@ -207,7 +208,7 @@ bool Plan::EdgeFinished(Edge* edge, EdgeResult result, string* err) {
   edge->pool()->RetrieveReadyEdges(&ready_);
 
   // Return the token acquired for this very edge to the jobserver
-  if (jobserver_.Enabled()) {
+  if (edge->acquired_job_server_token_) {
     jobserver_.Release();
   }
 
diff --git a/src/graph.h b/src/graph.h
index 314c442..f908d75 100644
--- a/src/graph.h
+++ b/src/graph.h
@@ -227,6 +227,7 @@ struct Edge {
   bool deps_loaded_ = false;
   bool deps_missing_ = false;
   bool generated_by_dep_loader_ = false;
+  bool acquired_job_server_token_ = false;
   TimeStamp command_start_time_ = 0;
 
   const Rule& rule() const { return *rule_; }

@hundeboll
Copy link
Contributor Author

For fully remote builds, we go: 5m54s -> 4m56s which is even nicer.

So this PR looks really good to me.

Nice numbers.

NOTE: I wrote a Python script to setup and serve the tokens, then invoking Ninja, see d6c0c1a (probably not the final version).

Looks good. I've also written something similar, albeit less feature rich:
https://lore.kernel.org/openembedded-core/[email protected]/

@hundeboll
Copy link
Contributor Author

EDIT:
This call - https://github.com/hundeboll/ninja/blob/be47d5de9312f486425c12e10b82b35d42a0d273/src/build.cc#L263 - is in EdgeMaybeReady() which is used when other nodes complete or dyndep discovery kicks in and the output (dependent edge) doesn't need to directly be built. This is an EdgeFinished() call that doesn't correlate to any FindWork()

Uf, good catch. I'll have to look into how the tokens are released again. Suggestions are welcome...

@jdrouhard
Copy link
Contributor

EDIT:
This call - https://github.com/hundeboll/ninja/blob/be47d5de9312f486425c12e10b82b35d42a0d273/src/build.cc#L263 - is in EdgeMaybeReady() which is used when other nodes complete or dyndep discovery kicks in and the output (dependent edge) doesn't need to directly be built. This is an EdgeFinished() call that doesn't correlate to any FindWork()

Uf, good catch. I'll have to look into how the tokens are released again. Suggestions are welcome...

See my edit, really small diff that fixes the problem!

The principle of such a job server is rather simple: Before starting a
new job (edge in ninja-speak), a token must be acquired from an external
entity. On posix systems, that entity is simply a fifo filled with N
characters. On win32 systems it is a semaphore initialized to N.  Once a
job is finished, the token must be returned to the external entity.

This functionality is desired when ninja is used as part of a bigger
build, such as builds with Yocto/OpenEmbedded, Buildroot and Android.
Here, multiple compile jobs are executed in parallel to maximize cpu
utilization, but if each compile job uses all available cores, the
system is over loaded.
Implement proper testing of the MAKEFLAGS parsing, and the token
acquire/release logic in the jobserver class.
@hundeboll
Copy link
Contributor Author

See my edit, really small diff that fixes the problem!

Nice.

See my edit, really small diff that fixes the problem!

Thanks! Pushed the change with some comments added :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants