Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle exit codes and grandchild processes with envkey-source #3

Open
danenania opened this issue Mar 30, 2022 · 9 comments
Open
Assignees
Labels
envkey-source https://docs-v2.envkey.com/docs/envkey-source

Comments

@danenania
Copy link
Contributor

As described in this comment: https://news.ycombinator.com/item?id=30858713

envkey-source needs to properly handle SIGCHLD signals on unix to avoid potential zombie processes.

@gmfawcett
Copy link

As another HN commenter shared, also see these (as examples, or as partner tools to recommend):

@danenania
Copy link
Contributor Author

danenania commented Apr 1, 2022

I've been looking into this a bit and it seems like exec.Cmd.Wait does indeed check exit codes, so I think there shouldn't be any issues with zombie children. But zombie grandchildren might still be an issue.

@gmfawcett
Copy link

They will still be an issue. :)

Consider this example: your process (PID1) execs a command (let's call that process PID2, though the actual ID could be different). And PID2 spawns a child process, PID3. Now, PID2 dies unexpectedly. Your exec.cmd.Wait in PID1 could synchronously handle the exit of PID2. But, PID3 is now an orphaned process. So the kernel will reassign it to PID1 as its new parent. When PID3 terminates, when/how will PID1 wait on it?

It may actually be more challenging if call a Wait method on an exec.cmd object that you own, because you'll have to compensate for those "syncrhonous" Waits in your "asynchronous" SIGCHLD counter. Think about it -- the total count of children that you need to wait on is (A) the total children you spawned (e.g. via exec.cmd) plus (B) the total count of orphans that you inherited via PID1 orphaning rules. My point is that both of these child types will send you SIGCHLD signals when they terminate, not just the orphans. You want to wait exactly the right number of times -- wait too few times, and you will have zombies; wait too many times, and your waiting thread could block indefinitely.

Aren't POSIX semantics fun? :P

@danenania
Copy link
Contributor Author

Hmm ok, that mostly makes sense. Thanks for explaining.

One thing I'm still not understanding is how I'd know when the watched process (or one of its children) spawns its own child process in order to increment the counter. SIGCHLD fires when a process ends, but there doesn't seem to be an equivalent for a process starting. Guess I could just poll the system process table?

@gmfawcett
Copy link

gmfawcett commented Apr 1, 2022

Hm, I guess you could poll the table, but it's certain that you will miss some short-lived processes (and those are pretty common). But you don't really need this information.

Here's one simple solution. I'm not suggesting it's the best approach, but I don't think it's a wrong one. My hope here is to give you a little model you could sketch on a whiteboard, and wrap your head around this crazy legacy problem you inherited. :)

Define a counter, protected by some isolation mechanism (a channel, a mutex, an atomic, etc.). Let's say by a mutex. So every counter operation requires a Lock/Unlock.

  • "You" in the sentences below refers to the PID1 process, not the children.
  • The counter starts at 0.
  • Every time you receive a SIGCHLD, you increment the counter.
  • Every time you spawn a child, you decrement the counter.
  • Every time you Wait on a child that you spawned yourself -- and you HAVE to wait on every spawned child -- you leave the counter alone. (This wait would need to be guaranteed, e.g., handled in a defer block.)
  • In a separate goroutine: at intervals, lock the mutex on the counter; if it's positive, write that number down (N := ...) and then zero the counter; otherwise, let N = 0, and leave the counter value alone. Then unlock the mutex; execute N waitpid() calls; and sleep until the next interval.

(That separate goroutine is your orphan catcher. Ideally it should be as reliable as possible, e.g. maybe a panic should be allowed to bring the whole process down.)

The totally non-intuitive part is the "do nothing when waiting on own child." The point here is: you already accounted for that wait, when you decremented the counter at spawn-time. Effectively you said, "the main goroutine just spawned a child, and it's taking responsibility right now for waitpid()'ing exactly one child to balance the books."

It follows that all the remaining waits, the ones executed by the "separate / orphan catcher" goroutine, will be orphaned processes. Play it out on paper and see. :) If there are never any orphaned processes, the counter will vary between zero and various random, negative numbers, but will never become positive. Because if you can assume there are no orphans, a SIGCHLD (increment) is always associated to the spawn (decrement) that made it possible.

It's been a long day, and I may have made a logical error here -- please reason this through with me -- but it looks right on a second read-through. Play the game a little, see what you think. :)

@danenania danenania added the envkey-source https://docs-v2.envkey.com/docs/envkey-source label Apr 1, 2022
@danenania danenania self-assigned this Apr 1, 2022
@gmfawcett
Copy link

gmfawcett commented Apr 1, 2022

I'm sorry. I should have known better than to try & be smart after a long day. :( There's a race condition (edit: really just a bug, not a race) in my algorithm. Please don't burn energy on it. I can give an example later if you're interested -- but for now, my strong advice is either to recommend a partner tool like tini, and punt on the issue; or look at how tini et al. implement it (or again the Rust article I shared).

@danenania
Copy link
Contributor Author

danenania commented Apr 2, 2022

Oh really? I was following your logic and it seems quite good! I'm curious where the bug is.

When calling waitpid in the last step, is it right that it would be called with the pid of the current process, since it will have become the parent process of any orphaned grandchildren?

Some examples here seem to be accomplishing the same thing in a few different ways: https://golang.hotexamples.com/examples/golang.org.x.sys.unix/-/Wait4/golang-wait4-function-examples.html

@gmfawcett
Copy link

gmfawcett commented Apr 2, 2022

That's the source of the problem actually. :)

You can waitpid in many different ways. Two that matter here are: asking for an explicit PID to wait on, and asking for any child process that has completed. Reading the system-call manual page is helpful, e.g., https://linux.die.net/man/2/waitpid, see the comments about non-positive PID values, esp. waitpid(-1).

cmd.exec.Wait uses the specific PID approach. But for the orphan collector, you don't want to know which PID you are reaping: you just want to wait the right number of times. So you use the second approach (any child).

The bug / race condition in my algorithm is because we are mixing the two methods. The counter method is workable when all waits are of the "don't care which" variety. You're just counting signals, and emitting the right number of waits. But when you mix this with a concurrent process (goroutine) that is doing explicit-PID waits, then under the right circumstances, this sequence of events can happen:

  1. you spawn a child process that has (e.g.) PID = 99.
  2. that process dies, resulting in a SIGCHLD signal. Your counter is incremented.
  3. ... other counter changes may happen here ...
  4. your orphan goroutine wakes up. The counter is currently positive. So it executes a waitpid(-1).
  5. the OS selects PID 99 to be reaped by that waitpid call.
  6. finally, your main goroutine calls waitpid(99).

But PID 99 has already been reaped. So the waitpid(99) in the final step will raise an ECHILD error (child doesn't exist). And now our bookkeeping has been upset.

For your own program's purposes, you may need to maintain some control over when your child processes come to an end. If you can tolerate having another goroutine (the orphan catcher) doing all the waits, even on your own child processes -- i.e., you never call exec.cmd.Wait -- then I think the variation below will work (famous last words, lol):

  • The counter starts at 0.
  • Every time you receive a SIGCHLD, you increment the counter.
  • In a goroutine: at intervals, lock the mutex on the counter. It will be either zero or positive. Let N be the number. Zero out the counter, and unlock the mutex. Execute N waitpid(-1) calls; and sleep until the next interval.

As you can see it's literally just counting SIGCHLDs and executing that number of waitpid(-1) calls.

But. This will all be WAY simpler if you let there be another process above yours, doing all this work. :) The dumb-init / tini pattern is a smart one. I worry that you're going to lose focus on your product features by getting lost in obscure process-management issues.

@danenania
Copy link
Contributor Author

Thanks @gmfawcett, I'll take your advice and suggest one of these tools for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
envkey-source https://docs-v2.envkey.com/docs/envkey-source
Projects
None yet
Development

No branches or pull requests

2 participants