Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of parallel launch? #149

Open
rongcuid opened this issue Dec 2, 2022 · 6 comments
Open

Support of parallel launch? #149

rongcuid opened this issue Dec 2, 2022 · 6 comments

Comments

@rongcuid
Copy link

rongcuid commented Dec 2, 2022

Is it possible to add parallel invocation of commands? Something like GNU Parallel:

ls | parallel md5sum

Of course, using parallel directly is a solution... but some simple pooling in this package might also be helpful.

@bitfield
Copy link
Owner

bitfield commented Dec 3, 2022

Sounds interesting! Can you suggest a suitable design?

@rongcuid
Copy link
Author

rongcuid commented Dec 4, 2022

Well, I can think of a few ways. Maybe have an option to construct multiple pipelines manually and have them run in parallel. Or take an input, split based on some criteria (such as lines), then for each subpart run a pipeline in parallel. Optionally use a thread pool to limit the simultaneous processes. Then there might be a need to join their outputs. Maybe it can be done by lines, in-order or out-of-order.

There are quite a few possibilities, but I guess something like a parallel xargs might be a good start.

@bitfield
Copy link
Owner

bitfield commented Dec 4, 2022

Great! Let's take a concrete example to help us think about what this might look like as code. The particular use case you mentioned can be done with:

script.ListFiles(".").SHA256Sums().Stdout()

Can you think of another reason you might want to run a bunch of parallel Exec commands?

@rongcuid
Copy link
Author

rongcuid commented Dec 4, 2022

Hmmm, does it already work in parallel? I haven't checked.

Say, I have a heavily single process program, ./a.out, which takes one argument as input, and outputs to somewhere else. Then, a parallel API might look like this:

script.ListFiles().Parallel().Exec("./a.out").Stdout()

Or more complicated, if we would normally do ./a.out | ./b.out,

p := script.Exec("./a.out").Exec("./b.out")
script.ListFiles().Parallel(p).Stdout()

Or

script.ListFiles().Parallel().Exec("./a.out").Exec("./b.out").Join().Stdout()

Reordering output might not necessarily be a good idea, but if some program generate finite, small amount of output you can certainly use a reorder buffer to ensure output generate as if you run the program in serial (using first API for simplicity):

script.ListFiles().Parallel().Exec("./a.out").Reorder().Stdout()

@bitfield
Copy link
Owner

bitfield commented Dec 5, 2022

Exec and SHA256Sums and other filters do run concurrently, that is to say, with each other. But SHA256Sums does not compute the file hashes concurrently; it generates each one sequentially. At the moment, no pipe stages process lines concurrently.

As you say, output ordering is the issue. If you buffer all the output so that you can reorder it in sequence, then there's no point generating it concurrently; you still have to wait for it all to be done before you can see any of it. With the existing sequential code, you do see the first line of output as soon as it is available.

On the other hand, some operations don't produce output, or it doesn't matter what order the output arrives in. For example, if you wanted to compress a bunch of large files, it would make sense to have all these compute-intensive operations running in parallel.

Parallel is a nice idea for a pipe method, but since it doesn't actually do anything, just has an effect on some hypothetical future Exec call, maybe it's better to write instead:

script.ListFiles().ExecParallel("./a.out").Stdout()

I think the implementation would likely be via io.Pipe, so while output writes from the parallel tasks would be interleaved, they wouldn't actually collide; writes are executed sequentially. For example, it might look like this:

result for file 2
result for file 1
result for file 3

@rongcuid
Copy link
Author

rongcuid commented Dec 5, 2022

ExecParallel here would probably be the simplest API.

The second API I mention is if you want to do ExecParallel("a | b"), and in this case it might be nice to be able to do Parallel(script.Exec("a").Exec("b")) in case you want to use a script pipeline. Of course, this would make the API look like a graph description language...

Reordering is probably a niche use case, and it does require finite, small output for the command to work well. I would probably not consider this a frequently used function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants