Replace `ExternalStream` with new `ByteStream` type #12774

IanManske · 2024-05-05T22:33:08Z

Description

This PR introduces a ByteStream type which is a Read-able stream of bytes. Internally, it has an enum over three different byte stream sources:

pub enum ByteStreamSource {
    Read(Box<dyn Read + Send + 'static>),
    File(File),
    Child(ChildProcess),
}

This is in comparison to the current RawStream type, which is an Iterator<Item = Vec<u8>> and has to allocate for each read chunk.

Currently, PipelineData::ExternalStream serves a weird dual role where it is either external command output or a wrapper around RawStream. ByteStream makes this distinction more clear (via ByteStreamSource) and replaces PipelineData::ExternalStream in this PR:

pub enum PipelineData {
    Empty,
    Value(Value, Option<PipelineMetadata>),
    ListStream(ListStream, Option<PipelineMetadata>),
    ByteStream(ByteStream, Option<PipelineMetadata>),
}

The PR is relatively large, but a decent amount of it is just repetitive changes.

This PR fixes #7017, fixes #10763, and fixes #12369.

This PR also improves performance when piping external commands. Nushell should, in most cases, have competitive pipeline throughput compared to, e.g., bash.

Command	Before (MB/s)	After (MB/s)	Bash (MB/s)
`throughput \| rg 'x'`	3059	3744	3739
`throughput \| nu --testbin relay o> /dev/null`	3508	8087	8136

User-Facing Changes

This is a breaking change for the plugin communication protocol, because the ExternalStreamInfo was replaced with ByteStreamInfo. Plugins now only have to deal with a single input stream, as opposed to the previous three streams: stdout, stderr, and exit code.
The output of describe has been changed for external/byte streams.
Temporary breaking change: bytes starts-with no longer works with byte streams. This is to keep the PR smaller, and bytes ends-with already does not work on byte streams.
If a process core dumped, then instead of having a Value::Error in the exit_code column of the output returned from complete, it now is a Value::Int with the negation of the signal number.

After Submitting

Update docs and book as necessary
Release notes (e.g., plugin protocol changes)
Adapt/convert commands to work with byte streams (high priority is str length, bytes starts-with, and maybe bytes ends-with).
Refactor the tee code, Devyn has already done some work on this.

…data` to `pipeline`

sholderbach · 2024-05-06T09:32:56Z

That is an impressive chunk of work and a nice improvement!
Resolving those incorrect buffering bugs is long overdue.

Before we run ahead just looking at the streaming side we need to talk about the error changes, to ensure we don't break our execution semantics in unintended ways.

What is the motivation for turning the exit codes into ShellError variants? (The existing separate ListStream certainly feels dirty, but the semantics change in major ways if it is an explicit error passing around but we have to look at all the relevant places, and I haven't gotten to that yet)
Can they be isolated into a separate PR in an easy way or will this become a furball here?
As externals are built in the world where bash is run without set -e exit codes are non-zero in ways that may not be errors in their logic. We treat them as such in complex executions but have to tolerate them in individual repl executions for example. See
@sophiajt 's blog post on examples
Destructuring ShellErrors in catch is still not smoothly possible, how would this affect the UX?

After Submitting
* Figure out end line trimming

What is up here? Is there a breakage we need to resolve before merging?

WindSoilder · 2024-05-06T10:28:59Z

I have the same concern about turning the exit codes into ShellError variants.
I remember @elferherrera and @sophiajt had voted against it before: #6079 (comment), #6079 (comment)

crates/nu-protocol/src/io/lines.rs

crates/nu-command/src/filters/lines.rs

crates/nu-cmd-lang/src/core_commands/do_.rs

crates/nu-plugin-protocol/src/lib.rs

crates/nu-protocol/src/io/read_iterator.rs

crates/nu-cmd-lang/src/core_commands/describe.rs

IanManske · 2024-05-10T15:46:28Z

Ok, I think this should be mostly ready for review.

Also, I stole some of your tests @devyn :P

devyn · 2024-05-14T03:52:47Z

I'll have a look again, then. Thanks for all of your hard work!

WindSoilder

Thanks for the improvement! Just leaves some thought about stderr consumer thread.

WindSoilder · 2024-05-13T22:41:24Z

crates/nu-protocol/src/pipeline/byte_stream.rs

+ (Some(out), Some(err)) => {
+ // To avoid deadlocks, we must spawn a separate thread to wait on stderr.
+ thread::scope(|s| {
+ let err_thread = thread::Builder::new()


I think it's good to name the thread, here is a reason: #7879

WindSoilder · 2024-05-13T22:54:18Z

crates/nu-protocol/src/process/child.rs

+ let stderr = self
+ .stderr
+ .take()
+ .map(|stderr| thread::Builder::new().spawn(move || consume_pipe(stderr)))


I think we can name the thread here

sholderbach · 2024-05-15T16:34:10Z

Is #12369 still part of this after removing the main error breaking changes?

IanManske · 2024-05-15T19:16:12Z

Yes, this PR still fixes #12369. If there is no exit code (i.e., the process was terminated via signal), then complete will now report the negation of the signal number in exit_code.

This is sort of a stop-gap measure until we figure out how we want to report this information.

sholderbach · 2024-05-15T20:12:35Z

crates/nu-cmd-extra/tests/commands/bytes/starts_with.rs

+// #[test]
+// fn short_stream_binary() {
+// let actual = nu!(r#"
+// nu --testbin repeater (0x[01]) 5 | bytes starts-with 0x[010101]


Pin: commented out tests

stormasm · 2024-05-16T00:03:15Z

@devyn so we agreed in the core team meeting today that once you give the green light and have reviewed the PR we will go ahead and land it...

Thank you kindly for doing this ! 😄

devyn

Looks good. I found one potential issue with bytes starts-with. A lot of the files changed are just into_value() related, with that being a Result now, so this isn't quite as big as it looks.

Didn't have any issues running it yet, and exit code is definitely working now. I'm glad you changed the each behaviour to Chunks and got rid of implicit line buffering.

Great stuff!

devyn · 2024-05-16T02:10:34Z

crates/nu-command/src/bytes/starts_with.rs

- for item in stream {
- let byte_slice = match &item {
- // String and binary data are valid byte patterns
- Ok(Value::String { val, .. }) => val.as_bytes(),


Removing this will be a breaking change, as this worked before:

^echo foo | bytes starts-with ("foo" | into binary)

In general, I think untyped byte streams should be able to be treated as binary

That was me being ~~lazy~~ making the PR smaller 😄. After this, I intend to open another PR to restore that command and also make bytes ends-with support byte streams properly.

devyn · 2024-05-16T02:15:14Z

crates/nu-command/src/filesystem/save.rs

+ // https://github.com/nushell/nushell/pull/9377 contains the reason
+ // for not using BufWriter<File>
+ //
+ // TODO: flag to specify buffered (BufWriter), line buffered (LineWriter), or no buffering (File).


I'm not sure this will ever be necessary, because the input stream is generally buffered in some way, even if by the command that produced it. If someone knows they want to use bigger buffers, it might be better to provide a command to adapt the stream with read_exact() or read_until() calls to ensure the desired chunking behaviour. And of course it should already be possible to do | lines | save if you definitely want line buffering.

I like this behaviour though, it means that the output of an external command gets written to file with the flushing behaviour it was intended to have, which is particularly good if saving to something like a FIFO where the interactivity actually matters.

Oh yeah, that's a good observation! There might be an argument to add the flags for performance reasons (e.g., to avoid the allocations), but it's probably not worth it / it won't make a noticeable difference.

IanManske · 2024-05-16T03:02:27Z

Thanks for taking the time to review this @devyn!

stormasm · 2024-05-16T05:10:48Z

@IanManske I am planning on landing this PR in the morning --- its 10pm in Oregon now...

Unless someone beats me to it...

I checked out your branch and ran all of the tests and it looks good to me...

I will await you giving me the green light to let me know if its now ready to go.

Based on Devyn's feedback it looks good to me...

Thanks for all of your hard work on this PR...

IanManske · 2024-05-16T13:17:04Z

Sounds good, feel free to land it whenever @stormasm. Thanks!

stormasm · 2024-05-16T14:26:55Z

@IanManske cool ! thanks for giving the green light 👍
I will monitor and make sure I don't see any issues now that its landed....

rgwood · 2024-05-17T04:46:17Z

Wow, it’s great to see thoughtful improvements to internals like this. Big thanks to Ian and everyone who reviewed this.

# Description Forgot that I fixed this already on my branch, but when printing without a display output hook, the implicit call to `table` gets its output mangled with newlines (since #12774). This happens when running `nu -c` or a script file. Here's that fix in one PR so it can be merged easily. # Tests + Formatting - 🟢 `toolkit fmt` - 🟢 `toolkit clippy` - 🟢 `toolkit test` - 🟢 `toolkit test stdlib`

# Description This PR introduces a `ByteStream` type which is a `Read`-able stream of bytes. Internally, it has an enum over three different byte stream sources: ```rust pub enum ByteStreamSource { Read(Box<dyn Read + Send + 'static>), File(File), Child(ChildProcess), } ``` This is in comparison to the current `RawStream` type, which is an `Iterator<Item = Vec<u8>>` and has to allocate for each read chunk. Currently, `PipelineData::ExternalStream` serves a weird dual role where it is either external command output or a wrapper around `RawStream`. `ByteStream` makes this distinction more clear (via `ByteStreamSource`) and replaces `PipelineData::ExternalStream` in this PR: ```rust pub enum PipelineData { Empty, Value(Value, Option<PipelineMetadata>), ListStream(ListStream, Option<PipelineMetadata>), ByteStream(ByteStream, Option<PipelineMetadata>), } ``` The PR is relatively large, but a decent amount of it is just repetitive changes. This PR fixes nushell#7017, fixes nushell#10763, and fixes nushell#12369. This PR also improves performance when piping external commands. Nushell should, in most cases, have competitive pipeline throughput compared to, e.g., bash. | Command | Before (MB/s) | After (MB/s) | Bash (MB/s) | | -------------------------------------------------- | -------------:| ------------:| -----------:| | `throughput \| rg 'x'` | 3059 | 3744 | 3739 | | `throughput \| nu --testbin relay o> /dev/null` | 3508 | 8087 | 8136 | # User-Facing Changes - This is a breaking change for the plugin communication protocol, because the `ExternalStreamInfo` was replaced with `ByteStreamInfo`. Plugins now only have to deal with a single input stream, as opposed to the previous three streams: stdout, stderr, and exit code. - The output of `describe` has been changed for external/byte streams. - Temporary breaking change: `bytes starts-with` no longer works with byte streams. This is to keep the PR smaller, and `bytes ends-with` already does not work on byte streams. - If a process core dumped, then instead of having a `Value::Error` in the `exit_code` column of the output returned from `complete`, it now is a `Value::Int` with the negation of the signal number. # After Submitting - Update docs and book as necessary - Release notes (e.g., plugin protocol changes) - Adapt/convert commands to work with byte streams (high priority is `str length`, `bytes starts-with`, and maybe `bytes ends-with`). - Refactor the `tee` code, Devyn has already done some work on this. --------- Co-authored-by: Devyn Cairns <[email protected]>

) # Description Forgot that I fixed this already on my branch, but when printing without a display output hook, the implicit call to `table` gets its output mangled with newlines (since nushell#12774). This happens when running `nu -c` or a script file. Here's that fix in one PR so it can be merged easily. # Tests + Formatting - 🟢 `toolkit fmt` - 🟢 `toolkit clippy` - 🟢 `toolkit test` - 🟢 `toolkit test stdlib`

YizhePKU · 2024-05-19T02:55:14Z

I'm trying to write tests for run_external.rs. After wrapping a child process into a ByteStream, how do I read from its stderr? The public APIs all seem to be reading from stdout.

I'm currently doing this, but it feels awkward:

if let PipelineData::ByteStream(byte_stream, ..) = pipeline_data {
    if let ByteStreamSource::Child(child) = byte_stream.into_source() {
        if let Some(stderr) = child.stderr.as_mut() {
            stderr.read(&mut buf);
        }
    }
}

YizhePKU · 2024-05-19T03:10:03Z

impl ByteStream {
    pub fn write_to(self, dest: &mut impl Write) -> Result<Option<ExitStatus>, ShellError> {
       // ...
    }
}

Generic reader/writer functions should take R: Read and W: Write by value (Rust API Guidelines).

IanManske · 2024-05-19T03:32:18Z

After wrapping a child process into a ByteStream, how do I read from its stderr? The public APIs all seem to be reading from stdout.

This is intentionally verbose. In fact, I should have maybe made it an unsafe API. Reading from stderr if stdout is present and vice versa could create a deadlock (unless, e.g., a separate thread is used to read stdout at the same time). In the vast majority of cases, the child process should not have a stderr. Commands that set the stderr destination to OutDest::Capture make sure to read stdout and stderr on separate threads. To "read" both stdout and stderr you can use ChildProcess::wait_with_output. Hope this helps.

Generic reader/writer functions should take R: Read and W: Write by value (Rust API Guidelines).

Oh, good to know. I guess io::copy was an unfortunate oversight?

YizhePKU · 2024-05-19T04:05:37Z

This is intentionally verbose. In fact, I should have maybe made it an unsafe API.

Rust's safety guarantees doesn't cover deadlocks, so it's probably fine.

I guess io::copy was an unfortunate oversight?

Unfortunate indeed. The Rust team tried to change it, but couldn't due to API breakage.

IanManske added 13 commits May 5, 2024 16:17

Add io and process modules to nu_protocol

e6b7d83

Create pipeline_data.rs file and change module name from `pipeline_…

adf74db

…data` to `pipeline`

Add ByteStream type and replace PipelineData::ExternalStream with it

43863fa

Fix dataframes

5806225

Apply ByteStream changes to nu-plugin

3568b01

Apply ByteStream changes to nu-cli

c7a97e6

Apply ByteStream changes to nu-cmd-extra

366e55a

Apply ByteStream changes to nu-cmd-lang

4e6677d

Apply ByteStream changes to nu-command

4744fed

Apply changes to nu-explore

28f8fa4

Apply changes to nushell main code

deccd27

Cleanup PipelineData printing

d10bdc2

Fix ByteStream support in tee

78adc49

IanManske changed the title ~~Replace ExternalStream with new Bytestream type~~ Replace ExternalStream with new ByteStream type May 5, 2024

IanManske added 5 commits May 5, 2024 18:34

Fix typo

e32d4b1

Add clippy ignore

27b3f78

Try fix windows

da23d29

Try fix windows

a2e9333

Try fix windows

40a6a47

devyn reviewed May 6, 2024

View reviewed changes

devyn reviewed May 7, 2024

View reviewed changes

crates/nu-cmd-lang/src/core_commands/describe.rs Outdated Show resolved Hide resolved

Add tests for Chunks

158fc31

IanManske marked this pull request as ready for review May 10, 2024 15:46

Merge branch 'main' into bytestream

84b99d3

WindSoilder reviewed May 14, 2024

View reviewed changes

IanManske added 2 commits May 14, 2024 10:08

Add thread names

501557b

Add thread names

a81f20b

sholderbach mentioned this pull request May 15, 2024

Show error message when processes are killed by SIGSEGV/SIGILL/etc #12874

Open

sholderbach reviewed May 15, 2024

View reviewed changes

devyn reviewed May 16, 2024

View reviewed changes

Edit comments

897d0d8

stormasm merged commit 6fd854e into nushell:main May 16, 2024
16 checks passed

hustcer added this to the v0.94.0 milestone May 17, 2024

devyn mentioned this pull request May 17, 2024

Fix the way the output of table is printed in print() #12895

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `ExternalStream` with new `ByteStream` type #12774

Replace `ExternalStream` with new `ByteStream` type #12774

IanManske commented May 5, 2024 •

edited

sholderbach commented May 6, 2024

WindSoilder commented May 6, 2024 •

edited

IanManske commented May 10, 2024

devyn commented May 14, 2024

WindSoilder left a comment

WindSoilder May 13, 2024

WindSoilder May 13, 2024

sholderbach commented May 15, 2024

IanManske commented May 15, 2024 •

edited

sholderbach May 15, 2024

stormasm commented May 16, 2024

devyn left a comment

devyn May 16, 2024

IanManske May 16, 2024

devyn May 16, 2024

IanManske May 16, 2024

IanManske commented May 16, 2024

stormasm commented May 16, 2024 •

edited

IanManske commented May 16, 2024

stormasm commented May 16, 2024

rgwood commented May 17, 2024 •

edited

YizhePKU commented May 19, 2024 •

edited

YizhePKU commented May 19, 2024

IanManske commented May 19, 2024

YizhePKU commented May 19, 2024

Replace ExternalStream with new ByteStream type #12774

Replace ExternalStream with new ByteStream type #12774

Conversation

IanManske commented May 5, 2024 • edited

Description

User-Facing Changes

After Submitting

sholderbach commented May 6, 2024

WindSoilder commented May 6, 2024 • edited

IanManske commented May 10, 2024

devyn commented May 14, 2024

WindSoilder left a comment

Choose a reason for hiding this comment

WindSoilder May 13, 2024

Choose a reason for hiding this comment

WindSoilder May 13, 2024

Choose a reason for hiding this comment

sholderbach commented May 15, 2024

IanManske commented May 15, 2024 • edited

sholderbach May 15, 2024

Choose a reason for hiding this comment

stormasm commented May 16, 2024

devyn left a comment

Choose a reason for hiding this comment

devyn May 16, 2024

Choose a reason for hiding this comment

IanManske May 16, 2024

Choose a reason for hiding this comment

devyn May 16, 2024

Choose a reason for hiding this comment

IanManske May 16, 2024

Choose a reason for hiding this comment

IanManske commented May 16, 2024

stormasm commented May 16, 2024 • edited

IanManske commented May 16, 2024

stormasm commented May 16, 2024

rgwood commented May 17, 2024 • edited

YizhePKU commented May 19, 2024 • edited

YizhePKU commented May 19, 2024

IanManske commented May 19, 2024

YizhePKU commented May 19, 2024

Replace `ExternalStream` with new `ByteStream` type #12774

Replace `ExternalStream` with new `ByteStream` type #12774

IanManske commented May 5, 2024 •

edited

WindSoilder commented May 6, 2024 •

edited

IanManske commented May 15, 2024 •

edited

stormasm commented May 16, 2024 •

edited

rgwood commented May 17, 2024 •

edited

YizhePKU commented May 19, 2024 •

edited