Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stream: use Buffer.byteLength() to get the string length #52828

Closed
wants to merge 1 commit into from

Conversation

lpinca
Copy link
Member

@lpinca lpinca commented May 4, 2024

Use the byte length of the string when the decodeStrings option is set to false.

Fixes: #52818

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/streams

@nodejs-github-bot nodejs-github-bot added the needs-ci PRs that need a full CI run. label May 4, 2024
encoding = 'buffer';
} else {
length = Buffer.byteLength(chunk);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the encoding argument instead of forcing UTF-8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the encoding argument

@lpinca lpinca added the stream Issues and PRs related to the stream subsystem. label May 4, 2024
@lpinca lpinca changed the title stream: use Buffer.ByteLength() to get the string length stream: use Buffer.byteLength() to get the string length May 4, 2024
@targos
Copy link
Member

targos commented May 4, 2024

What happens if the chunk is in the middle of a multi byte string?

@lpinca
Copy link
Member Author

lpinca commented May 4, 2024

I think the same that happens when the chunk is converted to a Buffer.

@lpinca lpinca added the request-ci Add this label to start a Jenkins CI on a PR. label May 4, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label May 4, 2024
@nodejs-github-bot
Copy link
Collaborator

@benjamingr benjamingr added the needs-benchmark-ci PR that need a benchmark CI run. label May 4, 2024
@benjamingr
Copy link
Member

benjamingr commented May 4, 2024

Benchmark CI https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1558/

                                                                                         confidence improvement accuracy (*)   (**)  (***)
streams/creation.js kind='duplex' n=50000000                                                             0.38 %       ±0.75% ±1.00% ±1.31%
streams/creation.js kind='readable' n=50000000                                                           0.29 %       ±0.63% ±0.83% ±1.09%
streams/creation.js kind='transform' n=50000000                                                          0.42 %       ±1.88% ±2.50% ±3.25%
streams/creation.js kind='writable' n=50000000                                                           0.05 %       ±0.64% ±0.85% ±1.11%
streams/destroy.js kind='duplex' n=1000000                                                              -0.07 %       ±0.50% ±0.67% ±0.87%
streams/destroy.js kind='readable' n=1000000                                                            -0.97 %       ±3.47% ±4.61% ±6.00%
streams/destroy.js kind='transform' n=1000000                                                            0.28 %       ±0.61% ±0.81% ±1.06%
streams/destroy.js kind='writable' n=1000000                                                            -0.44 %       ±0.66% ±0.87% ±1.14%
streams/pipe-object-mode.js n=5000000                                                                    0.31 %       ±0.33% ±0.44% ±0.58%
streams/pipe.js n=5000000                                                                                0.33 %       ±0.40% ±0.53% ±0.70%
streams/readable-async-iterator.js sync='no' n=100000                                                   -0.85 %       ±1.11% ±1.48% ±1.93%
streams/readable-async-iterator.js sync='yes' n=100000                                                   0.09 %       ±0.91% ±1.21% ±1.58%
streams/readable-bigread.js n=1000                                                                       0.21 %       ±0.75% ±1.00% ±1.31%
streams/readable-bigunevenread.js n=1000                                                                -0.30 %       ±0.62% ±0.83% ±1.08%
streams/readable-boundaryread.js type='buffer' n=2000                                                   -0.06 %       ±0.67% ±0.89% ±1.16%
streams/readable-boundaryread.js type='string' n=2000                                                    0.18 %       ±1.00% ±1.33% ±1.74%
streams/readable-from.js type='array' n=10000000                                                         0.03 %       ±1.12% ±1.49% ±1.95%
streams/readable-from.js type='async-generator' n=10000000                                               0.10 %       ±0.58% ±0.77% ±1.00%
streams/readable-from.js type='sync-generator-with-async-values' n=10000000                             -0.25 %       ±0.34% ±0.46% ±0.60%
streams/readable-from.js type='sync-generator-with-sync-values' n=10000000                               0.05 %       ±0.15% ±0.20% ±0.27%
streams/readable-readall.js n=5000                                                                       0.39 %       ±2.00% ±2.66% ±3.46%
streams/readable-uint8array.js kind='encoding' n=1000000                                                -0.39 %       ±0.79% ±1.05% ±1.37%
streams/readable-uint8array.js kind='read' n=1000000                                                     0.89 %       ±1.28% ±1.70% ±2.22%
streams/readable-unevenread.js n=1000                                                             *     -1.33 %       ±1.08% ±1.44% ±1.88%
streams/writable-manywrites.js len=1024 callback='no' writev='no' sync='no' n=100000              *      2.98 %       ±2.44% ±3.26% ±4.27%
streams/writable-manywrites.js len=1024 callback='no' writev='no' sync='yes' n=100000           ***     -2.78 %       ±0.79% ±1.05% ±1.37%
streams/writable-manywrites.js len=1024 callback='no' writev='yes' sync='no' n=100000                    0.10 %       ±1.21% ±1.62% ±2.14%
streams/writable-manywrites.js len=1024 callback='no' writev='yes' sync='yes' n=100000          ***     -3.05 %       ±1.45% ±1.95% ±2.57%
streams/writable-manywrites.js len=1024 callback='yes' writev='no' sync='no' n=100000                    0.61 %       ±1.84% ±2.45% ±3.19%
streams/writable-manywrites.js len=1024 callback='yes' writev='no' sync='yes' n=100000          ***     -2.07 %       ±0.53% ±0.71% ±0.93%
streams/writable-manywrites.js len=1024 callback='yes' writev='yes' sync='no' n=100000                  -0.53 %       ±0.83% ±1.12% ±1.48%
streams/writable-manywrites.js len=1024 callback='yes' writev='yes' sync='yes' n=100000          **     -1.83 %       ±1.19% ±1.58% ±2.06%
streams/writable-manywrites.js len=32768 callback='no' writev='no' sync='no' n=100000                   -1.59 %       ±2.53% ±3.37% ±4.38%
streams/writable-manywrites.js len=32768 callback='no' writev='no' sync='yes' n=100000          ***     -2.08 %       ±0.98% ±1.31% ±1.70%
streams/writable-manywrites.js len=32768 callback='no' writev='yes' sync='no' n=100000                  -0.17 %       ±1.51% ±2.01% ±2.62%
streams/writable-manywrites.js len=32768 callback='no' writev='yes' sync='yes' n=100000           *     -1.14 %       ±0.94% ±1.26% ±1.64%
streams/writable-manywrites.js len=32768 callback='yes' writev='no' sync='no' n=100000                  -1.19 %       ±2.11% ±2.82% ±3.70%
streams/writable-manywrites.js len=32768 callback='yes' writev='no' sync='yes' n=100000         ***     -1.80 %       ±0.50% ±0.66% ±0.86%
streams/writable-manywrites.js len=32768 callback='yes' writev='yes' sync='no' n=100000                  1.03 %       ±2.10% ±2.80% ±3.66%
streams/writable-manywrites.js len=32768 callback='yes' writev='yes' sync='yes' n=100000         **     -1.21 %       ±0.82% ±1.10% ±1.44%
streams/writable-uint8array.js kind='object-mode' n=50000000                                    ***     -1.84 %       ±0.72% ±0.97% ±1.29%
streams/writable-uint8array.js kind='write' n=50000000                                                   0.23 %       ±0.79% ±1.06% ±1.40%
streams/writable-uint8array.js kind='writev' n=50000000                                           *     -0.63 %       ±0.49% ±0.65% ±0.84%

Copy link
Member

@benjamingr benjamingr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the same fix for Readable

@@ -463,12 +464,17 @@ function _write(stream, chunk, encoding, cb) {
if (typeof chunk === 'string') {
if ((state[kState] & kDecodeStrings) !== 0) {
chunk = Buffer.from(chunk, encoding);
length = chunk.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since Buffer.byteLength already handles non-strings, it may be simpler to just always call it if it doesn't cause a performance regression

Use the byte length of the string when the `decodeStrings` option is set
to `false`.

Fixes: nodejs#52818
Copy link
Member

@benjamingr benjamingr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reapproving to indicate the 2% regression is fine for correctness IMO. This can also be optimized.

@lpinca lpinca added the request-ci Add this label to start a Jenkins CI on a PR. label May 4, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label May 4, 2024
@nodejs-github-bot
Copy link
Collaborator

@lpinca
Copy link
Member Author

lpinca commented May 4, 2024

We need the same fix for Readable.

I looked into it briefly and it does not seem trivial as it breaks cases like this

const assert = require('assert');
const { Readable } = require('stream');

const readable = new Readable({
  read() {}
});

readable.setEncoding('utf8');
readable.push('€');

const data = readable.read(1);

assert.strictEqual(data, '€');
assert.strictEqual(readable.readableLength, 0);

I'm fine with closing this PR and documenting the current behavior if we want to keep consistency with Readable.

@mcollina
Copy link
Member

mcollina commented May 4, 2024

I think consistency is better, and having a fix for Readable would be preferable.

@nodejs-github-bot
Copy link
Collaborator

@benjamingr
Copy link
Member

I looked into it briefly and it does not seem trivial as it breaks cases like this

That's actually pretty significant breakage. I tend to think it's better to document that buffering length in strings is based on the string length and not byte size.

@lpinca
Copy link
Member Author

lpinca commented May 5, 2024

That's actually pretty significant breakage. I tend to think it's better to document that buffering length in strings is based on the string length and not byte size.

I agree. Fixing it on Redable is not easy as readable.read(n) works on code units and not bytes when the chunk is a string.

Copy link
Member

@benjamingr benjamingr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion, we should document this (on both readable/writable) rather than fix it because it's a breaking change for .read

@lpinca lpinca closed this May 5, 2024
@lpinca lpinca deleted the fix/issue-52818 branch May 5, 2024 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-benchmark-ci PR that need a benchmark CI run. needs-ci PRs that need a full CI run. stream Issues and PRs related to the stream subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writable doesn't correctly count size of strings
7 participants