Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sed script libmode #2

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from
Draft

Conversation

mralusw
Copy link

@mralusw mralusw commented Aug 8, 2021

I've been working on a version of the sed script that caches compiled files. In the process, I reorganized the sed script to

  • be source-able as a library
  • not depend on any ./ files or cd commands (by using make -C and figuring out absolute paths names for generated_file etc).

Besides being extensible and not requiring any hacks, I think the reorganized code is somewhat clearer, so I'm opening a pull request.

FWIW, I've benchmarked (results below) sed-cached with a simple sed command. It's about 15 times faster than compiling each time, but still twice as slow as native sed (only 75% slower on busybox sh with builtin applets). Yeah, it should be written in C not sh. But it was a fun as a proof of concept :)

I'm curious what you think of this. Perhaps sed-cached can make it to contrib/. Perhaps a C version would be even faster than sed, and then maybe sed-bin is not so useless after all.

hyperfine --export-markdown /tmp/sed-bin.md -w2 -L sed /S/sed-bin/sed,/S/sed-bin/sed-cached,'busybox sh /S/sed-bin/sed-cached',sed 'for i in $(seq 1 5); do {sed} s/:/_/g /etc/passwd; done
Command Mean [ms] Min [ms] Max [ms] Relative
for i in $(seq 1 5); do /S/sed-bin/sed s/:/_/g /etc/passwd; done 397.3 ± 8.2 383.1 408.5 33.07 ± 0.84
for i in $(seq 1 5); do /S/sed-bin/sed-cached s/:/_/g /etc/passwd; done 25.5 ± 0.2 25.0 26.0 2.12 ± 0.03
for i in $(seq 1 5); do busybox sh /S/sed-bin/sed-cached s/:/_/g /etc/passwd; done 21.4 ± 0.3 19.4 22.0 1.78 ± 0.04
for i in $(seq 1 5); do sed s/:/_/g /etc/passwd; done 12.0 ± 0.2 11.1 12.6 1.00

@mralusw
Copy link
Author

mralusw commented Aug 8, 2021

Just to be clear, my sed-cached script is on a separate sed-cached branch; this pull request only includes changes to sed.

$0 in sh is caller's, BASH_SOURCE unavailable
@mralusw
Copy link
Author

mralusw commented Aug 9, 2021

... and, finally, sed-busting performance for shell scripts, with the latest sed-cached used in library-mode. It's about 4x faster than calling sed in a regular shell (and 2x faster than calling the builtin sed, if running in a busybox with static applets)

To use this, one has to pre-define SED_DIR, SED_LIBMODE, source the sed and sed-cached scripts from the repo, and call the entrypoint function whenever desired. It takes care of computing a hash for the script and looking up a cached binary, or compiling one.

Notes at the end of https://github.com/kstr0k/sed-bin/blob/sed-cached/sed-cached

@lhoursquentin
Copy link
Owner

I really like the idea, let me look into this

Also: don't abspath any other files since we don't 'cd' anymore
@mralusw mralusw marked this pull request as draft August 10, 2021 23:21
@mralusw
Copy link
Author

mralusw commented Aug 10, 2021

Sure, have a look and let me know if you want to do things differently. I've converted this to a draft PR.

I've just noticed there's a bug (also present on lhoursquentin/sed-bin) whereby a malformed expression (./sed 's/:/_' </etc/passwd) just hangs.

@lhoursquentin
Copy link
Owner

I've just noticed there's a bug (also present on lhoursquentin/sed-bin) whereby a malformed expression (./sed 's/:/_' </etc/passwd) just hangs.

Yeah unfortunately malformed scripts are not handled correctly for the most part, I have a small note in the README regarding this:

The translator does not handle invalid sed scripts, it will just generate invalid C code which will probably fail to compile, make sure you can run your script with an actual sed implementation before attempting to translate it.

Proper error checking in the translator is possible but it would require keeping a lot more state in memory. For instance the translator isn't even aware of curly brackets nesting depth, it just prints them right away and forgets about them.

In this specific case (s/:/_) though it cannot even finish translation an is trapped in an infinite loop, I just fixed it in commit 72f5a89 since it was straightforward and clarified in the README that infinite loop is a possibility when trying to translate invalid scripts.

@mralusw
Copy link
Author

mralusw commented Aug 11, 2021

That would be easily checked using "if system_sed script </dev/null 2>/dev/null" — compared to invoking gcc, it's practically free. But, again, this needs design decisions (have a SED_REAL_SED var?)

@lhoursquentin
Copy link
Owner

All right, I had the chance to take a more in depth look, it's really nice!
I'm just wondering if it wouldn't even be easier to merge the two together and instead have a SED_CACHED env var to trigger the cached version, so that we can keep a single sed script (which could also be sourced), what do you think?

It's also worth noting that the cached version seems slower than the non-cached one when dealing with big sed scripts, I tried with par.sed and the cached version was roughly two times slower.

That would be easily checked using "if system_sed script </dev/null 2>/dev/null" — compared to invoking gcc, it's practically free. But, again, this needs design decisions (have a SED_REAL_SED var?)

My main concern with this approach is that some scripts can have side effects: creating and writing to files with the w command, reading files with r (which can remove data from a FIFO for instance), or with the GNU extensions even run a shell with s/.*/rm some-file/e. So I'd rather let the user handle this verification process manually to avoid any surprises.

@mralusw
Copy link
Author

mralusw commented Aug 16, 2021

All right, I had the chance to take a more in depth look, it's really nice!
I'm just wondering if it wouldn't even be easier to merge the two together and instead have a SED_CACHED env var to trigger the cached version, so that we can keep a single sed script (which could also be sourced), what do you think?

Thanks (it's even cleaner now). Sure; a few more changes will be needed to avoid namespace pollution (BIN). I don't know if any users rely on this "API".

It's also worth noting that the cached version seems slower than the non-cached one when dealing with big sed scripts, I tried with par.sed and the cached version was roughly two times slower.

Interesting. It turns out shell string-truncation (e.g.: s='????'; s=${1#$s}; set -- "${1%"$s"}" "$s" to generate a truncated 4-char string + rest) is heavily dependent on the #/##, %/%% combination used (even though in this context #/## and %/%% are equivalent). FWIW I found out that dash, busybox sh (and even Bash) all favor # + %% (head truncation) and % + ## (tail truncation). However even the best tail-truncation is several times slower than head-.

It just so happened that the combination I was initially using (% / #) is one of the slowest. Like, orders of magnitude slower...

I've added a hyperfine example at the end of sed-cached to test all # / ## / % / %% combinations. I would drop script-tail hashing completely, except that's the "most unique" part (e.g. your par.sed starts with a bunch of comments).

My main concern with this approach is that some scripts can have side effects: creating and writing to files with the w command, reading files with r (which can remove data from a FIFO for instance), or with the GNU extensions even run a shell with s/.*/rm some-file/e. So I'd rather let the user handle this verification process manually to avoid any surprises.

I was afraid that might be the case. Speaking of which, I wonder, would it be feasible to add a sandbox mode like GNU sed (disable e/r/w)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants