Sed script libmode #2

mralusw · 2021-08-08T10:22:21Z

I've been working on a version of the sed script that caches compiled files. In the process, I reorganized the sed script to

be source-able as a library
not depend on any ./ files or cd commands (by using make -C and figuring out absolute paths names for generated_file etc).

Besides being extensible and not requiring any hacks, I think the reorganized code is somewhat clearer, so I'm opening a pull request.

FWIW, I've benchmarked (results below) sed-cached with a simple sed command. It's about 15 times faster than compiling each time, but still twice as slow as native sed (only 75% slower on busybox sh with builtin applets). Yeah, it should be written in C not sh. But it was a fun as a proof of concept :)

I'm curious what you think of this. Perhaps sed-cached can make it to contrib/. Perhaps a C version would be even faster than sed, and then maybe sed-bin is not so useless after all.

hyperfine --export-markdown /tmp/sed-bin.md -w2 -L sed /S/sed-bin/sed,/S/sed-bin/sed-cached,'busybox sh /S/sed-bin/sed-cached',sed 'for i in $(seq 1 5); do {sed} s/:/_/g /etc/passwd; done

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`for i in $(seq 1 5); do /S/sed-bin/sed s/:/_/g /etc/passwd; done`	397.3 ± 8.2	383.1	408.5	33.07 ± 0.84
`for i in $(seq 1 5); do /S/sed-bin/sed-cached s/:/_/g /etc/passwd; done`	25.5 ± 0.2	25.0	26.0	2.12 ± 0.03
`for i in $(seq 1 5); do busybox sh /S/sed-bin/sed-cached s/:/_/g /etc/passwd; done`	21.4 ± 0.3	19.4	22.0	1.78 ± 0.04
`for i in $(seq 1 5); do sed s/:/_/g /etc/passwd; done`	12.0 ± 0.2	11.1	12.6	1.00

mralusw · 2021-08-08T10:25:01Z

Just to be clear, my sed-cached script is on a separate sed-cached branch; this pull request only includes changes to sed.

$0 in sh is caller's, BASH_SOURCE unavailable

mralusw · 2021-08-09T05:06:11Z

... and, finally, sed-busting performance for shell scripts, with the latest sed-cached used in library-mode. It's about 4x faster than calling sed in a regular shell (and 2x faster than calling the builtin sed, if running in a busybox with static applets)

To use this, one has to pre-define SED_DIR, SED_LIBMODE, source the sed and sed-cached scripts from the repo, and call the entrypoint function whenever desired. It takes care of computing a hash for the script and looking up a cached binary, or compiling one.

Notes at the end of https://github.com/kstr0k/sed-bin/blob/sed-cached/sed-cached

lhoursquentin · 2021-08-10T20:11:34Z

I really like the idea, let me look into this

Also: don't abspath any other files since we don't 'cd' anymore

mralusw · 2021-08-10T23:23:54Z

Sure, have a look and let me know if you want to do things differently. I've converted this to a draft PR.

I've just noticed there's a bug (also present on lhoursquentin/sed-bin) whereby a malformed expression (./sed 's/:/_' </etc/passwd) just hangs.

lhoursquentin · 2021-08-11T16:38:17Z

I've just noticed there's a bug (also present on lhoursquentin/sed-bin) whereby a malformed expression (./sed 's/:/_' </etc/passwd) just hangs.

Yeah unfortunately malformed scripts are not handled correctly for the most part, I have a small note in the README regarding this:

The translator does not handle invalid sed scripts, it will just generate invalid C code which will probably fail to compile, make sure you can run your script with an actual sed implementation before attempting to translate it.

Proper error checking in the translator is possible but it would require keeping a lot more state in memory. For instance the translator isn't even aware of curly brackets nesting depth, it just prints them right away and forgets about them.

In this specific case (s/:/_) though it cannot even finish translation an is trapped in an infinite loop, I just fixed it in commit 72f5a89 since it was straightforward and clarified in the README that infinite loop is a possibility when trying to translate invalid scripts.

mralusw · 2021-08-11T18:59:14Z

That would be easily checked using "if system_sed script </dev/null 2>/dev/null" — compared to invoking gcc, it's practically free. But, again, this needs design decisions (have a SED_REAL_SED var?)

lhoursquentin · 2021-08-15T22:19:22Z

All right, I had the chance to take a more in depth look, it's really nice!
I'm just wondering if it wouldn't even be easier to merge the two together and instead have a SED_CACHED env var to trigger the cached version, so that we can keep a single sed script (which could also be sourced), what do you think?

It's also worth noting that the cached version seems slower than the non-cached one when dealing with big sed scripts, I tried with par.sed and the cached version was roughly two times slower.

That would be easily checked using "if system_sed script </dev/null 2>/dev/null" — compared to invoking gcc, it's practically free. But, again, this needs design decisions (have a SED_REAL_SED var?)

My main concern with this approach is that some scripts can have side effects: creating and writing to files with the w command, reading files with r (which can remove data from a FIFO for instance), or with the GNU extensions even run a shell with s/.*/rm some-file/e. So I'd rather let the user handle this verification process manually to avoid any surprises.

mralusw · 2021-08-16T11:04:18Z

All right, I had the chance to take a more in depth look, it's really nice!
I'm just wondering if it wouldn't even be easier to merge the two together and instead have a SED_CACHED env var to trigger the cached version, so that we can keep a single sed script (which could also be sourced), what do you think?

Thanks (it's even cleaner now). Sure; a few more changes will be needed to avoid namespace pollution (BIN). I don't know if any users rely on this "API".

It's also worth noting that the cached version seems slower than the non-cached one when dealing with big sed scripts, I tried with par.sed and the cached version was roughly two times slower.

Interesting. It turns out shell string-truncation (e.g.: s='????'; s=${1#$s}; set -- "${1%"$s"}" "$s" to generate a truncated 4-char string + rest) is heavily dependent on the #/##, %/%% combination used (even though in this context #/## and %/%% are equivalent). FWIW I found out that dash, busybox sh (and even Bash) all favor # + %% (head truncation) and % + ## (tail truncation). However even the best tail-truncation is several times slower than head-.

It just so happened that the combination I was initially using (% / #) is one of the slowest. Like, orders of magnitude slower...

I've added a hyperfine example at the end of sed-cached to test all # / ## / % / %% combinations. I would drop script-tail hashing completely, except that's the "most unique" part (e.g. your par.sed starts with a bunch of comments).

My main concern with this approach is that some scripts can have side effects: creating and writing to files with the w command, reading files with r (which can remove data from a FIFO for instance), or with the GNU extensions even run a shell with s/.*/rm some-file/e. So I'd rather let the user handle this verification process manually to avoid any surprises.

I was afraid that might be the case. Speaking of which, I wonder, would it be feasible to add a sandbox mode like GNU sed (disable e/r/w)?

mralusw added 8 commits August 8, 2021 12:43

sed: + _LIBMODE (just defs, don't run); no cd's

0abdf9a

sed: further function separation

ded2edb

sed: avoid useless cat

6550f2a

sed: make main overridable (libmode)

063732f

sed: separate make step

ad13b85

sed: internal functions -> __

f17ed0b

sed: + sed_exec hook

71155b4

sed: enable 1-arg 'exec <' optimization

cac4491

mralusw added 2 commits August 8, 2021 15:37

sed: exec at end

f7a5ea3

sed: in LIB_MODE, we need SED_DIR too

f294f0e

$0 in sh is caller's, BASH_SOURCE unavailable

sed: fix $0=sed (no dir); drop mydir (use SED_DIR)

9742cfe

Also: don't abspath any other files since we don't 'cd' anymore

mralusw marked this pull request as draft August 10, 2021 23:21

sed: par.sed writes ./...-init.c, must cd $SED_DIR

c3b2dde

mralusw added 2 commits August 18, 2021 11:02

sed: __usage -> __sed_usage

b38b449

sed: BIN -> SED_BIN (can't rely on Makefile's BIN)

e3dbe83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sed script libmode #2

Sed script libmode #2

mralusw commented Aug 8, 2021 •

edited

mralusw commented Aug 8, 2021

mralusw commented Aug 9, 2021

lhoursquentin commented Aug 10, 2021

mralusw commented Aug 10, 2021

lhoursquentin commented Aug 11, 2021

mralusw commented Aug 11, 2021

lhoursquentin commented Aug 15, 2021

mralusw commented Aug 16, 2021 •

edited

Sed script libmode #2

Are you sure you want to change the base?

Sed script libmode #2

Conversation

mralusw commented Aug 8, 2021 • edited

mralusw commented Aug 8, 2021

mralusw commented Aug 9, 2021

lhoursquentin commented Aug 10, 2021

mralusw commented Aug 10, 2021

lhoursquentin commented Aug 11, 2021

mralusw commented Aug 11, 2021

lhoursquentin commented Aug 15, 2021

mralusw commented Aug 16, 2021 • edited

mralusw commented Aug 8, 2021 •

edited

mralusw commented Aug 16, 2021 •

edited