Skip to content

comby-tools/comby-decomposer

Repository files navigation

comby-decomposer

comby-decomposer is a tool that automatically decomposes program syntax into (1) a set of templates and (2) a set of concrete fragments extracted from the syntax. Basically, it un-substitutes some input file to create a corpus of templates and fragments.

Here's a visual example of outputs when we decompose a function with respect to parentheses (...) and braces {...}.

decompose

You can then use mix-and-match templates and fragments and substitute these back in to create new combinations of output "programs". It's been most useful for fuzzing compilers.

Running

  • Install comby
  • Install fdupes for your distribution (typically sudo apt-get install fdupes or brew install fdupes)

Copy the programs you want to decompose into the sources directory. If you don't have any programs ready, you can try it with some example programs in this repository:

cp try_example_solidity/* sources

Then, just run:

./run.sh .sol .sol

This says "decompose programs for all .sol files in the sources directory using the .sol language (defined by comby -list). You get these outputs:

  • fragments contains the extracted fragments
  • templates contains the templatized versions of the input string.

These are deduplicated and unique. You can try the same for example rust programs:

cp try_example_rust/* sources
./run.sh .rs .rs

and check out examples in templates and fragments!

Customizing decomposition

By default programs are decomposed around delimiter syntax like (...), {...} and [...]. This means that comby-decomposer looks for this syntax and un-substitutes the program around these patterns. It will also look for nested syntax inside these. You can use any comby pattern you like though! I These patterns are defined inside extraction_specifications using a simple configuration file and you'll find the default ones for parentheses, braces, and brackets in there already. But if you wanted to un-substitute around numbers in the input program, you can simply add a specification like extract_number.toml that matches the regex [0-9]+ like this:

[extract_number]
match=':[~[0-9]+]'

(Tip: the string :[~[0-9]+] is just using the comby syntax :[...] to define a regular expression [0-9]+)

Now if you delete the other specifications and ./run.sh .sol .sol again on the example program, you'll only get one decomposition (since there's only one number to un-substitute), and your templates and fragments folders contain these:

templates

function f(uint:[1] arg) public {
        g(notfound);
}

fragments

256

Smaller fragments and templates

If you have many or large programs to decompose, you can run ./postprocess.sh after ./run.sh. This will:

  • only keep templates and fragments that are 8KB to 28KB in size
  • delete templates with no holes
  • delete templates with more than 10 holes

Look inside the postprocess.sh to customize it.

On-demand input generation

If you have templates and fragments and want to randomly generate new programs by substituting fragments into templates, you can use the included server. Just run these commands:

npm install express body-parser @iarna/toml minimist

export NODE_OPTIONS="--max-old-space-size=8192"
node server.js --generate 1.0

Then in a separate terminal, request a new input:

curl -X POST http://localhost:4448/mutate

Every request generates a random input. The server picks a random template from the templates directory, and substitutes up to 10 random fragments from the fragments directory. If you want more control over server generation you'll have to dig into server.js.

See https://github.com/agroce/afl-compiler-fuzzer#01-usage to use the server with an AFL fuzzer

Customizing language grammar target

Comby supports a lot of languages (see comby -list for supported ones), but even if you just created one, you'll probably get some mileage from using the generic parser and default patterns: ./run .my-language-extension .generic.

Why comby-decomposer? Mostly for fuzzing compilers...

comby-decomposer was created to generate inputs to fuzz compilers. We can substitute and recombine templates and fragments to create new inputs.

image

If you want to use it with an AFL fuzzer that's been modified to request inputs from the server, see https://github.com/agroce/afl-compiler-fuzzer#01-usage.

The fun part is comby-decomposer is syntax-aware and works on most any language (or structured input like JSON) to some degree, and you can get started right away: You just need some example programs for your program. No need to define a grammar to generate inputs, or know exactly how the grammar of your target language works. That's because it just uses Comby to do the decomposition work.

The basic idea is that comby-decomposer creates inputs that are much more likely to be syntactically valid expressions inside previously valid (now templatized) programs. Generating inputs with templates and likely-valid expressions help nudge a feedback-directed fuzzer (like AFL) to explore "deeper" code in a compiler earlier than, say, just starting with a compiler's initial test suite. So, instead of using only an initial test corpus, comby-decomposer helps generate templates and fragments for synthesizing new test programs on demand. It banks on the idea that these inputs are unlikely to be generated early (or at all) by default input mutations in a fuzzer's feedback loop.

For example, comby-decomposer generated this Solidity program that sent the compiler into an infinite loop:

contract C {
    bytes32 constant x = x;
    function f() public pure returns (uint t) {
        assembly {
            t := x
        }
    }
}

It also generated new valid-looking programs using the Zig compiler's test suite to find crashes deeper in the analysis and code generation phases of the compiler:

export fn entry() void {
	var x = false;
    const p = @ptrCast(?*u0, &x);
    return p < null;
}

You can find more Zig examples here.