Skip to content

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries with options for generating static profile inferred by an ML-model and by heuristics. Useful for when the generation of dynamic profiles is prohibitive.

License

angelica-moreira/BOLT

 
 

Repository files navigation

BOLT

BOLT is a post-link optimizer developed to speed up large applications. It achieves the improvements by optimizing application's code layout based on execution profile gathered by sampling profiler, such as Linux perf tool. An overview of the ideas implemented in BOLT along with a discussion of its potential and current results is available in CGO'19 paper.

BOLT + VESPA

VESPA is a static profiler based on ML designed to guide binary optimizations. VESPA removes the need for dynamic profiling to enable binary optimizations, thus simplifying and broadening the applicability of binary-optimization tools. Although VESPA is a general static profiler that can be used with different binary optimizers, we explored the use of VESPA with BOLT and you can see the discussion of this combined use here OOPSLA'21 paper.

Input Binary Requirements

BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries should have an unstripped symbol table, and, to get maximum performance gains, they should be linked with relocations (--emit-relocs or -q linker flag).

BOLT disassembles functions and reconstructs the control flow graph (CFG) before it runs optimizations. Since this is a nontrivial task, especially when indirect branches are present, we rely on certain heuristics to accomplish it. These heuristics have been tested on a code generated with Clang and GCC compilers. The main requirement for C/C++ code is not to rely on code layout properties, such as function pointer deltas. Assembly code can be processed too. Requirements for it include a clear separation of code and data, with data objects being placed into data sections/segments. If indirect jumps are used for intra-function control transfer (e.g., jump tables), the code patterns should be matching those generated by Clang/GCC.

NOTE: BOLT is currently incompatible with the -freorder-blocks-and-partition compiler option. Since GCC8 enables this option by default, you have to explicitly disable it by adding -fno-reorder-blocks-and-partition flag if you are compiling with GCC8.

PIE and .so support has been added recently. Please report bugs if you encounter any issues.

Installation

Docker Image For Original BOLT

You can build and use the docker image containing BOLT using our docker file. Alternatively, you can build BOLT manually using the steps below.

Docker Image For BOLT+VESPA

You can acces the docker image containing the artifact of the paper VESPA: Static Profiling for Binary Optimization at docker hub. If you have any doubts on how to run the scripts that reproduce the research questions please read the file published at zenodo.

Manual Build

BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM tools. The build process is not much different from a regular LLVM build. The following instructions are assuming that you are running under Linux.

Start with cloning LLVM and BOLT repos:

> git clone https://github.com/llvm-mirror/llvm llvm
> cd llvm/tools
> git checkout -b llvm-bolt f137ed238db11440f03083b1c88b7ffc0f4af65e
> git clone https://github.com/angelica-moreira/BOLT.git llvm-bolt 
> cd ..
> patch -p 1 < tools/llvm-bolt/llvm.patch

Proceed to a normal LLVM build using a compiler with C++11 support (for GCC use version 4.9 or later):

> cd ..
> mkdir build
> cd build
> cmake -G Ninja ../llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON
> ninja

llvm-bolt will be available under bin/. Add this directory to your path to ensure the rest of the commands in this tutorial work.

Note that we use a specific revision of LLVM as we currently rely on a set of patches that are not yet upstreamed.

Optimizing BOLT's Performance

BOLT runs many internal passes in parallel. If you foresee heavy usage of BOLT, you can improve the processing time by linking against one of memory allocation libraries with good support for concurrency. E.g. to use jemalloc:

> sudo yum install jemalloc-devel
> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....

Or if you rather use tcmalloc:

> sudo yum install gperftools-devel
> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....

Standard BOLT Usage

For a complete practical guide of using BOLT see Optimizing Clang with BOLT.

Step 0

In order to allow BOLT to re-arrange functions (in addition to re-arranging code within functions) in your program, it needs a little help from the linker. Add --emit-relocs to the final link step of your application. You can verify the presence of relocations by checking for .rela.text section in the binary. BOLT will also report if it detects relocations while processing the binary.

Step 1: Collect Profile

This step is different for different kinds of executables. If you can invoke your program to run on a representative input from a command line, then check For Applications section below. If your program typically runs as a server/service, then skip to For Services section.

The version of perf command used for the following steps has to support -F brstack option. We recommend using perf version 4.5 or later.

For Applications

This assumes you can run your program from a command line with a typical input. In this case, simply prepend the command line invocation with perf:

$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...

For Services

Once you get the service deployed and warmed-up, it is time to collect perf data with LBR (branch information). The exact perf command to use will depend on the service. E.g., to collect the data for all processes running on the server for the next 3 minutes use:

$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180

Depending on the application, you may need more samples to be included with your profile. It's hard to tell upfront what would be a sweet spot for your application. We recommend the profile to cover 1B instructions as reported by BOLT -dyno-stats option. If you need to increase the number of samples in the profile, you can either run the sleep command for longer and use -F<N> option with perf to increase sampling frequency.

Note that for profile collection we recommend using cycle events and not BR_INST_RETIRED.*. Empirically we found it to produce better results.

If the collection of a profile with branches is not available, e.g., when you run on a VM or on hardware that does not support it, then you can use only sample events, such as cycles. In this case, the quality of the profile information would not be as good, and performance gains with BOLT are expected to be lower.

With instrumentation (experimental)

If perf record is not available to you, you may collect profile by first instrumenting the binary with BOLT and then running it.

$ llvm-bolt <executable> -instrument -o <instrumented-executable>

After you run instrumented-executable with the desired workload, its BOLT profile should be ready for you in /tmp/prof.fdata and you can skip Step 2.

Run BOLT with the -help option and check the category "BOLT instrumentation options" for a quick reference on instrumentation knobs. Instrumentation is experimental and currently does not work for PIEs/SOs.

Step 2: Convert Profile to BOLT Format

NOTE: you can skip this step and feed perf.data directly to BOLT using experimental -p perf.data option.

For this step, you will need perf.data file collected from the previous step and a copy of the binary that was running. The binary has to be either unstripped, or should have a symbol table intact (i.e., running strip -g is okay).

Make sure perf is in your PATH, and execute perf2bolt:

$ perf2bolt -p perf.data -o perf.fdata <executable>

This command will aggregate branch data from perf.data and store it in a format that is both more compact and more resilient to binary modifications.

If the profile was collected without LBRs, you will need to add -nl flag to the command line above.

Step 3: Optimize with BOLT

Once you have perf.fdata ready, you can use it for optimizations with BOLT. Assuming your environment is setup to include the right path, execute llvm-bolt:

$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=cache+ \
                         -reorder-functions=hfsort -split-functions=2 -split-all-cold -split-eh -dyno-stats

If you do need an updated debug info, then add -update-debug-sections option to the command above. The processing time will be slightly longer.

For a full list of options see -help/-help-hidden output.

The input binary for this step does not have to 100% match the binary used for profile collection in Step 1. This could happen when you are doing active development, and the source code constantly changes, yet you want to benefit from profile-guided optimizations. However, since the binary is not precisely the same, the profile information could become invalid or stale, and BOLT will report the number of functions with a stale profile. The higher the number, the less performance improvement should be expected. Thus, it is crucial to update .fdata for release branches.

Multiple Profiles

Suppose your application can run in different modes, and you can generate multiple profiles for each one of them. To generate a single binary that can benefit all modes (assuming the profiles don't contradict each other) you can use merge-fdata tool:

$ merge-fdata *.fdata > combined.fdata

Use combined.fdata for Step 3 above to generate a universally optimized binary.

BOLT + VESPA Usage

Step 0

See Step 0 from Standard BOLT Usage section.

Step 1: Extract Code Features

To generate the csv feature file (features.csv) you must do as follows:

$ llvm-bolt <executable> -gen-features

Step 2: Generate Inferred Profile

Before you go ahead make sure that you have python3 installed in your machine and then create a virtual enviroment in your desired directory ($VIRTUAL_ENV_DIR) as follows:

$ python3 -m venv ${VIRTUAL_ENV_DIR}/vespa-env

Now activate it and install the dependences needed for running the scrpits of directory ${LLVM_BOLT_DIRECTORY}/utils/vespa_scripts. The dependency files is at ${LLVM_BOLT_DIRECTORY}/utils/requirements.txt, make sure to not forget to upgrate pip before installing the dependences, as follows:

$ source ${VIRTUAL_ENV_DIR}/vespa-env/bin/activate
$ PYTHON_HOME=${VIRTUAL_ENV_DIR}/vespa-env/bin
$ ${PYTHON_HOME}/pip install --upgrade pip
$ ${PYTHON_HOME}/pip install -r ${LLVM_BOLT_DIRECTORY}/utils/requirements.txt

You can skip the installation if you had already done it before, but make sure to activate it every time you want to run the scripts preprocessing.py and generate_probfiles.py.

To generate the probabilities file that will have the prediction performed by VESPA and that will be used to guide BOLT optimizations, do as follows:

$ VESPA_SCRIPTS_HOME=${LLVM_BOLT_DIRECTORY}/utils/vespa_scripts/
$ source ${VIRTUAL_ENV_DIR}/vespa-env/bin/activate
$ PYTHON_HOME=${VIRTUAL_ENV_DIR}/vespa-env/bin
$ ${PYTHON_HOME}/python3 ${VESPA_SCRIPTS_HOME}/preprocessing.py <features_path> <stage_code> <result_path>
$ ${PYTHON_HOME}/python3 ${VESPA_SCRIPTS_HOME}/generate_probfiles.py <model_path> <data_base_path> <output_path> [bin_name]
$ deactivate

Note: The LLVM_BOLT_DIRECTORY variable represents llvm-bolt source directory the ${HOME}/llvm/tools/llvm-bolt/.

The input variable features_path represents the path where the features.csv is, the stage_code can assume values 0 (train) or 1 (prediction), and result_path is the path where the preprocessed csv file will be persisted.

The input variable model_path is the path where vespa ml model is, data_base_path represents the path where the preprocessed csv file was persisted, the output_path is where you want to persist the probabilities file and bin_name is optional, if you enter the binary name the output file will be named as bin_name_ml.pdata, otherwise the file will be generated with the default name (prob.pdata).

Step 3: Optimize with BOLT + VESPA

Once you have prob.pdata ready, you can use it for optimizations with BOLT. Assuming your environment is setup to include the right path, execute llvm-bolt:

$ llvm-bolt <executable> -o <executable>.bolt -infer-local-counts -ml-based -prob-file=prob.pdata \
                         -reorder-blocks=cache+ -split-functions=3 -split-all-cold -dyno-stats \
                         -icf=1 -use-gnu-stack

Be aware that the command above does not work well for rearrenging functions. If you intend to enable this optimization, it is advisable to use the flag -infer-global-counts instead of -infer-local-counts when executing llvm-bolt:

$ llvm-bolt <executable> -o <executable>.bolt -infer-global-counts -ml-based \
                         -prob-file=prob.pdata -reorder-blocks=cache+ \
                         -split-functions=3 -split-all-cold -dyno-stats -icf=1 \
                         -reorder-functions=hfsort -split-eh -use-gnu-stack

Note: Here we are using the pre-trained model with our 243 programs, read the OOPSLA'21 paper for a deitaled information about it, as input for training, but feel free to explore tranning with other dataset following the steps presented in the Training VESPA with your own dataset tutorial.

BOLT + Other Static Predictors Usage

We incorporated in BOLT some trivial static predictors and the Wu and Larus heuristics. Below are the steps of how to run BOLT with the Wu and Larus static heuristic-based profiler.

Step 0

See Step 0 from Standard BOLT Usage section.

Step 1: Optimize with BOLT+Other Static Predictors

Since the trivial predictors and the Wu and Larus predictor are heuristic-based, and therefore perform branch probability inference by using heuristics that analysis the program's source code, you will not need to feed it with a file, you only need to specify the type of predictor you want to use. In the exemple below we are using Wu and Larus predictor with the intra-procedural strategy for converting probabilities into frequencies, mode details about this technique can be found at Wu and Larus's paper.

$ llvm-bolt <executable> -o <executable>.bolt -infer-local-counts -heuristic-based=wularus  \
                         -reorder-blocks=cache+ -split-functions=3 -split-all-cold \
                         -dyno-stats -icf=1 -use-gnu-stack

We also implemented the inter-procedural algorithm described in the paper. Using this implementation, you may be able to perform function reordering better than using the previously inferred data. However, as mentioned earlier in the BOLT + VESPA section, it is not advisable to use this approach for large or highly optimized programs, since they usually have a large number of non-reducible graphs. This caused a combination of both intra-procedural and inter-procedural techiniques to yield worse results when compared against the binary produced using only the intra-procedural propagation.

$ llvm-bolt <executable> -o <executable>.bolt -infer-global-counts -heuristic-based=wularus \
                         -reorder-blocks=cache+ -split-functions=3 -split-all-cold -dyno-stats \
                         -icf=1 -reorder-functions=hfsort -split-eh -use-gnu-stack

For testing the trivial predictions please run BOLT with the -help option and check the category "BOLT static infered profile options" for a quick reference on trivial heuristics and its description, and replace the name of the trivial predictor in the field -heuristic-based.

License

BOLT and BOLT+VESPA are licensed under University of Illinois/NCSA Open Source License.

About

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries with options for generating static profile inferred by an ML-model and by heuristics. Useful for when the generation of dynamic profiles is prohibitive.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 95.2%
  • Makefile 1.5%
  • Assembly 1.2%
  • Python 1.0%
  • Shell 0.7%
  • CMake 0.3%
  • Other 0.1%