Profile-Guided Optimization (PGO) benchmark report #3456

zamazan4ik · 2024-04-04T13:11:52Z

Hi!

Recently I checked optimizations like Profile-Guided Optimization (PGO) and Post-Link Optimizations (PLO) improvements on multiple projects. The results are available here. According to the tests, all these optimizations can help with achieving better performance in many cases for many applications. I think trying to enable them for libjxl can be a good idea. I read an article on Phoronix about a new JPEG encoding/decoding library - Jpegli - and decided to optimize it with PGO.

I already did some benchmarks and want to share my results here. Hopefully, they will be helpful.

Test environment

Fedora 39
Linux kernel 6.7.6
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Clang 17.0.6
libjxl version: the latest for now from the main branch on commit 680d0e38683b6485e39807772c579252fe91f3a4
Disabled Turbo boost (for better results consistency across runs)

Benchmark

I didn't find a good benchmark suite to evaluate performance gains on a large dataset. Instead, I use these image samples. In all cases, an image for 30 Mib is used. In all cases, the library is configured with cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DENABLE_JPEGLI_DEFAULT=ON ... For the PGO training phase, additional flag -fprofile-generate is passed to the compiler, for the PGO optimization phase - -fprofile-use flag. The PGO training phase is done with the following command: cjpegli Sample-png-image-30mb.png converted.jpeg -q 90, where cjpegli - Jpegli's encoder, Sample-png-image-30mb.png - an input image.

All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are reproducible at least on my machine. taskset -c 0 is used for better stability across runs (to reduce OS scheduler influence).

Results

Here are the results:

Release: https://gist.github.com/zamazan4ik/95b1472be0042086c6d4646d6edab053
PGO-optimized compared to Release: https://gist.github.com/zamazan4ik/3159661ef3943f8fad270c8634cd738d
(just for reference) PGO-instrumented compared to Release: https://gist.github.com/zamazan4ik/1e9c37de3e95a95cf06c9f3135c9b7e3

Also, I tested the case when training and actual workloads differ. Here are the PGO optimized compared to a regular release benchmark, when another sample image is used (not the same as during the training phase): https://gist.github.com/zamazan4ik/4750fa6424a53e83638f4ab422f901a9

At least to the simple benchmarks above, PGO allows achieving better performance.

Further steps

I can suggest the following action points:

Perform more PGO benchmarks on libjxl. If it shows improvements - add a note to the documentation about possible improvements in libjxl performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize libjxl according to their workloads.
Optimize pre-built libjxl binaries (if any)

Here are some examples of how PGO optimization is integrated into other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

I have some examples of how PGO information looks in the documentation:

ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Vector: https://vector.dev/docs/administration/tuning/pgo/
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
Clang:
- https://llvm.org/docs/HowToBuildWithPGO.html
- https://llvm.org/docs/AdvancedBuilds.html
tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md

Please, do not treat the issue like a bug or smth like that. It's just a benchmark report with possible improvement idea for the project.

The text was updated successfully, but these errors were encountered:

mo271 added enhancement New feature or request speedup Performance bugs, speed improvements unrelated to 1.0 Things that need not be done before the 1.0 version milestone labels Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile-Guided Optimization (PGO) benchmark report #3456

Profile-Guided Optimization (PGO) benchmark report #3456

zamazan4ik commented Apr 4, 2024

Profile-Guided Optimization (PGO) benchmark report #3456

Profile-Guided Optimization (PGO) benchmark report #3456

Comments

zamazan4ik commented Apr 4, 2024

Test environment

Benchmark

Results

Further steps