You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
zamazan4ik opened this issue
Apr 4, 2024
· 0 comments
Labels
enhancementNew feature or requestspeedupPerformance bugs, speed improvementsunrelated to 1.0Things that need not be done before the 1.0 version milestone
Recently I checked optimizations like Profile-Guided Optimization (PGO) and Post-Link Optimizations (PLO) improvements on multiple projects. The results are available here. According to the tests, all these optimizations can help with achieving better performance in many cases for many applications. I think trying to enable them for libjxl can be a good idea. I read an article on Phoronix about a new JPEG encoding/decoding library - Jpegli - and decided to optimize it with PGO.
I already did some benchmarks and want to share my results here. Hopefully, they will be helpful.
Test environment
Fedora 39
Linux kernel 6.7.6
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Clang 17.0.6
libjxl version: the latest for now from the main branch on commit 680d0e38683b6485e39807772c579252fe91f3a4
Disabled Turbo boost (for better results consistency across runs)
Benchmark
I didn't find a good benchmark suite to evaluate performance gains on a large dataset. Instead, I use these image samples. In all cases, an image for 30 Mib is used. In all cases, the library is configured with cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DENABLE_JPEGLI_DEFAULT=ON ... For the PGO training phase, additional flag -fprofile-generate is passed to the compiler, for the PGO optimization phase - -fprofile-use flag. The PGO training phase is done with the following command: cjpegli Sample-png-image-30mb.png converted.jpeg -q 90, where cjpegli - Jpegli's encoder, Sample-png-image-30mb.png - an input image.
All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are reproducible at least on my machine. taskset -c 0 is used for better stability across runs (to reduce OS scheduler influence).
Also, I tested the case when training and actual workloads differ. Here are the PGO optimized compared to a regular release benchmark, when another sample image is used (not the same as during the training phase): https://gist.github.com/zamazan4ik/4750fa6424a53e83638f4ab422f901a9
At least to the simple benchmarks above, PGO allows achieving better performance.
Further steps
I can suggest the following action points:
Perform more PGO benchmarks on libjxl. If it shows improvements - add a note to the documentation about possible improvements in libjxl performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize libjxl according to their workloads.
Optimize pre-built libjxl binaries (if any)
Here are some examples of how PGO optimization is integrated into other projects:
enhancementNew feature or requestspeedupPerformance bugs, speed improvementsunrelated to 1.0Things that need not be done before the 1.0 version milestone
Hi!
Recently I checked optimizations like Profile-Guided Optimization (PGO) and Post-Link Optimizations (PLO) improvements on multiple projects. The results are available here. According to the tests, all these optimizations can help with achieving better performance in many cases for many applications. I think trying to enable them for
libjxl
can be a good idea. I read an article on Phoronix about a new JPEG encoding/decoding library - Jpegli - and decided to optimize it with PGO.I already did some benchmarks and want to share my results here. Hopefully, they will be helpful.
Test environment
main
branch on commit680d0e38683b6485e39807772c579252fe91f3a4
Benchmark
I didn't find a good benchmark suite to evaluate performance gains on a large dataset. Instead, I use these image samples. In all cases, an image for 30 Mib is used. In all cases, the library is configured with
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DENABLE_JPEGLI_DEFAULT=ON ..
. For the PGO training phase, additional flag-fprofile-generate
is passed to the compiler, for the PGO optimization phase --fprofile-use
flag. The PGO training phase is done with the following command:cjpegli Sample-png-image-30mb.png converted.jpeg -q 90
, wherecjpegli
- Jpegli's encoder,Sample-png-image-30mb.png
- an input image.All tests are done on the same machine, done multiple times, with the same background "noise" (as much as I can guarantee of course) - the results are reproducible at least on my machine.
taskset -c 0
is used for better stability across runs (to reduce OS scheduler influence).Results
Here are the results:
Also, I tested the case when training and actual workloads differ. Here are the PGO optimized compared to a regular release benchmark, when another sample image is used (not the same as during the training phase): https://gist.github.com/zamazan4ik/4750fa6424a53e83638f4ab422f901a9
At least to the simple benchmarks above, PGO allows achieving better performance.
Further steps
I can suggest the following action points:
Here are some examples of how PGO optimization is integrated into other projects:
configure
scriptI have some examples of how PGO information looks in the documentation:
Please, do not treat the issue like a bug or smth like that. It's just a benchmark report with possible improvement idea for the project.
The text was updated successfully, but these errors were encountered: