-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wuffs 0.4 significantly slower than 0.3 decoding PNGs #148
Comments
Thanks for the bug report. I don't run Windows or VS myself. Are you able to bisect a few Wuffs versions to see which commit caused the slow down? If you're using
That's pretty coarse-grained though. If you have the time, I'd find it more helpful if you can bisect using
Also, what compiler flags are you using? |
Tangentially, if this is your code: Then an explicit "Convert from BGR to RGB" shouldn't be necessary, Instead, I think that you can change line 276 from this:
to this:
|
I'd also like to know whether any or these
|
Yeah that's my code. In addition, despite being relatively unoptimised code, my BGRA->RGBA swizzling seems faster than Wuffs:
|
Just _M_X64. I'm not building with AVX. |
|
I suspect that the slow-down is due to SIMD code no longer being used. Comparing
Also, there's a
Out of curiosity (I'm not familiar with MSVC / Visual Studio), both
Did you notice that at all, when building Wuffs? |
That might not be true, though. Fortunately, there's not that many commits between
|
WUFFS_BASE__CPU_ARCH__X86_64 is not defined for me.
Yes, and it's kind of annoying :) Looking at the code, the issue seems to be
preventing SSE from being used. As mentioned before I'm not building with AVX. |
To be clear, are you saying "WUFFS_BASE__CPU_ARCH__X86_64 is not defined for me" for just Again, what does
Yeah, it's not ideal, but I don't know how to make it better. Wuffs ships as a "single file C library". And in gcc or clang, code can opt-in to "compile me with SIMD enabled" via an So, for VS, I'd like the single file C library to work out of the box (even if, by default, it's leaving significant performance on the table), and it does, but the Sort of tangential to the original post, but as you're concerned about PNG decode performance: if your CPUs are less than 10 years old, then I'm curious how that "8.4 milliseconds" number changes if you do pass |
Maybe you could add a preprocessor option to suppress the warning? something like |
|
It's not defined for me using v0.4.0-alpha.4. Not sure about alpha 3. |
I think I have done enough remote debugging for now sorry. I think you need a windows build machine :) |
OK, but in that case, I don't expect this bug to be fixed any time soon. Sorry. |
Isn't it clear what the issue is? The SSE code is only being used when AVX is defined. |
For Wuffs + Visual Studio, "SSE code is only being used when AVX is defined" was true for Wuffs v0.3, v0.4.0-alpha.3 and v0.4.0-alpha.4. All three versions have the same At least, I think it's the same across all three versions, and that's why I asked you previously if you could confirm that (for older versions not just v0.4.0-alpha.4). If so, "SSE only when AVX defined" doesn't explain why performance regressed between v0.3 and v0.4, or between v0.4.0-alpha.3 and v0.4.0-alpha.4. I don't think it's clear yet what the issue is. |
But also, I don't think "SSE code is only being used when AVX is defined" has an obvious fix. SSE isn't a single thing, it's at least six different things: SSE, SSE2, SSE3, SSSE3 and SSE4.1 and SSE4.2. If all you have is On the other hand, Wuffs "SSE code" (in both version 0.3 and 0.4) uses intrinsics like |
There's a few things going on here I think. WUFFS_BASE__CPU_ARCH__X86_FAMILY and are only defined when The second thing that is going on is that I was working around this (I guess you could call it) by defining WUFFS_BASE__CPU_ARCH__X86_FAMILY myself before including the wuffs c file. This workaround stopped working well in wuffs 0. 4 alpha 4 when WUFFS_BASE__CPU_ARCH__X86_FAMILY became used less. (in fact not used at all) |
Well currently just including wuffs by default on windows x64 wouldn't even use SSE2. Solution I think is to detect building on x64, then allow SSE1 and SSE2. Or you could call it something like |
I'd say more "poorly named" than "incorrect". WUFFS_BASE__APPLY_X86_SIMD_OPTIMIZATIONS_AND_YOU_CAN_ASSUME_ALL_OF_SSE_NOT_JUST_SSE1_AND_SSE2 is more accurate, but maybe not a better name. In terms of "SIMD capability granularities" and not "which WUFFS macros trigger which SIMD code paths", I'm only looking at MSVC documentation (I'm not running MSVC myself) but for e.g. SSE4.1 code, not SSE1 or SSE2, https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?view=msvc-170 says you're going to need |
Ah, "defining it yourself" is a crucial bit of information that wasn't obvious from the earlier conversation. Having a second look at your That macro is a private implementation detail and it's not designed for library users to configure. Wuffs' documentation could certainly be better, but only the macros starting with If you need an immediate workaround for I'll think about whether to introduce a I wouldn't have done so in the past as I didn't know how MSVC would behave when you try to compile SSE4.2 intrinsics without also passing the |
If you're curious about why Wuffs' "SSE-family enabled" behavior changed, from depending on Rather than dealing with any similar-to-#145 issues in the future, it seemed simpler for Wuffs to enable its SIMD code (both SSE-family and AVX) only on 64-bit x86, not both 32-bit and 64-bit x86. That involved the |
That won't help Wuffs' PNG performance. Here's a code snippet from
That all works fine if you assume SSE4.2 (and earlier SSEs). But what if you only have SSE1 and SSE2? It turns out that e.g. the
Wuffs does choose code at run time. Furthermore, Wuffs' compiler (it takes in Currently, Wuffs' "can I call this |
Yeah, I'm not interested in just targeting SSE2. Targeting SSE 4.2 is much more reasonable these days. |
On Windows, Visual studio 2022, no AVX, AMD Ryzen 9 5950x.
The text was updated successfully, but these errors were encountered: