Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testPackUnpackExternal alignment error on sparc64 #147

Open
drew-parsons opened this issue Nov 19, 2021 · 16 comments
Open

testPackUnpackExternal alignment error on sparc64 #147

drew-parsons opened this issue Nov 19, 2021 · 16 comments

Comments

@drew-parsons
Copy link
Contributor

sparc64 is not the most common architecture around, but for what it's worth 3.1.2 has started giving a Bus Error (Invalid address alignment) in testPackUnpackExternal (test_pack.TestPackExternal),

testProbeRecv (test_p2p_obj_matched.TestP2PMatchedWorldDup) ... ok
testPackSize (test_pack.TestPackExternal) ... ok
testPackUnpackExternal (test_pack.TestPackExternal) ... [sompek:142729] *** Process received signal ***
[sompek:142729] Signal: Bus error (10)
[sompek:142729] Signal code: Invalid address alignment (1)
[sompek:142729] Failing at address: 0xffff800100ea2821
[sompek:142729] *** End of error message ***
Bus error
make[1]: *** [debian/rules:91: override_dh_auto_test] Error 1

Full log at https://buildd.debian.org/status/fetch.php?pkg=mpi4py&arch=sparc64&ver=3.1.2-1&stamp=1636215944&raw=0

It previously passed with 3.1.1.

Ongoing sparc64 build logs at https://buildd.debian.org/status/logs.php?pkg=mpi4py&arch=sparc64

@dalcinl
Copy link
Member

dalcinl commented Nov 19, 2021

I don't think I made any change in 3.1.2 from 3.1.1 that would explain such failure.

Did you notice that old logs correspond to Open MPI 4.1.1, but new failing logs correspond to Open MPI 4.1.2? It is not the first time that an Open MPI patch release breaks mpi4py testsuite.

@jeffhammond Can you curse in my behalf?

@drew-parsons
Copy link
Contributor Author

Fair point. We don't run CI tests on sparc64, so didn't catch the openmpi regression outside of the new builds.

@dalcinl
Copy link
Member

dalcinl commented Dec 27, 2021

@drew-parsons Did you confirm whether this was an Open MPI regression from 4.1.1 to 4.1.2 ?

@drew-parsons
Copy link
Contributor Author

Our sparc64 porterbox is down at the moment, so I can only judge by the past build logs at https://buildd.debian.org/status/logs.php?pkg=mpi4py&arch=sparc64

The last successful sparc64 build was mpi4py 3.1.1 with openmpi 4.1.1,
https://buildd.debian.org/status/fetch.php?pkg=mpi4py&arch=sparc64&ver=3.1.1-8&stamp=1632693743&raw=0

After that, mpi4py 3.1.2 and 3.1.3 have been failing with 4.1.2~rc1 and 4.1.2.

@dalcinl
Copy link
Member

dalcinl commented May 3, 2022

@drew-parsons What should we do with this issue? Did you report the problem upstream? Is the problem still there with the Open MPI 5.0.0rc tarball? Perhaps we disable the pack/unpack external tests if running Open MPI < 5 on sparc64? What's the output of platform.machine() on sparc64?

@drew-parsons
Copy link
Contributor Author

Eh, our sparc64 porterbox is still offline. Apparently a new bigger, better one is being commissioned. In the meantime, the Debian porters suggest requesting access to the GCC Compile Farm, https://gcc.gnu.org/wiki/CompileFarm . The Compile Farm is located at https://cfarm.tetaneutral.net/ . Their sparc64 box gcc202 is also down for hardware troubles, but their gcc102 is running fine.

"sparc64" is used in the library triplet, so there's a good chance it's what's returned by platform.machine()

Debian hasn't got a build of OpenMPI 5 yet (time doesn't permit me to build it separately). I haven't reported the error separately, would it be equal to the problem you raised at open-mpi/ompi#8918 ?

@dalcinl
Copy link
Member

dalcinl commented May 4, 2022

would it be equal to the problem you raised at open-mpi/ompi#8918 ?

I'm not sure. Note however that you reported a Bus Error (alignment issues), and that's a bit different than bad binary packing/unpacking in a specific binary representation (external32).

@dalcinl
Copy link
Member

dalcinl commented Nov 4, 2022

@drew-parsons Looks like the sparc64 machine is back to life, right? Any chance you can try openmpi from git at branch v5.0.x to see whether the bus error is still there? If the issue is gone, then we can mark as know failure under openmpi<5.0.0 and move on.

@drew-parsons
Copy link
Contributor Author

Eh no, the sparc64 porterbox (kyoto.debian.net) is still down. The official buildd is running, but it's not as simple to load up a build onto it as it would be to run a manual build on the porterbox.

@dalcinl
Copy link
Member

dalcinl commented Nov 7, 2022

OK, sorry for the confusion.

Anyway, why don't you decorate the failing test with @skipMPI("openmpi(<=4.1.4)", platform.machine() == 'sparc64')? By pinning the version, you will not forget about the issue in the next Open MPI version bump. And you can get the build going to completion (or discover another issue). Pack/unpack in external32 format is not an MPI feature that is used very often.

@drew-parsons
Copy link
Contributor Author

It's a sensible workaround, I'll do that.

@drew-parsons
Copy link
Contributor Author

Confirming this error still occurs on sparc64 with OpenMPI 4.1.6
I'll review again with OpenMPI 5, which should be available in debian soon.

@dalcinl
Copy link
Member

dalcinl commented May 2, 2024

@drew-parsons Any news about Open MPI v5? Did it landed in Debian?

@drew-parsons
Copy link
Contributor Author

OpenMPI 5 is now available in experimental,
https://buildd.debian.org/status/package.php?p=openmpi&suite=experimental

Getting it into debian unstable has been slowed down by Debian's decision to introduce 64-bit time_t on 32-bit arches (for the Y2K38 problem), along with a drop of 32-bit support in pmix (and openmpi 5).

We also now have a new sparc64 porterbox we can test on. I'll try to test soon, or we can request access for you if you'd like to inspect directly yourself.

@drew-parsons
Copy link
Contributor Author

Still failing with OpenMPI 5.0.3

testPackSize (test_pack.TestPackExternal.testPackSize) ... ok
testPackUnpackExternal (test_pack.TestPackExternal.testPackUnpackExternal) ... [stadler:1768783] *** Process received signal ***
[stadler:1768783] Signal: Bus error (10)
[stadler:1768783] Signal code: Invalid address alignment (1)
[stadler:1768783] Failing at address: 0xfff8000100f8b011
[stadler:1768783] *** End of error message ***
Bus error
make[1]: *** [debian/rules:91: override_dh_auto_test] Error 1

@dalcinl
Copy link
Member

dalcinl commented May 4, 2024

My wild guess is that Open MPI is somewhere implementing pack/unpack with unaligned load/stores rather than memcpy. At this point, IMHO, this issue transferred to the Open MPI project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants