Possible initialization bug for `LORSolver<HypreBoomerAMG>` #4286

victor-decaria-nnl · 2024-05-07T14:10:35Z

On GPUs, I've experienced non-deterministic behavior in the lor_elast miniapp I submitted. It's possibly due to some objects not being initialized since cuda's initcheck sanitizer throws errors.

To investigate, I made a MRE mre_lor.gz that's a modification of ex1p.cpp, and solves Poisson. The system is solved with CGSolver matrix free (partial assembly) and preconditioned with a LORSolver<HypreBoomerAMG> approximation of the same bilinear form. This is a simplification of the lor_elast miniapp which further does a block diagonal preconditioner of LORSolver<HypreBoomerAMG>s for a vector valued PDE.

Two sample runs are:

compute-sanitizer --tool initcheck ./mre_lor -d cuda
compute-sanitizer --tool initcheck ./mre_lor -b

The first run is similar to the miniapp in that it sets the preconditioner in CGSolver AFTER setting the operator. This runs, but the sanitizer shows 5 errors, and the number of errors scales with the problem size. When mfem is built for cpu only, I don't have issues with sanitizers with this approach.

The second run sets the preconditioner BEFORE setting the operator which is recommended, but this results in a runtime error. Here's the code snippet for details about why. prec is a LORSolver<HypreBoomerAMG>.

   if(!set_precond_after_op)
   {
      // This is the recommended way to set preconditioners, before setting the operator.
      // Does not work because CGSolver calls prec->SetOperator on A, which
      // calls HyperBoomerAMG::SetOperator on A, but A is not a HypreParMatrix,
      // it is a constrained partially assembled ParBilinearForm.
      cg.SetPreconditioner(prec);
   }
   cg.SetOperator(*A);
   if(set_precond_after_op)
   {
       // Runs, but "compute-sanitizer --tool initcheck" complains. Results
       // may be non-deterministic.
       cg.SetPreconditioner(prec);
   }

Am I setting up the solver wrong?

I'm not sure how accurate CUDA's sanitizers are, but I tried this on a few different compiler, mpi implementation and cuda version combinations.

Thanks!

The text was updated successfully, but these errors were encountered:

pazner · 2024-05-07T20:52:44Z

Hi @victor-decaria-nnl,

Do you get the same initcheck errors if you try an example with AMG, but without LOR preconditioning? For example:

compute-sanitizer --tool initcheck ./ex1p -fa -d cuda

(Yes, with the LOR solvers, set the preconditioner after setting the operator. Maybe the LOR solvers should just give an error if you try to call SetOperator on them; there is not really any way to properly set up an LOR solver object from a generic Operator).

victor-decaria-nnl · 2024-05-07T21:11:27Z

Hi @pazner ,
I do get the same errors with that sample run you provided. Here's the beginning of the output for reference. The executable still successfully completes. Does the sanitizer come up clean for you?

And that makes sense about setting it up after setting the operator. I just wanted to make sure that was the accepted use.

========= COMPUTE-SANITIZER
Options used:
   --mesh ../data/star.mesh
   --order 1
   --no-static-condensation
   --no-partial-assembly
   --full-assembly
   --device cuda
   --no-algebraic
   --visualization
Device configuration: cuda,cpu
Memory configuration: host-std,cuda
Number of finite element unknowns: 82561
========= Uninitialized __global__ memory read of size 8 bytes
=========     at 0x690 in void cusparse::csrmv_v3_partition_kernel<(int)256, cusparse::VectorScalarMultiplyPolicy, int, int, double, double, double>(const T4 *, T3, T4, int, T3 *, cusparse::KernelCoeff<T7>, T3, T6 *)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x14ac961e1400
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2ef370]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x92566e]
=========                in /apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/lib64/libcusparse.so.12
=========     Host Frame: [0x98587e]
=========                in /apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/lib64/libcusparse.so.12
=========     Host Frame: [0x1ad367]
=========                in /apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/lib64/libcusparse.so.12
=========     Host Frame: [0x1b5bc9]
=========                in /apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/lib64/libcusparse.so.12
=========     Host Frame: [0x1da0d3]
=========                in /apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/lib64/libcusparse.so.12
=========     Host Frame:cusparseSpMV [0xe88dc]
=========                in /apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/lib64/libcusparse.so.12
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/seq_mv/csr_matvec_device.c:291:hypre_CSRMatrixMatvecCusparseNewAPI [0x898707]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/seq_mv/csr_matvec_device.c:414:hypre_CSRMatrixMatvecCusparse [0x898d42]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/seq_mv/csr_matvec_device.c:84:hypre_CSRMatrixMatvecDevice2(int, double, hypre_CSRMatrix*, hypre_Vector*, double, hypre_Vector*, int) [0x897d4f]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/seq_mv/csr_matvec_device.c:166:hypre_CSRMatrixMatvecDevice [0x8981b9]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/seq_mv/csr_matvec.c:878:hypre_CSRMatrixMatvecOutOfPlace [0x55810c]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/parcsr_mv/par_csr_matvec_device.c:221:hypre_ParCSRMatrixMatvecOutOfPlaceDevice [0x8497a9]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/parcsr_mv/par_csr_matvec.c:254:hypre_ParCSRMatrixMatvecOutOfPlace [0x51e724]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/parcsr_mv/par_csr_matvec.c:279:hypre_ParCSRMatrixMatvec [0x51e7a1]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:linalg/hypre.cpp:1821:mfem::HypreParMatrix::Mult(double, mfem::Vector const&, double, mfem::Vector&) const [0x9952a]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========     Host Frame:fem/../linalg/hypre.hpp:703:mfem::HypreParMatrix::Mult(mfem::Vector const&, mfem::Vector&) const [0xb09c4]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========     Host Frame:linalg/solvers.cpp:727:mfem::CGSolver::Mult(mfem::Vector const&, mfem::Vector&) const [0xbde4f]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/spack-src/examples/ex1p.cpp:272:main [0x1a3d8]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========     Host Frame:__libc_start_main [0x3ad85]
=========                in /lib64/libc.so.6
=========     Host Frame:_start [0x1970e]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
========= 
========= Uninitialized __global__ memory read of size 8 bytes
=========     at 0x2ea0 in /apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/cub/thread/thread_load.cuh:275:std::iterator_traits<T2>::value_type cub::CUB_200200_700_NS::ThreadLoad<(cub::CUB_200200_700_NS::CacheLoadModifier)5, const unsigned long long *>(T2)
=========     by thread (2,0,0) in block (80,0,0)
=========     Address 0x14ac9da50010
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/cub/thread/thread_load.cuh:119:void cub::CUB_200200_700_NS::IterateThreadLoad<(int)0, (int)1>::Load<(cub::CUB_200200_700_NS::CacheLoadModifier)5, unsigned long long>(const T2 *, T2 *) [0x2ea0]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/cub/thread/thread_load.cuh:377:T1 cub::CUB_200200_700_NS::ThreadLoad<double, (int)5>(const T1 *, cub::CUB_200200_700_NS::Int2Type<T2>, cub::CUB_200200_700_NS::Int2Type<(int)1>) [0x2ea0]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/cub/thread/thread_load.cuh:395:std::iterator_traits<T2>::value_type cub::CUB_200200_700_NS::ThreadLoad<(cub::CUB_200200_700_NS::CacheLoadModifier)5, double *>(T2) [0x2ea0]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/cub/iterator/cache_modified_input_iterator.cuh:203:double cub::CUB_200200_700_NS::CacheModifiedInputIterator<(cub::CUB_200200_700_NS::CacheLoadModifier)5, double, long>::operator []<int>(T1) const [0x2ea0]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/cub/block/block_load.cuh:364:void cub::CUB_200200_700_NS::LoadDirectWarpStriped<double, (int)4, cub::CUB_200200_700_NS::CacheModifiedInputIterator<(cub::CUB_200200_700_NS::CacheLoadModifier)5, double, long>>(int, T3, T1 (&)[T2]) [0x2ea0]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/cub/block/block_load.cuh:968:void cub::CUB_200200_700_NS::BlockLoad<double, (int)128, (int)4, (cub::CUB_200200_700_NS::BlockLoadAlgorithm)4, (int)1, (int)1, (int)520>::LoadInternal<(cub::CUB_200200_700_NS::BlockLoadAlgorithm)4, (int)0>::Load<cub::CUB_200200_700_NS::CacheModifiedInputIterator<(cub::CUB_200200_700_NS::CacheLoadModifier)5, double, long>>(T1, double (&)[4]) [0x2e90]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/cub/block/block_load.cuh:1189:void cub::CUB_200200_700_NS::BlockLoad<double, (int)128, (int)4, (cub::CUB_200200_700_NS::BlockLoadAlgorithm)4, (int)1, (int)1, (int)520>::Load<cub::CUB_200200_700_NS::CacheModifiedInputIterator<(cub::CUB_200200_700_NS::CacheLoadModifier)5, double, long>>(T1, double (&)[4]) [0x2e90]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/copy_if.h:382:int thrust::cuda_cub::__copy_if::CopyIfAgent<double *, int *, double *, is_negative<int>, int, int *>::impl::consume_tile_impl<(bool)0, (bool)0>(int, int, int) [0x2e90]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/copy_if.h:494:int thrust::cuda_cub::__copy_if::CopyIfAgent<double *, int *, double *, is_negative<int>, int, int *>::impl::consume_tile<(bool)0>(int, int, int) [0x80]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/copy_if.h:526:thrust::cuda_cub::__copy_if::CopyIfAgent<double *, int *, double *, is_negative<int>, int, int *>::impl::impl(thrust::cuda_cub::__copy_if::CopyIfAgent<double *, int *, double *, is_negative<int>, int, int *>::PtxPlan<thrust::cuda_cub::core::sm52>::TempStorage &, cub::CUB_200200_700_NS::ScanTileState<int, (bool)1> &, double *, int *, double *, is_negative<int>, int, int, int *) [0x80]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/copy_if.h:568:thrust::cuda_cub::__copy_if::CopyIfAgent<double *, int *, double *, is_negative<int>, int, int *>::entry(double *, int *, double *, is_negative<int>, int, int *, cub::CUB_200200_700_NS::ScanTileState<int, (bool)1>, int, char *) [0x20]
=========     Device Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/core/agent_launcher.h:111:void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<double *, int *, double *, is_negative<int>, int, int *>, double *, int *, double *, is_negative<int>, int, int *, cub::CUB_200200_700_NS::ScanTileState<int, (bool)1>, unsigned long>(T2, T3, T4, T5, T6, T7, T8, T9) [0x20]
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2ef370]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x14fb4]
=========                in /apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/lib64/libcudart.so.12
=========     Host Frame:cudaLaunchKernel [0x70aae]
=========                in /apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/lib64/libcudart.so.12
=========     Host Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/cuda_runtime.h:216:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0x75f745]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/var/tmp/pbs.1072554.sawtoothpbs/tmpxft_000003a9_00000000-6_par_lr_interp_device.cudafe1.stub.c:46:__device_stub__ZN6thrust8cuda_cub4core13_kernel_agentINS0_9__copy_if11CopyIfAgentIPdPiS5_11is_negativeIiEiS6_EES5_S6_S5_S8_iS6_N3cub17CUB_200200_700_NS13ScanTileStateIiLb1EEEmEEvT0_T1_T2_T3_T4_T5_T6_T7_(double*, int*, double*, is_negative<int>&, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>&, unsigned long) [0x75eab9]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/var/tmp/pbs.1072554.sawtoothpbs/tmpxft_000003a9_00000000-6_par_lr_interp_device.cudafe1.stub.c:50:void thrust::cuda_cub::core::__wrapper__device_stub__kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<double*, int*, double*, is_negative<int>, int, int*>, double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long>(double*&, int*&, double*&, is_negative<int>&, int&, int*&, cub::CUB_200200_700_NS::ScanTileState<int, true>&, unsigned long&) [0x75eb19]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/core/agent_launcher.h:321:void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<double*, int*, double*, is_negative<int>, int, int*>, double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long>(double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long) [0x761043]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/core/triple_chevron_launch.h:62:cudaError thrust::cuda_cub::launcher::triple_chevron::doit_host<void (*)(double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long), double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long>(void (*)(double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long), double* const&, int* const&, double* const&, is_negative<int> const&, int const&, int* const&, cub::CUB_200200_700_NS::ScanTileState<int, true> const&, unsigned long const&) const [0x763df6]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/core/agent_launcher.h:954:void thrust::cuda_cub::core::AgentLauncher<thrust::cuda_cub::__copy_if::CopyIfAgent<double*, int*, double*, is_negative<int>, int, int*> >::launch_impl<double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long>(thrust::detail::integral_constant<bool, true>, double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long) const [0x762ca5]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/core/agent_launcher.h:1085:void thrust::cuda_cub::core::AgentLauncher<thrust::cuda_cub::__copy_if::CopyIfAgent<double*, int*, double*, is_negative<int>, int, int*> >::launch<double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long>(double*, int*, double*, is_negative<int>, int, int*, cub::CUB_200200_700_NS::ScanTileState<int, true>, unsigned long) const [0x762537]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/copy_if.h:681:cudaError thrust::cuda_cub::__copy_if::doit_step<double*, int*, double*, is_negative<int>, int, int*>(void*, unsigned long&, double*, int*, double*, is_negative<int>, int*, int, CUstream_st*) [0x7606bd]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/copy_if.h:752:double* thrust::cuda_cub::__copy_if::copy_if<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::cuda_cub::execute_on_stream_base>, double*, int*, double*, is_negative<int> >(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::cuda_cub::execute_on_stream_base> >&, double*, double*, int*, double*, is_negative<int>) [0x761d3d]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/system/cuda/detail/copy_if.h:825:double* thrust::cuda_cub::copy_if<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::cuda_cub::execute_on_stream_base>, double*, int*, double*, is_negative<int> >(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::cuda_cub::execute_on_stream_base> >&, double*, double*, int*, double*, is_negative<int>) [0x761454]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/apps/spack/opt/gcc-12.3.0/cuda-12.2.2-5efgibuz4aapvnxjnohsec4pjyulwvw3/include/thrust/detail/copy_if.inl:61:double* thrust::copy_if<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::cuda_cub::execute_on_stream_base>, double*, int*, double*, is_negative<int> >(thrust::detail::execution_policy_base<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::cuda_cub::execute_on_stream_base> > const&, double*, double*, int*, double*, is_negative<int>) [0x760e7f]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/parcsr_ls/par_lr_interp_device.c:1062:hypre_BoomerAMGBuildExtPIInterpDevice [0x75b149]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/parcsr_ls/par_lr_interp.c:5566:hypre_BoomerAMGBuildExtPIInterp [0x45b5be]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/parcsr_ls/par_amg_setup.c:2205:hypre_BoomerAMGSetup [0x7003d7]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/spack-src/src/parcsr_ls/HYPRE_parcsr_amg.c:57:HYPRE_BoomerAMGSetup [0x3aac8b]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/hypre-2.29.0-m6dmfwvk4kvwuzd4jgn734xviyehsy3f/lib/libHYPRE-2.29.0.so
=========     Host Frame:linalg/hypre.cpp:3946:mfem::HypreSolver::Setup(mfem::HypreParVector const&, mfem::HypreParVector&) const [0xa37ef]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========     Host Frame:linalg/hypre.cpp:3987:mfem::HypreSolver::Mult(mfem::HypreParVector const&, mfem::HypreParVector&) const [0xa3c7b]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========     Host Frame:linalg/hypre.cpp:4004:mfem::HypreSolver::Mult(mfem::Vector const&, mfem::Vector&) const [0xa404d]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========     Host Frame:linalg/solvers.cpp:738:mfem::CGSolver::Mult(mfem::Vector const&, mfem::Vector&) const [0xbdf02]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========     Host Frame:/tmp/pbs.1072554.sawtoothpbs/decavict/spack-stage/spack-stage-mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/spack-src/examples/ex1p.cpp:272:main [0x1a3d8]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========     Host Frame:__libc_start_main [0x3ad85]
=========                in /lib64/libc.so.6
=========     Host Frame:_start [0x1970e]
=========                in /home/decavict/Repos/spack-0-21-clean/opt/spack/linux-rocky8-cascadelake/gcc-12.3.0/mfem-4.6.1.2-3iadgkcmz3iqnq6ll2ou4b54yvubw7on/share/mfem/examples/./ex1p
=========

pazner · 2024-05-07T21:19:25Z

Hi @pazner , I do get the same errors with that sample run you provided. Here's the beginning of the output for reference. The executable still successfully completes. Does the sanitizer come up clean for you?

I also get some errors (not the same as the ones you are getting), but I don't know if they are false positives or not. But what this tells us is that it does not appear to be caused by the LOR preconditioner, but is something in the hypre AMG setup. The errors I see are all like this one:

========= Uninitialized __global__ memory read of size 1 bytes
=========     at 0x2d0 in /usr/tce/packages/cuda/cuda-11.8.0/include/cub/thread/thread_load.cuh:408:void cub::DeviceScanKernel<cub::DeviceScanPolicy<int>::Policy600,int*,int*,cub::ScanTileState<int,bool=1>,thrust::plus<void>,cub::detail::InputValue<int,int*>,int>(cub::DeviceScanPolicy<int>::Policy600,int*,int*,int,int,bool=1,cub::ScanTileState<int,bool=1>)
=========     by thread (31,0,0) in block (0,0,0)
=========     Address 0x2000acc0227c
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x200014907fdc]
=========                in /lib64/libcuda.so.1

Any ideas,@v-dobrev, @liruipeng?

victor-decaria-nnl · 2024-05-23T17:33:58Z

@pazner I think we should close this for now until I can further isolate the problem. I changed some things in my stack and the issue went away, and I didn't save my old spack concretization, so it could have been a false positive. I can reopen the issue if I can isolate two configurations where one causes the sanitizer to trip, and the other doesn't.

pazner · 2024-05-23T17:43:52Z

Sounds good, please re-open if needed.

victor-decaria-nnl added bug GPU labels May 7, 2024

tzanio assigned pazner May 7, 2024

pazner closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible initialization bug for `LORSolver<HypreBoomerAMG>` #4286

Possible initialization bug for `LORSolver<HypreBoomerAMG>` #4286

victor-decaria-nnl commented May 7, 2024

pazner commented May 7, 2024

victor-decaria-nnl commented May 7, 2024

pazner commented May 7, 2024

victor-decaria-nnl commented May 23, 2024

pazner commented May 23, 2024

Possible initialization bug for LORSolver<HypreBoomerAMG> #4286

Possible initialization bug for LORSolver<HypreBoomerAMG> #4286

Comments

victor-decaria-nnl commented May 7, 2024

pazner commented May 7, 2024

victor-decaria-nnl commented May 7, 2024

pazner commented May 7, 2024

victor-decaria-nnl commented May 23, 2024

pazner commented May 23, 2024

Possible initialization bug for `LORSolver<HypreBoomerAMG>` #4286

Possible initialization bug for `LORSolver<HypreBoomerAMG>` #4286