-
Notifications
You must be signed in to change notification settings - Fork 153
Replies: 1 comment · 7 replies
-
Yes, this is possible. But don't use host tasks for this purpose. They are broken beyond repair for this purpose, as they are executed when the SYCL task graph is executed, not when it is submitted. So you would enqueue additional cublas operations while your kernels are already running (note the additional synchronization at the end of the host task!). AdaptiveCpp has something better which can substantially outperform host-task-based code patterns. What you want is the custom operation extension: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/enqueue-custom-operation.md Note that there is already AdaptiveCpp support for CUDA and HIP backends in upstream oneMKL (which then dispatches calls to cuBLAS or rocBLAS). I'm not sure whether it still works at the moment, as there were some CI issues. The code there basically does exactly what you want, and also uses our extension. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Regarding compatibility with DPC++, here's what we do in GROMACS: #if GMX_SYCL_HIPSYCL // use hipSYCL_enqueue_custom_operation
queue_.submit([&](sycl::handler & cgh) {
cgh.hipSYCL_enqueue_custom_operation([=](sycl::interop_handle& h) {
callNativeLibrary(h.get_native_queue<[sc_syclBackend](sycl::backend::cuda)>(),
/*other parameters*/);
});
});
#elif GMX_SYCL_DPCPP // submit directly
callNativeLibrary(sycl::get_native<[sc_syclBackend](sycl::backend::ext_oneapi_cuda)>(queue_),
/*other parameters*/);
#endif
} Only works with in-order queues and only if you don't need the returned Also keep in mind that AdaptiveCpp does all the work in a separate thread (unless instant submission mode is used), so even host-only API calls should go through |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Thanks for the reply. It's an elegant solution. Did I get it right that in the case of
EDIT If this work for dpcpp, would it work similarly for AdaptiveCpp? I mean instead of using |
Beta Was this translation helpful? Give feedback.
All reactions
-
Pretty much, yes. But it relies on the number of assumptions (in-order queues, no event- or buffer-based dependencies, no frequent switching between multiple devices etc) that might not be true in all cases, but generally work well if you use SYCL the same way you use CUDA. AdaptiveCpp's approach is much more robust (e.g., it will guarantee that the correct device is active when using multiple devices).
CUDA (and HIP) runtime APIs use thread-based state: each host thread can have a different active CUDA context. Since AdaptiveCpp, by default, uses a separate worker thread to do CUDA API calls (unlike DPC++, which does everything from the same application thread that calls SYCL API), there is no guarantee that one will have the same context in the main application thread and in the worker thread, and in this case, all bets are off. We had such an issue with rocFFT initialization; solved by putting it into |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Thanks for the explanations. This make more sense. I will try to make some tests to check the above. Cristian |
Beta Was this translation helpful? Give feedback.
All reactions
-
For anyone who might be interested this is the final code :):
I get different results between the mkl and cublas and neapi::mkl::blas::gemm, but it is most probable in calling the cublas library differently. I could not figure out to get the stream information from th equeue outside of the |
Beta Was this translation helpful? Give feedback.
-
Hello,
In some case there is a possibility that when porting a code some operations like linear algebra do not have equivalents. SO one is left use for example cu/hipblas which are non-portable.
In cuda/hip the blas (and other libraries) calls are asynchronous and can be associated with a stream. So one could launch kernels, make the blas call, then launch more and only synchronize at the end.
In SYCL one could do some like launch kernels, synchronize, call blas, synchronize, launch the rest of the kernels. This is not optimal. I found this code in a CodePlay repository
My understanding of the code is that the stream info can be obtained from a queue and then the blas calls would be associated with the stream and consequently with the queue. So one would be able to do a series of calls: launch kernels, call blas, launch kernel and they would all run in the same queue.
The above code seems to be using onepi extensions. Is there something equivalent in AdaptiveCpp?
Best,
Cristian
Beta Was this translation helpful? Give feedback.
All reactions