Interoperability between queues and cuda/hip streams #1339

csccva · 2024-02-04T18:31:51Z

csccva
Feb 4, 2024

Hello,

In some case there is a possibility that when porting a code some operations like linear algebra do not have equivalents. SO one is left use for example cu/hipblas which are non-portable.

In cuda/hip the blas (and other libraries) calls are asynchronous and can be associated with a stream. So one could launch kernels, make the blas call, then launch more and only synchronize at the end.

In SYCL one could do some like launch kernels, synchronize, call blas, synchronize, launch the rest of the kernels. This is not optimal. I found this code in a CodePlay repository

q.submit([&](handler &h) {
     h.host_task([=](sycl::interop_handle ih) {
       // Set the correct cuda context & stream
       cuCtxSetCurrent(ih.get_native_context<backend::ext_oneapi_cuda>());
       auto cuStream = ih.get_native_queue<backend::ext_oneapi_cuda>();
       cublasSetStream(handle, cuStream);

       // Call generalised matrix-matrix multiply
       CHECK_ERROR(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, WIDTH, HEIGHT,
                               WIDTH, &ALPHA, d_A, WIDTH, d_B, WIDTH, &BETA,
                               d_C, WIDTH));
       cuStreamSynchronize(cuStream);
     });
   }).wait();

My understanding of the code is that the stream info can be obtained from a queue and then the blas calls would be associated with the stream and consequently with the queue. So one would be able to do a series of calls: launch kernels, call blas, launch kernel and they would all run in the same queue.
The above code seems to be using onepi extensions. Is there something equivalent in AdaptiveCpp?

Best,

Cristian

illuhad · 2024-02-04T18:58:57Z

illuhad
Feb 4, 2024
Maintainer

Yes, this is possible. But don't use host tasks for this purpose. They are broken beyond repair for this purpose, as they are executed when the SYCL task graph is executed, not when it is submitted. So you would enqueue additional cublas operations while your kernels are already running (note the additional synchronization at the end of the host task!). AdaptiveCpp has something better which can substantially outperform host-task-based code patterns.

What you want is the custom operation extension: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/enqueue-custom-operation.md
You will also find an example on this page. Note that AdaptiveCpp uses the CUDA runtime API, not the driver API that DPC++ uses. So you don't need to worry about the CUDA context at all.

Note that there is already AdaptiveCpp support for CUDA and HIP backends in upstream oneMKL (which then dispatches calls to cuBLAS or rocBLAS). I'm not sure whether it still works at the moment, as there were some CI issues. The code there basically does exactly what you want, and also uses our extension.

7 replies

al42and Feb 5, 2024
Collaborator

Regarding compatibility with DPC++, here's what we do in GROMACS:

#if GMX_SYCL_HIPSYCL // use hipSYCL_enqueue_custom_operation
    queue_.submit([&](sycl::handler & cgh) {
        cgh.hipSYCL_enqueue_custom_operation([=](sycl::interop_handle& h) {
            callNativeLibrary(h.get_native_queue<[sc_syclBackend](sycl::backend::cuda)>(),
                        /*other parameters*/);
        });
    });
#elif GMX_SYCL_DPCPP // submit directly
    callNativeLibrary(sycl::get_native<[sc_syclBackend](sycl::backend::ext_oneapi_cuda)>(queue_),
                 /*other parameters*/);
#endif
}

Only works with in-order queues and only if you don't need the returned sycl::event and only if you don't worry about driver vs. runtime API (the case for cudaStream/CUstream), but if stars align that permits minimal code duplication.

Also keep in mind that AdaptiveCpp does all the work in a separate thread (unless instant submission mode is used), so even host-only API calls should go through hipSYCL_enqueue_custom_operation, to make sure the correct device is active in multi-GPU systems.

csccva Feb 6, 2024
Author

Thanks for the reply.

It's an elegant solution.

Did I get it right that in the case of GMX_SYCL_DPCPP one just make the direct call? So in my case that would be:

// Some kernels with in-order queue

// Create cublas handle
  cublasHandle_t handle;
  CHECK_ERROR(cublasCreate(&handle));
  auto cuStream = ih.get_native_queue<sycl::backend::ext_oneapi_cuda>();
    cublasSetStream(handle, cuStream);
       // Call generalised matrix-matrix multiply
       CHECK_ERROR(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, WIDTH, HEIGHT,
                               WIDTH, &ALPHA, d_A, WIDTH, d_B, WIDTH, &BETA,
                               d_C, WIDTH));
     });
   })
// .wait() or more kernels on the same queue

 // Copy the result back to host
  q.memcpy(h_C.data(), d_C, numBytes).wait();

EDIT If this work for dpcpp, would it work similarly for AdaptiveCpp? I mean instead of using hipSYCL_enqueue_custom_operation just get the stream and call directly?

al42and Feb 6, 2024
Collaborator

Did I get it right that in the case of GMX_SYCL_DPCPP one just make the direct call? So in my case that would be:

Pretty much, yes. But it relies on the number of assumptions (in-order queues, no event- or buffer-based dependencies, no frequent switching between multiple devices etc) that might not be true in all cases, but generally work well if you use SYCL the same way you use CUDA. AdaptiveCpp's approach is much more robust (e.g., it will guarantee that the correct device is active when using multiple devices).

EDIT If this work for dpcpp, would it work similarly for AdaptiveCpp? I mean instead of using hipSYCL_enqueue_custom_operation just get the stream and call directly?

CUDA (and HIP) runtime APIs use thread-based state: each host thread can have a different active CUDA context. Since AdaptiveCpp, by default, uses a separate worker thread to do CUDA API calls (unlike DPC++, which does everything from the same application thread that calls SYCL API), there is no guarantee that one will have the same context in the main application thread and in the worker thread, and in this case, all bets are off. We had such an issue with rocFFT initialization; solved by putting it into hipSYCL_enqueue_custom_operation, even though it did not submit any work to the GPU.

csccva Feb 7, 2024
Author

Thanks for the explanations. This make more sense. I will try to make some tests to check the above.

Cristian

csccva Feb 7, 2024
Author

For anyone who might be interested this is the final code :):

//==============================================================
// Matrix Multiplication: DPC++ MKL
//==============================================================
// Copyright © 2021 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

// icpx -std=c++17 -O3 -fsycl -fsycl-targets=spir64_x86_64 -I$MKLROOT/include  -L$MKLROOT/lib/intel64/  -lmkl_sycl -lmkl_core  -lmkl_sequential -lmkl_intel_ilp64  gemm_mkl_usm.cpp
// syclcc -O3 --hipsycl-targets="cuda:sm_80"  
// Add -DMKL_LIB if oneMKL is used

// Add -DCUBLAS if cublas library is use. It needs also -isystem $CUDA_HOME/include/ -L$CUDA_HOME/lib64/ -lcublas -lcudart -lcuda
// Add the appropriate targets: -fsycl-targets=nvptx64-nvidia-cuda,spir64_x86_64 -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80

// Add -DACPP if cublas and Adaptive Cpp is used for compiling
// Add -DDPCPP -DCUDA_NO_HALF if oneapi is used for compiling


#include <sycl/sycl.hpp>
#include <ctime>
#include <chrono>
#include <getopt.h>
#if MKL_LIB
#include "oneapi/mkl.hpp"  //# oneMKL DPC++ interface
#endif

#if CUBLAS 
// cuda interface
#include <cublas_v2.h>
#include <cuda.h>


#define CHECK_ERROR(FUNC) checkCudaErrorMsg(FUNC, " " #FUNC)

void inline checkCudaErrorMsg(cublasStatus_t status, const char *msg) {
  if (status != CUBLAS_STATUS_SUCCESS) {
    std::cout << "ERROR CUBLAS:" << msg << " - " << status << std::endl;
    exit(EXIT_FAILURE);
  }
}
#endif

using namespace sycl;
#if MKL_LIB
using namespace oneapi::mkl;
#endif

int main(int argc, char *argv[]) {
    
    size_t N = 1024;
    size_t M = 32;
    int VERIFY = 0;
    int PRINT_OUTPUT_MATRIX = 0;
    
    int arg;
    while ((arg = getopt (argc, argv, "n:m:vp:h")) != -1)
        switch (arg){
            case 'n':
                N = std::atoi(optarg);
                break;
            case 'm':
                M = std::atoi(optarg);
                break;
            case 'v':
                VERIFY = 1;
                break;
            case 'p':
                PRINT_OUTPUT_MATRIX = 1;
                break;
            case 'h':
                std::cout << std::endl;
                std::cout << "Usage   : ./a.out -n <MATRIX_SIZE> -m <WORK_GROUP_SIZE> -v -p\n\n";
                std::cout << "          [-n] size for matrix, eg: 1024\n";
                std::cout << "          [-m] size of work_group, eg: 8/16\n";
                std::cout << "          [-v] verify output with linear computation on cpu\n";
                std::cout << "          [-p] print output matrix\n";
                std::cout << "Example : ./a.out -n 1024 -m 16 -v -p\n\n";
                std::exit(0);
        }

    //# Define vectors for matricies
    std::vector<float> matrix_a(N*N);
    std::vector<float> matrix_b(N*N);
    std::vector<float> matrix_c(N*N);
    std::vector<float> matrix_d(N*N);
    
    //# Initialize matricies with values
    float v1 = 2.f;
    float v2 = 3.f;
    for (int i=0; i<N; i++)
        for (int j=0; j<N; j++){
            matrix_a[i*N+j] = v1++;
            matrix_b[i*N+j] = v2++;
            matrix_c[i*N+j] = 0.f;
            matrix_d[i*N+j] = 0.f;
    }
    //# Define queue with default device for offloading computation
    queue q{property::queue::in_order{}};

    // First we warm-up the device
    std::cout << "Warm-up first" << "\n"; 

    {
        //# Create buffers for matrices

        buffer<float, 1> a(matrix_a.data(), range<1>(N*N));
        buffer<float, 1> b(matrix_b.data(), range<1>(N*N));
        buffer<float, 1> c(matrix_c.data(), range<1>(N*N));
       
         //# Submit command groups to execute on device
         q.submit([&](handler &h){
            //# Create accessors to copy buffers to the device
            auto A = a.get_access<access::mode::read>(h);
            auto B = b.get_access<access::mode::read>(h);
            auto C = c.get_access<access::mode::write>(h);

            //# Define size for ND-Range and work-group size
            range<2> global_size(N,N);
            range<2> work_group_size(M,M);

            //# Parallel Compute Matrix Multiplication
            h.parallel_for(nd_range<2>{global_size, work_group_size}, [=](nd_item<2> item){
                const int i = item.get_global_id(0);
                const int j = item.get_global_id(1);
                //# Use private mem to store intermediate result
                float temp=0.f;
                for (int k = 0; k < N; k++) {
                   temp += A[i*N+k] * B[k*N+j];
               }
               C[i*N+j]  = temp;
            });
        });
    } // warm-up done

    //# Initialize matrices with values
    v1 = 2.f;
    v2 = 3.f;
    for (int i=0; i<N; i++)
        for (int j=0; j<N; j++){
            matrix_a[i*N+j] = v1++;
            matrix_b[i*N+j] = v2++;
            matrix_c[i*N+j] = 0.f;
            matrix_d[i*N+j] = 0.f;
    }
    
    auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();
    
    std::cout << "Offload Device        : " << q.get_device().get_info<info::device::name>() << "\n";
    std::cout << "max_work_group_size   : " << q.get_device().get_info<info::device::max_work_group_size>() << "\n";
    std::cout << "Configuration         : MATRIX_SIZE= " << N << "x" << N << "\n";
    float* dev_a = sycl::malloc_device<float>(N*N, q);
    float* dev_b = sycl::malloc_device<float>(N*N, q);
    float* dev_c = sycl::malloc_device<float>(N*N, q); 
    q.memcpy(dev_a, matrix_a.data(), N*N*sizeof(float)).wait();
    q.memcpy(dev_b, matrix_b.data(), N*N*sizeof(float)).wait();
    
    //# scalar multipliers
    float alpha = 1.f, beta = 1.f;

#if MKL_LIB //    
        
    //# transpose status of matrices for oneMKL
    oneapi::mkl::transpose transA = oneapi::mkl::transpose::nontrans;
    oneapi::mkl::transpose transB = oneapi::mkl::transpose::nontrans;
  
        
    //# Submit MKL library call to execute on device
    blas::gemm(q, transA, transB, N, N, N, alpha, dev_b, N, dev_a, N, beta, dev_c, N); 

    q.wait(); 
#endif  

#if CUBLAS

// Create cublas handle
  cublasHandle_t handle;
  CHECK_ERROR(cublasCreate(&handle));

#if ACPP
std::cout << "\n"<< "Running with ACPP interoperability. \n";
q.submit([&](handler &cgh) {
     cgh.hipSYCL_enqueue_custom_operation([=](sycl::interop_handle &ih) {
       // Set the correct  stream
       auto cuStream = ih.get_native_queue<sycl::backend::cuda>();
       cublasSetStream(handle, cuStream);
       // Call generalised matrix-matrix multiply
       // Call generalised matrix-matrix multiply
       CHECK_ERROR(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N,N,
                               N, &alpha, dev_a, N, dev_b, N, &beta,
                               dev_c, N));
     });
   }).wait();
#endif

#if DPCPP  
  std::cout << "\n"<< "Warning!!! " << " \n" << " The DPC++ & CUDA \n relies on the number of assumptions:\n in-order queues,\n no event- or buffer-based dependencies, \n no frequent switching between multiple devices \n stars aligning properly.\n\n"; 
  q.submit([&](handler &h) {
     h.host_task([=](sycl::interop_handle ih) {
       // Set the correct cuda context & stream
       cuCtxSetCurrent(ih.get_native_context<backend::ext_oneapi_cuda>());
       auto cuStream = ih.get_native_queue<backend::ext_oneapi_cuda>();
       cublasSetStream(handle, cuStream);

       // Call generalised matrix-matrix multiply
       CHECK_ERROR(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N,N,
                               N, &alpha, dev_a, N, dev_b, N, &beta,
                               dev_c, N));
       cuStreamSynchronize(cuStream);
     });
   }).wait();
#endif

#endif    

    auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
    
    q.memcpy(matrix_c.data(), dev_c, N*N*sizeof(float)).wait();
    //# Print Output
    if (PRINT_OUTPUT_MATRIX){
        for (int i=0; i<N; i++){
            for (int j=0; j<N; j++){
                std::cout << matrix_c[i*N+j] << " ";
            }
            std::cout << "\n";
        }
    } else {
        std::cout << " [0][0] = " << matrix_c[0] << "\n";
    }
    
    //# Compute local and compare with offload computation
    if (VERIFY){
        int fail = 0;
        for(int i=0; i<N; i++){
            for (int j = 0; j < N; j++) {
                for(int k=0; k<N; k++){
                    matrix_d[i*N+j] += matrix_a[i*N+k] * matrix_b[k*N+j];
                }
                if(matrix_c[i*N+j] != matrix_d[i*N+j]) fail = 1;
            }
        }
        if(fail == 1){
            std::cout << "FAIL\n";
        } else {
            std::cout << "PASS\n";
        }
    }
    free(dev_a, q);
    free(dev_b, q);
    free(dev_c, q);
}

I get different results between the mkl and cublas and neapi::mkl::blas::gemm, but it is most probable in calling the cublas library differently.

I could not figure out to get the stream information from th equeue outside of the q.submit, but it is good enough at the moment.
Cristian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interoperability between queues and cuda/hip streams #1339

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Interoperability between queues and cuda/hip streams #1339

csccva Feb 4, 2024

Replies: 1 comment · 7 replies

illuhad Feb 4, 2024 Maintainer

al42and Feb 5, 2024 Collaborator

csccva Feb 6, 2024 Author

al42and Feb 6, 2024 Collaborator

csccva Feb 7, 2024 Author

csccva Feb 7, 2024 Author

csccva
Feb 4, 2024

Replies: 1 comment 7 replies

illuhad
Feb 4, 2024
Maintainer

al42and Feb 5, 2024
Collaborator

csccva Feb 6, 2024
Author

al42and Feb 6, 2024
Collaborator

csccva Feb 7, 2024
Author

csccva Feb 7, 2024
Author