For guidance on porting cuda code to ispc #2302

33rupes · 2022-04-18T08:57:25Z

33rupes
Apr 18, 2022

I am trying to port some cuda c code to ispcc to run spmd program on cpu.

I am having a few doubts like (answers to any/some of these will help a lot) ;–

Does the Ispc code need error checking mechanism like CUDA_CHECK_ERRORS which is useful for checking proper memory allocation on device/host.
I believe global qualifier is for compiler to know to run function on device and __syncthreads() is for barrier synchronization among instances which are probably not needed in ispcc so is there something I am missing.?
I think threadidx in cuda c is equivalent to programIndex in ispcc as cuda uses threads in blocks whereas ispc uses program instances in gangs however I am having trouble porting this statement (const unsigned int id = 32 * blockIdx.x + threadIdx.x;). My thought on this is that 1 block contains 32 threads and each id is for unique thread but in ispcc I would use const unsigned int id = programIndex; Will this be correct??.
Should I use “uniform” keyword for declaring global variables as these will be shared across the gang . Also shared keyword is used in cuda c to to make variables reside in shared memory easing communication between threads in the same block , is some similar mechanism possible in ispcc or should I ignore it ??
atomicAdd like atomic operations used in cuda C is good for operating safely on memory without being affected by other threads. Its equivalents in ispcc are atomic_add_global and atomic_add_local. Which one should be used in porting code ??
Regarding memory allocation since there is no host/device i believe there is no need for using two variables like var1 and device_var1 for same value and then using memcopy, instead I think of using single variable and also for porting statements like "cudaMalloc (&var, 4 * sizeof(float))" i'll use "float* var = new float[4] instead'. Am I thinking in right direction?.
Statements like cudaSetDevice(0) and cudaHostAllocPortable used for device selection and allocating page-locked memory as portable are mostly useful when dealing with multiple gpu devices. Is any similar mechanism required in Ispcc or these should be ignored .?
For porting the following code(which is probably used to launch kernels using multidimensional grids of blocks and threads of device in cuda) to ispcc :--
const dim3 threads(32, 1);
const dim3 grid(1, 1);
initKernel<<<grid, threads>>>(first_var);
updatKernel<<<grid, threads>>>(device_var1,device_var2);

I plan to use these lines directly ;--
initKernell((first_var); // here initkernel is just the name of function in ispc file
updatKernel((device_var1,device_var2);

Should I be using something else corresponding to thread and block information or is this ispc function call fine??..
Instead of std::fill_n(used for assigning values in std c++) and which I believe won't be present in ispcc (please correct me if I am wrong), I am planning to use a for loop to fill the array's elements and instead of std::ofstream ( used for writing data to files ) which is used here;-
"std::ofstream file1("file1.csv")"; and then
"file1 << i << "," << array_of_unsigned_ints[ s ] << std::endl;" // s is the index
I am planning to use the C-style file handling :-- FILE *file1 = fopen("file1.csv","w") ; fputs(text,file1); fclose(file1); // where text is defined as char text[] = strcat(strcat(strcat( i , "," ) , array_of_unsigned_ints[ s ]), "\n"). Will these be fine or I may do something better ?. Also I believe for putting text which is a character array I might need to find the maximum no. of of characters possible in an unsigned_integer ,, so what is it ??

dbabokin · 2022-04-19T05:31:36Z

dbabokin
Apr 19, 2022
Maintainer

The key focus of ISPC is providing an efficient programming model to program SIMD hardware, i.e. to exploit vector parallelism (i.e. same level as CUDA warp). When it comes to managing threads, ISPC strategy is to be compatible with any external threading libraries that user may want to use.

If you'd like to target CPU (ISPC supports x86 and ARM CPUs and Intel GPUs), then it's important to emphasis the difference in job scheduling versus CUDA or ISPC for GPU. On GPU you have device/host model - host code spawns multiple device threads over some iteration space (in CUDA through <<<grid, thread>>> syntax). On CPU there are several different alternatives how to deal with / manage threading:

Manage threading on C++ side with any of available threading libraries - pthreads, TBB, OpenMP, etc. And once the thread is spawned, call ISPC routine. Calling ISPC routine is as cheap as calling any other C++ function.
Call an ISPC function from C++ in a single thread and use launch/sync statements to spawn threads over some iteration space. More details on syntax are here. Note, that under the hood this mechanism will use the same pthreads, TBB, or OpenMP library. And it's user's responsibility to provide threading mechanism that fits user requirements. More details are here.
Use ISPCRT (ISPC RunTime) to schedule jobs in GPU-like fashion with the ability to run jobs on Intel GPUs. More information is available here.

Depending on the approach you take, the answers to the questions above will vary.

Does the Ispc code need error checking mechanism like CUDA_CHECK_ERRORS which is useful for checking proper memory allocation on device/host.

If ISPC RT is not used, then error checking for memory allocation is done through standard CPU mechanisms available in C++. If ISPCRT is used, then it's similar to CUDA. Note that ISPCRT is just a wrapper over Level Zero library for GPU, which also enables scheduling to CPU.

I believe global qualifier is for compiler to know to run function on device and __syncthreads() is for barrier synchronization among instances which are probably not needed in ispcc so is there something I am missing.?

CUDA __global__ declares a kernel. In ISPC you can use export qualifier to make function callable from C++. task qualifier to make it callable using launch keyword or through ISPCRT as a kernel.

Synchronization also varies - in approach 1 it's completely on C++ side. In 2 it's a sync keyword. In 3 it's more sophisticated mechanism available in Level Zero through managing queues.

I think threadidx in cuda c is equivalent to programIndex in ispcc as cuda uses threads in blocks whereas ispc uses program instances in gangs however I am having trouble porting this statement (const unsigned int id = 32 * blockIdx.x + threadIdx.x;). My thought on this is that 1 block contains 32 threads and each id is for unique thread but in ispcc I would use const unsigned int id = programIndex; Will this be correct??.

programIndex will be a vector of values 0 to N-1, when N is target width. I.e. it's will be {0, 1, 2, 3, 4, 5, 6, 7} on standard AVX2 target. So it's always a "local" index, while threadIdx refers to global iteration space.

Should I use “uniform” keyword for declaring global variables as these will be shared across the gang . Also shared keyword is used in cuda c to to make variables reside in shared memory easing communication between threads in the same block , is some similar mechanism possible in ispcc or should I ignore it ??

When targeting CPU, any global is a regular global variable as you would have it in C++. Typically you would want to have uniform global variables, but you can have varying as well - they will be vector variable. This depends on your needs.

atomicAdd like atomic operations used in cuda C is good for operating safely on memory without being affected by other threads. Its equivalents in ispcc are atomic_add_global and atomic_add_local. Which one should be used in porting code ??

Check Atomic Operations and Memory Fences

Regarding memory allocation since there is no host/device i believe there is no need for using two variables like var1 and device_var1 for same value and then using memcopy, instead I think of using single variable and also for porting statements like "cudaMalloc (&var, 4 * sizeof(float))" i'll use "float* var = new float[4] instead'. Am I thinking in right direction?.

Correct.

Statements like cudaSetDevice(0) and cudaHostAllocPortable used for device selection and allocating page-locked memory as portable are mostly useful when dealing with multiple gpu devices. Is any similar mechanism required in Ispcc or these should be ignored .?

You don't need it, unless you'd like to use ISPCRT and target both CPU and GPU at the same time.

For porting the following code(which is probably used to launch kernels using multidimensional grids of blocks and threads of device in cuda) to ispcc :--
const dim3 threads(32, 1);
const dim3 grid(1, 1);
initKernel<<<grid, threads>>>(first_var);
updatKernel<<<grid, threads>>>(device_var1,device_var2);
I plan to use these lines directly ;--
initKernell((first_var); // here initkernel is just the name of function in ispc file
updatKernel((device_var1,device_var2);
Should I be using something else corresponding to thread and block information or is this ispc function call fine??..

This depends on how you do threading. initKernell((first_var) will just does regular function call on CPU. If you plan to use launch/sync keywords to spawn task parallelism, you will need to do that inside ISPC function. Otherwise you will need to deal with it on C++ side and then call ISPC function from every thread.

Instead of std::fill_n(used for assigning values in std c++) and which I believe won't be present in ispcc (please correct me if I am wrong), I am planning to use a for loop to fill the array's elements and instead of std::ofstream ( used for writing data to files ) which is used here;-
"std::ofstream file1("file1.csv")"; and then
"file1 << i << "," << array_of_unsigned_ints[ s ] << std::endl;" // s is the index
I am planning to use the C-style file handling :-- FILE *file1 = fopen("file1.csv","w") ; fputs(text,file1); fclose(file1); // where text is defined as char text[] = strcat(strcat(strcat( i , "," ) , array_of_unsigned_ints[ s ]), "\n"). Will these be fine or I may do something better ?. Also I believe for putting text which is a character array I might need to find the maximum no. of of characters possible in an unsigned_integer ,, so what is it ??

To fill the memory you can use memset on ISPC side (if it works for your use case) or for use foreach loop.

As for working with files, ISPC doesn't have builtin way working with files, you will need to write to memory buffers and write them to files on C++ side.

As for data type limits, ISPC data types have well-defined width. I.e. int8/int16/int32/int64 are 8/16/32/64 bit wide, same for unsigned types. int and unsigned int are 32 bit wide with corresponding minimum and maximum values.

4 replies

33rupes Apr 23, 2022
Author

Thanks @dbabokin for replying & I am willing to use the second method i.e. using launch and sync keywords.Also I have made an example cuda kernel function consisting of parts I am having trouble porting. Will help a lot if someone could port this example code to Ispc learning from which I might port the original one.
namespace{
global void updateKernel(unsigned int *d_glbCnt, unsigned int *d_glb, float *d_V,
const float *d_a, const float *d_b)
{
const unsigned int id = 32 * blockIdx.x + threadIdx.x;
shared unsigned int sh[32];
shared unsigned int shPos;
shared unsigned int shCount;

  if(id < 32) {
    if(id < 4) {
        float lV = d_V[id];
        const float la = d_a[id];  const float lb = d_b[id];
        
        float I = 0; lV+=0.6f*(0.08f*lV*lV+3.0f*lV+110.0f+I+(1.0000000000e+01f))*1.0f;
       
        if (lV >= 29.99f) {
            const unsigned int Idx = atomicAdd((unsigned int *) &shCount, 1);
            sh[Idx] = id;
        }
        d_V[id] = lV;
    }
    __syncthreads();
    if (threadIdx.x == 0) {
        if (shCount > 0) {
            shPos = atomicAdd((unsigned int*)&d_glbCnt[0], shCount);
        }  }
    __syncthreads();
    if (threadIdx.x < shCount) {
        const unsigned int n = sh[threadIdx.x];
        d_glb[shPos + threadIdx.x] = n;
    }  }
}  }

33rupes Apr 23, 2022
Author

I believe due to some reason "__global _" got converted into global and "_shared _" converted into shared in my reply message.Please use their meanings as defined in cuda

dbabokin Apr 26, 2022
Maintainer

See here for guidelines how to properly quite your code. You can edit your post to fix it.

33rupes Apr 27, 2022
Author

Will you please help me with the below mentioned questions ??

33rupes · 2022-04-26T11:03:01Z

33rupes
Apr 26, 2022
Author

@dbabokin I am especially confused with implementing/porting the following cuda lines/logic :--

const unsigned int id = 32 * blockIdx.x + threadIdx.x;
const float la = d_a[id];
const unsigned int Idx = atomicAdd((unsigned int *) &shCount, 1);
__syncthreads();
if (threadIdx.x < shCount)

1 reply

33rupes Apr 26, 2022
Author

For 1 should it be
const unsigned int id = 32.taskindex0 + threadindex0; ??

Should 2 remain the same??

How to port 3 and 4 ??

for 5 I believe if using task parallelism it should be
cif(threadindex0< shCount)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For guidance on porting cuda code to ispc #2302

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

For guidance on porting cuda code to ispc #2302

33rupes Apr 18, 2022

Replies: 2 comments · 5 replies

dbabokin Apr 19, 2022 Maintainer

33rupes Apr 23, 2022 Author

33rupes Apr 23, 2022 Author

dbabokin Apr 26, 2022 Maintainer

33rupes Apr 27, 2022 Author

33rupes Apr 26, 2022 Author

33rupes Apr 26, 2022 Author

33rupes
Apr 18, 2022

Replies: 2 comments 5 replies

dbabokin
Apr 19, 2022
Maintainer

33rupes Apr 23, 2022
Author

33rupes Apr 23, 2022
Author

dbabokin Apr 26, 2022
Maintainer

33rupes Apr 27, 2022
Author

33rupes
Apr 26, 2022
Author

33rupes Apr 26, 2022
Author