Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does CuPy work? #8228

Open
tornikeo opened this issue Mar 7, 2024 · 1 comment
Open

How does CuPy work? #8228

tornikeo opened this issue Mar 7, 2024 · 1 comment
Labels

Comments

@tornikeo
Copy link
Contributor

tornikeo commented Mar 7, 2024

Description

There is no dicumentation on how CuPy works, end-to-end. Explanations for...

  • How does the translation of Python function to cuda __global__ function occur?
  • What happens with the variables referenced from outer context of the cuda.jit-ed functions? How are they made available to each thread?
  • What is the real number and size of parameters sent to each thread?
  • At which stage does the nvcc get called? with what args?
  • How do we transfer memory in and out of the kernel?

would greatly help incoming developers to see the kernel issues before they arise. Like, why do I get an 1024 blocksize, but not at 512 blocksize?

Idea or request for content

No response

@tornikeo tornikeo added the cat:document Documentation label Mar 7, 2024
@kmaehashi
Copy link
Member

Thanks for the feedback @tornikeo, indeed it's better to have docs covering CuPy internals. Here are quick answers:

  • How does the translation of Python function to cuda __global__ function occur?

This depends on the fucntion. Some are backed by ElementwiseKernel which is translated like this. Some are backed by RawModule (i.e., raw CUDA code) like this. cupyx.jit translates user Python function to CUDA source by traversing AST.

  • What happens with the variables referenced from outer context of the cuda.jit-ed functions? How are they made available to each thread?

cupyx.jit requires all variables referenced to be given as inputs.

  • What is the real number and size of parameters sent to each thread?

The blocksize is 128 for ElementwiseKernels and 512 for ReductionKenrels.

  • At which stage does the nvcc get called? with what args?

For most functions, NVRTC (take it as a library version of nvcc) is called for compilation, which happens on the first invocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants