Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Investigate if NVTX ranges in CUB algorithms support graph capture #1674

Open
1 task done
gevtushenko opened this issue Apr 29, 2024 · 3 comments
Open
1 task done
Assignees
Labels
feature request New feature or request.

Comments

@gevtushenko
Copy link
Collaborator

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

As of #719 we have NVTX ranges in CUB device algorithms. Most CUB device algorithms support graph capture. For now, it's not clear if NVTX is working correctly in presence of graph capture.

Describe the solution you'd like

We need to understand if NVTX ranges work correctly when CUB is in graph capture mode. Since all of our *_.lid_2 tests run CUB algorithms in graph capture mode, one of these tests, say cub.cpp17.test.device_select_if.lid_2, can be used as an example. If NVTX ranges do not contain kernels they surround, I'd prefer no NVTX ranges to be reported.

Describe alternatives you've considered

No response

Additional context

No response

@gevtushenko gevtushenko added the feature request New feature or request. label Apr 29, 2024
@bernhardmgruber
Copy link
Contributor

Testing with catch2_test_device_histogram from #1695 shows:

image

You can see the cudaMalloc and cub::DeviceFor::Bulk from the thrust::device_vector<int8> setting up temporary storage, then graph capture begins, the kernel is launched and reported by NVTX, graph capture ends. When the graph is instantiated, launched and synchronized with, no NVTX ranges are reported. So NVTX ranges are shown when a kernel is captured, not when executed.

@gevtushenko Would you like to have NVTX ranges disabled when stream capturing is active? That would require us to check the stream state on each invocation of a CUB device API.

@gevtushenko
Copy link
Collaborator Author

@bernhardmgruber thank you for taking a look! The results seem to match our intuition. Regarding the action item, investigating how much overhead is caused by checking if stream is in capture mode is non-trivial amount of work. I'd just update the NVTX section of the developer overview to clarify this behavior.

@bernhardmgruber
Copy link
Contributor

I will add it after #1753 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: Todo
Development

No branches or pull requests

2 participants