Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Releases: NVIDIA/cub

CUB 2.1.0

08 Mar 22:03
Compare
Choose a tag to compare

Breaking Changes

  • #553: Deprecate the CUB_USE_COOPERATIVE_GROUPS macro, as all supported CTK distributions provide CG. This macro will be removed in a future version of CUB.

New Features

  • #359: Add new DeviceBatchMemcpy algorithm.
  • #565: Add DeviceMergeSort::StableSortKeysCopy API. Thanks to David Wendt (@davidwendt) for this contribution.
  • #585: Add SM90 tuning policy for DeviceRadixSort. Thanks to Andy Adinets (@canonizer) for this contribution.
  • #586: Introduce a new mechanism to opt-out of compiling CDP support in CUB algorithms by defining CUB_DISABLE_CDP.
  • #589: Support 64-bit indexing in DeviceReduce.
  • #607: Support 128-bit integers in radix sort.

Bug Fixes

  • #547: Resolve several long-running issues resulting from using multiple versions of CUB within the same process. Adds an inline namespace that encodes CUB version and targeted PTX architectures.
  • #562: Fix bug in BlockShuffle resulting from an invalid thread offset. Thanks to @sjfeng1999 for this contribution.
  • #564: Fix bug in BlockRadixRank when used with blocks that are not a multiple of 32 threads.
  • #579: Ensure that all threads in the logical warp participate in the index-shuffle for BlockRadixRank. Thanks to Andy Adinets (@canonizer) for this contribution.
  • #582: Fix reordering in CUB member initializer lists.
  • #589: Fix DeviceSegmentedSort when used with bool keys.
  • #590: Fix CUB’s CMake install rules. Thanks to Robert Maynard (@robertmaynard) for this contribution.
  • #592: Fix overflow in DeviceReduce.
  • #598: Fix DeviceRunLengthEncode when the first item is a NaN.
  • #611: Fix WarpScanExclusive for vector types.

Other Enhancements

CUB 2.0.1

08 Mar 21:50
Compare
Choose a tag to compare

Other Enhancements

  • Skip device-side synchronization on SM90+. These syncs are a debugging-only feature and not required for correctness, and a warning will be emitted if this happens.

CUB 1.17.2

13 Sep 17:05
Compare
Choose a tag to compare

Summary

CUB 1.17.2 is a minor bugfix release.

  • #547: Introduce an annotated inline namespace to prevent issues with
    collisions and mismatched kernel configurations across libraries. The new
    namespace encodes the CUB version and target SM architectures.

CUB 2.0.0

15 Aug 16:53
Compare
Choose a tag to compare

Summary

The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime debug_synchronous debugging flags.

This release also includes several new features. DeviceHistogram now supports __half and better handles various edge cases. WarpReduce now performs correctly when restricted to a single-thread “warp”, and will use the __reduce_add_sync accelerated intrinsic (introduced with Ampere) when appropriate. DeviceRadixSort learned to handle the case where begin_bit == end_bit.

Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.

Breaking Changes

  • #448 Add libcu++ dependency (v1.8.0+).
  • #448: The following macros are no longer defined by default. They can be re-enabled by defining CUB_PROVIDE_LEGACY_ARCH_MACROS. These will be completely removed in a future release.
    • CUB_IS_HOST_CODE: Replace with NV_IF_TARGET.
    • CUB_IS_DEVICE_CODE: Replace with NV_IF_TARGET.
    • CUB_INCLUDE_HOST_CODE: Replace with NV_IF_TARGET.
    • CUB_INCLUDE_DEVICE_CODE: Replace with NV_IF_TARGET.
  • #486: CUB’s CUDA Runtime support macros have been updated to support NV_IF_TARGET. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.
    • CUB_RUNTIME_FUNCTION: Execution space annotations for functions that invoke CUDA Runtime APIs.
      • Old behavior:
        • RDC enabled: Defined to __host__ __device__
        • RDC not enabled:
          • NVCC host pass: Defined to __host__ __device__
          • NVCC device pass: Defined to __host__
      • New behavior:
        • RDC enabled: Defined to __host__ __device__
        • RDC not enabled: Defined to __host__
    • CUB_RUNTIME_ENABLED: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:
      • RDC enabled: Macro is defined.
      • RDC not enabled:
        • NVCC host pass: Macro is defined.
        • NVCC device pass: Macro is not defined.
    • CUB_RDC_ENABLED: New macro, may be combined with NV_IF_TARGET to replace most usages of CUB_RUNTIME_ENABLED. Behavior:
      • RDC enabled: Macro is defined.
      • RDC not enabled: Macro is not defined.
  • #509: A compile-time error is now emitted when a __device__-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).
    • Due to limitations in the CUDA programming model, the result of this query is unreliable, and will silently return an incorrect result. This leads to difficult to debug errors.
    • When using libcu++ 1.9.0, an error will be emitted with information about work-arounds:
      • Use a named function object with a __device__-only implementation of operator().
      • Use a __host__ __device__ lambda.
      • Use cuda::proclaim_return_type (Added in libcu++ 1.9.0)
  • #509: Use the result type of the binary reduction operator for accumulating intermediate results in the DeviceReduce algorithm, following guidance from http://wg21.link/P2322R6.
    • This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
    • In addition to the behavioral changes, the interfaces for the Dispatch*Reduce layer have changed:
      • DispatchReduce:
        • Now accepts accumulator type as last parameter.
        • Now accepts initializer type instead of output iterator value type.
        • Constructor now accepts init as initial type instead of output iterator value type.
      • DispatchSegmentedReduce:
        • Accepts accumulator type as last parameter.
        • Accepts initializer type instead of output iterator value type.
    • Thread operators now accept parameters using different types: Equality, Inequality, InequalityWrapper, Sum, Difference, Division, Max, ArgMax, Min, ArgMin.
    • ThreadReduce now accepts accumulator type and uses a different type for prefix.
  • #511: Use the result type of the binary operator for accumulating intermediate results in the DeviceScan, DeviceScanByKey, and DeviceReduceByKey algorithms, following guidance from http://wg21.link/P2322R6.
    • This change requires host-side introspection of the binary operator’s signature, and device-only extended lambda functions can no longer be used.
    • In addition to the behavioral changes, the interfaces for the Dispatch layer have changed:
      • DispatchScan now accepts accumulator type as a template parameter.
      • DispatchScanByKey now accepts accumulator type as a template parameter.
      • DispatchReduceByKey now accepts accumulator type as the last template parameter.
  • #527: Deprecate the debug_synchronous flags on device algorithms.
    • This flag no longer has any effect. Define CUB_DEBUG_SYNC during compilation to enable these checks.
    • Moving this option from run-time to compile-time avoids the compilation overhead of unused debugging paths in production code.

New Features

  • #514: Support __half in DeviceHistogram.
  • #516: Add support for single-threaded invocations of WarpReduce.
  • #516: Use __reduce_add_sync hardware acceleration for WarpReduce on supported architectures.

Bug Fixes

  • #481: Fix the device-wide radix sort implementations to simply copy the input to the output when begin_bit == end_bit.
  • #487: Fix DeviceHistogram::Even for a variety of edge cases:
    • Bin ids are now correctly computed when mixing different types for SampleT and LevelT.
    • Bin ids are now correctly computed when LevelT is an integral type and the number of levels does not evenly divide the level range.
  • #508: Ensure that temp_storage_bytes is properly set in the AdjacentDifferenceCopy device algorithms.
  • #508: Remove excessive calls to the binary operator given to the AdjacentDifferenceCopy device algorithms.
  • #533: Fix debugging utilities when RDC is disabled.

Other Enhancements

  • #448: Removed special case code for unsupported CUDA architectures.
  • #448: Replace several usages of __CUDA_ARCH__ with <nv/target> to handle host/device code divergence.
  • #448: Mark unused PTX arch parameters as legacy.
  • #476: Enabled additional debug logging for the onesweep radix sort implementation. Thanks to @canonizer for this contribution.
  • #480: Add CUB_DISABLE_BF16_SUPPORT to avoid including the cuda_bf16.h header or using the __nv_bfloat16 type.
  • #486: Add debug log messages for post-kernel debug synchronizations.
  • #490: Clarify documentation for in-place usage of DeviceScan algorithms.
  • #494: Clarify documentation for in-place usage of DeviceHistogram algorithms.
  • #495: Clarify documentation for in-place usage of DevicePartition algorithms.
  • #499: Clarify documentation for in-place usage of Device*Sort algorithms.
  • #500: Clarify documentation for in-place usage of DeviceReduce algorithms.
  • #501: Clarify documentation for in-place usage of DeviceRunLengthEncode algorithms.
  • #503: Clarify documentation for in-place usage of DeviceSelect algorithms.
  • #518: Fix typo in WarpMergeSort documentation.
  • #519: Clarify segmented sort documentation regarding the handling of elements that are not included in any segment.

CUB 1.17.1

15 Aug 16:19
Compare
Choose a tag to compare

Summary

CUB 1.17.1 is a minor bugfix release.

  • #508: Ensure that temp_storage_bytes is properly set in
    the AdjacentDifferenceCopy device algorithms.
  • #508: Remove excessive calls to the binary operator given to
    the AdjacentDifferenceCopy device algorithms.
  • Fix device-side debug synchronous behavior in DeviceSegmentedSort.

CUB 1.17.0

09 May 18:07
Compare
Choose a tag to compare

CUB 1.17.0

Summary

CUB 1.17.0 is the final minor release of the 1.X series. It provides a variety of bug fixes and miscellaneous enhancements, detailed below.

Known Issues

“Run-to-run” Determinism Broken

Several CUB device algorithms are documented to provide deterministic results (per device) for non-associative reduction operators (e.g. floating-point addition). Unfortunately, the implementations of these algorithms contain performance optimizations that violate this guarantee. The DeviceReduce::ReduceByKey and DeviceScan algorithms are known to be affected. We’re currently evaluating the scope and impact of correcting this in a future CUB release. See NVIDIA/cub#471 for details.

Bug Fixes

  • #444: Fixed DeviceSelect to work with discard iterators and mixed input/output types.
  • #452: Fixed install issue when CMAKE_INSTALL_LIBDIR contained nested directories. Thanks to @robertmaynard for this contribution.
  • #462: Fixed bug that produced incorrect results from DeviceSegmentedSort on sm_61 and sm_70.
  • #464: Fixed DeviceSelect::Flagged so that flags are normalized to 0 or 1.
  • #468: Fixed overflow issues in DeviceRadixSort given num_items close to 2^32. Thanks to @canonizer for this contribution.
  • #498: Fixed compiler regression in BlockAdjacentDifference. Thanks to @MKKnorr for this contribution.

Other Enhancements

  • #445: Remove device-sync in DeviceSegmentedSort when launched via CDP.
  • #449: Fixed invalid link in documentation. Thanks to @kshitij12345 for this contribution.
  • #450: BlockDiscontinuity: Replaced recursive-template loop unrolling with #pragma unroll. Thanks to @kshitij12345 for this contribution.
  • #451: Replaced the deprecated TexRefInputIterator implementation with an alias to TexObjInputIterator. This fully removes all usages of the deprecated CUDA texture reference APIs from CUB.
  • #456: BlockAdjacentDifference: Replaced recursive-template loop unrolling with #pragma unroll. Thanks to @kshitij12345 for this contribution.
  • #466: cub::DeviceAdjacentDifference API has been updated to use the new OffsetT deduction approach described in #212.
  • #470: Fix several doxygen-related warnings. Thanks to @karthikeyann for this contribution.

CUB 1.16.0

08 Feb 19:34
Compare
Choose a tag to compare

Summary

CUB 1.16.0 is a major release providing several improvements to the device scope algorithms. DeviceRadixSort now supports large (64-bit indexed) input data. A new UniqueByKey algorithm has been added to DeviceSelect. DeviceAdjacentDifference provides new SubtractLeft and SubtractRight functionality.

This release also deprecates several obsolete APIs, including type traits and BlockAdjacentDifference algorithms. Many bugfixes and documentation updates are also included.

64-bit Offsets in DeviceRadixSort Public APIs

Users frequently want to process large datasets using CUB’s device-scope algorithms, but the current public APIs limit input data sizes to those that can be indexed by a 32-bit integer. Beginning with this release, CUB is updating these APIs to support 64-bit offsets, as discussed in #212.

The device-scope algorithms will be updated with 64-bit offset support incrementally, starting with the cub::DeviceRadixSort family of algorithms. Thanks to @canonizer for contributing this functionality.

New DeviceSelect::UniqueByKey Algorithm

cub::DeviceSelect now provides a UniqueByKey algorithm, which has been ported from Thrust. Thanks to @zasdfgbnm for this contribution.

New DeviceAdjacentDifference Algorithms

The new cub::DeviceAdjacentDifference interface, also ported from Thrust, provides SubtractLeft and SubtractRight algorithms as CUB kernels.

Deprecation Notices

Synchronous CUDA Dynamic Parallelism Support

A future version of CUB will change the debug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).

This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true. Such invocations will continue to print extra debugging information, but they will no longer synchronize after kernel launches.

Deprecated Traits

CUB provided a variety of metaprogramming type traits in order to support C++03. Since C++14 is now required, these traits have been deprecated in favor of their STL equivalents, as shown below:

Deprecated CUB Trait Replacement STL Trait
cub::If std::conditional
cub::Equals std::is_same
cub::IsPointer std::is_pointer
cub::IsVolatile std::is_volatile
cub::RemoveQualifiers std::remove_cv
cub::EnableIf std::enable_if

CUB now uses the STL traits internally, resulting in a ~6% improvement in compile time.

Misnamed cub::BlockAdjacentDifference APIs

The algorithms in cub::BlockAdjacentDifference have been deprecated, as their names did not clearly describe their intent. The FlagHeads method is now SubtractLeft, and FlagTails has been replaced by SubtractRight.

Breaking Changes

  • #331: Deprecate the misnamed BlockAdjacentDifference::FlagHeads and FlagTails methods. Use the new SubtractLeft and SubtractRight methods instead.
  • #364: Deprecate some obsolete type traits. These should be replaced by the equivalent traits in <type_traits> as described above.

New Features

  • #331: Port the thrust::adjacent_difference kernel and expose it as cub::DeviceAdjacentDifference.
  • #405: Port the thrust::unique_by_key kernel and expose it as cub::DeviceSelect::UniqueByKey. Thanks to @zasdfgbmn for this contribution.

Enhancements

  • #340: Allow 64-bit offsets in DeviceRadixSort public APIs. Thanks to @canonizer for this contribution.
  • #400: Implement a significant reduction in DeviceMergeSort compilation time.
  • #415: Support user-defined CMAKE_INSTALL_INCLUDEDIR values in Thrust’s CMake install rules. Thanks for @robertmaynard for this contribution.

Bug Fixes

  • #381: Fix shared memory alignment in dyn_smem example.
  • #393: Fix some collisions with the min/max macros defined in windows.h.
  • #404: Fix bad cast in util_device.
  • #410: Fix CDP issues in DeviceSegmentedSort.
  • #411: Ensure that the nv_exec_check_disable pragma is only used on nvcc.
  • #418: Fix -Wsizeof-array-div warning on gcc 11. Thanks to @robertmaynard for this contribution.
  • #420: Fix new uninitialized variable warning in DiscardIterator on gcc 10.
  • #423: Fix some collisions with the small macro defined in windows.h.
  • #426: Fix some issues with version handling in CUB’s CMake packages.
  • #430: Remove documentation for DeviceSpmv parameters that are absent from public APIs.
  • #432: Remove incorrect documentation for DeviceScan algorithms that guaranteed run-to-run deterministic results for floating-point addition.

CUB 1.15.0

25 Oct 19:07
Compare
Choose a tag to compare

Summary

CUB 1.15.0 includes a new cub::DeviceSegmentedSort algorithm, which demonstrates up to 5000x speedup compared to cub::DeviceSegmentedRadixSort when sorting a large number of small segments. A new cub::FutureValue<T> helper allows the cub::DeviceScan algorithms to lazily load the initial_value from a pointer. cub::DeviceScan also added ScanByKey functionality.

The new DeviceSegmentedSort algorithm partitions segments into size groups. Each group is processed with specialized kernels using a variety of sorting algorithms. This approach varies the number of threads allocated for sorting each segment and utilizes the GPU more efficiently.

cub::FutureValue<T> provides the ability to use the result of a previous kernel as a scalar input to a CUB device-scope algorithm without unnecessary synchronization:

int *d_intermediate_result = ...;
intermediate_kernel<<<blocks, threads>>>(d_intermediate_result,  // output
                                         arg1,                   // input
                                         arg2);                  // input

// Wrap the intermediate pointer in a FutureValue -- no need to explicitly
// sync when both kernels are stream-ordered. The pointer is read after
// the ExclusiveScan kernel starts executing.
cub::FutureValue<int> init_value(d_intermediate_result);

cub::DeviceScan::ExclusiveScan(d_temp_storage,
                               temp_storage_bytes,
                               d_in,
                               d_out,
                               cub::Sum(),
                               init_value,
                               num_items);

Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.

Deprecation Notices

A future version of CUB will change the debug_synchronous behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).

This will only affect calls to CUB device-scope algorithms launched from device-side code with debug_synchronous = true. These algorithms will continue to print extra debugging information, but they will no longer synchronize after kernel launches.

Breaking Changes

  • #305: The template parameters of cub::DispatchScan have changed to support the new cub::FutureValue helper. More details under "New Features".
  • #377: Remove broken operator->() from cub::TransformInputIterator, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

New Features

  • #305: Add overloads to cub::DeviceScan algorithms that allow the output of a previous kernel to be used as initial_value without explicit synchronization. See the new cub::FutureValue helper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.
  • #354: Add cub::BlockRunLengthDecode algorithm. Thanks to Elias Stehle (@elstehle) for this contribution.
  • #357: Add cub::DeviceSegmentedSort, an optimized version of cub::DeviceSegmentedSort with improved load balancing and small array performance.
  • #376: Add "by key" overloads to cub::DeviceScan. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

Bug Fixes

  • #349: Doxygen and unused variable fixes.
  • #363: Maintenance updates for the new cub::DeviceMergeSort algorithms.
  • #382: Fix several -Wconversion warnings. Thanks to Matt Stack (@matt-stack) for this contribution.
  • #388: Fix debug assertion on MSVC when using cub::CachingDeviceAllocator.
  • #395: Support building with __CUDA_NO_HALF_CONVERSIONS__. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.

CUB 1.14.0 (NVIDIA HPC SDK 21.9)

24 Aug 18:41
792ac3d
Compare
Choose a tag to compare

CUB 1.14.0 is a major release accompanying the NVIDIA HPC SDK 21.9.

This release provides the often-requested merge sort algorithm, ported from the thrust::sort implementation. Merge sort provides more flexibility than the existing radix sort by supporting arbitrary data types and comparators, though radix sorting is still faster for supported inputs. This functionality is provided through the new cub::DeviceMergeSort and cub::BlockMergeSort algorithms.

The namespace wrapping mechanism has been overhauled for 1.14. The existing macros (CUB_NS_PREFIX/CUB_NS_POSTFIX) can now be replaced by a single macro, CUB_WRAPPED_NAMESPACE, which is set to the name of the desired wrapped namespace. Defining a similar THRUST_CUB_WRAPPED_NAMESPACE macro will embed both thrust:: and cub:: symbols in the same external namespace. The prefix/postfix macros are still supported, but now require a new CUB_NS_QUALIFIER macro to be defined, which provides the fully qualified CUB namespace (e.g. ::foo::cub). See cub/util_namespace.cuh for details.

Breaking Changes

  • #350: When the CUB_NS_[PRE|POST]FIX macros are set, CUB_NS_QUALIFIER must also be defined to the fully qualified CUB namespace (e.g. #define CUB_NS_QUALIFIER ::foo::cub). Note that this is handled automatically when using the new [THRUST_]CUB_WRAPPED_NAMESPACE mechanism.

New Features

  • #322: Ported the merge sort algorithm from Thrust; cub::BlockMergeSort and cub::DeviceMergeSort are now available.
  • #326: Simplify the namespace wrapper macros, and detect when Thrust's symbols are in a wrapped namespace.

Bug Fixes

  • #160, #163, #352: Fixed several bugs in cub::DeviceSpmv and added basic tests for this algorithm. Thanks to James Wyles and Seunghwa Kang for their contributions.
  • #328: Fixed error handling bug and incorrect debugging output in cub::CachingDeviceAllocator. Thanks to Felix Kallenborn for this contribution.
  • #335: Fixed a compile error affecting clang and NVRTC. Thanks to Jiading Guo for this contribution.
  • #351: Fixed some errors in the cub::DeviceHistogram documentation.

Enhancements

  • #348: Add an example that demonstrates how to use dynamic shared memory with a CUB block algorithm. Thanks to Matthias Jouanneaux for this contribution.

CUB 1.13.1 (CUDA Toolkit 11.5)

25 Oct 19:01
Compare
Choose a tag to compare

CUB 1.13.1 is a minor release accompanying the CUDA Toolkit 11.5.

This release provides a new hook for embedding the cub:: namespace inside
a custom namespace. This is intended to work around various issues related to
linking multiple shared libraries that use CUB. The existing CUB_NS_PREFIX and
CUB_NS_POSTFIX macros already provided this capability; this update provides a
simpler mechanism that is extended to and integrated with Thrust. Simply define
THRUST_CUB_WRAPPED_NAMESPACE to a namespace name, and both thrust:: and
cub:: will be placed inside the new namespace. Using different wrapped
namespaces for each shared library will prevent issues like those reported in
NVIDIA/thrust#1401.

New Features

  • #326: Add THRUST_CUB_WRAPPED_NAMESPACE hooks.