You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here are current TODOs for implementing a pointwise and normalization scheduler. NOTE: The action items is not exhaustive nor ordered by priority.
1. Initialize and Invalidate mbarrier at the start and end of the kernel respectively.
Currently, we initialize and invalidate mbarrier for each TMA operations, which adds unnecessary overhead.
2. Implement mbarrier with parity bit.
A single thread can arrive at a mbarrier and set the expected transaction count while the other threads wait at mbarrier for memory transaction to be completed.
Currently, we use the mbarrier token style. All threads that arrive at mbarrier get a token. Every thread must wait at mbarrier with token. It enforces all threads to operate together.
Only a single thread launches TMA operation and sets expected transaction count.
The remaining threads must arrive at mbarrier but sets expected transaction count to 0.
Motivation
Required for warp specialization
Simplifies mbarrier arrive and wait pattern.
mbarrier token
__mbarrier_token_t token;
if (elect_sync()) {
// Initiate TMA bulk tensor copy.cp_async_bulk_global_to_shared_tensor_2d(&smem_barrier, ...);
token = barrier_arrive1_tx(&smem_barrier, expected_transaction_count);
} else {
// Other threads arrive with arrival count of 1 and expected transaction count of 0.
token = barrier_arrive1_tx(&smem_barrier, 0);
}
while(! barrier_try_wait_token(&smem_barrier, token)) { };
// compute
mbarrier parity
int parity = 0;
if (elect_sync()) {
// Initiate TMA bulk tensor copy.cp_async_bulk_global_to_shared_tensor_2d(&smem_barrier, ...);
barrier_arrive1_tx(&smem_barrier, expected_transaction_count);
}
while(! barrier_try_wait_parity(&smem_barrier, parity)) { };
// compute// update parity bit
parity ^= 1;
3. Pipelining - (Multiple mbarriers per TensorView)
Launch multiple TMA operations simultaneously but process each stage as they become available.
Motivation
Overlap data movement with computation
Pseudo-code
for each stage of producer TV:
launch TMA operation for stage
end for
for each stage of consumer:
wait for corresponding TMA stage to become available
end for
4. Combining mbarriers together (Multiple TensorViews for a mbarrier)
Currently, we create a mbarrier for each TensorView, but the TensorViews can use the same mbarrier if they synchronize at the same point.
Use syncthread analysis to identify placement of mbarrier_wait
Merge mbarriers at the same sync position together
Create a single mbarrier but combine the expected transaction count
Motivation
We can launch independent TMA load operations and wait for all results at same time.
Minimize register pressure caused by mbarrier overhead.
Here are current TODOs for implementing a pointwise and normalization scheduler.
NOTE: The action items is not exhaustive nor ordered by priority.
1. Initialize and Invalidate
mbarrier
at the start and end of the kernel respectively.mbarrier
for each TMA operations, which adds unnecessary overhead.2. Implement
mbarrier
with parity bit.mbarrier
and set the expected transaction count while the other threads wait atmbarrier
for memory transaction to be completed.mbarrier
token style. All threads that arrive atmbarrier
get a token. Every thread must wait atmbarrier
with token. It enforces all threads to operate together.mbarrier
but sets expected transaction count to 0.Motivation
mbarrier
arrive and wait pattern.mbarrier
tokenmbarrier
parity3. Pipelining - (Multiple
mbarriers
per TensorView)Motivation
Pseudo-code
4. Combining
mbarriers
together (MultipleTensorViews
for ambarrier
)mbarrier
for eachTensorView
, but theTensorViews
can use the samembarrier
if they synchronize at the same point.mbarrier_wait
mbarriers
at the same sync position togethermbarrier
but combine the expected transaction countMotivation
mbarrier
overhead.Pseudo-code
5. Implement
ublkcp
TMA operator (1D version of TMA)The text was updated successfully, but these errors were encountered: