Use @builtin(num_workgroups) or @builtin(local_invocation_index) is very slow on M1 MacBook #5677

TYPEmber · 2024-05-08T07:11:10Z

But it is fine on window 10 with Nvidia 2080ti.

When I dispatch a (4000, 1000, 1) workgroups, without use @Builtin(num_workgroups) or @Builtin(local_invocation_index) will take just 90ms, but if I import these two builtin values, it will take 1800ms on M1 MacBook.

wgpu version: 0.20.0 & latest https://github.com/gfx-rs/wgpu.git

@group(0)
@binding(0)
var<storage, read> data: array<i32>;
@group(0)
@binding(1)
var<storage, read_write> output: array<i32>;
@group(0)
@binding(2)
var<storage, read> kernal_offset: array<i32>;
@group(0)
@binding(3)
var<storage, read> kernel_value: array<i32>;

@compute
@workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>,
@builtin(num_workgroups) num_workgroups: vec3<u32>,
@builtin(local_invocation_index) local_invocation_index: u32
) {
    var sum = 0;

    let si = global_id.x + global_id.y * 4000;
    let index = si + (si / 4000) * (4040 - 4000);

    for (var i = 0; i < i32(arrayLength(&kernal_offset)); i ++) {
        sum += data[i32(index) + kernal_offset[i]] * kernel_value[i];
    }

    // When I add this line into the code
    // It becomes very slow.
    // sum += i32(num_workgroups.x);
    // sum += i32(local_invocation_index);
    
    output[si] = sum;
}

9291Sam · 2024-05-08T14:48:20Z

This is expected behavior. You are using a workgroup of size (1, 1, 1) this is incredibly small and will destroy performance.

Instead you should use a larger workgroup size (32x32 tends to be good), dispatch it in larger chunks of size (125, 32) for your case. Finally, insert a branch to ensure you are within bounds on the array since this is actually acting over an "area" of 1024. If you're over 1000, you don't want to touch your array.

TYPEmber · 2024-05-08T17:13:49Z

Thanks for your reply.
You are right, this workgroup_size is not for the best performance, it just an example to show the difference between M1 MacBook Air and Nvidia 2080ti Windows.
Use @Builtin(num_workgroups) or @Builtin(local_invocation_index) won't destroy performance on Nvidia 2080ti Windows.
And this is my code actually, but it still has the same difference between these two platforms.

@compute
@workgroup_size(workgroup_len)
fn main(@builtin(workgroup_id) workgroup_id: vec3<u32>,
@builtin(num_workgroups) num_workgroups: vec3<u32>,
@builtin(local_invocation_index) local_invocation_index: u32
) {
    var sum = 0;

    var si = 0u;
    var index = 0u;

    if (local_invocation_index == 0) {
        item_si = workgroup_id.x + workgroup_id.y * num_workgroups.x + workgroup_id.z * num_workgroups.x * num_workgroups.y;
        item_index = item_si + (item_si / 4000) * (4040 - 4000);
    }
    workgroupBarrier();

    index = item_index;
    si = item_si;

    for (var i = local_invocation_index; i < arrayLength(&kernal_offset); i += workgroup_len) {
        sum += data[i32(index) + kernal_offset[i]] * kernel_value[i];
    }

    if (local_invocation_index != workgroup_len - 1) {
        atomicAdd(&item_sum, sum);
    } 
    workgroupBarrier();

    if (local_invocation_index == workgroup_len - 1){
        output[si] = sum + item_sum;
    }
}

In a word, no matter how bad the code is, it should not take additional time by use @Builtin(num_workgroups) instead of a constant number.

teoxoy · 2024-05-13T15:04:01Z

In a word, no matter how bad the code is, it should not take additional time by use @builtin(num_workgroups) instead of a constant number.

I don't know what we can do about this though. We are doing an almost 1:1 translation from WGSL to MSL and the items in question are builtins. It's up to the driver to dispatch those workgroups and execute the shaders.

TYPEmber · 2024-05-17T07:56:24Z

Thank you for your contribution to this project!
And I try to use naga to translate wgsl to msl, it does looks fine.

TYPEmber closed this as completed May 8, 2024

TYPEmber reopened this May 8, 2024

TYPEmber closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use @builtin(num_workgroups) or @builtin(local_invocation_index) is very slow on M1 MacBook #5677

Use @builtin(num_workgroups) or @builtin(local_invocation_index) is very slow on M1 MacBook #5677

TYPEmber commented May 8, 2024

9291Sam commented May 8, 2024 •

edited

TYPEmber commented May 8, 2024 •

edited

teoxoy commented May 13, 2024

TYPEmber commented May 17, 2024

Use @builtin(num_workgroups) or @builtin(local_invocation_index) is very slow on M1 MacBook #5677

Use @builtin(num_workgroups) or @builtin(local_invocation_index) is very slow on M1 MacBook #5677

Comments

TYPEmber commented May 8, 2024

9291Sam commented May 8, 2024 • edited

TYPEmber commented May 8, 2024 • edited

teoxoy commented May 13, 2024

TYPEmber commented May 17, 2024

9291Sam commented May 8, 2024 •

edited

TYPEmber commented May 8, 2024 •

edited