Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use @builtin(num_workgroups) or @builtin(local_invocation_index) is very slow on M1 MacBook #5677

Closed
TYPEmber opened this issue May 8, 2024 · 4 comments

Comments

@TYPEmber
Copy link

TYPEmber commented May 8, 2024

But it is fine on window 10 with Nvidia 2080ti.

When I dispatch a (4000, 1000, 1) workgroups, without use @Builtin(num_workgroups) or @Builtin(local_invocation_index) will take just 90ms, but if I import these two builtin values, it will take 1800ms on M1 MacBook.

wgpu version: 0.20.0 & latest https://github.com/gfx-rs/wgpu.git

@group(0)
@binding(0)
var<storage, read> data: array<i32>;
@group(0)
@binding(1)
var<storage, read_write> output: array<i32>;
@group(0)
@binding(2)
var<storage, read> kernal_offset: array<i32>;
@group(0)
@binding(3)
var<storage, read> kernel_value: array<i32>;

@compute
@workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>,
@builtin(num_workgroups) num_workgroups: vec3<u32>,
@builtin(local_invocation_index) local_invocation_index: u32
) {
    var sum = 0;

    let si = global_id.x + global_id.y * 4000;
    let index = si + (si / 4000) * (4040 - 4000);

    for (var i = 0; i < i32(arrayLength(&kernal_offset)); i ++) {
        sum += data[i32(index) + kernal_offset[i]] * kernel_value[i];
    }

    // When I add this line into the code
    // It becomes very slow.
    // sum += i32(num_workgroups.x);
    // sum += i32(local_invocation_index);
    
    output[si] = sum;
}
@9291Sam
Copy link

9291Sam commented May 8, 2024

This is expected behavior. You are using a workgroup of size (1, 1, 1) this is incredibly small and will destroy performance.

Instead you should use a larger workgroup size (32x32 tends to be good), dispatch it in larger chunks of size (125, 32) for your case. Finally, insert a branch to ensure you are within bounds on the array since this is actually acting over an "area" of 1024. If you're over 1000, you don't want to touch your array.

@TYPEmber
Copy link
Author

TYPEmber commented May 8, 2024

Thanks for your reply.
You are right, this workgroup_size is not for the best performance, it just an example to show the difference between M1 MacBook Air and Nvidia 2080ti Windows.
Use @Builtin(num_workgroups) or @Builtin(local_invocation_index) won't destroy performance on Nvidia 2080ti Windows.
And this is my code actually, but it still has the same difference between these two platforms.

@compute
@workgroup_size(workgroup_len)
fn main(@builtin(workgroup_id) workgroup_id: vec3<u32>,
@builtin(num_workgroups) num_workgroups: vec3<u32>,
@builtin(local_invocation_index) local_invocation_index: u32
) {
    var sum = 0;

    var si = 0u;
    var index = 0u;

    if (local_invocation_index == 0) {
        item_si = workgroup_id.x + workgroup_id.y * num_workgroups.x + workgroup_id.z * num_workgroups.x * num_workgroups.y;
        item_index = item_si + (item_si / 4000) * (4040 - 4000);
    }
    workgroupBarrier();

    index = item_index;
    si = item_si;

    for (var i = local_invocation_index; i < arrayLength(&kernal_offset); i += workgroup_len) {
        sum += data[i32(index) + kernal_offset[i]] * kernel_value[i];
    }

    if (local_invocation_index != workgroup_len - 1) {
        atomicAdd(&item_sum, sum);
    } 
    workgroupBarrier();

    if (local_invocation_index == workgroup_len - 1){
        output[si] = sum + item_sum;
    }
}

In a word, no matter how bad the code is, it should not take additional time by use @Builtin(num_workgroups) instead of a constant number.

@TYPEmber TYPEmber closed this as completed May 8, 2024
@TYPEmber TYPEmber reopened this May 8, 2024
@teoxoy
Copy link
Member

teoxoy commented May 13, 2024

In a word, no matter how bad the code is, it should not take additional time by use @builtin(num_workgroups) instead of a constant number.

I don't know what we can do about this though. We are doing an almost 1:1 translation from WGSL to MSL and the items in question are builtins. It's up to the driver to dispatch those workgroups and execute the shaders.

@TYPEmber
Copy link
Author

Thank you for your contribution to this project!
And I try to use naga to translate wgsl to msl, it does looks fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants