🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

understanding wave operation intrinsics

Started by
4 comments, last by Valakor 3 years ago

hello,

with directx12 SM6.0 Microsoft introduced new shader wave intrinsics

e.g.

WaveActiveBitOr

Returns the bitwise OR of all the values of the expression across all active lanes in the current wave, and replicates the result to all lanes in the wave.

How many lanes can “all lanes” be at max ?

Suppose i have a compute shader with typical WAV SIZE for NVIDIA or AMD

  • ThreadGroup (4,8,1) then 4 x 8 = 32 Threads are started per Group ( maybe typical NVIDIA HW )
  • ThreadGroup (8,8,1) then 8 x 8 = 64 Threads are started per Group ( maybe typical AMD HW )

Does the function(s) only perform good if i choose the above typical values that the gpu by harware can support or can i sync Thread crossing data with higher thread group Thread count

e.g ThreadGroup (16,16,1) then 16 x 16 = 256 Threads are started per Group ( greater tiles )

I ask because i am deeply working with AMD GPUOpen Code on Denoise Hybrid Raytraced Shadows

Only for the typical AMD “wave size” 64 the intrinsic is called directly.
The alternative path they have to sync manually with a groupshared memory variable.

It makes me wonder that they dont check WaveGetLaneCount() < lane_count_in_thread_group

bool FFX_DNSR_Shadows_ThreadGroupAllTrue(bool val)
{
    const uint lane_count_in_thread_group = 64;
    if (WaveGetLaneCount() == lane_count_in_thread_group)
    {
        return WaveActiveAllTrue(val);
    }
    else
    {
        GroupMemoryBarrierWithGroupSync();
        g_FFX_DNSR_Shadows_false_count = 0;
        GroupMemoryBarrierWithGroupSync();
        if (!val) g_FFX_DNSR_Shadows_false_count = 1;
        GroupMemoryBarrierWithGroupSync();
        return g_FFX_DNSR_Shadows_false_count == 0;
    }
}
Advertisement

"Typical values" are only useful for rough performance estimates, on paper.
Values you can actually use are only unavoidable queries (like WaveGetLaneCount), parameters you are responsible for setting and remembering (like group sizes) and if you are lucky minimum and maximum fixed limits that the spec guarantees.

Omae Wa Mou Shindeiru


For other readers here my conclusion after much reading from different sources.

The intrinsics like WaveActiveBitOr do exactly behave how they are defined, but this is NOT what programmers mostly need.
It only syncs the lanes of a wave ( the threads included in the wave )

BUT in most cases we want the “wave intrinsics” to behave like a “ThreadGroup” intrincic to sync the data from ALL threads of a ThreadGroup. The algorithm of a typical compute shader is designed e.g. around a 2 dimensional tile size that is used.
Coding the shader to behave correctly with different ThreadGroup sizes is in some cases a very hard work.

Because there is nor guarantee that the wave includes ALL Threads of the Group there is no guarantee the the results will be correct.

On PC Platform it is recommended to design the compute shader for a ThreadGroup Size =32 NVIDIA and =64 AMD which will occupy the GPU best and the wave intrinsics can be used.

Having XBox or PS as target things are easy because we have well defined HW an can write the shader exactly accordingly.

Thats kinda the point of wave ops, it exposes to you how the hardware works, with all of it's positives and negatives, so in certain cases you can use them to optimize stuff. What you described can be achieved already with atomics and LDS, wave ops won't replace those, but you can use them to optimize further. Building an abstraction above this would defeat the whole point of the excerise.

There are definitely certain algorithms that require knowing the wave size (32, 64, whatever), but there are some that don't. A very important one that I've used frequently optimizes accumulating / counting values. A naive implementation might have every thread do an AtomicAdd in global memory; however, you can instead use Wave ops to accumulate the value among the threads in the wave and then have only the first (or last) active thread in the wave perform an AtomicAdd with the wave-accumulated result.

Here's a link with some interesting DX12/Vulkan Wave Programming use-cases: https://gpuopen.com/wp-content/uploads/2017/07/GDC2017-Wave-Programming-D3D12-Vulkan.pdf

This topic is closed to new replies.

Advertisement