Advertisement

Structured Buffers and optimisation

Started by October 02, 2019 09:00 PM
8 comments, last by MJP 4 years, 11 months ago

Hi

 

I was wondering if any one knows much about optimisation for structured buffers. I am wondering if merging two buffers into one single buffer is better ?

An example of what i mean:


 


//two buffers 
RWStructuredBuffer<float2> buffer0; 
RWStructuredBuffer<float2> buffer1;

or


 


//one buffer where buffer 0 is .xy and buffer 1 is .zw
RWStructuredBuffer<float4> buffer01;

 

or


// one buffer where it is twice as large so buffer 0 is [0 to n-1] buffer 1 is [n to 2n-1]
RWStructuredBuffer<float2> buffer01;

or


// one buffer where it is four times as large with the data packed like so:
// [0] float2.x
// [1] float2.y
// as well as buffer 1 starting at [2n-1, 4n-1]
RWStructuredBuffer<float> buffer01;



Which is better, what does a GPU prefer in terms of performance and memory usage? I can't find much answers on this topic annoyingly, so maybe some one here has a better understanding of how GPUs handle this stuff.

 

FYI it might seem like a micro optimisation, but thats because i need to be, its a run time fast fourier transform and thus needs to be optimised to hell and back, so i'm starting off by seeing what is the ideal buffer setup.

 

Thanks

Merging two buffers into one generally isn't going to change much from a GPU performance point of view. There are some minor considerations with descriptors, but for this sort of thing your performance is mainly going to be determined by the latency of loading the data itself. To optimize for that, you'll want to make sure that the data is packed in a way to maximize cache hits. This generally means packing data together based on how you actually load/access it. So if you always access X and Y together, you'll probably want to store your data as XYXYXYXYXYXYXY so that the N threads in your warp/wave all can access their data in a coalesced load without any wasted cache space. However if your access pattern is to only access X in one pass and then Y in another, then XXXXXXXXXXXX....YYYYYYYYYY could be be more efficient.

Advertisement
7 minutes ago, MJP said:

Merging two buffers into one generally isn't going to change much from a GPU performance point of view. There are some minor considerations with descriptors, but for this sort of thing your performance is mainly going to be determined by the latency of loading the data itself. To optimize for that, you'll want to make sure that the data is packed in a way to maximize cache hits. This generally means packing data together based on how you actually load/access it. So if you always access X and Y together, you'll probably want to store your data as XYXYXYXYXYXYXY so that the N threads in your warp/wave all can access their data in a coalesced load without any wasted cache space. However if your access pattern is to only access X in one pass and then Y in another, then XXXXXXXXXXXX....YYYYYYYYYY could be be more efficient.

Since i am accessing X and Y together you're basically suggesting i do option 4 but keep each buffer 0 and 1 separate rather than a single buffer twice as large right ? 

First, pack your data like MJP said, by access pattern, but keep in mind that loading from StructuredBuffer with a stride with multiples of float4 might be more optimal on some hardware: https://developer.nvidia.com/content/understanding-structured-buffer-performance

I've had some experience when padding structured buffer to float4 did help performance on Nvidia.

Just now, turanszkij said:

First, pack your data like MJP said, by access pattern, but keep in mind that loading from StructuredBuffer with a stride with multiples of float4 might be more optimal on some hardware: https://developer.nvidia.com/content/understanding-structured-buffer-performance

I've had some experience when padding structured buffer to float4 did help performance on Nvidia.

But wouldn't a StructuredBuffer<float2> already be packed as xyxyxyxy ? Or do you mean to manually do it as StructuredBuffer<float> of [0]=>x[1]=>y[2]=>x2[3]=>y2.

Just need to be sure here. I am no export on this stuff i'm essentially trying to work it out as i come across issues.

I would think that StructuredBuffer<float2> is not so bad. I would worry more if it was float3, because then loads could span multiple cache lines. Having float4 would be best if you access all the elements in one shader, because float4 can be loaded in one instruction on AMD I believe. 

Advertisement

Some measurements on many GPUs:

https://github.com/sebbbi/perftest

Hi,

This only matters for the CPU. GPU doesn't simply care, because both structured buffers can be set up as a VertexAttribute so that the shader couldn't tell the difference.

Now setting up VertexAttributes could utilize differently structured VBOs, read this
https://www.khronos.org/opengl/wiki/Vertex_Specification_Best_Practices

Finally, if we interpret "I am wondering if merging two buffers into one single buffer is better ?" as merging two VBOs into a single VBO, then the answer is definitely yes. Switching from a VBO to another and setting up a new VertexAttribute structure for it is a very expensive process. It's always faster not doing anything, because you already have the appropriate data bounded to the shader :-)

Cheers,
bzt

On 10/2/2019 at 4:12 PM, CelticSir said:

But wouldn't a StructuredBuffer<float2> already be packed as xyxyxyxy ? Or do you mean to manually do it as StructuredBuffer<float> of [0]=>x[1]=>y[2]=>x2[3]=>y2.

Just need to be sure here. I am no export on this stuff i'm essentially trying to work it out as i come across issues.

Yes, if you do StructuredBuffer<float2> then you're fine: you'll get xyxyxyxy layout. Basically, your original version (option 1) is fine.

This topic is closed to new replies.

Advertisement