Hello. According to this article efficient gaussian blur with linear sampling it is better to reduce the number of cycles in the gaussian blur fragment shader by using bilinear interpolation.

I did some experiments and it is indeed better but only if framebuffer texture format is not wide. I have big performance improvement(about 25%) if i use GL_RGB16F texture format with such approach. But when i use GL_RGB32F than performance drops to about same 25%. Could someone comment on that?

I experiment on nvidia p1000 video card.

BTW i use apitrace to see performance difference of specific shader program.

cgrant

1,875

November 22, 2019 04:33 PM

Texture sampling performance is also predicated on bandwidth. So going from 16-bit float to 32-bit float is theoretically doubling your bandwidth. So it would be unrealistic to expect the same performance given the bandwidth difference.

_Flame_

Author

124

November 22, 2019 06:15 PM

1 hour ago, cgrant said:

Texture sampling performance is also predicated on bandwidth. So going from 16-bit float to 32-bit float is theoretically doubling your bandwidth. So it would be unrealistic to expect the same performance given the bandwidth difference.

Do you mean the linear sampling is worse than more cycles in the shader because of bandwidth bottleneck?

pcmaster

1,119

November 22, 2019 06:19 PM

It'd be interesting to see how you measured.

I highly doubt that using a 5x5 filter realised via discretised loads (not samples) on a RGB32F could possibly be faster than a 3x3 filter realised via bilinear samples (not loads) on the same RGB32F texture.

The reason is that the underlying memory will be organised in such a fashion that the linear samples will hit the caches quite as effectively as the discretised loads, the amount of memory transferred will be the same, plus the fixed GPU texture sampling hardware will return the mix of the four samples for free, compared to wasting ALU instructions on doing it yourself.

MJP

20,297

November 22, 2019 11:30 PM

The texture filtering units on most GPU's out in the wild have varying cycle counts for different formats. It's not at all uncommon to have 1/2 rate for 64bpp formats and 1/4 rate for 128bpp formats. Generally you want to avoid 128bpp formats anyway, since they are rarely necessary in graphics and consume a lot of memory + bandwidth.

The Blog | The Book

_Flame_

Author

124

November 23, 2019 08:40 AM

9 hours ago, MJP said:

The texture filtering units on most GPU's out in the wild have varying cycle counts for different formats. It's not at all uncommon to have 1/2 rate for 64bpp formats and 1/4 rate for 128bpp formats. Generally you want to avoid 128bpp formats anyway, since they are rarely necessary in graphics and consume a lot of memory + bandwidth.

It makes sense. It would be far fetched to have same filtering cycles count for all formats. Is there a documentation for such thing?

@pcmaster It is not 5x5 and 3x3 but 5 + 5 and 3 + 3.

pcmaster

1,119

November 23, 2019 10:05 AM

Yeah, I'm sorry about that 5x5. But my argument (guesstimate) still holds.

Maybe MJP will point us to some documentation that says that 128bpp formats (such as RGBA32F) have 1/4 rate, which is definitely true for the AMD GCN. I remember having read it but I cannot seem to find it in the AMD GCN ISA whitepaper nor in the AMD GCN Architecture whitepaper (same for the newer AMD RDNA).

Nevertheless, the "AMD RDNA Architecture" whitepaper on page 21 says:

Quote

the texture sampling and interpolation for pixels using FP16 per channel has doubled and is on par with INT8 data

Suggesting that the previous architecture (GCN) had 1/1 rate for int8 (such as rgba8), 1/2 rate for fp16 or int16 and 1/4 rate for fp32 or int32. It will be stated somewhere also explicitly but I couldn't find it in 10 minutes :(

_Flame_

Author

124

November 23, 2019 12:21 PM

If understood correctly that root cause in not sampling itself but interpolation. In case of interpolation shader we reduce texture reads but add interpolation.

I use apitrace to measure shader program performance. Here are screenshots with results. Shaders that responsible for gauss filtering are outlined by black. Column "Avg GPU time" is what we are looking for. It shows how much time it took to render with a shader per frame. There are 2 shaders because it is done with 2 passes(vertical and horizontal)

5 + 5 gaussian blur GL_RGB16F 118939285_Cropped16F.png.9d095616a12c8365aa2c607ff61a9ee7.png