🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

SVOGI Implementation Details

Started by
121 comments, last by Josh Klint 1 year, 9 months ago

You still need the sparse octree because it massively reduces memory bandwidth, and during the lighting phase you only have to process the solid voxels.

If that data then gets copied into a volume texture, rendering to all slices of that volume texture is going to be the new bottleneck…

10x Faster Performance for VR: www.ultraengine.com

Advertisement

JoeJ said:
It's kinda depressing you face the usual problems, making volume texture a win over acceleration structures again. :(

I can only follow up on this. For my whole implementation I use standard volume texture(s) (you can cascade them - and have cascaded volume texture for LOD is a winner here … which technically is a sort of ‘larger blocks’ taken to an extreme). SVO implementation was just way too slow - yes, it allows for very high resolution of results, but cascading the volume does the same thing - better and faster. And due to no duplication it also beats original Crassin's approach in pretty much all scenarios (at least in my implementation).

I'm currently able to do full resolution GI and specular reflections next to the rest of the pipeline for moderately sized scenes (I had some test videos with Sponza up somewhere - using temporal filtering on voxel data with some factor yielded the best smooth results even for animated scenes), even on few years old hardware (on my current box with Radeon 6800 it's holding stable 60fps at 4K resolution).

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

It looks like a choice must be made:

Sparse Voxel Octree

  • Voxelized on CPU
  • Low memory usage
  • GI is done with light propagation
  • High-resolution reflections, almost like raytracing, but blurry reflections not supported
  • High latency (animation and motion supported poorly)

Cascaded Volume Textures

  • Voxelized on the GPU
  • High memory usage
  • GI is done with cone step tracing
  • Blurry reflections supported, sharp reflections not as high-res as SVO
  • Low latency (animation and motion supported well)

To me, the soft indirect specular reflections are the most impressive aspect of this, so I lean towards the second option. And having sharp reflections of the voxelized scene is not that great.

10x Faster Performance for VR: www.ultraengine.com

I can add some downsides:

Octree:

Difficult to include dynamic objects and characters, so while the sharp reflections are impressive, it would only amplify the bad impression / artifacts of missing those on dynamic stuff or using screenspace for them.
Light propagation happening on lower LOD would miss out the given extra accuracy, so no real visual win over cone tracing. Though, LPV can handle participating media.

Volume:

Cone tracing means very approximated occlusion, causing light leaks and restrictions on level design. (But i had this same issue when i experimented with light propagation cascades)
Grid alignment can be noticeable even if modeling only low frequencies.
No robust differentiation of surface vs. volume, so high quality GI is not really possible.

I would tend towards the latter maybe, but i would not be happy.

I have completed voxelization taking place on the GPU:

Is there some clever trick for efficiently downsampling the volume texture? It seems like going through a volume texture and rendering a pass for each slice would be pretty slow.

10x Faster Performance for VR: www.ultraengine.com

Josh Klint said:
Is there some clever trick for efficiently downsampling the volume texture? It seems like going through a volume texture and rendering a pass for each slice would be pretty slow.

I would use a compute shader, read a small volume block to LDS, and generate 2-3 (or more) mips from that.
The higher mips mean idle threads, but lower bandwidth and less dispatches.

Edit: This could also do texture compression (which i guess also works with 3D textures).

What I currently do (no texture compression - for a reason though):

cbuffer InputDimensions : register(b0)
{
	uint3 dimensions;
}

cbuffer InputMiplevels : register(b1)
{
	uint srcMiplevel;
	uint miplevels;
	float texelSize;
}

SamplerState srcSampler : register(s0);
Texture3D<float4> srcLevel : register(t0);
RWTexture3D<float4> mipLevel1 : register(u0);

groupshared float4 tmp[8];

void StoreColor(uint idx, float4 color)
{
	tmp[idx] = color;
}

float4 LoadColor(uint idx)
{
	return tmp[idx];
}

float HasVoxel(float4 color)
{
	return color.a > 0.0f ? 1.0f : 0.0f;
}

// Naive version to generate single mipmap level from the previous one
//
// Runs in 2x2x2 workgroup
[numthreads(2, 2, 2)]
void GenerateMipmaps(uint GI : SV_GroupIndex, uint3 DTid : SV_DispatchThreadID)
{
	// Each thread in workgroup loads single voxel and stores it on groupshared memory
	float4 src0;
	float4 src1;
	float4 src2;
	float4 src3;
	float4 src4;
	float4 src5;
	float4 src6;
	float4 src7;
	float3 uvw = (DTid.xyz + 0.5f) * texelSize;
	src0 = srcLevel.SampleLevel(srcSampler, uvw, (float)srcMiplevel);
	StoreColor(GI, src0);
	GroupMemoryBarrierWithGroupSync();

	// For first thread in workgroup only
	if (GI == 0)
	{
		// Load all 8 colors from shared memory
		src1 = LoadColor(GI + 0x01);
		src2 = LoadColor(GI + 0x02);
		src3 = LoadColor(GI + 0x03);
		src4 = LoadColor(GI + 0x04);
		src5 = LoadColor(GI + 0x05);
		src6 = LoadColor(GI + 0x06);
		src7 = LoadColor(GI + 0x07);

		// Perform mipmapping function
		float div = HasVoxel(src0) + HasVoxel(src1) + HasVoxel(src2) + HasVoxel(src3) + HasVoxel(src4) + HasVoxel(src5) + HasVoxel(src6) + HasVoxel(src7);

		if (div == 0.0f)
		{
			src0 = 0.0f;
		}
		else
		{
			src0 = (src0 + src1 + src2 + src3 + src4 + src5 + src6 + src7) / div;
		}

		// Store value
		mipLevel1[DTid / 2] = src0;
	}
}

This is a single level variant (I also have variants which build 2 and 3 miplevels at once (where you work similarly to the above code, just use a (4,4,4) group (or (8,8,8)) - this reduces the number of times compute needs to be invocated, properly select which thread is first (use masks) and calculate. This is basically equivalent on how you can do mipmap generation also in 2D images.

In terms of performance… Voxelization for Crytek Sponza (realtime) incl. mipmapping with multiple dynamic objects and few skinned meshes takes about 1ms for 256^3 and about 5ms for 512^3. I believe you can do better than me as I'm writing quite some data into the 3D texture. This timing is on Radeon 6800, on previous GPU I had (Radeon 590) the timing was about the same - which I believe points out to be fillrate dependent, and not compute dependent. Resolving GI is MUCH faster on new 6800.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Vilem Otte said:
[numthreads(2, 2, 2)]

This would mean only 8 out of 64 or 32 threads do some work, the rest remains idle?
Or does API and compiler automatically pack multiple small workgroups so a whole wavefront is busy?

I think the programmer has to do this on his own, and using ‘too small’ workgroups is a typical beginner mistake.
But some time back i proposed this to @taby , which did the same ‘mistake’ of using (1,1,1) workgroups in a isosurface shader. He fixed it, but no speedup.
I could not believe, and fixed it on my own as well from his github project, but no speedup indeed. Seems compiler is smarter than i think?

Maybe you can lift this long standing mystery. I'm definitively not the only one stumbling over this. :D

So - this code practically keeps all 8 threads busy for a bit, and then 7 of those idle, pretty much until a single workgroup finishes. In case of “2-levels-in-kernel" you have 64 threads (4x4x4), where 64 are busy for starters, then only 8 are busy and then just a single one again. In case of "3-level-in-kernel" you have 512 threads (8x8x8) - but all are busy just for a tiny fraction (first load), then it is just 64 busy, 8 busy and 1 busy. Generating mip maps this way, as you clearly see, ends up in quite a lot of idle time.

Now temporal comparison of 3 variants I use:

VoxelMipmap.hlsl - https://pastebin.com/cNwKn93R​ - 8.70ms

VoxelMipmap2.hlsl - https://pastebin.com/4a8SasbG​ - 2.90ms

VoxelMipmap3.hlsl - https://pastebin.com/wxxY62Zq​ - 2.98ms

This ran on Radeon 6800, which with RDNA has 32 threads in wavefront. I sadly don't have an option to run in on NVidia gpu to compare. I might be able to measure GCN results (where if I remember correctly the VoxelMipmap3.hlsl variant was the fastest).

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Vilem Otte said:
Generating mip maps this way, as you clearly see, ends up in quite a lot of idle time.

That's unavoidable, but you could compact 8 x (2,2,2) into a single GCN wavefront manually to get a potential 8 x speedup regardless.
Even the example is memory bound, i would be sure it's a win. (If there was not this strange example of Tabys shader before, which was OpenGL).
Drivers doing such compaction automatically is highly unlikely, as it would break subgroup functions.

Vilem Otte said:
RDNA has 32 threads in wavefront.

There is also the mode to join two 32 groups into a 64 threads wavefront. I forgot how they call it. On PC we never know which choice the driver makes, but it was said compute shaders mostly use 64 mode still, while other shaders mostly use 32.
Probably CS uses 32 if the workgroup size is smaller than 64 ofc.

One thing i never tested personally is to compact busy vs. idle threads.
E.g. we have a workgroup of 256 threads, and after some time only 64 keep busy.
If we manage to have all 64 busy threads in sequence 0-63, and the idle threads at indices 64-255, ideally the GPU can skip over the idle wavefronts quickly and there is no need to execute multiple instructions for wavefronts completely masked out.
Not sure if this really works, but i know a guy who tried it on GCN and he said it helped.

This topic is closed to new replies.

Advertisement