🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to Graphics and GPU Programming

parallelizing a software rasterizer algorithm using opencl

AhmedSaleh · 2022-09-21T11:59:52

I've written a small software rasterizer using OpenCL and would like to optimize and parallelize it more, currently I'm scanning the whole screen and see if the triangle overlaps with the pixels.. I would like to parallelize the loop and do it more efficiently. For example in my idea, is to process only the bounding box pixels.. ? __kernel void sendImageToPBO(__global uchar4* dst_buffer, __global float* vbo, int vbosize, __global int* ibo, int ibosize) { size_t blockIdx = get_group_id(0); size_t blockIdy = get_group_id(1); size_t blockDimX = get_local_size(0); size_t blockDimY = get_local_size(1); size_t threadIdX = get_local_id(0); size_t threadIdY = get_local_id(1); float3 c0 = { 1, 0, 0 }; float3 c1 = { 0, 1, 0 }; float3 c2 = { 0, 0, 1 }; int x = get_global_id(0); int y= get_global_id(1); int imageWidth = 800; int imageHeight = 800; if (x < vbosize && y < vbosize) { for (int i = 0; i < vbosize; i += 9) { float3 v1 = (float3)(vbo[i], vbo[i + 1], vbo[i + 2]); float3 v0 = (float3)(vbo[i + 3], vbo[i + 4], vbo[i + 5]); float3 v2 = (float3)(vbo[i + 6], vbo[i + 7], vbo[i + 8]); float xmin = fmin(v0.x, fmin(v1.x, v2.x)); float ymin = fmin(v0.y, fmin(v1.y, v2.y)); float xmax = fmax(v0.x, fmin(v1.x, v2.x)); float ymax = fmax(v0.y, fmin(v1.y, v2.y)); // be careful xmin/xmax/ymin/ymax can be negative. Don't cast to unsigned int unsigned int x0 = max(0, (int)(floor(xmin))); unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax))); unsigned int y0 = max(0, (int)(floor(ymin))); unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax))); float3 p = { x + 0.5f, y + 0.5f, 0 }; float w0 = edgeFunction(v1, v2, p); float w1 = edgeFunction(v2, v0, p); float w2 = edgeFunction(v0, v1, p); if (w0 >= 0 && w1 >= 0 && w2 >= 0) { float area = edgeFunction(v0, v1, v2); float r = w0 * c0.x + w1 * c1.x + w2 * c2.x; float g = w0 * c0.y + w1 * c1.y + w2 * c2.y; float b = w0 * c0.z + w1 * c1.z + w2 * c2.z; w0 /= area; w1 /= area; w2 /= area; float z = 1 / (w0 * v0.z + w1 * v1.z + w2 * v2.z); r *= z, g *= z, b *= z; dst_buffer[y * get_global_size(0) + x] = (uchar4)(r * 255, g * 255, b * 255, 255); } } }

Graphics and GPU Programming Programming

Started by AhmedSaleh September 06, 2022 05:31 PM

58 comments, last by JoeJ 1 year, 9 months ago

AhmedSaleh

192

Author

September 09, 2022 08:59 AM

@JoeJ

The above code is not working :(

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,263

September 09, 2022 09:06 AM

I've missed a <= instead a <.
So the fix should be:

			unsigned int w = x1 - x0 + 1; // width of rectangle
			unsigned int h = y1 - y0 + 1; // height

AhmedSaleh

192

Author

September 09, 2022 09:11 AM

@joej

Many thanks..

No next step..

Can you explain in plain words, what does nVidia do with their bin rasterizer algorithm ? Is it easy to implement ?

Which is easier, your algorithm with sorting that you proposed? Can I get better performance ?

HPG2011_Papers_Laine.pdf (highperformancegraphics.net)

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,263

September 09, 2022 10:05 AM

That's very complex, thus much harder to implement, but much faster for sure. Maybe my proposal is similar to the former ‘FreePipe’ work they mention.

The next step would be the work on binning or sorting. Both is often a building block for more advanced algorithms like these. They rely on binning, for example.
Binning also is often used to bin lights to tiles in screenspace, so many lights can be processed efficiently. That's maybe an application to search for examples.
Sorting on the other hand would give better results, actually perfect results for my proposal.
But sorting is more expensive, and i could not suggest which is probably the better option.

I guess you also need to spend more work on clipping. That's quite complex too. A guard band is often used for compromise, to reduce the amount of new, clipped geometry to be generated. Triangles covering the guard band then need to be clipped to the actual screen per scanline, pixel or bounding rectangle by the rasterizer. I assume your approach already (but only) supports that latter case of clipping the bounding rectangle.
But many triangles can't be represented with a guard band. For example, triangles which have vertices behind the camera can't be projected to screenspace at all. We must clip their geometry before the projection. This complicates stuff a lot.

Because clipping causes new triangles, we do not know in advance how many triangles we get. Which again complicates things, usually solved by using memory buffers ‘probably’ large enough in practice. If we exceed this, we get artifacts, but at least we should make sure the application does not crash.

That's a lot of work. Maybe the hardest part, as i remember it from work on CPU rasterization.

AhmedSaleh

192

Author

September 09, 2022 11:16 AM

@JoeJ

I would like to apply your idea of sorting the triangles, but don't know how would the buffer look like, and submitting the ids.. could you formulate your algorithm a bit more in details ? for example what is IF (survive)

Many thanks for your collabration and your work!

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,263

September 09, 2022 12:14 PM

I see i had a bug there:

for each triangle
{
	transform vertices, frustum and backface culling.
	If (survives)
	{
		CalculateClippedBoundingRectangle(triangle);
		//uint key = (retangleArea<<32) | triangleIndex; // i had 64 bit key in mind, but it's 32, so a left shift of 32 would make the area vanish
		uint key = (rectangleArea<<16) | triangleIndex;
		buffer.Append(key);
	}
}

If (survives) means that the triangle is not culled.

I proposed to combine this kernel with frustum and back face culling. For your current scene and camera setup, you would need only back face culling.
Frustum culling is related to clipping, so you could care about that later.

The key does bitpacking to store both the area and the index. Because we want to sort by area, area must take the highest bits.
After the sorting is done, you need to unpack the bits to get the triangle index by index = key & 0xffff; Of course you decide how many bits you need.

Cuda has optimized libraries for sorting. I guess this exists for OpenCL too and could be found on github or libraries offered from GPU vendors.

Appending to a buffer is most easy using atomicAdd.

We have a global variable in vram, e.g. in a small buffer. Having forgotten syntax, i call this global int (still very much pseudo code):

// global data
global int bufferIndex = 0;
// kernel...
int index = atomicAdd(bufferIndex, 1);
buffer[index] = key;

However, this is not great because it causes atomic access from all threads to the same memory address.
So after this works, you can optimize by using a small buffer in LDS memory per workgroup. You fill it the same way, but now we only need atomicAdd to a counter in LDS memory, which is very fast.
Only after this local buffer is full, you do one atomicAdd to the global index with the local buffer size. This reserves space so all threads can then just copy the local LDS buffer to the global VRAM buffer. Then we st local counter to zero and start filling the same local buffer again.

That's an important optimization we can use very often, but not yet needed to make it work.

… a never ending stream of details … ; )

AhmedSaleh

192

Author

September 09, 2022 05:49 PM

@JoeJ

I understand the approach, but don't know why it would make a difference.

We will sort by area, so small triangles are rendered first then big ones afterwards, right ?

how would that optimize the performance ?

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,263

September 09, 2022 06:15 PM

AhmedSaleh said:
how would that optimize the performance ?

The problem is similar to the nested loop being bad for GPU.

Imagine thread 1 draws a triangle with 10 pixels.
Thread 2 draws triangle with 100 pixels.

Even with the optimized loop of today, thread 2 will ‘throttle’ thread 1.
All threads of a workgroup (or wave / warp to be precise) will go idle and do nothing until the one thread which has the largest triangle is done.

That's the problem i aim to solve by sorting. After sorting, triangle areas of our list will be like:

2,2,2,3,4,4,5,7,…

A workgroup will take a sequence of small size from that list, e.g. the sequence 2,2,2,3 if a warp would be only 4 threads wide.
Thus all threads of the warp will finish at roughly the same time, and will be instantly ready for new work.
Or in other words: All threads of the entire GPU will do work most of the time, instead waiting.

If we don't sort, the sequence may be 2,7,3,4. This means 3 threads will wait half of the time, so we achieve utilization of only 25%.

This is what makes the ‘many threads in lockstep’ execution model of GPUs so different from CPU multi threading. It is often hard to saturate a GPU and utilize its potential, requiring completely different algorithms.

AhmedSaleh

192

Author

September 10, 2022 12:07 PM

@JoeJ

Here is the final rasterizer with shading and color/normals interpolation without back face culling and depth buffer,, will add them later..

I will try your idea, but there is on major problem, my current platform doesn't support Atomic operations, is there a way to overcome adding atomically on the buffer ?

Game Programming is the process of converting dead pictures to live ones .

AhmedSaleh

192

Author

September 10, 2022 12:15 PM

with normals interpolation using barycentric coordinates:

Game Programming is the process of converting dead pictures to live ones .

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines