🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

parallelizing a software rasterizer algorithm using opencl

Started by
58 comments, last by JoeJ 1 year, 9 months ago

AhmedSaleh said:
Thanks for your information. So for the approach for removing iterating over the triangles. I think about something and tell me if it's correct or no. I make a kernel, enqueue it, get all the bounding boxes store them in out variable buffer run another kernel with input all bounding boxes, start the scanning 2D from enqueue from the bounding boxes limits Is that correct ? but how to chose which bounding box size and offset to let the second kernel go to it and draw

I guess you mean the approach i proposed (sorting by area), but it's not clear to me what you ask for.
In practice we get more kernels than two. I'll write some pseudo code.

// 1st kernel; input: all triangles of the scene; output: buffer at the size of all triangles
for each triangle
{
	transform vertices, frustum and backface culling.
	If (survives)
	{
		CalculateClippedBoundingRectangle(triangle);
		uint key = (retangleArea<<32) | triangleIndex;
		buffer.Append(key);
	}
}

// 2nd kernel: Sort the buffer, e.g. using radix sort. 

// 3rd kernel: for each remaining and sorted triangle, rasterize it from one thread using atomics to framebuffer

There is no need to cache the bounding rectangle, although you can of course, e.g. if the rasterizer needs it. We could also cache transformed vertices or have an initial kernel which does this just once per vertex.
With such caching, we decrease ALU at the cost of increasing bandwidth. Probably not worth it.

We only want to process the triangles in sorted order, so all threads in a workgroup have a workload of similar size.
But if all triangles already are at about the same projected size like in your initial picture, my proposed optimization is not worth it. We need a complex scene with close up and distant stuff.
And we need enough triangles to saturate the GPU on per thread rasterization.

If we have few very large triangles in the scene, that's still bad. E.g. if a triangle covers the entire screen, a single thread fills the entire screen without any parallelization.
For such cases, your initial approach would be better. So you could keep it and use it for the last triangles in the sorted list, eventually.

Advertisement

JoeJ said:
And you dislike OpenGL / DirectX so much that you prefer to render on CPU. Because of the above, your personal dislike must be the only reason.

Its too time consuming to fix crashes and bugs with every GPU family of every manufacturer on every platform, the speed benefit doesn't worths the time - if most of the users experiencing issues when running the program, i will just earn less money from them, for 10x the efforts. (the only way out from this problem is not using any features besides triangles + small textures, and then it will look like the game is from 1997 anyway). and thats not just an indie problem, even bigger engines cant escape this. in winter i have checked a few unity games, with most of the driver/gpu combos i had crashes or it ran unexplainably slow.

IF i would go with 3d acceleration, i wouldnt go with vulkan, thats too new and the support is questionable… i would go opengl instead (thats the one i know well enough).

JoeJ said:
Now i don't want to change your mind, but have you ever tried GPU compute? Maybe you would like it?

Yeah, i have used OpenCL a bit. I can confirm, its really easy, its just C code in a special package encapsulated in a function. I see where you are going with this one: with opencl by just adding a few extra lines, you can run c algorithms on a gpu, easily can be enabled or disabled within the code. And there are even software-only implementations, so it doesnt matters if there is no gpu to support it.

And the software i have used opencl in: of course it was in the same package as the cpu only executable. I kept it online for a few years, and no one seemed to be interested in the opencl version, pepople either didnt understood what it is, or they had no opencl whatsoever. Even after a year, there was no users for the opencl version (no questions about it, no bug reports, no feedbacks) and people only used the normal version. So i threw the opencl version.

Another problem: the code performance of just the pure native c code VS software emulated opencl code is about 5x in favor of the native c code according to my personal experience (and i have implemented basically almost the entire program as opencl code, not just 3-4 line long example-type usage). so what this means is: anemic performance on the cpu, and very extensive work is required to re-implement literally everything in opencl.

Third problem: people need drivers to be installed.

Fourth problem: the opencl source code is being passed as raw text to the driver. Its easy to steal your work.

Fifth problem: basic opencl cant use system memory. You have to copy back and forth memory segments from/to the program area. I am aware that there is probably some extension which allows it, but its not even complaint with nvidia cards (or at least the time i wrote, nvidia only supported the older specification of opencl which didnt even supported this).

These are my thoughts in short. Imho opencl needs another 20 years of extensive development to be even remotely usable for me. But as right now we have 24-32 (virtual) cores in cpus in high-end desktop computers, and 12-16 in midrange… i think a more wise route is to find ways to efficiently use these cores than going with the gpu route.

Geri said:
IF i would go with 3d acceleration, i wouldnt go with vulkan, thats too new and the support is questionable… i would go opengl instead (thats the one i know well enough).

I have only mentioned VK to point out compute is simple even with that. Actually i assume your problems with gfx are a result of API complexity. There are always many ways to do one thing, and often we are unsure we do it correctly / or if our approach works with every vendor. (It's well known that driver teams mainly fix our bugs with their newest drivers, ‘ready’ for a certain new game.)

That's much less of an issue with compute shaders.

Personally i started with OpenGL compute shaders. Later i've switched to OpenCL, which was twice as fast with NVidia and 10% faster on AMD GPUs. The move to Vulkan then gave me another speedup of two, due to less need for CPU / GPU sync.
The shader code itself remained the same no matter what API. To convert between CL and GLSL i use a C preprocessor. Language features are more or less the same, just syntax varies.

Geri said:
Another problem: the code performance of just the pure native c code VS software emulated opencl code is about 5x in favor of the native c code according to my personal experience

Well, it's nice OpenCL can run on both CPU and GPU, but a completely useless feature in practice. Because you need to write very different code for both if you want good performance. OpenCL makes sense only on GPU. On CPU there sure is no need for it.

Geri said:
Third problem: people need drivers to be installed.

Yeah. AMD has dropped OpenCL support for their CPUs since some years. If GPUs follow, it's dead.
Also, OpenCL 3.0 is a step back from 2.0. 2.0 is most advanced, but can't be used because NVidia refuses to support it.
At this point i would say it's useful only as a personal entry point to learn GPU programming, and graphics APIs are the only practical cross vendor option we really have. Sadly there is no progress to catch up with CL 2.0 or Cuda features. >:(

Geri said:
Fourth problem: the opencl source code is being passed as raw text to the driver. Its easy to steal your work.

That's another issue gfx APIs actually 'solve'.

Geri said:
Fifth problem: basic opencl cant use system memory. You have to copy back and forth memory segments from/to the program area.

Of course. GPUs need high bandwidth which AGP bus can't deliver. Not to mention the issues of latency or syncing caches if the GPU changes CPU ram and vice versa.
So that's not a API problem but a technical limitation. One reason why i personally think dGPU monsters should just die in favor of more powerful and affordable iGPUs, potentially reducing such problems a lot.

Geri said:
These are my thoughts in short. Imho opencl needs another 20 years of extensive development to be even remotely usable for me.

I have to agree. That's true for GPUs in general. It's still not accessible enough for every day work, and there is no progress to improve the situation.
NV pushes Cuda, Intel now builds on SYCL, AMD has their own HIP stuff which only works on Linux, Microsoft abandoned their AMP stuf, Apple cares a fuck about cross platform.
OpenCL was our only hope, but seemingly they neither can nor want to agree on some standard.
Biggest failure of the tech industry of all times. <:(

I still have some hope on upcoming C++ standards of parallelization, which might change a thing if we're lucky.

@joej

I modified the code and removed For Each triangle in the first post of code, and made a 3D Kernel, Width,Height, Triangle count

Still the performance is the SAME!!!! how come ? and It's degraded at some models.

__kernel void sendImageToPBO(__global uchar4* dst_buffer, __global struct Triangle_* triangles, int triCount)

{
	size_t blockIdx = get_group_id(0);
	size_t blockIdy = get_group_id(1);
	size_t blockDimX = get_local_size(0);
	size_t blockDimY = get_local_size(1);
	size_t threadIdX = get_local_id(0);
	size_t threadIdY = get_local_id(1);

	float3 c0 = { 1, 0, 0 };
	float3 c1 = { 0, 1, 0 };
	float3 c2 = { 0, 0, 1 };

	int x = get_global_id(0);
	int y = get_global_id(1);
	int z = get_global_id(2);
	//z = 5;
	//printf("%f %f %f \r\n", triangles[z].v[0].pos[0], triangles[z].v[0].pos[1]);
	//printf("%f %f %f \r\n", triangles[z].v[1].pos[0], triangles[z].v[1].pos[1]);
	//printf("%f %f %f \r\n", triangles[z].v[2].pos[0], triangles[z].v[2].pos[1]);
	if (x < get_global_size(0) && y < get_global_size(0) && z < triCount)
	{
		float3 v0 = (float3)(triangles[z].v[0].pos[0], triangles[z].v[0].pos[1], 0);
		float3 v1 = (float3)(triangles[z].v[1].pos[0], triangles[z].v[1].pos[1], 0);
		float3 v2 = (float3)(triangles[z].v[2].pos[0], triangles[z].v[2].pos[1], 0);
		float3 p = { x + 0.5f, y + 0.5f, 0 };
		float w0 = edgeFunction(v1, v2, p);
		float w1 = edgeFunction(v2, v0, p);
		float w2 = edgeFunction(v0, v1, p);
		if (w0 >= 0 && w1 >= 0 && w2 >= 0) {

			float area = edgeFunction(v0, v1, v2);

			float r = w0 * c0.x + w1 * c1.x + w2 * c2.x;
			float g = w0 * c0.y + w1 * c1.y + w2 * c2.y;
			float b = w0 * c0.z + w1 * c1.z + w2 * c2.z;

			w0 /= area;
			w1 /= area;
			w2 /= area;
			float z = 1 / (w0 * v0.z + w1 * v1.z + w2 * v2.z);
			r *= z, g *= z, b *= z;


			dst_buffer[y * get_global_size(0) + x] = (uchar4)(r * 255, g * 255, b * 255, 255);
		}
	}

}
Game Programming is the process of converting dead pictures to live ones .

AhmedSaleh said:
I modified the code and removed For Each triangle in the first post of code, and made a 3D Kernel, Width,Height, Triangle count Still the performance is the SAME!!!! how come ?

It looks instead looping over all triangles inside the kernel (before),

you now loop over all triangles by making the triangle index the 3rd dimension of the compute domain.

So the workload remains the same, and effect on performance is expected to be subtle.

@JoeJ

So I still don't understand how to optimize it more ? it seems I need the third dimension.

But in your original algorithm, you still iterate for each triangle too, so what's the difference ?

How does your buffer look like in your pseudocode ? in terms of datatype ?

I want to seperate the for each triangle in my code.

also another question, how to specifiy number of blocks, number of threads in OpenCL ?

Game Programming is the process of converting dead pictures to live ones .

@joej

I managed to get FPS 115 and optimized it, many thanks.

But I wanna apply your algorithm, here is the current kernel

// the triangle is out of screen
		if (xmin < imageWidth - 1 || xmax > 0 || ymin < imageHeight - 1 || ymax > 0)
		{

			// be careful xmin/xmax/ymin/ymax can be negative. Don't cast to uint32_t
			unsigned int x0 = max(0, (int)(floor(xmin)));
			unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax)));
			unsigned int y0 = max(0, (int)(floor(ymin)));
			unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax)));
			for (unsigned int y = y0; y <= y1; ++y) {
				for (unsigned int x = x0; x <= x1; ++x) {
					float3 p = { x + 0.5f, y + 0.5f, 0 };

					float w0 = edgeFunction(v1Raster, v2Raster, p);
					float w1 = edgeFunction(v2Raster, v0Raster, p);
					float w2 = edgeFunction(v0Raster, v1Raster, p);

					if (w0 >= 0 && w1 >= 0 && w2 >= 0) {

						float area = edgeFunction(v0Raster, v1Raster, v2Raster);

						float r = w0 * c0.x + w1 * c1.x + w2 * c2.x;
						float g = w0 * c0.y + w1 * c1.y + w2 * c2.y;
						float b = w0 * c0.z + w1 * c1.z + w2 * c2.z;

						w0 /= area;
						w1 /= area;
						w2 /= area;
						float z = 1 / (w0 * v0Raster.z + w1 * v1Raster.z + w2 * v2Raster.z);
						r *= z, g *= z, b *= z;


						dst_buffer[y * get_global_size(0) + x] = (uchar4)(r * 255, g * 255, b * 255, 255);
					}
				}
			}
		}
Game Programming is the process of converting dead pictures to live ones .

AhmedSaleh said:
for (unsigned int y = y0; y <= y1; ++y) { for (unsigned int x = x0; x <= x1; ++x) {

I have expected you would do this mistake : )
It's very bad, because by formulating the loop this way, the whole intended advantage gets lot. Basically you serialize execution, and parallelism isn't well utilized.

To understand the reason, be sure to imagine parallel execution in lockstep. Some example to point out what happens:

We have a similar loop like this:

for (int y=0; y<h; y++) 
for (int x=0; x<w; x++) 
{
	DoStuff(x,y);
}

Imagine two threads execute the loop in lockstep.
But we have different values for w and h.

thread 1: w = 1, h = 10

thread 2: w = 10, h = 1

So both have the same workload of size 10.

Now imagine the inner loop:

Thread 1 executes it one times. But then it goes idle and does nothing until the second thread is finished with the same loop too.
Thread 2 executes the loop 10 times. Only after that, both threads can continue with processing one step of the outer loop. So thread 2 throttles thread 1.

For the outer loop, the same problem happens, just vice versa. Thread 1 throttles thread 2.

As a result, both threads require the time to process a workload of size 100, not 10 (!)

So you always get worst case performance if you do this. For GPU parallel programming, you must think different. In many ways, but this example is really good.
Be sure to understand this. Otherwise ask if something isn't clear.

To do better, we could do this:

int workload = w*h;
for (int i=0; i<workload; i++)
{
	int x = i%w;
	int y = i/w; 
	DoStuff(x,y);
}

We use a single loop to avoid the problem, but we have to do some extra work to get our coordinates.
Both threads now have best case performance and do the loop only 10 times as desired.

So that's probably the most important lesson to learn here. Again, be sure to understand.

As a side note, GPUs often lack a integer division operation. So it might be worth to try some alternatives from using / and %.
But that's low level optimizations and not important at this point. Just to mention.

(Edit: fixed a bug)

@joej

Didn't understand fully but couldn't convert the the two loops into one ?

for (unsigned int z = x1*y0; z <= x0*y1; ++z) {
				unsigned int x = z % x1*y0;
				unsigned int y = z % y1*x0;
				float3 p = { x + 0.5f, y + 0.5f, 0 };

			
Game Programming is the process of converting dead pictures to live ones .

AhmedSaleh said:
Didn't understand fully

You have to! So keep it in mind for the future, as this happens all the time.
Ask specifically if you can formulate a question.

AhmedSaleh said:
couldn't convert the the two loops into one ?

I'll try. Original code:


			unsigned int x0 = max(0, (int)(floor(xmin)));
			unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax)));
			unsigned int y0 = max(0, (int)(floor(ymin)));
			unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax)));
			for (unsigned int y = y0; y <= y1; ++y) {
				for (unsigned int x = x0; x <= x1; ++x) {

Optimized code:


			unsigned int x0 = max(0, (int)(floor(xmin)));
			unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax)));
			unsigned int y0 = max(0, (int)(floor(ymin)));
			unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax)));
			
			unsigned int w = x1 - x0; // width of rectangle
			unsigned int h = y1 - y0; // height
			unsigned int workload = w * h; // area of rectangle in pixels
			

			for (unsigned int i=0; i<workload; i++) 
			{
				unsigned int x = (i%w) + x0;
				unsigned int y = (i/w) + y0; 

This topic is closed to new replies.

Advertisement