🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to Graphics and GPU Programming

parallelizing a software rasterizer algorithm using opencl

AhmedSaleh · 2022-09-21T11:59:52

I've written a small software rasterizer using OpenCL and would like to optimize and parallelize it more, currently I'm scanning the whole screen and see if the triangle overlaps with the pixels.. I would like to parallelize the loop and do it more efficiently. For example in my idea, is to process only the bounding box pixels.. ? __kernel void sendImageToPBO(__global uchar4* dst_buffer, __global float* vbo, int vbosize, __global int* ibo, int ibosize) { size_t blockIdx = get_group_id(0); size_t blockIdy = get_group_id(1); size_t blockDimX = get_local_size(0); size_t blockDimY = get_local_size(1); size_t threadIdX = get_local_id(0); size_t threadIdY = get_local_id(1); float3 c0 = { 1, 0, 0 }; float3 c1 = { 0, 1, 0 }; float3 c2 = { 0, 0, 1 }; int x = get_global_id(0); int y= get_global_id(1); int imageWidth = 800; int imageHeight = 800; if (x < vbosize && y < vbosize) { for (int i = 0; i < vbosize; i += 9) { float3 v1 = (float3)(vbo[i], vbo[i + 1], vbo[i + 2]); float3 v0 = (float3)(vbo[i + 3], vbo[i + 4], vbo[i + 5]); float3 v2 = (float3)(vbo[i + 6], vbo[i + 7], vbo[i + 8]); float xmin = fmin(v0.x, fmin(v1.x, v2.x)); float ymin = fmin(v0.y, fmin(v1.y, v2.y)); float xmax = fmax(v0.x, fmin(v1.x, v2.x)); float ymax = fmax(v0.y, fmin(v1.y, v2.y)); // be careful xmin/xmax/ymin/ymax can be negative. Don't cast to unsigned int unsigned int x0 = max(0, (int)(floor(xmin))); unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax))); unsigned int y0 = max(0, (int)(floor(ymin))); unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax))); float3 p = { x + 0.5f, y + 0.5f, 0 }; float w0 = edgeFunction(v1, v2, p); float w1 = edgeFunction(v2, v0, p); float w2 = edgeFunction(v0, v1, p); if (w0 >= 0 && w1 >= 0 && w2 >= 0) { float area = edgeFunction(v0, v1, v2); float r = w0 * c0.x + w1 * c1.x + w2 * c2.x; float g = w0 * c0.y + w1 * c1.y + w2 * c2.y; float b = w0 * c0.z + w1 * c1.z + w2 * c2.z; w0 /= area; w1 /= area; w2 /= area; float z = 1 / (w0 * v0.z + w1 * v1.z + w2 * v2.z); r *= z, g *= z, b *= z; dst_buffer[y * get_global_size(0) + x] = (uchar4)(r * 255, g * 255, b * 255, 255); } } }

Graphics and GPU Programming Programming

Started by AhmedSaleh September 06, 2022 05:31 PM

58 comments, last by JoeJ 1 year, 9 months ago

JoeJ

4,263

September 07, 2022 07:53 PM

AhmedSaleh said:
how ? ?

Again, the simplest idea i have is:

JoeJ said:
hmmm… following that Epic example, one idea would be to bin or sort the triangles by pixel area. So triangles nearby in the sorted list have a similar count of pixels. That's one dispatch for the binning or sorting. Then a second dispatch would process the triangles from the sorted list. So all threads in a workgroup raster one triangle, and they'll have similar count of pixels, so all threads finish at nearly the same time and saturation is good. Would work with both the scanline or the edge equations approach (for the latter you would sort by are of the bounding rectangle, not the triangle). Not sure what's better. The latter wastes half of work on processing pixels which are not inside the triangle, the former has more code divergence due to complexity. Might not matter too much. That's very simple and not that bad. But the downside is that nearby threads will write to different locations in the framebuffer. But maybe due to the atomic access this does not make it so much slower than it is anyway. Though, i guess for just a single cow model it might not beat your current approach. The scene should be more complex to make it a win.

Analyzing theoretical performance, we process every triangle once for the binning (or sorting), and a second time to draw it.
Rendering performance is as usual with rasterization: The number of processed pixels depends on overdraw.
Implementation is easy.

But because each thread draws one whole triangle we need many triangles to saturate a GPU.
(That's probably also the reason why the github project shows no impressive performance - they also draw one triangle per thread and work with single low poly models.)

Geri

407

September 07, 2022 08:51 PM

I had issues above 3-4 cores with my software renderer. My software renderer supports 3 type of threading:

-per object
-per polygon
-per screen tile

None of them wanted to scale above 3-4 cores. After fine tuning it for a while, i was able to manage scaling up to 4-5 cores in some cases, and thats all.

Obviously the per polygon solution was the fastest. But its useless for alpha blended things. So i switched to per tile. Maybe the most efficient way would be having a governor that dinamically starts and allocates the tasks per tile, but i am not sure how efficient would that be under opencl.

JoeJ

4,263

September 07, 2022 08:56 PM

Geri said:
Obviously the per polygon solution was the fastest.

How did you deal with write hazards, if multiple polygons / threads draw to the same region on screen?

Geri

407

September 07, 2022 08:58 PM

JoeJ said:
How did you deal with write hazards, if multiple polygons / threads draw to the same region on screen?

I havent. If you stare on the same spot long enough, you can sometimes observe a flickering pixel, maybe once in every 10 seconds.

JoeJ

4,263

September 07, 2022 09:12 PM

Interesting solution : )

I always was wondering if it's practical / possible to use a large buffer of memory for atomics on CPU. Never tried it.

It sucks we have to declare a memory address to support atomics. On GPU that's so easy. Any memory supports atomic access - we only need to use an atomic instruction and it just works.
Beside the many threads running the same program in lockstep, this makes true parallel programming so nice on GPU. On CPU that seems not possible at all.
Actually i would love the GPU programming model. It's just that APIs suck so hard it's basically impossible to ‘develop’ anything reasonable complex on GPU. I always do it on CPU first and port it to GPU afterwards. >:(

Geri

407

September 07, 2022 09:33 PM

In the past days, i was wondering about creating a new game engine. I also have re-evaluated my 3d engine and renderer, in contrast with my future plans. I was thinking on optimizing my software renderer even further, and/or i was also thinking to write maybe an opengl fallback for very old 32 bit computers, where the frame rate is too low.

I have scrapped the second part of the idea, and i will finish my new ray tracer instead. I have finally found out, how to make it less ugly, meanwhile i keep the code short, simple, and readable, and fast enough. However, my new idea will not be able to scale beyond 6-8 cores, so almost the same limits are here to haunt again.

By the way, this was the scaling characteristics of the older version of my rasterizer, if someone wonders, with a given X scene:

1 threads: 11 fps
2 threads: 22 fps
3 threads: 26 fps
4 threads: 27 fps
6 threads: 26 fps (yeah, decrease)

After optimizing the threading a little bit two years ago, i was able to reach something like this:

1 threads: 11 fps
2 threads: 22 fps
3 threads: 27 fps
4 threads: 30 fps
6 threads: 32 fps

And the difference between the per-tile, per object, and per-polygon threading models were around 5-10%, depending on the scene. I know, its disappointing. Hopefully, the ray tracer will behave better.

JoeJ

4,263

September 07, 2022 11:14 PM

Geri said:
I have scrapped the second part of the idea, and i will finish my new ray tracer instead. I have finally found out, how to make it less ugly, meanwhile i keep the code short, simple, and readable, and fast enough. However, my new idea will not be able to scale beyond 6-8 cores, so almost the same limits are here to haunt again.

So you target modern hardware, because older stuff has no 6-8 cores.
This means any of such HW will have a GPU, because it's 2022.
And you dislike OpenGL / DirectX so much that you prefer to render on CPU. Because of the above, your personal dislike must be the only reason.

Now i don't want to change your mind, but have you ever tried GPU compute? Maybe you would like it?
I grant you: It has nothing in common with pixel or vertex shaders, or anything you associate with graphics.
It's much easier to use than the graphics pipeline on the API side. Complexity is low even with Vulkan i would say. While drawing a single triangle is tedious, complicated, error prone and a frustrating nightmare, compute is actually fine to use.
And OpenCL makes it really easy. Recommended to get started, but on the long run it lacks some performance advantages of gfx APIs, and i'm worried it might just die in the future. (OpenCL kernels are easy to port to compute shaders if so. It's basically the same thing.)
It will be much faster no matter if raytracing or rasterization, and CPU remains available to implement games which might require some performance too.

Consider to take a look. In comparison to working on CPU it's still a inferior experience. No debugging, just harder to maintain. So i really only use it if CPU is too slow. But unfortunately, that's sometimes the case.

(Did i just encourage somebody to implement basic software rendering in 2022? What's wrong with me? :D )

AhmedSaleh

192

Author

September 08, 2022 03:24 AM

@joej

Thanks for your information.

So for the approach for removing iterating over the triangles. I think about something and tell me if it's correct or no.

I make a kernel, enqueue it, get all the bounding boxes store them in out variable buffer
run another kernel with input all bounding boxes, start the scanning 2D from enqueue from the bounding boxes limits

Is that correct ? but how to chose which bounding box size and offset to let the second kernel go to it and draw

Game Programming is the process of converting dead pictures to live ones .

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines