🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to Graphics and GPU Programming

parallelizing a software rasterizer algorithm using opencl

AhmedSaleh · 2022-09-21T11:59:52

I've written a small software rasterizer using OpenCL and would like to optimize and parallelize it more, currently I'm scanning the whole screen and see if the triangle overlaps with the pixels.. I would like to parallelize the loop and do it more efficiently. For example in my idea, is to process only the bounding box pixels.. ? __kernel void sendImageToPBO(__global uchar4* dst_buffer, __global float* vbo, int vbosize, __global int* ibo, int ibosize) { size_t blockIdx = get_group_id(0); size_t blockIdy = get_group_id(1); size_t blockDimX = get_local_size(0); size_t blockDimY = get_local_size(1); size_t threadIdX = get_local_id(0); size_t threadIdY = get_local_id(1); float3 c0 = { 1, 0, 0 }; float3 c1 = { 0, 1, 0 }; float3 c2 = { 0, 0, 1 }; int x = get_global_id(0); int y= get_global_id(1); int imageWidth = 800; int imageHeight = 800; if (x < vbosize && y < vbosize) { for (int i = 0; i < vbosize; i += 9) { float3 v1 = (float3)(vbo[i], vbo[i + 1], vbo[i + 2]); float3 v0 = (float3)(vbo[i + 3], vbo[i + 4], vbo[i + 5]); float3 v2 = (float3)(vbo[i + 6], vbo[i + 7], vbo[i + 8]); float xmin = fmin(v0.x, fmin(v1.x, v2.x)); float ymin = fmin(v0.y, fmin(v1.y, v2.y)); float xmax = fmax(v0.x, fmin(v1.x, v2.x)); float ymax = fmax(v0.y, fmin(v1.y, v2.y)); // be careful xmin/xmax/ymin/ymax can be negative. Don't cast to unsigned int unsigned int x0 = max(0, (int)(floor(xmin))); unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax))); unsigned int y0 = max(0, (int)(floor(ymin))); unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax))); float3 p = { x + 0.5f, y + 0.5f, 0 }; float w0 = edgeFunction(v1, v2, p); float w1 = edgeFunction(v2, v0, p); float w2 = edgeFunction(v0, v1, p); if (w0 >= 0 && w1 >= 0 && w2 >= 0) { float area = edgeFunction(v0, v1, v2); float r = w0 * c0.x + w1 * c1.x + w2 * c2.x; float g = w0 * c0.y + w1 * c1.y + w2 * c2.y; float b = w0 * c0.z + w1 * c1.z + w2 * c2.z; w0 /= area; w1 /= area; w2 /= area; float z = 1 / (w0 * v0.z + w1 * v1.z + w2 * v2.z); r *= z, g *= z, b *= z; dst_buffer[y * get_global_size(0) + x] = (uchar4)(r * 255, g * 255, b * 255, 255); } } }

Graphics and GPU Programming Programming

Started by AhmedSaleh September 06, 2022 05:31 PM

58 comments, last by JoeJ 1 year, 9 months ago

JoeJ

4,263

September 21, 2022 11:59 AM

AhmedSaleh said:
I'm just confused, what is the point of prefix sum ?

It's a really simple concept, so before i have learned GPU programming, i did not know it has name.

But it's just that: We have an array of numbers, and the prefix sum is another array, where each element is the sum of all former elements.

This has two applications, usually:
We know the sum of all numbers of the whole array (last element), which often need to allocate memory of that size.
And we know this sum also for any element in our initial array, so we can put all objects in compact order to process them in such order, often in parallel.

See the end of the code:

// now we can iterate points per bin:
  
  for (int binI = 0; binI<NUM_BINS; binI++)
  {
      int listBegin = prefixBinCount[binI];
      int listEnd = prefixBinCount[binI+1];
      
      for (int i=listBegin; i<listEnd; i++)
      {
          std::cout << pointPerBinLists[i] << " ";
      }
        
  }

Notice that's ideal. We don't need to traverse some linked list or something like that - we have our stuff in sequence.
And getting the begin and end of our range per bin is easy and ideal as well.

Once this goal is clear, it is easier to understand how and why the prefix sum gives us this, and why the concept is generally useful.

Basically the applications are similar to those where we want to use sorting. But binning is faster (O(number of elements + number of bins)), so preferable if we don't need exactly sorted order, just some coarse division (into 'bins').
(Another typical application is summed area tables (SAT)).

The alternative to binning would be to build linked lists of points per bin. This is simpler, and has time complexity of only O(number of elements).
But to do this, we would again require atomics to build the lists in parallel.
Plus: With triangles, it would not even work, because triangles go into multiple tiles, not just one.
Finally, traversing linked lists is slow due to pointer chasing, and on GPU this is even more of a problem than on CPU.

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines