🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to Graphics and GPU Programming

parallelizing a software rasterizer algorithm using opencl

AhmedSaleh · 2022-09-21T11:59:52

I've written a small software rasterizer using OpenCL and would like to optimize and parallelize it more, currently I'm scanning the whole screen and see if the triangle overlaps with the pixels.. I would like to parallelize the loop and do it more efficiently. For example in my idea, is to process only the bounding box pixels.. ? __kernel void sendImageToPBO(__global uchar4* dst_buffer, __global float* vbo, int vbosize, __global int* ibo, int ibosize) { size_t blockIdx = get_group_id(0); size_t blockIdy = get_group_id(1); size_t blockDimX = get_local_size(0); size_t blockDimY = get_local_size(1); size_t threadIdX = get_local_id(0); size_t threadIdY = get_local_id(1); float3 c0 = { 1, 0, 0 }; float3 c1 = { 0, 1, 0 }; float3 c2 = { 0, 0, 1 }; int x = get_global_id(0); int y= get_global_id(1); int imageWidth = 800; int imageHeight = 800; if (x < vbosize && y < vbosize) { for (int i = 0; i < vbosize; i += 9) { float3 v1 = (float3)(vbo[i], vbo[i + 1], vbo[i + 2]); float3 v0 = (float3)(vbo[i + 3], vbo[i + 4], vbo[i + 5]); float3 v2 = (float3)(vbo[i + 6], vbo[i + 7], vbo[i + 8]); float xmin = fmin(v0.x, fmin(v1.x, v2.x)); float ymin = fmin(v0.y, fmin(v1.y, v2.y)); float xmax = fmax(v0.x, fmin(v1.x, v2.x)); float ymax = fmax(v0.y, fmin(v1.y, v2.y)); // be careful xmin/xmax/ymin/ymax can be negative. Don't cast to unsigned int unsigned int x0 = max(0, (int)(floor(xmin))); unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax))); unsigned int y0 = max(0, (int)(floor(ymin))); unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax))); float3 p = { x + 0.5f, y + 0.5f, 0 }; float w0 = edgeFunction(v1, v2, p); float w1 = edgeFunction(v2, v0, p); float w2 = edgeFunction(v0, v1, p); if (w0 >= 0 && w1 >= 0 && w2 >= 0) { float area = edgeFunction(v0, v1, v2); float r = w0 * c0.x + w1 * c1.x + w2 * c2.x; float g = w0 * c0.y + w1 * c1.y + w2 * c2.y; float b = w0 * c0.z + w1 * c1.z + w2 * c2.z; w0 /= area; w1 /= area; w2 /= area; float z = 1 / (w0 * v0.z + w1 * v1.z + w2 * v2.z); r *= z, g *= z, b *= z; dst_buffer[y * get_global_size(0) + x] = (uchar4)(r * 255, g * 255, b * 255, 255); } } }

Graphics and GPU Programming Programming

Started by AhmedSaleh September 06, 2022 05:31 PM

58 comments, last by JoeJ 1 year, 9 months ago

JoeJ

4,264

September 15, 2022 09:02 AM

AhmedSaleh said:
The basic scanline triangle works on the simulator of the GPU (PC Simulator) it takes 1 second for a 512*512 framebuffer, for a cube. and for cow it takes 13 seconds… Would you think that tilemap rendering enhance the performance? If not, what would enhance the performance to overcome at least 20ms rendering ?

To utilize from tiles on HW which can't do atomics, the trivial approach may be:

Sort all triangles by depth.

Divide screen into M x N tiles.

Then one workgroup per tile:
One thread of the workgroup iterates all triangles, test if triangle is in the frustum of the tile.
If so, all threads can rasterize it (like your very first approach).

This way we get some parallel rasterization, but anything else is bad. All workgroups process all triangles, and only one thread of the group does it.
It won't scale with increasing scene complexity.

And i fail to come up with a better approach. We could use something like spin locks to do things like binning to tiles for better parallelization, but spin locks means to serialize execution as well, so we get similar issues most likely.
It may work, but it still sucks conceptually. So i would not want to invest time at all.

That's why i would prefer raytracing. It's slow, but easily parallelizable, and it will scale well.
RT usually only shows benefit if we need secondary rays e.g. for shadows, reflections, AO, GI, etc. But without atomics it just becomes the better concept.
Maybe i miss some options, but i would go so far to say that efficient parallelized rasterization without atomics is not possible. Early fixed function GPUs had it from the start, in form of the depth buffer.

I'm sorry, but i can't help any further. Missing atomics really is a show stopper to a whole lot of things, rasterization being one of them.

AhmedSaleh said:
Increasing the number of threads and warps, make the performance worth in the simulator… is there a solution for that to get benefits of multiple threads ?

Not sure if the simulator can reproduce related performance behavior. I guess not really. Likely the best settings have to be found from testing on the actual hardware.

AhmedSaleh

192

Author

September 18, 2022 04:22 AM

@JoeJ

Hello, hope you are doing fine..

I found an issue, increasing the number of threads for the program, it runs fast, but it doesn't render any pixels on the screen..

Why increasing the number of threads make things bad ?

executing command like that :

running: CONFIGS=-DNUM_CLUSTERS=1 -DNUM_CORES=1 -DNUM_WARPS=4 -DNUM_THREADS=4 -DL2_ENABLE=0 -DL3_ENABLE=0 make -C /home/osboxes/vx/vortex/ci/../driver/simx

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,264

September 18, 2022 09:24 AM

AhmedSaleh said:
Why increasing the number of threads make things bad ?

No idea. Idk how the emulator maps GPU threads to CPU cores. Could be issues in your side, but likely the GPU project has some bugs and issues as well.
But surely you can not make assumptions on execution order of the real HW. Meaning, e.g. if sorting triangles by depth would generate a correct image on the emulator, there's no guarantee the same applies to the real HW.

You may want to run your project also on regular CPU / GPU OpenCL drivers of your dev PC as well for a reference.

AhmedSaleh

192

Author

September 20, 2022 10:36 AM

@JoeJ

Hello Joe,

Hope you're doin fine. Simple question do you think implementing tile based approach would be beneficial in terms of FPS settings on the Vortex GPU not on PC ? would I invest time to implement as it's not easy at all

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,264

September 21, 2022 06:35 AM

AhmedSaleh said:
Simple question do you think implementing tile based approach would be beneficial in terms of FPS settings on the Vortex GPU not on PC ?

You could try to make your very first attempt tiled this way:

For each triangle, count how many tiles it intersects, and store the count.
Do a prefix sum over (tiles per triangle) count to know how much memory you need for the tile lists.
For each tile iterate all triangles to count how many triangles go into the tile, and store the count.
Do a prefix sum over (triangles per tile) count to calculate the list ranges per tile.
For each tile, iterate all triangles again to finally fill the triangles per tile list.

After that, you can run your initial algorithm per tile, and for each tile you will only process the triangles which are in that tile. (I hope i thought about that algorithm correctly.)

Even we iterate stuff many times, it is less work if scene complexity is high enough. So at some point, it will be a win.

Basically any algorithm which would benefit from atomics can be altered to use multiple passes of prefix sums instead.
The complexity to implement this is actually low, so i'd say it's worth a try.
If such scan algorithms like prefix sums are new to you, it's worth the time anyway alone for the learning experience. Scan algorithms are the most basic building block for parallel programming.

AhmedSaleh

192

Author

September 21, 2022 07:26 AM

@JoeJ

Thanks Joe. I'm always beginner and a big fan of pseudo code ? Please provide me one and I will implement it

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,264

September 21, 2022 09:01 AM

Well, let's make a simple, one dimensional example, replacing triangles with points.
And we want to bin those points into tiles of size 16.

#include <iostream>
#include <array>
#include <vector>


int main()
{
  std::array<int, 6> points = {0x04, 0x23, 0x25, 0x75, 0x13, 0x9a};
  
  constexpr int NUM_BINS = 16;
  int binCount[NUM_BINS] = {};
  int pointListIndex[points.size()] = {};
  
  constexpr int BIN_SIZE = 16;
  for (int i=0; i<points.size(); i++) 
  {
      int binIndex = points[i] / BIN_SIZE;
      
      pointListIndex[i] = binCount[binIndex]; // index of this point in its future bin, used later
      
      binCount[binIndex]++; // if this loop is parallilized, we would need an atomic add here
      // but becasue we can't, you need to rewrite the loop like so:
      // foreach bin { foreach point { if (point inside bin) count++ } }
  }  
  
  std::cout << "\nbinCount:\n";
  for (int i=0; i<NUM_BINS; i++) std::cout << binCount[i] << " ";
  // 1 1 2 0 0 0 0 1 0 1 0 0 0 0 0 0
  
  
  // prefix sum over binCount...
  
  int prefixBinCount[NUM_BINS+1];
  prefixBinCount[0] = 0;
  for (int i=0; i<=NUM_BINS; i++)
  {
      prefixBinCount[i+1] = prefixBinCount[i] + binCount[i];
  }
  
  std::cout << "\nprefixBinCount:\n";
  for (int i=0; i<=NUM_BINS; i++) std::cout << prefixBinCount[i] << " ";
  // 0 1 2 4 4 4 4 4 5 5 6 6 6 6 6 6 6
  
  // now we know we need 6(+1) vaulues fo our compacted point per bin lists (which could be more if one point could go into multiple bins!)
  // and we also know where the list of each bin stasts and ends
  
  
  // create compacted lists...
  
  std::vector<int> pointPerBinLists (prefixBinCount[NUM_BINS]);
  
  for (int i=0; i<points.size(); i++) 
  {
      int binIndex = points[i] / BIN_SIZE;
      int binListBegin = prefixBinCount[binIndex];
      
      pointPerBinLists[binListBegin + pointListIndex[i]] = i;
  }
  
  // done.
  // now we can iterate points per bin:
  
  for (int binI = 0; binI<NUM_BINS; binI++)
  {
      int listBegin = prefixBinCount[binI];
      int listEnd = prefixBinCount[binI+1];
      
      if (listBegin==listEnd)
        continue;
      
      std::cout << "\npoints in bin " << binI << ":\n";
      
      for (int i=listBegin; i<listEnd; i++)
      {
          std::cout << pointPerBinLists[i] << " ";
      }
        
  }
    /*
    points in bin 0:
    0 
    points in bin 1:
    4 
    points in bin 2:
    1 2 
    points in bin 7:
    3 
    points in bin 9:
    5  
    */
}

JoeJ

4,264

September 21, 2022 09:13 AM

It's not super obvious to understand, i guess. But that's the simplest example i can do.

I really recommend the compute shader chapter from OpenGL Super Bible, independent of API. It really nails to explain parallel programming quickly and with good examples.

To parallelize this, it becomes more complicated, so make sure to understand this example first. And i would port just that to OpenCL first, caring about 2D and triangles later.

The main difference is here:

for (int i=0; i<=NUM_BINS; i++)
  {
      prefixBinCount[i+1] = prefixBinCount[i] + binCount[i];
  }

Some OpenCL code to do such prefix sum in parallel looks like so for example:

		_rayCount[(((lID >> 0) << 1) | (lID &  0) |  1)]	+= _rayCount[(((lID >> 0) << 1) |  0)];		_rayCount[((((lID+64) >> 0) << 1) | ((lID+64) &  0) |  1)]	+= _rayCount[((((lID+64) >> 0) << 1) |  0)];	BARRIER_LOCAL
		_rayCount[(((lID >> 1) << 2) | (lID &  1) |  2)]	+= _rayCount[(((lID >> 1) << 2) |  1)];		_rayCount[((((lID+64) >> 1) << 2) | ((lID+64) &  1) |  2)]	+= _rayCount[((((lID+64) >> 1) << 2) |  1)];	BARRIER_LOCAL
		_rayCount[(((lID >> 2) << 3) | (lID &  3) |  4)]	+= _rayCount[(((lID >> 2) << 3) |  3)];		_rayCount[((((lID+64) >> 2) << 3) | ((lID+64) &  3) |  4)]	+= _rayCount[((((lID+64) >> 2) << 3) |  3)];	BARRIER_LOCAL
		_rayCount[(((lID >> 3) << 4) | (lID &  7) |  8)]	+= _rayCount[(((lID >> 3) << 4) |  7)];		_rayCount[((((lID+64) >> 3) << 4) | ((lID+64) &  7) |  8)]	+= _rayCount[((((lID+64) >> 3) << 4) |  7)];	BARRIER_LOCAL
		_rayCount[(((lID >> 4) << 5) | (lID & 15) | 16)]	+= _rayCount[(((lID >> 4) << 5) | 15)];		_rayCount[((((lID+64) >> 4) << 5) | ((lID+64) & 15) | 16)]	+= _rayCount[((((lID+64) >> 4) << 5) | 15)];	BARRIER_LOCAL
		_rayCount[(((lID >> 5) << 6) | (lID & 31) | 32)]	+= _rayCount[(((lID >> 5) << 6) | 31)];		_rayCount[((((lID+64) >> 5) << 6) | ((lID+64) & 31) | 32)]	+= _rayCount[((((lID+64) >> 5) << 6) | 31)];	BARRIER_LOCAL
		_rayCount[(((lID >> 6) << 7) | (lID & 63) | 64)]	+= _rayCount[(((lID >> 6) << 7) | 63)];		_rayCount[((((lID+64) >> 6) << 7) | ((lID+64) & 63) | 64)]	+= _rayCount[((((lID+64) >> 6) << 7) | 63)];	BARRIER_LOCAL
		_rayCount[(((lID >> 7) << 8) | (lID &127) |128)]	+= _rayCount[(((lID >> 7) << 8) |127)];		_rayCount[((((lID+64) >> 7) << 8) | ((lID+64) &127) |128)]	+= _rayCount[((((lID+64) >> 7) << 8) |127)];	BARRIER_LOCAL

Iirc, this does some prefix sum over 64 elements, where we need 8 iterations using 64 threads (lID is thread index).
I've unrolled the loop manually - could be less code. Just to give an impression.
But the point is: We need only 8 iterations instead 64, although with each iteration one half of threads becomes idle and we do much more operations than with the simple single threaded C++ loop.
So that's very useful very often.

JoeJ

4,264

September 21, 2022 09:18 AM

Oh, btw, i had to report two driver bugs to AMD with such prefix sum code. Once for OpenCL, and another time for Vulkan.
For some reason this seams a stress test for compiler optimization issues.

I guess there is a risk you face some issues here as well. But it's really a building block, so it should work.

AhmedSaleh

192

Author

September 21, 2022 11:22 AM

@JoeJ

Thank you Joe for your awesome answer. I'm just confused, what is the point of prefix sum ?

Game Programming is the process of converting dead pictures to live ones .

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines