🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to Graphics and GPU Programming

parallelizing a software rasterizer algorithm using opencl

AhmedSaleh · 2022-09-21T11:59:52

I've written a small software rasterizer using OpenCL and would like to optimize and parallelize it more, currently I'm scanning the whole screen and see if the triangle overlaps with the pixels.. I would like to parallelize the loop and do it more efficiently. For example in my idea, is to process only the bounding box pixels.. ? __kernel void sendImageToPBO(__global uchar4* dst_buffer, __global float* vbo, int vbosize, __global int* ibo, int ibosize) { size_t blockIdx = get_group_id(0); size_t blockIdy = get_group_id(1); size_t blockDimX = get_local_size(0); size_t blockDimY = get_local_size(1); size_t threadIdX = get_local_id(0); size_t threadIdY = get_local_id(1); float3 c0 = { 1, 0, 0 }; float3 c1 = { 0, 1, 0 }; float3 c2 = { 0, 0, 1 }; int x = get_global_id(0); int y= get_global_id(1); int imageWidth = 800; int imageHeight = 800; if (x < vbosize && y < vbosize) { for (int i = 0; i < vbosize; i += 9) { float3 v1 = (float3)(vbo[i], vbo[i + 1], vbo[i + 2]); float3 v0 = (float3)(vbo[i + 3], vbo[i + 4], vbo[i + 5]); float3 v2 = (float3)(vbo[i + 6], vbo[i + 7], vbo[i + 8]); float xmin = fmin(v0.x, fmin(v1.x, v2.x)); float ymin = fmin(v0.y, fmin(v1.y, v2.y)); float xmax = fmax(v0.x, fmin(v1.x, v2.x)); float ymax = fmax(v0.y, fmin(v1.y, v2.y)); // be careful xmin/xmax/ymin/ymax can be negative. Don't cast to unsigned int unsigned int x0 = max(0, (int)(floor(xmin))); unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax))); unsigned int y0 = max(0, (int)(floor(ymin))); unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax))); float3 p = { x + 0.5f, y + 0.5f, 0 }; float w0 = edgeFunction(v1, v2, p); float w1 = edgeFunction(v2, v0, p); float w2 = edgeFunction(v0, v1, p); if (w0 >= 0 && w1 >= 0 && w2 >= 0) { float area = edgeFunction(v0, v1, v2); float r = w0 * c0.x + w1 * c1.x + w2 * c2.x; float g = w0 * c0.y + w1 * c1.y + w2 * c2.y; float b = w0 * c0.z + w1 * c1.z + w2 * c2.z; w0 /= area; w1 /= area; w2 /= area; float z = 1 / (w0 * v0.z + w1 * v1.z + w2 * v2.z); r *= z, g *= z, b *= z; dst_buffer[y * get_global_size(0) + x] = (uchar4)(r * 255, g * 255, b * 255, 255); } } }

Graphics and GPU Programming Programming

Started by AhmedSaleh September 06, 2022 05:31 PM

58 comments, last by JoeJ 1 year, 9 months ago

AhmedSaleh

192

Author

September 10, 2022 12:43 PM

@joej

What I have noticed and is strange, while rendering a cube I get 44FPS, but while rendering the cow, I get +100 FPS, how is that possible ?

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,264

September 10, 2022 01:20 PM

Surely because the cube has only 12 triangles, so only 12 threads do all the work.
Basically you would need a second method to handle large triangles.

Looks nice ; )

What platform do you have which does not support atomics? :O
That's a big problem, because you need them for the depth test as well.

Without atomics, a completely different and inferior approach is needed.
Actually i would suggest raytracing then.

AhmedSaleh

192

Author

September 12, 2022 01:13 PM

@JoeJ

I have implemented the sorting by area method.

What I have seen is a cube is rendered at 5FPS, while the COW is at 145 FPS, no difference for the cow, but it's very slow for the cube.

Why is that ?

Also I see some flickering over the cow, some triangles are flickering.

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,264

September 12, 2022 02:00 PM

AhmedSaleh said:
What I have seen is a cube is rendered at 5FPS, while the COW is at 145 FPS, no difference for the cow, but it's very slow for the cube.

The sorting has a cost, which reduces FPS?
As said before, the sorting will only show a win in a complex and large scene, where triangles have different sizes. But for the cow model all triangles have similar size on the model.
Currently, triangles are smaller on screen only if they are seen from the side. Now we can render them faster, but the cost of sorting seemingly cancels the win.

You should measure execution times of all your kernels separately, so you know the cost of sorting, rasterization, etc.
By looking just at FPS, you get a combined cost of everything, which makes optimizing pretty hard.
The best way is to use profiling tools, which also give important details about occupancy, register pressure, LDS usage, cache misses, etc.
This helps to identify where the problems are. Otherwise you work basically blind folded and guessing is all you can do.
If no profiling tools are available for your platform, at least measure time per kernel. But i don't remember if OpenCL can do this. Usually it's done suing ‘tiemstamps’.

Also I see some flickering over the cow, some triangles are flickering.

To prevent flicker, you must implement a depth test, which you can do efficiently only with atomics.
Sorting may cause more flicker than not sorting, but it's not the reason.
The reason is that multiple threads write to the same pixels, so a distant triangle can overwrite a close triangle.
We have no control of execution order of kernel invocations. It depends on hardware, and flicker is expected.

So again, what hardware do you use? I did not know there exists any hardware which does not support atomics, and actually i doubt that's the case.
If you can't tell the hardware, check out if it has support to shard memory (LDS) eventually, if it indeed lacks support for global memory (VRAM). This would open up a solution at least.

I remember, NVidias Kepler GPUs had no support for atomics to LDS, so they implemented a work around by emulating that with atomics to VRAM in the driver.
This was slow, but it worked, and the programmer did not notice the missing hardware support other than from performance.
Thus it really is unlikely a HW vendor makes OpenCL drivers lacking atomics, and i'm really curious about your hardware…

JoeJ

4,264

September 12, 2022 02:17 PM

Oh, one thing you could try is to sort the triangles by distance to the camera instead by area.
We loose the optimization, but if you're really lucky, your hardware processes the triangles in the same order the kernels are dispatched, so the sorting order would be preserved and you get a depth test.

Even if that works, it will still cause artifacts on intersection triangles. But many early 3D games ignored this and results were good enough in practice.

AhmedSaleh

192

Author

September 13, 2022 06:41 PM

@JoeJ

I'm currently working on that GPU, it's a custom based GPGPU, and I'm doing the rasterizer for it as hobby task.

They wrote on their research paper that they did it, so I wanted to take the challenge and do it on my own too ?

vortexgpgpu/vortex (github.com)

Regarding the sorting by area, I put a complex model, still it's slower than the plain scan line.

I will try the binning by nVidia research paper and will tell you the progress.

But Many thanks for your cooperation and your thoughts, you always come to rescue and give ideas to my work as usual.

Game Programming is the process of converting dead pictures to live ones .

JoeJ

4,264

September 13, 2022 07:54 PM

So that's some open source GPU implemented on programmable hardware? Really cool! : )

I remember this recent ray tracing tutorial: https://jacco.ompf2.com/2022/06/03/how-to-build-a-bvh-part-9a-to-the-gpu/
RT becomes interesting because it surely needs no atomics for anything, but not sure about expected performance.
For binning i guess they use atomics too, but sure worth a look as well.

AhmedSaleh

192

Author

September 15, 2022 06:27 AM

@JoeJ

Hello,

hope you're doing fine. The basic scanline triangle works on the simulator of the GPU (PC Simulator) it takes 1 second for a 512*512 framebuffer, for a cube. and for cow it takes 13 seconds…

Would you think that tilemap rendering enhance the performance? If not, what would enhance the performance to overcome at least 20ms rendering ?

Game Programming is the process of converting dead pictures to live ones .

AhmedSaleh

192

Author

September 15, 2022 06:49 AM

Also another issue,

Increasing the number of threads and warps, make the performance worth in the simulator… is there a solution for that to get benefits of multiple threads ?
-DNUM_CORES=1 -DNUM_WARPS=4 -DNUM_THREADS=1 -DL2_ENABLE=1 -DL3_ENABLE=1

for example using 1 thread, its very fast but no rendering comes out

Here is how I execute the kernel

global_size[0] = 800;

global_size[1] = 800;


auto time_start = std::chrono::high_resolution_clock::now();

CL_CHECK(clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, global_size, NULL, 0, NULL, NULL));

Game Programming is the process of converting dead pictures to live ones .

AhmedSaleh

192

Author

September 15, 2022 06:58 AM

here is the kernel:

#pragma OPENCL EXTENSION cl_amd_printf : enable                                


struct AABB {
	float min[3];
	float max[3];
};

struct  VertexIn {
	float pos[3];
	float nor[3];
	float col[3];
};

struct  VertexOut {
	float pos[3];
	float nor[3];
	float col[3];
	float mpos[3];
};
struct Triangle_ {
	struct VertexOut v[3];
	struct  AABB box;
	bool isPoint;
	bool isLine;
	bool isValidGeom;
	float signedArea;
	float minDepth;
};

struct Fragment {
	float pos[3];
	float nor[3];
	float col[3];
	bool isCovered;
};


//LOOK: checks if a barycentric coordinate is within the boundaries of a triangle
bool isBarycentricCoordInBounds(float3 barycentricCoord) {
	return barycentricCoord.x >= 0.0 && barycentricCoord.x <= 1.0 &&
		barycentricCoord.y >= 0.0 && barycentricCoord.y <= 1.0 &&
		barycentricCoord.z >= 0.0 && barycentricCoord.z <= 1.0;
}
float edgeFunction(float3 a, float3 b, float3 c)
{
	return (c.x - a.x) * (b.y - a.y) - (c.y - a.y) * (b.x - a.x);
}


__kernel void VertexShaderKernel(__global float* vbo, int vbosize
)
{
	size_t threadIdX = get_local_id(0);
	size_t threadIdY = get_local_id(1);
	size_t blockIdx = get_group_id(0);
	size_t blockIdy = get_group_id(1);
	size_t blockDimX = get_local_size(0);
	size_t blockDimY = get_local_size(1);


	int i = (blockIdx * blockDimY) + threadIdX;


}
 bool PointInTriangle(float3 p, float3 p0, float3 p1, float3 p2)
{
	float s = (p0.x - p2.x) * (p.y - p2.y) - (p0.y - p2.y) * (p.x - p2.x);
	float t = (p1.x - p0.x) * (p.y - p0.y) - (p1.y - p0.y) * (p.x - p0.x);

	if ((s < 0) != (t < 0) && s != 0 && t != 0)
		return false;

	float d = (p2.x - p1.x) * (p.y - p1.y) - (p2.y - p1.y) * (p.x - p1.x);
	return d == 0 || (d < 0) == (s + t <= 0);
}
 float min3(const float  a, const float  b, const float c)
 {
	 return fmin(a, fmin(b, c));
 }

 float max3(const float a, const float b, const float c)
 {
	 return fmax(a, fmax(b, c));
 }

__kernel void sendImageToPBO(__global uchar4* dst_buffer, __global struct Triangle_* triangles, int triCount)

{
	size_t blockIdx = get_group_id(0);
	size_t blockIdy = get_group_id(1);
	size_t blockDimX = get_local_size(0);
	size_t blockDimY = get_local_size(1);
	size_t threadIdX = get_local_id(0);
	size_t threadIdY = get_local_id(1);

	int xt = (blockIdx * blockDimX) + threadIdX;
	int yt = (blockIdy * blockDimY) + threadIdY;
    
	int imageWidth = 800;
	int imageHeight = 800;
	int index = xt + (yt * imageWidth);
    
	
	float3 c0 = { 1, 0, 0 };
	float3 c1 = { 0, 1, 0 };
	float3 c2 = { 0, 0, 1 };

	int x_pos = get_global_id(0);
	int y_pos = get_global_id(1);

	if (index < triCount)
	{
	

		float3 v0Raster = (float3)(triangles[index].v[0].pos[0], triangles[index].v[0].pos[1], 0);
		float3 v1Raster = (float3)(triangles[index].v[1].pos[0], triangles[index].v[1].pos[1], 0);
		float3 v2Raster = (float3)(triangles[index].v[2].pos[0], triangles[index].v[2].pos[1], 0);
		float xmin = min3(v0Raster.x, v1Raster.x, v2Raster.x);
		float ymin = min3(v0Raster.y, v1Raster.y, v2Raster.y);
		float xmax = max3(v0Raster.x, v1Raster.x, v2Raster.x);
		float ymax = max3(v0Raster.y, v1Raster.y, v2Raster.y);


		// the triangle is out of screen
		if (xmin < imageWidth - 1 || xmax > 0 || ymin < imageHeight - 1 || ymax > 0)
		{

			// be careful xmin/xmax/ymin/ymax can be negative. Don't cast to uint32_t
			unsigned int x0 = max(0, (int)(floor(xmin)));
			unsigned int x1 = min((int)(imageWidth)-1, (int)(floor(xmax)));
			unsigned int y0 = max(0, (int)(floor(ymin)));
			unsigned int y1 = min((int)(imageHeight)-1, (int)(floor(ymax)));
			for (unsigned int y = y0; y <= y1; ++y) {
				for (unsigned int x = x0; x <= x1; ++x) {
					float3 p = { x + 0.5f, y + 0.5f, 0 };

					float w0 = edgeFunction(v1Raster, v2Raster, p);
					float w1 = edgeFunction(v2Raster, v0Raster, p);
					float w2 = edgeFunction(v0Raster, v1Raster, p);

					if (w0 >= 0 && w1 >= 0 && w2 >= 0) {

						float area = edgeFunction(v0Raster, v1Raster, v2Raster);

						float r = w0 * c0.x + w1 * c1.x + w2 * c2.x;
						float g = w0 * c0.y + w1 * c1.y + w2 * c2.y;
						float b = w0 * c0.z + w1 * c1.z + w2 * c2.z;

						w0 /= area;
						w1 /= area;
						w2 /= area;
						float z = 1 / (w0 * v0Raster.z + w1 * v1Raster.z + w2 * v2Raster.z);
						r *= z, g *= z, b *= z;


						dst_buffer[y * get_global_size(0) + x] = (uchar4)(0,  255, 0, 255);
					}
				}
			}
		}
	}

}

Game Programming is the process of converting dead pictures to live ones .

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

parallelizing a software rasterizer algorithm using opencl

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines