AhmedSaleh said:
The basic scanline triangle works on the simulator of the GPU (PC Simulator) it takes 1 second for a 512*512 framebuffer, for a cube. and for cow it takes 13 seconds… Would you think that tilemap rendering enhance the performance? If not, what would enhance the performance to overcome at least 20ms rendering ?
To utilize from tiles on HW which can't do atomics, the trivial approach may be:
Sort all triangles by depth.
Divide screen into M x N tiles.
Then one workgroup per tile:
One thread of the workgroup iterates all triangles, test if triangle is in the frustum of the tile.
If so, all threads can rasterize it (like your very first approach).
This way we get some parallel rasterization, but anything else is bad. All workgroups process all triangles, and only one thread of the group does it.
It won't scale with increasing scene complexity.
And i fail to come up with a better approach. We could use something like spin locks to do things like binning to tiles for better parallelization, but spin locks means to serialize execution as well, so we get similar issues most likely.
It may work, but it still sucks conceptually. So i would not want to invest time at all.
That's why i would prefer raytracing. It's slow, but easily parallelizable, and it will scale well.
RT usually only shows benefit if we need secondary rays e.g. for shadows, reflections, AO, GI, etc. But without atomics it just becomes the better concept.
Maybe i miss some options, but i would go so far to say that efficient parallelized rasterization without atomics is not possible. Early fixed function GPUs had it from the start, in form of the depth buffer.
I'm sorry, but i can't help any further. Missing atomics really is a show stopper to a whole lot of things, rasterization being one of them.
AhmedSaleh said:
Increasing the number of threads and warps, make the performance worth in the simulator… is there a solution for that to get benefits of multiple threads ?
Not sure if the simulator can reproduce related performance behavior. I guess not really. Likely the best settings have to be found from testing on the actual hardware.