AhmedSaleh said:
how ? ?
Again, the simplest idea i have is:
JoeJ said:
hmmm… following that Epic example, one idea would be to bin or sort the triangles by pixel area. So triangles nearby in the sorted list have a similar count of pixels. That's one dispatch for the binning or sorting. Then a second dispatch would process the triangles from the sorted list. So all threads in a workgroup raster one triangle, and they'll have similar count of pixels, so all threads finish at nearly the same time and saturation is good. Would work with both the scanline or the edge equations approach (for the latter you would sort by are of the bounding rectangle, not the triangle). Not sure what's better. The latter wastes half of work on processing pixels which are not inside the triangle, the former has more code divergence due to complexity. Might not matter too much. That's very simple and not that bad. But the downside is that nearby threads will write to different locations in the framebuffer. But maybe due to the atomic access this does not make it so much slower than it is anyway. Though, i guess for just a single cow model it might not beat your current approach. The scene should be more complex to make it a win.
Analyzing theoretical performance, we process every triangle once for the binning (or sorting), and a second time to draw it.
Rendering performance is as usual with rasterization: The number of processed pixels depends on overdraw.
Implementation is easy.
But because each thread draws one whole triangle we need many triangles to saturate a GPU.
(That's probably also the reason why the github project shows no impressive performance - they also draw one triangle per thread and work with single low poly models.)