Batching sprites that need different render settings

Started by
3 comments, last by arnero 4 years, 6 months ago

I am rendering sprites in 2D and am finding a lot of difficulties in batching them together.

Texture switches can be avoided by texture atlasing (at the cost of figuring out potential bleed/mip map issues), but as I add more complex effects, there are shader switches (sometimes with different vertex layouts/data), and notably a lot of switches between alpha and additive blending (since alpha-A, additive-B, alpha-C is a different result to say alpha-A, alpha-C, additive-B).

I know in 2D I could probably avoid some of these issues by moving things to common layers, e.g. all alpha-blended game objects with atlas, all additive particles, all alpha (e.g. smoke/dust) particles on top. But I wonder if it can be solved for arbitrary z/depth/order as in 3D rendering doing so doesn't seem an option.

I couldn't find any noticeable artifacts having played around in a number of 3D games so I feel they must be sorting all the transparent stuff in some efficient way (e.g. positioning the camera so that sprites both additive and alpha, and other transparent objects in the scene have different order to see if I got something odd like an additive explosion/spark/weapon effect/etc. drawn on top of dust/smoke/glass window/etc. that should be in front). Even with smoke trails from fire/damage/missiles/etc. that are individual sprites rather than say a line strip, and explosions clearly made from many smaller sprites.

My original design was to render this stuff back to front (easy in 2D), filling up a vertex buffer and then rendering with a draw call per state change in the original order, no extra sorting. But switching between alpha/additive and similar is breaking the batches into thousands of draw calls.

I then thought to my self that most of the time sprites do not overlap (especially when zoomed out, which is when the total on-screen object count can get really high). So I thought put all the items into a buffer then batch as good as possible. Batching seems to be much improved as overlaps are rare when zoomed out, but putting together the non-overlapping batches is extremely expensive.

Is there a better way to make way to make the batches? Of course the CPU has multiple cores, but splitting this aspect of rendering up seems tricky and probably the wrong way to go, as does just generally searching for optimisations of the loops, memory layout, etc.

// _items is an array populated with all the scene sprites to render
// Currently each item is just 4 vertices for a sprite plus its texture. Potentially will need to become more complex for arbitrary polygons (e.g. line strips), and where I need different vertex data (some effects), multiple textures, etc.
// _items[x].rendered is made true by add_to_batch
// Populating _items with the scene contents takes 3.43% total CPU time according to VS2019 profiler
 (on a release build).
// material is a combination of texture, shaders, blend mode, etc.
void flush() // 67% total CPU 
{
    bool not_rendered;
    size_t start = 0;
    do
    {
        not_rendered = false;
        for (size_t i = start; i < _items.size(); ++i) // 0.01% CPU
        {
            if (!_items[i].rendered)
            {
                auto &amp;this_mat = _items[i].material;
                start_batch(this_mat); // Setting shaders, textures, etc. 0.84% CPU
                add_to_batch(_items[i]); // Adding sprite vertices to buffer, 0.10% CPU

                for (size_t j = i + 1; j < _items.size(); ++j) // 5.80%
                {
                    if (!_items[j].rendered) // 1.37%
                    {
                        if (_items[j].material != this_mat)
                        {
                            if (!not_rendered)
                            {
                                not_rendered = true;
                                start = i + 1;
                            }
                        } // 1.21%
                        else
                        {
                            bool can_batch = true;
                            for (size_t k = start; k < j; ++k) // 21.53%
                            {
                                if (!_items[k].rendered) // 26.81%
                                {
                                    if (_items[k].material != this_mat) // 0.91%
                                    {
                                        if (overlaps(_items[j].bbox, _items[k].bbox)) // 1.69%
                                        {
                                            can_batch = false;
                                            break;
                                        }
                                    }
                                }
                            }
                            if (can_batch)
                            {
                                add_to_batch(_items[j]); // 1.38%
                            }
                            else if (!not_rendered) // 0.67%
                            {
                                not_rendered = true;
                                start = i + 1;
                            }
                        }
                    }
                }
                flush_batch(); // Actual draw call, 0.20%
                break;
            }
        }
    } while (not_rendered);
    _items.clear();
}

Being an RTS like thing there are a lot of objects, and adding damage, debris, trail, etc. particle effects is inflating the number of things to draw greatly. This can vary greatly by just tweaking a few things, but certainly can easily be in the thousands when zoomed out a bit, as distant particles can be just a few dozen pixels across. In a test scene I made is some 5,000 sprites, many of them only 16x16 or so at the zoom level (including particles. Possibly a bit extreme? This was intended to stress the system as lots of computers are somewhat slower). Takes my 2700X about 20ms to process on a core (50fps).

Advertisement

Don't many games just use order independent blending? Something like an average? The tile based render on SEGA dreamcast can sort by z per tile. Modern graphic cards also like to draw tile by tile, just in a hierarchical manner. Thus global sort by z like on a PSX is bad. So with a blend mode where smoke hides stuff behind it, do it this way?

I dunno, it is 2020, do we still need to batch shaders? Maybe vulcan exposes some synergies and then use some blend to minimize x,y change, texture change (location in a large texture also affects the cache), z-order vs overlap, shader change.

Can we get some info about the number of pipelines? Give every pipeline a different shader, let them draw to close u,v texels and close x,y pixels for good cache, but not too close x,y px to avoid collision.

So some clarification, I was testing with D3D11 and OpenGL. I didn't try D3D12/Vulkan just yet, I was under the impression they wouldn't like state changes and single quad draws thousands of times per frame without batching all that much more.

Although things like being able to dynamically use different textures(sprites) using vertex data, of different sizes and without an atlas (and those problems) definitely sounds nice.

And the problems I was running into are almost entirely CPU performance (e.g. see the profiler annotations on that source code), GPU utilization was very low, don't think it even boosted to its normal max clocks, although that probably will change once I actually do some better testing (testing on a 2700X + Vega 64 and getting 50fps with a game I would like to run on an Intel laptop iGPU a few years old).

EDIT: Was trying to keep this all short… So to be clear, the original “draw back to front and end-batch/draw every time a state changes”, was about 5ms frame time on D3D11 (didn't measure my OpenGL solution), so 4X faster than my “optimisation attempt” so not saying draw calls are way too expensive. But still very much CPU limited, the GPU didn't ramp up to its full clock and the CPU profiler put about 30% of the total time on the various state setting+draw call. I tested separately on a i7 4790 and that did a lot less well.

arnero said:
Don't many games just use order independent blending? Something like an average?

Not sure what sort of average you are thinking of? With lots more researching I read that some games use screen-door transparency, especially for distance object fade in/out as the player probably won't look too closely. And possibly a post-process blur effect with mostly hide it (but preventing transparent objects having any sharp edges/details).

As for truly order independent transparency? I only found a few research examples like storing a linked list of pixels that can be sorted and merged later, which doesn't sound particularly perform-ant, and not looked at how to implement (e.g. can't just write to a single render target as that doesn't keep a linked-list of each output for each pixel position).

"Depth peeling" looks maybe more viable, but not sure how much and couldn't quickly find examples of games doing it, just some explanations and research/gpu examples. Maybe the cost of the needed number of “passes” for good results is too great.

arnero said:
tile based render on SEGA dreamcast can sort by z per tile

Not at all familiar with that. I suspect if dealing with tiles rather than arbitrary positioned, rotated and sized sprites that it is a lot easier to implement the “what can this overlap with” logic, which is where my code above sunk 50% of my total CPU power. e.g. if every tile knows the list of sprites for that tile, and can just loop `tiles[x][y].sprites` to check.

arnero said:
Modern graphic cards also like to draw tile by tile, just in a hierarchical manner.

Not entirely sure what you mean by that. I know once they get to the pixel/fragment shader they render out the triangles in a tile like manner, while still being “as if” the original vertex ordering (so if I have a vertex buffer with two overlapping sprites and draw the buffer, my understanding is it will always render the second correctly on top of the first, but it might not actually run the pixel/fragment shader entirely in that order). But I need the stuff to be in a single draw call for that to work, so draw calls must be Z-sorted, but vertices in them don't need to be?

e.g. say I have top left (red), bottom (blue), top right (red) sprites. The red boxes are exactly the same, but red before blue, and blue before red, are very different.

arnero said:
I dunno, it is 2020, do we still need to batch shaders? Maybe vulcan exposes some synergies and then use some blend to minimize x,y change, texture change (location in a large texture also affects the cache), z-order vs overlap, shader change.

I think maybe not as much as before, but doing a lot of set states (blend mode, maybe shader, vertex buffer/different-layout, multiple textures, etc.) and then render 1 or 2 quads thousands of times per frame, seemed to slow down a lot in both D3D11 and OGL for me. I didn't try porting the code to Vulcan (or D3D12) actually, but I didn't think they liked being given individual small quads thousands of times separately that much more.

arnero said:
Can we get some info about the number of pipelines? Give every pipeline a different shader, let them draw to close u,v texels and close x,y pixels for good cache, but not too close x,y px to avoid collision.

Not sure what you are getting at here. But sorting by UV/XY in addition to my problem with sorting overlapping sprites seems like an even bigger CPU sink?

So had a bit of a mental breakthrough after a lot of failing search (probably my search terms just sucked :( ). Looks like I can combine alpha and additive into a single draw without needing to change the blend states. The secret being a close look at the pre-multiplied alpha formula.

Pre-multiplied alpha needs a blend state that does `result.rgb = source.rgb + dest.rgb * (1 - source.a)` with `source.rgb = source.rgb * source.a` done in advance. And additive needs a blend state of `result.rgb = source.rgb + dest.rgb`.

But what if I set all the alpha values in a texture containing a sprite I know (at load time) will always use additive blending, and then use the pre-multiplied alpha blend state? Now I get `result.rgb = source.rgb + dest.rgb * (1 - 0)` which is the same as `result.rgb = source.rgb + dest.rgb` and so I can just use alpha blending state for everything with alpha and additive sprites in the same atlas.

I didn't look at see what `result.a` needs to be (for off-screen rendering), but looks very hopeful, and that gets rid of like 99% of the state changes I was having trouble with...

Still a lot of switches due to special shader effects I wrote (like shield bubble impact stuff that overlays multiple textures (or parts of a texture atlas) and wants things like the centre coordinates, extra tex coords, etc. in the vertex data). Possibly I could combine all these shaders into one, and use the “largest vertex” for everything and hopefully the dynamic branching and a lot of “empty” vectors in the vertices won't be a problem.

>CPU bound.
oh

https://en.wikipedia.org/wiki/Order-independent_transparency
Uh, hmm a hack. But I guess it works well for fog (everything is white), or glow (everything is added).

This topic is closed to new replies.

Advertisement