🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Direct X12: How bad is a large root signature?

Started by
5 comments, last by MJP 2 years, 7 months ago

I've started converting my code from DirectX11 to DirectX12. In my old code I had a bunch of terrain chunks and I would simply set a Word / View matrix (I do projection separately) in a constant buffer before rendering each chunk, to place it in it's correct position. This worked fine in DX11.

But now in DX12 you apparently should set in advance all per object constant buffer data, so I would need to generate all matrixes in advance and I would also need to know how many of them there are in advance. This seems kind of a pain to me. So I thought I would just put that data in root signature as constants as I go along, since I can put commands to set this stuff right in the command list. Note, there is also a some other data I need for procedural shading that will go there also. A back of the envelope calculation puts. the data size at somewhere under 40 DWORDs. This is under the limit of 64 DWORDs so I suppose it should work. However I keep reading in places that root signatures should be as small as possible. One AMD page says try to keep it under 13 DWORDS. I guess what I'm asking is how important is this really? Does anyone have an insight into this?

Advertisement

I think I talked about this already on the DirectX discord but I'll reiterate here in case you want to get more in-depth.

  • It's not true that you need to build your constant buffers in advance in DX12. In some cases that may be ideal from a performance point of view since doing something once is less work than doing it over an over, but in DX12 you really have power and freedom to do all kinds of things that fit your usage patterns (this is both a blessing and curse, as you'll do doubt find out). Like I mentioned on the discord, you can totally build a little system that's super-fast at handing out small, temporary buffers that live in UPLOAD memory and whose contents are only valid for the frame you're currently generating commands for. This can be really great for things where it's not desirable or practical to pre-build and cache the constant buffer data in advance, and you'd rather just fill it out on-the-fly as your issuing Draw calls. I have a simple example of doing this using a basic linear allocator with atomics to “allocate” temp buffer memory very quickly from multiple threads:

    https://github.com/TheRealMJP/DXRPathTracer/blob/master/SampleFramework12/v1.02/Graphics/DX12_Upload.cpp#L448
    https://github.com/TheRealMJP/DXRPathTracer/blob/master/SampleFramework12/v1.02/Graphics/DX12_Helpers.h#L162




Continuing here since the editor is having trouble…

The effects of a large root signature really depends on the exact hardware and driver. AMD makes that recommendation because they have 16 registers that can be set from the command buffer, and they use that to implement a root signature. Anything that doesn't fit in those registers has to be “spilled” to memory, which can add CPU overhead and potentially affect GPU performance. If you keep your root signature smaller by building constant buffers yourself, then that overhead is instead in your own code which means you can directly measure the cost and make the right trade-offs for your usage patterns.

Thanks for the reply. I don’t use discord much so I somehow missed your message there until now.

It’s my understanding (which may be complexly wrong) that you can’t change constant buffer data after you have submitted a command list that will use it. All the resource data has to be set already. I was hoping to write my frame rendering commands to a single command list so I was thinking it wouldn’t be possible to change constant buffer data while the a list was executing. I have Frank Luna’s book and I’m trying to work though that, although I find DX12 a lot more confusing than DX11 since it seems to give you a lot more ways to do things. I’ve even heard sometimes things run faster in DX11 but I’m guessing that’s because someone made the wrong design decision. I’m hoping to avoid that.

Part of the problem is how the engine works. Everything is organized it a tree, however you start rendering from some point in the tree where the camera is. That’s usually a leaf node and you go up the tree and then back down the various branches from there. The reason is I’m trying to avoid having “world” coordinates. I go straight to view coordinates. This is because I’m using 64 coordinates on the CPU and only converting them to 32 bits before going to the GPU. Therefor I can’t just send my terrain in world coordinates because the accuracy would be lost. So I send everything with an offset (something like model coordinates) and translate it to its real position on the GPU. If it’s far away from the camera the loss of accuracy doesn’t matter. If it’s close, I still have good precision.

In any case, according to how it see it done in Luna’s book, I have to upload the entire set of matrixes before executing the list, and then kind of scroll through them as I go. So I’d have to keep track of the size of my tree and be able to expand buffers on the fly if something changes.

The second problem is the lights are just objects the tree, so I can’t actually render until I’ve gone through the whole tree and figured out how the lights are positioned since they can be attached to moving objects. That’s why I put render commands in a single list. I suppose I could keep a list of command lists and render them all at the end, but that seems like it would get out of hand.

Currently my textures are procedural so there isn’t really anything to upload in that department. At some point that will change. I’m not so much worried about that. All terrain meshes for LOD are generated and sent down in a completely separate thread. A model of the world is built for a given camera position and then I switch models, and work on the next one. The rendering thread only does rendering plus the player and camera controller.

I took a look at your stuff but I’ll have to dig into it more to figure out what’s going on. As I said DX12 is pretty confusing to me. I think I probably should have stuck with DX11 but since I started working on, I want to finish it. As you saw, I kind of have a solution with the root signature constants so maybe I’ll just use that for now and see how it works.

However if there is some way to change (not just scroll through) constant buffer data in a command list, that would also be a possibility but it doesn’t look like that’s the case.

Ahh, I understand what you're saying. I think I can help explain.

It is indeed true that the CPU can't write to buffers or other GPU memory while the GPU is currently reading from it. In more specifics terms, this means that once you've submitted a command list that reads from some memory, you can't overwrite that memory until a fence is signaled by the GPU following the call to ExecuteCommandLists. The common workaround to this is called “versioning”, and is often called “double-buffering” or “ring-buffering”. The basic idea is that instead of just having 1 buffer that you use, you actually use multiple behind the scenes so that you always have at least 1 isn't currently being used by the GPU.

As an example: let's say that you're just doing 1 draw call that uses a single 256 byte constant buffer, and you would like to have the CPU be able to update the contents of that buffer every frame right before you issue the draw call on the command list. Let's also say you're using a simple “double buffering” setup where the GPU runs one frame behind the CPU so that they can run in parallel. The rough flow of your frame would look like this:

  • Generate command list for frame 0
  • Submit command list for frame 0, and tell the queue to signal the fence
  • Generate command list for frame 1 (while GPU is working on frame 0)
  • Submit command list for frame 1
  • Wait for the fence to be signaled indicating that frame 0 has completed on the GPU
  • Move on to generating commands for frame 2

With this setup, if you have a single constant buffer you're fine being able to update it from the CPU when you're building commands for frame 0 since it's not yet being used. But during frame 1, the buffer is being read by the GPU so you can't safely touch it. A simple way to deal with this is to instead have two constant buffers of the same size (or alternatively, one buffer that's double the size). If you then update and bind one of these buffers on the even frames and then update and bind the other buffer on the odd frames, you can continue to swap back and forth and always have a “safe” buffer to write to. You just need to make sure that you're always binding the right buffer through your root signature and everything works fine. It gets a little complicated if you want to use a full CBV descriptor in a descriptor table since then you either need to swap descriptor tables or build descriptor tables on-the-fly, but if you use a root CBV instead it's pretty trivial since you just pass the GPU virtual address of the sub-buffer that you're using. I prefer to pretty much always use root CBVs for this reason, since it saves you the hassle of dealing with descriptors. You can also be a little more smart with this setup and only “swap" between the buffers whenever they actually get updated. This requires you to store an index indicating which sub-buffer currently holds the latest data, but it's a good idea for buffers that are updated infrequently rather than every frame.

Now this setup I described works fine as long as you only update the constant buffer at most once per-frame. If you're familiar with D3D11, you'll know that you can actually update the same buffer many times within a single frame as long as it's created as DYNAMIC and you map it with DISCARD. Now the reason this works for D3D11 and the reason why DISCARD semantics exist, is because the driver is typically doing some kind of complex versioning for you behind the scenes. Typically the driver will have some large pool of CPU-writable GPU memory available, and whenever you call Map it will grab a chunk of available memory (that's not being used by the GPU!) and let you fill that with your data. The driver will then also do all kinds of magic and bookkeeping to make sure that the right memory address actually gets hooked up to the shader bindings based on how you've bound the ID3D11Buffer to the per-stage constant buffer slots. In light of this, it should make sense why you the previous contents of the buffer need to be discarded when you call Map in order for things to work efficiently: the driver needs to be able to transparently swap out memory for you, and that fresh memory isn't going to have the previous buffer contents in it.

With D3D12 you are of course on your own to do this sort of thing, but it's possible to implement your own kind of systems to something similar if you'd like. Or you can do your own patterns that work for your engine. This is basically what I was referring to in my earlier post: in D3D12 you can make your own kind of allocator to keep “finding” fresh unused memory to use as constant buffers, and as long as you make sure the right GPUVA gets bound to your root signature everything should work out.

On a side note…it's totally normal to be confused when starting D3D12. It leaves many more details in your hands compared to D3D11, which had some nice features that were already implemented for you in the driver and runtime (like all of that crazy dynamic buffer versioning stuff I just mentioned). It's also tricky because using D3D12 well generally relies on having a pretty good working knowledge of how GPUs work, and it can be hard to build that up just from the docs and API alone. I would just take one bit at a time, and work at it until your understanding and your code are good enough that things start working. It's likely you'll need to build up a good set of building blocks that you can use throughout your renderer, such as your own implementation of a “dynamic” buffer that lets you freely update the contents from the CPU. But after a while you'll have the pieces in place, and then you can start doing the cool stuff.

If you want an example of the most basic kind of double buffering, you can follow through what happens in my Buffer type when it's created with the “dynamic” flag, and also what happens when Map is called. This roughly maps to the D3D11 model, with the caveat that the buffer can only be updated once per frame. For cases where I need to update more frequently I use the “temporary” buffer allocator that I mentioned previously. I'm not familiar with Frank Luna's book, but I would be surprised if he doesn't cover something similar in one of the chapters. You can also check out AMD's Cauldron library, which has some helpers that you can look at for inspiration.

This topic is closed to new replies.

Advertisement