Advertisement

Benefits of multithreaded renderer

Started by February 25, 2020 02:15 AM
39 comments, last by NikiTo 4 years, 6 months ago

NikiTo said:
Thing is, we need a solid reason to do anything. And i haven't heard of a game that has 8000 zombies. Is there a game with even 800 zombies?

Dying Light, Days Gone, or this Epic Battle Simulator stuff, if non zombies are allowed. Maybe less zombies than 8000, but better animation and gfx for the former examples.

NikiTo said:
Multithreading uses semaphores and lot of extra code to can work, so having two cores doesn't mean the speed of execution will double.

Semaphores, Mutexes, etc… that's not the problem. Those are just the necessary primitives to synchronize stuff.
The problem is a very different one, and because it is not obvious i recommend to get used to MT even if not seemingly totally necessary, for learning purposes.

So what i mean for example are things like this:

You wanna multithread some solver for game physics constraints or UV parametrization for tools.
To do this you'll end up replacing faster gauss seidel method with slower jacobi method (i hope the terms are right - i'm no math guy).
Gauss Seidel means to update a variable immideatly according to its current solution, Jacobi means to cache solutions in a temporary buffer, leave all variables as is and change them in one go after the full iteration over all variables is done.
Only the latter is order independent and can be multithreaded. A solver often takes twice the time to converge, but if you have more cores than two it is worth it.

A similar problem is dividing huge workloads into chunks. If you want to multithread, you may need to enlarge the chunks my a margin of overlap so full adjacency information is present.
For volumetric workloads (blur, cubic filter, etc.) this overlap can have the same volume than the inner region you are interested in.
Again you want more cores than two to make this a real win.

This is what marketing won't tell you, and the reason dual core processors often struggle to beat single cores significantly with real workloads - they only manage to keep the system more responsive to for the user (which is fine).

Now we have 8 cores and 16 threads. This is reason enough to utilize the power. We have to learn how to do it. Gfx drivers will not do this for us, because they do only a small percentage of a game.
And it is worth it. I get an average speedup of 7 with 8 cores, and i got 3.5 with 4 cores on 10 years older CPU.
This difference in performance is the difference between working and useless software more often than not.

SyncViews said:
I guess there are 2 levels of multi-threading to consider as well. You could put all your rendering/animation/etc. logic on one single thread separate from other game stuff (and maybe split some of the other stuff into other threads as well), rather than have nearly the entire game be a single thread. And you can go beyond that and have the rendering itself use multiple threads, but one thread can already do a lot of rendering if that is all it is doing.

Yes, and this is the least we should do because it's almost zero extra work.
It won't teach you parallel programming, but it's a good start.

This often leads me to the question: Should i parallelize the workload, accepting more work / slower algorithm is necessary? Or should i do completely different workloads on different cores, leading to worse cache utilization and CPU saturation but avoiding the extra work?
But so far i never compared those two options and i don't know what the difference could be.

@NikiTo @JoeJ I guess I don't fully understand the benefit of multithreaded rendering. Like it makes sense that the animation system for example is multithreaded as that is the part about computing the location of vertices which is going to happen on the cpus which happens before gpu calls. But how can the actual rendering benefit from multithreading. First off don't the gpu calls need to be serialized for semantic reasons. And secondly even if you send gpu calls in a multithreaded way wouldn't all the internal synchronization of the driver probably remove any benefit compared to a single threaded synchronization contention free sequence of gpu calls?

Advertisement

JoeJ said:
Again you want more cores than two to make this a real win.

So you add one more uncertainty to the worth of multithreading.

JoeJ said:
And it is worth it. I get an average speedup of 7 with 8 cores, and i got 3.5 with 4 cores on 10 years older CPU. This difference in performance is the difference between working and useless software more often than not.

Yes, but you are advanced C++ programmer, who works in one of the most computational intensive tasks. You DO need multithreading. Not a “beginner”.

JoeJ said:
This often leads me to the question: Should i parallelize the workload, accepting more work / slower algorithm is necessary? Or should i do completely different workloads on different cores, leading to worse cache utilization and CPU saturation but avoiding the extra work? But so far i never compared those two options and i don't know what the difference could be.

Sure! This is what i talk about - you need to spend 2 years of development(more years for your more complex project). You need to program it a way, other way a third way, compare it, tune it. It is not worth unless you work on something very computationally demanding.

I am perfectly fine with programming a parallel workload manager built into the game engine. You submit the workload to it and it distributes it dynamically around the resources. This is the most sophisticated and worthy thing to do. But again, how much it will take to code?

I mean - a client installs your app on a weak dual core CPU and a strong GPU, so your app should not bet on CPU multithreading, but more to the GPU.
OR, a client installs your app on a strong 8 real cores CPU and a weak GPU, so your app should bet on CPU multithreading.

The silliest and slowest thing to do is to program two programs doing twice the programming work and the installer will select between these two apps you created by putting in it twice the effort.

the cool thing to do is to let your workload manager distribute the workload around the resources of the client System. You @joel might need it, but definitively 90% of people are fine with a single core.

@ccherng The command list multithreding will benefit you even less than CPU miltithreaded character animation.

I recommend you to code what you code for single thread. Then if you get 15FPS, and IF you are sure it is not slow because of some bad code in other place, try to multithread it.

@NikiTo Well I wasn't intending on actually using multithreading now. I just wanted to contemplate and learn about it for purely the sake of learning about it right now. Question does multithreadded gpu calls work because there are some gpu calls that can be safely parallelized?

ccherng said:
I just wanted to contemplate and learn about it for purely the sake of learning about it right now.

This is exactly what you should do. Lot of googling and planning before coding.

ccherng said:
Question does multithreadded gpu calls work because there are some gpu calls that can be safely parallelized?

I need the confirmation of some more advanced GPU coders from the forum about this. AFAIK, you create the command lists in parallel on the CPU side. Then it is all traveling over the physical bus between the CPU and the GPU in a line, in serial. Then the GPU gets that serialized input and completely insanely parallelizes it in the most random order(if no barriers used).

Advertisement

I am no GPU expert either, but yeah, if you plan it well, you don't do much at the CPU when controlling the GPU other than sending some commands and possibly some data. While preparing the latter multi-threading could be of benefit in theory, but since setting up data in memory is saturating the memory bus very quickly, you won't get much benefit there either.

The memory system already has a hard time keeping up with 1 core if you mostly do memory access, let alone several cores.

@Alberth I suppose that it takes time to create the structs, but once they are created, it is all about passing pointers around.

Drivers most of the time are using Memory Mapped Registers. The Assembler instruction tells the CPU to move something to RAM, but that RAM doesn't physically exist. When you “write DWORD to it” CPU actually sends that DWORD over the bus. It is not a real RAM and affects not cache. It is just that Assembly uses the old semantic of “mov [RAX], EDX;”. But it has nothing to do with RAM(other than the usage of memory tables that will hide that area in RAM from other apps, in case they try to use it as a regular RAM and crash the system).

I share your opinion, that parallelizing the commands is not of much benefit. Not much benefit compared to other benefits that one could obtain from doing parallelization/multithreading in other places.

For now i just make sure the GPU is well busy. I make sure to send to it a huge dispatch so the parallelization of commands makes no sense at all - GPU will be still busy anyways.

I guess for the cases where there are lot of dispatches with small amount of work, parallelization of commands matters more. My dispatches are huge. I do that on purpose. But my application allows for it. With some applications, one just can not avoid dispatches containing small workloads.

The parallelization inside the GPU is tricky to control, but it can win a lot of time for you. This is why i try to saturate the GPU as much as I can. GPU will apply a good parallelization on the workload. It will do it for me. It saves me coding work.

Instead of multithreading 10x dispatch(256, 1, 1), i prefer to issue a single dispatch(256, 10, 1).

That's why i asked if the driver sends a single dispatch(1024, 1024, 1) over the bus, or it sends dispatch(1,1,1) 1024x1024 times, which makes no engineering sense. This is why i assume it sends a single dispatch over the bus. It makes more engineering sense. Plus, it can explain the limits of 65,536 for each of the dimensions of the dispatch. This could be related with the limited space inside a DWORD. A single DWORD can contain the whole information of the dispatch, all of its dimensions fit.

ccherng said:
But how can the actual rendering benefit from multithreading. First off don't the gpu calls need to be serialized for semantic reasons. And secondly even if you send gpu calls in a multithreaded way wouldn't all the internal synchronization of the driver probably remove any benefit compared to a single threaded synchronization contention free sequence of gpu calls?

It can benefit by doing all the internal work to turn API commands into something the GPU can use in parallel, earlier in the frame with better pipelining.

Remember building multiple command buffers in parallel was not possible on older APIs, and this was major reason for the complaints about slow draw calls.

However, personally try to do more the ‘GPU controlls itself’ thing, meaning i have only static command buffers and only the data is dynamic. So i can not benefit from MT because there is no work for CPU at all other than uploading changed data.
But i know most games do much more than that, e.g. streaming, changing shaders, using many single frame command buffers etc. Pretty sure MT shows a noticeable benifit here, but can't tell from my own experience.

NikiTo said:
So you add one more uncertainty to the worth of multithreading.

Uncertainty about performance can be solved with profiling.

NikiTo said:
Yes, but you are advanced C++ programmer, who works in one of the most computational intensive tasks. You DO need multithreading. Not a “beginner”.

Nah… i'm not very up to date with C++ nor with MT. I need it and it works, but i'm not happy with how my code looks after adding it. Need more practice, but opportunities are rare.

But OP seems no beginner either: He is probably using low level API so MT is an option, and he asks for MT so is interested. I don't think one can start with MT too early - it's something physical, not something arguable like a OOP vs. ECS debatte.

NikiTo said:
Sure! This is what i talk about - you need to spend 2 years of development(more years for your more complex project). You need to program it a way, other way a third way, compare it, tune it. It is not worth unless you work on something very computationally demanding.

No, you don't need to try 2 options - you can. The reason i never did is the first option i tried always was good enough.

NikiTo said:
I mean - a client installs your app on a weak dual core CPU and a strong GPU, so your app should not bet on CPU multithreading, but more to the GPU.

The client has to check minimal specs that we give him. Dual cores are gone - you can demand 4 cores an a game now and you won't loose a lot of potential buyers. You will loose more if you make a game that does not utilize at least entry level hardware and appears outdated. Depends on the game ofc.

ccherng said:
Question does multithreadded gpu calls work because there are some gpu calls that can be safely parallelized?

Yes. GPUs have multiple queues that operate independent of each other. Those queues are a API abstraction over the real hardware but it works.

You can use transfer queue to stream data, compute queue to do async compute, render queue to do draws. And you can feed the queues from different threads if you want.
But i think that's not what people mean in general with MT rendering - this really seems mostly about parallel command buffer generation.

If you serialize this, the driver (even if doing its own MT) may not have enough work around most of the time, and then suddenly all work comes in one go which can cause bottlenecks.
If you pile up work earlier, the driver can utilize cores at any time, probably while your engine does not saturate all cores. That's what i mean with pipelining. (It's just an example - i do not know much about drivers. But this is how MT works in general.)

@JoeJ I don't have clear how the API will parallelize things for you…
I fail to picture it.
It is very hard for an app to predict and guess what you intend to do.
I still think it is only about making the CPU generate commands faster. If you want to use the copy engine at the same time as the compute engine, this is something you have to provide and do it manually.

For my app, if i generate the copy commands first and then i generate the compute command second, or i generate them in parallel on the CPU side, it doesn't affects the speed at all. What has effect is my decision to take the advantage of modern GPUs having two different engines that can work in parallel.

An example question -

If i have to dispatch 20 shaders and the biggest shader of them uses 8 resources slots and 4 constant buffers.

Should i just create a single RootSignature for the worst case? Then i apply that signature in the beginning of the App, and never again generate a command “SetComputeRootSignature(..)” again.
It will work fine. Just that some of the slots I will not be used in all the shaders.
It will work. But i never heard of somebody doing it to save command list times.

I can ever create a huge RootSignature with all the resources assigned to a slot and then issue “SetComputeRootUnorderedAccessView()” only once in the beginning of the app, and never again.

It will work and will generate 10 times less commands. But i never heard of somebody doing that to get a speed up..


Resuming it, i think MT commands generation makes sense for very complex pipelines for full size AAA games. If you take in account the code of the logic that generates or not a command, you can add it to the MT and it makes more sense.

One core thinks a lot, takes decisions generate s OR NOT one or OTHER commands and pushes them on a list.

Second core thinks a lot, takes decisions generate s OR NOT one or OTHER commands and pushes them on a list.

It makes more sense now to MT the decision taking code. Yes, i see it now making sense.

(The OP said he is a beginner. I did not invent it out of nothing to hurt him.)

This topic is closed to new replies.

Advertisement