I recently came across a question on the gamedev forum about "[DX11] Command Lists on a Single Threaded Renderer": If command lists are an efficient way to store replayable drawing commands, would it be efficient to use them even in a single threaded scenario where lots of drawing commands are repeatable?
In order to verify this, among other things, I did a simple micro-benchmark using C#/SharpDX, but while the results are somehow expectable, there are a couple of gotchas that deserve a more in-depth look...
Direct3D11 Multi-threading : The basics
I assume that general multi-threading concepts and advantages are already understood to focus on Direct3D11 multi-threading API.
There is already a nice "Introduction to Multithreading in Direct3D11" on msdn that is worth reading if you are already a little bit familiar with the Direct3D11 API.
In Direct3D10, we had only a class ID3D10Device to perform object/resource creation and draw calls, the API was not thread safe, but It was possible to emulate some kind of deferred rendering by using mutexes and a simplified command buffers to access safely the device.
In Direct3D11, preparation of the draw calls are now "parralelizable" while object/resource creation is thread safe. The API is now split between:
- ID3D11Device which is responsible to create object/resources/shaders and device contexts.
- ID3D11DeviceContext which holds all commands to setup shaders pipeline and perform all draw calls (including constant buffer update, setup of shader resource views, samplers, blendstate...etc.)
When a Direct3D11 device is created, it provides a default ID3D11DeviceContext called an immediate context that is effectively used for immediate rendering. There is only one immediate context available per device.
In order to use deferred rendering, we need to create new ID3D11DeviceContext called deferred context. One context for each thread responsible for preparing a set of draw calls.
Then the sequence of multithreaded draw calls are executed like this:
Each secondary threads are responsible to prepare draw calls in a set of ID3D11CommandList that will effectively be executed by the immediate context (in order to push them to the driver).
The simplified version of the code to write is fairly easy:
// Thread-1
context[threadIdn].InputAssembler.InputLayout = layout1;
context[threadIdn].InputAssembler.PrimitiveTopology = PrimitiveTopology.TriangleList;
context[threadIdn].InputAssembler.SetVertexBuffers(0, new VertexBufferBinding(vertices1, Utilities.SizeOf<Vector4>() * 2, 0));
[...]
context[threadId1].Draw(...)
commandLists[threadId1] = context[ThreadId1].FinishCommandList(false);
[...]
// Thread-n
context[threadIdn].InputAssembler.InputLayout = layoutn;
context[threadIdn].InputAssembler.PrimitiveTopology = PrimitiveTopology.TriangleList;
context[threadIdn].InputAssembler.SetVertexBuffers(0, new VertexBufferBinding(verticesn, Utilities.SizeOf<Vector4>() * 2, 0));
[...]
context[threadIdn].Draw(...)
commandLists[threadIdn] = context[ThreadIdn].FinishCommandList(false);
// Rendering Thread
for (int i = 0; i < threadCount; i++)
{
var commandList = commandLists[i];
// Execute the deferred command list on the immediate context
immediateContext.ExecuteCommandList(commandList, false);
commandList.Dispose();
}
The API provides several key advantages:
- We can easily switch the code between immediate context and deferred context. Thus using the multi-threading part of the Direct3D11 API doesn't hurt our code.
- The API is supported on downlevel hardware (from Direct3D11 down to Direct3D9)
- The underlying driver can take advantages when calling
FinishCommandList
to perform some native layout that will help the deferredExecuteCommandList
command to run faster.
CheckFeatureSupport
(or directly in SharpDX using CheckThreadingSupport
) but it seems that almost only NVIDIA (and quite recently, around this year), is supporting this feature natively. On my previous ATI 6850 and now on my 6900M are not supporting it. Is this bad? We will see that the default Direct3D11 runtime is performing just fine for this, but doesn't provide any extra boost.We will also see that there is an interesting issue with the usage of
Map/Unmap
or UpdateSubresource
in order to update constant buffers, and their respective usage under a multithreading scenario could hurt performances. MultiCube, a Direct3D11 Multi-threading micro-benchmark
In order to stress-test multi-threading using Direct3D11, I have developed a simple application called MultiCube (available as part of SharpDX samples: See Program.cs)
This application is performing the following benchmark: It renders n x n cubes on the screen, each cube has its own matrix rotation. You can modify the number of cubes from 1 (1x1) to 65536 (256x256). The title bar is including some benchmark measurement (FPS/ time per frame) and you can change the behavior of the application with following keys:
- F1: Switch between Immediate Test (no threading), Deferred Test (Threading), and Frozen-Deferred Test (execute a pre-prepared CommandList on the ImmediateContext)
- F2: Switch between Map/Unmap mode and UpdateSubresource mode to update constant buffers.
- F3: Burn the CPU on/off. This is were multithreading usage is making the difference and we are going to analyse the results a little bit more. When this option is on, It simulates lots of CPU calculation on the deferred threads. If this is off, It will just batch the draw calls (which are simple, its just Cubes!)
- Left-Right arrows: Decrease/Increase the number of cubes to display (default 64x64)
- Down-Up arrows: Decrease/Increase the number of threads used (only for Deferred Test mode)
If your graphics driver doesn't support natively multithreading, you will see a "*" just after Deferred node.
You can download the application here. It is a single exe that doesn't need anykind of install (apart the DirectX June 2010 runtime). Also, being able to pack this application into a single exe is a unique feature of SharpDX: static linking of a .NET exe with SharpDX Dlls.
Results
I ran 2 type of tests:
- Draw 65536 cubes with the Burn-Cpu option ON and OFF, and comparing Immediate and Deferred rendering (ranging from 1 thread to 6 threads).
- Draw 1024 cubes switching between Map/Unmap and UpdateSubresource, and comparing the results between Immediate and Deferred rendering.
65536 Drawcalls - BurnCpu: On | Threads | ||||
Type | 1 | 2 | 3 | 4 | 6 |
Nvidia-GTX 570 Deferred | 232ms | 130ms | 98ms | 92ms | 82ms |
Nvidia-GTX 570 Immediate | 220ms | 220ms | 220ms | 220ms | 220ms |
ATI 6900M Deferred | 231ms | 131ms | 98ms | 93ms | 84ms |
ATI 6900M Immediate | 228ms | 228ms | 228ms | 228ms | 228ms |
Fig2. 65536 draw calls with CPU intensive threads, comparison between Immediate and Deferred rendering |
65536 Drawcalls - BurnCpu: Off | Threads | ||||
Type | 1 | 2 | 3 | 4 | 6 |
Nvidia-GTX 570 Deferred | 31ms | 24ms | 21ms | 20ms | 20ms |
Nvidia-GTX 570 Immediate | 19ms | 19ms | 19ms | 19ms | 19ms |
ATI 6900M Deferred | 32ms | 28ms | 28ms | 28ms | 28ms |
ATI 6900M Immediate | 28ms | 28ms | 28ms | 28ms | 28ms |
Fig2. 65536 draw calls with CPU ligh threads, comparison between Immediate and Deferred rendering |
And finally the Map/Unmap and UpdateSubresource test:
65536 Drawcalls - Type | Map | Update |
Nvidia-GTX 570 Immediate - 1024 | 0.6ms | 1.1ms |
Nvidia-GTX 570 Deferred - 1024 | 0.92ms | 7.32ms |
ATI 6900M Immediate - 1024 | 0.6ms | 0.6ms |
ATI 6900M Deferred - 1024 | 0.6ms | 0.6ms |
Analysis
If we examine the results a little more carefully, there are a couple of interesting things to highlight:
- Using multithreading and deferred context rendering is only relevant when the CPU is effectively used on each threads (that sounds obvious, but It is at least clear!). When we are not using the CPU, Immediate rendering is in fact faster!
- Multithreading rendering with CPU intensive application can perform 3-4x times faster than a single threaded application (at the condition that we have enough CPU core to dispatch rendering jobs)
- The "native support from driver" of Direct3D11 multithreading doesn't seem to change so much, compare to the NVIDIA graphics card that is supporting it, we don't see a huge difference with AMD.
- Usage of UpdateSubresource on a NVIDIA card is 8x times slower in a multithreading situation and is hurting a lot the performance of the application: Use Map/Unmap instead!
Finally, to respond to the original gamedev question, I provided a "Frozen Deferred" test in MultiCube to test if rendering a pre-prepared CommandList is actually faster then executing it with an immediate context: It seems that It doesn't make currently any differences (but for this to be sure, I would have to run this benchmark on several different machines/CPU/graphics card/drivers configs in order to fully verify it).