1 | ANV
|
---|
2 | ===
|
---|
3 |
|
---|
4 | Debugging
|
---|
5 | ---------
|
---|
6 |
|
---|
7 | Here are a few environment variable debug environment variables
|
---|
8 | specific to ANV:
|
---|
9 |
|
---|
10 | :envvar:`ANV_ENABLE_PIPELINE_CACHE`
|
---|
11 | If defined to ``0`` or ``false``, this will disable pipeline
|
---|
12 | caching, forcing ANV to reparse and recompile any VkShaderModule
|
---|
13 | (SPIRV) it is given.
|
---|
14 | :envvar:`ANV_DISABLE_SECONDARY_CMD_BUFFER_CALLS`
|
---|
15 | If defined to ``1`` or ``true``, this will prevent usage of self
|
---|
16 | modifying command buffers to implement ``vkCmdExecuteCommands``. As
|
---|
17 | a result of this, it will also disable :ext:`VK_KHR_performance_query`.
|
---|
18 | :envvar:`ANV_ALWAYS_BINDLESS`
|
---|
19 | If defined to ``1`` or ``true``, this forces all descriptor sets to
|
---|
20 | use the internal `Bindless model`_.
|
---|
21 | :envvar:`ANV_QUEUE_THREAD_DISABLE`
|
---|
22 | If defined to ``1`` or ``true``, this disables support for timeline
|
---|
23 | semaphores.
|
---|
24 | :envvar:`ANV_USERSPACE_RELOCS`
|
---|
25 | If defined to ``1`` or ``true``, this forces ANV to always do
|
---|
26 | kernel relocations in command buffers. This should only have an
|
---|
27 | effect on hardware that doesn't support soft-pinning (Ivybridge,
|
---|
28 | Haswell, Cherryview).
|
---|
29 | :envvar:`ANV_PRIMITIVE_REPLICATION_MAX_VIEWS`
|
---|
30 | Specifies up to how many view shaders can be lowered to handle
|
---|
31 | :ext:`VK_KHR_multiview`. Beyond this number, multiview is implemented
|
---|
32 | using instanced rendering. If unspecified, the value default to
|
---|
33 | ``2``.
|
---|
34 |
|
---|
35 |
|
---|
36 | Experimental features
|
---|
37 | ---------------------
|
---|
38 |
|
---|
39 | .. _`Bindless model`:
|
---|
40 |
|
---|
41 | Binding Model
|
---|
42 | -------------
|
---|
43 |
|
---|
44 | Here is the ANV bindless binding model that was implemented for the
|
---|
45 | descriptor indexing feature of Vulkan 1.2 :
|
---|
46 |
|
---|
47 | .. graphviz::
|
---|
48 |
|
---|
49 | digraph G {
|
---|
50 | fontcolor="black";
|
---|
51 | compound=true;
|
---|
52 |
|
---|
53 | subgraph cluster_1 {
|
---|
54 | label = "Binding Table (HW)";
|
---|
55 |
|
---|
56 | bgcolor="cornflowerblue";
|
---|
57 |
|
---|
58 | node [ style=filled,shape="record",fillcolor="white",
|
---|
59 | label="RT0" ] n0;
|
---|
60 | node [ label="RT1" ] n1;
|
---|
61 | node [ label="dynbuf0"] n2;
|
---|
62 | node [ label="set0" ] n3;
|
---|
63 | node [ label="set1" ] n4;
|
---|
64 | node [ label="set2" ] n5;
|
---|
65 |
|
---|
66 | n0 -> n1 -> n2 -> n3 -> n4 -> n5 [style=invis];
|
---|
67 | }
|
---|
68 | subgraph cluster_2 {
|
---|
69 | label = "Descriptor Set 0";
|
---|
70 |
|
---|
71 | bgcolor="burlywood3";
|
---|
72 | fixedsize = true;
|
---|
73 |
|
---|
74 | node [ style=filled,shape="record",fillcolor="white", fixedsize = true, width=4,
|
---|
75 | label="binding 0 - STORAGE_IMAGE\n anv_storage_image_descriptor" ] n8;
|
---|
76 | node [ label="binding 1 - COMBINED_IMAGE_SAMPLER\n anv_sampled_image_descriptor" ] n9;
|
---|
77 | node [ label="binding 2 - UNIFORM_BUFFER\n anv_address_range_descriptor" ] n10;
|
---|
78 | node [ label="binding 3 - UNIFORM_TEXEL_BUFFER\n anv_storage_image_descriptor" ] n11;
|
---|
79 |
|
---|
80 | n8 -> n9 -> n10 -> n11 [style=invis];
|
---|
81 | }
|
---|
82 | subgraph cluster_5 {
|
---|
83 | label = "Vulkan Objects"
|
---|
84 |
|
---|
85 | fontcolor="black";
|
---|
86 | bgcolor="darkolivegreen4";
|
---|
87 |
|
---|
88 | subgraph cluster_6 {
|
---|
89 | label = "VkImageView";
|
---|
90 |
|
---|
91 | bgcolor=darkolivegreen3;
|
---|
92 | node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2,
|
---|
93 | label="surface_state" ] n12;
|
---|
94 | }
|
---|
95 | subgraph cluster_7 {
|
---|
96 | label = "VkSampler";
|
---|
97 |
|
---|
98 | bgcolor=darkolivegreen3;
|
---|
99 | node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2,
|
---|
100 | label="sample_state" ] n13;
|
---|
101 | }
|
---|
102 | subgraph cluster_8 {
|
---|
103 | label = "VkImageView";
|
---|
104 | bgcolor="darkolivegreen3";
|
---|
105 |
|
---|
106 | node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2,
|
---|
107 | label="surface_state" ] n14;
|
---|
108 | }
|
---|
109 | subgraph cluster_9 {
|
---|
110 | label = "VkBuffer";
|
---|
111 | bgcolor=darkolivegreen3;
|
---|
112 |
|
---|
113 | node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2,
|
---|
114 | label="address" ] n15;
|
---|
115 | }
|
---|
116 | subgraph cluster_10 {
|
---|
117 | label = "VkBufferView";
|
---|
118 |
|
---|
119 | bgcolor=darkolivegreen3;
|
---|
120 | node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2,
|
---|
121 | label="surface_state" ] n16;
|
---|
122 | }
|
---|
123 |
|
---|
124 | n12 -> n13 -> n14 -> n15 -> n16 [style=invis];
|
---|
125 | }
|
---|
126 |
|
---|
127 | subgraph cluster_11 {
|
---|
128 | subgraph cluster_12 {
|
---|
129 | label = "CommandBuffer state stream";
|
---|
130 |
|
---|
131 | bgcolor="gold3";
|
---|
132 | node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2,
|
---|
133 | label="surface_state" ] n17;
|
---|
134 | node [ label="surface_state" ] n18;
|
---|
135 | node [ label="surface_state" ] n19;
|
---|
136 |
|
---|
137 | n17 -> n18 -> n19 [style=invis];
|
---|
138 | }
|
---|
139 | }
|
---|
140 |
|
---|
141 | n3 -> n8 [lhead=cluster_2];
|
---|
142 |
|
---|
143 | n8 -> n12;
|
---|
144 | n9 -> n13;
|
---|
145 | n9 -> n14;
|
---|
146 | n10 -> n15;
|
---|
147 | n11 -> n16;
|
---|
148 |
|
---|
149 | n0 -> n17;
|
---|
150 | n1 -> n18;
|
---|
151 | n2 -> n19;
|
---|
152 | }
|
---|
153 |
|
---|
154 |
|
---|
155 |
|
---|
156 | The HW binding table is generated when the draw or dispatch commands
|
---|
157 | are emitted. Here are the types of entries one can find in the binding
|
---|
158 | table :
|
---|
159 |
|
---|
160 | - The currently bound descriptor sets, one entry per descriptor set
|
---|
161 | (our limit is 8).
|
---|
162 |
|
---|
163 | - For dynamic buffers, one entry per dynamic buffer.
|
---|
164 |
|
---|
165 | - For draw commands, render target entries if needed.
|
---|
166 |
|
---|
167 | The entries of the HW binding table for descriptor sets are
|
---|
168 | RENDER_SURFACE_STATE similar to what you would have for a normal
|
---|
169 | uniform buffer. The shader will emit reads this buffer first to get
|
---|
170 | the information it needs to access a surface/sampler/etc... and then
|
---|
171 | emits the appropriate message using the information gathered from the
|
---|
172 | descriptor set buffer.
|
---|
173 |
|
---|
174 | Each binding type entry gets an associated structure in memory
|
---|
175 | (``anv_storage_image_descriptor``, ``anv_sampled_image_descriptor``,
|
---|
176 | ``anv_address_range_descriptor``, ``anv_storage_image_descriptor``).
|
---|
177 | This is the information read by the shader.
|
---|
178 |
|
---|
179 |
|
---|
180 | .. _`Binding tables`:
|
---|
181 |
|
---|
182 | Binding Tables
|
---|
183 | --------------
|
---|
184 |
|
---|
185 | Binding tables are arrays of 32bit offset entries referencing surface
|
---|
186 | states. This is how shaders can refer to binding table entry to read
|
---|
187 | or write a surface. For example fragment shaders will often refer to
|
---|
188 | entry 0 as the first render target.
|
---|
189 |
|
---|
190 | The way binding tables are managed is fairly awkward.
|
---|
191 |
|
---|
192 | Each shader stage must have its binding table programmed through
|
---|
193 | a corresponding instruction
|
---|
194 | ``3DSTATE_BINDING_TABLE_POINTERS_*`` (each stage has its own).
|
---|
195 |
|
---|
196 | .. graphviz::
|
---|
197 |
|
---|
198 | digraph structs {
|
---|
199 | node [shape=record];
|
---|
200 | struct3 [label="{ binding tables\n area | { <bt4> BT4 | <bt3> BT3 | ... | <bt0> BT0 } }|{ surface state\n area |{<ss0> ss0|<ss1> ss1|<ss2> ss2|...}}"];
|
---|
201 | struct3:bt0 -> struct3:ss0;
|
---|
202 | struct3:bt0 -> struct3:ss1;
|
---|
203 | }
|
---|
204 |
|
---|
205 |
|
---|
206 | The value programmed in the ``3DSTATE_BINDING_TABLE_POINTERS_*``
|
---|
207 | instructions is not a 64bit pointer but an offset from the address
|
---|
208 | programmed in ``STATE_BASE_ADDRESS::Surface State Base Address`` or
|
---|
209 | ``3DSTATE_BINDING_TABLE_POOL_ALLOC::Binding Table Pool Base Address``
|
---|
210 | (available on Gfx11+). The offset value in
|
---|
211 | ``3DSTATE_BINDING_TABLE_POINTERS_*`` is also limited to a few bits
|
---|
212 | (not a full 32bit value), meaning that as we use more and more binding
|
---|
213 | tables we need to reposition ``STATE_BASE_ADDRESS::Surface State Base
|
---|
214 | Address`` to make space for new binding table arrays.
|
---|
215 |
|
---|
216 | To make things even more awkward, the binding table entries are also
|
---|
217 | relative to ``STATE_BASE_ADDRESS::Surface State Base Address`` so as
|
---|
218 | we change ``STATE_BASE_ADDRESS::Surface State Base Address`` we need
|
---|
219 | add that offsets to the binding table entries.
|
---|
220 |
|
---|
221 | The way with deal with this is that we allocate 4Gb of address space
|
---|
222 | (since the binding table entries can address 4Gb of surface state
|
---|
223 | elements). We reserve the first gigabyte exclusively to binding
|
---|
224 | tables, so that anywhere we position our binding table in that first
|
---|
225 | gigabyte, it can always refer to the surface states in the next 3Gb.
|
---|
226 |
|
---|
227 |
|
---|
228 | .. _`Descriptor Set Memory Layout`:
|
---|
229 |
|
---|
230 | Descriptor Set Memory Layout
|
---|
231 | ----------------------------
|
---|
232 |
|
---|
233 | Here is a representation of how the descriptor set bindings, with each
|
---|
234 | elements in each binding is mapped to a the descriptor set memory :
|
---|
235 |
|
---|
236 | .. graphviz::
|
---|
237 |
|
---|
238 | digraph structs {
|
---|
239 | node [shape=record];
|
---|
240 | rankdir=LR;
|
---|
241 |
|
---|
242 | struct1 [label="Descriptor Set | \
|
---|
243 | <b0> binding 0\n STORAGE_IMAGE \n (array_length=3) | \
|
---|
244 | <b1> binding 1\n COMBINED_IMAGE_SAMPLER \n (array_length=2) | \
|
---|
245 | <b2> binding 2\n UNIFORM_BUFFER \n (array_length=1) | \
|
---|
246 | <b3> binding 3\n UNIFORM_TEXEL_BUFFER \n (array_length=1)"];
|
---|
247 | struct2 [label="Descriptor Set Memory | \
|
---|
248 | <b0e0> anv_storage_image_descriptor|\
|
---|
249 | <b0e1> anv_storage_image_descriptor|\
|
---|
250 | <b0e2> anv_storage_image_descriptor|\
|
---|
251 | <b1e0> anv_sampled_image_descriptor|\
|
---|
252 | <b1e1> anv_sampled_image_descriptor|\
|
---|
253 | <b2e0> anv_address_range_descriptor|\
|
---|
254 | <b3e0> anv_storage_image_descriptor"];
|
---|
255 |
|
---|
256 | struct1:b0 -> struct2:b0e0;
|
---|
257 | struct1:b0 -> struct2:b0e1;
|
---|
258 | struct1:b0 -> struct2:b0e2;
|
---|
259 | struct1:b1 -> struct2:b1e0;
|
---|
260 | struct1:b1 -> struct2:b1e1;
|
---|
261 | struct1:b2 -> struct2:b2e0;
|
---|
262 | struct1:b3 -> struct2:b3e0;
|
---|
263 | }
|
---|
264 |
|
---|
265 | Each Binding in the descriptor set is allocated an array of
|
---|
266 | ``anv_*_descriptor`` data structure. The type of ``anv_*_descriptor``
|
---|
267 | used for a binding is selected based on the ``VkDescriptorType`` of
|
---|
268 | the bindings.
|
---|
269 |
|
---|
270 | The value of ``anv_descriptor_set_binding_layout::descriptor_offset``
|
---|
271 | is a byte offset from the descriptor set memory to the associated
|
---|
272 | binding. ``anv_descriptor_set_binding_layout::array_size`` is the
|
---|
273 | number of ``anv_*_descriptor`` elements in the descriptor set memory
|
---|
274 | from that offset for the binding.
|
---|
275 |
|
---|
276 |
|
---|
277 | Pipeline state emission
|
---|
278 | -----------------------
|
---|
279 |
|
---|
280 | Vulkan initially started by baking as much state as possible in
|
---|
281 | pipelines. But extension after extension, more and more state has
|
---|
282 | become potentially dynamic.
|
---|
283 |
|
---|
284 | ANV tries to limit the amount of time an instruction has to be packed
|
---|
285 | to reprogram part of the 3D pipeline state. The packing is happening
|
---|
286 | in 2 places :
|
---|
287 |
|
---|
288 | - ``genX_pipeline.c`` where the non dynamic state is emitted in the
|
---|
289 | pipeline batch. Chunks of the batches are copied into the command
|
---|
290 | buffer as a result of calling ``vkCmdBindPipeline()``, depending on
|
---|
291 | what changes from the previously bound graphics pipeline
|
---|
292 |
|
---|
293 | - ``genX_gfx_state.c`` where the dynamic state is added to already
|
---|
294 | packed instructions from ``genX_pipeline.c``
|
---|
295 |
|
---|
296 | The rule to know where to emit an instruction programming the 3D
|
---|
297 | pipeline is as follow :
|
---|
298 |
|
---|
299 | - If any field of the instruction can be made dynamic, it should be
|
---|
300 | emitted in ``genX_gfx_state.c``
|
---|
301 |
|
---|
302 | - Otherwise, the instruction can be emitted in ``genX_pipeline.c``
|
---|
303 |
|
---|
304 | When a piece of state programming is dynamic, it should have a
|
---|
305 | corresponding field in ``anv_gfx_dynamic_state`` and the
|
---|
306 | ``genX(cmd_buffer_flush_gfx_runtime_state)`` function should be
|
---|
307 | updated to ensure we minimize the amount of time an instruction should
|
---|
308 | be emitted. Each instruction should have a associated
|
---|
309 | ``ANV_GFX_STATE_*`` mask so that the dynamic emission code can tell
|
---|
310 | when to re-emit an instruction.
|
---|
311 |
|
---|
312 |
|
---|
313 | Generated indirect draws optimization
|
---|
314 | -------------------------------------
|
---|
315 |
|
---|
316 | Indirect draws have traditionally been implemented on Intel HW by
|
---|
317 | loading the indirect parameters from memory into HW registers using
|
---|
318 | the command streamer's ``MI_LOAD_REGISTER_MEM`` instruction before
|
---|
319 | dispatching a draw call to the 3D pipeline.
|
---|
320 |
|
---|
321 | On recent products, it was found that the command streamer is showing
|
---|
322 | as performance bottleneck, because it cannot dispatch draw calls fast
|
---|
323 | enough to keep the 3D pipeline busy.
|
---|
324 |
|
---|
325 | The solution to this problem is to change the way we deal with
|
---|
326 | indirect draws. Instead of loading HW registers with values using the
|
---|
327 | command streamer, we generate entire set of ``3DPRIMITIVE``
|
---|
328 | instructions using a shader. The generated instructions contain the
|
---|
329 | entire draw call parameters. This way the command streamer executes
|
---|
330 | only ``3DPRIMITIVE`` instructions and doesn't do any data loading from
|
---|
331 | memory or touch HW registers, feeding the 3D pipeline as fast as it
|
---|
332 | can.
|
---|
333 |
|
---|
334 | In ANV this implemented in 2 different ways :
|
---|
335 |
|
---|
336 | By generating instructions directly into the command stream using a
|
---|
337 | side batch buffer. When ANV encounters the first indirect draws, it
|
---|
338 | generates a jump into the side batch, the side batch contains a draw
|
---|
339 | call using a generation shader for each indirect draw. We keep adding
|
---|
340 | on more generation draws into the batch until we have to stop due to
|
---|
341 | command buffer end, secondary command buffer calls or a barrier
|
---|
342 | containing the access flag ``VK_ACCESS_INDIRECT_COMMAND_READ_BIT``.
|
---|
343 | The side batch buffer jump back right after the instruction where it
|
---|
344 | was called. Here is a high level diagram showing how the generation
|
---|
345 | batch buffer writes in the main command buffer :
|
---|
346 |
|
---|
347 | .. graphviz::
|
---|
348 |
|
---|
349 | digraph commands_mode {
|
---|
350 | rankdir = "LR"
|
---|
351 | "main-command-buffer" [
|
---|
352 | label = "main command buffer|...|draw indirect0 start|<f0>jump to\ngeneration batch|<f1>|<f2>empty instruction0|<f3>empty instruction1|...|draw indirect0 end|...|draw indirect1 start|<f4>empty instruction0|<f5>empty instruction1|...|<f6>draw indirect1 end|..."
|
---|
353 | shape = "record"
|
---|
354 | ];
|
---|
355 | "generation-command-buffer" [
|
---|
356 | label = "generation command buffer|<f0>|<f1>write draw indirect0|<f2>write draw indirect1|...|<f3>exit jump"
|
---|
357 | shape = "record"
|
---|
358 | ];
|
---|
359 | "main-command-buffer":f0 -> "generation-command-buffer":f0;
|
---|
360 | "generation-command-buffer":f1 -> "main-command-buffer":f2 [color="#0000ff"];
|
---|
361 | "generation-command-buffer":f1 -> "main-command-buffer":f3 [color="#0000ff"];
|
---|
362 | "generation-command-buffer":f2 -> "main-command-buffer":f4 [color="#0000ff"];
|
---|
363 | "generation-command-buffer":f2 -> "main-command-buffer":f5 [color="#0000ff"];
|
---|
364 | "generation-command-buffer":f3 -> "main-command-buffer":f1;
|
---|
365 | }
|
---|
366 |
|
---|
367 | By generating instructions into a ring buffer of commands, when the
|
---|
368 | draw count number is high. This solution allows smaller batches to be
|
---|
369 | emitted. Here is a high level diagram showing how things are
|
---|
370 | executed :
|
---|
371 |
|
---|
372 | .. graphviz::
|
---|
373 |
|
---|
374 | digraph ring_mode {
|
---|
375 | rankdir=LR;
|
---|
376 | "main-command-buffer" [
|
---|
377 | label = "main command buffer|...| draw indirect |<f1>generation shader|<f2> jump to ring|<f3> increment\ndraw_base|<f4>..."
|
---|
378 | shape = "record"
|
---|
379 | ];
|
---|
380 | "ring-buffer" [
|
---|
381 | label = "ring buffer|<f0>generated draw0|<f1>generated draw1|<f2>generated draw2|...|<f3>exit jump"
|
---|
382 | shape = "record"
|
---|
383 | ];
|
---|
384 | "main-command-buffer":f2 -> "ring-buffer":f0;
|
---|
385 | "ring-buffer":f3 -> "main-command-buffer":f3;
|
---|
386 | "ring-buffer":f3 -> "main-command-buffer":f4;
|
---|
387 | "main-command-buffer":f3 -> "main-command-buffer":f1;
|
---|
388 | "main-command-buffer":f1 -> "ring-buffer":f1 [color="#0000ff"];
|
---|
389 | "main-command-buffer":f1 -> "ring-buffer":f2 [color="#0000ff"];
|
---|
390 | }
|
---|