1 | Asahi
2 | =====
3 |
4 | The Asahi driver aims to provide an OpenGL implementation for the Apple M1.
5 |
6 | Wrap (macOS only)
7 | -----------------
8 |
9 | Mesa includes a library that wraps the key IOKit entrypoints used in the macOS
10 | UABI for AGX. The wrapped routines print information about the kernel calls made
11 | and dump work submitted to the GPU using agxdecode. This facilitates
12 | reverse-engineering the hardware, as glue to get at the "interesting" GPU
13 | memory.
14 |
15 | The library is only built if ``-Dtools=asahi`` is passed. It builds a single
16 | ``wrap.dylib`` file, which should be inserted into a process with the
17 | ``DYLD_INSERT_LIBRARIES`` environment variable.
18 |
19 | For example, to trace an app ``./app``, run:
20 |
21 | DYLD_INSERT_LIBRARIES=~/mesa/build/src/asahi/lib/libwrap.dylib ./app
22 |
23 | Hardware varyings
24 | -----------------
25 |
26 | At an API level, vertex shader outputs need to be interpolated to become
27 | fragment shader inputs. This process is logically pipelined in AGX, with a value
28 | traveling from a vertex shader to remapping hardware to coefficient register
29 | setup to the fragment shader to the iterator hardware. Each stage is described
30 | below.
31 |
32 | Vertex shader
33 | `````````````
34 |
35 | A vertex shader (running on the :term:`Unified Shader Cores`) outputs varyings with the
36 | ``st_var`` instruction. ``st_var`` takes a *vertex output index* and a 32-bit
37 | value. The maximum number of *vertex outputs* is specified as the "output count"
38 | of the shader in the "Bind Vertex Pipeline" packet. The value may be interpreted
39 | consist of a single 32-bit value or an aligned 16-bit register pair, depending
40 | on whether interpolation should happen at 32-bit or 16-bit. Vertex outputs are
41 | indexed starting from 0, with the *vertex position* always coming first, the
42 | 32-bit user varyings coming next with perspective, flat, and linear interpolated
43 | varyings grouped in that order, then 16-bit user varyings with the same groupings,
44 | and finally *point size* and *clip distances* at the end if present. Note that
45 | *clip distances* are not accessible from the fragment shader; if the fragment
46 | shader needs to read the interpolated clip distance, the vertex shader must
47 | *also* write the clip distance values to a user varying for the fragment shader
48 | to interpolate. Also note there is no clip plane enable mask anywhere; that must
49 | lowered for APIs that require this (OpenGL but not Vulkan).
50 |
51 | .. list-table:: Ordering of vertex outputs with all outputs used
52 | :widths: 25 75
53 | :header-rows: 1
54 |
55 | * - Size (words)
56 | - Value
57 | * - 4
58 | - Vertex position
59 | * - 1
60 | - 32-bit smooth varying 0
61 | * -
62 | - ...
63 | * - 1
64 | - 32-bit smooth varying m
65 | * - 1
66 | - 32-bit flat varying 0
67 | * -
68 | - ...
69 | * - 1
70 | - 32-bit flat varying n
71 | * - 1
72 | - 32-bit linear varying 0
73 | * -
74 | - ...
75 | * - 1
76 | - 32-bit linear varying o
77 | * - 1
78 | - Packed pair of 16-bit smooth varyings 0
79 | * -
80 | - ...
81 | * - 1
82 | - Packed pair of 16-bit smooth varyings p
83 | * - 1
84 | - Packed pair of 16-bit flat varyings 0
85 | * -
86 | - ...
87 | * - 1
88 | - Packed pair of 16-bit flat varyings q
89 | * - 1
90 | - Packed pair of 16-bit linear varyings 0
91 | * -
92 | - ...
93 | * - 1
94 | - Packed pair of 16-bit linear varyings r
95 | * - 1
96 | - Point size
97 | * - 1
98 | - Clip distance for plane 0
99 | * -
100 | - ...
101 | * - 1
102 | - Clip distance for plane 15
103 |
104 | Remapping
105 | `````````
106 |
107 | Vertex outputs are remapped to varying slots to be interpolated.
108 | The output of remapping consists of the following items: the *W* fragment
109 | coordinate, the *Z* fragment coordinate, user varyings in the vertex
110 | output order. *Z* may be omitted, but *W* may not be. This remapping is
111 | configured by the "Output select" word.
112 |
113 | .. list-table:: Ordering of remapped slots
114 | :widths: 25 75
115 | :header-rows: 1
116 |
117 | * - Index
118 | - Value
119 | * - 0
120 | - Fragment coord W
121 | * - 1
122 | - Fragment coord Z
123 | * - 2
124 | - 32-bit varying 0
125 | * -
126 | - ...
127 | * - 2 + m
128 | - 32-bit varying m
129 | * - 2 + m + 1
130 | - Packed pair of 16-bit varyings 0
131 | * -
132 | - ...
133 | * - 2 + m + n + 1
134 | - Packed pair of 16-bit varyings n
135 |
136 | Coefficient registers
137 | `````````````````````
138 |
139 | The fragment shader does not see the physical slots.
140 | Instead, it references varyings through *coefficient registers*. A coefficient
141 | register is a register allocated constant for all fragment shader invocations in
142 | a given polygon. Physically, it contains the values output by the vertex shader
143 | for each vertex of the polygon. Coefficient registers are preloaded with values
144 | from varying slots. This preloading appears to occur in fixed function hardware,
145 | a simplification from PowerVR which requires a specialized program for the
146 | programmable data sequencer to do the preload.
147 |
148 | The "Bind fragment pipeline" packet points to coefficient register bindings,
149 | preceded by a header. The header contains the number of 32-bit varying slots. As
150 | the *W* slot is always present, this field is always nonzero. Slots whose index
151 | is below this count are treated as 32-bit. The remaining slots are treated as
152 | 16-bits.
153 |
154 | The header also contains the total number of coefficient registers bound.
155 |
156 | Each binding that follows maps a (vector of) varying slots to a (consecutive)
157 | coefficient registers. Some details about the varying (perspective
158 | interpolation, flat shading, point sprites) are configured here.
159 |
160 | Coefficient registers may be ordered the same as the internal varying slots.
161 | However, this may be inconvenient for some APIs that require a separable shader
162 | model. For these APIs, the flexibility to mix-and-match slots and coefficient
163 | registers allows mixing shaders without shader variants. In that case, the
164 | bindings should be generated outside of the compiler. For simple APIs where the
165 | bindings are fixed and known at compile-time, the bindings could be generated
166 | within the compiler.
167 |
168 | Fragment shader
169 | ```````````````
170 |
171 | In the fragment shader, coefficient registers, identified by the prefix ``cf``
172 | followed by a decimal index, act as opaque handles to varyings. For flat
173 | shading, coefficient registers may be loaded into general registers with the
174 | ``ldcf`` instruction. For smooth shading, the coefficient register corresponding
175 | to the desired varying is passed as an argument to the "iterate" instruction
176 | ``iter`` in order to "iterate" (interpolate) a varying. As perspective correct
177 | interpolation also requires the W component of the fragment coordinate, the
178 | coefficient register for W is passed as a second argument. As an example, if
179 | there's a single varying to interpolate, an instruction like ``iter r0, cf1, cf0``
180 | is used.
181 |
182 | Iterator
183 | ````````
184 |
185 | To actually interpolate varyings, AGX provides fixed-function iteration hardware
186 | to multiply the specified coefficient registers with the required barycentrics,
187 | producing an interpolated value, hence the name "coefficient register". This
188 | operation is purely mathematical and does not require any memory access, as
189 | the required coefficients are preloaded before the shader begins execution.
190 | That means the iterate instruction executes in constant time, does not signal
191 | a data fence, and does not require the shader to wait on a data fence before
192 | using the value.
193 |
194 | Image layouts
195 | -------------
196 |
197 | AGX supports several image layouts, described here. To work with image layouts
198 | in the drivers, use the ail library, located in ``src/asahi/layout``.
199 |
200 | The simplest layout is **strided linear**. Pixels are stored in raster-order in
201 | memory with a software-controlled stride. Strided linear images are useful for
202 | working with modifier-unaware window systems, however performance will suffer.
203 | Strided linear images have numerous limitations:
204 |
205 | - Strides must be a multiple of 16 bytes.
206 | - Strides must be nonzero. For 1D images where the stride is logically
207 | irrelevant, ail will internally select the minimal stride.
208 | - Only 1D and 2D images may be linear. In particular, no 3D or cubemaps.
209 | - Array texture may not be linear. No 2D arrays or cubemap arrays.
210 | - 2D images must not be mipmapped.
211 | - Block-compressed formats and multisampled images are unsupported. Elements of
212 | a strided linear image are simply pixels.
213 |
214 | With these limitations, addressing into a strided linear image is as simple as
215 |
216 | .. math::
217 |
218 | \text{address} = (y \cdot \text{stride}) + (x \cdot \text{bytes per pixel})
219 |
220 | In practice, this suffices for window system integration and little else.
221 |
222 | The most common uncompressed layout is **twiddled**. The image is divided into
223 | power-of-two sized tiles. The tiles themselves are stored in raster-order.
224 | Within each tile, elements (pixels/blocks) are stored in Morton (Z) order.
225 |
226 | The tile size used depends on both the image size and the block size of the
227 | image format. For large images, :math:`n \times n` or :math:`2n \times n` tiles
228 | are used (:math:`n` power-of-two). :math:`n` is such that each page contains
229 | exactly one tile. Only power-of-two block sizes are supported in hardware,
230 | ensuring such a tile size always exists. The hardware uses 16 KiB pages, so tile
231 | sizes are as follows:
232 |
233 | .. list-table:: Tile sizes for large images
234 | :widths: 50 50
235 | :header-rows: 1
236 |
237 | * - Bytes per block
238 | - Tile size
239 | * - 1
240 | - 128 x 128
241 | * - 2
242 | - 128 x 64
243 | * - 4
244 | - 64 x 64
245 | * - 8
246 | - 64 x 32
247 | * - 16
248 | - 32 x 32
249 |
250 | The dimensions of large images are rounded up to be multiples of the tile size.
251 | In addition, non-power-of-two large images have extra padding tiles when
252 | mipmapping is used, see below.
253 |
254 | That rounding would waste a great deal of memory for small images. If
255 | an image is smaller than this tile size, a smaller tile size is used to reduce
256 | the memory footprint. For small images, the tile size is :math:`m \times m`
257 | where
258 |
259 | .. math::
260 |
261 | m = 2^{\lceil \log_2( \min \{ \text{width}, \text{ height} \}) \rceil}
262 |
263 | In other words, small images use the smallest square power-of-two tile such that
264 | the image's minor axis fits in one tile.
265 |
266 | For mipmapped images, tile sizes are determined independently for each level.
267 | Typically, the first levels of an image are "large" and the remaining levels are
268 | "small". This scheme reduces the memory footprint of mipmapping, compared to a
269 | fixed tile size for the whole image. Each mip level are padded to fill at least
270 | one cache line (128 bytes), ensure no cache line contains multiple mip levels.
271 |
272 | There is a wrinkle: the dimensions of large mip levels in tiles are determined
273 | by the dimensions of level 0. For power-of-two images, the two calculations are
274 | equivalent. However, they differ subtly for non-power-of-two images. To
275 | determine the number of tiles to allocate for level :math:`l`, the number of
276 | tiles for level 0 should be right-shifted by :math:`2l`. That appears to divide
277 | by :math:`2^l` in both width and height, matching the definition of mipmapping,
278 | however it rounds down incorrectly. To compensate, the level contains one extra
279 | row, column, or both (with the corner) as required if any of the first :math:`l`
280 | levels were rounded down. This hurt the memory footprint. However, it means
281 | non-power-of-two integer multiplication is only required for level 0.
282 | Calculating the sizes for subsequent levels requires only addition and bitwise
283 | math. That simplifies the hardware (but complicates software).
284 |
285 | A 2D image consists of a full miptree (constructed as above) rounded up to the
286 | page size (16 KiB).
287 |
288 | 3D images consist simply of an array of 2D layers (constructed as above). That
289 | means cube maps, 2D arrays, cube map arrays, and 3D images all use the same
290 | layout. The only difference is the number of layers. Notably, 3D images (like
291 | ``GL_TEXTURE_3D``) reserve space even for mip levels that do not exist
292 | logically. These extra levels pad out layers of 3D images to the size of the
293 | first layer, simplifying layout calculations for both software and hardware.
294 | Although the padding is logically unnecessary, it wastes little space compared
295 | to the sizes of large mipmapped 3D textures.
296 |
297 | drm-shim (Linux only)
298 | ---------------------
299 |
300 | Mesa includes a library that mocks out the DRM UABI used by the Asahi driver
301 | stack, allowing the Mesa driver to run on non-M1 Linux hardware. This can be
302 | useful for exercising the compiler. To build, use options:
303 |
304 | ::
305 |
306 | -Dgallium-drivers=asahi -Dtools=drm-shim
307 |
308 | Then run an OpenGL workload with environment variable:
309 |
310 | .. code-block:: console
311 |
312 | LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so
313 |
314 | For example to compile a shader with shaderdb and print some statistics along
315 | with the IR:
316 |
317 | .. code-block:: console
318 |
319 | ~/shader-db$ AGX_MESA_DEBUG=shaders,shaderdb ASAHI_MESA_DEBUG=precompile LIBGL_DRIVERS_PATH=~/lib/dri/ LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so ./run shaders/glmark/1-12.shader_test
320 |
321 | The drm-shim implementation for Asahi is located in ``src/asahi/drm-shim``. The
322 | drm-shim implementation there should be updated as new UABI is added.
323 |
324 | Hardware glossary
325 | -----------------
326 |
327 | AGX is a tiled renderer descended from the PowerVR architecture. Some hardware
328 | concepts used in PowerVR GPUs appear in AGX.
329 |
330 | .. glossary:: :sorted:
331 |
332 | VDM
333 | Vertex Data Master
334 | Dispatches vertex shaders.
335 |
336 | PDM
337 | Pixel Data Master
338 | Dispatches pixel shaders.
339 |
340 | CDM
341 | Compute Data Master
342 | Dispatches compute kernels.
343 |
344 | USC
345 | Unified Shader Cores
346 | A unified shader core is a small cpu that runs shader code. The core is
347 | unified because a single ISA is used for vertex, pixel and compute
348 | shaders. This differs from older GPUs where the vertex, fragment and
349 | compute have separate ISAs for shader stages.
350 |
351 | PPP
352 | Primitive Processing Pipeline
353 | The Primitive Processing Pipeline is a hardware unit that does primitive
354 | assembly. The PPP is between the :term:`VDM` and :term:`ISP`.
355 |
356 | ISP
357 | Image Synthesis Processor
358 | The Image Synthesis Processor is responsible for the rasterization stage
359 | of the rendering pipeline.
360 |
361 | PBE
362 | Pixel BackEnd
363 | Hardware unit which writes to color attachements and images. Also the
364 | name for a descriptor passed to :term:`PBE` instructions.