1 | Asahi
|
---|
2 | =====
|
---|
3 |
|
---|
4 | The Asahi driver aims to provide an OpenGL implementation for the Apple M1.
|
---|
5 |
|
---|
6 | Wrap (macOS only)
|
---|
7 | -----------------
|
---|
8 |
|
---|
9 | Mesa includes a library that wraps the key IOKit entrypoints used in the macOS
|
---|
10 | UABI for AGX. The wrapped routines print information about the kernel calls made
|
---|
11 | and dump work submitted to the GPU using agxdecode. This facilitates
|
---|
12 | reverse-engineering the hardware, as glue to get at the "interesting" GPU
|
---|
13 | memory.
|
---|
14 |
|
---|
15 | The library is only built if ``-Dtools=asahi`` is passed. It builds a single
|
---|
16 | ``wrap.dylib`` file, which should be inserted into a process with the
|
---|
17 | ``DYLD_INSERT_LIBRARIES`` environment variable.
|
---|
18 |
|
---|
19 | For example, to trace an app ``./app``, run:
|
---|
20 |
|
---|
21 | DYLD_INSERT_LIBRARIES=~/mesa/build/src/asahi/lib/libwrap.dylib ./app
|
---|
22 |
|
---|
23 | Hardware varyings
|
---|
24 | -----------------
|
---|
25 |
|
---|
26 | At an API level, vertex shader outputs need to be interpolated to become
|
---|
27 | fragment shader inputs. This process is logically pipelined in AGX, with a value
|
---|
28 | traveling from a vertex shader to remapping hardware to coefficient register
|
---|
29 | setup to the fragment shader to the iterator hardware. Each stage is described
|
---|
30 | below.
|
---|
31 |
|
---|
32 | Vertex shader
|
---|
33 | `````````````
|
---|
34 |
|
---|
35 | A vertex shader (running on the :term:`Unified Shader Cores`) outputs varyings with the
|
---|
36 | ``st_var`` instruction. ``st_var`` takes a *vertex output index* and a 32-bit
|
---|
37 | value. The maximum number of *vertex outputs* is specified as the "output count"
|
---|
38 | of the shader in the "Bind Vertex Pipeline" packet. The value may be interpreted
|
---|
39 | consist of a single 32-bit value or an aligned 16-bit register pair, depending
|
---|
40 | on whether interpolation should happen at 32-bit or 16-bit. Vertex outputs are
|
---|
41 | indexed starting from 0, with the *vertex position* always coming first, the
|
---|
42 | 32-bit user varyings coming next with perspective, flat, and linear interpolated
|
---|
43 | varyings grouped in that order, then 16-bit user varyings with the same groupings,
|
---|
44 | and finally *point size* and *clip distances* at the end if present. Note that
|
---|
45 | *clip distances* are not accessible from the fragment shader; if the fragment
|
---|
46 | shader needs to read the interpolated clip distance, the vertex shader must
|
---|
47 | *also* write the clip distance values to a user varying for the fragment shader
|
---|
48 | to interpolate. Also note there is no clip plane enable mask anywhere; that must
|
---|
49 | lowered for APIs that require this (OpenGL but not Vulkan).
|
---|
50 |
|
---|
51 | .. list-table:: Ordering of vertex outputs with all outputs used
|
---|
52 | :widths: 25 75
|
---|
53 | :header-rows: 1
|
---|
54 |
|
---|
55 | * - Size (words)
|
---|
56 | - Value
|
---|
57 | * - 4
|
---|
58 | - Vertex position
|
---|
59 | * - 1
|
---|
60 | - 32-bit smooth varying 0
|
---|
61 | * -
|
---|
62 | - ...
|
---|
63 | * - 1
|
---|
64 | - 32-bit smooth varying m
|
---|
65 | * - 1
|
---|
66 | - 32-bit flat varying 0
|
---|
67 | * -
|
---|
68 | - ...
|
---|
69 | * - 1
|
---|
70 | - 32-bit flat varying n
|
---|
71 | * - 1
|
---|
72 | - 32-bit linear varying 0
|
---|
73 | * -
|
---|
74 | - ...
|
---|
75 | * - 1
|
---|
76 | - 32-bit linear varying o
|
---|
77 | * - 1
|
---|
78 | - Packed pair of 16-bit smooth varyings 0
|
---|
79 | * -
|
---|
80 | - ...
|
---|
81 | * - 1
|
---|
82 | - Packed pair of 16-bit smooth varyings p
|
---|
83 | * - 1
|
---|
84 | - Packed pair of 16-bit flat varyings 0
|
---|
85 | * -
|
---|
86 | - ...
|
---|
87 | * - 1
|
---|
88 | - Packed pair of 16-bit flat varyings q
|
---|
89 | * - 1
|
---|
90 | - Packed pair of 16-bit linear varyings 0
|
---|
91 | * -
|
---|
92 | - ...
|
---|
93 | * - 1
|
---|
94 | - Packed pair of 16-bit linear varyings r
|
---|
95 | * - 1
|
---|
96 | - Point size
|
---|
97 | * - 1
|
---|
98 | - Clip distance for plane 0
|
---|
99 | * -
|
---|
100 | - ...
|
---|
101 | * - 1
|
---|
102 | - Clip distance for plane 15
|
---|
103 |
|
---|
104 | Remapping
|
---|
105 | `````````
|
---|
106 |
|
---|
107 | Vertex outputs are remapped to varying slots to be interpolated.
|
---|
108 | The output of remapping consists of the following items: the *W* fragment
|
---|
109 | coordinate, the *Z* fragment coordinate, user varyings in the vertex
|
---|
110 | output order. *Z* may be omitted, but *W* may not be. This remapping is
|
---|
111 | configured by the "Output select" word.
|
---|
112 |
|
---|
113 | .. list-table:: Ordering of remapped slots
|
---|
114 | :widths: 25 75
|
---|
115 | :header-rows: 1
|
---|
116 |
|
---|
117 | * - Index
|
---|
118 | - Value
|
---|
119 | * - 0
|
---|
120 | - Fragment coord W
|
---|
121 | * - 1
|
---|
122 | - Fragment coord Z
|
---|
123 | * - 2
|
---|
124 | - 32-bit varying 0
|
---|
125 | * -
|
---|
126 | - ...
|
---|
127 | * - 2 + m
|
---|
128 | - 32-bit varying m
|
---|
129 | * - 2 + m + 1
|
---|
130 | - Packed pair of 16-bit varyings 0
|
---|
131 | * -
|
---|
132 | - ...
|
---|
133 | * - 2 + m + n + 1
|
---|
134 | - Packed pair of 16-bit varyings n
|
---|
135 |
|
---|
136 | Coefficient registers
|
---|
137 | `````````````````````
|
---|
138 |
|
---|
139 | The fragment shader does not see the physical slots.
|
---|
140 | Instead, it references varyings through *coefficient registers*. A coefficient
|
---|
141 | register is a register allocated constant for all fragment shader invocations in
|
---|
142 | a given polygon. Physically, it contains the values output by the vertex shader
|
---|
143 | for each vertex of the polygon. Coefficient registers are preloaded with values
|
---|
144 | from varying slots. This preloading appears to occur in fixed function hardware,
|
---|
145 | a simplification from PowerVR which requires a specialized program for the
|
---|
146 | programmable data sequencer to do the preload.
|
---|
147 |
|
---|
148 | The "Bind fragment pipeline" packet points to coefficient register bindings,
|
---|
149 | preceded by a header. The header contains the number of 32-bit varying slots. As
|
---|
150 | the *W* slot is always present, this field is always nonzero. Slots whose index
|
---|
151 | is below this count are treated as 32-bit. The remaining slots are treated as
|
---|
152 | 16-bits.
|
---|
153 |
|
---|
154 | The header also contains the total number of coefficient registers bound.
|
---|
155 |
|
---|
156 | Each binding that follows maps a (vector of) varying slots to a (consecutive)
|
---|
157 | coefficient registers. Some details about the varying (perspective
|
---|
158 | interpolation, flat shading, point sprites) are configured here.
|
---|
159 |
|
---|
160 | Coefficient registers may be ordered the same as the internal varying slots.
|
---|
161 | However, this may be inconvenient for some APIs that require a separable shader
|
---|
162 | model. For these APIs, the flexibility to mix-and-match slots and coefficient
|
---|
163 | registers allows mixing shaders without shader variants. In that case, the
|
---|
164 | bindings should be generated outside of the compiler. For simple APIs where the
|
---|
165 | bindings are fixed and known at compile-time, the bindings could be generated
|
---|
166 | within the compiler.
|
---|
167 |
|
---|
168 | Fragment shader
|
---|
169 | ```````````````
|
---|
170 |
|
---|
171 | In the fragment shader, coefficient registers, identified by the prefix ``cf``
|
---|
172 | followed by a decimal index, act as opaque handles to varyings. For flat
|
---|
173 | shading, coefficient registers may be loaded into general registers with the
|
---|
174 | ``ldcf`` instruction. For smooth shading, the coefficient register corresponding
|
---|
175 | to the desired varying is passed as an argument to the "iterate" instruction
|
---|
176 | ``iter`` in order to "iterate" (interpolate) a varying. As perspective correct
|
---|
177 | interpolation also requires the W component of the fragment coordinate, the
|
---|
178 | coefficient register for W is passed as a second argument. As an example, if
|
---|
179 | there's a single varying to interpolate, an instruction like ``iter r0, cf1, cf0``
|
---|
180 | is used.
|
---|
181 |
|
---|
182 | Iterator
|
---|
183 | ````````
|
---|
184 |
|
---|
185 | To actually interpolate varyings, AGX provides fixed-function iteration hardware
|
---|
186 | to multiply the specified coefficient registers with the required barycentrics,
|
---|
187 | producing an interpolated value, hence the name "coefficient register". This
|
---|
188 | operation is purely mathematical and does not require any memory access, as
|
---|
189 | the required coefficients are preloaded before the shader begins execution.
|
---|
190 | That means the iterate instruction executes in constant time, does not signal
|
---|
191 | a data fence, and does not require the shader to wait on a data fence before
|
---|
192 | using the value.
|
---|
193 |
|
---|
194 | Image layouts
|
---|
195 | -------------
|
---|
196 |
|
---|
197 | AGX supports several image layouts, described here. To work with image layouts
|
---|
198 | in the drivers, use the ail library, located in ``src/asahi/layout``.
|
---|
199 |
|
---|
200 | The simplest layout is **strided linear**. Pixels are stored in raster-order in
|
---|
201 | memory with a software-controlled stride. Strided linear images are useful for
|
---|
202 | working with modifier-unaware window systems, however performance will suffer.
|
---|
203 | Strided linear images have numerous limitations:
|
---|
204 |
|
---|
205 | - Strides must be a multiple of 16 bytes.
|
---|
206 | - Strides must be nonzero. For 1D images where the stride is logically
|
---|
207 | irrelevant, ail will internally select the minimal stride.
|
---|
208 | - Only 1D and 2D images may be linear. In particular, no 3D or cubemaps.
|
---|
209 | - Array texture may not be linear. No 2D arrays or cubemap arrays.
|
---|
210 | - 2D images must not be mipmapped.
|
---|
211 | - Block-compressed formats and multisampled images are unsupported. Elements of
|
---|
212 | a strided linear image are simply pixels.
|
---|
213 |
|
---|
214 | With these limitations, addressing into a strided linear image is as simple as
|
---|
215 |
|
---|
216 | .. math::
|
---|
217 |
|
---|
218 | \text{address} = (y \cdot \text{stride}) + (x \cdot \text{bytes per pixel})
|
---|
219 |
|
---|
220 | In practice, this suffices for window system integration and little else.
|
---|
221 |
|
---|
222 | The most common uncompressed layout is **twiddled**. The image is divided into
|
---|
223 | power-of-two sized tiles. The tiles themselves are stored in raster-order.
|
---|
224 | Within each tile, elements (pixels/blocks) are stored in Morton (Z) order.
|
---|
225 |
|
---|
226 | The tile size used depends on both the image size and the block size of the
|
---|
227 | image format. For large images, :math:`n \times n` or :math:`2n \times n` tiles
|
---|
228 | are used (:math:`n` power-of-two). :math:`n` is such that each page contains
|
---|
229 | exactly one tile. Only power-of-two block sizes are supported in hardware,
|
---|
230 | ensuring such a tile size always exists. The hardware uses 16 KiB pages, so tile
|
---|
231 | sizes are as follows:
|
---|
232 |
|
---|
233 | .. list-table:: Tile sizes for large images
|
---|
234 | :widths: 50 50
|
---|
235 | :header-rows: 1
|
---|
236 |
|
---|
237 | * - Bytes per block
|
---|
238 | - Tile size
|
---|
239 | * - 1
|
---|
240 | - 128 x 128
|
---|
241 | * - 2
|
---|
242 | - 128 x 64
|
---|
243 | * - 4
|
---|
244 | - 64 x 64
|
---|
245 | * - 8
|
---|
246 | - 64 x 32
|
---|
247 | * - 16
|
---|
248 | - 32 x 32
|
---|
249 |
|
---|
250 | The dimensions of large images are rounded up to be multiples of the tile size.
|
---|
251 | In addition, non-power-of-two large images have extra padding tiles when
|
---|
252 | mipmapping is used, see below.
|
---|
253 |
|
---|
254 | That rounding would waste a great deal of memory for small images. If
|
---|
255 | an image is smaller than this tile size, a smaller tile size is used to reduce
|
---|
256 | the memory footprint. For small images, the tile size is :math:`m \times m`
|
---|
257 | where
|
---|
258 |
|
---|
259 | .. math::
|
---|
260 |
|
---|
261 | m = 2^{\lceil \log_2( \min \{ \text{width}, \text{ height} \}) \rceil}
|
---|
262 |
|
---|
263 | In other words, small images use the smallest square power-of-two tile such that
|
---|
264 | the image's minor axis fits in one tile.
|
---|
265 |
|
---|
266 | For mipmapped images, tile sizes are determined independently for each level.
|
---|
267 | Typically, the first levels of an image are "large" and the remaining levels are
|
---|
268 | "small". This scheme reduces the memory footprint of mipmapping, compared to a
|
---|
269 | fixed tile size for the whole image. Each mip level are padded to fill at least
|
---|
270 | one cache line (128 bytes), ensure no cache line contains multiple mip levels.
|
---|
271 |
|
---|
272 | There is a wrinkle: the dimensions of large mip levels in tiles are determined
|
---|
273 | by the dimensions of level 0. For power-of-two images, the two calculations are
|
---|
274 | equivalent. However, they differ subtly for non-power-of-two images. To
|
---|
275 | determine the number of tiles to allocate for level :math:`l`, the number of
|
---|
276 | tiles for level 0 should be right-shifted by :math:`2l`. That appears to divide
|
---|
277 | by :math:`2^l` in both width and height, matching the definition of mipmapping,
|
---|
278 | however it rounds down incorrectly. To compensate, the level contains one extra
|
---|
279 | row, column, or both (with the corner) as required if any of the first :math:`l`
|
---|
280 | levels were rounded down. This hurt the memory footprint. However, it means
|
---|
281 | non-power-of-two integer multiplication is only required for level 0.
|
---|
282 | Calculating the sizes for subsequent levels requires only addition and bitwise
|
---|
283 | math. That simplifies the hardware (but complicates software).
|
---|
284 |
|
---|
285 | A 2D image consists of a full miptree (constructed as above) rounded up to the
|
---|
286 | page size (16 KiB).
|
---|
287 |
|
---|
288 | 3D images consist simply of an array of 2D layers (constructed as above). That
|
---|
289 | means cube maps, 2D arrays, cube map arrays, and 3D images all use the same
|
---|
290 | layout. The only difference is the number of layers. Notably, 3D images (like
|
---|
291 | ``GL_TEXTURE_3D``) reserve space even for mip levels that do not exist
|
---|
292 | logically. These extra levels pad out layers of 3D images to the size of the
|
---|
293 | first layer, simplifying layout calculations for both software and hardware.
|
---|
294 | Although the padding is logically unnecessary, it wastes little space compared
|
---|
295 | to the sizes of large mipmapped 3D textures.
|
---|
296 |
|
---|
297 | drm-shim (Linux only)
|
---|
298 | ---------------------
|
---|
299 |
|
---|
300 | Mesa includes a library that mocks out the DRM UABI used by the Asahi driver
|
---|
301 | stack, allowing the Mesa driver to run on non-M1 Linux hardware. This can be
|
---|
302 | useful for exercising the compiler. To build, use options:
|
---|
303 |
|
---|
304 | ::
|
---|
305 |
|
---|
306 | -Dgallium-drivers=asahi -Dtools=drm-shim
|
---|
307 |
|
---|
308 | Then run an OpenGL workload with environment variable:
|
---|
309 |
|
---|
310 | .. code-block:: console
|
---|
311 |
|
---|
312 | LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so
|
---|
313 |
|
---|
314 | For example to compile a shader with shaderdb and print some statistics along
|
---|
315 | with the IR:
|
---|
316 |
|
---|
317 | .. code-block:: console
|
---|
318 |
|
---|
319 | ~/shader-db$ AGX_MESA_DEBUG=shaders,shaderdb ASAHI_MESA_DEBUG=precompile LIBGL_DRIVERS_PATH=~/lib/dri/ LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so ./run shaders/glmark/1-12.shader_test
|
---|
320 |
|
---|
321 | The drm-shim implementation for Asahi is located in ``src/asahi/drm-shim``. The
|
---|
322 | drm-shim implementation there should be updated as new UABI is added.
|
---|
323 |
|
---|
324 | Hardware glossary
|
---|
325 | -----------------
|
---|
326 |
|
---|
327 | AGX is a tiled renderer descended from the PowerVR architecture. Some hardware
|
---|
328 | concepts used in PowerVR GPUs appear in AGX.
|
---|
329 |
|
---|
330 | .. glossary:: :sorted:
|
---|
331 |
|
---|
332 | VDM
|
---|
333 | Vertex Data Master
|
---|
334 | Dispatches vertex shaders.
|
---|
335 |
|
---|
336 | PDM
|
---|
337 | Pixel Data Master
|
---|
338 | Dispatches pixel shaders.
|
---|
339 |
|
---|
340 | CDM
|
---|
341 | Compute Data Master
|
---|
342 | Dispatches compute kernels.
|
---|
343 |
|
---|
344 | USC
|
---|
345 | Unified Shader Cores
|
---|
346 | A unified shader core is a small cpu that runs shader code. The core is
|
---|
347 | unified because a single ISA is used for vertex, pixel and compute
|
---|
348 | shaders. This differs from older GPUs where the vertex, fragment and
|
---|
349 | compute have separate ISAs for shader stages.
|
---|
350 |
|
---|
351 | PPP
|
---|
352 | Primitive Processing Pipeline
|
---|
353 | The Primitive Processing Pipeline is a hardware unit that does primitive
|
---|
354 | assembly. The PPP is between the :term:`VDM` and :term:`ISP`.
|
---|
355 |
|
---|
356 | ISP
|
---|
357 | Image Synthesis Processor
|
---|
358 | The Image Synthesis Processor is responsible for the rasterization stage
|
---|
359 | of the rendering pipeline.
|
---|
360 |
|
---|
361 | PBE
|
---|
362 | Pixel BackEnd
|
---|
363 | Hardware unit which writes to color attachements and images. Also the
|
---|
364 | name for a descriptor passed to :term:`PBE` instructions.
|
---|