asahi.rst@ 105856

最後變更在這個檔案從105856是 103996,由 vboxsync 提交於 11 月前
Additions/3D/mesa: export mesa-24.0.2 to OSE. bugref:10606
檔案大小: 13.6 KB

行
1	Asahi
2	=====
3
4	The Asahi driver aims to provide an OpenGL implementation for the Apple M1.
5
6	Wrap (macOS only)
7	-----------------
8
9	Mesa includes a library that wraps the key IOKit entrypoints used in the macOS
10	UABI for AGX. The wrapped routines print information about the kernel calls made
11	and dump work submitted to the GPU using agxdecode. This facilitates
12	reverse-engineering the hardware, as glue to get at the "interesting" GPU
13	memory.
14
15	The library is only built if ``-Dtools=asahi`` is passed. It builds a single
16	``wrap.dylib`` file, which should be inserted into a process with the
17	``DYLD_INSERT_LIBRARIES`` environment variable.
18
19	For example, to trace an app ``./app``, run:
20
21	DYLD_INSERT_LIBRARIES=~/mesa/build/src/asahi/lib/libwrap.dylib ./app
22
23	Hardware varyings
24	-----------------
25
26	At an API level, vertex shader outputs need to be interpolated to become
27	fragment shader inputs. This process is logically pipelined in AGX, with a value
28	traveling from a vertex shader to remapping hardware to coefficient register
29	setup to the fragment shader to the iterator hardware. Each stage is described
30	below.
31
32	Vertex shader
33	`````````````
34
35	A vertex shader (running on the :term:`Unified Shader Cores`) outputs varyings with the
36	``st_var`` instruction. ``st_var`` takes a vertex output index and a 32-bit
37	value. The maximum number of vertex outputs is specified as the "output count"
38	of the shader in the "Bind Vertex Pipeline" packet. The value may be interpreted
39	consist of a single 32-bit value or an aligned 16-bit register pair, depending
40	on whether interpolation should happen at 32-bit or 16-bit. Vertex outputs are
41	indexed starting from 0, with the vertex position always coming first, the
42	32-bit user varyings coming next with perspective, flat, and linear interpolated
43	varyings grouped in that order, then 16-bit user varyings with the same groupings,
44	and finally point size and clip distances at the end if present. Note that
45	clip distances are not accessible from the fragment shader; if the fragment
46	shader needs to read the interpolated clip distance, the vertex shader must
47	also write the clip distance values to a user varying for the fragment shader
48	to interpolate. Also note there is no clip plane enable mask anywhere; that must
49	lowered for APIs that require this (OpenGL but not Vulkan).
50
51	.. list-table:: Ordering of vertex outputs with all outputs used
52	:widths: 25 75
53	:header-rows: 1
54
55	* - Size (words)
56	- Value
57	* - 4
58	- Vertex position
59	* - 1
60	- 32-bit smooth varying 0
61	* -
62	- ...
63	* - 1
64	- 32-bit smooth varying m
65	* - 1
66	- 32-bit flat varying 0
67	* -
68	- ...
69	* - 1
70	- 32-bit flat varying n
71	* - 1
72	- 32-bit linear varying 0
73	* -
74	- ...
75	* - 1
76	- 32-bit linear varying o
77	* - 1
78	- Packed pair of 16-bit smooth varyings 0
79	* -
80	- ...
81	* - 1
82	- Packed pair of 16-bit smooth varyings p
83	* - 1
84	- Packed pair of 16-bit flat varyings 0
85	* -
86	- ...
87	* - 1
88	- Packed pair of 16-bit flat varyings q
89	* - 1
90	- Packed pair of 16-bit linear varyings 0
91	* -
92	- ...
93	* - 1
94	- Packed pair of 16-bit linear varyings r
95	* - 1
96	- Point size
97	* - 1
98	- Clip distance for plane 0
99	* -
100	- ...
101	* - 1
102	- Clip distance for plane 15
103
104	Remapping
105	`````````
106
107	Vertex outputs are remapped to varying slots to be interpolated.
108	The output of remapping consists of the following items: the W fragment
109	coordinate, the Z fragment coordinate, user varyings in the vertex
110	output order. Z may be omitted, but W may not be. This remapping is
111	configured by the "Output select" word.
112
113	.. list-table:: Ordering of remapped slots
114	:widths: 25 75
115	:header-rows: 1
116
117	* - Index
118	- Value
119	* - 0
120	- Fragment coord W
121	* - 1
122	- Fragment coord Z
123	* - 2
124	- 32-bit varying 0
125	* -
126	- ...
127	* - 2 + m
128	- 32-bit varying m
129	* - 2 + m + 1
130	- Packed pair of 16-bit varyings 0
131	* -
132	- ...
133	* - 2 + m + n + 1
134	- Packed pair of 16-bit varyings n
135
136	Coefficient registers
137	`````````````````````
138
139	The fragment shader does not see the physical slots.
140	Instead, it references varyings through coefficient registers. A coefficient
141	register is a register allocated constant for all fragment shader invocations in
142	a given polygon. Physically, it contains the values output by the vertex shader
143	for each vertex of the polygon. Coefficient registers are preloaded with values
144	from varying slots. This preloading appears to occur in fixed function hardware,
145	a simplification from PowerVR which requires a specialized program for the
146	programmable data sequencer to do the preload.
147
148	The "Bind fragment pipeline" packet points to coefficient register bindings,
149	preceded by a header. The header contains the number of 32-bit varying slots. As
150	the W slot is always present, this field is always nonzero. Slots whose index
151	is below this count are treated as 32-bit. The remaining slots are treated as
152	16-bits.
153
154	The header also contains the total number of coefficient registers bound.
155
156	Each binding that follows maps a (vector of) varying slots to a (consecutive)
157	coefficient registers. Some details about the varying (perspective
158	interpolation, flat shading, point sprites) are configured here.
159
160	Coefficient registers may be ordered the same as the internal varying slots.
161	However, this may be inconvenient for some APIs that require a separable shader
162	model. For these APIs, the flexibility to mix-and-match slots and coefficient
163	registers allows mixing shaders without shader variants. In that case, the
164	bindings should be generated outside of the compiler. For simple APIs where the
165	bindings are fixed and known at compile-time, the bindings could be generated
166	within the compiler.
167
168	Fragment shader
169	```````````````
170
171	In the fragment shader, coefficient registers, identified by the prefix ``cf``
172	followed by a decimal index, act as opaque handles to varyings. For flat
173	shading, coefficient registers may be loaded into general registers with the
174	``ldcf`` instruction. For smooth shading, the coefficient register corresponding
175	to the desired varying is passed as an argument to the "iterate" instruction
176	``iter`` in order to "iterate" (interpolate) a varying. As perspective correct
177	interpolation also requires the W component of the fragment coordinate, the
178	coefficient register for W is passed as a second argument. As an example, if
179	there's a single varying to interpolate, an instruction like ``iter r0, cf1, cf0``
180	is used.
181
182	Iterator
183	````````
184
185	To actually interpolate varyings, AGX provides fixed-function iteration hardware
186	to multiply the specified coefficient registers with the required barycentrics,
187	producing an interpolated value, hence the name "coefficient register". This
188	operation is purely mathematical and does not require any memory access, as
189	the required coefficients are preloaded before the shader begins execution.
190	That means the iterate instruction executes in constant time, does not signal
191	a data fence, and does not require the shader to wait on a data fence before
192	using the value.
193
194	Image layouts
195	-------------
196
197	AGX supports several image layouts, described here. To work with image layouts
198	in the drivers, use the ail library, located in ``src/asahi/layout``.
199
200	The simplest layout is strided linear. Pixels are stored in raster-order in
201	memory with a software-controlled stride. Strided linear images are useful for
202	working with modifier-unaware window systems, however performance will suffer.
203	Strided linear images have numerous limitations:
204
205	- Strides must be a multiple of 16 bytes.
206	- Strides must be nonzero. For 1D images where the stride is logically
207	irrelevant, ail will internally select the minimal stride.
208	- Only 1D and 2D images may be linear. In particular, no 3D or cubemaps.
209	- Array texture may not be linear. No 2D arrays or cubemap arrays.
210	- 2D images must not be mipmapped.
211	- Block-compressed formats and multisampled images are unsupported. Elements of
212	a strided linear image are simply pixels.
213
214	With these limitations, addressing into a strided linear image is as simple as
215
216	.. math::
217
218	\text{address} = (y \cdot \text{stride}) + (x \cdot \text{bytes per pixel})
219
220	In practice, this suffices for window system integration and little else.
221
222	The most common uncompressed layout is twiddled. The image is divided into
223	power-of-two sized tiles. The tiles themselves are stored in raster-order.
224	Within each tile, elements (pixels/blocks) are stored in Morton (Z) order.
225
226	The tile size used depends on both the image size and the block size of the
227	image format. For large images, :math:`n \times n` or :math:`2n \times n` tiles
228	are used (:math:`n` power-of-two). :math:`n` is such that each page contains
229	exactly one tile. Only power-of-two block sizes are supported in hardware,
230	ensuring such a tile size always exists. The hardware uses 16 KiB pages, so tile
231	sizes are as follows:
232
233	.. list-table:: Tile sizes for large images
234	:widths: 50 50
235	:header-rows: 1
236
237	* - Bytes per block
238	- Tile size
239	* - 1
240	- 128 x 128
241	* - 2
242	- 128 x 64
243	* - 4
244	- 64 x 64
245	* - 8
246	- 64 x 32
247	* - 16
248	- 32 x 32
249
250	The dimensions of large images are rounded up to be multiples of the tile size.
251	In addition, non-power-of-two large images have extra padding tiles when
252	mipmapping is used, see below.
253
254	That rounding would waste a great deal of memory for small images. If
255	an image is smaller than this tile size, a smaller tile size is used to reduce
256	the memory footprint. For small images, the tile size is :math:`m \times m`
257	where
258
259	.. math::
260
261	m = 2^{\lceil \log_2( \min \{ \text{width}, \text{ height} \}) \rceil}
262
263	In other words, small images use the smallest square power-of-two tile such that
264	the image's minor axis fits in one tile.
265
266	For mipmapped images, tile sizes are determined independently for each level.
267	Typically, the first levels of an image are "large" and the remaining levels are
268	"small". This scheme reduces the memory footprint of mipmapping, compared to a
269	fixed tile size for the whole image. Each mip level are padded to fill at least
270	one cache line (128 bytes), ensure no cache line contains multiple mip levels.
271
272	There is a wrinkle: the dimensions of large mip levels in tiles are determined
273	by the dimensions of level 0. For power-of-two images, the two calculations are
274	equivalent. However, they differ subtly for non-power-of-two images. To
275	determine the number of tiles to allocate for level :math:`l`, the number of
276	tiles for level 0 should be right-shifted by :math:`2l`. That appears to divide
277	by :math:`2^l` in both width and height, matching the definition of mipmapping,
278	however it rounds down incorrectly. To compensate, the level contains one extra
279	row, column, or both (with the corner) as required if any of the first :math:`l`
280	levels were rounded down. This hurt the memory footprint. However, it means
281	non-power-of-two integer multiplication is only required for level 0.
282	Calculating the sizes for subsequent levels requires only addition and bitwise
283	math. That simplifies the hardware (but complicates software).
284
285	A 2D image consists of a full miptree (constructed as above) rounded up to the
286	page size (16 KiB).
287
288	3D images consist simply of an array of 2D layers (constructed as above). That
289	means cube maps, 2D arrays, cube map arrays, and 3D images all use the same
290	layout. The only difference is the number of layers. Notably, 3D images (like
291	``GL_TEXTURE_3D``) reserve space even for mip levels that do not exist
292	logically. These extra levels pad out layers of 3D images to the size of the
293	first layer, simplifying layout calculations for both software and hardware.
294	Although the padding is logically unnecessary, it wastes little space compared
295	to the sizes of large mipmapped 3D textures.
296
297	drm-shim (Linux only)
298	---------------------
299
300	Mesa includes a library that mocks out the DRM UABI used by the Asahi driver
301	stack, allowing the Mesa driver to run on non-M1 Linux hardware. This can be
302	useful for exercising the compiler. To build, use options:
303
304	::
305
306	-Dgallium-drivers=asahi -Dtools=drm-shim
307
308	Then run an OpenGL workload with environment variable:
309
310	.. code-block:: console
311
312	LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so
313
314	For example to compile a shader with shaderdb and print some statistics along
315	with the IR:
316
317	.. code-block:: console
318
319	~/shader-db$ AGX_MESA_DEBUG=shaders,shaderdb ASAHI_MESA_DEBUG=precompile LIBGL_DRIVERS_PATH=~/lib/dri/ LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so ./run shaders/glmark/1-12.shader_test
320
321	The drm-shim implementation for Asahi is located in ``src/asahi/drm-shim``. The
322	drm-shim implementation there should be updated as new UABI is added.
323
324	Hardware glossary
325	-----------------
326
327	AGX is a tiled renderer descended from the PowerVR architecture. Some hardware
328	concepts used in PowerVR GPUs appear in AGX.
329
330	.. glossary:: :sorted:
331
332	VDM
333	Vertex Data Master
334	Dispatches vertex shaders.
335
336	PDM
337	Pixel Data Master
338	Dispatches pixel shaders.
339
340	CDM
341	Compute Data Master
342	Dispatches compute kernels.
343
344	USC
345	Unified Shader Cores
346	A unified shader core is a small cpu that runs shader code. The core is
347	unified because a single ISA is used for vertex, pixel and compute
348	shaders. This differs from older GPUs where the vertex, fragment and
349	compute have separate ISAs for shader stages.
350
351	PPP
352	Primitive Processing Pipeline
353	The Primitive Processing Pipeline is a hardware unit that does primitive
354	assembly. The PPP is between the :term:`VDM` and :term:`ISP`.
355
356	ISP
357	Image Synthesis Processor
358	The Image Synthesis Processor is responsible for the rasterization stage
359	of the rendering pipeline.
360
361	PBE
362	Pixel BackEnd
363	Hardware unit which writes to color attachements and images. Also the
364	name for a descriptor passed to :term:`PBE` instructions.

注意: 瀏覽 TracBrowser 來幫助您使用儲存庫瀏覽器

source: vbox/trunk/src/VBox/Additions/3D/mesa/mesa-24.0.2/docs/drivers/asahi.rst@ 105856

以其他格式下載: