1 | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
---|
2 | <html>
|
---|
3 | <head>
|
---|
4 |
|
---|
5 | <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/>
|
---|
6 | <title>Ogg Vorbis Documentation</title>
|
---|
7 |
|
---|
8 | <style type="text/css">
|
---|
9 | body {
|
---|
10 | margin: 0 18px 0 18px;
|
---|
11 | padding-bottom: 30px;
|
---|
12 | font-family: Verdana, Arial, Helvetica, sans-serif;
|
---|
13 | color: #333333;
|
---|
14 | font-size: .8em;
|
---|
15 | }
|
---|
16 |
|
---|
17 | a {
|
---|
18 | color: #3366cc;
|
---|
19 | }
|
---|
20 |
|
---|
21 | img {
|
---|
22 | border: 0;
|
---|
23 | }
|
---|
24 |
|
---|
25 | #xiphlogo {
|
---|
26 | margin: 30px 0 16px 0;
|
---|
27 | }
|
---|
28 |
|
---|
29 | #content p {
|
---|
30 | line-height: 1.4;
|
---|
31 | }
|
---|
32 |
|
---|
33 | h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a {
|
---|
34 | font-weight: bold;
|
---|
35 | color: #ff9900;
|
---|
36 | margin: 1.3em 0 8px 0;
|
---|
37 | }
|
---|
38 |
|
---|
39 | h1 {
|
---|
40 | font-size: 1.3em;
|
---|
41 | }
|
---|
42 |
|
---|
43 | h2 {
|
---|
44 | font-size: 1.2em;
|
---|
45 | }
|
---|
46 |
|
---|
47 | h3 {
|
---|
48 | font-size: 1.1em;
|
---|
49 | }
|
---|
50 |
|
---|
51 | li {
|
---|
52 | line-height: 1.4;
|
---|
53 | }
|
---|
54 |
|
---|
55 | #copyright {
|
---|
56 | margin-top: 30px;
|
---|
57 | line-height: 1.5em;
|
---|
58 | text-align: center;
|
---|
59 | font-size: .8em;
|
---|
60 | color: #888888;
|
---|
61 | clear: both;
|
---|
62 | }
|
---|
63 | </style>
|
---|
64 |
|
---|
65 | </head>
|
---|
66 |
|
---|
67 | <body>
|
---|
68 |
|
---|
69 | <div id="xiphlogo">
|
---|
70 | <a href="https://xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.Org"/></a>
|
---|
71 | </div>
|
---|
72 |
|
---|
73 | <h1>Ogg Vorbis stereo-specific channel coupling discussion</h1>
|
---|
74 |
|
---|
75 | <h2>Abstract</h2>
|
---|
76 |
|
---|
77 | <p>The Vorbis audio CODEC provides a channel coupling
|
---|
78 | mechanisms designed to reduce effective bitrate by both eliminating
|
---|
79 | interchannel redundancy and eliminating stereo image information
|
---|
80 | labeled inaudible or undesirable according to spatial psychoacoustic
|
---|
81 | models. This document describes both the mechanical coupling
|
---|
82 | mechanisms available within the Vorbis specification, as well as the
|
---|
83 | specific stereo coupling models used by the reference
|
---|
84 | <tt>libvorbis</tt> codec provided by xiph.org.</p>
|
---|
85 |
|
---|
86 | <h2>Mechanisms</h2>
|
---|
87 |
|
---|
88 | <p>In encoder release beta 4 and earlier, Vorbis supported multiple
|
---|
89 | channel encoding, but the channels were encoded entirely separately
|
---|
90 | with no cross-analysis or redundancy elimination between channels.
|
---|
91 | This multichannel strategy is very similar to the mp3's <em>dual
|
---|
92 | stereo</em> mode and Vorbis uses the same name for its analogous
|
---|
93 | uncoupled multichannel modes.</p>
|
---|
94 |
|
---|
95 | <p>However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and
|
---|
96 | later implement a coupled channel strategy. Vorbis has two specific
|
---|
97 | mechanisms that may be used alone or in conjunction to implement
|
---|
98 | channel coupling. The first is <em>channel interleaving</em> via
|
---|
99 | residue backend type 2, and the second is <em>square polar
|
---|
100 | mapping</em>. These two general mechanisms are particularly well
|
---|
101 | suited to coupling due to the structure of Vorbis encoding, as we'll
|
---|
102 | explore below, and using both we can implement both totally
|
---|
103 | <em>lossless stereo image coupling</em> [bit-for-bit decode-identical
|
---|
104 | to uncoupled modes], as well as various lossy models that seek to
|
---|
105 | eliminate inaudible or unimportant aspects of the stereo image in
|
---|
106 | order to enhance bitrate. The exact coupling implementation is
|
---|
107 | generalized to allow the encoder a great deal of flexibility in
|
---|
108 | implementation of a stereo or surround model without requiring any
|
---|
109 | significant complexity increase over the combinatorially simpler
|
---|
110 | mid/side joint stereo of mp3 and other current audio codecs.</p>
|
---|
111 |
|
---|
112 | <p>A particular Vorbis bitstream may apply channel coupling directly to
|
---|
113 | more than a pair of channels; polar mapping is hierarchical such that
|
---|
114 | polar coupling may be extrapolated to an arbitrary number of channels
|
---|
115 | and is not restricted to only stereo, quadraphonics, ambisonics or 5.1
|
---|
116 | surround. However, the scope of this document restricts itself to the
|
---|
117 | stereo coupling case.</p>
|
---|
118 |
|
---|
119 | <a name="sqpm"></a>
|
---|
120 | <h3>Square Polar Mapping</h3>
|
---|
121 |
|
---|
122 | <h4>maximal correlation</h4>
|
---|
123 |
|
---|
124 | <p>Recall that the basic structure of a a Vorbis I stream first generates
|
---|
125 | from input audio a spectral 'floor' function that serves as an
|
---|
126 | MDCT-domain whitening filter. This floor is meant to represent the
|
---|
127 | rough envelope of the frequency spectrum, using whatever metric the
|
---|
128 | encoder cares to define. This floor is subtracted from the log
|
---|
129 | frequency spectrum, effectively normalizing the spectrum by frequency.
|
---|
130 | Each input channel is associated with a unique floor function.</p>
|
---|
131 |
|
---|
132 | <p>The basic idea behind any stereo coupling is that the left and right
|
---|
133 | channels usually correlate. This correlation is even stronger if one
|
---|
134 | first accounts for energy differences in any given frequency band
|
---|
135 | across left and right; think for example of individual instruments
|
---|
136 | mixed into different portions of the stereo image, or a stereo
|
---|
137 | recording with a dominant feature not perfectly in the center. The
|
---|
138 | floor functions, each specific to a channel, provide the perfect means
|
---|
139 | of normalizing left and right energies across the spectrum to maximize
|
---|
140 | correlation before coupling. This feature of the Vorbis format is not
|
---|
141 | a convenient accident.</p>
|
---|
142 |
|
---|
143 | <p>Because we strive to maximally correlate the left and right channels
|
---|
144 | and generally succeed in doing so, left and right residue is typically
|
---|
145 | nearly identical. We could use channel interleaving (discussed below)
|
---|
146 | alone to efficiently remove the redundancy between the left and right
|
---|
147 | channels as a side effect of entropy encoding, but a polar
|
---|
148 | representation gives benefits when left/right correlation is
|
---|
149 | strong.</p>
|
---|
150 |
|
---|
151 | <h4>point and diffuse imaging</h4>
|
---|
152 |
|
---|
153 | <p>The first advantage of a polar representation is that it effectively
|
---|
154 | separates the spatial audio information into a 'point image'
|
---|
155 | (magnitude) at a given frequency and located somewhere in the sound
|
---|
156 | field, and a 'diffuse image' (angle) that fills a large amount of
|
---|
157 | space simultaneously. Even if we preserve only the magnitude (point)
|
---|
158 | data, a detailed and carefully chosen floor function in each channel
|
---|
159 | provides us with a free, fine-grained, frequency relative intensity
|
---|
160 | stereo*. Angle information represents diffuse sound fields, such as
|
---|
161 | reverberation that fills the entire space simultaneously.</p>
|
---|
162 |
|
---|
163 | <p>*<em>Because the Vorbis model supports a number of different possible
|
---|
164 | stereo models and these models may be mixed, we do not use the term
|
---|
165 | 'intensity stereo' talking about Vorbis; instead we use the terms
|
---|
166 | 'point stereo', 'phase stereo' and subcategories of each.</em></p>
|
---|
167 |
|
---|
168 | <p>The majority of a stereo image is representable by polar magnitude
|
---|
169 | alone, as strong sounds tend to be produced at near-point sources;
|
---|
170 | even non-diffuse, fast, sharp echoes track very accurately using
|
---|
171 | magnitude representation almost alone (for those experimenting with
|
---|
172 | Vorbis tuning, this strategy works much better with the precise,
|
---|
173 | piecewise control of floor 1; the continuous approximation of floor 0
|
---|
174 | results in unstable imaging). Reverberation and diffuse sounds tend
|
---|
175 | to contain less energy and be psychoacoustically dominated by the
|
---|
176 | point sources embedded in them. Thus, we again tend to concentrate
|
---|
177 | more represented energy into a predictably smaller number of numbers.
|
---|
178 | Separating representation of point and diffuse imaging also allows us
|
---|
179 | to model and manipulate point and diffuse qualities separately.</p>
|
---|
180 |
|
---|
181 | <h4>controlling bit leakage and symbol crosstalk</h4>
|
---|
182 |
|
---|
183 | <p>Because polar
|
---|
184 | representation concentrates represented energy into fewer large
|
---|
185 | values, we reduce bit 'leakage' during cascading (multistage VQ
|
---|
186 | encoding) as a secondary benefit. A single large, monolithic VQ
|
---|
187 | codebook is more efficient than a cascaded book due to entropy
|
---|
188 | 'crosstalk' among symbols between different stages of a multistage cascade.
|
---|
189 | Polar representation is a way of further concentrating entropy into
|
---|
190 | predictable locations so that codebook design can take steps to
|
---|
191 | improve multistage codebook efficiency. It also allows us to cascade
|
---|
192 | various elements of the stereo image independently.</p>
|
---|
193 |
|
---|
194 | <h4>eliminating trigonometry and rounding</h4>
|
---|
195 |
|
---|
196 | <p>Rounding and computational complexity are potential problems with a
|
---|
197 | polar representation. As our encoding process involves quantization,
|
---|
198 | mixing a polar representation and quantization makes it potentially
|
---|
199 | impossible, depending on implementation, to construct a coupled stereo
|
---|
200 | mechanism that results in bit-identical decompressed output compared
|
---|
201 | to an uncoupled encoding should the encoder desire it.</p>
|
---|
202 |
|
---|
203 | <p>Vorbis uses a mapping that preserves the most useful qualities of
|
---|
204 | polar representation, relies only on addition/subtraction (during
|
---|
205 | decode; high quality encoding still requires some trig), and makes it
|
---|
206 | trivial before or after quantization to represent an angle/magnitude
|
---|
207 | through a one-to-one mapping from possible left/right value
|
---|
208 | permutations. We do this by basing our polar representation on the
|
---|
209 | unit square rather than the unit-circle.</p>
|
---|
210 |
|
---|
211 | <p>Given a magnitude and angle, we recover left and right using the
|
---|
212 | following function (note that A/B may be left/right or right/left
|
---|
213 | depending on the coupling definition used by the encoder):</p>
|
---|
214 |
|
---|
215 | <pre>
|
---|
216 | if(magnitude>0)
|
---|
217 | if(angle>0){
|
---|
218 | A=magnitude;
|
---|
219 | B=magnitude-angle;
|
---|
220 | }else{
|
---|
221 | B=magnitude;
|
---|
222 | A=magnitude+angle;
|
---|
223 | }
|
---|
224 | else
|
---|
225 | if(angle>0){
|
---|
226 | A=magnitude;
|
---|
227 | B=magnitude+angle;
|
---|
228 | }else{
|
---|
229 | B=magnitude;
|
---|
230 | A=magnitude-angle;
|
---|
231 | }
|
---|
232 | }
|
---|
233 | </pre>
|
---|
234 |
|
---|
235 | <p>The function is antisymmetric for positive and negative magnitudes in
|
---|
236 | order to eliminate a redundant value when quantizing. For example, if
|
---|
237 | we're quantizing to integer values, we can visualize a magnitude of 5
|
---|
238 | and an angle of -2 as follows:</p>
|
---|
239 |
|
---|
240 | <p><img src="squarepolar.png" alt="square polar"/></p>
|
---|
241 |
|
---|
242 | <p>This representation loses or replicates no values; if the range of A
|
---|
243 | and B are integral -5 through 5, the number of possible Cartesian
|
---|
244 | permutations is 121. Represented in square polar notation, the
|
---|
245 | possible values are:</p>
|
---|
246 |
|
---|
247 | <pre>
|
---|
248 | 0, 0
|
---|
249 |
|
---|
250 | -1,-2 -1,-1 -1, 0 -1, 1
|
---|
251 |
|
---|
252 | 1,-2 1,-1 1, 0 1, 1
|
---|
253 |
|
---|
254 | -2,-4 -2,-3 -2,-2 -2,-1 -2, 0 -2, 1 -2, 2 -2, 3
|
---|
255 |
|
---|
256 | 2,-4 2,-3 ... following the pattern ...
|
---|
257 |
|
---|
258 | ... 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6 5, 7 5, 8 5, 9
|
---|
259 |
|
---|
260 | </pre>
|
---|
261 |
|
---|
262 | <p>...for a grand total of 121 possible values, the same number as in
|
---|
263 | Cartesian representation (note that, for example, <tt>5,-10</tt> is
|
---|
264 | the same as <tt>-5,10</tt>, so there's no reason to represent
|
---|
265 | both. 2,10 cannot happen, and there's no reason to account for it.)
|
---|
266 | It's also obvious that this mapping is exactly reversible.</p>
|
---|
267 |
|
---|
268 | <h3>Channel interleaving</h3>
|
---|
269 |
|
---|
270 | <p>We can remap and A/B vector using polar mapping into a magnitude/angle
|
---|
271 | vector, and it's clear that, in general, this concentrates energy in
|
---|
272 | the magnitude vector and reduces the amount of information to encode
|
---|
273 | in the angle vector. Encoding these vectors independently with
|
---|
274 | residue backend #0 or residue backend #1 will result in bitrate
|
---|
275 | savings. However, there are still implicit correlations between the
|
---|
276 | magnitude and angle vectors. The most obvious is that the amplitude
|
---|
277 | of the angle is bounded by its corresponding magnitude value.</p>
|
---|
278 |
|
---|
279 | <p>Entropy coding the results, then, further benefits from the entropy
|
---|
280 | model being able to compress magnitude and angle simultaneously. For
|
---|
281 | this reason, Vorbis implements residue backend #2 which pre-interleaves
|
---|
282 | a number of input vectors (in the stereo case, two, A and B) into a
|
---|
283 | single output vector (with the elements in the order of
|
---|
284 | A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus
|
---|
285 | each vector to be coded by the vector quantization backend consists of
|
---|
286 | matching magnitude and angle values.</p>
|
---|
287 |
|
---|
288 | <p>The astute reader, at this point, will notice that in the theoretical
|
---|
289 | case in which we can use monolithic codebooks of arbitrarily large
|
---|
290 | size, we can directly interleave and encode left and right without
|
---|
291 | polar mapping; in fact, the polar mapping does not appear to lend any
|
---|
292 | benefit whatsoever to the efficiency of the entropy coding. In fact,
|
---|
293 | it is perfectly possible and reasonable to build a Vorbis encoder that
|
---|
294 | dispenses with polar mapping entirely and merely interleaves the
|
---|
295 | channel. Libvorbis based encoders may configure such an encoding and
|
---|
296 | it will work as intended.</p>
|
---|
297 |
|
---|
298 | <p>However, when we leave the ideal/theoretical domain, we notice that
|
---|
299 | polar mapping does give additional practical benefits, as discussed in
|
---|
300 | the above section on polar mapping and summarized again here:</p>
|
---|
301 |
|
---|
302 | <ul>
|
---|
303 | <li>Polar mapping aids in controlling entropy 'leakage' between stages
|
---|
304 | of a cascaded codebook.</li>
|
---|
305 | <li>Polar mapping separates the stereo image
|
---|
306 | into point and diffuse components which may be analyzed and handled
|
---|
307 | differently.</li>
|
---|
308 | </ul>
|
---|
309 |
|
---|
310 | <h2>Stereo Models</h2>
|
---|
311 |
|
---|
312 | <h3>Dual Stereo</h3>
|
---|
313 |
|
---|
314 | <p>Dual stereo refers to stereo encoding where the channels are entirely
|
---|
315 | separate; they are analyzed and encoded as entirely distinct entities.
|
---|
316 | This terminology is familiar from mp3.</p>
|
---|
317 |
|
---|
318 | <h3>Lossless Stereo</h3>
|
---|
319 |
|
---|
320 | <p>Using polar mapping and/or channel interleaving, it's possible to
|
---|
321 | couple Vorbis channels losslessly, that is, construct a stereo
|
---|
322 | coupling encoding that both saves space but also decodes
|
---|
323 | bit-identically to dual stereo. OggEnc 1.0 and later uses this
|
---|
324 | mode in all high-bitrate encoding.</p>
|
---|
325 |
|
---|
326 | <p>Overall, this stereo mode is overkill; however, it offers a safe
|
---|
327 | alternative to users concerned about the slightest possible
|
---|
328 | degradation to the stereo image or archival quality audio.</p>
|
---|
329 |
|
---|
330 | <h3>Phase Stereo</h3>
|
---|
331 |
|
---|
332 | <p>Phase stereo is the least aggressive means of gracefully dropping
|
---|
333 | resolution from the stereo image; it affects only diffuse imaging.</p>
|
---|
334 |
|
---|
335 | <p>It's often quoted that the human ear is deaf to signal phase above
|
---|
336 | about 4kHz; this is nearly true and a passable rule of thumb, but it
|
---|
337 | can be demonstrated that even an average user can tell the difference
|
---|
338 | between high frequency in-phase and out-of-phase noise. Obviously
|
---|
339 | then, the statement is not entirely true. However, it's also the case
|
---|
340 | that one must resort to nearly such an extreme demonstration before
|
---|
341 | finding the counterexample.</p>
|
---|
342 |
|
---|
343 | <p>'Phase stereo' is simply a more aggressive quantization of the polar
|
---|
344 | angle vector; above 4kHz it's generally quite safe to quantize noise
|
---|
345 | and noisy elements to only a handful of allowed phases, or to thin the
|
---|
346 | phase with respect to the magnitude. The phases of high amplitude
|
---|
347 | pure tones may or may not be preserved more carefully (they are
|
---|
348 | relatively rare and L/R tend to be in phase, so there is generally
|
---|
349 | little reason not to spend a few more bits on them)</p>
|
---|
350 |
|
---|
351 | <h4>example: eight phase stereo</h4>
|
---|
352 |
|
---|
353 | <p>Vorbis may implement phase stereo coupling by preserving the entirety
|
---|
354 | of the magnitude vector (essential to fine amplitude and energy
|
---|
355 | resolution overall) and quantizing the angle vector to one of only
|
---|
356 | four possible values. Given that the magnitude vector may be positive
|
---|
357 | or negative, this results in left and right phase having eight
|
---|
358 | possible permutation, thus 'eight phase stereo':</p>
|
---|
359 |
|
---|
360 | <p><img src="eightphase.png" alt="eight phase"/></p>
|
---|
361 |
|
---|
362 | <p>Left and right may be in phase (positive or negative), the most common
|
---|
363 | case by far, or out of phase by 90 or 180 degrees.</p>
|
---|
364 |
|
---|
365 | <h4>example: four phase stereo</h4>
|
---|
366 |
|
---|
367 | <p>Similarly, four phase stereo takes the quantization one step further;
|
---|
368 | it allows only in-phase and 180 degree out-out-phase signals:</p>
|
---|
369 |
|
---|
370 | <p><img src="fourphase.png" alt="four phase"/></p>
|
---|
371 |
|
---|
372 | <h3>example: point stereo</h3>
|
---|
373 |
|
---|
374 | <p>Point stereo eliminates the possibility of out-of-phase signal
|
---|
375 | entirely. Any diffuse quality to a sound source tends to collapse
|
---|
376 | inward to a point somewhere within the stereo image. A practical
|
---|
377 | example would be balanced reverberations within a large, live space;
|
---|
378 | normally the sound is diffuse and soft, giving a sonic impression of
|
---|
379 | volume. In point-stereo, the reverberations would still exist, but
|
---|
380 | sound fairly firmly centered within the image (assuming the
|
---|
381 | reverberation was centered overall; if the reverberation is stronger
|
---|
382 | to the left, then the point of localization in point stereo would be
|
---|
383 | to the left). This effect is most noticeable at low and mid
|
---|
384 | frequencies and using headphones (which grant perfect stereo
|
---|
385 | separation). Point stereo is is a graceful but generally easy to
|
---|
386 | detect degradation to the sound quality and is thus used in frequency
|
---|
387 | ranges where it is least noticeable.</p>
|
---|
388 |
|
---|
389 | <h3>Mixed Stereo</h3>
|
---|
390 |
|
---|
391 | <p>Mixed stereo is the simultaneous use of more than one of the above
|
---|
392 | stereo encoding models, generally using more aggressive modes in
|
---|
393 | higher frequencies, lower amplitudes or 'nearly' in-phase sound.</p>
|
---|
394 |
|
---|
395 | <p>It is also the case that near-DC frequencies should be encoded using
|
---|
396 | lossless coupling to avoid frame blocking artifacts.</p>
|
---|
397 |
|
---|
398 | <h3>Vorbis Stereo Modes</h3>
|
---|
399 |
|
---|
400 | <p>Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes
|
---|
401 | constructed out of lossless and point stereo. Phase stereo was used
|
---|
402 | in the rc2 encoder, but is not currently used for simplicity's sake. It
|
---|
403 | will likely be re-added to the stereo model in the future.</p>
|
---|
404 |
|
---|
405 | <div id="copyright">
|
---|
406 | The Xiph Fish Logo is a
|
---|
407 | trademark (™) of Xiph.Org.<br/>
|
---|
408 |
|
---|
409 | These pages © 1994 - 2005 Xiph.Org. All rights reserved.
|
---|
410 | </div>
|
---|
411 |
|
---|
412 | </body>
|
---|
413 | </html>
|
---|
414 |
|
---|
415 |
|
---|
416 |
|
---|
417 |
|
---|
418 |
|
---|
419 |
|
---|