Your Audio Compression Format Sucks!

Download Report

Transcript Your Audio Compression Format Sucks!

An Introduction to the
“Thor-like” Power of Ogg Vorbis!
Robert W. Ferguson III
January 30, 2003
Xiphophorus
Xiphophorus is a freshwater fish genus
comprised of 23 species.
Since the 1920's its been known that one
could make hybrids between the different
species easily. In some cases, one simply
had to place one Xiphophorus species next
to another in an aquarium, and they would
reproduce.
XIPH.COM
Xiphophorus is a non-profit organization
responsible for the Ogg project.
Xiphophorus is GPL.
All cool companies have an X to start
their name.
What Is Ogg Vorbis
The Ogg project is an open-source
alternative to proprietary and patented
codecs for digital media (for both audio
and video).
The Vorbis project is responsible for the
creation of a perceptual audio encoder
similar to famous,inherently evil,
proprietary codecs popularized by global,
illegal file sharing.
It Is Not MP3
Vorbis is in the same category as
MPEG-4 (AAC)
And similar to, but higher performance
than
MPEG-1/2 audio layer 3
MPEG-4 audio (TwinVQ)
WMA - Windows Media Audio
PAC
Classification
Vorbis I
Vorbis I is a forward-adaptive monolithic
transform CODEC based on the Modified
Discrete Cosine Transform.
The codec is structured to allow addition of a
hybrid wavelet filter bank in Vorbis II to offer
better transient response and reproduction
using a transform better suited to localized
time
Packets
Vorbis uses free-form packets that have no
minimum size, maximum size, or fixed/expected
size. Packets are designed that they may be
truncated (or padded) and remain decodable.
Error Detection
Vorbis provides none of its own protection
against errors.
 It is solely a method of accepting input
audio, dividing it into individual frames and
compressing these frames into raw,
unformatted 'packets'.
ATH – Absolute Threshold
of Hearing
Most codecs assume volume is fixed during
playback. Vobis assumes that volume can be
adjusted.
Tone Masking
Tone masking is when louder frequencies
mask out adjacent quieter ones.
Most codes use a psychoacoustics model
to calculate what’s left as best as possible
in given bit-rate limits.
Vorbis approximates the same thing using
as many bits as it takes.
Coupling
Most sounds consist of many channels and
have redundancy between these channels. This
is exploited to lower the bit-rate if the channels
are encoded in some joint representation.
The simplest example is to encode the average
and the difference between channels (for a
stereo sound) – this is called mid/side
representation and it requires fewer bits for
sections that are close to mono.
Channel Support
Vorbis supports up to 255 channels.
At the moment the encoder knows to use
coupling for 2-channel files only, but
eventually it will scale.
Vector Quantization
Vector Quantization (VQ) is a lossy data
compression method where vectors are rounded
off into encoding regions.
Basically if you group together numbers
describing different channels, your channels
become automatically coupled (normally a
group would be picked from data describing a
single channel, so channels would be
approximated independently).
Vector Quantization…
The process of VQ introduces some
vector quantization noise. The difference
between the approximation (a limited
number of these can be chosen) and the
original group of numbers.
All codecs suffer from quantization
problems. VQ should suffer less.
Memory Usage
The vector codebooks used in the first
stage of decoding are packed, in their
entirety into the Vorbis bit-stream headers.
In packed form, these codebooks occupy
only a few kilobytes; The extent to which
they are pre-decoded into a cache is the
dominant factor in decoder memory
usage.
Following the Standard
Any file that follows the decoding
standard, regardless of encoding method
follows the standard.
Headers
Identification Header
 The identification header identifies the bitstream as Vorbis, Vorbis
version, and the simple audio characteristics of the stream such as
sample rate and number of channels.
Comment Header
 The comment header includes user text comments ["tags"] and a
vendor string for the application/library that produced the bitstream.
Setup Header
 The setup header includes extensive CODEC setup information as well
as the complete VQ and Huffman codebooks needed for decode.
Decoding Procedure
The decoding and synthesis procedure for all
audio packets is fundamentally the same.
 1. decode packet type flag
 2. decode mode number
 3. decode window shape [long
windows only]
 4. decode floor
 5. decode residue into residue
vectors
 6. inverse channel coupling of
residue vectors
 7. generate floor curve from
decoded floor data
Decoding Procedure...
 8.
compute dot product of floor
and residue, producing audio
spectrum vector
 9.
inverse monolithic transform
of audio spectrum vector, always
an MDCT in Vorbis I
 10. overlap/add left-hand output
of transform with right-hand output
of previous frame
 11. store right hand-data from
transform of current frame for
future lapping.
 12. if not first frame, return
results of overlap/add as audio
result of current frame
Rearrangement of the synthesis arithmetic is possible.
Controversy
 The entire probability model of the codec, the Huffman
and VQ codebooks, is packed into the bitstream header
along with extensive CODEC setup parameters (often
several hundred fields).
 It’s impossible to embed a simple frame type flag in
each audio packet, or begin decode at any frame in the
stream without having previously fetched the codec
setup header.
 Vorbis can initiate decode at any arbitrary packet within
a bitstream so long as the codec has been
initialized/setup with the setup headers.
Window Shape Decode
Vorbis frames use one of two PCM sample sizes
specified during codec setup. In Vorbis I, legal
frame sizes are powers of two from 64 to 8192
samples. Aside from coupling, Vorbis handles
channels as independent vectors and these
frame sizes are in samples per channel.
Overlapping Windows
Vorbis uses an overlapping transform, namely
the MDCT, to blend one frame into the next,
avoiding most inter-frame block boundary
artifacts. The MDCT output of one frame is
windowed according to MDCT requirements,
overlapped 50% with the output of the previous
frame and added. The window shape assures
seamless reconstruction.
Dealing with Windows
And slightly more complex in the case of overlapping unequal sized windows:
Inverse Monolithic Transform
 The audio spectrum is converted back into time domain
PCM audio via an inverse modified discrete cosine
transform (MDCT). A detailed description of the MDCT is
available in the paper The use of multirate filter banks
for coding of high quality digital audio_, by T. Sporer, K.
Brandenburg and B. Edler.