During our last lecture, we've covered lossless sound compression. That was basically the PCM (Pulse Code Modulation), and that's what basically .WAV sound file format uses.
This talk will mostly concentrate on lossy methods, but we'll mention some common extensions to lossless formats too.
Human Auditory System
Well, as mentioned earlier, the human ear can only hear a limited range of frequencies. The ear can theoretically hear from 20Hz to 22kHz, however best responds to frequencies in the range 2kHz and 4kHz.
So when dealing with sound for human consumption, we may be able to mask (or limit the effect of) frequencies outside this range, and the human listener won't hear much of a difference.
There are also: frequency masking, and temporal masking.
Frequency Masking (also known as auditory masking) occurs when a sound we can normally hear is masked by another sound with a nearby frequency. The effect depends on the actual frequencies.
A good lossy sound compression scheme must make use of this. Temporal Masking
The temporal masking may occur when a strong sound at frequency f is preceded or followed by a weaker sound of the same (or nearly the same) frequency.
Just as with any data, we can apply RLE, Huffman, etc., to sound. The trick is that it usually doesn't work very well, due to the fact that sound is a digitized analog wave.
Usually we use a combination of some lossy scheme, and lossless compression. For example, silence compression.
This works by utilizing quiet areas of the sound signal. We can usually just zero them out, and treat them as total silence. We can then apply something like RLE to compress those runs of 0s.
There is also the idea of companding (compressing/expanding), which basically says that our ear notices mid-range loudness better than very loud sounds. Basically we can use an equation like:
mapped = 32767 * (2sample/65536-1)
To turn a 16 bit sample into a 15 bit sample, in such a way that the lowest amplitudes are less effected by this `compression' than the higher amplitudes. To get the sample back, we use:
sample=65536 * log (1 + mapped / 32767)
More compression can be achieved by using a smaller number than 32767. For example, we can turn 16 bit samples into 8 bit samples (and make the file twice as small).
u-Law and A-Law
These two are `standard' companding type methods. u-Law is used North America and Japan, and A-Law is used elsewhere to compress ISDN digital telephone signals.
The u-Law inputs a 14bit sample, and via a non-linear transform outputs an 8bit sample. A-Law is similar, except it starts with a 13bit sample.
The telephone signal is sampled at 8kHz, at 14bits each, that's 112,000 bits/s. At a compression factor of 1.75, the encoder outputs 64,000 bits/s.
(u-Law uses: sgn(x) * ln(1+u|x|)/ln(1+u); where u is 25, 255, 2555).
Basically we want to go from 16,000 range (or -8000 to 8000) samples, samples in the range of -127 to 127.
Similar idea applies to A-Law.
PCM (Pulse Code Modulation) is not the only way we can encode wave data. We can also use Differential Pulse Code Modulation (DPCM), which basically says instead of encoding the absolute value per sample, we encode the difference between the current value and the last. Since most waves are `continuous' the difference is usually relatively small, and may reduce the bit count per sample.
There is also the idea of not just encoding the difference, but encoding the difference from our `guessed' value. For example, we see that a wave is going down (so a bunch of samples are decreasing). What we do is guess that it will continue to decrease, and if we guess correctly, we don't need to save anything but the fact that our guess was correct.
In this section, we'll discuss the MPEG Audio compression. The way this works is very similar to JPEG. Same idea, just different domain.
Basically the standard specifies 3 layers. Each layer adds on higher functionality to the layer below it. Each higher layer is backwards compatible with the lower layers (Thus, a layer3 player can play any layer below 3).
The encoder works in chunks of 512 samples. It takes those samples and using a Fourier Transform (it actually uses a thing called Polyphase Filter Bank - which is basically just a courser Fourier transform---we must remember that the encoder is needs to be relatively efficient and fast, and a Fourier transform (even the Fast Fourier Transform) is sometimes relatively slow---so approximation is probably the best way to go) transforms them into frequency domain. The encoder then breaks up those resulting frequencies into 32 equal width frequency sub-bands.
Out of those 32 sub-bands, 24 `critical' frequency bands are selected. Then a psychoacoustic model is applied to handle frequency masking and temporal masking to those 24 sub-bands.
Then quantization is applied to those.
For layer I, each of those 24 sub-bands has 16 signals (512/32). Since we only care for those 24 sub-bands, we have essentially 384 signals (24 * 16). Layer II, and III have 36 signals per sub-band, for a total of 1152 signals.
More details in class...