Stereo and Surround Panning in Practice

Download Report

Transcript Stereo and Surround Panning in Practice

Surround From Stereo

David Griesinger Lexicon [email protected]

www.world.std.com/~griesngr

Main Message

• Two channel audio is ubiquitous and obsolete.

– Reproducing two channels through multiple loudspeakers in the listening room (or a car) is a big improvement, Even if it is poorly done.

• Fully automatic two-channel-to-surround processors are widely available as consumer products.

– We need to make a processor that works well.

• Operator-controlled two-channel-to-surround processors are an important component in the re-mix process for creating discrete recordings from multichannel or two channel masters.

– The better we can make this component the better the ultimate product.

• The design of a processor that works well can teach us how to make better discrete recordings.

– If a machine can make a better surround mix than the typical sound engineer, perhaps we have something to learn!

– The design of a superior two-channel-to-surround processor requires an understanding of room acoustics and human perception. This knowledge can be applied to recording (and processing) technique.

Introduction

• • • Two channel sound reproduction (stereo) was introduced about 50 years ago.

– The improvement in emotional involvement provided by stereo over mono is compelling and easy to demonstrate. – Since then the basic principles of stereo reproduction have not changed.

• Improvements in S/N or frequency response are welcome, but not compelling.

• It seems likely that CD’s replaced LPs largely because they are more convenient, although the improvement in sound quality was also easily heard.

The improvements in envelopment and sound stage provided by multichannel sound are easy to appreciate and demonstrate – but they are seldom compelling.

– Current surround recordings improve the reproduction of the original hall, but are otherwise very similar to a two channel recording.

– It takes an innovative and risky mix to show what surround sound can do emotionally, and these mixes are rare.

The major market for surround recordings in the future will probably be for playback in automobiles.

– Automobile playback probably requires at least a five to seven conversion as part of the playback system.

– This playback venue can reveal problems with common mixing techniques

Home playback of surround will include video.

• Standard Video and DVD are also obsolete • Standard NTSC video was also introduced 50 years ago, and has changed very little.

– Improvements in color rendition and S/N are welcome, but result in little change in emotional involvement.

– Standard video and DVD has insufficient resolution and (usually) too small a screen to fully engage human visual perception.

– Competent directors compensate with frequent close-ups, fast editing etc.

• The result is distracting and ultimately boring, as the director is always forcing us to watch a particular aspect of the show.

• At best such a presentation is good for one viewing only.

– Videos of music performances show the heads or fingers of particular performers, forcing us to listen to only one musical line.

• The intelligent interplay between musical lines (the essence of Western music) is lost.

What is the Solution?

• We need to upgrade both the audio and the video if we want to create a product that is more compelling than current standards.

– Translation: We need develop a product that will generate substantial sales.

• High-Definition video combined with multichannel audio gives this opportunity.

– The audio can be made backwards compatible through an active downmixer, as we will see in this talk.

– The video is backwards compatible if the high-def image is cropped and scanned.

And this is becoming routine.

High-Definition Demo

• Brahms F minor Piano Quintet – Performed by the faculty of the Point-Counter-Point Summer camp.

– Video is high-definition (with some artifacts.) – Audio is two channel, single microphone pick-up.

– Played here (after post production) with two channel to five-channel processing.

Five channel to Two channel Encoding

• It is widely believed that it is impossible to automatically mix a five channel recoding into a two channel recording.

– Standard methods of downmixing have several serious errors in balance.

– These errors can be analyzed and corrected with an active downmixer.

• The results can be surprising. Very often the downmixed 5 channel recording is better than the manually mixed two channel recording.

Desired properties of an active downmixer

• Most importantly, the effective energy of each signal source should be preserved in the two channel output.

– There is no passive mixer that can achieve this goal • Next, the two channel mix should reproduce the position of inner voices in the identical position as the 5 channel mix.

– This also requires an active mixer, with compensation for errors in the sin/cosine pan law.

– We also must be careful to preserve the stereo width of decorrelated signals applied to the rear inputs.

• Finally, we can make some adjustments to the mix based on dynamic analysis of the input signals.

– Rear signals that are mostly reverberation should be attenuated in the mix by 3dB.

– Relatively low level rear signals in the rear that have prominent onsets can be briefly brought up in the mix, to give them a better chance of being heard and properly decoded.

Center channel energy

• The balance between the center vocal and the surrounding instruments is critical.

– Downmixing is this essential component is difficult, as different engineers can mix it in different ways.

• The usual center downmix mixes the center equally into both output channels with an attenuation of 3dB.

• This mix works well if the vocals are entirely phantom (no center channel) or hard center (no phantom).

• However it is very common to mix the vocals equally in all front channels.

• When this configuration is downmixed the vocals will be more than 2dB too strong unless some correction is made.

– Likewise an instrument panned half-way between center and left or right will have about a 2dB amplitude error.

– How can we correct for this common error?

Rear channel energy

• The same problem occurs in the rear channels. • The basic operation for the rear channels is to mix the left rear input (x0.91) to the rear output, while also mixing it (x-0.38) to the right rear output.

– The result is to preserve the loudness of a discrete rear input.

• But when the two rear inputs are driven in phase, the result is a 2dB to 3dB extra increase in the loudness of the output.

– In a passive encoder it is not possible to have the correct balance for discrete rear inputs while also having the correct balance when the rear inputs are driven in phase.

– All available passive encoders (Dolby Surround, etc.) attenuate the rear inputs by 3dB to compensate for this error.

– This makes discrete rear inputs (left rear only) 3dB too weak.

• In addition there is a directional error as the signals pan from left rear only to left and right rear together.

A complaint

• This problem has vexed me from the beginning of my work on matrix audio. • Early L7 encoders measured the phase coherence between the front and rear channels, and adjusted mix coefficients on an ad-hoc basis.

– The result was adequate for the rear energy problem, but the panning errors remained.

– Panning errors were also problematic in the front, particularly when vocals were placed in multiple channels.

• It is really very difficult to determine how the mix engineer has placed the vocals – and yet you must do so to correctly downmix the piece. – Mix engineers that put the vocals in all five channels are really asking for trouble.

– And if they add random time delays to the center channel the result becomes impossible to downmix at all.

• (the difference these delays make to the sound is almost entirely imaginary.)

Solution:

• But there is a solution! Elegant, simple, and thoroughly obvious.

• We can correct the output signal based on an energy comparison.

– We measure the energy of the input signals, and compare this to the energy of the output signals.

• It is important to frequency contour the measurement to correspond to human loudness sensitivity. – The difference signal can be used in a simple feedback loop to correct the amplitudes of the mix.

– As an added advantage it is possible with this technique to correct for panning errors, and for the difference between the sine/cosine pan law and subjective pan positions.

– (patent applied for)

Encoder Block Diagram Here you can see the basic structure of the active encoder. Note the center is mixed to the output with two variable coefficients, ml, mr, (usually mr=ml= -3dB).

The rear channels are mixed with the basic .91 and -.38 ratio to the outputs, but they are also attenuated by the coefficients mi and ms, to preserve the correct output power under all conditions. (usually mi = 1, ms = 0.) The use of 90 degree phase shift networks is common in all encoders, as it allows arbitrary pans from front to rear.

Surround from Two channel

• Two choices: – We can create a whole new mix – repositioning the musical forces, adding reverberation, etc.

• This type of upmix requires the thoughtful participation of a skilled operator.

• Ideally these operations are done by the musical copyright holder, for distribution on CD or DVD.

– Or we can preserve the original recording as much as possible, while using multichannel reproduction to: • Enlarge the listening area • Increase envelopment • Transform the listening room to a larger, more comfortable space.

– This type of upmix should be fully automatic, and can be part of a consumer playback system.

• Automatic processing is particularly beneficial in small spaces, such as cars.

– This talk will concentrate on automatic processing.

• We want to reproduce the original recording, while improving the listening area and the overall impression.

• In this talk we will deliberately limit ourselves by not discussing adding anything to the original, such as reverberation or early reflections. Potentially at least, the original recording can be reconstructed from the converted outputs.

The default

• The MOST AUDIBLE difference between the various currently available 2-5 processors is what they do when there is NO detectible direction in the input sound.

– When there is no detectible sound cue we reproduce the sound in a default condition – The default condition is the most common condition in most music, and how we treat this case is of utmost importance.

• If the default behavior of the processor can improve the acoustic performance of the original without altering the original sound stage, we will have succeeded.

• Our major goal is to “do no harm” to the original mix, while improving listener envelopment and increasing the useable listening area.

What does “default” mean?

• • In the default condition the input channels are decorrelated – in other words there is no common signal between the two channels.

– An example might be an orchestral recording with violins on one side and cellos on the other.

– Or any complex mix with a lot of reverberation or high left/right separation.

In a recording mastered for two loudspeaker reproduction the default playback condition is normal stereo – a phantom image spread between the front two speakers. – Two channel stereo reproduction allows the listener to identify the direction of the sound sources through the (buried) amplitude cues in the original recording.

• If we are unsure about what to do in converting this recording to multichannel, our best option is to preserve the stereo image.

• This seems obvious, but it is not standard practice

Standard 2-to-4 decoders

• The usual default condition of multichannel decoders is Dolby Surround: • The center and rear channel are as loud as the main the main channels, causing the sound image to move strongly toward the center of the room.

• Image width is reduced by about ½.

• The envelopment in the room is also reduced since the rear energy is added entirely in the medial plane (so there is no lateral sound energy.)

Standard 5 channel decoders

• • • The most obvious extension of Dolby Surround to 5 channels uses a type of matrix to supply the left and right rear speakers.

If the input signals contain a buried sound that is encoded to the rear, this matrix will send the signal predominantly to the appropriate speaker. However when there is NO buried encoded signal (which is almost always the case) this matrix results in a center channel that is too strong, and rear channels that are out of phase!

.

Antiphase rear channels

• The tendency of the rear channels to be out of phase an inherent problem with phase/amplitude encoding and decoding.

– As we pan a signal from left to left surround the input signal is positive phase in the left input, and negative phase in the right input.

– If we pan from right to right surround the signal is positive in the right channel, and negative in the left.

• So what happens when we want to be fully to the rear? – should the input signals be negative on the left or the right?

– If we choose to always make the right channel with negative phase, then in the default condition the loudspeakers will be out of phase.

• The best solution to this problem is to incorporate a variable phase shift network in the rear channels, that will actively flip the phase when there is a sound that is strongly in the rear.

Examples – a “music” decoder

X-Y plot of the rear outputs of a popular “music” decoder when the inputs are driven by uncorrelated pink noise. The correlation coefficient is -0.25

Example – a “film” decoder

The problem can be corrected to some degree by reversing the phase of the right rear channel. The rear channels are now decorrelated – but pans from right front to right rear might be a little peculiar. The correlation coefficient is +0.25.

Example – a decorrelated default

It is possible to design a decoder where the default results in decorrelated rear channels. The result is sonically superior. The correlation coefficient is 0.

Block diagram for a decorrelated default

If we create the rear channels by delaying and frequency contouring the front channels, the full separation in the input channels is maintained in the rear.

This results in higher envelopment around the listeners and a more comfortable sound. The frequency contouring is vital – and contains shelving as well as rolloff.

The effect is highly audible, particularly in a small listening room or an automobile.

However a decorrelated default makes the decoder design more complex, as there is no inherent cancellation of a center speaker in the rear channels

Decorrelated rear steering

• The

next most audible

difference between current decoders is their behavior when playing surround recordings where the original rear channels were decorrelated. For example, – Crowd noise in music CDs – Orchestra sounds in the rear on a film – Backup chorus in music CDs • When these recordings are encoded to two channel these decorrelated signals result in a output with net negative correlation.

• It is desirable to decode these signals such that the original decorrelation is restored.

• Most available decoders do poorly on this test, and the result is highly audible – (particularly in small rooms)

Examples: Decorrelated rear steering

“Film” decoder (Corr ~0.5) “Music” decoder Decorrelated Decoder “Music” and “Film”

Listening Examples

• Decorrelated rear steering during crowd noise in “Hotel California” – Note high positive or negative correlation leads to a flat, two-dimensional sound • Example with pink noise • (must play from CD)

How does a decoder work with directional signals?

Once we have a great default, we need to consider sounds which have a distinct direction: • We need to detect the direction of these sounds • And then we need to adjust mix coefficients to direct these sounds to the appropriate speakers.

• • Frequency contouring of the rear channels is essential, and must be made to depend on the degree of steering.

– A high frequency rolloff is important when reproducing sounds in the front, but it should vanish smoothly when signals move rear.

– In addition a shelving filter at about 300Hz allows the rear speakers to reproduce bass frequencies while not drawing attention to themselves.

• This filter should also disappear when signals move rear.

The type of frequency contouring used in upmixing is also very useful in making discrete recordings!

Block diagram of a 5 channel decoder

This diagram does not include shelving and rolloff filters, which are essential in a practical design.

The matrix elements in a decoder designed for maximum decorrelation

• The output of the decoder is entirely determined by the way the ten matrix elements depend on l/r and c/s • We can graph the surface formed by the matrix element on the l/r c/s plane • By symmetry we need to graph only 5 of the 10 matrix elements

© Lexicon

- a Harman International Company

Matrix elements in a decorrelated decoder

Left Input

Right Input

Center Output

Left Front Output

Left Rear Output

Left input to Left Front output element (LFL) inverted back to front to reveal the trough at left rear

The peak in front keeps loudness correct for center signals

The Right input to Left front output matrix element (LFR)

The peak in front keeps loudness correct for mild front steering

The Left input to Left Rear output matrix element (LRL)

The ridge in the rear keeps separation high as sound moves back

The Right input to Left Rear output matrix element (LRR)

Note the ridge along the back - which keeps separation high in the rear

The Left input to Center output matrix element (CL)

Note the rapid increase in level as the steering moves forward. This preserves stereo separation while making a hard center

Decorrelated Decoder Conventional Decoder

This element shows particular attention to the center of the plane.

These surfaces are defined by the edges, not the center.

For directional signals we must:

• 1. Analyze the original recording to determine the directions of the original instruments.

• 2. Adjust the mixing parameters of our processor to reproduce these sounds though a different loudspeaker arrangement in

precisely

the same positions

Original recording analysis

• We must look for cues in the original recording that will allow us to determine the directions of all the instruments • • • • We have to determine these directions quickly enough and accurately enough to mimic the properties of human localization.

– If we can closely approximate human hearing, the perceived results will be flawless.

Human sound perception is based on detecting “Sound Events”.

– So our processor needs a “Sound Event Detector” with accurate estimations of probable sound direction.

Human perception uses several directional cues: – Direct localization of sound events through amplitude and time differences between the two ears.

• These localization cues are mostly supplied through amplitude panning in a recording.

– Overall “center of gravity” localization, done through level differences (for left right) combined with head rotation (for front-back) • This is how we can be aware that a group of instruments are behind us, even if there are no discrete sound events.

If we determine that the original was surround encoded (Logic 7 or Dolby Surround) we need to correctly determine rear directions.

Amplitude Panning

• Nearly all current recordings employ amplitude panning – Phantom images are moved from one speaker location to another by varying the relative amplitude between the speakers.

• As a consequence the intended position of a sound source can be detected if we compare the amplitudes of the two input channels.

• Both applications require that we know the perceptual result of various amplitudes.

– We need to know the “pan-law”

The sine/cosine pan-law

• The most common pan-law in common use is the sine/cosine pan-law • Left_output = cos(p)*input • Right_output = sin(p)*input • • Where p is assumed to vary from 0 degrees (full left) to 90 degrees (full right), with center at p = 45 degrees. Note that then: • • (Left_output)^2 + (right_output)^2 = (input)^2

Sine-Cosine drawing

If the sine-cosine pan law is accurate, and we set p=22.5 degrees, we should hear a sound image half-way between center and left.

How does a surround decoder work?

• Compatibility with Dolby Surround requires a phase/amplitude decoder – thus direction is determined by evaluating |Lin|/|Rin| (lr) and |Lin+Rin|/|Lin-Rin| (cs).

• Stereo compatibility requires that the front localization roughly follows a sine/cosine pan law.

– Thus for strongly steered signals • Lin = cos(A) • Rin = sin(A) – As angle A varies from 0 to 90 degrees the sound pans from left front to right front • Strongly steered rear signals can be encoded by allowing angle A to increase from 90 degrees to 180 degrees.

The left/right and center/surround signals

• Define l/r = arctan(|Lin|/|Rin|) - 45 degrees • Define c/s = arctan(|Lin+Rin|/|Lin-Rin|) - 45 degrees – When there is no dominant direction in the input signal, all the levels l/r and c/s ~=0 • A signal cannot be both left and center at the same time! Thus the sum of |l/r| and |c/s| is bounded.

– |l/r| + |c/s| <= 45 degrees

© Lexicon

- a Harman International Company

The l/r and c/s signals are bounded

• l/r and c/s are not independent for strongly steered signals, where Lin = cos(A), Rin = sin(A) – in this case c/s ~= A, l/r ~= 45 -A • Allowed values for l/r and c/s fall within a diamond in the l/r c/s plane.

• A circularly panned signal will produce l/r and c/s values which lie on the boundary of the diamond

.

A circularly panned noise signal as seen by a phase/amplitude decoder

Music signals will fall inside this boundary

Histogram of A a typical stereo piece (first 30 seconds of Jennifer Warnes)

•Note the bulk of the power is not steered (uncorrelated), but it moves from the middle to the front

Sound event direction detection, newer circuit

• Jennifer Warnes – Bird on a wire, whole song

A Histogram of an encoded 5-channel piece (first 30 seconds of Boyz II Men)

Note the extensive use of the rear directions. Here we need full stereo width in the front and in the rear!

A Histogram of a classical 5-channel piece (30 seconds of 1812 recorded by Eargle)

Note the music has no net left/right bias. It is stereo, but the voices are mostly in the rear.

Event Detection

• A MAJOR problem in the design of a two-channel to multichannel processor is determining the true direction of the incoming sounds – When the recording includes only a single sound there is no problem.

– But most music is a complex mix of many sounds, with reverberation. Reverberation provides a random directional signal that can easily confuse the processor.

• An event detector uses information in the amplitude envelope to search for loudness patterns that represent notes in music or syllables in speech • Depending on the rise-time and fall-time of the detected sound events the time constants of the sound processor are adjusted to maximize the speed with which sounds can be re-directed to the correct positions • While NOT rapidly steering on more continuous music.

Accommodation in sound perception

• All modes of human perception emphasize transients.

– A perception which is sustained tends to drop from consciousness.

– This loss of perception of sustained events can happen quite early in the detection physioligy.

• It does not need higher neural processing.

• We can more accurately detect event direction if we can ignore continuous signals.

– For example, consider a strong continuous vocal line in the center, with accompanying sound events on either side • We can build a

differential

level-ratio detector, that will accommodate to continuous sounds, while correctly identifying the direction of transient syllabic events.

Differential direction detection

• Conventional direction detection relies on the level in the input channels • Define l/r = arctan(|Lin|/|Rin|) - 45 degrees • Define c/s = arctan(|Lin+Rin|/|Lin-Rin|) - 45 degrees • An essential feature of a differential detector is the realization that if we use the input

power

instead of the input

level

it is possible to accurately detect the direction of a buried transient event.

– Power is additive: if we subtract any constant power from any input, the power change during a transient event accurately reflects the direction of the transient event.

• • • The following equations give identical results to the above: l/r = arctan(sqrt(|Lin^2|/|Rin^2|)) - 45 degrees c/s = arctan(sqrt(|(Lin+Rin)^2|/|(Lin-Rin)^2|)) - 45 degrees

Now let the input power adapt to constant signals

• (Lin_adapt)^2 = Lin^2 – av(Lin^2) • (Rin_adapt)^2 = Rin^2 – av(Lin^2) • (Similarly for the sum and difference signals) • If we detect directional angles using the adapted signals, transient events which occur during constant signals can be accurately localized.

• This scheme

requires

that we accurately detect the start of transient events. In the absence of such an event the

unadapted

directions must be used – Or the direction detection becomes unacceptably noisy.

Onset Detection

• We can use the amplitude envelope of the signals – particularly after adaptation – to find the onsets of individual events • The sensitivity of this process must adapt to the nature of the music. Pop music tends to have many buried significant events, while classical music does not.

– Decoding classical music as if it were pop results in many errors in directional panning, which can be audible.

Amplitude envelope of reverberant speech

Anechoic speech with added reverberation equivalent to a small hall at about 20 meters.

Analysis into 1/3 octave bands, followed by envelope detection.

Green = envelope Yellow = edge detection This slide shows how the steeply rising edges of the amplitude envelope can be used to trigger a syllable onset detector.

The advantage of human hearing

• Human hearing has the advantage of being frequency selective.

– The event separation process takes into account the frequency content of the incoming signals, and can separate them based on this information.

– Most surround processors are broadband systems. We want to do “almost as well” as human hearing using only phase/amplitude information to both separate events and to determine direction • In spite of this disadvantage a system based on broadband event detection works remarkably well.

Examples of event detection

This slide plays an MPEG movie showing the instantaneous output of the event detector playing music and test signals of various types.

Center Left Right Left Surround Rear Right Surround

Rear pan Front pan

Testing event detection

Pan with center tone Center tone louder Center tone louder A test signal which consists of a tone that alternates from left to right while panning first to the center, and then to the rear.

We then add an interfering constant tone to the center with increasing level.

Analysis of test results

Output of an accommodating decoder in the “film” setting to the pan tones with no interfering center signal Left Front Right Front Left Rear Right Rear Center

Output with a constant center signal at -7dB level

Left Front Right Front Left Rear Right Rear Center

Output with a constant center signal at equal level

Left Front Right Front Left Rear Right Rear Center

Output with a constant center signal at +7dB level with an accommodating decoder

Left Front Right Front Left Rear Right Rear Center

Comparison at 0dB center level

Commercial decoder output with no accommodation. Note there is no left/right difference in the rear, and there is a high level in the front speakers during rear panning.

Accommodating decoder output. Note how the first 80ms or so of each tone burst tends to accurately show the intended direction of the sound. The accuracy degrades as the sound continues, but this is not audible.

Comparisons with plots

We can plot the amplitude graphs as polar diagrams, using the center of gravity of the sound to determine the probable perceived direction. This graph is from a commercial non-adapting decoder, with no constant interfering signal.

Note the first two plots expand the front axis to +-90 degrees. Each pan position has been divided into four equal sections. These plots show very good performance.

Accommodation vs non-accommodation – equal interference and signal

An accommodating detector (in this case in film mode) catches the true direction of the first wavefront.

Standard detectors cause all signals to be brought toward the center. This is the “film” setting . (These plots show the first 80ms of the tone bursts.)

A better plot – no interference (rear)

A non-accommodating “film” decoder An accommodating “film” decoder It is more revealing to plot the direction of the output by finding the angle of the center of gravity. We plot the result as a circle who’s radius depends on the degree of coherence of the image. If the sound comes from all around, the radius is large, and the center moves to the center of the plot. All phantom images (such as the full rear position) will have large circles, indicating the sound comes from more than one speaker. (The minimum circle radius is limited to 0.05 in these drawings so they remain visible.) With no interference the two decoders perform similarly. It can also be seen that the non-accommodating decoder puts signals panned only partly to the rear further to the rear than might be expected. (Careful observers may notice that the accommodating drawing has 7 positions, the non-accommodating on has 6.)

A better plot – interference -7dB

A non-accommodating “film” decoder An accommodating “film”decoder With an interfering signal the positions in the non-accommodating decoder are brought toward the front, with the large circle radius indicating the sound comes from many speakers at the same time. These sounds are heard as quite diffuse.

The performance of the accommodating decoder actually improves somewhat over the previous slide.

A better plot – equal interference and signal

Non-accommodating detector (first 80ms) Accommodating detector (first 80ms) With equal interference and signal the accommodating decoder degrades somewhat, but the performance is still very good. The non-accommodating detector throws the rear signals increasingly into the front.

A better rear plot - +7dB interference

A non-accommodating decoder An accommodating decoder Neither deocder works perfectly for rear signals at this level of interference from the front, but the accommodating detector does better. At least the front left and right signals are correctly reproduced.

Front positions – no interference

A non-accommodating “film” decoder An accommodating “film” decoder Both decoders perform well with no interference. Notice that this non accommodating “film” decoder widens the front soundstage by putting some of the left-only and right-only signals into the rear speakers as well as in the front.

Front positions - -7dB interference

A non-accommodating “film” decoder An accommodating “film” decoder With a -6dB interfering signal both decoders still perform well. Notice that the non accommodating decoder still has some leakage to the rear. The leakage continues into the first few positions of the front pan. (The limit on the minimum circle radius prevents us from seeing the circles between center and left become larger as they are being reproduced by two speakers instead of one. We can see that the localization outside the sweet spot will be good. This is due to the close angular spacing of the speakers.)

Front positions – equal interference and signal

A non-accommodating “film” decoder An accommodating “film” decoder With equal interfering tone and signal the performance of the non-accommodating decoder degrades markedly for full-left and full-right signals, which are drawn toward the center, and are reproduced through all loudspeakers.

The accommodating decoder works well, still displaying the minimum circle radius for all positions.

Front positions - +7dB interference

A non-accommodating “film” decoder An accommodating “film” decoder With high interference the non-accommodating decoder draws all the signals into the center, and reproduces them through all loudspeakers. The exact center however is exclusively in the center speaker.

The accommodating decoder continues to have high left-right separation, and nearly minimum circle radius.

Front Positions – “Music” vs “Film”

A non-accommodating decoder - no interference – an accommodating decoder It is possible to improve the imaging of a decoder (in the sweet spot only) by reducing the level of the center speaker, and reducing the front steering. The front image is then mostly phantom, and there are fewer steering artifacts. This setting is typically called the “music” setting of the decoder. With an accommodating decoder it is not necessary to reduce the front steering to prevent artifacts. We deliberately blend only the true center position into the left and right fronts. This treatment of vocals is frequently used in discrete surround mixes. This setting can reduce any audible pumping of the decoder when vocals and background are nearly equal in level.

Front Positions “Music” -7dB interference

A non-accommodating decoder An accommodating decoder With -6dB interference the two decoders behave much the same as in the previous slide.

Front Positions – Music + interference at 0dB

A non-accommodating decoder An accommodating decoder The non-accommodating decoder shows leakage into the rear channels, which increases the circle radius for left front and right front. The accommodating decoder keeps the leakage low.

Front Positions – Music + interference at +7dB

A non-accommodating decoder An accommodating decoder The leakage in the non-accommodating decoder increases at this level of interference, while the accommodating detector continues to work well.

Some Math

• The circular burst plots make an interesting graphic, but making them involved a number of guesses. It is not obvious how to determine just where a sound will be perceived when it is emitted from several loudspeakers. The math that follows might be useful, but I think the “better” graphs might be ultimately more useful, as they more accurately reflect the performance of the decoder when the listener is not in the exact center between the speakers. • • • The calculation of the angle of the sound source for front panning is determined by summing the sound powers from the different loudspeakers, in order to find a “center of gravity” of the sound.

If we let av_front_left = sqrt(pressure_from_left_front^2); (and so on) ra = 360/(2*pi); • • • • l_angle = 45-ra*atan((av_front_right+0.71*av_center))/ (av_front_left+0.71*av_center))); This yields an angle for the center of gravity of the front three speakers. The correction factor of 0.71 for the center level results in a good match to a sin/cosine pan law. No correction for the measured pan laws is included so far.

More math

• We can then write a similar expression for the sum for front panning of all five speakers.

• l_angle_late = 120-ra.667*ra*atan((av_front_right +0.8*av_center +av_rear_right+(.534*av_front_left) +(rear_leak*av_rear_left))) /(av_front_left+0.8*av_center+av_rear_left+(.534*av_front_right) +(rear_leak*av_rear_right))); • Rear_leak is a fudge factor • • • • • • • rear_leak=0; % for low rear levels, use the full value with no leak if av_rear_left < .71*av_front_left rear_leak = .303; % for intermediate levels use an intermediate leak, that fades out when the rear is 3dB stronger than front elseif av_rear_left < 1.41*av_fr_al1(angl,i) rear_leak = .303*(2 - av_rear_left/av_front_left/(2-.71); end

Math for the better display

Matlab code for finding center of gravity. Front_ang = angle of front speakers from the center (45). Rear_ang = angle of the rear speakers (120). Av_fr_al1 is the average power from the front left speaker during the time of index i, av_fr_ar1 is the average power from the right front speaker, etc. Angl is the index for the particular angle, and i is the index for a short time period within f_grav_cnt_fb(angl,i) = (cos(front_ang)*av_fr_al1(angl,i))^2 +(cos(front_ang)*av_fr_ar1(angl,i))^2 +av_ctr_a1(angl,i)^2 -((-cos(rear_ang)*av_rear_al1(angl,i))^2 +(-cos(rear_ang)*av_rear_ar1(angl,i))^2); f_total_power(angl,i) = av_fr_al1(angl,i)^2 +av_fr_ar1(angl,i)^2 +av_ctr_a1(angl,i)^2 +av_rear_al1(angl,i)^2 +av_rear_ar1(angl,i)^2; f_grav_cnt_fb(angl,i) = f_grav_cnt_fb(angl,i)/f_total_power(angl,i); if f_grav_cnt_fb(angl,i) < 0; f_grav_cnt_fb(angl,i) = -sqrt(-f_grav_cnt_fb(angl,i)); else f_grav_cnt_fb(angl,i) = sqrt(f_grav_cnt_fb(angl,i)); end

Math for the better display

% find the left-right center of gravity f_grav_cnt_lr(angl,i) = (sin(front_ang)*av_fr_al1(angl,i))^2 +(sin(rear_ang)*av_rear_al1(angl,i))^2 -((sin(rear_ang)*av_rear_ar1(angl,i))^2 +(sin(front_ang)*av_fr_ar1(angl,i))^2 ); f_grav_cnt_lr(angl,i) = f_grav_cnt_lr(angl,i)/f_total_power(angl,i); if f_grav_cnt_lr(angl,i) < 0; f_grav_cnt_lr(angl,i) = -sqrt(-f_grav_cnt_lr(angl,i)); else f_grav_cnt_lr(angl,i) = sqrt(f_grav_cnt_lr(angl,i)); end

Math for the better display

First find an average center of gravity for each angle, by averaging over the index i. Then find the polar angle of the center of gravity and the plot radius. The constant ra = 360/(2*pi); Note the step which limits the minimum radius.

for b=1:angstps av_fba(b) = 0; av_lra(b) = 0; for p = 1:nvarsp end end av_fba(b) = av_fba(b) + f_grav_cnt_fb(b,p); av_lra(b) = av_lra(b) + f_grav_cnt_lr(b,p); av_fba(b) = av_fba(b)/nvarsp; av_lra(b) = av_lra(b)/nvarsp; av_plt_ang(b) = 90-atan(abs(av_fba(b)./abs(av_lra(b))))*ra; radius(b) = sqrt(av_fba(b)^2+av_lra(b)^2); av_circ_sz(b) = 1 - radius(b); if av_circ_sz(b) < 0.05

av_circ_sz(b) = 0.05; Nvarsp = the number of time steps for the index i. (p) If we want to only look at the first half of the burst, we use a smaller value for nvarsp.

The minimum radius for display can be changed here.

end

Math for the better display

figure hold on for b=1:angstps Now plot the results.

x = circx*av_circ_sz(b)+av_fba(b); y = circy*av_circ_sz(b)+av_lra(b); plot(x,y,colorval(b)),axis([-1.1 1.1 -1.1 1.1]) x = circx*av_circ_sz_r(b)+av_fba_r(b); y = circy*av_circ_sz_r(b)+av_lra_r(b); plot(x,y,colorval(b)),axis([-1.1 1.1 -1.1 1.1]) plot(n,zip,'w') plot(zip,n,'w') % mark the center % mark the center plot(n,n,'w') % mark the front plot(n,-n,'w') plot(-n,n,'w') plot(-n,-n,'w') end xlabel('front_back') ylabel('left_right') title('front positions') x = 2*pi*(1:50)/(50); circx = sin(x); circy = cos(x); n = ((1:50)/25)-1; zip = zeros(size(n)); Angsteps = number of angular steps: 6 or 7 in these examples.

colorval = ['y','r','b','g','m','c','y'];

Sound redirection

• Once a sound event is detected, the processor must adjust its outputs so the sound is perceived as coming from the correct direction.

– This process not trivial. We can detect the

electrical

direction of a sound. We need to know its

perceived

direction.

– Two channel panning and three channel panning can be perceived quite differently • So we need to study sound panning, both to understand how to detect sound directions in a recording, and how to reproduce them.

Sine-Cosine drawing

If the sine-cosine pan law is accurate, and we set p=22.5 degrees, we should hear a sound image half-way between center and left.

What actually happens:

The actual position of a speaking (or singing) voice when p = 22.5 is closer to the left speaker – where we would expect p=15 to be.

Sine Law, Tangent Law

From: V. Pulkki and M. Karjalainen, “Localization of Amplitude-Panned Virtual Sources, Part1: Stereophonic Panning,”

J. Audio Eng. Soc.

, vol. 49, pp 739, 752 (2001).

• If g1 = the gain of left channel, and g2 = gain of the right channel, then: • Tangent law: Apparent position= arctan(tan(45)*(g1-g2)./(g1+g2)); The tangent law is equivalent to the sine/cosine law as described in the previous slide..

• Sine law: Apparent position = arcsin(sin(45)*(g1-g2)./(g1+g2)); The sine law predicts images even further away from their apparent position (with speech) than the sine/cosine law.

Pan law accuracy matters!

• With two channel recordings pan-pots can be adjusted by ear, and panning errors are no big deal in practice.

• But say we want to make a surround recording with a 3 channel front image: – We would like to duplicate the mix in two channels using the same panner settings.

– If our pan-laws were accurate, we could do this easily.

Conversion of two channel to multichannel

• We also want to automatically upmix a two channel recording to a three channel, five channel, or seven channel recording.

• To do this we must determine the true apparent direction of a sound source in the two channel recording, and then duplicate this position in the multichannel recording.

– It is not fair to cheat by ignoring the center channel!!!

• To do this we need ACCURATE pan laws.

Panning conversion

• Our standard decoder is a two channel to seven channel upmixer.

– It includes a sound event detector which separates the incoming sound into separate events, and determines the intended direction of each event.

– The output matrix is then adjusted to place that sound event as closely as possible to the intended position • The output pans use only the two loudspeakers that are closest to the intended position. • We fully use the center channel (in film mode.) • As the sound event detector improved we noticed that the perceived front image was consistently narrower than the two channel original.

– Either the three channel front pan-law or the the two channel pan law (the sine/cosine law) had to be inaccurate.

– So… time to learn something new.

The three channel pan law

• We can test three channel pans by using a sine/cosine pan pot to sweep between the center channel and left or right.

• Left_output = cos(p)*input • Center_output = sin(p)*input • This time as p varies from 0 to 90 degrees the sound image will pan from left to center. When p = 45 degrees, we might expect the perceived image to be half-way between center and left.

• This is exactly what what we find.

• So the panning error is in the two-channel pan law.

The physics of two-channel panning

The pressure at each ear is the sum of the direct sound pressure from one speaker and the diffracted sound pressure from the other.

These two signals interfere with each other, producing a highly frequency dependent signal.

Consequences of panning physics

• A two channel pan is entirely different from the localization process in natural hearing.

– Localization always depends on the interaural time delay (ITD) and the interaural intensity difference (IID).

– In natural hearing the ITD and IID vary due to head shadowing alone.

• Between 500Hz and 1500Hz the ear relies increasingly on IID rather than ITD, and the precise phase of the ear signals becomes inaudible.

– In a two channel pan, ITD and IID vary due to INTERFERENCE.

– The interference is entirely PHYSICAL. It happens at all frequencies, even HF.

• Thus the phase relationship of HF signals continues to be important at all frequencies.

The frequency transmission of the pinnae and middle ear

From: B. C. J. Moore, B. R. Glasberg and T. Baer, “A model for the prediction of thresholds, loudness and partial loudness,”

J. Audio Eng. Soc.

, vol. 45, pp. 224-240 (1997).

The intensity of nerve firings is concentrated in the frequency range of human speech signals, about 700Hz to 4kHz. With a broad-band source, these frequencies will dominate the apparent direction.

Past History

• We discovered that the apparent position of a sound source is highly influenced by its expected position and its past history.

– Expectation of azimuth and elevation is particularly important in sound recording.

– Localization in a natural environment is almost always dominated by the visual field, or a memory of a visual field.

• In panning experiments we can alter the bandwidth or the band frequency of a known source type, like speech. This change of frequency mix will drastically alter the IDT and IID.

– And yet a source which appears to be located at a particular position with one mix of frequencies will remain in that position when the frequency mix is changed .

– This is because the brain expects the sound to remain in the same place.

• Alternating the presentation from left to right by switching the speaker channels breaks this hysterisis.

– Thus the subject is asked to estimate the width between sound images which alternate left and right, – rather than the position of images presented consistently on one side or the other.

Apparent width of broadband sources

• Broadband sources are consistently perceived as narrow in width – But when we analyze them in critical frequency bands we find their apparent position varies over a wide angle. – The brain must make sense of these conflicting directional cues.

• The neurological process of separating sound into streams assigns a “best guess” position to the entire stream, rather than separating the perception into several streams in different directions.

– Since the brain expects most sources to have a narrow spatial extent, the “best guess” position is applied to all the conflicting frequency bands, and the source is perceived as sharply localized.

• Once the brain has assigned a direction to a source stream, it is quite reluctant to change it.

Pan law tests with third octave filtered speech

• Start with a broadband speech segment that alternates between half-left and half-right using a sine/cosine pan (p = 22.5 and p=67.5 degrees) • Ask a subject to tell you the angular width between the two apparent positions.

• Now filter the speech into third-octave bands, and ask the same question for each band.

Results at High Frequencies (ILD + ITD)

Apparent position of 1/3 octave filtered speech as a function of frequency.

Note the sine-cosine law is accurate below 600Hz, but the position of the broadband source is strongly pulled away from the center by the increased angle of frequencies above 1kHz.

Conclusions about panning

• Images from broadband speech and music sound sources are consistently perceived as wider than would be predicted by a sine/cosine pan law.

• The discrepancy can be explained by the dominance of frequencies between 700Hz and 4kHz in human hearing.

• The hearing mechanism appears to simply weight the apparent position of each frequency band by the intensity of nerve firings in that band when assigning an azimuth to a particular sound stream.

• The expected position of a sound stream, and the past history of its position, will have a strong (and usually dominant) influence on perception.

Listening tests

Conclusions

• Designing two channel to multichannel decoders can become a bit of a consuming passion.

– The results can be surprisingly good. • The tools shown in this workshop are applicable to a decoder with any number of output channels.

– It is possible to decode the front into five channels instead of three, with a substantial improvement in localization accuracy outside the sweet spot.

• The essential features of a good decoder include careful attention to the center channel and to the decorrelation and frequency contouring of the rear channels.

– This same attention can and should be paid when a discrete mix is made.

• It is possible (but sometimes difficult) to make a discrete mix that is clearly superior to an automatic two-to-five conversion. – Comparing the original discrete mix to the same mix after downmixing and upmixing can be quite revealing, as the down-mixed and up_mixed mix is often superior to the original.

– Analysis of the reasons for this observation can lead to better mixing technique..