Transcript Document

Speech Comprehension
A few words on acoustics
• Given a source, how it is heard is a function of the
resonant cavities through which it is filtered
• The shape of a cavity in which a sound occurs
determines several measurable properties of that
sound
– This is easy to see when you have a deformable sound
cavity, such as a wind instrument
– The sound that comes out is the one which is most
resonant: where the sound waves are ‘in sync’
– This is a function of the length and shape of the
resonating chamber: simple if the chamber is a simple
tube; complex if the chamber reflect sound in complex
ways
• Speaking is a highly controlled deformation of the
resonating chamber which is our vocal tract
Structure of sound
• Most natural sounds are not one pure resonating frequency, but
multiple resonating frequencies stacked up on each other
– Those can be more or less cleanly dividing up the sound
spectrum
– ’Hissy' noises (like fricatives) send out waves at many
frequencies at the same time, resulting in a complex spectrum
of resonance
• We can see fricatives as smears of high frequency bands interspersed
with more clearly multiple-frequency (and low-frequency) bands of
vowels
• Note that the 'sh' sound is characterized by slightly lower frequencies
– Clean sounds (like vowels) send out a controlled bands of
frequencies at different ranges, resulting in a cleaner spectrum
of resonance
Formants
• When we deform our mouths, we are manipulating which
frequencies will resonate
• The ones that resonate are called ‘formants’ [see pg. 121],
which appear as visible bands in a spectrogram
• Before we said: Speaking is a highly controlled
deformation of the resonating chamber which is our vocal
tract
• We can equivalently say: Speaking is a method for
manipulating the resonance of formants.
“We were away a year ago”
Image from: http://www.umanitoba.ca/faculties/arts/linguistics/russell/138/sec4/specgram.htm
What’s in a phoneme?
• As soon as we were able to electronically
manipulate the signal, it was found that the speech
signal could be greatly simplified: much of the
information carried is not necessary
– Why is it a good thing to have (why might natural
selection have favoured) unnecessary information in a
signal system?
• The question of interest is: What are the
components of the speech signal that carry
necessary/sufficient information?
What’s in a phoneme?
• The first and second formants are sufficient for
comprehensible speech
• In fact, subjects can get some discriminating
information from only the first formant: lowfrequency formants were associated with low,
back vowels (o, u) and higher-frequency with
high, front vowels (i, e)
What’s in a phoneme?
• We use sound (the formants we extract) to deduce
information about how the vocal tract was positioned when
that sound was produced
– F1 largely reflects by tongue body height, which (as we
saw previously) changes with different vowels
– F2 reflects whether the tongue body is more front or
more back
• The difference between F1 and F2 is a better
indicator
• In this way the sound encodes information about the state
of the system that produced it
A complication
• Vowel sounds are dependent on the consonants
that flank them
• We make different sounds by changing the shape
of our mouth and our mouth has to change in
different ways to get to a particular vowel sound
from one position than from another
• In other words: The very process of getting into
position to make a sound involves manipulating
exactly those elements which are manipulated to
change sound
What’s in a vowel?
• If you make ‘CVC’ words and the chop out the V, people
make many mistakes in guessing what that missing sound
is supposed to be
– They are much better at guessing what a vowel sound
is if you give them only the flanking consonants
– They were as good at silent center- the V taken out- as
they were with the original word!
• V recognition is worse if you discard temporal
information, so subjects only hear a small, constant-length
portion of the missing vowel
– This suggests that temporal information- how long a
vowel lasts- is one of the clues used in vowel
identification.
Consonants too.
• The same is true for consonants
– If you take a stop consonant off the front of a vowel
(the b in BA) it is utterly impossible to recognize what
the consonant was (a beep or chirp?): it was never a ‘b’
but a ‘b merging rapidly into an a’
– Both the stop consonant and a chunk of the formant
transition into the next vowel are necessary for
comprehension
Coarticulation
• A phoneme merging with its adjacent neighbouris called an encoded phoneme
– We can also say the two phonemes are
coarticulated
• Since an encoded phoneme is a single
indistinguishable sound which encodes two
phonemes- the encoded one and its neighbour- we
say there is ‘parallel transmission’
Information compression
• Coarticulation is a feature, not a bug
• The informational compression it offers is
one way we get up to the informational
transfer rates that I mentioned last time, of
25-30 phonemes per second
More information, please
• In normal sentence-level decoding of the phonetic
stream, we have higher-level informational cues
• Early work showed the words masked with noise
are better recognized in sentences than in isolation
• A classic experiment from the 1970s showed that
people are amazingly smooth at using these cues
to restore missing phonemic segments
– Parts of sentence were chopped out (in mid-word) and
replaced with the sound of someone coughing
– Subjects reported that they didn’t hear the cough cover
any part of the speech signal at all- they claimed to
have heard the entire word, with the cough in
background
Our favourite theme
• Yet another linguistic phenomenon (phoneme
identification) that superficially appears to be a
single function is in fact a complex function that
uses many independent and redundant cues:
–
–
–
–
Formant transitions [from C to V anv V to C]
Individual formants [of V],
Durational information
Amount of energy in the burst [the release of pressure
after a stop]
– Onset frequency of the formant
– Sentence and word level information
The McGurk Effect
Models of speech perception
•
•
•
•
•
i.) Motor theory of speech perception (Liberman)
ii.) Analysis by synthesis (Stevens)
iii.) Fuzzy logic model (Massaro)
iv.) Cohort model (Marslen-Wilson)
v.) TRACE model (Elman & McLelland)
i.) Motor Theory Of Speech Perception
(Liberman)
• Main idea: we interpret speech input by tying it to motor
articulation required to produce it
• Pros:
– Provides a nice evolutionary story: phonetic comprehension
built on a more 'primitive' (evolutionarily older) level of
sound production.
– Ties into 'hardware'
– Explains McGurk effect
– Explains how we deal with coarticulation so easily
– Explains how we deal with invariance
– Explains categorical perception
• i.e. we use motor information to constrain possible sounds;
use motor invariance to counter acoustic variance
i.) Motor Theory Of Speech Perception
• Cons:
– Animals also show categorical perception but can’t
produce phonemes
– Humans with deformed mouths can comprehend speech
– We can comprehend sounds we cannot make
– Says nothing about semantic and pragmatic constraints
ii.) Analysis by synthesis
(Stevens)
• Main idea: We synthesize speech from phonetic
features; we have 'rules for synthesizing, which
can be absolute when the signal is clear, and less
absolute (more dependent on contextual cues)
when there is known ambiguity
• The synthesized version is compared with the
heard version not at the level of motor
articulations
ii.) Analysis by synthesis
• Pros
– Tries to capture the fact that not all phonemes are
created equal
• ambiguous sounds must be more carefully
analyzed- because they are subject to a greater
variety of constraints- than unambiguous sounds
• early phonemes have greater weight than later
phonemes
– The idea that rules across phonetic features underlie
comprehension means that the problem will be tractable
• Since we have a good handle on what those features
are, there is hope we could specify how they
combine
ii.) Analysis by synthesis
• Cons
– Can’t explain McGurk effect, since everything is
acoustically specified
– Pretty vague without the rules actually being specified
– Very abstract: hard to falsify or confirm experimentally,
because it makes claims about what is happening
internally that cannot be tested easily
– Says nothing about semantic and pragmatic constraints
iii.) Fuzzy Logic Model (Massaro)
• Main idea: speech perception is a special case of
pattern recognition (analysis by features)
• There are four steps:
– i.) Feature identification/extraction: Identify the
relevant features
– ii.) Feature evaluation: Match those features to
prototypes in memory- i.e. generate a list of partial
matches with features sets that contain some of the
identified features
– iii.) Feature integration: Rank order the candidates
according to the degree that they match
– iv.) Feature decision: Make a ‘goodness of match’
decision and return the best candidate
iii.) Fuzzy Logic Model (Massaro)
• Pros
– Puts speech recognition out of the special case category
into the category of general pattern recognition, thereby
tying it in to work in other subfields, including other
areas of language and into a general theory
• This could also be a con, since speech recognition does seem
to be special…
– Stresses continuous (quantitative) rather than
discontinuous (qualitative) information, so a match can
be more-or-less good; more or less-certain
iii.) Fuzzy Logic Model (Massaro)
• Cons
– Very abstract: hard to falsify or confirm experimentally,
because it makes claims about what is happening
internally that cannot be tested easily
– Says nothing about semantic and pragmatic constraints
(but perhaps it could…?)
iv.) Cohort Model (Marslen-Wilson)
• Basic idea: A spreading activation model
– Stage 1- Initial Access
• Access cohort: Bottom-up, based on first 150-200
ms
– Stage 2- Selection:
• Elimination of candidates that fail for reasons other
than phonology- so we can weed out using
semantic/pragmatic and syntactic constraints, as
well as later-stage phonology
– Stage 3: Integration of semantic and syntactic
information
iv.) Cohort Model (Marslen-Wilson)
• Pros
– Does take into account semantics and
pragmatics
– Well-supported by a variety of experimental
evidence: frequency effects, neighbourhood
effects, word/NW RT effect
iv.) Cohort Model (Marslen-Wilson)
• Cons
– Says nothing about mechanisms
– Says nothing about word segmentation
• The model assumes listeners pick out the words, but
we have seen that word boundaries are not usually
specified in the speech stream
– Not incompatible with other models, since it takes
phonemic activation (the selection of the initial cohort)
for granted (maybe this is a pro?)
v.) TRACE Model (Elman &
McLelland)
• Basic idea: A parallel distributed processing (PDP)
model: degree of activation/inhibition from units
at each of three levels (phonemic feature,
phoneme, word) is determined by the resting
activation level of word units
– Each gets input directly from constant sequence of
phonemes, all equally valuable
v.) TRACE Model
• Pros
– Decision based on overall goodness of fit, so degraded
input is not problematic
– Consistent in principle with cohort activation models
(and so well-supported by experimental evidence)
– Does take into account semantics and pragmatics- activation of overlapping lexical levels is explicable
v.) TRACE Model
• Cons
– Treats all features as equal, which we know they are not
– Says nothing about mechanisms
– ‘Cheats’ by building in phonemic activation (the
selection of the initial cohort) by direct activation of
those features
• One big part of the puzzle (how do we specify and
recognize these features?) is thereby glossed over
– Highly over-simplified, both at the level of language
and neurology
• Are these models incompatible?
• Can they be synthesized into a ‘metamodel’?
• How could we test for which parts of each
were best?