Speech Production - Delving into the Mental Life of Language

Download Report

Transcript Speech Production - Delving into the Mental Life of Language

Speech Perception
[]
recognize speech
?
wreck a nice beach
The Major Questions in Speech
Perception:
1) How do we identify the sounds we hear?
2) What about the “lack of invariance” in the
speech signal?
3) What about degraded signals?
How do we identify sounds?
Speech occurs at an alarming rate:
(estimates vary between 120-180 wpm)
…10-15 or 25-30 phonetic segments/second!
The speech signal is continuous – there are no
easily identifiable boundaries between
words
The speech signal to the right 
is segmented into “how are you?”
How do we deal with the “lack of
invariance” in a speech signal?
Lack of Invariance comes from:
• Coarticulation effects (Allophonic variation)
“Tom Burton tried to steal a butter plate”
• Speaker variation
• No exact repetition
• Reduction / deletion of segments
Acoustic Cues
• No single acoustic cue is reliably present for
any given phoneme
– for [di] and [du], the /d/ is very different, but
speakers will indicate that it’s still the same
segment
• Each phoneme has more than one acoustic cue
– voice-onset-time (VOT)
– energy in the burst
– onset frequency of the first formant
– placement in syllable
Voice Onset Time (VOT)
• Measure of time between the burst of air and
beginning of vocal-fold vibration of the adjacent vowel
• Best single cue for distinguishing between
voiced/voiceless consonants in many languages:
English, Dutch, Spanish, Hungarian, Tamil,
Cantonese, Thai, and Eastern Armenian… (Lisker &
Abramson, 1964)
• BUT we can still interpret whispered speech!
(practically all voiceless)
Categorical Perception
(chunking of speech signals)
• Although speech is non-discrete, we perceive it
discretely!
• Task: Identify the sound
0------10------20------30------40------50------60
/d/
100% |
50%
| 100%
/t/
Categorical Perception
Yeni-Komshian and LaFontaine (1983)
– 7 stimuli, between [di:]/[ti:] (VOT 0 - 60 ms)
0----10----20----30----40----50----60
|
|
|
|
|
same 1-step
2-steps
• Task: Discriminate between these sounds
(2 steps apart – so 20 ms difference on VOT)
0/20 ms – 100% same
40/60 ms – 100% same
10/30 ms – 50% same
30/50 ms - 50% same
20/40 ms – 100% different
What about bilinguals?
• VOT boundaries vary between languages
• Perception studies show compromise-effects
Canadian French-English bilinguals
(Caramazza, Yeni-Komshian, Zurif, & Carbone, 1973)
Spanish-English bilinguals
(Williams, 1977, 1980)
Bilinguals seem to have developed a single
perceptual system!
Coarticulation Effects
Phonemes are influenced by the sounds around
them!
• Take naturally recorded speech
• Remove vowel
• Guess the vowel
– Example: “see” [si:] – remove the vowel
– Play 150 ms of /s/
– Can identify removed vowel (for most vowels)
How is speech perceived under
less than ideal conditions?
Top-down
-
Semantic context
Syntactic structure
Acoustic Information
UNDERSTANDING
Bottom-up
A demonstration
The McGurk Effect:
We use visual AND auditory cues to
determine what segments we’re hearing!
Top-Down Processing
(using semantic and syntactic information to decode
individual words in fluent speech)
*language* *speech* *recognition* *talk*
recognize speech
[]
Bottom-Up Processing
(using acoustic information to encode the speech signal)
Phoneme Restoration Effect
(Warren, 1970)
• Replaced sounds with a cough
• Word presented in a sentence
– The bill was sent to the legi_lature.
• “Where does the cough occur?”
• Participants thought whole word was present.
The /s/ was mentally restored!
– It was found that the _eel was on the orange.
– It was found that the _eel was on the shoe.
Semantic Influences
(Garnes and Bond, 1976)
• 16 tokens, spanning the spectrum of bait-dategate
• 3 carrier sentences:
– Here’s the fishing gear and the ______.
– Check the time and the _______.
– Paint the fence and the _______.
• If unambiguous, get semantically implausible
sentences (Paint the fence and the bait.)
• If ambiguous (near a phoneme boundary),
semantic context effects
Slurred Speech
• Syntactic and semantic cues help!
• Words (with noise) are perceived more
accurately in sentences than in isolation
– (Pollack & Pickett, 1964) – recorded
conversations and excised individual words.
Presented the words to listeners for
identification, and only half the excised words
were correctly recognized.
Rules of Rapid Speech:
“hanmethethimbook”
• Often can drop the las consonan
• Consonants in clusters may be modified to
have the same blace of articulation/voicing
– |thimbook|, |thingcarpet|, |Istambul|
– NOT: |thingbook|, |thim slice|
• Almost all vowels can be shortened
Listening for Mispronunciation
(Cole, 1973; Cole, Jakimik, & Cooper, 1978; Cole & Jakimik, 1980)
20-minute story. Press a button whenever you hear a
mispronunciation.
• Notice more stop errors with voicing:
– 70% for stops (boot to poot)
– 64% for affricates (chance to jance)
– 38% for fricatives (fin to vin)
• Notice almost all place changes (80-90%):
– (take to pake)
– no higher percentage if voicing also changed (take to gake)
• Notice more errors at beginnings of words:
– 72% for word-initial segments (dish to tish)
– 33% for word-final segments (split to splid)
• Conclusion: we DO use bottom-up information!
Mad Gab
Mad Gab
• Lit Told Hid High No
Mad Gab
• Ate Whole Freak Haul
Mad Gab
• Hike Air Rub Ouch Hue
Mad Gab
• Huff Ink Earn Elf Aisle
Mad Gab
• Ale All Heap Hop
(A lollipop)
Mad Gab
• Butcher Ed Stew Gather
(Put your heads together)
Mad Gab
• Lease Hummer Reap Wrest Lee
(Lisa Marie Presley)
Mad Gab
• Bill Spare Reed Oh-boy!
(Pillsbury Dough Boy)
Models of Speech Perception
• Motor Theory of Speech Perception
– Speech signals interpreted by reference to
motor speech movements
• Cohort Model
• TRACE Model
Models of Speech Perception
• Motor Theory of Speech Perception
• Cohort Model:
1) The acoustic information at the beginning of a
word activates a “cohort” of possible words
2) Syntax and semantics influence the selection of
the target word from the cohort
• TRACE Model
Cohort size
• Standard dictionary
–
–
–
–
after 50 ms, 115 nouns share the same sounds
after 100 ms, 43 nouns
after 200 ms, 11 nouns
after 300 ms, 5 nouns
(Average word length, depending on speech rate, for one-, two-, and
three-syllable words is between 550 – 830 ms)
Word recognition occurs before the
isolation point! (only one word possible)
Models of Speech Perception
• Motor Theory of Speech Perception
• Cohort Model
• TRACE Model: Neural Network
(Elman and McClelland 1984, 1986)
– processing occurs through excitatory and
inhibitory connections – in processing units
called nodes
– 3 levels of nodes: features, phonemes, and
words all highly interconnected
Evidence for the TRACE model
(or other interactive models)
• We activate all possible words from the
phonology regardless of semantic fit
– “He swam across to the far side of the river and
scrambled up the bank before running off” primes
bank as financial institution!
– parts of words cause priming:
• “trombone” primes for rib just as well as “bone”
– word boundaries don’t interfere with phonological
retrieval
• nudist is primed by the phrase “new distance”
– BUT we eliminate all the irrelevant words within a
few syllables
For more information
• b-d-g continuum
http://www.phonetik.unimuenchen.de/Lehre/Skripten/Haskins/Haskins/MISC/PP/b
dg/bdgau.html
• Resources on phonetics and phonology
http://faculty.washington.edu/dillon/PhonResources/
• Why we need prosody and lexical access
http://emsah.uq.edu.au/linguistics/book/flant.htm