Transcript Document

Pattern and Speech recognition
Speech recognition
John Beech
School of Psychology
PS1000
1
Speech Recognition
Listening to speech isn’t like reading.
Speech sounds are produced by changing the position and
shape of the tongue and the position and shape of the lips.
The shape of the vocal tract changes continuously in a
fluid way and these shapes depend on previous shapes.
Individual sounds of words are called phonemes. E.g. ‘rat’
and ‘cat’ differ by just one phoneme.
There are about 40 phonemes in English and the average
duration is 50 msec (1/20 sec). In the context of a word,
this is a very short time to identify each phoneme from 40
others.
Also, phonemes are not simple and distinct as in the letters in
2
words.
Speech perception
Often there are quite subtle
differences between the
sounds of phonemes.
E.g. in both [ba] and [pa] the
lips open and soon after the
vocal folds vibrate.
In [ba] it is after 20 ms and in
[pa] it is after 40 ms. The
longer gap between the lip
opening and the voice onset
for “pa” is because there is
more of a puff of air - and
therefore it lasts longer.
Known as ‘voice onset time’
Thus [b] and [p] sounds are
differentiated by voice onset.
Differences in voice onset time
between “ba” and “pa”
start
20 ms
“ba”
lips
voice
“pa”
lips
40 ms
voice
3
Speech perception
If we take any point in time the sound being
produced is not just the intended sound, but it is
affected by the previous sound AND the following
sound. This subtle variation in sound is known as
co-articulation.
Thus phonemes can vary according to context, for
examples the [p] in [pit] sounds different to the [p]
in spit. In [pit], [p] more aspirated.
4
Speech perception
Phonemes are produced by distinct articulation. E.g. [t] at the
front and [k] at the rear of the mouth.
Phonemes vary on 3 dimensions of amplitude, frequency and
duration. Can be measured by speech spectrogram.
5
Speech
perception
Speech spectrograms
showing how words look
when depicted by their
frequencies over time. Note
how even when the
beginning consonant is the
same, the shape of the
speech signal differs
depending on the following
vowel.
Words used: first col. going
down: bet, bee, boat & bird;
second column going down:
debt, deal, dome & dirt.6
Spectrogram and phonetic transcription of “speech
perception lab”
7
What we think we hear sometimes is
not what we actually hear
The ‘McGurk effect’ after Harry
McGurk:
The McGurk
effect: a
demonstration
The McGurk Effect: Installation and
Operation
In the following demonstration:
1. Sound & Sight: You should hear the
sound ‘Da’ when looking at the
picture and listening to the sound.
This is an illusory sound midway
between [ba] & [ga].
2. Sound alone: Next close your eyes
and listen to the sound and you should
hear ‘Ba’.
3. Sight alone: Put your fingers in your
ears and look at the face and you
should perceive the sound to be ‘Ga’.
Sound & Sight
Da
Sound alone
Ba
Sight alone
Ga
8
Speech perception
What we think we hear sometimes is not what we
actually hear
Bastian et al. (1961) inserted gap of silence between
[s] and [eet]. If gap short people thought is was
[sleet] and if longer thought it was [spleet].
9
Categorical perception
Liberman et al. 1957 demonstrated that the perception of
phonemes is categorical. As mentioned before, phonemes
with voice onset times less than 20 msec are perceived as /b/
and those with voice onsets more than 40 msec are perceived
as /p/.
start
20 msec
40 msec
“ba”
lips open
“pa”
lips open
voice
voice
10
Categorical perception
In both [ba] and [pa] the lips open and soon after the vocal folds
vibrate. In [ba] voice onset is less than 20 ms and in [pa] it is
after 40 ms. Liberman et al. created artificial versions of [ba]
and [pa] with different voice onset times and then played pairs
in sequence and asked if same/different sounds.
Liberman et al. (1957) task: if participant thinks that he or she
hears:
“pa” - “pa” then says “same”
“ba” – “ba” then says “same”
“pa” – “ba” then says “different”
The stimulus materials varied in voice onset times.
11
Categorical perception
Liberman et al. found
If both voice onsets were 0-20 ms or both were 40-60 ms, ie
same side of boundary, then Ss would say that there was
no difference. If on different sides of the boundary, (e.g.
first is 15 msec and second is 45 msec in voice onset) then
they said ‘different’.
Liberman et al showed that the range of voice onset times
between 0-20 msec are all classified as within the category
of being a “b” sound.
12
Categorical perception
One hypothesis is that we learn which voice onset
times are relevant. Unfortunately, also found in 1month olds (Eimas et al., 1971)! This suggests
that we’re born able to recognize phonemes, and
phonemes in other languages as well.
At 10 months, processing of phonemes is specific to
mother tongue and not to other languages.
Do we have phoneme detectors? Not in the case of
vowels. Unlike consonants, we do not categorize
vowels.
13
Phonemic boundaries
Summerfield (1981) showed that the voice onset
boundary difference between [ba] and [pa]
changed according to the context of speech in
which it was heard.
Thus at a fast speech rate a [ba] with a short onset
time, would be perceived as a [pa] with a long
onset time.
Suggests that we learn to interpret phonemes
according to the context of the rate of speech.
However, babies can do it too!
14
Phonemic boundaries
Eimas and Miller (1980) showed that babies can
judge relative durations depending on different
speech contexts. This must be useful as different
speech rates would be very confusing for learning
a language. However, at the moment, it is difficult
to see how this is achieved.
15
Our perception of words
Words are built up from
phonemes and syllables.
Words in turn are used to
construct sentences.
Making sense of speech
sounds (going up) – we
start at the level of
processing phonemes and
these in turn are processed
into higher-order units as
shown.
or producing a sentence
(going down) – we have a
thought that is at the level
of the phrase or longer. We
then need to decompose
in
16
order to produce speech.
Our perception of words
An early theory was by
Morton (1969 & 1970).
Ears (and/or eyes)
bombarded with word
features (acoustic,
semantic and visual).
Logogens are word units
that fire if critical level
reached. This level
changes with word
frequency, just like
Treisman’s model.
17
Our perception of words
On average we know 60-75,000 words, excluding
the variants (e.g. like, likes, liked, liking, alike,
liken, likeness, likewise). This means each one we
hear is accessed from this lexicon. How?
The question is what are the important elements of
the sounds in word that are used to access the
correct entry? Do we use phonemes?
18
Our perception of words
Infants seem to use syllables for organisation, or at
least they organise according to rhythms of their
language.
Experimental work suggests that for adults in
French, syllabic structure is important, but in
English it is not (Mehler et al, 1981). However,
we are not much further on. We are still not sure
how a word is accessed from the lexicon.
19
What’s in a word?
We’ve already described the constituent sounds, but
there is more to a word than this. There is its
meaning.
Words consist of morphemes (don’t confuse with
morphine). These are speech elements which have
a meaning and which can’t be subdivided into
further elements. E.g. ‘book’ is a morpheme,
while ‘books’ is two: ‘book’ and ‘-s’ for plural.
20
What’s in a word?
Morphemes are of two kinds: stems and
affixes (‘-s’, ‘-ing’). In English the more
common affix is the suffix at the end of
words (e.g. ‘-s’) and the less common the
prefix (e.g. ‘un-stable’).
Other languages can have infixes – affixes put
into the stem - but this occurs rarely and
usually informally in English.
21
What’s in a word?
With some verbs the vowel in the stem is
modified to change meaning (e.g. say-said,
run-ran). This goes back to earlier times
before the suffix –ed was introduced to
denote past tense (cook-cooked).
22
What’s in a word?
Inflectional affixes don’t change a word’s meaning (e.g. -ed,
ing), but derivational affixes do (e.g. un-usable, use-less).
It might seem that to get at meaning that we analyze the word’s
stems and affixes. Unfortunately, there are times when this
does not work. There are words with common stems, (e.g
pro-blem, em-blem, blem-ish) where the stem (e.g.‘blem’)
doesn’t have a common meaning. (There may have been
one at one time for blem, from c.14 French blemir - to make
pale.)
23
What’s in a word?
We have only sampled the complexity of this subject studied
by linguists. But our concern is more about
psycholinguistics. We want to know how words are
accessed in the mental lexicon or dictionary.
24
How does our mental lexicon work?
Marslen-Wilson (1987) found in a shadowing task a
correlation between the time to recognise a word and the
point at which its sound becomes unique in relation to
other words. E.g. when the z sound in ‘trousers’ is
encountered, this eliminates ‘trowel’, ‘trounce’ and ‘trout’.
We recognise a word before its acoustic offset.
25
Speech perception: How does
our mental lexicon work?
Marslen-Wilson (1987):
Listening to the word “trousers”
Number of candidate words at each point (starting with “tr”)
238
4
1
/tr/
/tr/
/tr/
/tr/
/ow/
/ow/
/ow/
/ow/
/z/ /er/ /z/ (trousers)
/el/
(trowel)
/n/ /s/
(trounce)
/t/
(trout)
26
Speech Perception: Zwitserlood (1989)
Zwitserlood (1989): used cross-modal priming (e.g.
listening and reading) with Dutch participants, but
in translation, this is what happened. People heard
just the segment of speech capt, which is part of
the word captain. As soon as the ‘t’ had been
sounded, the printed word ship was shown on the
screen. The actual task was to press either a
“word” button or a “no-word” button whether the
word “ship” was a real word or not a word.
[See next figure]
27
Speech Perception: Zwitserlood (1989)
Heard “capt..”
shown
word
ship
press “yes” button
Heard “capt..”
shown
word
blag
press “no” button
Priming effects
1.
If shown control word (e.g. boot)
slower than for word ship.
2.
If shown captive it was also
responded to faster than control
word.
Conclusion
When listening to “captain” and
reaching the /t/ sound, all
candidate words activated
(“captain, “captive”, “capture”)
28
Speech Perception: Zwitserlood (1989)
However, if the full word was spoken (either captive or
captain), then only the related word was activated. Thus
the spoken word captive would only influence the printed
word captive and not the printed word captain. Thus
hearing the full word meant that capture had been rapidly
de-activated.
[As shown in the next figure]
29
Speech Perception: Zwitserlood (1989)
BUT if:
Heard “captain”
shown
word
ship
press “yes” button
THEN ONLY “captain” primed the word ship
Conclusion After passing /t/, hearing the full word meant that “captive”
had rapidly de-activated.
30
Getting at the meaning of a word:
words with several meanings
Entries in a dictionary describe the meaning of words. The mental lexicon
stores these word meanings. However, what about the many words
that have several meanings, such as “bay”?
1. Type of shore
2. Alcove
E.g. bay:
3. Of hound
4. Laurel
5. Reddish-brown colour.
Which way could our mental lexicon work?
1.
Activate all 5 meanings in parallel?
2.
Activate only one of the 5 given the context?
31
Swinney (1979): words with several meanings
Swinney (1979) wondered if the other meanings are also
activated even when the context singled out one meaning.
Ss listened to sentences with the word embedded. E.g. ‘the
villains decided to kill their master so they hatched a
plot…’
Immediately after the ambiguous word was spoken they saw a
printed word. For instance, after hearing the ambiguous
word plot they might see one of its two different meanings:
either land or plan.
The task was to decide if the word on the screen was a real
word or not. For instance, if they saw land they would
respond “word” but for blag they would respond “nonword”.
32
Swinney (1979): words with several meanings
Start of the sentence they listened to: “the villains decided to kill their master so
they hatched a plot…”
Listened to “..hatched a plot…
Saw word/nonword:
plan (congruent with sentence)
land (incongruent with sent. congruent with “plot”)
bike (incongruent with sent. neutral with “plot”)
blag (nonword)
Result: plan and land faster than bike.
Conclusion: Both meanings of “plot” highly activated at this point – even
though the sentence context points to “plan” and not to “land”.
33
Swinney (1979): words with several meanings
Thus Swinney found that this sentence context activated the
two different meanings ‘land’ or ‘plan’. Decision times
were faster for both words, compared with a neutral
condition. In other words, a sentence context activates both
alternative meanings of the word plot.
This showed that the activation of both meanings was high at
the point when the printed word was shown, even though
the sentence context was pointing to only one particular
meaning.
34
Swinney (1979): words with several meanings
THEN tried “hatched a plot to kill..” before presenting the
words. This time “land” deactivated and “plan” still
active.
Further work refined Swinney’s work and showed that
activation of a particular meaning was also determined by
its frequency of use. Thus if the alternative meaning was
rarely used, then activation was weak in the first place. So
the effect happens more strongly when two meanings are
equally likely.
35
Tanenhaus et al. (1979): words with several
meanings
Tanenhaus et al. (1979) used the same type of
paradigm. They looked at words in which
the alternative meanings are different also in
grammatical terms. Would both form still be
activated? For instance, watch has two
meanings for the noun and the verb.
Similarly, cross.
In the sentence: ‘Jim started to…’ only a verb
can follow.
In our mental dictionary do we scan only for a
verb at this point? The answer is ‘no’, we
activate both the noun and verb meanings.
[see next figure]
36
Tanenhaus et al. (1979): words with several
meanings
Listened to “Jim started to watch…
Saw word
look (congruent with sentence)
clock (incongruent)
chair (neutral)
blag (nonword)
Result: look and clock faster than chair.
Conclusion: both meanings of watch highly activated even though one is a
noun and the other a verb.
37
Summary of Speech Recognition
Phonemes are the smallest unit of speech and vary in
loudness, pitch and time. When we speak each phoneme
can vary according to surrounding phonemes (coarticulation). Also we interpret according to the rate of
speech rather than the absolute length of time of the
duration of phonemes. Thus we differentiate t and d in
terms of their duration, but if the rate of speech is fast, we
don’t become confused, as we take into account that
relatively speaking t and d now have different durations.
38
Summary of Speech Recognition
We can hypothesize what we hear (e.g. s__eet). Also we
categorize consonant phonemes, we don’t tolerate
ambiguity. But we don’t categorize vowel phonemes.
In English, we probably don’t use syllables or morphemes in
order to understand a word. But the French seem to use
syllables.
We look up words in an ordinary dictionary sorted by
alphabetical order getting closer to the word by a series of
decisions. This could be described as a linear process.
39
Summary of Speech Recognition
The mental lexicon is different. There is a wide activation of
meanings compatible with the stream of sounds,
irrespective of appropriateness of syntax (grammar). Word
frequency leads to strong activation of highly frequent
words.
There is also a counter suppression of inappropriate entries.
Thus although several entries are activated, once there is
more information to narrow it down to one word, the other
words are rapidly de-activated. This process takes place
rapidly during or just after hearing the word.
40