Week 2 - University of Pennsylvania

Download Report

Transcript Week 2 - University of Pennsylvania

LING 001 Introduction to Linguistics
Spring 2010
Computer speech
Speech synthesis
Recording and sampling
Speech recognition
Apr. 5
Speech synthesis
•
•
•
Wolfgang von Kempelen (1734-1804) constructed one of the first
working synthesizers.
It had a reed that kept vibrating by an airstream from bellows.
The sound from the reed was applied to a box made of leather and
wood (the vocal tract), a movable flap inside it (the tongue), and a
shutter at one end (lips).
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
LING 001 Introduction to Linguistics, Spring 2010
2
Speech synthesis
•
•
At the beginning of the 20th century, the progress in electrical
engineering made it possible to synthesize speech sounds by electrical
means.
The first device of this kind that attracted the attention of a wider public,
was the VODER, developed by Homer Dudley at Bell Labs.
LING 001 Introduction to Linguistics, Spring 2010
3
4
Speech synthesis
•
The VODER was based on VOCODER (VOice enCODER), which uses
a series of band pass filters to analyze, transmit, and synthesize
speech sounds.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
LING 001 Introduction to Linguistics, Spring 2010
Speech synthesis
•
•
•
Pattern playback was developed by Frank Cooper at the Haskins Labs
and completed in 1950. It works like an inverse of a spectrograph.
Light from a lamp goes through a rotating disk then through
spectrogram into photovoltaic cells.
The amount of light that gets transmitted at each frequency band
corresponds to amount of acoustic energy at that band shown on the
spectrogram.
LING 001 Introduction to Linguistics, Spring 2010
5
Computer speech synthesis
•
Articulatory synthesizers model the movement of the articulators and
the acoustics of the vocal tract.
• Articulatory synthesis has never made it out of the laboratory.
•
Formant synthesizers start with the acoustics, based on the sourcefilter model of speech production.
• Formant synthesizers enjoyed a long commercial history while
computers were relatively underpowered.
•
Concatenative systems use databases of stored speech to assemble
new utterances.
• Today most commercial systems are concatenative, with many
being so-called unit selection approaches.
LING 001 Introduction to Linguistics, Spring 2010
6
Formant synthesis
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
LING 001 Introduction to Linguistics, Spring 2010
7
Concatenative synthesis
• A speech segment is synthesized by simply playing back a recorded
waveform with matching phoneme string.
• An utterance is synthesized by concatenating together several
speech segments.
• Issues:
 What type of speech segment (unit) to use?
Phoneme: ~ 40
Diphone: ~1500
Syllable: ~15K
Word: ~100k - 1.5M
Sentence: ∞
 How to select the best string of speech segments from a given library of
segments?
 How to alter the prosody of the speech segments to best match the
desired output prosody?
LING 001 Introduction to Linguistics, Spring 2010
8
Diphone Synthesis
9
•
Units are diphones; middle of one phone to middle of next. A diphone r-ih, for
example, includes from the middle of the r phoneme to the middle of the ih
phoneme.
•
Mid phone is more stable than edge; the transition between two phones is
retained.
LING 001 Introduction to Linguistics, Spring 2010
10
Diphone synthesis
dynamic
Diphone Table
*d
*d
am
am
dI
dI
ik
ik
In
In
k*
k*
mi
mi
na
na
*d dI
• Annotated by hand
• “Best” instance of
diphone used
• Static
[From Richard Sproat]
Linguistic
Preprocessing
dInamik
Target-to-Unit
Mapping
In
na am mi
Signal
Postprocessing
LING 001 Introduction to Linguistics, Spring 2010
ik
k*
Unit Selection synthesis
•
Larger and variable units: from diphones to sentences.
•
Large representative database, 10 hours of speech or more, multiple
copies of each unit type.
•
Use search to find best sequence of units based on target and joint
costs.
•
Prosodic modification is often avoided, as selected targets may already
be close to desired prosody, little or no signal processing applied to
each unit.
LING 001 Introduction to Linguistics, Spring 2010
11
Unit Selection synthesis
[From Dan Jurafsky]
LING 001 Introduction to Linguistics, Spring 2010
12
Text-To-Speech Demos
•
•
•
ATT:
• http://www.research.att.com/~ttsweb/tts/demo.php
Festvox:
• http://festvox.org/voicedemos.html
IBM
• http://www.research.ibm.com/tts/coredemo.shtml
Nuance:
• http://212.8.184.250/tts/demo_login.jsp
http://www.ivona.com/
http://www.neospeech.com/
http://www.cereproc.com/products/voices
•
Roger Ebert gets his new voice! (YouTube)
•
•
•
•
LING 001 Introduction to Linguistics, Spring 2010
13
Recording
•
Digital recording: The process of converting speech waves into
computer-readable format is called digitization, or A/D conversion.
LING 001 Introduction to Linguistics, Spring 2010
14
Sampling
•
•
•
In order to transform sound into a digital format, you must sample the
sound. The computer takes a snapshot of the sound level at small time
intervals while you are recording.
The number of samples taken each second is called the sampling rate.
The more samples that are taken, the better sound quality. But we also
need more storage space for higher quality sound.
For speech recordings, in most cases a sampling rate of 10k Hz is
enough.
44100 Hz
22050 Hz
11025 Hz
8000 Hz
5000 Hz
LING 001 Introduction to Linguistics, Spring 2010
15
Sampling
•
Nyquist-Shannon theorem:
When sampling a signal (e.g., converting from an analog signal to
digital), the sampling frequency must be greater than twice the highest
frequency in the input signal in order to be able to reconstruct the
original perfectly from the sampled version.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
•
Aliasing: If the sampling frequency is less than twice the highest
frequency component, then frequencies in the original signal that are
above half the sampling rate will be "aliased" and will appear in the
resulting signal as lower frequencies.
•
Anti-Aliasing filter: typically a low-pass filter that is applied before
sampling to ensure that no components with frequencies greater than
half the sample frequency remain.
LING 001 Introduction to Linguistics, Spring 2010
16
Audio file formats
•
There are a number of different types of Audio files.
•
“.wav” files are commonly used for storing uncompressed sound files,
which means that they can be large in size - around 10MB per minute
of music.
•
“.mp3” files use the "MPEG Layer-3" codec (compressordecompressor). “mp3” files are compressed to roughly one-tenth the
size of an equivalent .wav file while maintaining good audio quality.
•
“.aiff” is the standard audio file format used by Apple. It is like a wav file
for the Mac.
LING 001 Introduction to Linguistics, Spring 2010
17
Praat: doing phonetics by computer
LING 001 Introduction to Linguistics, Spring 2010
18
Speech recognition
•
Goal: to convert an acoustic signal O into a word sequence W.
•
Statistics-based approach: What is the most likely sentence out of all
sentences in the language L given some acoustic input O?
•
Treat acoustic input O as sequence of individual observations
• O = o1,o2,o3,…,ot
•
Define a sentence as a sequence of words:
• W = w1,w2,w3,…,wn
LING 001 Introduction to Linguistics, Spring 2010
19
Speech recognition architecture
•
Solution: search through all possible sentences. Pick the one that is
most probable given the waveform/observation.
Wˆ  argmaxP(W | O)
W L
Bayes’ rule
P(O |W )P(W )
Wˆ  arg max
P(O)
W L
P(O) is the same for each W
Wˆ  argmaxP(O |W )P(W )
W L
LING 001 Introduction to Linguistics, Spring 2010
20
Speech recognizer components
•
Acoustic modeling: Describes the acoustic patterns of phones in the
language.
1. Feature extraction
2. Hidden Markov Model
•
Lexicon (pronouncing dictionary): Describes the sequences of phones
that make up words in the language.
•
Language modeling: Describes the likelihood of various sequences of
words being spoke in the language.
LING 001 Introduction to Linguistics, Spring 2010
21
Acoustic modeling
•
A vector of 39 features is extracted at every 10 ms from 20-25 ms of
speech.
•
Each phone is represented as an Hidden Markov Model (HMM) that
consists of three states: the beginning part (s1), the middle part (s2),
and the end part (s3). Each state is represented by a Gaussian model
on the 39 features.
LING 001 Introduction to Linguistics, Spring 2010
22
Lexicon
•
The CMU pronouncing dictionary: a pronunciation dictionary for
American English that contains over 125,000 words and their phone
transcriptions.
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
•
CMU dictionary uses 39 phonemes (in ARPABET), word stress is
labeled on vowels: 0 (no stress); 1 (primary stress); 2 (secondary
stress).
PHONETICS F AH0 N EH1 T IH0 K S
COFFEE K AA1 F IY0
COFFEE(2) K AO1 F IY0
RESEARCH R IY0 S ER1 CH
RESEARCH(2) R IY1 S ER0 CH
LING 001 Introduction to Linguistics, Spring 2010
23
Language Modeling
24
•
We want to compute the probability of a word sequence, p(w1,w2,w3,…,wn).
•
Using the Chain rule, we have, for example:
p(speech, recognition, is, very, fun) =
p(speech)*p(recognition|speech)*p(is|speech, recognition)*p(very|speech,
recognition, is)*p(fun|speech, recognition, is, very)
•
Learn p(fun|speech, recognition, is, very) from data? - we’ll never be able to
get enough data to compute the probabilities of long sentences.
•
Instead, we need to make some Markov assumptions:
• Zeroth order: p(fun|speech, recognition, is, very) = p(fun) - unigram
• First order: p(fun|speech, recognition, is, very) = p(fun|very) - bigram
• Second order: p(fun|speech, recognition, is, very) = p(fun|is, very) trigram
…
LING 001 Introduction to Linguistics, Spring 2010
State-of-the-art ASR performance
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
LING 001 Introduction to Linguistics, Spring 2010
25
Challenges in ASR
•
Robustness and Adaptability – to changing conditions (different mic,
background noise, new speaker, different speaking rate, etc.)
•
Language Modelling – the role of linguistics in improving the language
models?
•
Out-of-Vocabulary (OOV) Words – Systems must have some method of
detecting OOV words, and dealing with them in a sensible way.
•
Spontaneous Speech – disfluencies (filled pauses, false starts,
hesitations, ungrammatical constructions etc) remain a problem.
•
Prosody –Stress, intonation, and rhythm convey important information
for word recognition and the user's intentions (e.g., sarcasm, anger).
•
Accent, dialect and mixed language – non-native speech is a huge
problem, especially where code-switching is commonplace.
LING 001 Introduction to Linguistics, Spring 2010
26