Click to add title

Download Report

Transcript Click to add title

Speech & Language Modeling
Cindy Burklow & Jay Hatcher
CS521 – March 30, 2006
Agenda
What is Speech Recognition?
Challenges of Speech Recognition
Expresso III Case Study
IBM Superhuman Speech Tech
Speech Synthesis
What is Speech Recognition
One long
Rule book
Two
approaches
Deductive
Framework
Search
Algorithms
& Math Models
How does it
work?
Phonemes
Hunting Speech
Phoneme Sequence
Phonemes Energy
Challenges of Speech Recognition
Users own
preferences
Limit Speech
Range
Noise
People
Infinite
Combinations
Software
Expresso III
Project
Who?
What?
Why?
How?
Expresso III
How is it different?
Why try a new method?
Co-Articulation
Independencies
Duration
Linear Dynamic Model (LDM)
Expresso III
Why Linear Dynamic Model (LDM)?
Expresso III ‘s Hypothesis
Testing Methods
Includes error models
Only linear models allowed
Series of tests (5 total)
Increase “phones” & training data
Switching & Iteration & Data classification
Generated histograms of log likelihood
Divide & Conquer Technique
Results
IBM Superhuman Speech Tech
ViaVoice 4.4
Comprehend languages
Products
Free-Form Command
Translate dynamically
MASTOR
Create “on-the-fly”
subtitles on TV
Goal
TALES
“Get performance
comparable to
humans in the next
five years.”
-IBM Jan. 2006
Speak commands
PDAS, IPODS, & DVRs
“Free-Form Command”
• Commands
associated with
objects
• Simplified Language
• Partnering with
Specialized
Hardware
Manufacturing
• Finding Cliché
markets
• Well-chosen
Algorithms
IBM’s MASTOR
Multilingual Automatic Speech-to-Speech Translator
IBM’s Tales
• Server-based system
• Dynamically Transcribe &
translates any words
spoken into English
subtitles
• Requires long processing
time
• Real-time translations are
impossible
• 60%-70% accuracy rate
• High subscription fee for
users
Expanding Speech Recognition Applications
PDAs to collect data
iPod: Email &
RSS Read Aloud
Navigate Your DVR with Speech
Voice commands
Requires microphone
*TV remote
* Headset
Text to Speech Systems
Two major steps:
1. Convert the text into a pronounceable format
– Look for domain specific sections like time,
dates, numbers, addresses, and abbreviations
– Try to identify homographs and the contexts in
which they occur
– Use some combination of dictionary and rulebased approaches as a guide to pronunciation
2. Synthesize speech from the phonetic
representation using one of many possible
approaches
Speech Synthesis
Continuum of Speech Synthesis methods
Waveform Synthesis
Formant Synthesis
Concatenative synthesis
HMM-based synthesis
Articulatory Synthesis
Diphone Synthesis
Hybrid Approaches
Recordings
Unit Selection
Speech Synthesis at CMU
• Carnegie Mellon University has been
doing extensive research in both speech
recognition and speech synthesis
• Research primarily uses the Festival
Speech Synthesis System, an opensource framework developed by
Edinburgh University
Speech Synthesis at CMU
• Research has primarily focused on
Diphone Synthesis, with some additional
exploration into Unit Selection.
Speech Synthesis at CMU
• Diphone synthesis allows greater control
of pitch and voice inflection, but often has
a more robotic sound to it.
• Example: This is a short introduction to the
Festival Speech Synthesis System.
Festival was developed by Alan Black and
Paul Taylor, at the Centre for Speech
Technology Research, University of
Edinburgh.
Speech Synthesis at CMU
• Improvements can be made by performing statistical
analysis of the text as a preprocessing step before
synthesis.
• This helps with pacing, homographs, and other situations
where pronunciation differs depending on context.
• He wanted to go for a drive in.
• He wanted to go for a drive in the country.
• My cat who lives dangerously has nine lives.
• Henry V: Part I Act II Scene XI: Mr X is I believe, V I
Lenin, and not Charles I.
Speech Synthesis at CMU
• Unit selection can be used instead of
diphones to improve how natural the voice
sounds by using whole phones (e.g.
syllables) and not just diphones (sound
transitions)
• The following examples are based on the
same speaker:
• Diphones
• Unit Selection
Speech Synthesis at CMU
• With care, unit selection can produce very
convincing natural sound.
– Original Sound
– Synthesis from natural phones, pitch, and
duration data
• However, it is difficult to generalize Unit
Selection for a variety of situations, and if
it does poorly it sounds much worse than
diphones.
– Example
Speech Synthesis at CMU
• Most commercial TTS packages use Unit
Selection with medium to large databases
of samples.
– Example: Neospeech VoiceText
• These produce higher quality sound at the
expense of memory and processor power.
• CMU’s Festival implementation has
focused more on Diphone Synthesis to
reduce memory footprint and allow greater
control of the synthesizer.
Speech Synthesis at CMU
• Diphone Synthesis can control inflection,
pitch, and other factors dynamically.
– A short example with no prosody.
– A short example with declination.
– A short example with accents on stressed
syllables and end tones.
– A short example with statistically trained
intonation and duration models.
Conclusion
• CMU’s research using Festival has lead to
useful technology for embedded systems
and servers. The Diphone Synthesis
model they have developed can produce
generally intelligible speech with minimal
memory and processing costs. The model
is still being worked on and may one day
reach a natural level of quality.
References and Useful Links
What is speech recognition & Challenges?
• http://www.extremetech.com/article2/0,1697,1826664,00.asp
• http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/
mags/co/&toc=comp/mags/co/2002/04/r4toc.xml&DOI=10.1109/MC.
2002.993770
• http://en.wikipedia.org/wiki/Speech_recognition
• http://cslu.cse.ogi.edu/HLTsurvey/ch1node7.html
Expresso III Case Study
• http://www.cstr.ed.ac.uk/publications/users/s0129866_abstracts.html
#Couper-02
• http://www.cstr.ed.ac.uk/publications/users/s0129866.html
IBM Superhuman Speech Tech
• http://www.ibm.com
• http://www.pcmag.com/article2/0,1895,1915071,00.asp
References and Useful Links
•
•
•
•
•
•
•
•
The Festival Speech Synthesis System
NeoSpeech VoiceText Demo
AT&T’s TTS FAQ
Reviews of Popular Speech Synthesizers
Speech Engine Listings with Samples
BrightSpeech.com
Festival at CMU
FestVox