Summerschool Prague 2001 Jan Odijk
Download
Report
Transcript Summerschool Prague 2001 Jan Odijk
Acoustic Databases
Jan Odijk
ELSNET Summer School, Prague, 2001
Acknowledgements
Part of the slides
have been borrowed from or
are based on work by
•
•
•
•
•
Bart D’Hoore
Hugo van Hamme
Robrecht Comeyne
Dirk van Compernolle
Bert van Coile
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Linguistic Resources(LRs)
Linguistic Resources are sets of language data in machine
readable form that can be used for developing, improving
or evaluating language and speech technologies.
Some language and speech technologies
•
•
•
•
•
•
•
•
•
Text-To-Speech (TTS)
Automatic Speech Recognition (ASR)
Dictation
Speaker Verification/recognition
Spoken Dialogue
Audio Mining
Machine Translation
Intelligent Content Management
….
Linguistic Resources(LRs)
Major Types
Electronic Text Corpora
•
•
•
Newspapers, magazines, etc.
Usenet texts, e-mail, correspondence
Etc.
Lexical Resources
•
•
•
•
Monolingual lexicons
Translation lexicons
Thesauri
…
Acoustic Resources
•
•
Annotated Speech Recordings
Annotated Recordings of other acoustic signals
• Coughing, throat clearing, breathing, …
• Door slamming, screeching tires (of a car),…
Types of Linguistic Resources
Acoustic Resources
Acoustic Databases (ADBs)
•
•
•
•
•
•
Controlled recording of human speech or other acoustic signals
Enriched with annotations
Recorded in a digital way
Representative of targeted application environment and medium
Balanced for phonemes/phoneme combinations
Speaker parameters, recording quality, environment/medium
documented
Types of Linguistic Resources
Acoustic Resources
Annotated unstructured recordings
•
•
•
•
Broadcasted material
Recorded conversations/monologues/speeches etc
Dictated material
Enriched with annotations
Types of Linguistic Resources
Acoustic Resources
In-service data
•
•
•
•
Recorded sessions of interaction humans-running application
Usually by logging a customer system
Enriched with annotations
Used for tuning models, grammars,etc. to specific application
Types of Linguistic Resources
Acoustic Resources
Environments
•
“Quiet”
• Studio
• Quiet office
• Normal office
•
Noisy
• Public place (street, hotel lobby, station, etc.)
• Car (running engine 0km/hr, city, highway)
• Industrial environment
Types of Linguistic Resources
Acoustic Resources
Media
•
•
•
HQ close-talk microphone
Desktop Microphones
Telephone
• analog or digital
• fixed line or mobile
•
•
•
Wide band microphones
Array microphones
PC/PDA etc. low quality microphone
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Acoustic Resources
Use
(for speech synthesis modules in TTS systems)
(as acoustic reference material for pronunciation lexicons)
Mainly for speech recognition
Training and test material for research into new recognition
engines and engine features
Training and test material for development of acoustic models
Tuning of acoustic models for specific applications
What is speech recognition?
ASR: Automatic speech recognition
Automatic speech recognition is the process by which a
computer maps an acoustic speech signal to text.
Automatic speech understanding is the process by which a
computer maps an acoustic speech signal to some form of
abstract meaning of the speech.
Speaker recognition is the process by which a computer
recognizes the identity of the speaker based on speech samples.
Speaker verification is the process by which a computer checks
the claimed identity of the speaker based on speech samples.
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Feature Extraction
Turning speech signal into something more manageable
•
•
Do analysis once every 10ms
Data compression: 220 byte => 50 byte => 4 byte
Sampling of a signal: transforming into a digital form
Extracting relevant parameters from the signal
•
Spectral information, energy, pitch,...
Eliminate undesirable elements (normalization)
•
•
•
Noise
Channel properties
Speaker properties (gender)
Feature Extraction: Vectors
Signal is chopped in small pieces (frames), typically 30 ms
Spectral analysis of a speech frame produces a vector
representing the signal properties.
=> result = stream of vectors
4
3
2
1
0
-1
-2
-3
-4
10.3
1.2
-0.9
.
0.2
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Acoustic Model
Split utterance into basic units, e.g. phonemes
The acoustic model describes the typical spectral shape (or
typical vectors) for each unit
For each incoming speech segment, the acoustic model will tell
us how well (or how badly) it matches each phoneme
Must cope with pronunciation variability
•
•
•
Utterances of the same word by the same speaker are never identical
Differences between speakers
Identical phonemes sound differently in different words
=> statistical techniques: creation via a lot of examples
f-r--ie--n--d--l--y-
S1
S2
S3
S4
c--o--m--p---u----t--e--r---s
S5
S6
S7
S8
S9
S10
S11
S12
S13
Acoustic Model: Units
Phoneme: share units that model the same sound
S
T
O
P
S
T
A
R
Stop
T
Start
Word: series of units specific to the word
S1 S2 S3 S4
Stop
S6 S7 S8 S9 S10
Start
Acoustic Model: Units
Context dependent phoneme
S|,|T
T|S|O
O|T|P
P|O|,
ST
TO
OP
Stop
Diphone
,S
P,
Stop
Other sub-word units: consonant clusters
ST
O
P
Stop
Acoustic Model: Units
Phonemes
Phonemes in context: spectral properties depend on
previous and following phoneme
Diphones
Sub-words: syllables, consonant clusters
Words
Multi words: example: “it is”, “going to”
Combinations of all of the above
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Pattern matching
Acoustic Model: returns a score for each incoming feature
vector indicating how well the feature corresponds to the
model.
= Local score
Calculate score of a word, indicating how well the word
matches the string of incoming features (viterbi)
Search algorithm: looks for the best scoring word or word
sequence
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Language Model
Describes how words are connected to form a sentence
Limit possible word sequences
Reduce number of recognition errors by eliminating
unlikely sequences
Increase speed of recognizer => real time
implementations
Language Model
Two major types
•
•
Grammar based
!start <sentence>;
<sentence>: <yes> | <no>;
<yes>: yes | yep | yes please ;
<no>: no | no thanks | no thank you ;
Statistical
• Probability of single words, 2/3-word sequences
• Derived from frequencies in a large corpus
Active Vocabulary
Lists words that can be recognized by the acoustic model
That are allowed to occur given the language model
Each word associated with a phonetic transcription
•
•
Enumerated, and/or
Generated by a Grapheme-to-Phoneme (G2P) module
Post Processing
Re-ordering of Nbest list using other criteria: e.g. account
numbers, telephone numbers
Spelling: name search from a list of known names
Applying NLP techniques that fall outside the scope of the
statistical language model
•
•
•
E.g. “three dollars fifty cents” “$ 3.50”
“doctor Jones” “Dr. Jones”
Etc.
Training of Acoustic Models
Annotated
Speech
Database
Pronunciation
Dictionary
Training Program
Acoustic
Model
Training of Acoustic Models
Database design
•
•
•
Coverage of units: word, phoneme, context dependent unit
Coverage of population (region, dialect, age, …)
Coverage of environments (car, telephone, office,..)
Database collection and validation
•
•
Checking recording quality
Annotation: describing what people said, extra-speech sounds
Dictionaries
•
•
•
Phonetic transcription of words
Multiple transcriptions needed
G2P: automatic transcription
Feature vectors
...
10.3
1.2
-0.9
.
0.2
8.1
-0.5
1.3
.
0.2
……...
2.1
-0.2
1.9
.
-0.3
...
Example: discrete models
A collection of prototypes is constructed (100 to 250)
Each vector is replaced by its nearest prototype
8
6
4
-4
-2
2
Vectoren
0
Prototypes
-2
-4
-6
0
2
4
6
8
Feature vectors
...
Prototypes
10.3
1.2
-0.9
.
0.2
8.1
-0.5
1.3
.
0.2
……...
2.1
-0.2
1.9
.
-0.3
...
,,,3 9 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,,,,,
Phoneme
assignment
ffrrEEEnnnnddllIIII,,,kkOOOmmpjjuuuuttt$$$rrzz
Prototypes
2276998900023448889211127780128897791237787622
f f
0
1
2
3
4
5
6
7
8
9
rr
0
0
0
1
0
0
2 2100
0
3
0
0
4
0
0
5
0
0
6
0 2 50
7
0 2 50
8
0
0
9
0
0
EE n n d d l l I I k k OO mm p p j j u u
0 3 75
0
0
0 1 50
0 1 50
0
0
0
0
0
0
0
0 1 50
0 1 50
0
0
0
0
0 1 50
0
0
0
0
0 2100
0
0
0
0 1 50
0
0
0
0
0
0
0
0
0
0
0 2100
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 2100
0
0
0
0
0
0
0 2 67
0
0
0 2 50
1 33
0
0
0 3 75
0 1 33
0
0
0
0
2 67 1 25
0
0 1 25
0
0
0
0
0 2 50
t t
$$
zz
, ,
0
0
0
0
1 33
0
0 2 67
1 33
0 2100 1 33
1 33
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 2 67
0
0
0 1 33
0
0
0
0
0
0
Training of Acoustic Models
For all utterances in database:
Make phonetic transcription of a sentence
Use models to segment the utterance file:
assign a phoneme to each speech frame
Collect statistical information:
Count prototype-phoneme occurrences
Create New Models
Key Element in ASR
ASR is based on learning from observations
•
•
Huge amount of spoken data needed for making acoustic models
Huge amount of text data needed for making language models
=> Lots of statistics, few rules
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Contents of an ADB
Utterances of different utterance types
Utterance types suited to the intended
application domain
Text balanced for phoneme and/or diphone
distribution
All enriched with annotations
Contents of an ADB
Spontaneous v. Read Utterances
A spontaneous utterance is a response to a question or a
request
“In which city do you live?”
“Please spell a letter heading to your secretary”
“Is English your mother tongue?”
“Make a hotel reservation”
A read utterance is an utterance read from a presentation
text
“London”
“Dear John”
“Yes”
“Please book me a room for 2 persons with bath. We will arrive ….”
Contents of an ADB
Isolated Phonetically Rich Word
Apple Tree, Lobster
Isolated Digit
5
Isolated Alphabet
B
Isolated number (natural number)
4256
Contents of an ADB
Continuous Digits
911
Continuous Alphabet
YMCA
Commands
Stop, left, print, call, next
Contents of an ADB
Connected Digits
Telephone Numbers
057/228888
Credit Card Numbers
3741 959289 310001
Pin-codes
8978
Social Security Number
560228 561 80
Other identification numbers, e.g. sheet id
012589225712
Contents of an ADB
Time and Date Expressions
Time (“analog”, word style)
A quarter past two
Time (“digital”)
14:15
2:15PM
Date (“analog”, word style, absolute)
Friday, June 25th, 1999
Christmas’ Eve, Easter
Date (“digital”, absolute)
25/06/99
Date (“analog”, word style, relative)
Tomorrow, next week, in one month
Contents of an ADB
Money amounts
$327.67
£148.95
Isolated Phonetically Rich Sentences
A cold supper was ordered and a bottle of port
Isolated Command Sentences
Insert this name in the list
Names
Microsoft, New York, Jonathan
Syllables
Hi-ta-chi
Contents of an ADB
Continuous Phonetically Rich Sentences
Once upon a time, in a land far from here, lived a little princess. She
was the most beautiful girl…
Continuous Command Sentences
Select the first line. Make it bold and move it to the bottom of the
text…
Continuous Spontaneous speech
<Make a reservation in a hotel>
Contents of an ADB
Contents of SpeechDat-II ADB
For each speaker/session
Approx. 40 utterances
Duration approx. 10 minutes
Mixture of read and spontaneous utterances
Mixture of
Phonetically rich sentences
Application specific words
Utterance types that will often occur in any application
1000-5000 speakers/sessions
Contents of an ADB
Contents of SpeechDat-II ADB
1 isolated digit
4-digit id (sheet number)
3 connected digits (~10-digit telephone number)
12-digit credit card number
3 natural numbers
2 money amounts: 1 large, 1 small
3 spelled words
1 time of day (spontaneous)
Contents of an ADB
Contents of SpeechDat-II ADB
1 time phrase (read, word style)
1 date (spontaneous, e.g. person’s birthday)
2 dates (read, word style)
3 yes/no questions
1 city of call/birth
6 common application words out of 50
3 application word phrases
9 sentences (read)
Contents of an ADB
Contents of SpeechDat-Car ADB
For each session
Approx. 120-130 utterances (depending on session)
Duration 2-3 hours
Mixture of read and spontaneous utterances
Mixture of
Phonetically rich sentences
Application specific words
Utterance types that will often occur in any application
600 sessions with min. 300 speakers
In 2 out of 7 conditions
Standing still/ Low speed/ high speed
Different road conditions/ surrounding noise
Audio equipment on/off
Contents of an ADB
Contents of SpeechDat-Car ADB
Digits and Digit Strings
1 sequence of 10 digits
1 sheet number (4+ digit sequence)
1 spontaneous telephone number
1 credit card number (16 digits)
1 PIN code (6 digits)
4 isolated digits
Dates
1 spontaneous date (e.g. birthday)
1 prompted date, word style
1 relative and general date expression
Contents of an ADB
Contents of SpeechDat-Car ADB
Names
1 spontaneous name (e.g. speaker’s first name)
1 city of growing up (spontaneous)
2 most frequent city’s
2 company / agency / street names
1 person name (first name or surname)
Spellings
1 spontaneous spelled name (e.g. speaker’s first name)
1 spelling of city name
4 real words or names
1 artificial name (for coverage)
Contents of an ADB
Contents of SpeechDat-Car ADB
Money Amounts/Natural Numbers
1 money amount
1 natural number
Times
1 time of day (spontaneous)
1 time phrase (word style)
Phonetically Rich words
4 phonetically rich words
Contents of an ADB
Contents of SpeechDat-Car ADB
Application words
13 Mobile Phone Application words
22 IVR functions keywords
32 car products keywords
2 voice activation keywords
2 language dependent keywords
Sentences
2 phrases using an application word
9 phonetically rich sentences
10 prompts for spontaneous speech
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Phases in Acoustic Resource
Creation
Design
Creation of Script
Recruitment & Recording
Annotation and Validation
Lexicon
Quality control
Production
DESIGN
Language study:
•
Phoneme set
•
Dialects
Scripting: utterance definition and distribution over speakers
Speaker typology: distribution definition
•
Sex, gender, age, dialects, educational level
Recording: specification of procedure and platform
Validation: specification of procedure and quality standard
Creation of Script
Prompts/Text/Transcription
A prompt refers to the way an utterance is presented to the
speaker. This can be done on the desktop, on paper or with a
play back file (telephony).
The (presentation) text represents the utterance as it
should be pronounced by the speaker. It is normally
presented according to the spelling conventions of the target
language.
The transcription is the utterance as it has been
pronounced by the speaker.
EXAMPLE: The pronunciation of a digit string
•
•
•
PROMPT: “Please read the number on top of your form”
TEXT: 578124
TRANSCRIPTION: five seven eight one two four
Creation of Script
Collect and clean text corpora
Split cleaned text into a sequence of sentences
Remove ungrammatical and too long sentences
Remove sentences containing offensive language
Remove (certain) ambiguities in pronunciation
numbers, dates, abbreviations, etc.
Apply phonetic balancing tools to obtain phonetically rich text
Creation of Script
Collect and/or create other utterance types
•
Telephone numbers, amounts, credit card numbers, etc.
Create prompts
•
Prerecorded messages to the speaker
•
For unmonitored recording without access to screen (telephony)
Put all these in resource files
Creation of Script
Script File
Configuration:
•
Acquisition board, Coding type
•
Sampling rate, Number of channels
Information items
•
Speaker id, Sheet id
•
Gender, age, region of birth, region youth, living, etc. and their possible values
•
Recording environment/conditions
Sentence definitions
•
Specifies order and types of utterances in one session
Creation of Script
Resource files utterance sheets
Generate letter with instructions and list of utterances for
each speaker (esp. telephony)
Creation of Script
Tools
Script Editor
•
For creating/modifying scripts
•
For creating utterance sheet files (from resource files)
•
For generating letters to speakers
Digit String Generator
•
Natural numbers
•
Bank accounts
•
Credit card numbers
•
Phone numbers
•
Pin-codes
Creation of Script
Test the script
•
By making one or more recording sessions
•
Also tests the recording set-up
•
also provides indications for average session duration
RECRUITMENT
Contact potential speakers according to the typology
•
Acquaintances, colleagues
•
Advertisements
•
Employees/students of cooperating organizations (companies, universities)
•
Possibly with the help of marketing agencies
Explain
•
purpose and context
•
What the speaker is supposed to do
•
How much time it will take
•
Reimbursement for the speaker (time spent, travel costs)
Make concrete arrangements with the speakers
RECORDING
Locations : Set up recording equipment in environment fitting
environment definition
Set-up recording platform and test it
Welcome speaker, instruct speaker
Interview : log speaker typology & recording conditions
Make recordings and follow-up on quality
Deal with administrative matters
•
Agreement on ownership of recording
•
Reimbursement
RECORDING
TOOL
VALIDATION and ANNOTATION
After recording, the signal will NEVER be touched
•
Only enriched with annotations
check (and correct) relation between text & speech
•
Orthographic transcription must represent what the speaker said
•
Tool to expand abbreviations, numbers, digit sequences
Segmentation
•
Check (and correct) begin and end of speech markers
•
(mostly for TTS) Mark begin and end of phonemes
VALIDATION and ANNOTATION
assign quality label
•
Very good overall quality … very bad overall quality
Annotations for extra events
•
Speaker sounds (coughing, breathing, swallowing, …)
•
Mispronunciations, truncations
•
Sound from other sources (other speaker, music, radio, …)
•
Continuous background noise (wind, rain, …)
•
Filled pauses (uh, um, er, ah, hmm, ….)
•
Telephone distortions
Validation Tool
Semi-Automatic Validation
Validation can be partially automated
For certain types of databases
70-75% reliably validated automatically
25-30% require manual check
Using ASR systems
Research into further automating this ongoing
LEXICON
One central “mother lexicon” for each language
•
To reduce duplication of effort
•
To maintain consistency
Request is compared with mother database
•
Unfound entries imported in the mother database
•
Unfound entries turned into a job
•
Job is assigned to linguists
LEXICON
After finishing job
•
Requested entries and properties are exported
•
Turned into required format
•
Delivered to requestor
Additions/modifications due to this request are now
available for other requests
LEXICON: Tools
Phoned
•
Lexical database plus user interface
•
(currently in Access but switching to SQL Server)
•
Reuse of G2P and Synthesis Modules
PhonedAdmin
•
import and export of data from the mother database
•
Comparison with existing mother database
•
Definition of users and jobs
•
Assignment of jobs to users
QUALITY CONTROL
Typical Circumstances
Database project is ongoing
Often on a remote location
Multiple persons (for recording and validation)
Many questions, problems and unclarities arise constantly
Require answers from specialists
Danger of errors and inconsistencies
Within the work of a single person
between different persons
Constant monitoring
Systematic and regular quality checks required
Systematic and regular feedback required
During the whole project
From the earliest moment possible
Documentation, incl. spot check report
QUALITY CONTROL
Tools
•
ADB Scanner—checks consistency of database
• Standard structure, All files available
•
ADB Statistics
• Statistics on information items (sex, gender, age, dialect, quality,
etc.) and utterance types
•
ADB Report Tool
• For creating parts of the documentation
•
And others
PRODUCTION
Huge amount of data!
Multiple copies needed
Special fast CD-replicator equipment
Special cupboards for storing the CDs
Description in catalogue
Distribution
Conversion Tools (format converter, down sampling,
demultiplexing)
DAR Resource Description
DAR Resource Description:
Statistics
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
General
More data!
The right data!
High Quality data
In-service Data
ASAP
SpeechDat Family
Consortium of industrial and university partners
Often EU projects
One type of database is defined
Each partner makes one database according to spec
Each database is validated by external organization (SPEX,
Nijmegen, the Netherlands)
After approval databases are exchanged among the partners
Max. 1-1.5 yr later data are offered for distribution by ELRA
http://www.speechdat.org/
Overview of major projects
SpeechDat (M)
SpeechDat-II
SpeechDat-E
SpeechDat-Car
SPEECON
SALA I
SALA II
SpeechDat (M)
EU-funded
production, standardization, evaluation and
dissemination of Spoken Language Resources
8 fixed telephone network databases, 1000
speakers each; 1 mobile telephone network
database, 300 speakers
Period: 1994-1996
SpeechDat (M)
Partners
Siemens
UPC
Philips
IDIAP
Vocalis
INESC
CSELT
GEC MSIS
SpeechDat (M)
Languages
German
Spanish
French
Portuguese
Danish
Swiss French
Italian
SpeechDat-II
EU-funded
Creation of Telephony Databases
25 fixed and mobile telephone network
databases, 500-5000 speakers each; 3 speaker
verification databases
Period: 1996-1998
SpeechDat-II
Partners
Aalborg
University
Auditex
British Telecom
CSELT
DMI
ELRA
GEC
GPT
IDIAP
INESC
SPEX
Knowledge S.A.
Swiss Telecom
KTH
Telenor
Lernout &
Hauspie
Univ. of Maribor
Matra Nortel
Philips
Portugal Telecom
Siemens
Univ. of Munich
Univ. of Patras
UPC
Vocalis
SpeechDat-II
Languages
Danish
Finnish
Slovenian
Flemish
Finnish Swedish
Greek
Belgian French French French
Italian
Luxemburg
German
Dutch
Portuguese
Swiss French
Spanish
Luxemburg
French
Swiss German
Swedish
German
Norwegian
British English
Welsh
SpeechDat-E
EU-funded
Eastern European Speech Databases for Creation of
Voice Driven Teleservices
Speech databases for fixed telephone networks
suited for typical present-day teleservices plus
phonetically rich set of material for vocabulary
independent ASR
1000 – 2500 speakers
Period: 1999-2001
SpeechDat-E
Partners
Auditex
Lernout & Hauspie
Philips Speech
Processing
Siemens
ELRA
SPEX
Brno University of
Technology
Prague Technical
University
Budapest University of
Technology
Wroclaw University of
Technology
Slovak Academy of
Sciences
SpeechDat-E
Languages
Russian (2500)
Czech
Slovak
Hungarian
Polish
SpeechDat-Car
EU-funded
9 in-vehicle and mobile telephone network databases
300 speakers, each in 2 out of 7 conditions (600 recording
sessions)
5 simultaneous channels
Period: Apr 1998 - Oct 2000
SpeechDat-Car
Partners
Aalborg University
Nokia
Alcatel
Renault
Robert Bosch GmbH
SEAT
DMI
SPEX
ELRA
University of Munich
Knowledge S.A.
UPC
Lernout & Hauspie
L&H France (formerly
Matra Nortel)
Vocalis
Volkswagen
SpeechDat-Car
Languages
Danish
German
British English
Greek
Finnish
Italian
Flemish/Dutch
Spanish
French
American English
SPEECON
Speech driven interfaces for consumer devices
Speech databases for voice controlled consumer
applications
•
television sets, video recorders, mobile phones, palmtop
computers, car navigation kits or even microwave ovens and
toasters.
600 speakers
Period: 2000-2003
SPEECON
Partners
DaimlerChrysler
Nokia
Ericsson
Philips Speech
Processing
IBM
Lernout & Hauspie
Natural Speech
Communications
Siemens
Sony
Temic Telefunken
SPEECON
Languages
EU Spanish
Flemish
Dutch
Russian
US English
Japanese
Italian
US
Spanish
Polish
Swedish
German
UK English
Danish
Hebrew
Portuguese
French
Swiss
German
Finnish
Cantonese
Mandarin
SALA I
SpeechDat Across Latin America
Not government-subsidized
Speech databases for fixed telephony, Latin America
1000-2000 speakers per database
Period: 1998-2001
SALA
Partners
CSELT
Siemens
ELRA
SPEX
Lernout & Hauspie
UPC
Lucent
Vocalis
Philips
SALA
Languages
Brasil (Portuguese,
2000)
Mexico (2000)
Caribbean islands
and Venezuela
Central America
Panama, Columbia
Ecuador, Peru,
Bolivia
Chile
Argentina, Uruguay,
Paraguay
SALA II
Not government-subsidized
to create speech databases telephone cellular oriented
applications
America (North and Latin)
1000 (or 2000) speakers
Period: 2001-2002
(project just starting up)
SALA II
Partners
ATLAS
NSC
ELRA
Philips
IBM
Siemens
Lernout&Hauspie
SPEX
Loquendo
UPC
Lucent
SALA II
Languages
Venezuela
Peru
Mexico
Chile
Argentina
Costa Rica
Brasil
Colombia
American English Canada
US English North East
US Spanish East
US English South East
US Spanish West
US English North West
Future
Non-native/multilingual ASR
Data for Speech-to-Speech Translation
Access to information
•
•
•
anytime
anywhere
by way of any device
More use of spontaneous speech (“conversational
systems”)
Future
Devices will become
•
•
•
•
increasingly smaller (“mobile”)
Increasingly more powerful
Connected to information sources such as Internet etc
robustness against different environments
Input/Output
•
•
•
•
•
Limited
Keyboard and screens less convenient
Opportunity for speech input and output
Other input/output methods get different roles
Multi-modal input and output systems
Future
Distributed systems
•
•
•
Part of the recognition/synthesis on the local system (“client”)
Part on the server
Dynamically adaptable local systems
In car, speech is
•
•
“Hands-free” and
“Eyes-free” solution
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Why Speech User Interface
Pro
•
•
•
Con
Audio feedback draws attention
Complex commands E.G. Control
your VCR
Fast and simple - Chinese !!!
•
•
•
•
•
•
•
•
Speech input: 50-250 wpm
Typing: 20-90 wpm
Handwriting: 25 wpm
Pointing: 10-40/min
Eyes free
Hands fee
Mobile
Compact i/o devices
•
•
•
•
•
audio messages difficult to
remember if too long. E.g.
telephone number, address
“a drawing can replace a
thousand words”
privacy
sometimes cumbersome. E.g.
control a cursor on a screen
voice wear-out
Text-to-Speech engines
Voice Quality
Human-like
RealSpeak
RealSpeak
Compact
RealSpeak
UltraCompact
TTS3000
TTS250
0
Machine-like
low
high
Processor power & memory
Text-to-Speech engines
TTS2500
•
•
Low quality, small footprint engine for talking dictionary products
Available, no additional R&D
TTS3000
•
•
•
Medium quality engines
Limited footprint, high densities
Limited developments
RealSpeak compact
•
Target: handheld devices
RealSpeak
•
High-end system
RealSpeak TTS
New generation, human sounding TTS
Target: server based telephony, PCMM
Platform Requirements:
•
•
CPU: 48 real time instances PIII 450MHz (8 kHz speech data)
RAM: < 250 kB/instance ROM: 4-6 MB
Speechbases:
•
•
•
8 kHz uncompressed: ~ 250 MB
8, 11 kHz compressed: 20 – 30 MB
22 kHz compressed: 70 – 90 MB
20 languages: US English, 15 European and 4 Asian languages
2 languages under development
RealSpeak Compact
High quality, medium footprint TTS
Target: mobile and embedded platforms
Platform Requirements:
•
•
•
150 MIPS
RAM: < 250 kB/instance; 4-6 MB common
ROM: 16 MB (includes 11 kHz speechbase)
Derived automatically from RealSpeak
RealSpeak ultra compact under development
TTS3000
Low footprint, highly intelligible TTS engine
Target: Telephony, PCMM, Mobile, Embedded
Platform requirements:
•
•
CPU: 20 – 30 MIPS
RAM: 100 kB/instance; ROM: 2 - 3 MB
13 languages including:
•
•
•
US English
7 European
3 Asian languages
2 languages under development
TTS2500
Dedicated TTS for very low footprint talking dictionaries
Analysis on 8 or 16 bit processor: <2 Mips
Synthesis on dedicated chip (LH3010 or LH3030 ) or DSP
(ADSP21xx)
1.5 MB ROM, 16 KB RAM
Languages:
•
•
•
•
•
American English
Mandarin Chinese
Mexican Spanish
German
French
Dimensions of ASR
Speaker
•
•
•
Independent - adaptive - dependent
Native - non-native
Man, woman, child
Recording conditions
•
•
Recording device: telephone, GSM, microphone, tape recorder
Environment: quiet office, home, car, factory, street…
Implementation
•
•
Platform: PC, embedded
CPU and memory
Dimensions of ASR
Size of the (active) vocabulary
•
Small (10-100) - medium (100-1000) - large(>1000) - very large (>10000)
Flexibility of the vocabulary
•
•
•
Fixed (factory-definable)- User-definable
phoneme based => speaker independent
user words => speaker dependent
Word sequences
•
•
•
Isolated words - sentences - word spotting
Fixed grammar - flexible language model
Discrete - continuous speech
Language
•
•
Language independent engine, language dependent data files
Swapping language files
Different Applications, Different
Needs
Dictation
•
Speaker dependent, large vocabulary, continuous speech, quiet office, PC
Command & control, name dialing
•
Speaker independent, small to large vocabulary, noise robust, DSP boards
and/or client-server
Dialogue systems
•
Speaker independent, medium to large vocabulary, noise robust, client-server
Security: verification
•
Speaker dependent, combination of password (what) + speaker characteristics
(who)
Language learning
•
Non-native speakers, punish mistakes rather than being tolerant
Automatic Speech Recognition
L&H speech recognition engines cover a broad range of tasks,
processor types, operating systems and input signal types:
Tasks:
•
•
•
•
Large vocabulary continuous real-time dictation,
Large vocabulary batch transcription,
Grammar-based recognition – large, medium and small vocabularies,
Small-vocabulary isolated word recognition.
Platforms:
•
•
•
•
PC,
Server,
Handheld, Embedded,
Distributed.
Automatic Speech Recognition
engines
Task
Complexity
Server
Large Vocabulary
Open Grammar
Dictation
Mobile Terminal
XCalibur,
MREC
VX
Large Vocabulary
Closed Grammar
ASR1600
ASR1500
Medium Vocabulary
Closed Grammar
ASR300
Isolated word
recognition
ASR100
low
Processor power & memory
high
Recognition engines …
Input conditions:
•
•
•
•
Environments: home, office, public/industrial, car.
Channels: telephone (wireline, wireless), wideband, mobile devices.
Microphones: close-talking, far-talking.
Combinations: e.g., broadcast material.
A wide range of processor/memory operating points:
•
•
•
•
200Mips/32MB,
60Mips/1MB,
20Mips/300KB,
5-10Mips/<30KB
Recognition engines ….
ASR100:
•
•
•
•
•
•
•
•
•
•
•
•
5-10 Mips,
< 30 KB,
Speaker-dependent
Recording device: mic./phone
Sampling Frequency: 8/11 kHz
Environment: office
Vocabulary: small and user-adaptable
Grammar: isolated
Speech: Isolated
OS: various
Architecture: stand-alone
Languages: language-independent
Applications
•
•
•
embedded
cell-phone dialing
toys
Recognition engines ….
ASR300:
•
•
•
•
•
•
•
•
•
•
•
•
•
20 Mips,
300 KB,
SI & SD
Sampling Frequency: 8/11 kHz
Vocabulary: small and factory-adaptable
Highly noise robust
Environment: office/car/other noisy environments
Unit: word-dependent
Grammar: isolated
Speech: quasi-connected command and control
OS: various
Architecture: stand-alone
Languages: US English, French, Italian, Korean, German, Japanese
Applications
•
•
•
In car command and control
Command and control of toys, games
Command and control in noisy industrial environments
Recognition engines ….
ASR1500
•
•
•
•
•
•
•
•
•
•
•
•
•
60 Mips,
1MB
SI and speaker-adaptive
Vocabulary: medium size; user-adaptable
Sampling Frequency: 8kHz
Environment: office
Recording Device: telephone/ mobile phone
Grammar: finite state
Speech: Continuous
Unit: phoneme
OS: various
Architecture: Stand-alone and client/server
Languages: US English, 10 European languages, 4 Asian languages
Applications
•
IVR applications over the phone
• Reverse directory, Automated attendant
• Information provider— stock quotes
• Ordering systems
Recognition engines ….
ASR1600
•
•
•
•
•
•
•
•
•
•
•
•
•
60 Mips,
1MB
SI and speaker-adaptive
Vocabulary: medium size; user-adaptable
Sampling Frequency: 11kHz
Environment: office, car; highly noise-robust
Recording Device: mic.
Grammar: finite state
Speech: Continuous
Unit: phoneme
OS: various
Architecture: Stand-alone and client/server
Languages: US English, 10 European languages, 4 Asian languages
Applications
•
In-Car recognition
• Command and Control
•
Embedded devices
• PDA’s, SmartPhones
Recognition engines ….
Mrec/VX:
•
•
•
•
•
•
•
•
•
•
•
•
•
> 200 Mips
> 64MB
SI and speaker-adaptive
Vocabulary: very large (> 64,000 words)
Sampling Frequency: 22 (16) kHz
Environment: Office
Recording Device: mic.
Grammar: statistical
Speech: Continuous
Unit: phoneme
OS: Windows
Architecture: Stand-alone
Languages: US English and Spanish, 7 European languages, 2 Asian
languages
Applications
•
•
•
document creation, incl. command and control
MediaIndexer (Mrec)
Speech Transcription (Mrec)
Recognition engines ….
Xcalibur
•
•
•
•
•
•
•
•
•
•
•
•
scalable
SI and speaker-adaptive
Vocabulary: very large (> 64,000 words)
Sampling Frequency: 22 (16) kHz
Environment: Office, (Telephony, Car)
Recording Device: mic.
Grammar: statistical and rule-based
Speech: Continuous
Unit: phoneme
OS: Windows
Architecture: Stand-alone and client-server
Languages: Currently only Japanese
Applications
•
•
•
document creation
command and control
Focus on conversational systems