Summerschool Prague 2001 Jan Odijk

Download Report

Transcript Summerschool Prague 2001 Jan Odijk

Acoustic Databases
Jan Odijk
ELSNET Summer School, Prague, 2001
Acknowledgements
 Part of the slides
 have been borrowed from or
 are based on work by
•
•
•
•
•
Bart D’Hoore
Hugo van Hamme
Robrecht Comeyne
Dirk van Compernolle
Bert van Coile
Overview
 What is a speech database?
 How is it used?
 What does it contain?
 How is it created?
 Industrial needs
 Technologies and applications
Overview
 What is a speech database?
 How is it used?
 What does it contain?
 How is it created?
 Industrial needs
 Technologies and applications
Linguistic Resources(LRs)
 Linguistic Resources are sets of language data in machine
readable form that can be used for developing, improving
or evaluating language and speech technologies.
 Some language and speech technologies
•
•
•
•
•
•
•
•
•
Text-To-Speech (TTS)
Automatic Speech Recognition (ASR)
Dictation
Speaker Verification/recognition
Spoken Dialogue
Audio Mining
Machine Translation
Intelligent Content Management
….
Linguistic Resources(LRs)
Major Types
 Electronic Text Corpora
•
•
•
Newspapers, magazines, etc.
Usenet texts, e-mail, correspondence
Etc.
 Lexical Resources
•
•
•
•
Monolingual lexicons
Translation lexicons
Thesauri
…
 Acoustic Resources
•
•
Annotated Speech Recordings
Annotated Recordings of other acoustic signals
• Coughing, throat clearing, breathing, …
• Door slamming, screeching tires (of a car),…
Types of Linguistic Resources
Acoustic Resources
 Acoustic Databases (ADBs)
•
•
•
•
•
•
Controlled recording of human speech or other acoustic signals
Enriched with annotations
Recorded in a digital way
Representative of targeted application environment and medium
Balanced for phonemes/phoneme combinations
Speaker parameters, recording quality, environment/medium
documented
Types of Linguistic Resources
Acoustic Resources
 Annotated unstructured recordings
•
•
•
•
Broadcasted material
Recorded conversations/monologues/speeches etc
Dictated material
Enriched with annotations
Types of Linguistic Resources
Acoustic Resources
 In-service data
•
•
•
•
Recorded sessions of interaction humans-running application
Usually by logging a customer system
Enriched with annotations
Used for tuning models, grammars,etc. to specific application
Types of Linguistic Resources
Acoustic Resources
 Environments
•
“Quiet”
• Studio
• Quiet office
• Normal office
•
Noisy
• Public place (street, hotel lobby, station, etc.)
• Car (running engine 0km/hr, city, highway)
• Industrial environment
Types of Linguistic Resources
Acoustic Resources
 Media
•
•
•
HQ close-talk microphone
Desktop Microphones
Telephone
• analog or digital
• fixed line or mobile
•
•
•
Wide band microphones
Array microphones
PC/PDA etc. low quality microphone
Overview
 What is a speech database?
 How is it used?
 What does it contain?
 How is it created?
 Industrial needs
 Technologies and applications
Acoustic Resources
Use
 (for speech synthesis modules in TTS systems)
 (as acoustic reference material for pronunciation lexicons)
 Mainly for speech recognition
 Training and test material for research into new recognition
engines and engine features
 Training and test material for development of acoustic models
 Tuning of acoustic models for specific applications
What is speech recognition?
 ASR: Automatic speech recognition
 Automatic speech recognition is the process by which a
computer maps an acoustic speech signal to text.
 Automatic speech understanding is the process by which a
computer maps an acoustic speech signal to some form of
abstract meaning of the speech.
 Speaker recognition is the process by which a computer
recognizes the identity of the speaker based on speech samples.
 Speaker verification is the process by which a computer checks
the claimed identity of the speaker based on speech samples.
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Feature Extraction
 Turning speech signal into something more manageable
•
•
Do analysis once every 10ms
Data compression: 220 byte => 50 byte => 4 byte
 Sampling of a signal: transforming into a digital form
 Extracting relevant parameters from the signal
•
Spectral information, energy, pitch,...
 Eliminate undesirable elements (normalization)
•
•
•
Noise
Channel properties
Speaker properties (gender)
Feature Extraction: Vectors
 Signal is chopped in small pieces (frames), typically 30 ms
 Spectral analysis of a speech frame produces a vector
representing the signal properties.
 => result = stream of vectors
4
3
2
1
0
-1
-2
-3
-4
10.3
1.2
-0.9
.
0.2
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Acoustic Model
 Split utterance into basic units, e.g. phonemes
 The acoustic model describes the typical spectral shape (or
typical vectors) for each unit
 For each incoming speech segment, the acoustic model will tell
us how well (or how badly) it matches each phoneme
 Must cope with pronunciation variability
•
•
•
Utterances of the same word by the same speaker are never identical
Differences between speakers
Identical phonemes sound differently in different words
=> statistical techniques: creation via a lot of examples
f-r--ie--n--d--l--y-
S1
S2
S3
S4
c--o--m--p---u----t--e--r---s
S5
S6
S7
S8
S9
S10
S11
S12
S13
Acoustic Model: Units
 Phoneme: share units that model the same sound
S
T
O
P
S
T
A
R
Stop
T
Start
 Word: series of units specific to the word
S1 S2 S3 S4
Stop
S6 S7 S8 S9 S10
Start
Acoustic Model: Units
 Context dependent phoneme
S|,|T
T|S|O
O|T|P
P|O|,
ST
TO
OP
Stop
 Diphone
,S
P,
Stop
 Other sub-word units: consonant clusters
ST
O
P
Stop
Acoustic Model: Units
 Phonemes
 Phonemes in context: spectral properties depend on
previous and following phoneme
 Diphones
 Sub-words: syllables, consonant clusters
 Words
 Multi words: example: “it is”, “going to”
 Combinations of all of the above
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Pattern matching
 Acoustic Model: returns a score for each incoming feature
vector indicating how well the feature corresponds to the
model.
= Local score
 Calculate score of a word, indicating how well the word
matches the string of incoming features (viterbi)
 Search algorithm: looks for the best scoring word or word
sequence
Elements of a Recognizer
Acoustic
Model
Speech
Data
Feature
Extraction
Pattern
Matching
Language
Model
Action
Post Processing
Natural
Language
Understanding
Display
text
Meaning
Language Model
 Describes how words are connected to form a sentence
 Limit possible word sequences
 Reduce number of recognition errors by eliminating
unlikely sequences
 Increase speed of recognizer => real time
implementations
Language Model
 Two major types
•
•
Grammar based
!start <sentence>;
<sentence>: <yes> | <no>;
<yes>: yes | yep | yes please ;
<no>: no | no thanks | no thank you ;
Statistical
• Probability of single words, 2/3-word sequences
• Derived from frequencies in a large corpus
Active Vocabulary
 Lists words that can be recognized by the acoustic model
 That are allowed to occur given the language model
 Each word associated with a phonetic transcription
•
•
Enumerated, and/or
Generated by a Grapheme-to-Phoneme (G2P) module
Post Processing
 Re-ordering of Nbest list using other criteria: e.g. account
numbers, telephone numbers
 Spelling: name search from a list of known names
 Applying NLP techniques that fall outside the scope of the
statistical language model
•
•
•
E.g. “three dollars fifty cents”  “$ 3.50”
“doctor Jones”  “Dr. Jones”
Etc.
Training of Acoustic Models
Annotated
Speech
Database
Pronunciation
Dictionary
Training Program
Acoustic
Model
Training of Acoustic Models
 Database design
•
•
•
Coverage of units: word, phoneme, context dependent unit
Coverage of population (region, dialect, age, …)
Coverage of environments (car, telephone, office,..)
 Database collection and validation
•
•
Checking recording quality
Annotation: describing what people said, extra-speech sounds
 Dictionaries
•
•
•
Phonetic transcription of words
Multiple transcriptions needed
G2P: automatic transcription
Feature vectors
...
10.3
1.2
-0.9
.
0.2
8.1
-0.5
1.3
.
0.2
……...
2.1
-0.2
1.9
.
-0.3
...
Example: discrete models
 A collection of prototypes is constructed (100 to 250)
 Each vector is replaced by its nearest prototype
8
6
4
-4
-2
2
Vectoren
0
Prototypes
-2
-4
-6
0
2
4
6
8
Feature vectors
...
Prototypes
10.3
1.2
-0.9
.
0.2
8.1
-0.5
1.3
.
0.2
……...
2.1
-0.2
1.9
.
-0.3
...
,,,3 9 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,,,,,
Phoneme
assignment
ffrrEEEnnnnddllIIII,,,kkOOOmmpjjuuuuttt$$$rrzz
Prototypes
2276998900023448889211127780128897791237787622
f f
0
1
2
3
4
5
6
7
8
9
rr
0
0
0
1
0
0
2 2100
0
3
0
0
4
0
0
5
0
0
6
0 2 50
7
0 2 50
8
0
0
9
0
0
EE n n d d l l I I k k OO mm p p j j u u
0 3 75
0
0
0 1 50
0 1 50
0
0
0
0
0
0
0
0 1 50
0 1 50
0
0
0
0
0 1 50
0
0
0
0
0 2100
0
0
0
0 1 50
0
0
0
0
0
0
0
0
0
0
0 2100
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 2100
0
0
0
0
0
0
0 2 67
0
0
0 2 50
1 33
0
0
0 3 75
0 1 33
0
0
0
0
2 67 1 25
0
0 1 25
0
0
0
0
0 2 50
t t
$$
zz
, ,
0
0
0
0
1 33
0
0 2 67
1 33
0 2100 1 33
1 33
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 2 67
0
0
0 1 33
0
0
0
0
0
0
Training of Acoustic Models
For all utterances in database:
Make phonetic transcription of a sentence
Use models to segment the utterance file:
assign a phoneme to each speech frame
Collect statistical information:
Count prototype-phoneme occurrences
Create New Models
Key Element in ASR
 ASR is based on learning from observations
•
•
Huge amount of spoken data needed for making acoustic models
Huge amount of text data needed for making language models
 => Lots of statistics, few rules
Overview
 What is a speech database?
 How is it used?
 What does it contain?
 How is it created?
 Industrial needs
 Technologies and applications
Contents of an ADB
 Utterances of different utterance types
 Utterance types suited to the intended
application domain
 Text balanced for phoneme and/or diphone
distribution
 All enriched with annotations
Contents of an ADB
Spontaneous v. Read Utterances
 A spontaneous utterance is a response to a question or a
request




“In which city do you live?”
“Please spell a letter heading to your secretary”
“Is English your mother tongue?”
“Make a hotel reservation”
 A read utterance is an utterance read from a presentation
text




“London”
“Dear John”
“Yes”
“Please book me a room for 2 persons with bath. We will arrive ….”
Contents of an ADB
 Isolated Phonetically Rich Word
 Apple Tree, Lobster
 Isolated Digit
 5
 Isolated Alphabet
 B
 Isolated number (natural number)
 4256
Contents of an ADB
 Continuous Digits
 911
 Continuous Alphabet
 YMCA
 Commands
 Stop, left, print, call, next
Contents of an ADB
Connected Digits
 Telephone Numbers
 057/228888
 Credit Card Numbers
 3741 959289 310001
 Pin-codes
 8978
 Social Security Number
 560228 561 80
 Other identification numbers, e.g. sheet id
 012589225712
Contents of an ADB
Time and Date Expressions
 Time (“analog”, word style)
 A quarter past two
 Time (“digital”)
 14:15
 2:15PM
 Date (“analog”, word style, absolute)
 Friday, June 25th, 1999
 Christmas’ Eve, Easter
 Date (“digital”, absolute)
 25/06/99
 Date (“analog”, word style, relative)
 Tomorrow, next week, in one month
Contents of an ADB
 Money amounts
 $327.67
 £148.95
 Isolated Phonetically Rich Sentences
 A cold supper was ordered and a bottle of port
 Isolated Command Sentences
 Insert this name in the list
 Names
 Microsoft, New York, Jonathan
 Syllables
 Hi-ta-chi
Contents of an ADB
 Continuous Phonetically Rich Sentences
 Once upon a time, in a land far from here, lived a little princess. She
was the most beautiful girl…
 Continuous Command Sentences
 Select the first line. Make it bold and move it to the bottom of the
text…
 Continuous Spontaneous speech
 <Make a reservation in a hotel>
Contents of an ADB
Contents of SpeechDat-II ADB
 For each speaker/session




Approx. 40 utterances
Duration approx. 10 minutes
Mixture of read and spontaneous utterances
Mixture of
 Phonetically rich sentences
 Application specific words
 Utterance types that will often occur in any application
 1000-5000 speakers/sessions
Contents of an ADB
Contents of SpeechDat-II ADB
 1 isolated digit
 4-digit id (sheet number)
 3 connected digits (~10-digit telephone number)
 12-digit credit card number
 3 natural numbers
 2 money amounts: 1 large, 1 small
 3 spelled words
 1 time of day (spontaneous)
Contents of an ADB
Contents of SpeechDat-II ADB
 1 time phrase (read, word style)
 1 date (spontaneous, e.g. person’s birthday)
 2 dates (read, word style)
 3 yes/no questions
 1 city of call/birth
 6 common application words out of 50
 3 application word phrases
 9 sentences (read)
Contents of an ADB
Contents of SpeechDat-Car ADB
 For each session




Approx. 120-130 utterances (depending on session)
Duration 2-3 hours
Mixture of read and spontaneous utterances
Mixture of
 Phonetically rich sentences
 Application specific words
 Utterance types that will often occur in any application
 600 sessions with min. 300 speakers
 In 2 out of 7 conditions
 Standing still/ Low speed/ high speed
 Different road conditions/ surrounding noise
 Audio equipment on/off
Contents of an ADB
Contents of SpeechDat-Car ADB
 Digits and Digit Strings






1 sequence of 10 digits
1 sheet number (4+ digit sequence)
1 spontaneous telephone number
1 credit card number (16 digits)
1 PIN code (6 digits)
4 isolated digits
 Dates
 1 spontaneous date (e.g. birthday)
 1 prompted date, word style
 1 relative and general date expression
Contents of an ADB
Contents of SpeechDat-Car ADB
 Names





1 spontaneous name (e.g. speaker’s first name)
1 city of growing up (spontaneous)
2 most frequent city’s
2 company / agency / street names
1 person name (first name or surname)
 Spellings




1 spontaneous spelled name (e.g. speaker’s first name)
1 spelling of city name
4 real words or names
1 artificial name (for coverage)
Contents of an ADB
Contents of SpeechDat-Car ADB
 Money Amounts/Natural Numbers
 1 money amount
 1 natural number
 Times
 1 time of day (spontaneous)
 1 time phrase (word style)
 Phonetically Rich words
 4 phonetically rich words
Contents of an ADB
Contents of SpeechDat-Car ADB
 Application words





13 Mobile Phone Application words
22 IVR functions keywords
32 car products keywords
2 voice activation keywords
2 language dependent keywords
 Sentences
 2 phrases using an application word
 9 phonetically rich sentences
 10 prompts for spontaneous speech
Overview
 What is a speech database?
 How is it used?
 What does it contain?
 How is it created?
 Industrial needs
 Technologies and applications
Phases in Acoustic Resource
Creation
 Design
 Creation of Script
 Recruitment & Recording
 Annotation and Validation
 Lexicon
 Quality control
 Production
DESIGN
 Language study:
•
Phoneme set
•
Dialects
 Scripting: utterance definition and distribution over speakers
 Speaker typology: distribution definition
•
Sex, gender, age, dialects, educational level
 Recording: specification of procedure and platform
 Validation: specification of procedure and quality standard
Creation of Script
Prompts/Text/Transcription
 A prompt refers to the way an utterance is presented to the
speaker. This can be done on the desktop, on paper or with a
play back file (telephony).
 The (presentation) text represents the utterance as it
should be pronounced by the speaker. It is normally
presented according to the spelling conventions of the target
language.
 The transcription is the utterance as it has been
pronounced by the speaker.
 EXAMPLE: The pronunciation of a digit string
•
•
•

PROMPT: “Please read the number on top of your form”
TEXT: 578124
TRANSCRIPTION: five seven eight one two four
Creation of Script
 Collect and clean text corpora
 Split cleaned text into a sequence of sentences
 Remove ungrammatical and too long sentences
 Remove sentences containing offensive language
 Remove (certain) ambiguities in pronunciation
 numbers, dates, abbreviations, etc.
 Apply phonetic balancing tools to obtain phonetically rich text
Creation of Script
 Collect and/or create other utterance types
•
Telephone numbers, amounts, credit card numbers, etc.
 Create prompts
•
Prerecorded messages to the speaker
•
For unmonitored recording without access to screen (telephony)
 Put all these in resource files
Creation of Script
Script File
 Configuration:
•
Acquisition board, Coding type
•
Sampling rate, Number of channels
 Information items
•
Speaker id, Sheet id
•
Gender, age, region of birth, region youth, living, etc. and their possible values
•
Recording environment/conditions
 Sentence definitions
•
Specifies order and types of utterances in one session
Creation of Script
 Resource files  utterance sheets
 Generate letter with instructions and list of utterances for
each speaker (esp. telephony)
Creation of Script
Tools
 Script Editor
•
For creating/modifying scripts
•
For creating utterance sheet files (from resource files)
•
For generating letters to speakers
 Digit String Generator
•
Natural numbers
•
Bank accounts
•
Credit card numbers
•
Phone numbers
•
Pin-codes
Creation of Script
 Test the script
•
By making one or more recording sessions
•
Also tests the recording set-up
•
also provides indications for average session duration
RECRUITMENT
 Contact potential speakers according to the typology
•
Acquaintances, colleagues
•
Advertisements
•
Employees/students of cooperating organizations (companies, universities)
•
Possibly with the help of marketing agencies
 Explain
•
purpose and context
•
What the speaker is supposed to do
•
How much time it will take
•
Reimbursement for the speaker (time spent, travel costs)
 Make concrete arrangements with the speakers
RECORDING
 Locations : Set up recording equipment in environment fitting
environment definition
 Set-up recording platform and test it
 Welcome speaker, instruct speaker
 Interview : log speaker typology & recording conditions
 Make recordings and follow-up on quality
 Deal with administrative matters
•
Agreement on ownership of recording
•
Reimbursement
RECORDING
TOOL
VALIDATION and ANNOTATION
 After recording, the signal will NEVER be touched
•
Only enriched with annotations
 check (and correct) relation between text & speech
•
Orthographic transcription must represent what the speaker said
•
Tool to expand abbreviations, numbers, digit sequences
 Segmentation
•
Check (and correct) begin and end of speech markers
•
(mostly for TTS) Mark begin and end of phonemes
VALIDATION and ANNOTATION
 assign quality label
•
Very good overall quality … very bad overall quality
 Annotations for extra events
•
Speaker sounds (coughing, breathing, swallowing, …)
•
Mispronunciations, truncations
•
Sound from other sources (other speaker, music, radio, …)
•
Continuous background noise (wind, rain, …)
•
Filled pauses (uh, um, er, ah, hmm, ….)
•
Telephone distortions
Validation Tool
Semi-Automatic Validation
 Validation can be partially automated
 For certain types of databases
 70-75% reliably validated automatically
 25-30% require manual check
 Using ASR systems
 Research into further automating this ongoing
LEXICON
 One central “mother lexicon” for each language
•
To reduce duplication of effort
•
To maintain consistency
 Request is compared with mother database
•
Unfound entries imported in the mother database
•
Unfound entries turned into a job
•
Job is assigned to linguists
LEXICON
 After finishing job
•
Requested entries and properties are exported
•
Turned into required format
•
Delivered to requestor
 Additions/modifications due to this request are now
available for other requests
LEXICON: Tools
 Phoned
•
Lexical database plus user interface
•
(currently in Access but switching to SQL Server)
•
Reuse of G2P and Synthesis Modules
 PhonedAdmin
•
import and export of data from the mother database
•
Comparison with existing mother database
•
Definition of users and jobs
•
Assignment of jobs to users
QUALITY CONTROL
 Typical Circumstances






Database project is ongoing
Often on a remote location
Multiple persons (for recording and validation)
Many questions, problems and unclarities arise constantly
Require answers from specialists
Danger of errors and inconsistencies
 Within the work of a single person
 between different persons
 Constant monitoring
 Systematic and regular quality checks required
 Systematic and regular feedback required
 During the whole project
 From the earliest moment possible
 Documentation, incl. spot check report
QUALITY CONTROL
 Tools
•
ADB Scanner—checks consistency of database
• Standard structure, All files available
•
ADB Statistics
• Statistics on information items (sex, gender, age, dialect, quality,
etc.) and utterance types
•
ADB Report Tool
• For creating parts of the documentation
•
And others
PRODUCTION
 Huge amount of data!
 Multiple copies needed
 Special fast CD-replicator equipment
 Special cupboards for storing the CDs
 Description in catalogue
 Distribution
 Conversion Tools (format converter, down sampling,
demultiplexing)
DAR Resource Description

DAR Resource Description:
Statistics
Overview
 What is a speech database?
 How is it used?
 What does it contain?
 How is it created?
 Industrial needs
 Technologies and applications
General
 More data!
 The right data!
 High Quality data
 In-service Data
 ASAP
SpeechDat Family
 Consortium of industrial and university partners
 Often EU projects
 One type of database is defined
 Each partner makes one database according to spec
 Each database is validated by external organization (SPEX,
Nijmegen, the Netherlands)
 After approval databases are exchanged among the partners
 Max. 1-1.5 yr later data are offered for distribution by ELRA
 http://www.speechdat.org/
Overview of major projects
 SpeechDat (M)
 SpeechDat-II
 SpeechDat-E
 SpeechDat-Car
 SPEECON
 SALA I
 SALA II
SpeechDat (M)
 EU-funded
 production, standardization, evaluation and
dissemination of Spoken Language Resources
 8 fixed telephone network databases, 1000
speakers each; 1 mobile telephone network
database, 300 speakers
 Period: 1994-1996
SpeechDat (M)
Partners
 Siemens
 UPC
 Philips
 IDIAP
 Vocalis
 INESC
 CSELT
 GEC MSIS
SpeechDat (M)
Languages
 German
 Spanish
 French
 Portuguese
 Danish
 Swiss French
 Italian
SpeechDat-II
 EU-funded
 Creation of Telephony Databases
 25 fixed and mobile telephone network
databases, 500-5000 speakers each; 3 speaker
verification databases
 Period: 1996-1998
SpeechDat-II
Partners
 Aalborg
University
 Auditex
 British Telecom
 CSELT
 DMI
 ELRA
 GEC
 GPT
 IDIAP
 INESC
 SPEX
 Knowledge S.A.
 Swiss Telecom
 KTH
 Telenor
 Lernout &
Hauspie
 Univ. of Maribor
 Matra Nortel
 Philips
 Portugal Telecom
 Siemens
 Univ. of Munich
 Univ. of Patras
 UPC
 Vocalis
SpeechDat-II
Languages
 Danish
 Finnish
 Slovenian
 Flemish
 Finnish Swedish
 Greek
 Belgian French  French French
 Italian
 Luxemburg
German
 Dutch
 Portuguese
 Swiss French
 Spanish
 Luxemburg
French
 Swiss German
 Swedish
 German
 Norwegian
 British English
 Welsh
SpeechDat-E
 EU-funded
 Eastern European Speech Databases for Creation of
Voice Driven Teleservices
 Speech databases for fixed telephone networks
 suited for typical present-day teleservices plus
 phonetically rich set of material for vocabulary
independent ASR
 1000 – 2500 speakers
 Period: 1999-2001
SpeechDat-E
Partners
 Auditex
 Lernout & Hauspie
 Philips Speech
Processing
 Siemens
 ELRA
 SPEX
 Brno University of
Technology
 Prague Technical
University
 Budapest University of
Technology
 Wroclaw University of
Technology
 Slovak Academy of
Sciences
SpeechDat-E
Languages





Russian (2500)
Czech
Slovak
Hungarian
Polish
SpeechDat-Car
 EU-funded
 9 in-vehicle and mobile telephone network databases
 300 speakers, each in 2 out of 7 conditions (600 recording
sessions)
 5 simultaneous channels
 Period: Apr 1998 - Oct 2000
SpeechDat-Car
Partners
 Aalborg University
 Nokia
 Alcatel
 Renault
 Robert Bosch GmbH
 SEAT
 DMI
 SPEX
 ELRA
 University of Munich
 Knowledge S.A.
 UPC
 Lernout & Hauspie
 L&H France (formerly
Matra Nortel)
 Vocalis
 Volkswagen
SpeechDat-Car
Languages
 Danish
 German
 British English
 Greek
 Finnish
 Italian
 Flemish/Dutch
 Spanish
 French
 American English
SPEECON
 Speech driven interfaces for consumer devices
 Speech databases for voice controlled consumer
applications
•
television sets, video recorders, mobile phones, palmtop
computers, car navigation kits or even microwave ovens and
toasters.
 600 speakers
 Period: 2000-2003
SPEECON
Partners
 DaimlerChrysler
 Nokia
 Ericsson
 Philips Speech
Processing
 IBM
 Lernout & Hauspie
 Natural Speech
Communications
 Siemens
 Sony
 Temic Telefunken
SPEECON
Languages
 EU Spanish
 Flemish
 Dutch
 Russian
 US English
 Japanese
 Italian
 US
Spanish
 Polish
 Swedish
 German
 UK English
 Danish
 Hebrew
 Portuguese
 French
 Swiss
German
 Finnish
 Cantonese
 Mandarin
SALA I
 SpeechDat Across Latin America
 Not government-subsidized
 Speech databases for fixed telephony, Latin America
 1000-2000 speakers per database
 Period: 1998-2001
SALA
Partners
 CSELT
 Siemens
 ELRA
 SPEX
 Lernout & Hauspie
 UPC
 Lucent
 Vocalis
 Philips
SALA
Languages
 Brasil (Portuguese,
2000)
 Mexico (2000)
 Caribbean islands
and Venezuela
 Central America
 Panama, Columbia
 Ecuador, Peru,
Bolivia
 Chile
 Argentina, Uruguay,
Paraguay
SALA II
 Not government-subsidized
 to create speech databases telephone cellular oriented
applications
 America (North and Latin)
 1000 (or 2000) speakers
 Period: 2001-2002
 (project just starting up)
SALA II
Partners
 ATLAS
 NSC
 ELRA
 Philips
 IBM
 Siemens
 Lernout&Hauspie
 SPEX
 Loquendo
 UPC
 Lucent
SALA II
Languages
 Venezuela
 Peru
 Mexico
 Chile
 Argentina
 Costa Rica
 Brasil
 Colombia
 American English Canada
 US English North East
 US Spanish East
 US English South East
 US Spanish West
 US English North West
Future
 Non-native/multilingual ASR
 Data for Speech-to-Speech Translation
 Access to information
•
•
•
anytime
anywhere
by way of any device
 More use of spontaneous speech (“conversational
systems”)
Future
 Devices will become
•
•
•
•
increasingly smaller (“mobile”)
Increasingly more powerful
Connected to information sources such as Internet etc
 robustness against different environments
 Input/Output
•
•
•
•
•
Limited
Keyboard and screens less convenient
Opportunity for speech input and output
Other input/output methods get different roles
Multi-modal input and output systems
Future
 Distributed systems
•
•
•
Part of the recognition/synthesis on the local system (“client”)
Part on the server
Dynamically adaptable local systems
 In car, speech is
•
•
“Hands-free” and
“Eyes-free” solution
Overview
 What is a speech database?
 How is it used?
 What does it contain?
 How is it created?
 Industrial needs
 Technologies and applications
Why Speech User Interface
 Pro
•
•
•
 Con
Audio feedback draws attention
Complex commands E.G. Control
your VCR
Fast and simple - Chinese !!!
•
•
•
•
•
•
•
•
Speech input: 50-250 wpm
Typing: 20-90 wpm
Handwriting: 25 wpm
Pointing: 10-40/min
Eyes free
Hands fee
Mobile
Compact i/o devices
•
•
•
•
•
audio messages difficult to
remember if too long. E.g.
telephone number, address
“a drawing can replace a
thousand words”
privacy
sometimes cumbersome. E.g.
control a cursor on a screen
voice wear-out
Text-to-Speech engines
Voice Quality
Human-like
RealSpeak
RealSpeak
Compact
RealSpeak
UltraCompact
TTS3000
TTS250
0
Machine-like
low
high
Processor power & memory
Text-to-Speech engines
 TTS2500
•
•
Low quality, small footprint engine for talking dictionary products
Available, no additional R&D
 TTS3000
•
•
•
Medium quality engines
Limited footprint, high densities
Limited developments
 RealSpeak compact
•
Target: handheld devices
 RealSpeak
•
High-end system
RealSpeak TTS
 New generation, human sounding TTS
 Target: server based telephony, PCMM
 Platform Requirements:
•
•
CPU: 48 real time instances PIII 450MHz (8 kHz speech data)
RAM: < 250 kB/instance ROM: 4-6 MB
 Speechbases:
•
•
•
8 kHz uncompressed: ~ 250 MB
8, 11 kHz compressed: 20 – 30 MB
22 kHz compressed: 70 – 90 MB
 20 languages: US English, 15 European and 4 Asian languages
 2 languages under development
RealSpeak Compact
 High quality, medium footprint TTS
 Target: mobile and embedded platforms
 Platform Requirements:
•
•
•
150 MIPS
RAM: < 250 kB/instance; 4-6 MB common
ROM: 16 MB (includes 11 kHz speechbase)
 Derived automatically from RealSpeak
 RealSpeak ultra compact under development
TTS3000
 Low footprint, highly intelligible TTS engine
 Target: Telephony, PCMM, Mobile, Embedded
 Platform requirements:
•
•
CPU: 20 – 30 MIPS
RAM: 100 kB/instance; ROM: 2 - 3 MB
 13 languages including:
•
•
•
US English
7 European
3 Asian languages
 2 languages under development
TTS2500
 Dedicated TTS for very low footprint talking dictionaries
 Analysis on 8 or 16 bit processor: <2 Mips
 Synthesis on dedicated chip (LH3010 or LH3030 ) or DSP
(ADSP21xx)
 1.5 MB ROM, 16 KB RAM
 Languages:
•
•
•
•
•
American English
Mandarin Chinese
Mexican Spanish
German
French
Dimensions of ASR
 Speaker
•
•
•
Independent - adaptive - dependent
Native - non-native
Man, woman, child
 Recording conditions
•
•
Recording device: telephone, GSM, microphone, tape recorder
Environment: quiet office, home, car, factory, street…
 Implementation
•
•
Platform: PC, embedded
CPU and memory
Dimensions of ASR
 Size of the (active) vocabulary
•
Small (10-100) - medium (100-1000) - large(>1000) - very large (>10000)
 Flexibility of the vocabulary
•
•
•
Fixed (factory-definable)- User-definable
phoneme based => speaker independent
user words => speaker dependent
 Word sequences
•
•
•
Isolated words - sentences - word spotting
Fixed grammar - flexible language model
Discrete - continuous speech
 Language
•
•
Language independent engine, language dependent data files
Swapping language files
Different Applications, Different
Needs
 Dictation
•
Speaker dependent, large vocabulary, continuous speech, quiet office, PC
 Command & control, name dialing
•
Speaker independent, small to large vocabulary, noise robust, DSP boards
and/or client-server
 Dialogue systems
•
Speaker independent, medium to large vocabulary, noise robust, client-server
 Security: verification
•
Speaker dependent, combination of password (what) + speaker characteristics
(who)
 Language learning
•
Non-native speakers, punish mistakes rather than being tolerant
Automatic Speech Recognition
 L&H speech recognition engines cover a broad range of tasks,
processor types, operating systems and input signal types:
 Tasks:
•
•
•
•
Large vocabulary continuous real-time dictation,
Large vocabulary batch transcription,
Grammar-based recognition – large, medium and small vocabularies,
Small-vocabulary isolated word recognition.
 Platforms:
•
•
•
•
PC,
Server,
Handheld, Embedded,
Distributed.
Automatic Speech Recognition
engines
Task
Complexity
Server
Large Vocabulary
Open Grammar
Dictation
Mobile Terminal
XCalibur,
MREC
VX
Large Vocabulary
Closed Grammar
ASR1600
ASR1500
Medium Vocabulary
Closed Grammar
ASR300
Isolated word
recognition
ASR100
low
Processor power & memory
high
Recognition engines …
 Input conditions:
•
•
•
•
Environments: home, office, public/industrial, car.
Channels: telephone (wireline, wireless), wideband, mobile devices.
Microphones: close-talking, far-talking.
Combinations: e.g., broadcast material.
 A wide range of processor/memory operating points:
•
•
•
•
200Mips/32MB,
60Mips/1MB,
20Mips/300KB,
5-10Mips/<30KB
Recognition engines ….
 ASR100:
•
•
•
•
•
•
•
•
•
•
•
•
5-10 Mips,
< 30 KB,
Speaker-dependent
Recording device: mic./phone
Sampling Frequency: 8/11 kHz
Environment: office
Vocabulary: small and user-adaptable
Grammar: isolated
Speech: Isolated
OS: various
Architecture: stand-alone
Languages: language-independent
 Applications
•
•
•
embedded
cell-phone dialing
toys
Recognition engines ….
 ASR300:
•
•
•
•
•
•
•
•
•
•
•
•
•
20 Mips,
300 KB,
SI & SD
Sampling Frequency: 8/11 kHz
Vocabulary: small and factory-adaptable
Highly noise robust
Environment: office/car/other noisy environments
Unit: word-dependent
Grammar: isolated
Speech: quasi-connected command and control
OS: various
Architecture: stand-alone
Languages: US English, French, Italian, Korean, German, Japanese
 Applications
•
•
•
In car command and control
Command and control of toys, games
Command and control in noisy industrial environments
Recognition engines ….
 ASR1500
•
•
•
•
•
•
•
•
•
•
•
•
•
60 Mips,
1MB
SI and speaker-adaptive
Vocabulary: medium size; user-adaptable
Sampling Frequency: 8kHz
Environment: office
Recording Device: telephone/ mobile phone
Grammar: finite state
Speech: Continuous
Unit: phoneme
OS: various
Architecture: Stand-alone and client/server
Languages: US English, 10 European languages, 4 Asian languages
 Applications
•
IVR applications over the phone
• Reverse directory, Automated attendant
• Information provider— stock quotes
• Ordering systems
Recognition engines ….
 ASR1600
•
•
•
•
•
•
•
•
•
•
•
•
•
60 Mips,
1MB
SI and speaker-adaptive
Vocabulary: medium size; user-adaptable
Sampling Frequency: 11kHz
Environment: office, car; highly noise-robust
Recording Device: mic.
Grammar: finite state
Speech: Continuous
Unit: phoneme
OS: various
Architecture: Stand-alone and client/server
Languages: US English, 10 European languages, 4 Asian languages
 Applications
•
In-Car recognition
• Command and Control
•
Embedded devices
• PDA’s, SmartPhones
Recognition engines ….
 Mrec/VX:
•
•
•
•
•
•
•
•
•
•
•
•
•
> 200 Mips
> 64MB
SI and speaker-adaptive
Vocabulary: very large (> 64,000 words)
Sampling Frequency: 22 (16) kHz
Environment: Office
Recording Device: mic.
Grammar: statistical
Speech: Continuous
Unit: phoneme
OS: Windows
Architecture: Stand-alone
Languages: US English and Spanish, 7 European languages, 2 Asian
languages
 Applications
•
•
•
document creation, incl. command and control
MediaIndexer (Mrec)
Speech Transcription (Mrec)
Recognition engines ….
 Xcalibur
•
•
•
•
•
•
•
•
•
•
•
•
scalable
SI and speaker-adaptive
Vocabulary: very large (> 64,000 words)
Sampling Frequency: 22 (16) kHz
Environment: Office, (Telephony, Car)
Recording Device: mic.
Grammar: statistical and rule-based
Speech: Continuous
Unit: phoneme
OS: Windows
Architecture: Stand-alone and client-server
Languages: Currently only Japanese
 Applications
•
•
•
document creation
command and control
Focus on conversational systems