Lecture22 - Intelligent Software Lab.

Download Report

Transcript Lecture22 - Intelligent Software Lab.

Speech-based Information Retrieval
Gary Geunbae Lee
POSTECH
Oct 15 2007, ICU
Information and Communications
University
Korean Intellectual Property Office – ICU seminar
contents
 Why speech-based IR?
 Speech recognition technology
 Spoken document retrieval
 Ubiquitous IR using spoken dialog
Information and Communications
University
2 / 50
Why speech IR? – SDR (backend multimedia material)
[ch-icassp07]
 In the past decade there has been a dramatic increase in the
availability of on-line audio-visual material…
 More than 50% percent of IP traffic is video
 …and this trend will only continue as cost of producing audio-visual
content continues to drop
Broadcast News
Podcasts
Academic Lectures
 Raw audio-visual material is difficult to search and browse
 Keyword driven Spoken Document Retrieval (SDR):
 User provides a set of relevant query terms
 Search engine needs to return relevant spoken documents and provide an easy
way to navigate them
Information and Communications
University
3 / 50
3
Why speech IR? - Ubiquitous computing (frond-end
query)

Ubiquitous computing: network + sensor + computing

Pervasive computing

Third paradigm computing

Calm technology

Invisible computing
 Irobot style interface – human language + hologram
Information and Communications
University
4 / 50
Ubiquitous computer interface?
 Computer – robot, home appliances, audio, telephone, fax machine,
toaster, coffee machine, etc (every objects)
 VoiceBox (USA)
 Telematics Dialog Interface (POSTECH, LG, DiQuest)
 EPG guide (POSTECH)
 Dialog Translation (POSTECH)
Information and Communications
University
5 / 50
Example Domain for Ubiquitous IR
Car-navigation
Tele-service
Information and Communications
University
6 / 50
Home networking
Robot interface
What’s hard – ambiguities, ambiguities, all different
levels of ambiguities
John stopped at the donut store on his way home from work. He thought a
coffee was good every few hours. But it turned out to be too expensive
there. [from J. Eisner]
- donut: To get a donut (doughnut; spare tire) for his car?
- Donut store: store where donuts shop? or is run by donuts? or looks like a big donut?
or made of donut?
- From work: Well, actually, he stopped there from hunger and exhaustion, not just from
work.
- Every few hours: That’s how often he thought it? Or that’s for coffee?
- it: the particular coffee that was good every few hours? the donut store? the situation
- Too expensive: too expensive for what? what are we supposed to conclude about what
John did?
Information and Communications
University
7 / 50
contents
 Why speech-based IR?
 Speech Recognition Technology
 Spoken document retrieval
 Ubiquitous IR using spoken dialog
Information and Communications
University
8 / 50
The Noisy Channel Model
 Automatic speech recognition (ASR) is a process by which an
acoustic speech signal is converted into a set of words [Rabiner et al.,
1993]
 The noisy channel model [Lee et al., 1996]
 Acoustic input considered a noisy version of a source sentence
Noisy Channel
Source
sentence
Decoder
Noisy
sentence
버스 정류장이
어디에 있나요?
Information and Communications
University
Guess at
original sentence
버스 정류장이
어디에 있나요?
9 / 50
The Noisy Channel Model
 What is the most likely sentence out of all sentences in the
language L given some acoustic input O?
 Treat acoustic input O as sequence of individual observations
 O = o1,o2,o3,…,ot
 Define a sentence as a sequence of words:
 W = w1,w2,w3,…,wn
Wˆ  arg max P(W | O)
Bayes rule
WL
P(O | W ) P(W )
ˆ
W  arg max
P(O)
WL
Wˆ  arg max P(O | W ) P(W )
WL
Information and Communications
University
10 / 50
Golden rule
Speech Recognition Architecture Meets Noisy Channel
Wˆ  arg max P(O | W ) P(W )
버스 정류장이
어디에 있나요?
WL
O
Speech Signals
Feature
Extraction
Decoding
버스 정류장이
어디에 있나요?
W
Word Sequence
Network
Construction
Speech
DB
HMM
Estimation
Acoustic Pronunciation
Model
Model
G2P
Text
Corpora
LM
Estimation
Information and Communications
University
11 / 50
Language
Model
Feature Extraction
 The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice
[Paliwal, 1992]
X(n)
Preemphasis/
Hamming
Window
FFT
(Fast Fourier
Transform)
Mel-scale
filter bank
log|.|
DCT
(Discrete Cosine
Transform)
 Frame size : 25ms / Frame rate : 10ms
25 ms
...
10ms
a1
a2
a3
 39 feature per 10ms frame
 Absolute : Log Frame Energy (1) and MFCCs (12)
 Delta : First-order derivatives of the 13 absolute coefficients
 Delta-Delta : Second-order derivatives of the 13 absolute coefficients
Information and Communications
University
12 / 50
MFCC
(12-Dimension)
Acoustic Model
 Provide P(O|Q) = P(features|phone)
 Modeling Units [Bahl et al., 1986]
 Context-independent : Phoneme
 Context-dependent : Diphone, Triphone, Quinphone
 pL-p+pR : left-right context triphone
 Typical acoustic model [Juang et al., 1986]
  ( A, B,  )
 Continuous-density Hidden Markov Model
 Distribution : Gaussian Mixture
K
b j ( x j )   c jk N ( xt ;  jk ,  jk )
k 1
 HMM Topology : 3-state left-to-right model for each phone, 1-state for silence
or pause
bj(x)
codebook
Information and Communications
University
13 / 50
Pronunciation Model
 Provide P(Q|W) = P(phone|word)
 Word Lexicon [Hazen et al., 2002]
 Map legal phone sequences into words according to phonotactic rules
 G2P (Grapheme to phoneme) : Generate a word lexicon automatically
 Several word may have multiple pronunciations
 Example
0.2
[ow]
[ey]
1.0
[m]
[t]
0.8
 Tomato
0.5
1.0
[ah]
1.0
1.0
[t]
0.5
[aa]
1.0
 P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1
 P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4
Information and Communications
University
14 / 50
[ow]
Training
 Training process [Lee et al., 1996]
Speech DB
Feature
Extraction
Baum-Welch
Re-estimation
yes
Converged?
no
HMM
 Network for training
Sentence HMM
ONE
TWO
ONE TWO THREE ONE
Word HMM
ONE
Phone HMM
W
Information and Communications
University
W
1
15 / 50
AH
2
3
THREE ONE
N
End
Language Model
 Provide P(W) ; the probability of the sentence [Beaujard et al., 1999]
 We saw this was also used in the decoding process as the probability of
transitioning from one word to another.
 Word sequence : W = w1,w2,w3,…,wn
n
P( w1 wn )   P( wi | w1  wi 1 )
i 1
 The problem is that we cannot reliably estimate the conditional word
probabilities, P(wi | w1  wi 1 ) for all words and all sequence lengths in a
given language
 n-gram Language Model
 n-gram language models use the previous n-1 words to represent the history
P(wi | w1 wi 1 )  P(wi | wi ( n1) wi 1 )
 Bi-grams are easily incorporated in a viterbi search
Information and Communications
University
16 / 50
Language Model
 Example
 Finite State Network (FSN)
서울
부산
에서
출발
세시
네시
출발
대구
대전
하는
기차
버스
도착
 Context Free Grammar (CFG)
$time = 세시|네시;
$city = 서울|부산|대구|대전;
$trans = 기차|버스;
$sent = $city (에서 $time 출발 | 출발 $city 도착) 하는 $trans
 Bigram
P(에서|서울)=0.2 P(세시|에서)=0.5
P(출발|세시)=1.0 P(하는|출발)=0.5
P(출발|서울)=0.5 P(도착|대구)=0.9
…
Information and Communications
University
17 / 50
Network Construction
 Expanding every word to state level, we get a search network
[Demuynck et al., 1997]
Acoustic Model
Pronunciation Model
I
일
I
L
이
I
삼
S
A
사
S
A
Language Model
L
이
일
S
M
사
A
M
삼
Intra-word
transition
start
이
P(이|x)
Search
Network
Word
transition
end
I
LM is
applied
일
P(일|x)
I
P(사|x)
Between-word
transition
S
L
사
A
P(삼|x)
삼
Information and Communications
University
S
A
18 / 50
M
Decoding
 Find
Wˆ  arg max P(W | O)
WL
 Viterbi Search : Dynamic Programming
 Token Passing Algorithm [Young et al., 1989]

Initialize all states with a token with a null history and the likelihood that it’s a
start state

For each frame ak
 For each token t in state s with probability P(t), history H
 For each state r
 Add new token to s with probability P(t) Ps,r Pr(ak), and
history s.H
Information and Communications
University
19 / 50
Decoding
 Pruning [Young et al., 1996]
 Entire search space for Viterbi search is much too large
 Solution is to prune tokens for paths whose score is too low
 Typical method is to use:
 histogram: only keep at most n total hypotheses
 beam: only keep hypotheses whose score is a fraction of best score
 N-best Hypotheses and Word Graphs
 Keep multiple tokens and return n-best paths/scores
 Can produce a packed word graph (lattice)
 Multiple Pass Decoding
 Perform multiple passes, applying successively more fine-grained language
models
Information and Communications
University
20 / 50
Large Vocabulary Continuous Speech Recognition (LVCSR)
 Decoding continuous speech over large vocabulary
 Computationally complex because of huge potential search space
 Weighted Finite State Transducers (WFST) [Mohri et al., 2002]
Word : Sentence
Phone : Word
WFST
Search
Network
WFST
Combination
HMM : Phone
WFST
State : HMM
WFST
 Dynamic Decoding
 On-demand network constructions
 Much less memory requirements
Information and Communications
University
21 / 50
Optimization
Out-of-Vocabulary Word Modeling[ch-icassp07]
 How can out-of-vocabulary (OOV)
words be handled
 Start with standard lexical
network
 Separate sub-word network is
created to model OOVs
 Add sub-word network to word
network as new word, Woov
 OOV model used to detect OOV
words and provide phonetic
transcription (Bazzi & Glass, 2000)
Information and Communications
University
22 / 50
Mixture Language Models[ch-icassp07]
 When building a topic-specific language model:
 Topic-specific material may be limited and sparse
 Best results when combining with robust general model
 May desire a model based on a combination of topics
 …and with some topics weighted more heavily than others
 Topic mixtures is one approach (Iyer & Ostendorf, 1996)
 SRI Language Modeling Toolkit provides an open source implementation
(http://www.speech.sri.com/projects/srilm)
 A basic topic mixture-language model is defined as a weighted
combination of N different topics T1 to TN :
Information and Communications
University
23 / 50
Automatic Alignment of Human Transcripts[ch-icassp07]
 Goal: Align transcript w/o time markers to long audio file
 Run recognizer over utterances to obtain word hypotheses
 Use language model strongly adapted to reference transcript
 Align reference transcript against word hypotheses
 Identify matched words ( ) and mismatched words (X)
 Treat multi-word matched sequences as anchor regions
 Extract new segments starting and ending within anchors
 Force align reference words within each new segment si
Information and Communications
University
24 / 50
contents
 Why speech-based IR?
 Speech Recognition Technology
 Spoken document retrieval
 Ubiquitous IR using spoken dialog
Information and Communications
University
25 / 50
Spoken Document Processing[ch-icassp07]
 The goal is to enable users to:
 Search for spoken documents as easily as they search for text
 Accurately retrieve relevant spoken documents
 Efficiently browse through returned hits
 Quickly find segments of spoken documents they would most like to listen to or
watch
 Information (or meta-data) to enable search and retrieval:
 Transcription of speech
 Text summary of audio-visual material
 Other relevant information:
 speakers, time-aligned outline, etc.
 slides, other relevant text meta-data: title, author, etc.
 links pointing to spoken document from the www
 collaborative filtering (who else watched it?)
Information and Communications
University
26 / 50
Transcription of Spoken Documents[ch-icassp07]
 Manual transcription of audio material is expensive
 A basic text-transcription of a one hour lecture costs >$100
 Human generated transcripts can contain many errors
 MIT study on commercial transcripts of academic lectures
 Transcripts show a 10% difference against true transcripts
 Many differences are actually corrections of speaker errors
 However, ~2.5% word substitution rate is observed:
Substitution errors
Misspelled words
Furui  Frewey
Makhoul  McCool
Tukey  Tuki
Eigen  igan
Gaussian  galsian
cepstrum  capstrum
Information and Communications
University
Fourier  for your
Kullback  callback
a priori  old prairie
resonant  resident
affricates  aggregates
palatal  powerful
27 / 50
Rich Annotation of Spoken Documents[ch-icassp07]
 Humans take 10 to 50 times real time to perform rich transcription of
audio data including:
 Full transcripts with proper punctuation and capitalization
 Speaker identities, speaker changes, speaker overlaps
 Spontaneous speech effects (false starts, partial words, etc.)
 Non-speech events and background noise conditions
 Topic segmentation and content summarization
 Goal: Automatically generate rich annotations of audio
 Transcription (What words were spoken?)
 Speaker diarization (Who spoke and when?)
 Segmentation (When did topic changes occur?)
 Summarization (What are the primary topics?)
 Indexing (Where were specific words spoken?)
 Searching (How can the data be searched efficiently?)
Information and Communications
University
28 / 50
Text Retrieval[ch-icassp07]
 Collection of documents:
 “large” N: 10k-1M documents or more (videos, lectures)
 “small” N: < 1-10k documents (voice-mails, VoIP chats)
 Query:
 ordered set of words in a large vocabulary
 restrict ourselves to keyword search; other query types are clearly possible:
 Speech/audio queries (match waveforms)
 Collaborative filtering (people who watched X also watched…)
 Ontology (hierarchical clustering of documents, supervised or unsupervised)
Information and Communications
University
29 / 50
Text Retrieval: Vector Space Model[ch-icassp07]
 Build a term-document co-occurrence (LARGE) matrix (Baeza-Yates,
99)
 rows indexed by word
 columns indexed by documents
 TF (term frequency): frequency of word in document
 could be normalized to maximum frequency in a given document
 IDF (inverse document frequency): if a word appears in all documents
equally likely, it isn’t very useful for ranking
 (Bellegarda, 2000) uses normalized entropy
Information and Communications
University
30 / 50
Text Retrieval: Vector Space Model (2) [ch-icassp07]
 For retrieval/ranking one ranks the documents in decreasing order of
relevance score:
 query weights have minimal impact since queries are very short, so
one often uses a simplified relevance score:
Information and Communications
University
31 / 50
Text Retrieval: TF-IDF Shortcomings[ch-icassp07]
 Hit-or-Miss:
 returns only documents containing the query words
 query for Coca Cola will not return a document that reads:
 “… its Coke brand is the most treasured asset of the soft drinks maker …”
 Cannot do phrase search: “Coca Cola”
 needs post processing to filter out documents not matching the phrase
 Ignores word order and proximity
 query for Object Oriented Programming:
 “ … the object oriented paradigm makes programming a joy … “
 “ … TV network programming transforms the viewer in an object and it is oriented
towards…”
Information and Communications
University
32 / 50
Vector Space Model: Query/Document Expansion[ch-icassp07]
 Correct the Hit-or-Miss problem by doing some form of expansion on
the query and/or document side
 add similar terms to the ones in the query/document to increase number of terms
matched on both sides
 corpus driven methods: TREC-7 (Singhal et al,. 99) and TREC-8 (Singhal et al,.
00)
 Query side expansion works well for long queries (10 words)
 short queries are very ambiguous and expansion may not work well
 Expansion works well for boosting Recall:
 very important when working on small to medium sized corpora
 typically comes at a loss in Precision
Information and Communications
University
33 / 50
Vector Space Model: Latent Semantic Indexing[ch-icassp07]
 Correct the Hit-or-Miss problem by doing some form of dimensionality
reduction on the TF-IDF matrix
 Singular Value Decomposition (SVD) (Furnas et al., 1988)
 Probabilistic Latent Semantic Analysis (PLSA) (Hoffman, 1999)
 Non-negative Matrix Factorization (NMF)
 Matching of query vector and document vector is performed in the
lower dimensional space
 Good as long as the magic works
 Drawbacks:
 still ignores WORD ORDER
 users are no longer in full control over the search engine Humans are very good
at crafting queries that’ll get them the documents they want and expansion
methods impair full use of their natural language faculty
Information and Communications
University
34 / 50
Probabilistic Models (Robertson, 1976) [ch-icassp07]
 Assume one has a probability model for generating queries and
documents
 We would like to rank documents according to the point-wise mutual
information
 One can model
document (Ponte, 1998)
using a language model built from each
 Takes word order into account
 models query N-grams but not more general proximity features
 expensive to store
Information and Communications
University
35 / 50
Text Retrieval: Scaling Up[ch-icassp07]
 Linear scan of document collection is not an option for compiling the
ranked list of relevant documents
 Compiling a short list of relevant documents may allow for relevance score
calculation on the document side
 Inverted index is critical for scaling up to large collections of
documents
 think index at end of a book as opposed to leafing through it!
All methods are amenable to some form of indexing:
 TF-IDF/SVD: compact index, drawbacks mentioned
 LM-IR: storing all N-grams in each document is very expensive
 significantly more storage than the original document collection
 Early Google: compact index that maintains word order information
and hit context
 relevance calculation, phrase based matching using only the index
Information and Communications
University
36 / 50
TREC SDR: “A Success Story” [ch-icassp07]
 The Text Retrieval Conference (TREC)
 pioneering work in spoken document retrieval (SDR)
 SDR evaluations from 1997-2000 (TREC-6 toTREC-9)
 TREC-8 evaluation:
 focused on broadcast news data
 22,000 stories from 500 hours of audio
 even fairly high ASR error rates produced document retrieval performance close
to human generated transcripts
 key contributions:
 Recognizer expansion using N-best lists
 query expansion, and document expansion
 conclusion: SDR is “A success story” (Garofolo et al, 2000)
 Why don’t ASR errors hurt performance?
 content words are often repeated providing redundancy
 semantically related words can offer support (Allan, 2003)
Information and Communications
University
37 / 50
Broadcast News: SDR Best-case Scenario[ch-icassp07]
 Broadcast news SDR is a best-case scenario for ASR:
 primarily prepared speech read by professional speakers
 spontaneous speech artifacts are largely absent
 language usage is similar to written materials
 new vocabulary can be learned from daily text news articles
 state-of-the-art recognizers have word error rates <10%
 comparable to the closed captioning WER (used as reference)
 TREC queries were fairly long (10 words) and have low out-ofvocabulary (OOV) rate
 impact of query OOV rate on retrieval performance is high (Woodland et al., 2000)
 Vast amount of content is closed captioned
Information and Communications
University
38 / 50
Beyond Broadcast News[ch-icassp07]
 Many useful tasks are more difficult than broadcast news
 Meeting annotation (e.g., Waibel et al, 2001)
 Voice mail (e.g., SCANMail, Bacchiani et al, 2001))
 Podcasts (e.g., Podzinger, www.podzinger.com)
 Academic lectures
 Primary difficulties due to limitations of ASR technology:
 highly spontaneous, unprepared speech
 topic-specific or person-specific vocabulary & language usage
 unknown content and topics potentially lacking support in general language
model
 wide variety of accents and speaking styles
 OOVs in queries: ASR vocabulary is not designed to recognize infrequent query
terms, which are most useful for retrieval
 General SDR still has many challenges to solve
Information and Communications
University
39 / 50
Spoken Term Detection Task[ch-icassp07]
 A new Spoken Term Detection evaluation initiative from NIST
 Find all occurrences of a search term as fast as possible in heterogeneous audio
sources
TREC
STD
Documents
Broadcast News
BN, Switchboard, Meeting
Languages
English
English, Arabic, Mandarin
Query
Long
Short (few words)
System
Output
Ranked Relevant
documents
Location of the query in the audio
Decision Score indicating how likely the term exists
“Actual” decision as to whether the detected term is
a hit
 Objective of the evaluation
 Understand speed/accuracy tradeoffs
 Understand technology solution tradeoffs: e.g., word vs. phone recognition
 Understand technology issues for the three STD languages: Arabic, English, and
Mandarin
Information and Communications
University
40 / 50
Text Retrieval: Evaluation[ch-icassp07]
 trec_eval (NIST) package requires reference annotations for
documents with binary relevance judgments for each query
 Standard Precision/Recall and Precision@N documents
 Mean Average Precision (MAP)
 R-precision (R=number of relevant documents for the query)
 Ranking on reference side is flat (ignored)
Information and Communications
University
41 / 50
contents
 Why speech-based IR?
 Speech Recognition Technology
 Spoken document retrieval
 Ubiquitous IR using spoken dialog
Information and Communications
University
42 / 50
Dialog System
 A system to provide interface between the user and a computerbased application [Cole, 1997; McTear, 2004]
 Interaction on turn-by-turn basis
 Dialog manager
 Control the flow of the dialog
 Main flow
 information gathering from user
 communicating with external application
 communicating information back to the user
 Three types of dialog system
 frame-based
 agent-based
 finite state- (or graph-) based (~ VoiceXML-based)
Information and Communications
University
43 / 50
DARPA Communicator - Revisited
 From DARPA Communicator framework to Postech Ubiquitous
Natural Language Dialog System [Lee et al. 2006]
 Architecture based on Communicator hub-client structure
 Adding back-end modules (contents DB assistance, dialog model building)
Information and Communications
University
44 / 50
Spoken Language Understanding
 Spoken language understanding is to map natural language speech
to frame structure encoding of its meanings. [Wang et al., 2005]
 What’s difference between NLU and SLU?
 Robustness; noise and ungrammatical spoken language
 Domain-dependent; further deep-level semantics (e.g. Person vs. Cast)
 Dialog; dialog history dependent and utt. by utt. analysis
 Traditional approaches; natural language to SQL conversion
Speech
Text
ASR
SLU
Semantic
Frame
SQL
Generate
SQL
A typical ATIS system (from [Wang et al., 2005])
Information and Communications
University
45 / 50
Response
Database
Semantic Representation
 Semantic frame (frame and slot/value structure) [Gildeaand Jurafsky, 2002]
 An intermediate semantic representation to serve as the interface between
user and dialog system
 Each frame contains several typed components called slots. The type of a
slot specifies what kind of fillers it is expecting.
“Show me flights from Seattle to Boston”
<frame name=‘ShowFlight’ type=‘void’>
<slot type=‘Subject’>
FLIGHT</slot>
<slot type=‘Flight’/>
<slot type=‘DCity’>SEA</slot>
<slot type=‘ACity’>BOS</slot>
</slot>
</frame>
ShowFlight
Subject
FLIGHT
Flight
Departure_City
Arrival_City
SEA
BOS
Semantic representation on ATIS task; XML format (left) and
hierarchical representation (right) [Wang et al., 2005]
Information and Communications
University
46 / 50
Semantic Frame Extraction
 Semantic Frame Extraction (~ Information Extraction Approach)
1) Dialog act / Main action Identification ~ Classification
2) Frame-Slot Object Extraction ~ Named Entity Recognition
3) Object-Attribute Attachment ~ Relation Extraction
 1) + 2) + 3) ~ Unification
롯데월드에 어떻게 가나요?
Domain: Navigation
Dialog Act: WH-question
Main Action: Search
Feature Extraction / Selection
+
Dialog Act
Frame-Slot
Relation
+
+
Identification
Extraction
Extraction
+
Info.
Source
+
난 롯데월드가 너무 좋아.
Domain: Chat
Dialog Act: Statement
Main Action: Like
Object.Location=롯데월드
Unification
Examples of semantic frame structure
Overall architecture for semantic analyzer
Information and Communications
University
Object.Location.Destination=롯데월드
47 / 50
The Role of Dialog Management
 For example, in the flight reservation system
 System : Welcome to the Flight Information Service. Where would you like to
travel to?
 Caller : I would like to fly to London on Friday arriving around 9 in the
morning.
 System :
??????????
There is a flight that departs
at 7:45 a.m. and arrives at 8:50 a.m.
 In order to process this utterance, the system has to engage in the following
processes:
1) Recognize the words that the caller said. (Speech Recognition)
2) Assign a meaning to these words. (Language Understanding)
3) Determine how the utterance fits into the dialog so far and decide what to
do next. (Dialog Management)
Information and Communications
University
48 / 50
Overall Architecture [on-going research]
Speech Recognizer
Linguistic Analysis
Keyword
Feature Extractor
Generic SLU
Agent / Domain Spotter
Dialog Management
Task Agent
Chat Agent
Domain-Specific
SLU
Discourse History
Domain-Specific
Dialog Expert
Dialog Example
Database
Domain-Specific
SLU
Discourse History
Domain-Specific
Chat Expert
Chat
Dialog Example
Database
Domain Knowledge
Database
Information and Communications
University
Text-To-Speech
49 / 50
System
Utterance
References - Recognition
 L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, 1986. Maximum mutual
information estimation of hidden Markov model ICASSP, pp.49–52.
 C. Beaujard and M. Jardino, 1999. Language Modeling based on Automatic Word
Concatenations, In Proceedings of 8th European Conference on Speech
Communication and Technology, vol. 4, pp.1563-1566.
 K. Demuynck, J. Duchateau, and D. V. Compernolle, 1997. A static lexicon network
representation for cross-word context dependent phones, Proceedings of the 5th
European Conference on Speech Communication and Technology, pp.143–146.
 T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling
using a finite-state transducer representation, Proceedings of the ISCA Workshop on
Pronunciation Modeling and Lexicon Adaptation, pp.99–104.
 M. Mohri, F. Pereira, and M Riley, 2002. Weighted finite-state transducers in speech
recognition, Computer Speech and Language, vol.16, no.1, pp.69–88.
Information and Communications
University
50 / 50
References -recognition
 B. H. Juang, S. E. Levinson, and M. M. Sondhi, 1986. Maximum likelihood estimation
for multivariate mixture observations of Markov chains, IEEE Transactions on
Information Theory, vol.32, no.2, pp.307–309.
 C. H. Lee, F. K. Soong, and K. K. Paliwal, 1996. Automatic Speech and Speaker
Recognition: Advanced Topics, Kluwer Academic Publishers.
 K. K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set for the
HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173.
 L. R. Rabiner, 1989, A tutorial on hidden Markov models and selected applications in
speech recognition, Proceedings of the IEEE, vol.77, no.2, pp.257–286.
 L. R. Rabiner and B. H. Juang, 1993. Fundamentals of Speech Recognition, PrenticeHall.
 S. J. Young, N. H. Russell, and J. H. S Thornton, 1989. Token passing: a simple
conceptual model for connected speech recognition systems. Technical Report
CUED/F-INFENG/TR.38, Cambridge University Engineering Department.
 S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK book.
Entropics Cambridge Research Lab., Cambridge, UK.
Information and Communications
University
51 / 50
References – understanding & dialog
 J. Dowding, J. M. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore, D. and
Moran. 1993. Gemini: A natural language system for spoken language
understanding. ACL, 54-61.
 J. Eun, C. Lee, and G. G. Lee, 2004. An information extraction approach for
spoken language understanding. ICSLP.
 J. Eun, M. Jeong, and G. G. Lee, 2005. A Multiple Classifier-based ConceptSpotting Approach for Robust Spoken Language Understanding.
Interspeech 2005-Eurospeech.
 D. Gildea, and D. Jurafsky. 2002. Automatic labeling of semantic roles.
Computational Linguistics, 28(3):245-288.
 Y. He, and S. Young. January 2005. Semantic processing using the Hidden
Vector State model. Computer Speech and Language, 19(1):85-106.
 M. Jeong, and G. G. Lee. 2006. Exploiting non-local features for spoken
language understanding. COLING/ACL.
 J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional Random Fields:
Probabilistic models for segmenting and labeling sequence data. ICML.
Information and Communications
University
52 / 50
References – understanding and dialog
 E. Levin, and R. Pieraccini. 1995. CHRONUS, the next generation, In
Proceedings of 1995 ARPA Spoken Language Systems Technical Workshop,
269--271, Austin, Texas.
 B. Pellom, W. Ward., and S. Pradhan. 2000. The CU Communicator: An
Architecture for Dialogue Systems. ICSLP.
 R. E. Schapire., M. Rochery, M. Rahim, and N. Gupta. 2002, Incorporating
prior knowledge into boosting. ICML. pp538-545.
 S. Seneff. 1992. TINA: a natural language system for spoken language
applications, Computational Linguistics, 18(1):61--86.
 G. Tur, D. Hakkani-Tur, and R. E. Schapire. 2005. Combining active and
semi-supervised learning for spoken language understanding. Speech
Communication. 45:171-186
 Y. Wang, L. Deng, and A. Acero. September 2005, Spoken Language
Understanding: An introduction to the statistical framework. IEEE Signal
Processing Magazine, 27(5)
Information and Communications
University
53 / 50
References – understanding and dialog
 J. F. Allen, B. Miller, E. Ringger and T. Sikorski. 1996. A Robust System for
Natural Spoken Dialogue, ACL.
 S. Bayer, C. Doran, and B. George. 2001. Dialogue Interaction with the
DARPA Communicator Infrastructure: The Development of Useful Software.
HLT Research.
 R. Cole, editor., Survey of the state of the art in human language technology,
Cambridge University Press, New York, NY, USA, 1997.
 G. Ferguson, and J. F. Allen. 1998. TRIPS: An Integrated Intelligent ProblemSolving Assistant, AAAI, pp26-30.
 K. Komatani, F. Adachi, S. Ueno, T. Kawahara, and H. Okuno. 2003. Flexible
Spoken Dialogue System based on User Models and Dynamic Generation of
VoiceXML Scripts. SIGDIAL.
 S. Larsson, and D. Traum. 2000. Information state and dialogue
management in the TRINDI Dialogue Move Engine Toolkit, Natural
Language Engineering, 6(3-4).
 S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink, and G.
Sagerer. 2003. Providing the basis for human-robotinteraction: A multimodal attention system for a mobile robot. ICMI. pp. 28–35.
Information and Communications
University
54 / 50
References – understanding and dialog
 E. Levin, R. Pieraccini, and W. Eckert. 2000, A stochastic model of humanmachine interaction for learning dialog strategies. IEEE Transactions on
Speech and Audio Processing. 8(1):11-23
 C. Lee, S. Jung, J. Eun, M. Jeong, and G. G. Lee. 2006. A Situation-based
Dialogue Management using Dialogue Examples. ICASSP.
 W. Marilyn, H. Lynette, and A. John. 2000. Evaluation for Darpa
Communicator Spoken Dialogue Systems. LREC.
 M. F. McTear, Spoken Dialogue Technology, Springer, 2004.
 I. O’Neil, P. Hanna, X. Liu, D. Greer, and M. McTear. 2005. Implementing
advanced spoken dialog management in Java. Speech Communication,
54(1):99-124.
 B. Pellom, W. Ward., and S. Pradhan. 2000. The CU Communicator: An
Architecture for Dialogue Systems. ICSLP.
 A. Rudnicky, E. Thayer, P. Constantinides, C. Tchou, R. Shern, K. Lenzo, W.
Xu, and A. Oh. 1999. Creating natural dialogs in the Carnegie Mellon
Communicator system. Eurospeech, 4, pp1531-1534.
 W3C, Voice Extensible Markup Language (VoiceXML) Version 2.0 Working
Draft, http://www.w3c.org/TR/voicexml20/
Information and Communications
University
55 / 50
References – spoken document retrieval
 J. Allan, “Robust techniques for organizing and retrieving spoken
documents”, EURASIP Journal on Applied Signal Processing, no. 2, pp.
103-114, 2003.
 C. Allauzen, M. Mohri, and B. Roark, “A general weighted grammar library”,
in Proc. of International Conf. on the Implementation and Application of
Automata, Kingston, Canada, July 2004.
 C. Allauzen,et al, “General Indexation of Weighted Automata – Application
to Spoken Utterance Retrival,” Proc. of HLT-NAACL Workshop on
Interdisciplinary Approaches to Speech Indexing and Retrieval, pp. 33-40,
Boston, Massachusetts, USA, 2004.
 M. Bacchiani, et al, “SCANMail: audio navigation in the voicemail domain,”
in Proc. of the HLT Conf., pp. 1-3, San Diego, 2000.
 R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, chapter
2, pages 27-30. Addison Wesley, New York, 1999.
 I. Bazzi and J. Glass. “Modeling out-of-vocabulary words for robust
speech recognition”, in Proc. of ICSLP, Beijing, China, October, 2000.
 I. Bazzi and J. Glass. “Learning units for domain independent outofvocabulary word modeling”, in Proc. of Eurospeech, Aalborg, Sep. 2001.
Information and Communications
University
56 / 50
References – spoken document retrieval
 S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine”,
Computer Networks and ISDN Systems, Vol. 30, pp. 107-117, 1998.
 C. Chelba and A. Acero, “Position specific posterior lattices for indexing speech”, In
Proc. of the Annual Meeting of the ACL (ACL'05), pp. 443-450, Ann Arbor, Michigan,
June 2005.
 R. Fagin, R. Kumar, and D. Sivakumar. “Comparing top k lists”, In SIAM Journal of
Discrete Math, vol. 17, no. 1, pp. 134-160, 2003.
 S. Furui, T. Kikuchi, Y. Shinnaka and C. Hori, “Speech-to-text and speech-tospeech
summarization of spontaneous speech”, IEEE Trans. on Speech and Audio
Processing, vol. 12, no. 4, pp. 401-408, July 2004.
 G. Furnas, et al. “Information retrieval using a singular value decomposition model
of latent semantic structure”, in Proc. of ACM SIGIR Conf., pp. 465-480 Grenoble,
France, June 1988.
 J. Garofolo, C. Auzanne, and E. M. Voorhees, “The TREC spoken document retrieval
track: A success story,” in Proc. 8th Text REtrieval Conference (1999), vol. 500-246 of
NIST Special Publication, pp. 107–130, NIST, Gaithersburg, MD, USA, 2000.
 J. Gauvain and C. Lee, “Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains”, IEEE Trans. on Speech and Audio
Processing, vol. 2, no. 2, pp. 291-298, April 1994.
Information and Communications
University
57 / 50
References – spoken document retrieval
 J. Glass, T. Hazen, L. Hetherington and C. Wang, “Analysis and processing
of lecture audio data: Preliminary investigations”, in Proc. of the HLTNAACL 2004 Workshop on Interdisciplinary Approaches to Speech
Indexing and Retrieval, pp. 9-12, Boston, May 2004.
 T. Hofmann, “Probabilistic latent semantic analysis”, in Proc. of
Uncertainty in Artificial Intelligence (UAI'99), Stockholm, 1999.
 A. Inoue, T. Mikami and Y. Yamashita, “Prediction of sentence importance
for speech summarization using prosodic features”, in Proc. Eurospeech,
2003.
 R. Iyer and M. Ostendorf, “Modeling long distance dependence in
language: Topic mixtures vs. dynamic cache models”, in Proc. ICSLP,
Philadelphia, 1996.
 D. James, The Application of Classical Information Retrieval Techniques to
Spoken Documents, PhD thesis, University of Cambridge, 1995.
 D. Jones, et al, “Measuring the readability of automatic speech-to-text
transcripts”, in Proc. Eurospeech, Geneva, Switzerland, September 2003.
 G. Jones, J. Foote, K. Spärck Jones, and S. Young, “Retrieving spoken
documents by combining multiple index sources”, In Proc. of ACM SIGIR
Conf., pp. 30-38, Zurich, Switzerland, 1996.
Information and Communications
University
58 / 50
References – spoken document retrieval
 K. Koumpis and S. Renals, “Transcription and summarization of voicemail
speech”, in Proc. ICSLP, Beijing, October 2000.
 C. Leggetter and P. Woodland, “Maximum likelihood linear regression for
speaker adaptation on continuous density hidden Markov Models”,
Computer Speech and Language, vol. 9, no. 2, pp. 171-185, April 1995.
 B. Logan, P. Moreno, and O. Deshmukh, “Word and sub-word indexing
approaches for reducing the effects of OOV queries on spoken audio”, in
Proc. of HLT, San Diego, March 2002.
 I. Malioutov and R. Barzilay, “Minimum cut model for spoken lecture
segmentation”, in Proc. of COLING-ACL, 2006.
 S. Matsoukas, et al, “BBN CTS English System,” available at
http:www.nist.gov/speech/tests/rt/rt2003/spring/presentations.
 Kenney Ng, Subword-Based Approaches for Spoken Document Retrieval,
PhD thesis, Massachusetts Institute of Technology, 2000.
 NIST. The TREC evaluation package available at: http://wwwnlpir.nist.gov/projects/trecvid/trecvid.tools/trec_eval
 Douglas W. Oard, et al, “Building an information retrieval test collection for
spontaneous conversational speech”, In Proc. ACM SIGIR Conf., pp. 41--48,
New York, 2004.
Information and Communications
University
59 / 50
References –spoken document retrieval
 J. Ponte and W. Croft, “A language modeling approach to information
retrieval”, Proc. ACM SIGIR), pp. 275--281, Melbourne, Australia,
August1998.
 J. Silva Sanchez, C. Chelba, and A. Acero, “Pruning analysis of the
position specific posterior lattices for spoken document search”, in Proc.
of ICASSP, Toulouse, France, May 2006.
 M. Saraclar and R. Sproat, “Lattice-based search for spoken utterance
retrieval”, In Proc. of HLT-NAACL 2004, pp. 129-136, Boston, May 2004.
 F. Seide and P. Yu, “Vocabulary-independent search in spontaneous
speech”, in Proc. of ICASSP, Montreal, Canada, 2004.
 F. Seide and P. Yu, “A hybrid word/phoneme-based approach for improved
vocabulary-independent search in spontaneous speech”, in Proc. of ICSLP,
Jeju, Korea, 2004.
 M. Siegler, Integration of Continuous Speech Recognition and Information
Retrieval for Mutually Optimal Performance, PhD thesis, Carnegie Mellon
University, 1999.
 A. Singhal, J. Choi, D.Hindle, D. Lewis and F. Pereira, “AT&T at TREC-7”, in
Text REtrieval Conference, pages 239-252, 1999.
Information and Communications
University
60 / 50
References – spoken document retrieval
 A. Singhal, S. Abney, M. Bacchiani, M. Collins, D. Hindle and F. Pereira,
“AT&T at TREC-8”. In Text REtrieval Conference, pp. 317-330, 2000.
 O. Siohan and M. Bacchiani, Fast Vocabulary-Independent Audio
Search Using Path-Based Graph Indexing, Proc. of Interspeech,
Lisbon, Portugal, 2005.
 J. M. Van Thong, et al, “SpeechBot: An experimental speech-based
search engine for multimedia content on the web”, IEEE Trans. on
Multimedia, Vol. 4, No. 1, March 2002.
 A. Waibel, et al, “Advances in automatic meeting record creation and
access,” in Proc. of ICASSP, Salt Lake City, May 2001.
 P. Woodland, S. Johnson, P. Jourlin, and K. Spärck Jones, “Effects of
out of vocabulary words in spoken document retrieval”, In Proc. of
SIGIR, pp. 372-374, Athens, Greece, 2000.
Information and Communications
University
61 / 50