Embedded Text to Speech Plan

Transcript Embedded Text to Speech Plan

Retrieving Spoken Information by
Combining Multiple Speech Transcription Methods
Jonathan Mamou
Joint work with
Ron Hoory, David Carmel, Yosi Mass, Bhuvana Ramabhadran, Benjamin Sznajder
IBM Haifa Research Lab
© 2008 IBM Corporation
IBM Haifa Research Lab
Motivation
Spoken data is everywhere!
Conference Meetings
Call Center
2
Speech Technologies Seminar 2008
Broadcasts News
Surveillance & Security
© 2008 IBM Corporation
IBM Haifa Research Lab
IR Tasks on Speech Data
 Spoken Document Retrieval (SDR)
 Traditional search engine approach: find spoken documents
relevant to a query.
 Spoken Term Detection (STD)
 Detect occurrences of a phrase in spoken documents.
 NIST STD evaluation
3
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Approaches for Speech Information Retrieval
 Keyword spotting
 Based on direct detection of a predefined set of keywords in the
speech data
 Build an index out of automatic transcription output
 Based on full transcription of the audio and indexing of the
transcription process output
 This is the approach we are using
 Part of this work has been done in the
framework of SAPIR, an EU FP6 project
of Search in Audiovisual Content using P2P
4
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Overview
Acoustic model
Language model
Vocabulary
Speech data
Automatic
Speech
Recognition
Query
Ranked Results
5
Speech Technologies Seminar 2008
Search
Engine
Index
© 2008 IBM Corporation
IBM Haifa Research Lab
Why is it different from classic text IR?
 The classic text IR based solution would be an indexing and search of 1-best
word transcript.
 However, two main issues can arise during the transcription of the speech data:
 Errors (substitutions, deletions, insertions) can occur during the
transcription
 Out-of-vocabulary (OOV) terms can be present in the spoken data and in
the query
 OOV words are missing words from the ASR system vocabulary
 They are replaced in the output transcript by alternatives that are probable,
given the acoustic model, vocabulary and language model of the ASR system.
 e.g., TALIBAN  TELL A BAND
 Over 10% of user queries can be OOV terms (especially named entities)
6
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Influence of the WER on the Retrieval
 Substitutions and deletions reflect the fact that a term “appearing” in
the speech signal is not recognized
 Impact on the recall of the search (i.e., fraction of the documents
relevant to the query that are successfully retrieved)
 Substitutions and insertions reflect the fact that a term which is not
part of the speech signal appears in the transcript
 Impact on the precision of the search (i.e., fraction of the retrieved
documents that are relevant to the query)
 These issues may dramatically affect the effectiveness of the retrieval
and prevent the “naïve” search engine from retrieving the information
7
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Technical Approach
 We have developed algorithms
 to improve search effectiveness in the presence of errors,
 to allow OOV queries.
 Indexing of the Word Confusion Network (WCN) including word
alternatives and corresponding confidences, for IV terms.
 Phonetic indexing and fuzzy search.
8
Speech Technologies Seminar 2008
© 2008 IBM Corporation
Retrieval Model
IBM Haifa Research Lab
© 2008 IBM Corporation
IBM Haifa Research Lab
Word Search
 We index Word Confusion Network (WCN) [Mangu et al., 2000]
 It is a compact representation of a word lattice: the different word
hypotheses that appear at a same time are aligned.
 A vertex is associated with a timestamp.
 An edge is labeled with
 a word hypothesis,
 its posterior probability: the probability of the word given the signal.
10
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
A fragment of WCN
glasses 27%
graphic 22%
have 61%
…
impressions 19 %
graphics 13%
and 39%
interested 9%
screen 99%
on 100%
my 100%
seen 1%
…
impresses 7 %
grass 3%
11
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Improving Retrieval Effectiveness using WCNs
 Recall is enhanced by expanding the 1-best transcript by extra words,
taken from the other alternatives provided by the WCN.
 These alternatives may have been spoken but were not the top
choice of the ASR.
 However, such an expansion will probably decrease the precision!
 Using an intelligent ranking model, we can improve the mean average
precision (MAP) of the search.
 Average precision is average of precisions computed after
truncating the list of results after each of the relevant documents in
turn.
 MAP emphasizes returning more relevant documents earlier.
12
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Improving Retrieval Effectiveness using WCNs
 We exploit two pieces of information provided by WCN concerning the
occurrences of a term to improve our ranking model:
 The posterior probability of the hypothesis given the signal,
 The rank of the hypothesis among the other alternatives.
13
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Posterior Probability of the Hypothesis, Confidence Level
 The posterior probability of the hypothesis given the signal reflects the
confidence of the ASR in the hypothesis.
 The retrieval process will boost documents for which the query term
occurs with higher probability.
 We denote by Pr(t|o,D) the posterior probability of a term t at offset o in
the WCN of a document D.
14
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Rank of the Hypothesis, Relative Importance
 The rank of the hypothesis among other alternatives reflects the
importance of the term relatively to other alternatives.
 A document containing a query term that is ranked higher, should be
preferred over a document where the same term is ranked lower.
 We denote by rank(t|o,D) the rank of a term t at the offset o in the WCN
of a document D.
 A boosting vector B=(B1,…,Bl) associates a boosting factor to each
rank of the different hypotheses.
15
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Scoring
 Our scoring is based on Vector Space Model (VSM) [Salton and McGill,
1986]
 It is an algebraic model for representing documents as vectors of
words.
 Each dimension corresponds to a separate term. If a term occurs in
the document, its value in the vector is its tfidf.
 Relevance ranking of documents can be calculated by comparing
the cosine of the angles between each
document vector and the original query
d1
vector where the query is represented
d2
as same kind of vector as the documents.
q
16
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Scoring
 Term frequency – Inverse document frequency (tfidf)
 This weight is a statistical measure used to evaluate how important
a word is to a document in a corpus.
 The importance increases proportionally to the number of times a
word appears in the document (term frequency - tf) but is offset by
the frequency of the word in the corpus (inverse document
frequency - idf).
17
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Term Frequency
 The term frequency is evaluated by summing the posterior
probabilities of all the occurrences of the term over the document.
 The term frequency is boosted by the rank of the term among the
other hypotheses.
tf t , D  
occt , D 
B
i 1
rank t oi , D 
 Prt oi , D 
occ(t,D) is the sequence of all the occurrences of t in D.
18
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Phonetic Search
 Different kinds of phonetic transcripts:
 Sub-word decoding [Siohan and Bacchiani, 2005]
 Sub-word representation of automatic 1-best word transcript
 Sub-word can be word-fragment, syllable, phone
 Sub-word transcripts have high error rate
 Phonetic transcription cannot be an alternative to word transcripts
especially for in-vocabulary (IV) search.
 That is the reason why we need to combine word transcripts with
phonetic transcripts.
19
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Phonetic Search
 Relevant to IV and OOV search
 N-gram or sub-word based indexing
 Retrieval approaches
 Exact search
 High precision but low recall
 Fuzzy search
 It improves recall while decreases precision
 Using an intelligent ranking model, we can improve the mean average
precision of the search.
 Based on Edit distance on pronunciations
 We have implemented a fail-fast dynamic algorithm for computing it
20
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Scoring
 Our scoring model extends TFIDF.
 Let’s consider a query that is represented by the phonetic pronunciation
ph.
 sim(ph,ph’) is the edit distance based similarity between two phonetic
pronunciations ph and ph’.
 Term frequency:
tf  ph, D 
 sim ph, ph
phD
 Document frequency:
df  ph 
D ph  D s.t.sim ph, ph  0
N
N is the number of documents in the corpus.
21
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Phonetic Query Expansion
 Compensate for
 OOV
 spelling variations
 Each query term is converted to its phonetic pronunciations using joint
maximum entropy N-gram model [Chen, 2003].
 Each pronunciation is associated with a score that reflects the
probability of this pronunciation normalized by the probability of the
best pronunciation, given the spelling.
22
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Phonetic Query Expansion
 Let’s consider a query term t that is expanded to (ph1,s1), …, (phm,sm)
where phi is a pronunciation and si its associated score.
 The score of t in D is given by the aggregation of the scores of the
search on D of the pronunciations phi weighted by their score si
s  tfidf  ph , D 

scoret , D  
s
i i
i
i i
23
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Combination of word search with phonetic search
 Using the Threshold Algorithm [Fagin, 1996]
 Merging result lists of documents returned respectively by word
and phonetic search, ordered according to their score
 Using inverted indices with Boolean Constraints
 Merging posting lists extracted from inverted indices (word and
phonetic), ordered by the document identifiers, according to Boolean
constraints
 Based on query rewriting to combine word and phonetic parts of
the original query
24
Speech Technologies Seminar 2008
© 2008 IBM Corporation
Experiments
IBM Haifa Research Lab
© 2008 IBM Corporation
IBM Haifa Research Lab
Experimental Setup
 2236 calls made to the IBM internal
customer support service.
 The calls deal with a large range of
software and hardware problems.
 The average length of a call is 18
minutes.
26
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Precision and Recall vs. WER
 As expected, indexing all WCN candidates improve Recall while reduce
Precision
 Recall/Precision are both decreased with higher WER
27
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Experiments with several retrieval strategies over WCN
 1-best WCN TF
 Index: 1-best transcript obtained from WCN - Ranking: classic tf-idf
 All WCN TF
 Index: all the WCN hypotheses - Ranking: classic tf-idf
 1-best WCN CL
 Index: 1-best transcript obtained from WCN - Ranking: confidence levels
 All WCN CL
 Index: all the WCN hypotheses - Ranking: confidence levels
 ALL WCN CL boost
 Index: all the WCN hypotheses - Ranking: confidence levels and rank
among the other hypotheses
28
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
MAP vs. WER
 Using confidence level information provides significant contribution.
 all WCN CL boost always outperforms the other models, especially for
high WER.
29
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Experimental Setup
 Data set provided by NIST for the STD evaluation
 3 hours of broadcast news
 We built three different indices
 Word: word index on the WCN
 WordPhone: a phonetic index of the phonetic representation of the
1-best word decoding
 Phone: a phonetic index of the 1-best word-fragment decoding
 For phonetic retrieval, we compared two different search methods: exact
and fuzzy match.
30
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
MAP of Phonetic Query Expansion for OOV search
Phonetic Search
Method
WordPhone
Phone
Merge
Exact
0.31
0.27
0.37
Exact+expansion
0.32
0.29
0.39
Fuzzy
0.40
0.39
0.47
Fuzzy+expansion
0.42
0.40
0.48
 MAP of phonetic retrieval improves by up to 7.5% with query expansion
in respect to baseline search approaches.
31
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
MAP for Hybrid Search
Semantics
OR
AND
Word
WordPhone
Phone
Merge
0.59
0.54
0.48
0.73
0
0.5
0.36
0.57
 Queries combine IV and OOV terms under different query semantics
 Improvement of merge approach with respect to word and phonetic
approaches
32
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
Conclusions
 The Approach
 Word-based approach suffers from limited vocabulary of the recognition
system.
 Phonetic-based approach suffers from lower accuracy.
 Our spoken information retrieval system combines both approaches
 Recall and MAP are significantly improved by searching
 all the hypotheses provided by the WCN
 in phonetic transcripts
 This approach received the highest overall ranking for US English speech data
in the last NIST Spoken Term Detection evaluation (December 2006).
33
Speech Technologies Seminar 2008
© 2008 IBM Corporation
IBM Haifa Research Lab
References
 Spoken Document Retrieval from Call-Center Conversations, Jonathan Mamou,
David Carmel, Ron Hoory, SIGIR 2006
 Vocabulary Independent Spoken Term Detection, Jonathan Mamou, Bhuvana
Ramabhadran, Olivier Siohan, SIGIR 2007
 Audio-visual content analysis in P2P: the SAPIR approach, Walter Allasia,
Francesco Gallo, Fabrizio Falchi, Mouna Kacimi, Aaron Kaplan, Jonathan Mamou,
Yosi Mass, Nicola Orio, Workshop on Automated Information Extraction in Media
Production, DEXA 2008
 Combination of Multiple Speech Transcription Methods for Vocabulary
Independent Search, Jonathan Mamou, Yosi Mass, Bhuvana Ramabhadran,
Benjamin Sznajder, Search in Spontaneous Conversational Speech Workshop,
SIGIR 2008
 Phonetic Query Expansion for Spoken Document Retrieval, Jonathan Mamou,
Bhuvana Ramabhadran, Interspeech 2008
34
Speech Technologies Seminar 2008
© 2008 IBM Corporation
Thank you!
IBM Haifa Research Lab
© 2008 IBM Corporation