Query by Singing (CBMR)

Download Report

Transcript Query by Singing (CBMR)

Speech Assessment: Methods and
Applications for Spoken Language Learning
J.-S. Roger Jang (張智星)
[email protected]
http://www.cs.nthu.edu.tw/~jang
Multimedia Information Retrieval Lab
CS Dept, Tsing Hua Univ, Taiwan
Outline
Introduction to speech assessment
Methods
Using learning to rank for speech assessment
Demos
Conclusions
Intro. to Speech Assessment
Goal
Evaluate a person’s utterance based on some
acoustic features, for language learning
Also known as
Pronunciation scoring
CAPT (computer-assisted pronunciation training)
Computer-Assisted Language
Learning (CALL)
4 aspects of CALL
ListeningEasier
SpeakingHarder
ReadingEasier
WritingHarder
Receptive skills are
easier to be assisted by
computers, while
productive skills are
harder to evaluate
automatically.
SA plays an essential
role in CALL for
speech/pronunciation
scoring.
Speech Assessment
Characteristics of ideal SA
Assessment levels: as detailed as possible
Syllables, words, sentences, paragraphs
Assessment criteria: as many as possible
timbre, tone, energy, rhythm, co-articulation, …
Feedbacks: as specific as possible
High-level correction and suggestions
Basic Assessment Criteria
Timber (咬字/音色)
Based on acoustic
models
Tone (音調/音高)
Based on tone
recognition (for tonal
language)
Based on pitch
similarity with the target
utterance
Rhythm (韻律/音長)
Based on duration
comparison with the
target utterance
Energy (強度/音量)
Based on energy
comparison with the
target utterance
Additional Assessment Criteria
English
Stress (重音)
Levels (word or sentence)
Intonation (整句音調)
Declarative sentence
Interrogative sentence
Co-articulation (連音)
A red apple.
Did you call me?
Won’t you go?
Raise your hand.
Mandarin
Tone (聲調)
Retroflex (捲舌音)
Co-articulation (連音)
兒化音
Others
Pause
Types of SA
Types of SA (ordered by difficulty)
Type 1:有目標文字、有目標語句
Type 2:有目標文字、無目標語句
Type 3:無目標文字、有目標語句
Type 4:無目標文字、無目標語句
We are focusing on type 1 and 2.
Our Approach
Basic approach to timbre assessment
Lexicon net construction (Usually a sausage net)
Forced alignment to identify phone boundaries
Phone scoring based on several criteria, such as
ranking, histograms, posterior prob., etc.
Weighted average to get syllable/sentence scores
Lexicon Net Construction
Lexicon net for “what are you allergic to?”
Sausage net with all possible (and correct)
multiple pronunciations
Optional sil between words
Lexicon Net with Confusing Phones
Common errors for
Japanese learners of
Chinese
ㄖㄌ
例:天氣熱天氣樂
ㄑㄐ
例:打哈欠 打哈見
ㄘㄗ
例:一次旅行一字旅行
ㄢㄤ
例:晚安晚ㄤ
Rule-based approach to
creating confusing
syllables
Rules:
Rule 1: re  le
Rule 2: qi  ji
Rule 3: ci  zi
Rule 4: an  ang
Example
欠 (qian)見 (jian)、嗆
(qiang)、降 (jiang)
Lexicon Net with Confusing Phones
Lexicon net for “天氣熱、打哈欠”
Canonical form: tian qi re da ha qian
16 variant paths in the net:
Automatic Confusing Syllable Id.
Corpus of Japanese learners
Of Chinese
強制對位以得到初步切音結果
對華語411音節進行比對
以找出每個音的混淆音
將混淆音節加入辨識網路
再進行強制對位及切音
No
切音結果不再變動?
Yes
輸出混淆音節
及辨識網路
Error Pattern Identification (EPI)
Common insertions/deletions from users
以「朝辭白帝彩雲間」為標準語句
•
•
•
•
•
•
•
•
•
任意處結束,例如「朝辭白帝」
任意處開始,例如「彩雲間」
任意處開始與結束,例如「白帝彩雲」
任意處開始與結束,並出現跳字,例如「白彩雲」
疊字,例如「朝…朝辭白帝彩雲間」
疊詞例如「朝辭…朝辭白帝彩雲間」
疊字加換音,例如「朝(cao)…朝(zhao)辭白帝彩雲間」
兩字對調,例如「朝辭彩帝白雲間」
錯字,例如「朝辭白帝黑山間」
Lexicon Net for EPI (I)
偵測「從頭開始、在任意處結束」的發音
Lexicon Net for EPI (II)
偵測「從任意處開始,在尾端結束」的發音
Lexicon Net for EPI (III)
偵測「從任意處開始,結束於任意處(但
不可跳字)」的發音
Lexicon Net for EPI (IV)
偵測「從任意處開始,結束於任意處,而
且可以跳字)」的發音
Design Philosophy of Lexicon Nets
We need to strike a balance between
recognition and lexicon
In the extreme, we can have a net for free syllable
decoding to catch all error patterns.
The feasibility of free syllable decoding is offset
by its not-so-high recognition rate.
Scoring Methods for Speech Assessment
Five phone-based scoring methods
Duration-distribution scores
Log-likelihood scores
Log-posterior scores
Log-likelihood-distribution scores
Rank ratio scores
All based on forced alignment to segment
phones
Method 1: Duration-distribution Scores
PDF of phone duration
Obtained from forced alignment
Normalized by speech rate
Fitted by log-normal PDF
Max PDF  score 100
Method 2: Log-likelihood Scores
Log-likelihood of phone q i with duration of d
frames :
t 0  d 1
1
lˆ 
log p yt | qi 

d t t0
where p yt | qi  is the likelihood of the frame t
with the observation vector yt
Method 3: Log-posterior Scores
Log-posterior of phone q i with duration d :
1
ˆ 
d
t0  d 1
 log Pq | y 
i
t t 0
where Pqi | yt  
t
p yt | qi Pqi 
 py
m
j 1
t
| q j Pq j 
Method 4: Log-likelihooddistribution Scores
Use CDF of Gaussian for log-likelihood
CDF = 1  score = 100
Method 5: Rank Ratio Scores
Rank ratio
rr q j  
rankq j   1
# of competingphones 1
RR to score conversion
scoreq j ; a, b  
100
1
rr q j 
b
a
where parameters a, b are phone specific.
Possible sets of competing phones for x+y
*+y
*+*
Intro. to Learning to Rank
Learning to rank
A supervised learning algorithm which generates
a ranking model based on a training set of partially
order items.
Methods
Pointwise (e.g., Pranking)
Pairwise (e.g., RankSVM, RankBoost, RankNet)
Listwise (e.g., ListNet)
Application of LTR to SA
Why use LTR for SA?
Human scoring is rank-based: A+, A, B, B-…
Tsing Hua’s grading system is moving from scores
(0~100) to ranks (A, B, C, D…).
Combination of features (scores)
Features are complementary.
Effective determination of ranking
LTR only generates numerical output with a ranking
order as close as possible to the correct order. A
optimum DP-approach is proposed.
LTR Score Segmentation
Given: LTR scores s  s1 , s2 ,, sn  (sorted)
Desired rank r  r1 , r2 ,, rn 
We want to find the separating scores   1 ,2 ,,m1 
with score-to-rank function s2rs,  :
2
1
Rank 1
Rank 2
n
3
Rank 3
4
Rank 4
Such that J     ri  s 2r si  is minimized.
i 1
s
Rank 5
LTR Score Segmentation by DP (I)
Formulate the problem in DP framework
Optimum-value function D(i,j): The minimum
cost of mapping s1, s2 ,, si  to rank 1, 2, , j 
Recurrent equation
D(i, j)  ri  j  minD(i 1, j), D(i 1, j 1)
Boundary condition: D(1, j )  r1  j , j  [1, m]
Optimum cost: D n, m 
LTR Score Segmentation by DP (II)
Computed
rank
5
4
3
2
1
r1 r2 r3
1 
s2  s3
2
r4 r5 r6 r7 r8 r9 r10 r11 r12 r13
2 
s6  s7
s s
3  8 9
2
2
4 
Desired
rank
s11  s12
2
Local constraint:
Recurrent formula:
 D(i  1, j )
D(i, j ) | ri  j |  min
D(i  1, j  1)
Di 1, j 
Di  1, j  1
Di, j 
LTR Score Segmentation with DP (III)
Data distribution:
DP path:
DP total distance = 23
5
Class
4
3
2
1
5
5
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
0
0.5
1
1.5
2
2.5
3
x1
3.5
4
4.5
5
5.5
1
4
2
vec2
vec1
6
50
100
150
200
250
50
100
150
200
250
4
2
Flow Charts of Our Experiment
Corpora for Experiments
WSJ
For training biphone acoustic models for forced
alignment
MIR-SD
Recordings of about 4000 multi-syllable English words
by 22 students (12 females and 10 males.) with an
intermediate competence level.
 Originally designed for stress detection
Available at http://mirlab.org/dataSet/public
Human Scoring of MIR-SD
Human scoring
Only 50 utterances from each speaker of MIR-SD
are scored by 2 humans, making a total of 1100
utterances
Score
Frequency
Percentage
1
110
10%
2
198
18%
3
259
24%
4
409
37%
5
124
11%
Human scoring are consistent:
Correlation
Word-based
Speaker-based
Inter-rater
0.58
0.78
HR1-GT
0.84
0.96
HR2-GT
0.89
0.93
Performance Indices
Performance indices used in the literature
hr = [1 3 5 4 2 2], cr = [2 3 5 2 1 4]
Recognition rate rRate = 33.33%
Recognition rate with tolerance 1 = 66.67%
Average absolute difference = 1
Correlation coef = 0.54
Performance Evaluation of Different
Scoring Methods
durDis
hmmLike
hmmPost
likeDis
rkRatio
Corr
rRate
rRateT1
AADiff
Corr
rRate
rRateT1
AADiff
Corr
rRate
rRateT1
AADiff
Corr
rRate
rRate T1
AADiff
Corr
rRate
rRateT1
AADiff
Raw
score
0.209
0.120
0.084
0.141
0.240
DP-based
inside
outside
0.217
0.189
0.342
0.309
0.783
0.771
0.906
0.942
0.168
0.102
0.325
0.306
0.780
0.757
0.928
0.973
0.297
0.265
0.344
0.330
0.811
0.798
0.862
0.893
0.160
0.125
0.316
0.308
0.789
0.774
0.924
0.948
0.232
0.198
0.333
0.316
0.789
0.779
0.898
0.929
k-means
inside
outside
0.202
0.194
0.281
0.276
0.701
0.696
1.109
1.122
0.144
0.154
0.258
0.255
0.692
0.689
1.158
1.165
0.192
0.216
0.170
0.162
0.565
0.561
1.494
1.499
0.141
0.143
0.247
0.247
0.665
0.671
1.207
1.203
0.229
0.236
0.269
0.268
0.699
0.698
1.120
1.124
Overall Performance Comparison
Legends
Score segmentation
Circles: DP
Triangles: k-means
Inside/outside tests
Solid lines: Inside
Dashed lines: Outside
Black lines: Baselines
Demo: Practice of Mandarin Idioms of
Length 4 (一語中的)
Level (difficulty) of an
idiom is based on it’s freq.
via Google search:
孤掌難鳴 ===> 260,000
鶼鰈情深 ===> 43,300
亡鈇意鄰 ===> 22,700
舉案齊眉 ===> 235,000
Can be adapted for
English learning
Next step: multithreading, fast decoding
via FSM
Demo: Recitation Machine
(唸唸不忘)
 Support Mandarin &
English
 Support user-defined
recitation script
 Next step: multithreading
for recording & recognition
Licensing for PC Applications
For Mandarin, English, Japanese
SA for Embedded Systems
Embedded platforms: PMP, iPhone, Androids
Demo: Tangible Companions
Chicken run (落跑雞)
Penguin for Tang Poetry
(唐詩企鵝)
Robot Fighter (蘿蔔戰士)
Singing Bass & Dog (大
嘴鱸魚和唱歌狗)
Tools and Tutorials
Tools
DCPR toolbox
http://mirlab.org/jang/mat
lab/toolbox/dcpr
SAP toolbox
http://mirlab.org/jang/mat
lab/toolbox/sap
ASR Toolbox
http://mirlab.org/jang/mat
lab/toolbox/asr
Tutorials
Data clustering and
pattern recognition:
http://mirlab.org/jang/boo
ks/dcpr
Audio signal processing
http://mirlab.org/jang/boo
ks/audioSignalProcessing
Lab page (with demos):
http://mirlab.org
Other SA Issues to be Addressed
Core technology
Other acoustic features for
scoring
Pitch: tone/intonation
Volume
Duration
Pause
Coarticulation
Error pattern identification
Application side
Mulimodal GUI
Extensions
Slightly adaptation
Paragraph-level SA
Text-free SA
Beyond pronunciation
Translation + recognition
+ assessment
Microphone types
Examples
Coarticulation
Knock it off!
Mom woke her up
Consonant+consonant
Bus stop
Push Shirley
Ask question
Jeff flew south through
Tainan
Exception
Change jobs
Which Chair
Examples
Changes due to
coarticulation
Would you like it?
Won’t you go?
Raise your hand.
It makes you look
younger.
Softened sounds
Junction
Popcorn
Fruitful
Can and can’t
I can read the letter.
I can’t read the letter.
d and t
Better
Cider
Most Likely to be Mispronounced
Within Taiwan
Pleasure/pressure
World/war/word
Shirt/short
Walk/work
Flesh/fresh
Supply/surprise
Some/son
Confirm/conform
Cancel/cancer
Mouth/mouse
Measure/major
Police/please
Version/virgin
Conclusions
Conclusions
SA calls for more cues than ASR
SA requires techniques from ML/IR
Multi-modal approach to SA is a must
“Popcorn”, “Thursday”
On-going & future work
Tone recognition & assessment
Reliable error pattern identification
References











Witt, S. M. and Young, S. J., “Phone-level Pronunciation Scoring and Assessment for Interactive Language Learning”,
Speech Communication 30, 95-108, 2000.
Kim, Y., Franco, H., and Neumeyer, L., “Automatic Pronunciation Scoring of Specific phone Segments for Language
Instruction”, in Proceedings of the 4th European Conference on Speech Communication and Technology (Eurospeech ’97),
pp. 649-652, Rhodes, 1997.
Neumeyer, L, Franco, H., Digalakis, V., and Weintraub, M., “Automatic Scoring of Pronunciation Quality”, Speech
Communication 30, 83-93, 2000.
Franco, H., Neumeyer, L., Digalakis, V., and Ronen, O., “Combination of Machine Scores for Automatic Grading of
Pronunciation Quality”, Speech Communication 30, 121-130, 2000.
Cincared, T., Gruhn, R., Hacker, C., Nöth, E., and Nakamura, S., “Automatic Pronunciation Scoring of Words and Sentences
Independent from the Non-Native’s First Language”, Computer Speech and Language 23, 65-88, 2009.
Crammer, K. and Singer, Y., “Pranking with Ranking”, in proceedings of the conference on Neural Information Processing
Systems (NIPS), 2001.
Joachims, T., “Optimizing Search Engines using Clickthrough Data”, in proceedings of the ACM Conference on Knowledge
Discovery and Data Mining (KDD), ACM, 2002.
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y., “An Efficient Boosting Algorithm for Combining Preferences”, in
proceedings of ICML, pp170-178, 1998.
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G., “Learning to Rank using
Gradient Descent”, in proceedings of ICML, pp. 89-96, 2005.
Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., and Li, H., “Learning to Rank: From Pairwise Approach to Listwise Approach”, in
proceedings of the 24th International Conference on Machine Learning, pp. 129-136, Corvallis, OR, 2007.
Liang-Yu Chen , Jyh-Shing Roger Jang, “Automatic Pronunciation Scoring using Learning to Rank and DP-based Score
Segmentation”, submitted to Interspeech 2010.