Query by Singing (CBMR)
Download
Report
Transcript Query by Singing (CBMR)
Speech Assessment: Methods and
Applications for Spoken Language Learning
J.-S. Roger Jang (張智星)
[email protected]
http://www.cs.nthu.edu.tw/~jang
Multimedia Information Retrieval Lab
CS Dept, Tsing Hua Univ, Taiwan
Outline
Introduction to speech assessment
Methods
Using learning to rank for speech assessment
Demos
Conclusions
Intro. to Speech Assessment
Goal
Evaluate a person’s utterance based on some
acoustic features, for language learning
Also known as
Pronunciation scoring
CAPT (computer-assisted pronunciation training)
Computer-Assisted Language
Learning (CALL)
4 aspects of CALL
ListeningEasier
SpeakingHarder
ReadingEasier
WritingHarder
Receptive skills are
easier to be assisted by
computers, while
productive skills are
harder to evaluate
automatically.
SA plays an essential
role in CALL for
speech/pronunciation
scoring.
Speech Assessment
Characteristics of ideal SA
Assessment levels: as detailed as possible
Syllables, words, sentences, paragraphs
Assessment criteria: as many as possible
timbre, tone, energy, rhythm, co-articulation, …
Feedbacks: as specific as possible
High-level correction and suggestions
Basic Assessment Criteria
Timber (咬字/音色)
Based on acoustic
models
Tone (音調/音高)
Based on tone
recognition (for tonal
language)
Based on pitch
similarity with the target
utterance
Rhythm (韻律/音長)
Based on duration
comparison with the
target utterance
Energy (強度/音量)
Based on energy
comparison with the
target utterance
Additional Assessment Criteria
English
Stress (重音)
Levels (word or sentence)
Intonation (整句音調)
Declarative sentence
Interrogative sentence
Co-articulation (連音)
A red apple.
Did you call me?
Won’t you go?
Raise your hand.
Mandarin
Tone (聲調)
Retroflex (捲舌音)
Co-articulation (連音)
兒化音
Others
Pause
Types of SA
Types of SA (ordered by difficulty)
Type 1:有目標文字、有目標語句
Type 2:有目標文字、無目標語句
Type 3:無目標文字、有目標語句
Type 4:無目標文字、無目標語句
We are focusing on type 1 and 2.
Our Approach
Basic approach to timbre assessment
Lexicon net construction (Usually a sausage net)
Forced alignment to identify phone boundaries
Phone scoring based on several criteria, such as
ranking, histograms, posterior prob., etc.
Weighted average to get syllable/sentence scores
Lexicon Net Construction
Lexicon net for “what are you allergic to?”
Sausage net with all possible (and correct)
multiple pronunciations
Optional sil between words
Lexicon Net with Confusing Phones
Common errors for
Japanese learners of
Chinese
ㄖㄌ
例:天氣熱天氣樂
ㄑㄐ
例:打哈欠 打哈見
ㄘㄗ
例:一次旅行一字旅行
ㄢㄤ
例:晚安晚ㄤ
Rule-based approach to
creating confusing
syllables
Rules:
Rule 1: re le
Rule 2: qi ji
Rule 3: ci zi
Rule 4: an ang
Example
欠 (qian)見 (jian)、嗆
(qiang)、降 (jiang)
Lexicon Net with Confusing Phones
Lexicon net for “天氣熱、打哈欠”
Canonical form: tian qi re da ha qian
16 variant paths in the net:
Automatic Confusing Syllable Id.
Corpus of Japanese learners
Of Chinese
強制對位以得到初步切音結果
對華語411音節進行比對
以找出每個音的混淆音
將混淆音節加入辨識網路
再進行強制對位及切音
No
切音結果不再變動?
Yes
輸出混淆音節
及辨識網路
Error Pattern Identification (EPI)
Common insertions/deletions from users
以「朝辭白帝彩雲間」為標準語句
•
•
•
•
•
•
•
•
•
任意處結束,例如「朝辭白帝」
任意處開始,例如「彩雲間」
任意處開始與結束,例如「白帝彩雲」
任意處開始與結束,並出現跳字,例如「白彩雲」
疊字,例如「朝…朝辭白帝彩雲間」
疊詞例如「朝辭…朝辭白帝彩雲間」
疊字加換音,例如「朝(cao)…朝(zhao)辭白帝彩雲間」
兩字對調,例如「朝辭彩帝白雲間」
錯字,例如「朝辭白帝黑山間」
Lexicon Net for EPI (I)
偵測「從頭開始、在任意處結束」的發音
Lexicon Net for EPI (II)
偵測「從任意處開始,在尾端結束」的發音
Lexicon Net for EPI (III)
偵測「從任意處開始,結束於任意處(但
不可跳字)」的發音
Lexicon Net for EPI (IV)
偵測「從任意處開始,結束於任意處,而
且可以跳字)」的發音
Design Philosophy of Lexicon Nets
We need to strike a balance between
recognition and lexicon
In the extreme, we can have a net for free syllable
decoding to catch all error patterns.
The feasibility of free syllable decoding is offset
by its not-so-high recognition rate.
Scoring Methods for Speech Assessment
Five phone-based scoring methods
Duration-distribution scores
Log-likelihood scores
Log-posterior scores
Log-likelihood-distribution scores
Rank ratio scores
All based on forced alignment to segment
phones
Method 1: Duration-distribution Scores
PDF of phone duration
Obtained from forced alignment
Normalized by speech rate
Fitted by log-normal PDF
Max PDF score 100
Method 2: Log-likelihood Scores
Log-likelihood of phone q i with duration of d
frames :
t 0 d 1
1
lˆ
log p yt | qi
d t t0
where p yt | qi is the likelihood of the frame t
with the observation vector yt
Method 3: Log-posterior Scores
Log-posterior of phone q i with duration d :
1
ˆ
d
t0 d 1
log Pq | y
i
t t 0
where Pqi | yt
t
p yt | qi Pqi
py
m
j 1
t
| q j Pq j
Method 4: Log-likelihooddistribution Scores
Use CDF of Gaussian for log-likelihood
CDF = 1 score = 100
Method 5: Rank Ratio Scores
Rank ratio
rr q j
rankq j 1
# of competingphones 1
RR to score conversion
scoreq j ; a, b
100
1
rr q j
b
a
where parameters a, b are phone specific.
Possible sets of competing phones for x+y
*+y
*+*
Intro. to Learning to Rank
Learning to rank
A supervised learning algorithm which generates
a ranking model based on a training set of partially
order items.
Methods
Pointwise (e.g., Pranking)
Pairwise (e.g., RankSVM, RankBoost, RankNet)
Listwise (e.g., ListNet)
Application of LTR to SA
Why use LTR for SA?
Human scoring is rank-based: A+, A, B, B-…
Tsing Hua’s grading system is moving from scores
(0~100) to ranks (A, B, C, D…).
Combination of features (scores)
Features are complementary.
Effective determination of ranking
LTR only generates numerical output with a ranking
order as close as possible to the correct order. A
optimum DP-approach is proposed.
LTR Score Segmentation
Given: LTR scores s s1 , s2 ,, sn (sorted)
Desired rank r r1 , r2 ,, rn
We want to find the separating scores 1 ,2 ,,m1
with score-to-rank function s2rs, :
2
1
Rank 1
Rank 2
n
3
Rank 3
4
Rank 4
Such that J ri s 2r si is minimized.
i 1
s
Rank 5
LTR Score Segmentation by DP (I)
Formulate the problem in DP framework
Optimum-value function D(i,j): The minimum
cost of mapping s1, s2 ,, si to rank 1, 2, , j
Recurrent equation
D(i, j) ri j minD(i 1, j), D(i 1, j 1)
Boundary condition: D(1, j ) r1 j , j [1, m]
Optimum cost: D n, m
LTR Score Segmentation by DP (II)
Computed
rank
5
4
3
2
1
r1 r2 r3
1
s2 s3
2
r4 r5 r6 r7 r8 r9 r10 r11 r12 r13
2
s6 s7
s s
3 8 9
2
2
4
Desired
rank
s11 s12
2
Local constraint:
Recurrent formula:
D(i 1, j )
D(i, j ) | ri j | min
D(i 1, j 1)
Di 1, j
Di 1, j 1
Di, j
LTR Score Segmentation with DP (III)
Data distribution:
DP path:
DP total distance = 23
5
Class
4
3
2
1
5
5
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
0
0.5
1
1.5
2
2.5
3
x1
3.5
4
4.5
5
5.5
1
4
2
vec2
vec1
6
50
100
150
200
250
50
100
150
200
250
4
2
Flow Charts of Our Experiment
Corpora for Experiments
WSJ
For training biphone acoustic models for forced
alignment
MIR-SD
Recordings of about 4000 multi-syllable English words
by 22 students (12 females and 10 males.) with an
intermediate competence level.
Originally designed for stress detection
Available at http://mirlab.org/dataSet/public
Human Scoring of MIR-SD
Human scoring
Only 50 utterances from each speaker of MIR-SD
are scored by 2 humans, making a total of 1100
utterances
Score
Frequency
Percentage
1
110
10%
2
198
18%
3
259
24%
4
409
37%
5
124
11%
Human scoring are consistent:
Correlation
Word-based
Speaker-based
Inter-rater
0.58
0.78
HR1-GT
0.84
0.96
HR2-GT
0.89
0.93
Performance Indices
Performance indices used in the literature
hr = [1 3 5 4 2 2], cr = [2 3 5 2 1 4]
Recognition rate rRate = 33.33%
Recognition rate with tolerance 1 = 66.67%
Average absolute difference = 1
Correlation coef = 0.54
Performance Evaluation of Different
Scoring Methods
durDis
hmmLike
hmmPost
likeDis
rkRatio
Corr
rRate
rRateT1
AADiff
Corr
rRate
rRateT1
AADiff
Corr
rRate
rRateT1
AADiff
Corr
rRate
rRate T1
AADiff
Corr
rRate
rRateT1
AADiff
Raw
score
0.209
0.120
0.084
0.141
0.240
DP-based
inside
outside
0.217
0.189
0.342
0.309
0.783
0.771
0.906
0.942
0.168
0.102
0.325
0.306
0.780
0.757
0.928
0.973
0.297
0.265
0.344
0.330
0.811
0.798
0.862
0.893
0.160
0.125
0.316
0.308
0.789
0.774
0.924
0.948
0.232
0.198
0.333
0.316
0.789
0.779
0.898
0.929
k-means
inside
outside
0.202
0.194
0.281
0.276
0.701
0.696
1.109
1.122
0.144
0.154
0.258
0.255
0.692
0.689
1.158
1.165
0.192
0.216
0.170
0.162
0.565
0.561
1.494
1.499
0.141
0.143
0.247
0.247
0.665
0.671
1.207
1.203
0.229
0.236
0.269
0.268
0.699
0.698
1.120
1.124
Overall Performance Comparison
Legends
Score segmentation
Circles: DP
Triangles: k-means
Inside/outside tests
Solid lines: Inside
Dashed lines: Outside
Black lines: Baselines
Demo: Practice of Mandarin Idioms of
Length 4 (一語中的)
Level (difficulty) of an
idiom is based on it’s freq.
via Google search:
孤掌難鳴 ===> 260,000
鶼鰈情深 ===> 43,300
亡鈇意鄰 ===> 22,700
舉案齊眉 ===> 235,000
Can be adapted for
English learning
Next step: multithreading, fast decoding
via FSM
Demo: Recitation Machine
(唸唸不忘)
Support Mandarin &
English
Support user-defined
recitation script
Next step: multithreading
for recording & recognition
Licensing for PC Applications
For Mandarin, English, Japanese
SA for Embedded Systems
Embedded platforms: PMP, iPhone, Androids
Demo: Tangible Companions
Chicken run (落跑雞)
Penguin for Tang Poetry
(唐詩企鵝)
Robot Fighter (蘿蔔戰士)
Singing Bass & Dog (大
嘴鱸魚和唱歌狗)
Tools and Tutorials
Tools
DCPR toolbox
http://mirlab.org/jang/mat
lab/toolbox/dcpr
SAP toolbox
http://mirlab.org/jang/mat
lab/toolbox/sap
ASR Toolbox
http://mirlab.org/jang/mat
lab/toolbox/asr
Tutorials
Data clustering and
pattern recognition:
http://mirlab.org/jang/boo
ks/dcpr
Audio signal processing
http://mirlab.org/jang/boo
ks/audioSignalProcessing
Lab page (with demos):
http://mirlab.org
Other SA Issues to be Addressed
Core technology
Other acoustic features for
scoring
Pitch: tone/intonation
Volume
Duration
Pause
Coarticulation
Error pattern identification
Application side
Mulimodal GUI
Extensions
Slightly adaptation
Paragraph-level SA
Text-free SA
Beyond pronunciation
Translation + recognition
+ assessment
Microphone types
Examples
Coarticulation
Knock it off!
Mom woke her up
Consonant+consonant
Bus stop
Push Shirley
Ask question
Jeff flew south through
Tainan
Exception
Change jobs
Which Chair
Examples
Changes due to
coarticulation
Would you like it?
Won’t you go?
Raise your hand.
It makes you look
younger.
Softened sounds
Junction
Popcorn
Fruitful
Can and can’t
I can read the letter.
I can’t read the letter.
d and t
Better
Cider
Most Likely to be Mispronounced
Within Taiwan
Pleasure/pressure
World/war/word
Shirt/short
Walk/work
Flesh/fresh
Supply/surprise
Some/son
Confirm/conform
Cancel/cancer
Mouth/mouse
Measure/major
Police/please
Version/virgin
Conclusions
Conclusions
SA calls for more cues than ASR
SA requires techniques from ML/IR
Multi-modal approach to SA is a must
“Popcorn”, “Thursday”
On-going & future work
Tone recognition & assessment
Reliable error pattern identification
References
Witt, S. M. and Young, S. J., “Phone-level Pronunciation Scoring and Assessment for Interactive Language Learning”,
Speech Communication 30, 95-108, 2000.
Kim, Y., Franco, H., and Neumeyer, L., “Automatic Pronunciation Scoring of Specific phone Segments for Language
Instruction”, in Proceedings of the 4th European Conference on Speech Communication and Technology (Eurospeech ’97),
pp. 649-652, Rhodes, 1997.
Neumeyer, L, Franco, H., Digalakis, V., and Weintraub, M., “Automatic Scoring of Pronunciation Quality”, Speech
Communication 30, 83-93, 2000.
Franco, H., Neumeyer, L., Digalakis, V., and Ronen, O., “Combination of Machine Scores for Automatic Grading of
Pronunciation Quality”, Speech Communication 30, 121-130, 2000.
Cincared, T., Gruhn, R., Hacker, C., Nöth, E., and Nakamura, S., “Automatic Pronunciation Scoring of Words and Sentences
Independent from the Non-Native’s First Language”, Computer Speech and Language 23, 65-88, 2009.
Crammer, K. and Singer, Y., “Pranking with Ranking”, in proceedings of the conference on Neural Information Processing
Systems (NIPS), 2001.
Joachims, T., “Optimizing Search Engines using Clickthrough Data”, in proceedings of the ACM Conference on Knowledge
Discovery and Data Mining (KDD), ACM, 2002.
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y., “An Efficient Boosting Algorithm for Combining Preferences”, in
proceedings of ICML, pp170-178, 1998.
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G., “Learning to Rank using
Gradient Descent”, in proceedings of ICML, pp. 89-96, 2005.
Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., and Li, H., “Learning to Rank: From Pairwise Approach to Listwise Approach”, in
proceedings of the 24th International Conference on Machine Learning, pp. 129-136, Corvallis, OR, 2007.
Liang-Yu Chen , Jyh-Shing Roger Jang, “Automatic Pronunciation Scoring using Learning to Rank and DP-based Score
Segmentation”, submitted to Interspeech 2010.