Query by Singing (CBMR)

Transcript Query by Singing (CBMR)

語音辨識
張智星
[email protected]
http://www.cs.nthu.edu.tw/~jang
多媒體資訊檢索實驗室
清華大學資訊工程系
語音處理（Speech Processing）
Speech Recognition（語音辨識） -- converting speech into text,
based on the input speech and on prior acoustic and textual
analyses
Speaker Recognition（語者辨識） --verifying a person’s identity or
associating a person with a voice
Speech Coding（語音編碼） -- digital coding (compression) of
speech for efficient, secure storage and transmission
Speech Synthesis（語音合成） -- automatic generation of a speech
signal starting from a normal textual input
Speech Enhancement（語音強化） -- a way that a speech signal,
subject to certain degradations, can be processed to increase its
intelligibility and/or its quality
-2-
「語音辨識」簡介
目標：
以聲音來進行特定範圍之詞彙辨識
特性：
技術門檻較高，需熟悉數位訊號處理、聲學模
型、比對方法、語言模型等。
語料蒐集需花大量人力
-3-
「語音辨識」技術困難點與考量
1.
Is the system required to recognize a specific individual or multiple
speakers?
2.
What is the size of the vocabulary?
3.
Is the speech to be entered in discrete units with distinct pauses among
them, or as a continuous utterance?
4.
What is the extent of ambiguity and acoustic confusability in the
vocabulary?
5.
Is the system to be operated in a quiet or noisy environment, and what
is the nature of the environmental noise if it exists?
6.
What are the linguistic constraints placed upon the speech, and what
linguistic knowledge is built into the recognizer?
-4-
「語音辨識」應用面
應用面
語音點歌
自動語音電話總機
以語音為介面的全文檢索系統
例如：「哈莉波特」的全文檢索與語音定位
歌詞檢索系統
例如：「潮起又潮落」
其他任何可用語音為介面之應用
-5-
「語音辨識」的分類
Vocabulary Size
 Small Vocabulary --- below 100 Words
 Medium Vocabulary --- from 100 to 1000 Words
 Large Vocabulary --- more than 1000 Words
Speaker Dependence
 Speaker-Dependent
 Speaker-Independent
-6-
「語音辨識」的分類
Speaking Style
Isolated Words
Connected Words
Continuous Speech
Environment
Clean Speech
Noisy Speech
Channel Distorted
Microphone Mismatched
-7-
「語音辨識」的方向
1980s and 1990s
Methodology
Hidden Markov Models
Neural Networks
The Trends
Large Vocabulary Continuous Speech recognition
Robust Speech Recognition
Real-Time Speaker Adaptive Speech Recognition
Language Modeling
-8-
「語音辨識」流程
錄音、特徵抽取
比對
Dynamic Time Warping（動態時間伸縮）
Hidden Markov Model（隱藏式馬可夫模型）
顯示比對結果
-9-
「語音辨識」示意圖
 Isolated Word Problem
 Concept

詞彙
紐約
台北
台中
:
:
:
洛杉機
-10-
「語音辨識」之特徵抽取
MFCC: Mel-frequency Cepstral Coefficients
語音訊號
取音框
預強調
漢明視窗
(Frame blocking)
(Pre-emphasis)
(Hamming window)
log( )
Discrete Cosine
Transform
Log Energy
13-D
Differentiator
DFT ()
2
梅爾對數頻譜
三角濾波器
39-D feature vector
-11-
Dynamic Time Warping
Characteristics:
Pattern-matching-based approach
Require less computation
Difficult to achieve speaker independency
Suitable for small to medium vocabulary
Suitable for microprocessor/chip implementation
Applications
手機、車用電話、玩具、錄音筆
-12-
Dynamic Time Warping (DTW)
j
r(j)
t: input MFCC matrix
r: reference MFCC matrix
D(i, j )
DTW recurrence:
r(j-1)
D(i, j )  dist(t (i), r ( j )) 
D(i  1, j  2)


min D(i  1, j  1) 
D(i  2, j  1)


t(i-1) t(i)
i
-13-
DTW Paths of “Match Ends”
We assume the speed of
a user’s acoustic input
falls within 1/2 and 2
times of that of the
intended sentence.
Both ends are fixed.
(End point detection is
critical.)
Suitable for voice
command applications
j
i
-14-
DTW Paths of “Match Anywhere”
Both ends are free to
move.
Suitable for personal
voice retrieval
applications, such as 錄
音筆、個人語音文件
j
i
-15-
Example DTW Path of “Match Ends”
-16-
DTW Demos
Match-ends (asr/demoDTW.m)
Match-anywhere (asr/demoVIR.m)
-17-
Hidden Markov Model
Characteristics:
Statistics-based approach
Require more computation
Can achieve speaker independency
Suitable for large vocabulary
Difficult for microprocessor/chip
implementation
Applications
語音全文檢索、聽寫機
-18-
Example of HMM
 An example: 欲辨識“紐約”這個詞
 1. 斷詞轉長庚拼音
 niou-Ye 紐約 0 (*.syl)
 2. 找出對應syllable的model
syllable
 niou
 Ye
sil+n n+i i+o o+u u+sil
sil+Y Y+e e+sil
 3. 由macros讀入state資訊
model
 Sil+*: 3 states,其餘: 5 states
niou
Ye
-19-
Viterbi Search in HMM
 Dynamic Programming
(i,j)
HmmTable(i,j) =
max
hmmTable(i-1,j) + transitionProb(j,j)
+ StateProb(i,j)
hmmTable(i-1,j-1) + transitionProb(j-1,j)
-20-
EM in HMM
Acoustic parameters for each state are
identified via Baum-Welch algorithms, which
is a variant of EM (Expectation Maximization).
In order to identify a set of suitable parameters,
we need to have a balance corpus of
recordings from various people.
-21-
Speedup Mechanism in HMM
Search strategies:
Beam search in Viterbi search
Tree lexicon instead of linear lexicon
Implementation
Fix-point instead of floating-point operations
Many other tricks…
-22-
何謂Linear Net
陳惠操
陳建智
陳雅姿注拼音
Sil
陳雅秀
陳雅玲
蔡茂豐
孫愛玲
CrN-huei-cau
CrN-jieN-Jy
CrN-ia-jy
Sil
CrN-ia-siou
CrN-ia-liG
cai-mau-fG
suN-ai-liG
-23-
何謂Tree Net
陳惠操
陳建智
陳雅姿注拼音
Sil
陳雅秀
陳雅玲
蔡茂豐
孫愛玲
CrN-huei-cau
jieN-Jy
ia-jy
Sil
siou
liG
cai-mau-fG
suN-ai-liG
-24-
如何從Linear轉換到Tree
陳惠操
蔡茂豐
陳雅秀
陳雅玲
孫愛玲
陳建智
陳雅姿
依
欄
位
順
序
排
列
陳惠操
陳建智
陳雅姿
陳雅秀
陳雅玲
蔡茂豐
孫愛玲
去
除
重
複
陳惠操
建智
雅姿
秀
玲
蔡茂豐
孫愛玲
-25-
Tree Net Structure
陳惠操
建智
雅姿
Sil !N Sil
秀
玲
蔡茂豐
!N=!NULL 孫愛玲
Sil
!N
Sil
-26-
標注音所遇到的問題
破音字
查已標完注音之詞庫（約九萬詞）
可能發生之問題
我們三人參加會議
朝辭白帝彩雲間
朝如青絲暮成雪
-27-
Robust Speech Recognition
語音特徵參數抽取方塊圖
倒頻譜平均值消去法(Cepstral Mean Subtraction)
訊號偏移消去法(Signal Bias Removal)
統計式對應法(Stochastic Matching)
頻譜消去法(Spectra Subtraction)
雜訊遮蔽法(Noise Masking)
時間濾波器(即差量濾波器)
模糊特徵法(Missing feature)
求取特徵參數的濾波器形狀改良
-28-
HMM Demos
語音全文檢索系統
人名系統：約60句
台北市街道：約900條路
唐詩三百首：約3200句
紅樓夢：約11萬句
六法全書：約30萬句
-29-

Query by Singing (CBMR)

Transcript Query by Singing (CBMR)

Directory