Query by Singing (CBMR)
Download
Report
Transcript Query by Singing (CBMR)
語音辨識
張智星
[email protected]
http://www.cs.nthu.edu.tw/~jang
多媒體資訊檢索實驗室
清華大學 資訊工程系
語音處理(Speech Processing)
Speech Recognition(語音辨識) -- converting speech into text,
based on the input speech and on prior acoustic and textual
analyses
Speaker Recognition(語者辨識) --verifying a person’s identity or
associating a person with a voice
Speech Coding(語音編碼) -- digital coding (compression) of
speech for efficient, secure storage and transmission
Speech Synthesis(語音合成) -- automatic generation of a speech
signal starting from a normal textual input
Speech Enhancement(語音強化) -- a way that a speech signal,
subject to certain degradations, can be processed to increase its
intelligibility and/or its quality
-2-
「語音辨識」簡介
目標:
以聲音來進行特定範圍之詞彙辨識
特性:
技術門檻較高,需熟悉數位訊號處理、聲學模
型、比對方法、語言模型等。
語料蒐集需花大量人力
-3-
「語音辨識」技術困難點與考量
1.
Is the system required to recognize a specific individual or multiple
speakers?
2.
What is the size of the vocabulary?
3.
Is the speech to be entered in discrete units with distinct pauses among
them, or as a continuous utterance?
4.
What is the extent of ambiguity and acoustic confusability in the
vocabulary?
5.
Is the system to be operated in a quiet or noisy environment, and what
is the nature of the environmental noise if it exists?
6.
What are the linguistic constraints placed upon the speech, and what
linguistic knowledge is built into the recognizer?
-4-
「語音辨識」應用面
應用面
語音點歌
自動語音電話總機
以語音為介面的全文檢索系統
例如:「哈莉波特」的全文檢索與語音定位
歌詞檢索系統
例如:「潮起又潮落」
其他任何可用語音為介面之應用
-5-
「語音辨識」的分類
Vocabulary Size
Small Vocabulary --- below 100 Words
Medium Vocabulary --- from 100 to 1000 Words
Large Vocabulary --- more than 1000 Words
Speaker Dependence
Speaker-Dependent
Speaker-Independent
-6-
「語音辨識」的分類
Speaking Style
Isolated Words
Connected Words
Continuous Speech
Environment
Clean Speech
Noisy Speech
Channel Distorted
Microphone Mismatched
-7-
「語音辨識」的方向
1980s and 1990s
Methodology
Hidden Markov Models
Neural Networks
The Trends
Large Vocabulary Continuous Speech recognition
Robust Speech Recognition
Real-Time Speaker Adaptive Speech Recognition
Language Modeling
-8-
「語音辨識」流程
錄音、特徵抽取
比對
Dynamic Time Warping(動態時間伸縮)
Hidden Markov Model(隱藏式馬可夫模型)
顯示比對結果
-9-
「語音辨識」示意圖
Isolated Word Problem
Concept
詞彙
紐約
台北
台中
:
:
:
洛杉機
-10-
「語音辨識」之特徵抽取
MFCC: Mel-frequency Cepstral Coefficients
語音訊號
取音框
預強調
漢明視窗
(Frame blocking)
(Pre-emphasis)
(Hamming window)
log( )
Discrete Cosine
Transform
Log Energy
13-D
Differentiator
DFT ()
2
梅爾對數頻譜
三角濾波器
39-D feature vector
-11-
Dynamic Time Warping
Characteristics:
Pattern-matching-based approach
Require less computation
Difficult to achieve speaker independency
Suitable for small to medium vocabulary
Suitable for microprocessor/chip implementation
Applications
手機、車用電話、玩具、錄音筆
-12-
Dynamic Time Warping (DTW)
j
r(j)
t: input MFCC matrix
r: reference MFCC matrix
D(i, j )
DTW recurrence:
r(j-1)
D(i, j ) dist(t (i), r ( j ))
D(i 1, j 2)
min D(i 1, j 1)
D(i 2, j 1)
t(i-1) t(i)
i
-13-
DTW Paths of “Match Ends”
We assume the speed of
a user’s acoustic input
falls within 1/2 and 2
times of that of the
intended sentence.
Both ends are fixed.
(End point detection is
critical.)
Suitable for voice
command applications
j
i
-14-
DTW Paths of “Match Anywhere”
Both ends are free to
move.
Suitable for personal
voice retrieval
applications, such as 錄
音筆、個人語音文件
j
i
-15-
Example DTW Path of “Match Ends”
-16-
DTW Demos
Match-ends (asr/demoDTW.m)
Match-anywhere (asr/demoVIR.m)
-17-
Hidden Markov Model
Characteristics:
Statistics-based approach
Require more computation
Can achieve speaker independency
Suitable for large vocabulary
Difficult for microprocessor/chip
implementation
Applications
語音全文檢索、聽寫機
-18-
Example of HMM
An example: 欲辨識“紐約”這個詞
1. 斷詞轉長庚拼音
niou-Ye 紐約 0 (*.syl)
2. 找出對應syllable的model
syllable
niou
Ye
sil+n n+i i+o o+u u+sil
sil+Y Y+e e+sil
3. 由macros讀入state資訊
model
Sil+*: 3 states,其餘: 5 states
niou
Ye
-19-
Viterbi Search in HMM
Dynamic Programming
(i,j)
HmmTable(i,j) =
max
hmmTable(i-1,j) + transitionProb(j,j)
+ StateProb(i,j)
hmmTable(i-1,j-1) + transitionProb(j-1,j)
-20-
EM in HMM
Acoustic parameters for each state are
identified via Baum-Welch algorithms, which
is a variant of EM (Expectation Maximization).
In order to identify a set of suitable parameters,
we need to have a balance corpus of
recordings from various people.
-21-
Speedup Mechanism in HMM
Search strategies:
Beam search in Viterbi search
Tree lexicon instead of linear lexicon
Implementation
Fix-point instead of floating-point operations
Many other tricks…
-22-
何謂Linear Net
陳惠操
陳建智
陳雅姿 注拼音
Sil
陳雅秀
陳雅玲
蔡茂豐
孫愛玲
CrN-huei-cau
CrN-jieN-Jy
CrN-ia-jy
Sil
CrN-ia-siou
CrN-ia-liG
cai-mau-fG
suN-ai-liG
-23-
何謂Tree Net
陳惠操
陳建智
陳雅姿 注拼音
Sil
陳雅秀
陳雅玲
蔡茂豐
孫愛玲
CrN-huei-cau
jieN-Jy
ia-jy
Sil
siou
liG
cai-mau-fG
suN-ai-liG
-24-
如何從Linear轉換到Tree
陳惠操
蔡茂豐
陳雅秀
陳雅玲
孫愛玲
陳建智
陳雅姿
依
欄
位
順
序
排
列
陳惠操
陳建智
陳雅姿
陳雅秀
陳雅玲
蔡茂豐
孫愛玲
去
除
重
複
陳惠操
建智
雅姿
秀
玲
蔡茂豐
孫愛玲
-25-
Tree Net Structure
陳 惠 操
建 智
雅 姿
Sil !N Sil
秀
玲
蔡 茂 豐
!N=!NULL 孫 愛 玲
Sil
!N
Sil
-26-
標注音所遇到的問題
破音字
查已標完注音之詞庫(約九萬詞)
可能發生之問題
我們三人參加會議
朝辭白帝彩雲間
朝如青絲暮成雪
-27-
Robust Speech Recognition
語音特徵參數抽取方塊圖
倒頻譜平均值消去法(Cepstral Mean Subtraction)
訊號偏移消去法(Signal Bias Removal)
統計式對應法(Stochastic Matching)
頻譜消去法(Spectra Subtraction)
雜訊遮蔽法(Noise Masking)
時間濾波器(即差量濾波器)
模糊特徵法(Missing feature)
求取特徵參數的濾波器形狀改良
-28-
HMM Demos
語音全文檢索系統
人名系統:約60句
台北市街道:約900條路
唐詩三百首:約3200句
紅樓夢:約11萬句
六法全書:約30萬句
-29-