PowerPoint Template - Intelligent Software Lab.

Transcript PowerPoint Template - Intelligent Software Lab.

Spoken Language Understanding
Spoken Language
Understanding
Intelligent Robot Lecture Note
1
Spoken Language Understanding
Error Correction and Adaptation
Intelligent Robot Lecture Note
2
Spoken Language Understanding
Dialog System and Adaptation
ASR Engine
ASR output
Acoustic
Model
Language
Model
ASR outputs
Post-Processor
Adaptation
or Re-ranking
Transcripts
Intelligent Robot Lecture Note
Dialog
System
ECLM
Semantic
LM
3
Spoken Language Understanding
Error Correction and Adaptation
• ‘Error correction’ is to automatically recover ASR errors by postprocessing [Ringger and Allen, 1996].
Transcript
대출 기간 을 알려 줘
Let me know periods of the loan.
ASR output
대출 이란 을 알려 줘
Let me know (that is) of the loan.
Error Correction
이란 (That is)  기간 (Period)
• ‘Adaptation’ is a process to fit ASR to new domain, speaker, and
speech styles [Bellegarda, 2004].
Corpus A
(Adaptation)
Information
Extraction
Task-specific
Knowledge
SLM
Adaptation
Corpus B
(Background)
Intelligent Robot Lecture Note
SLM
Estimation
Adapted
P(w1…wn)
Initial
PB(w1…wn)
*From [Bellegarda, 2004]
4
Spoken Language Understanding
Error Correction and Adaptation
• Error Correction
►
►
Rule-based approach: pattern mining / transformation rule learning
Statistical model: Noisy-Channel framework as a statistical machine
translation
• Adaptation
►
Channel Model
(Translation)
Language Model
(Source)
Re-ranking with complicate linguistic information (e.g. POStag, phrase
chunk, parse tree, topic, WordNet, etc.) [Bellegarda, 2004, Rosenfeld
1996]
►
Interpolated LM:
• Error Corrective Language Model (ECLM) is a marriage of two
methods.
►
►
Domain-specific / User-oriented adaptation
ECLM is a statistical re-ranking method to interactively adapt speech
recognizer’s language models [Jeong et al. 2005].
Intelligent Robot Lecture Note
5
Spoken Language Understanding
Error Corrective Adaptation
• Two-stage re-ranking
►
►
CORRECTOR:
generating corrected
candidates by translation
model
RERANKER: re-ranking
with semantic information
Adaptation Data
Adaptation Data
(ASR and Transcript)
Text Corpus
Channel
Model
Bigram
LM
ASR
output
Semantic
LM
1-best with
n-best
CORRECTOR
n-best
SLU
semantic frame
RERANKER
Reference
대출 기간 을 알려 줘 (request (search_period (period=대출 기간)))
ASR output
대출 이란 을 알려 줘 (request (search_calculate (period=대출 이란)))
Error-Corrected
Candidates
대출 이란 을 알려 줘 (request (search_calculate (period=대출 이란)))
대출 기간 을 알려 줘 (request (search_period (period=대출 기간)))
대출 이라는 알려 줘 (request (search_period))
대출 이 안 을 알려 줘 (request (search_period))
대출 기간 을 알려 줘 야 (request (search_period (period=대출 기간)))
대출 은 알려 줘 (request (search_about (about=대출)))
Re-ranking with
SLU
Intelligent Robot Lecture Note
대출 기간 을 알려 줘 (request (search_period (period=대출 이란)))
An Example for Error Corrective Re-ranking
6
Spoken Language Understanding
Error Correction
• Correcting the ASR hypotheses as MAP process
• Noisy-Channel correction model
• Modeling the word-to-word translation pairs
►
►
(e.g.) p(TAKE A | TICKET), p(TRAIN | TRAIN), p(FROM | FROM), p(CHICAGO
| CHICAGO), p(TO | TO), p(TOLEDO | TO LEAVE)
Fertility model (IBM model 4 [Brown et al. 1993])
Word mapping
Intelligent Robot Lecture Note
Fertility model
7
Spoken Language Understanding
Estimating the Channel Model
• Maximum Likelihood Estimates
►
Counting the pair (oi,wi)
in training data (A)
• Discounting methods
►
►
►
To reduce over-fitting
Applying well-known techniques
to smooth language models
Absolute (abs), Good-Turing (GT),
and modified Kneser-Ney (KN)
[Chen and Goodman, 1998]
Intelligent Robot Lecture Note
8
Spoken Language Understanding
Re-ranking with Language Understanding
• Re-ranking the n-best hypotheses or word lattice
• PLM can be any type of language models which are trained using
background and domain-specific text data.
►
Two baselines: trigram and class-based n-gram (CLM)
• Exploiting the Semantic Information
►
Semantic Class-based Language Model (SCLM)
◦ s={s1,…,sn} : semantic class ~ (domain specific) named entity categories
◦ Estimation as same as class-based n-gram [Brown et al. 1992]
Intelligent Robot Lecture Note
9
Spoken Language Understanding
Re-ranking with Language Understanding
• Exploiting the Semantic Information
►
Another method for estimating SCLM = MaxEnt exponential model
◦ xi={xi1,xi2,xi3,…} : lexical, syntactic and semantic features
Air-Travel Domain
Telebank Domain
AIRLINE, AIRLINE_CODE, AIRPORT_CODE, AIRPORT_NAME,
CAR_CLASS, CITY_NAME, CLASS_TYPE, COST_RELATIVE,
COUNTRY,
DATE_ORDINAL, DATE_RELATIVE, DAY_NAME,
DAY_NUMBER, FLIGHT_NUMBER,
FLIGHT_STOP,
HOTEL_DURATION, HOTEL_LOC, HOTEL_NAME, LEG_NUM,
MONTH_NAME, NEGATIVE, PERIOD_MOD, PERIOD_OF_DATE,
PERIOD_OF_DAY,
RENTAL_COMPANY, ROUND_TRIP,
START_TIME, STATE_NAME, TIME, TIME_RELATIVE,
TODAY_RELATIVE
ABOUT, AMOUNT, BENEFIT, COUNT,
INFO, KIND, LIMIT, PERIOD, PROFIT,
QUALIFIED, RATE, REFUND, RULE,
SERVICE, VIA
Defined Semantic Classes for Two Domains
Intelligent Robot Lecture Note
10
Spoken Language Understanding
Reading Lists
•
•
•
•
•
•
J. R. Bellegarda. January 2004. Statistical language model
adaptation: review and perspectives. Speech Communication,
42(1):93-108.
S. F. Chen, and J. Goodman. 1998. An empirical study of
smoothing techniques for language modeling. Technical Report
TR.10.98, Harvard University.
M. Jeong, J. Eun, S. Jung, and G. G. Lee. 2005. An errorcorrective language-model adaptation for automatic speech
recognition. Interspeech 2005-Eurospeech, Lisbon.
M. Jeong, and G. G. Lee. 2006. Experiments on error-corrective
language model adaptation. WESPAC-9, Seoul.
E. K. Ringger, and J. F. Allen. 1996. Error correction via a postprocessor for continuous speech recognition. In ICASSP.
R. Rosenfeld, 1996. A Maximum Entropy approach to adaptive
statistical language modeling. Computer, Speech and Language,
10:187-228.
Intelligent Robot Lecture Note
11
Spoken Language Understanding
Spoken Language Understanding
Intelligent Robot Lecture Note
12
Spoken Language Understanding
Spoken Language Understanding
• Spoken language understanding is to map natural language
speech to frame structure encoding of its meanings.
• What’s difference between NLU and SLU?
►
►
►
Robustness; noise and ungrammatical spoken language
Domain-dependent; further deep-level semantics (e.g. Person vs.
Cast)
Dialog; dialog history dependent and utt. by utt. analysis
• Traditional approaches; natural language to SQL conversion
Speech
Text
ASR
SLU
Semantic
Frame
SQL
Generate
SQL
Response
Database
A typical ATIS system (from [Wang et al., 2005])
Intelligent Robot Lecture Note
13
Spoken Language Understanding
Semantic Representation
• Semantic frame (frame and slot/value structure) [Gildeaand Jurafsky,
2002]
►
►
An intermediate semantic representation to serve as the interface
between user and dialog system
Each frame contains several typed components called slots. The type
of a slot specifies what kind of fillers it is expecting.
“Show me flights from Seattle to Boston”
<frame name=‘ShowFlight’ type=‘void’>
<slot type=‘Subject’>
FLIGHT</slot>
<slot type=‘Flight’/>
<slot type=‘DCity’>SEA</slot>
<slot type=‘ACity’>BOS</slot>
</slot>
</frame>
ShowFlight
Subject
FLIGHT
Semantic representation on ATIS task; XML format (left) and
hierarchical representation (right) [Wang et al., 2005]
Intelligent Robot Lecture Note
Flight
Departure_City
Arrival_City
SEA
BOS
14
Spoken Language Understanding
Semantic Representation
• Two common components in semantic frame
►
►
Dialog acts (DA); the meaning of an utt. At the discourse level, and it
it approximately the equivalent of intent or subject slot in the practice.
Named entities (NE); the identifier of entity such as person, location,
organization, or time. In SLU, it represents domain-specific meaning of
a word (or word group).
• Example (ATIS and EPG domain, simplified representation)
Show me flights from Denver to New York on Nov. 18th
DIALOG_ACT = Show_Flight
FROMLOC.CITY_NAME = Denver
TOLOC.CITY_NAME = New York
MONTH_NAME = Nov.
DAY_NUMBER = 18th
I want to watch LOST
DIALOG_ACT = Search_Program
PROGRAM = LOST
Intelligent Robot Lecture Note
15
Spoken Language Understanding
Semantic Frame Extraction
• Semantic Frame Extraction (~ Information Extraction Approach)
1)
2)
3)
►
Dialog act / Main action Identification ~ Classification
Frame-Slot Object Extraction ~ Named Entity Recognition
Object-Attribute Attachment ~ Relation Extraction
1) + 2) + 3) ~ Unification
롯데월드에 어떻게 가나요?
Feature Extraction / Selection
+
Dialog Act
Frame-Slot
Relation
+
+
Identification
Extraction
Extraction
+
Info.
Source
Unification
+
Overall architecture for semantic analyzer
Intelligent Robot Lecture Note
Domain: Navigation
Dialog Act: WH-question
Main Action: Search
Object.Location.Destination=롯데월드
난 롯데월드가 너무 좋아.
Domain: Chat
Dialog Act: Statement
Main Action: Like
Object.Location=롯데월드
Examples of semantic frame structure
16
Spoken Language Understanding
Knowledge-based Systems
•
Knowledge-based systems:
►
►
►
•
Previous works
►
►
►
•
Developers write a syntactic/semantic grammar
A robust parser analyzes the input text with the grammar
Without a large amount of training data
MIT: TINA (natural language understanding) [Seneff, 1992]
CMU: PHEONIX [Pellom et al., 1999]
SRI: GEMINI [Dowding et al., 1993]
Disadvantages
1)
2)
3)
4)
Grammar development is an error-prone process
It takes multiple rounds to fine-tune a grammar
Combined linguistic and engineering expertise is required to construct
a grammar with good coverage and optimized performance
Such a grammar is difficult and expensive to maintain
Intelligent Robot Lecture Note
17
Spoken Language Understanding
Statistical Systems
•
Statistical SLU approaches:
►
►
•
System can automatically learn from example sentences with their
corresponding semantics
The annotation are much easier to create and do not require
specialized knowledge
Previous works
►
►
►
►
Microsoft: HMM/CFG composite model [Wang et al., 2005]
AT&T: CHRONUS (Finite-state transducers) [Levin and Pieraccini, 1995]
Cambridge Univ: Hidden vector state model [He and Young, 2005]
Postech: Semantic frame extraction using statistical classifiers [Eun et
al., 2004; Eun et al., 2005; Jeong and Lee, 2006]
•
Disadvantages
1)
2)
Data-sparseness problem; system requires a large amount of corpus
Lack of domain knowledge
Intelligent Robot Lecture Note
18
Spoken Language Understanding
Machine Learning for SLU
• Relational Learning (RL) or Structured Prediction (SP) [Dietterich,
2002; Sutton and McCallum, 2004]
►
Structured or relational patterns are important because they can be
exploited to improve the prediction accuracy of our classier
►
Argmax search (e.g. Sum-Max, Belief propagation, Viterbi etc)
• Basically, RL for language processing is to use a left-to-right
structure (a.k.a linear-chain or sequence structure)
• Algorithms: CRFs, Max-Margin Markov Net (M3N), SVM for
Independent and Structured Output (SVM-ISO), Structured
Perceptron, etc.
Intelligent Robot Lecture Note
19
Spoken Language Understanding
Machine Learning for SLU
• Background: Maximum Entropy (a.k.a logistic regression)
►
►
►
Conditional and discriminative manner
Unstructured! (no dependency in y)
Dialog act classification problem
z
hk
x
• Conditional Random Fields [Lafferty et al. 2001]
►
►
►
►
►
Structured versions of MaxEnt (argmax search in inference)
Undirected graphical models
fk
Popular in language and text processing
yt-1
yt
Linear-chain structure for practical implementation gk
Named entity recognition problem
xt-1
xt
Intelligent Robot Lecture Note
yt+1
xt+1
20
Spoken Language Understanding
Semantic NER as Sequence Labeling
• Relational Learning for language processing
►
►
►
►
Left-to-right n-th order Markov model (linear-chain or sequence)
E.g. Part-of-speech tagging, Noun Phrase chunking, Information
Extraction, Speech Recognition, etc.
Very large size of feature space (e.g. state-of-the-art NP chunking >
1M)
Open problem is how to reduce the training cost (even 1st order
Markov)
• Transformation to BIO representation [Ramshaw and Marcus, 1995]
►
Begin of entity, Inside of entity, and Outside
Show
me
flight
from
Denver
to
New
York
on
Nov.
18th
O
O
O
O
F.CITY-B
O
T.CITY-B
T.CITY-I
O
MONTH-B
DAY-B
MONTH
DAY
F.CITY
Intelligent Robot Lecture Note
T-CITY
21
Spoken Language Understanding
Theoretical Background for SLU
Intelligent Robot Lecture Note
22
Spoken Language Understanding
Maximum Entropy
• (Conditional) Maximum Entropy
= log-linear, exponential model, multinomial logistic regression
• Criterion is based on Information Theory
►
►
We want a distribution which is uniform with constraints = high
entropy
Given the parametric form, fitting an Maximum Entropy to a collection
of training data entails finding values for the parameter vector Λ which
or
Intelligent Robot Lecture Note
23
Spoken Language Understanding
MaxEnt - Derivation
• Introduce the Lagrangian
• Derivative the Lagrangian and Get the parametric form
• Dual form
Intelligent Robot Lecture Note
24
Spoken Language Understanding
MaxEnt - Parametric Form
• Parametric form
inner product
• Log-likelihood function is concave  global maximum
• Constrained optimization  Unconstrained optimization
• Regularization (= MAP with Gaussian prior,
)
• Why Maximum Entropy Classifier?
►
►
►
►
Efficient calculation (inner product of the parameter and a feature
vector)
e.g. SVM learning (1 week) vs. MaxEnt (1 hour)
Handling of overlapping features
Automatically inducing feature (for exponential model)
Intelligent Robot Lecture Note
25
Spoken Language Understanding
MaxEnt - Parameter Estimation
• Generalized Iterative Scaling
►
one popular method for iteratively refining the parameters
from Darroch and Ratcliff (1972), Deming and Stephan (1940)
►
It only require the expected values but is very slowly convergent
►
• Conjugate Gradient (Fletcher-Reeves and Polak-Ribiere-Positive)
►
The expected values required by GIS essentially is the gradient, so
we can use the gradient information.
• Limited memory quasi-Newton (L-BFGS)
►
►
For large-scale problem, the evaluation of the Hessian matrix is
computationally impractical  quasi-Newton method
Furthermore, limited memory version of BFGS algorithm saves the
storage in large-scale problem.
Intelligent Robot Lecture Note
26
Spoken Language Understanding
Conditional Random Fields
• An undirected graphical model [Lafferty et al. 2001]
►
Linear-chain structure
yt-1
yt
yt+1
xt-1
xt
xt+1
• A label bias problem
►
State whose following-state dist. have low entropy will be preferred
Intelligent Robot Lecture Note
27
Spoken Language Understanding
CRFs - Parameter Estimation
• A 2-norm penalized (conditional) log-likelihood:
• To maximize log-likelihood, we get
Empirical distribution
Expectation
Marginal distribution
• Optimization techniques:
►
►
GIS or IIS, Conjugate Gradient or L-BFGS
Voted perceptron, Stochastic meta gradient
Intelligent Robot Lecture Note
28
Spoken Language Understanding
CRFs - Inference
• In training
►
Compute marginal distribution p(y,y’,xt(i))
►
It’s forward-backward recursion.
• In testing
►
Viterbi recursion
Intelligent Robot Lecture Note
29
Spoken Language Understanding
Exploiting Non-Local Information
for SLU
Intelligent Robot Lecture Note
30
Spoken Language Understanding
Long-distance Dependency Problem
…
fly
from
denver
to
chicago
on
dec.
10th
1999
…
10th
1999
…
DEPART.MONTH
…
return
from
denver
to
chicago
on
dec.
RETURN.MONTH
• Most practical NLP models employ the local feature set
►
►
►
Local context feature (sliding window)
E.g.) for “dec.”; current = dec., prev-2 = on, prev-1 = chicago, next+1
= 10th, next+2 = 1999, POS-tag = NN, chunk = PP
However, there is exactly same feature set for two states “dec.”
(even different labels)
• Non-local feature or high-order dependency should be considered
Intelligent Robot Lecture Note
31
Spoken Language Understanding
Using Non-local Information
• Syntactic parser-based approach
►
►
►
►
From fields of Semantic Role Labeling and Relation Extraction
Parse tree path, governing category, or head word
Adv : global structure of language
Disadv : informal language – speech, email
• Data-driven approach
►
►
►
►
From fields of Information Retrieval and Lang. Modeling
Identical word, lexical co-occurrence, or triggering
Adv : easy to extract, fit to ungrammatical form
Disadv : depends on data set
Intelligent Robot Lecture Note
32
Spoken Language Understanding
Hidden Vector State Model
• CFG-like Model [He and Young, 2005]
►
►
►
Extends ‘flat-concept’ HMM model
Represents hierarchical structure (right-branching) using hidden state
vectors
Each state expanded to encode stack of a push down automaton
SS
DUMMY
RETURN
TOLOC
CITY
ON
MONTH
I
want
to
return
to
chicago
on
dec.
DUMMY
DUMMY
DUMMY
RETURN
TOLOC
RETURN
CITY
TOLOC
RETURN
ON
RETURN
MONTH
ON
RETURN
POP(DUMMY)
PUSH(RETURN)
PUSH(TOLOC)
PUSH(CITY)
POP(CITY)
POP(TOLOC)
PUSH(ON)
PUSH(MONTH)
PUSH(DUMMY)
Intelligent Robot Lecture Note
33
Spoken Language Understanding
Previous Slot Context
• CRF-based Model [Wang et al., 2005]
►
►
►
►
Conditional model for ATIS domain SLU system
Efficient heuristic functions that encode non-local information
But, it should have a domain-specific lexicon and knowledge
More general alternatives; skip-chain CRF [Sutton and McCallum,
2006] and two-step approach [Jeong and Lee, 2007]
O
O
O
O
O
TOLOC.
CITY
O
RETURN.
MONTH
I
want
to
return
to
chicago
on
dec.
Preamble
Preamble
Preamble
Preamble
Preamble
Filler
Preamble
Filler
context (k=1)
context (k=2)
Intelligent Robot Lecture Note
Previous Slot
34
Spoken Language Understanding
Tree Path and Head Word
• Using Syntactic Parse Tree [Gildea and Jurafsky, 2002]
►
►
►
Motivated by semantic role labeling
Full structural information of language
Disadvantage
◦ Reliable for only TEXT process but not SPEECH
◦ Heavy computation to parse a sentence
O
O
O
O
O
TOLOC.
CITY
O
RETURN.
MONTH
I
want
to
return
to
chicago
on
dec.
PRP
S/VP/VBP
S/TO
VP/VB
PP/TO
NP/NNP
PP/IN
NP/NNP
head=return
Tree Path
(NN↑NP↑PP↑VP↓VB)
Intelligent Robot Lecture Note
35
Spoken Language Understanding
Trigger Feature Selection
• Using Concurrence Information [Jeong and Lee, 2006]
►
►
Finding non-local and long-distance dependency
Based on a feature selection method for exponential models
Definition 1 (Trigger Pair) Let two elements a ∈ A and b ∈ B, A is a set of
history features and B is a set of target features in training examples. If a
feature element a is significantly correlated with another element b, then a
→ b is considered as a trigger, with a being the trigger element and b the
triggered element.
O
O
O
O
O
TOLOC.
CITY
O
RETURN.
MONTH
I
want
to
return
to
chicago
on
dec.
todec.
Selected Trigger (returndec.)
todec.
wantdec.
Idec.
Intelligent Robot Lecture Note
36
Spoken Language Understanding
Trigger Features
• Trigger word pair [Rosenfeld, 1992]
►
►
►
the basic element for extracting information from the long-distance
document history
(wi→wj) : word wi and wj are significantly correlated
When wi occurs in the document, it triggers wj, causing its probability
estimate to change.
• Selecting the triggers
►
E.g) dec.  return : 0.001179, dec.  fly : < 0.0001 (reject)
• Is it enough to extract trigger features from the training data? NO!
Intelligent Robot Lecture Note
37
Spoken Language Understanding
Inducing (Trigger) Features
• Basic idea
►
Incrementally adding only bundle of (trigger) features which increase
log-likelihood of exponential model [Della Pietra et al., 1997]
• Feature gain
►
Measure to evaluate the (trigger) features using Kullback-Leibler
divergence [Della Pietra et al., 1997]
►
Current model (CRFs) + a new feature [McCallum, 2003]
►
Mean field approximation
Intelligent Robot Lecture Note
38
Spoken Language Understanding
Inducing (Trigger) Features
• Feature Gain (final form)
►
No closed form solution  Newton’s method
• MI vs FI?
MI
FI
Target
Unlabeled data
Labeled data
Element
Word pair
Arbitrary features
Time Cost
Low
High
Intelligent Robot Lecture Note
39
Spoken Language Understanding
Efficient Feature Induction Algorithm
• How to reduce time cost for FI?
1) Local feature set as base-model
►
►
P) Original algorithm iteratively
includes some of all features to
empty set
S) We only consider to induce
non-local trigger features
2) Maximum entropy inducer
►
►
P) CRFs are heavy to learn!
S) We extract non-local trigger
features from light-weight MaxEnt
model, and then only learn CRFs
at last time.
Outline of our feature induction algorithm
Intelligent Robot Lecture Note
40
Spoken Language Understanding
Effect of Trigger Feature
100
Transcripts (Trigger)
95
Transcripts (Local)
90
Precision
N-best(Local)
1-best (Local)
85
N-best (Trigger)
80
1-best (Trigger)
75
70
50.00
55.00
Intelligent Robot Lecture Note
60.00
65.00
70.00
75.00
Recall
80.00
85.00
90.00
95.00
100.00
41
Spoken Language Understanding
Joint Prediction for SLU
Intelligent Robot Lecture Note
42
Spoken Language Understanding
Motivation: Independent System
• DA and NE tasks are totally independent.
►
►
System produces the DA (z) and NE (y) given input words (x), and
passes them to DM.
MaxEnt for DA classification, CRFs for NE recognition
• Preliminary result in previous section
x
Automatic
Speech
Recognition
x
Intelligent Robot Lecture Note
Sequence
Labeling
Classification
(Named Entity /
Frame Slot)
(Dialog Act / Intent)
Sequence
Labeling Model
Classification
Model
(e.g. HMM, CRFs)
(e.g. MaxEnt, SVM)
x,y
x,z
Dialog
Management
43
Spoken Language Understanding
Motivation: Cascaded System
• Current state-of-the-art system design [Gupta et al. 2006, AT&T
system]
►
Training the NE module, and use its prediction as a feature for the DA
module (or vice versa)
►
Significant drawback; cannot take advantage of information from DA
task (or vice versa)
►
Cascaded system can improve only single task performance rather
than both.
Automatic
Speech
Recognition
x
Intelligent Robot Lecture Note
Sequence
Labeling
x,y
Classification
(Named Entity /
Frame Slot)
(Dialog Act / Intent)
Sequence
Labeling Model
Classification
Model
(e.g. HMM, CRFs)
(e.g. MaxEnt, SVM)
x,y,z
Dialog
Management
44
Spoken Language Understanding
Motivation: Joint System
• Joint Prediction of DA and NE [Jeong and Lee, 2006]
►
►
►
DA and NE are mutually dependent
An integration of DA and NE model  encoding inter-dependency
GOAL is to improve both performances of DA and NE task.
Joint Inference
Automatic
Speech
Recognition
x
Sequence
Labeling
(Named Entity /
Frame Slot)
Classification
(Dialog Act / Intent)
x,y,z
Dialog
Management
Joint Model
(e.g. TriCRFs)
Intelligent Robot Lecture Note
45
Spoken Language Understanding
Triangular-chain CRFs
• Modeling the inter-dependency (y ↔ z)
►
Factorizing the potential
hk
xz
edge-transition
NE-observation
DA-observation
z
f2k
yt-1
f1k
yt
yt+1
y,y dependency
gk
xt-1
xt
xt+1
Intelligent Robot Lecture Note
y,z dependency
In general, fk can be a function of triangulated cliques.
However, we assume that NE state transition is independent from DA,
i.e., DA operates as an observation feature to identify NE labels.
46
Spoken Language Understanding
CRFs Family
Graphical illustrations of linear-chain CRF (left), factorial CRF [Sutton et al. 2004]
(middle) and triangular-chain CRF (right)
Intelligent Robot Lecture Note
47
Spoken Language Understanding
Joint Inference of Tri-CRFs
• Performed by multiple exact inferences (of linear-chain CRFs)
►
►
Imagine the |Z| planes of linear-chain CRFs.
Forward-backward recursion and final alpha yields the mass of all
state seq.
where
• Beam search (for z variables)
►
Truncate z planes with p < ε (=0.001)
Intelligent Robot Lecture Note
48
Spoken Language Understanding
Parameter Estimation of Tri-CRFs
• Log-likelihood (for joint task optimization) given D={x,y,z} i=1,…,N
►
Derivatives
• Numerical optimization; L-BFGS
►
Gaussian regularization (σ2=20)
Intelligent Robot Lecture Note
49
Spoken Language Understanding
Reducing the Human Effort
Intelligent Robot Lecture Note
50
Spoken Language Understanding
Reducing the Effort of Human Annotation
• The goal is to reduce the labeling effort for spoken language
understanding
►
Preparation of human-labeled utterances is labor intensive and time
consuming
• Supervised learning
►
►
►
►
requires the large amount of labeled data
Given the data
We find a function
f can be any classifiers (e.g. MaxEnt, SVM, Boosting, Decision tree,
etc.)
Raw data
Intelligent Robot Lecture Note
Labeled data
Model
51
Spoken Language Understanding
Reducing the Effort of Human Annotation
• Active learning
►
►
►
►
►
Artificial membership queries (Cohn et al. 1994)
Text categorization (Lewis and Carlett, 1994)
Support vector machine (Schohn and Cohn, 2000; Tong and Koller,
2001)
Natural language parsing and information extraction (Thompson et al.,
1999; Tang et al., 2002), Word segmentation (Sassano, 2002)
Spoken language understanding (Tur et al., 2002)
• Semi-supervised learning
►
►
►
Co-training (Blum and Mitchell, 1998)
Co-EM (Nigam and Ghani, 2000), Co-EM with ECOC (Ghani, 2002)
Natural language call routing (Iyer et al. 2002)
• Combining two techniques
►
►
Text categorization (McCallum and Nigam, 1998)
Speech recognition (Riccardi and Hakkani-Tur, 2003)
Intelligent Robot Lecture Note
52
Spoken Language Understanding
Active Learning
• Certainty-based method
►
►
Predict the candidate raw data
Estimate confidence (e.g. Prob) and use it
+
Small
Labeled data
Raw data
Intelligent Robot Lecture Note
Labeled
samples
Model
Predict &
Estimate
Confidence
Filter
Selected
samples
53
Spoken Language Understanding
Semi-supervised Learning
• Augmenting the machine-labeled data
• Augmenting the classification model
►
Like as adaptation (model interpolation), classifier dependent method
+
Small
Labeled data
Raw data
Intelligent Robot Lecture Note
Labeled
samples
Model
Predict &
Estimate
Confidence
Filter
Selected
samples
54
Spoken Language Understanding
Combining Two Methods
• Use whole raw data, and divide them into two sets Sraw = Sactive +
Ssemi
Labeled
samples
+
Small
Labeled data
Raw data
Model
Predict &
Estimate
Confidence
Intelligent Robot Lecture Note
Augmented
data
+
no
Active
Learning
Filter
yes
> threshold
< threshold
55
Spoken Language Understanding
Reducing the Effort of Human Annotation
• How May I Help You (HMIHY) System
Intelligent Robot Lecture Note
56
Spoken Language Understanding
Using ASR and Prior Knowledge
Intelligent Robot Lecture Note
57
Spoken Language Understanding
Using N-best & Lattice
• Robustness Issues in data-driven SLU [Y. He and S. Young, 2004]
Intelligent Robot Lecture Note
58
Spoken Language Understanding
Using N-best & Lattice
• On the use of finite state transducers for semantic interpretation,
[C. Raymond et al. 2006]
Intelligent Robot Lecture Note
59
Spoken Language Understanding
Using N-best & Lattice
• Beyond ASR 1-best: Using word confusion networks in spoken
language understanding [D. Hakkani-Tur et al. 2006]
Intelligent Robot Lecture Note
60
Spoken Language Understanding
Incorporating Prior Knowledge
• Composition of HMM and CFG [Wang et al., 2003]
►
►
Using ‘superwords’ rather than pure lexicon
Embedding domain-specific knowledge using PCFG
• Boosting for call classification [Schapire et al., 2002]
►
Jointly optimize predictor with prior rules and training data using
AdaBoost
[Wang et al., 2003, Eurospeech], ATIS task
Dashed line; CFG only
Solid line; HMM+CFG
Intelligent Robot Lecture Note
[Schapire et al., 2002, ICML] HelpDesk task
61
Spoken Language Understanding
To General Understanding:
Semantic Role Labeling
Intelligent Robot Lecture Note
62
Spoken Language Understanding
Semantic Role Labeling (SRL)
• More general natural language understanding tasks
• The process of assigning a WHO did WHAT to WHOM, WHEN,
WHERE, WHY, HOW etc. structure to plain text
• Natural Language Understanding : domain-specific / handcrafted  domain-independent / machine learning
• Examples
►
►
[Judge She ] blames [Evaluee the Government] [Reason for failing to do enough
to help]. (JUDGEMENT)
[Message “I’ll knock on your door at quarter to six”] [Speaker Susan] said.
(STATEMENT)
• Applications
►
information extraction, question-answering, spoken dialogue system,
machine translation, text summarization, parsing etc.
Intelligent Robot Lecture Note
63
Spoken Language Understanding
Semantic Role Labeling (SRL)
Domain-dependent
ATIS
Domain-independent
PropBank
FrameNet
Computer Scientist
Linguistics
• SRL = Domain-independent shallow semantic parsing
• There is not always a direct mapping between syntax and
semantics.
►
Verb-specific Role (FrameNet)
◦ CONVERSATION including
– verb : argue, banter, debate, converse, gossip
– noun : dispute, discussion
►
Thematic Role (FrameNet & PropBank)
◦ tend to be mainly verb
Intelligent Robot Lecture Note
64
Spoken Language Understanding
SRL – History and Related Work
• Linguistic Theories
►
►
Panini’s karaka theory (thousands of years ago)
Fillmore (1976), Frame semantics and the nature of language
• Domain-specific Semantic Interpretation
►
►
Miller et al. (1996), ATIS Project
NLU systems (90s~), e.g. TRAINS, Communicator
• Information Extraction (90s~), MUC / ACE Project
• Emerging to Semantic Role Labeling
►
►
►
Fillmore and Baker (1998~), UC Berkeley, FrameNet Project
Kingsbury et al. (2002~), UPenn, PropBank Project
CoNLL 2004/2005 Shared Tasks
Intelligent Robot Lecture Note
65
Spoken Language Understanding
Two Projects
FrameNet
PropBank
Target Corpus
British National Corpus
Wall Street Journal
Labels
10 general abstract thematic roles +
the thousands of potential verbspecific roles
Predicate independent labels
ARG0~5, ARG-Ms etc
Training set
50,000 sentence / 100,000 frames
85,000 sentences / 250,000
arguments
Test set
N/A
5,000 sentences / 12,000 arguments
Release
Release 1.2, June 14, 2005
Feb, 2004 / March, 2005
Evaluation Task
Senseval task
CoNLL shared task
Distributor
UC. Berkeley
Univ. Penn / ACE project
Fee
Free for Academy
Free for CoNLL
Intelligent Robot Lecture Note
66
Spoken Language Understanding
CoNLL 04/05 Shared Tasks
• Open and Competitive Task
• Dataset
►
►
►
►
►
►
Released on March 4th, 2005
WSJ sections 02-21 as training set
WSJ section 24 as development set
WSJ section 23 + fresh data as test set
POS, Chunk, Collins/Charniak parse tree, NE tag
Special labels
V: verb
A0: acceptor
A1: thing accepted
A2: accepted-from
A3: attribute
AM-MOD: modal
AM-NEG: negation
◦ R-* : a reference to some others (e.g. that)
◦ C-* : a continuation phrase
• Example
►
[A0 He ] [AM-MOD would ] [AM-NEG n't ] [V accept ] [A1 anything of value ]
from [A2 those he was writing about ].
Intelligent Robot Lecture Note
67
Spoken Language Understanding
FrameNet – Frame Example
Sample domains and frames from the FrameNet lexicon
Intelligent Robot Lecture Note
68
Spoken Language Understanding
FrameNet – Example
Intelligent Robot Lecture Note
69
Spoken Language Understanding
FrameNet – Abstract Roles
Intelligent Robot Lecture Note
70
Spoken Language Understanding
PropBank – Roles
Intelligent Robot Lecture Note
71
Spoken Language Understanding
PropBank – Example
Intelligent Robot Lecture Note
72
Spoken Language Understanding
Two Problems in SRL
• Identification (= Pruning)
►
►
Identify constituents in the sentence
as Binary classification problem
• Classification (= Labeling)
►
►
Assign the appropriate argument labels
as Multi-class classification problem
Intelligent Robot Lecture Note
73
Spoken Language Understanding
Features in SRL
• The relationship between surface manifestations (syntax) and
semantic roles = Linking theory (Levin and Hovav 1996)
►
the syntactic realization of arguments of a predicate is predicable from
semantics
• Syntactic Structure
►
Using Shallow and Full parsing
• Lexical information
►
Word Statistics / Semantics
• Basic Features provided by CoNLL Task
►
POS, Chunk, Parse tree, NE tag
• Bottleneck = Parse Tree Error
Intelligent Robot Lecture Note
74
Spoken Language Understanding
Using Grammatical Function
• Phrase Type (= Chunk)
[Speaker We] talked [Topic about the proposal] [Medium over the
phone].
Governing Category : S (subject) and VP (object)
►
“if there is an underlying AGENT, it becomes the syntactic subject.”
►
•
• Position : before or after the predicate
►
to overcome parse error
• Voice : active of passive
• Head Word
►
from parse tree output
• Tree Path
Intelligent Robot Lecture Note
75
Spoken Language Understanding
Parse Tree Path
• Syntactic relation between target word and constituent
• Syntactic path through the parse tree from the parse constituent to
the predicate being classified
*FrameNet (Gildea)
He  ate :
VB↑VP↑S↓NP
Intelligent Robot Lecture Note
*PropBank (Pradhan)
The lawyer  went :
NP↑S↓VP↓VBD
76
Spoken Language Understanding
Parse Tree Path
• Sample statistics from FrameNet corpus
Intelligent Robot Lecture Note
77
Spoken Language Understanding
Parse Tree Path
• subject-to-object raising case
object
subject
Intelligent Robot Lecture Note
78
Spoken Language Understanding
Reading Lists (MaxEnt)
• Adam Berger, S. Della Pietra and V. Della Pietra. March 1996. A
Maximum Entropy Approach to Natural Language Processing,
Computational Linguistics, vol. 22, no. 1
• Dan Klein and Chris Manning. 2003. Maxent Models, Conditional
Estimation, and Optimization, without the Magic. Tutorial at
NAACL/ACL-03
• Robert Malouf. 2002. A comparison of algorithms for Maximum
Entropy parameter estimation. In Proceedings of the 6th CoNLL
• Thomas Minka. 2003. A Comparison of Numerical Optimizers for
Logistic Regression. Technical Report, Dept. of Statistics,
Carnegie Mellon. Univ., 2003
• Hal Daume III , Notes on CG and LM-BFGS Optimization of
Logistic Regression.
http://www.isi.edu/~hdaume/megam/index.html
Intelligent Robot Lecture Note
79
Spoken Language Understanding
Reading Lists (CRFs)
• T. G. Dietterich, 2002. Machine learning for sequential data: A
review. Caelli(Ed.) Structural, Syntactic, and Statistical Pattern
Recognition.
• S. D. Pietra, V. D Pietra, and J. Lafferty, 1997. Inducing features of
random fields. IEEE PAMI.
• J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional
Random Fields: Probabilistic models for segmenting and labeling
sequence data. ICML.
• C. Sutton, K. Rohanimanesh and A. McCallum, 2004. Dynamic
conditional random fields: Factorized probabilistic models for
labeling and segmenting sequence data. ICML.
• C. Sutton and A. McCallum, 2006. An introduction to conditional
random fields for relational learning. In Introduction to Statistical
Relational Learning. L. Getoor and B. Taskar, Eds. MIT Press
• Sarawagi, S., Cohen, W., 2004. Semi-markov conditional random
fields for information extraction. In Proceedings of ICML.
Intelligent Robot Lecture Note
80
Spoken Language Understanding
Reading Lists (SLU)
•
•
•
•
•
J. Dowding, J. M. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore, D.
and Moran. 1993. Gemini: A natural language system for spoken
language understanding. ACL, 54-61.
J. Eun, C. Lee, and G. G. Lee, 2004. An information extraction approach
for spoken language understanding. ICSLP.
N. Gupta, G. Tur, D. Hakkani-Tur, S. Bangalore, G. Riccardi, and M.
Gilbert, January 2006. The at&t spoken language understanding system,
IEEE Transactions on Audio, Speech, and Language Processing, vol. 14,
no. 1, pp. 213–222.
Y. He, and S. Young. January 2005. Semantic processing using the
Hidden Vector State model. Computer Speech and Language, 19(1):85106.
E. Levin, and R. Pieraccini. 1995. CHRONUS, the next generation, In
Proceedings of 1995 ARPA Spoken Language Systems Technical
Workshop, 269--271, Austin, Texas.
Intelligent Robot Lecture Note
81
Spoken Language Understanding
Reading Lists (SLU)
• B. Pellom, W. Ward., and S. Pradhan. 2000. The CU
Communicator: An Architecture for Dialogue Systems. ICSLP.
• R. E. Schapire., M. Rochery, M. Rahim, and N. Gupta. 2002,
Incorporating prior knowledge into boosting. ICML. pp538-545.
• S. Seneff. 1992. TINA: a natural language system for spoken
language applications, Computational Linguistics, 18(1):61--86.
• G. Tur, D. Hakkani-Tur, and R. E. Schapire. 2005. Combining
active and semi-supervised learning for spoken language
understanding. Speech Communication. 45:171-186
• Y. Wang, L. Deng, and A. Acero. September 2005, Spoken
Language Understanding: An introduction to the statistical
framework. IEEE Signal Processing Magazine, 27(5)
Intelligent Robot Lecture Note
82
Spoken Language Understanding
Reading Lists (SRL and Other Topics)
• D. Gildea, and D. Jurafsky. 2002. Automatic labeling of semantic
roles. Computational Linguistics, 28(3):245-288.
• M. Jeong, and G. G. Lee. 2006. Exploiting non-local features for
spoken language understanding. COLING/ACL.
• M. Jeong, and G. G. Lee. 2006. Jointly predicting dialog act and
named entity for spoken language understanding. IEEE/ACL SLT2006.
• M. Jeong, and G. G. Lee. 2007. Structures for Spoken Language
Understanding: A Two-Step Approach. IEEE ICASSP 2007.
Intelligent Robot Lecture Note
83