Transcript Document

Natural Language Processing
Zhao Hai 赵海
Department of Computer Science and Engineering
Shanghai Jiao Tong University
[email protected]
1
Overview
• Models
– HMM: Hidden Markov Model
– maximum entropy Markov model
– CRFs: Conditional Random Fields
• Tasks
– Chinese word segmentation
– part-of-speech tagging
– named entity recognition
2
What is an HMM?
• Graphical Model
• Circles indicate states
• Arrows indicate probabilistic dependencies between states
3
What is an HMM?
• Green circles are hidden states
• Dependent only on the previous state
• “The past is independent of the future given the present.”
4
What is an HMM?
• Purple nodes are observed states
• Dependent only on their corresponding hidden state
5
HMM Formalism
S
S
S
S
S
K
K
K
K
K
• {S, K, P, A, B}
• S : {s1…sN } are the values for the hidden states
• K : {k1…kM } are the values for the observations
6
HMM Formalism
S
A
S
B
K
•
•
•
•
K
A
S
B
K
A
S
A
S
B
K
K
{S, K, P, A, B}
P = {pi} are the initial state probabilities
A = {aij} are the state transition probabilities
B = {bik} are the observation state probabilities
7
Inference in an HMM
• Probability Estimation: Compute the probability of a
given observation sequence
• Decoding: Given an observation sequence, compute the
most likely hidden state sequence
• Parameter Estimation: Given an observation sequence,
find a model that most closely fits the observation
8
Probability Estimation
o1
ot-1
ot
ot+1
oT
Given an observation sequence and a model,
compute the probability of the observation sequence
O = (o1...oT ),  = ( A, B, P)
ComputeP(O |  )
9
Probability Estimation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  ) = bx1o1 bx2o2 ...bxT oT
10
Probability Estimation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  ) = bx1o1 bx2o2 ...bxT oT
P( X |  ) = p x1 ax1x2 ax2 x3 ...axT 1xT
11
Probability Estimation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  ) = bx1o1 bx2o2 ...bxT oT
P( X |  ) = p x1 ax1x2 ax2 x3 ...axT 1xT
P(O, X |  ) = P(O | X ,  ) P( X |  )
12
Probability Estimation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O | X ,  ) = bx1o1 bx2o2 ...bxT oT
P( X |  ) = p x1 ax1x2 ax2 x3 ...axT 1xT
P(O, X |  ) = P(O | X ,  ) P( X |  )
P(O |  ) =  P(O | X ,  ) P( X |  )
X
13
Probability Estimation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
P(O |  ) =
p
{ x1 ...xT }
T 1
b
x1 x1o1
Pa
t =1
b
xt xt 1 xt 1ot 1
14
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
• Special structure gives us an efficient solution using
dynamic programming.
• Intuition: Probability of the first t observations is the
same for all possible t+1 length state sequences.
• Define:
i (t ) = P(o1...ot , xt = i |  )
15
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j (t  1)
= P(o1...ot 1 , xt 1 = j )
= P(o1...ot 1 | xt 1 = j ) P( xt 1 = j )
= P(o1...ot | xt 1 = j ) P(ot 1 | xt 1 = j ) P( xt 1 = j )
= P(o1...ot , xt 1 = j ) P(ot 1 | xt 1 = j )
16
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j (t  1)
= P(o1...ot 1 , xt 1 = j )
= P(o1...ot 1 | xt 1 = j ) P( xt 1 = j )
= P(o1...ot | xt 1 = j ) P(ot 1 | xt 1 = j ) P( xt 1 = j )
= P(o1...ot , xt 1 = j ) P(ot 1 | xt 1 = j )
17
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j (t  1)
= P(o1...ot 1 , xt 1 = j )
= P(o1...ot 1 | xt 1 = j ) P( xt 1 = j )
= P(o1...ot | xt 1 = j ) P(ot 1 | xt 1 = j ) P( xt 1 = j )
= P(o1...ot , xt 1 = j ) P(ot 1 | xt 1 = j )
18
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 j (t  1)
= P(o1...ot 1 , xt 1 = j )
= P(o1...ot 1 | xt 1 = j ) P( xt 1 = j )
= P(o1...ot | xt 1 = j ) P(ot 1 | xt 1 = j ) P( xt 1 = j )
= P(o1...ot , xt 1 = j ) P(ot 1 | xt 1 = j )
19
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
=
 P(o ...o , x
1
i =1...N
=
= i, xt 1 = j )P(ot 1 | xt 1 = j )
= j | xt = i )P( xt = i ) P(ot 1 | xt 1 = j )
 P(o ...o , x
= i )P( xt 1 = j | xt = i ) P(ot 1 | xt 1 = j )
1
t
1
i =1...N
=
t
 P(o ...o , x
i =1...N
=
t
t
 (t )a b
i =1...N
i
ij
t 1
t
jot 1
20
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
=
 P(o ...o , x
1
i =1...N
=
= i, xt 1 = j )P(ot 1 | xt 1 = j )
= j | xt = i )P( xt = i ) P(ot 1 | xt 1 = j )
 P(o ...o , x
= i )P( xt 1 = j | xt = i ) P(ot 1 | xt 1 = j )
1
t
1
i =1...N
=
t
 P(o ...o , x
i =1...N
=
t
t
 (t )a b
i =1...N
i
ij
t 1
t
jot 1
21
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
=
 P(o ...o , x
1
i =1...N
=
= i, xt 1 = j )P(ot 1 | xt 1 = j )
= j | xt = i )P( xt = i ) P(ot 1 | xt 1 = j )
 P(o ...o , x
= i )P( xt 1 = j | xt = i ) P(ot 1 | xt 1 = j )
1
t
1
i =1...N
=
t
 P(o ...o , x
i =1...N
=
t
t
 (t )a b
i =1...N
i
ij
t 1
t
jot 1
22
Forward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
=
 P(o ...o , x
1
i =1...N
=
= i, xt 1 = j )P(ot 1 | xt 1 = j )
= j | xt = i )P( xt = i ) P(ot 1 | xt 1 = j )
 P(o ...o , x
= i )P( xt 1 = j | xt = i ) P(ot 1 | xt 1 = j )
1
t
1
i =1...N
=
t
 P(o ...o , x
i =1...N
=
t
t
 (t )a b
i =1...N
i
ij
t 1
t
jot 1
23
Backward Procedure
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
 i (T  1) = 1
i (t ) = P(ot ...oT | xt = i) Probability of the rest
 i (t ) =
a b
j =1...N
ij iot
 j (t  1)
of the states given the
first state
24
The Solution to Estimation
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
N
P(O |  ) =  i (T )
Forward Procedure
P(O |  ) =  p i  i (1)
Backward Procedure
P(O |  ) =  i (t ) i (t )
Combination
i =1
N
i =1
N
i =1
25
Decoding: Best State Sequence
o1
ot-1
ot
ot+1
oT
• Find the state sequence that best explains the observations
• Viterbi algorithm
P( X | O)
• arg max
X
26
Viterbi Algorithm
x1
xt-1
j
o1
ot-1
ot
ot+1
oT
 j (t ) = max P( x1...xt 1 , o1...ot 1 , xt = j , ot )
x1 ...xt 1
The state sequence which maximizes the
probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
27
Viterbi Algorithm
x1
xt-1
xt
xt+1
o1
ot-1
ot
ot+1
oT
 j (t ) = max P( x1...xt 1 , o1...ot 1 , xt = j , ot )
x1 ...xt 1
 j (t  1) = max  i (t )aijb jo
t 1
i
 j (t  1) = arg max i (t )aijb jo
i
Recursive
Computation
t 1
28
Viterbi Algorithm
x1
xt-1
xt
xt+1
xT
o1
ot-1
ot
ot+1
oT
Xˆ T = arg max i (T )
i
Xˆ t =  ^ (t  1)
X t 1
Compute the most
likely state sequence
by working backwards
P( Xˆ ) = arg max i (T )
i
29
Parameter Estimation
A
B
o1
A
B
ot-1
A
B
ot
A
B
B
ot+1
oT
• Given an observation sequence, find the model that is
most likely to produce that sequence.
• No analytic method => an EM algorithm (Baum-Welch)
• Given a model and observation sequence, update the
model parameters to better fit the observations.
30
Parameter Estimation
A
B
A
B
o1
A
B
ot-1
ot
 i (t )aijb jo  j (t  1)
pt (i, j ) =
 m (t )  m (t )
t 1
A
B
B
ot+1
oT
Probability of
traversing an arc
m =1...N
 i (t ) =
 p (i, j)
j =1...N
t
Probability of
being in state i
31
Parameter Estimation
A
A
B
B
o1
ot-1
A
B
ot
A
B
B
ot+1
oT
pˆi =  i (1)
p (i, j )

=
  (t )
 (i)

=
  (t )
T
aˆij
t =1
T
t =1
bˆik
t
i
{t:ot = k } t
T
t =1 i
Now we can
compute the new
estimates of the
model parameters.
32
Overview
• Models
– HMM: Hidden Markov Model
– maximum entropy Markov model
– CRFs: Conditional Random Fields
• Tasks
– Chinese word segmentation
– part-of-speech tagging
– named entity recognition
33
Limitations of HMM
“US official questions regulatory scrutiny of Apple”
• Problem 1: HMMs only use word identity. It cannot use richer representations.
– Apple is capitalized.
• MEMM Solution: Use more descriptive features
– ( b0:Is-capitalized, b1: Is-in-plural, b2: Has-wordnet-antonym, b3:Is-“the” etc)
– Real valued features can also be handled.
• Here features are pairs < b, s >: b is feature of observation and s is destination
state e.g. <Is-capitalized, Company>
• Feature function:
if b(ot ) is true and s = st
f b,s (ot ,st ) = 
0 otherwise
e.g.
f <Is-capitalized,Company>(“Apple”, Company) = 1.
34
HMMs vs. MEMMs (I)
HMMs
P( s | s '), P(o | s)
MEMMs
P(s | s ', o) | S | distributions : Ps (s | o)
35
HMMs vs. MEMMs (II)
HMMs
αt(s) the probability of
producing o1, . . . , ot and being
in s at time t.
 t 1 ( s ) =  sS  t ( s ') P( s | s ') P(ot 1 | s)
MEMMs
αt(s) the probability of being in
s at time t given o1, . . . , ot .
 t 1 ( s ) =  sS  t ( s ') Ps ( s | ot 1 )
δt(s) the probability of the best path δt(s) the probability of the best
for producing o1, . . . , ot and being path that reaches s at time t
given o1, . . . , ot .
in s at time t.
t 1 (s) = max sS t (s ') P(s | s ') P(ot 1 | s)
t 1 (s) = max sS t (s ') Ps (s | ot 1 )
36
Maximum Entropy
•Problem 2:
• HMMs are trained to maximize the likelihood of the training set.
Generative, joint distribution.
• But they solve conditional problems (observations are given).
• MEMM Solution: Maximum Entropy.
• Idea: Use the least biased hypothesis, subject to what is known.
• Constraints: The expectation Ei of feature i in the learned
distribution should be the same as its mean Fi on the training set.
For every state s0 and feature i:
1
Fi = kn =1, sk = s fi (ok , s ')
ns
1
Ei = kn =1, sk = s  sS Ps ( s | ok ) fi (ok , s ')
ns
37
More on MEMMs
• It turns out that the maximum entropy distribution is unique and has
an exponential form:
Ps (s | o) =
1
exp(  i fi (o, s))
Z (o, s ')
i features
• We can estimate λi with Generalized Iterative Scaling.
–
–
–
–
Adding a feature x : f x (o, s) = C   fi (o, s) does not affect the solution.
i
Compute Fi.
(0)
Set i = 0
Compute current expectation Ei( j ) of feature i from model.

( j 1)
i
=
( j)
i
Fi
1
 log( ( j ) )
C
Ei
38
Extensions
• We can train even when the labels are not known using EM.
– E step: determine most probable state sequence and compute Fi.
– M step: GIS.
• We can reduce the number of parameters to estimate by moving the
previous state in the features: “Subject-is-female”, “Previous-wasquestion”, “Is-verb-and-no-noun-yet”.
• We can even add features regarding actions in a reinforcement learning
setting: “Slow-vehicle-encountered-and-steer-left”.
• We can mitigate data sparseness problems by simplifying the model:
1
P ( s | s ' , o) = P ( s | s ' )
exp( i f i (o, s))
Z (o, s' )
i
39
Decoding of MEMM
• Train an MEMM as it is being a maximum
entropy model
• The only difference is about decoding:
– ME is a classifier
– MEMM is a structure learning tool, where Viterbi
algorithm is applied instead.
40
Overview
• Models
– HMM: Hidden Markov Model
– maximum entropy Markov model
– CRFs: Conditional Random Fields
• Tasks
– Chinese word segmentation
– part-of-speech tagging
– named entity recognition
41
CRFs as Sequence Labeling Tool
• Conditional random fields (CRFs) are a statistical sequence
modeling framework first introduced to the field of natural
language processing (NLP) to overcome label-bias problem.
• John Lafferty, A. McCallum and F. Pereira. 2001.
Conditional Random Field: Probabilistic Models for Segmenting and Labeling
Sequence Data. In Proceedings of the Eighteenth International Conference on
Machine Learning, 282-289. June 28-July 01, 2001
42
Sequence Segmenting and Labeling
• Goal: mark up sequences with content tags
• Application in computational biology
–
–
–
–
DNA and protein sequence alignment
Sequence homolog searching in databases
Protein secondary structure prediction
RNA secondary structure analysis
• Application in computational linguistics & computer science
– Text and speech processing, including topic segmentation, part-of-speech
(POS) tagging
– Information extraction
– Syntactic disambiguation
43
HMMs as Generative Models
• Hidden Markov models (HMMs)
• Assign a joint probability to paired observation and label
sequences
– The parameters typically trained to maximize the joint likelihood of train
examples
44
HMMs as Generative Models
(cont’d)
• Difficulties and disadvantages
– Need to enumerate all possible observation sequences
– Not practical to represent multiple interacting features or long-range
dependencies of the observations
– Very strict independence assumptions on the observations
45
Conditional Models
• Conditional probability P(label sequence y | observation sequence x) rather
than joint probability P(y, x)
– Specify the probability of possible label sequences given an observation
sequence
• Allow arbitrary, non-independent features on the observation sequence X
• The probability of a transition between labels may depend on past and
future observations
– Relax strong independence assumptions in generative models
46
Discriminative Models
Maximum Entropy Markov Models (MEMMs)
•
•
Exponential model
Given training set X with label sequence Y:
– Train a model θ that maximizes P(Y|X, θ)
– For a new data sequence x, the predicted label y maximizes P(y|x, θ)
– Notice the per-state normalization
47
MEMMs (cont’d)
• MEMMs have all the advantages of Conditional Models
• Per-state normalization: all the mass that arrives at a state must
be distributed among the possible successor states
(“conservation of score mass”)
• Subject to Label Bias Problem
– Bias toward states with fewer outgoing transitions
48
Label Bias Problem
• Consider this MEMM:
• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r)
P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)
In the training data, label value 2 is the only label value observed after label value 1
Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x
• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).
• Per-state normalization does not allow the required expectation
•http://wing.comp.nus.edu.sg/pipermail/graphreading/2005-September/000032.html
•http://hi.baidu.com/%BB%F0%D1%BF_ayouh/blog/item/338f13510d38e8441038c250.html
49
Solve the Label Bias Problem
• Change the state-transition structure of the model
– Not always practical to change the set of states
• Start with a fully-connected model and let the training
procedure figure out a good structure
– Prelude the use of prior, which is very valuable (e.g. in information
extraction)
50
Random Field
51
Conditional Random Fields (CRFs)
• CRFs have all the advantages of MEMMs without
label bias problem
– MEMM uses per-state exponential model for the conditional probabilities of
next states given the current state
– CRF has a single exponential model for the joint probability of the entire
sequence of labels given the observation sequence
• Undirected acyclic graph
• Allow some transitions “vote” more strongly than others
depending on the corresponding observations
52
Definition of CRFs
X is a random variable over data sequences to be labeled
Y is a random variable over corresponding label sequences
53
Example of CRFs
54
Graphical comparison among
HMMs, MEMMs and CRFs
HMM
MEMM
CRF
55
Conditional Distribution
If the graph G = (V, E) of Y is a tree, the conditional distribution over the
label sequence Y = y, given X = x, by fundamental theorem of random
fields is:


p (y | x)  exp   k f k (e, y |e , x)   k g k (v, y |v , x) 
vV ,k
 eE,k

x is a data sequence
y is a label sequence
v is a vertex from vertex set V = set of label random variables
e is an edge from edge set E over V
fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a
Boolean edge feature
k is the number of features
 = (1, 2 , , n ; 1, 2 , , n ); k and k are parameters to be estimated
y|e is the set of components of y defined by edge e
y|v is the set of components of y defined by vertex v
56
Conditional Distribution (cont’d)
• CRFs use the observation-dependent normalization Z(x) for the
conditional distributions:


1
p (y | x) =
exp   k f k (e, y |e , x)   k g k (v, y |v , x) 
Z (x)
vV ,k
 eE,k

Z(x) is a normalization over the data sequence x
57
Decoding: To label an unseen sequence
We compute the most likely labeling Y* as follows by dynamic
programming (for efficient computation): Viterbi algorithm
Y * = arg maxY P(Y | X )
58
Complexity Estimation
• The time complexity of an iteration of parameter
estimation of L-BFGS algorithm is
• O(L2NMF)
• where L and N are, respectively, the numbers of
labels and sequences (sentences),
• M is the average length of sequences, and
• F is the average number of activated features of
each labeled clique.
59
CRF++: a CRFs Package
• CRF++ is a simple, customizable, and open source
implementation of Conditional Random Fields (CRFs) for
segmenting/labeling sequential data.
• http://crfpp.sourceforge.net/
• Requirements
– C++ compiler (gcc 3.0 or higher)
• How to make
–
–
–
–
% ./configure
% make
% su
# make install
60
CRF++
• Feature template representation and input file format
61
CRF++
• training
• % crf_learn -f 3 -c 1.5 template_file train_file model_file
• test
• % crf_test -m model_file test_files
62
Summary
• Discriminative models are prone to the label bias problem
• CRFs provide the benefits of discriminative models
• CRFs solve the label bias problem well, and demonstrate
good performance, but it is expensive in computation!
63
Overview
• Models
– HMM: Hidden Markov Model
– maximum entropy Markov model
– CRFs: Conditional Random Fields
• Tasks
– Chinese word segmentation
– part-of-speech tagging
– named entity recognition
64
What is Chinese Word Segmentation
• A special case of tokenization in natural language
processing (NLP) for many languages that have no explicit
word delimiters such as spaces.
• Original:
– 她来自苏格兰
– She comes from SU GE LAN
Meaningless!
• Segmented:
– 她/来/自/苏格兰
– She comes from Scotland.
Meaningful!
65
Learning from a Lexicon:
maximal matching algorithm for word segmentation
•
Input
–
–
•
A lexicon is pre-defined.
An unsegmented sequence
The algorithm:
① Start from the first character, try to find the longest
matched word in the lexicon.
② Set the next character after the above found word as
the new start point.
③ If reaches the end of the sequence, the algorithm
ends.
④ Otherwise, go to (1).
66
Learning from a segmented corpus:
Word segmentation as labeling
• 自然科学的研究不断深入
• natural science / of / research / uninterruptedly / deepen
• 自然科学/的/研究/不断/深入
• BMME S BE BE BE
• B: beginning, M: Middle, E: End, of a word
• S: single-character word
• Using CRFs as the learning model
67
CWS as Character-base Tagging:
From the begging to the latest
•
•
•
•
•
Nianwen Xue, 2003
Chinese Word Segmentation as Character Tagging, CLCLP, Vol. 8(1), 2003
Xiaoqiang Luo, 2003
A Maximum Entropy Chinese Character-based Parser, EMNLP-2003
Hwee Tou Ng and Kin Kiat Low, 2004
Chinese Part-of Speech Tagging: One-at-a-Time or All-at-Once? Word-based or
Character-Based? EMNLP-2004
Jin Kiat Low, Hwee Tou Ng, Wenyuan Guo, 2005
A Maximum Entropy Approach to Chinese Word Segmentation, The 4th SIGHAN
Workshop on CLP, 2005
Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, Christopher
Manning, 2005
A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005, The 4th
SIGHAN Workshop on CLP, 2005
68
Label Set
Tag set
Tags
Multi-character Word
Reference
2-tag
B, E
B, BE, BEE, …
Mostly for CRF
4-tag
B, M, E, S
S, BE, BME, BMME, … Xue/Low/MaxEnt
6-tag
B, M, E, S,
B2, B3
S, BE, BB2E, BB2 B3E,
BB2 B3ME, …
Zhao/CRF
More labels, better performance for CWS …
69
Feature Template Set
C-1,C0,C1,C-1C0,C0C1,C-1C1,
Where C-1 C0 C1 is previous, current and next character
70
Overview
• Models
– HMM: Hidden Markov Model
– maximum entropy Markov model
– CRFs: Conditional Random Fields
• Tasks
– Chinese word segmentation
– part-of-speech tagging
– named entity recognition
71
Parts of Speech
• Generally speaking, the “grammatical type” of word:
– Verb, Noun, Adjective, Adverb, Article, …
• We can also include inflection:
– Verbs: Tense, number, …
– Nouns: Number, proper/common, …
– Adjectives: comparative, superlative, …
–…
• Most commonly used POS sets for English have 5080 different tags
72
BNC Parts of Speech
• Nouns:
NN0 Common noun, neutral for number (e.g. aircraft
NN1 Singular common noun (e.g. pencil, goose, time
NN2 Plural common noun (e.g. pencils, geese, times
NP0 Proper noun (e.g. London, Michael, Mars, IBM
• Pronouns:
PNI Indefinite pronoun (e.g. none, everything, one
PNP Personal pronoun (e.g. I, you, them, ours
PNQ Wh-pronoun (e.g. who, whoever, whom
PNX Reflexive pronoun (e.g. myself, itself, ourselves
73
• Verbs:
VVB finite base form of lexical verbs (e.g. forget, send, live, return
VVD past tense form of lexical verbs (e.g. forgot, sent, lived
VVG -ing form of lexical verbs (e.g. forgetting, sending, living
VVI infinitive form of lexical verbs (e.g. forget, send, live, return
VVN past participle form of lexical verbs (e.g. forgotten, sent, lived
VVZ -s form of lexical verbs (e.g. forgets, sends, lives, returns
VBB present tense of BE, except for is
…and so on: VBD VBG VBI VBN VBZ
VDB finite base form of DO: do
…and so on: VDD VDG VDI VDN VDZ
VHB finite base form of HAVE: have, 've
…and so on: VHD VHG VHI VHN VHZ
VM0 Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd)
74
• Articles
AT0 Article (e.g. the, a, an, no)
DPS Possessive determiner (e.g. your, their, his)
DT0 General determiner (this, that)
DTQ Wh-determiner (e.g. which, what, whose, whichever)
EX0 Existential there, i.e. occurring in “there is…” or “there are…”
• Adjectives
AJ0 Adjective (general or positive) (e.g. good, old, beautiful)
AJC Comparative adjective (e.g. better, older)
AJS Superlative adjective (e.g. best, oldest)
• Adverbs
AV0 General adverb (e.g. often, well, longer (adv.), furthest.
AVP Adverb particle (e.g. up, off, out)
AVQ Wh-adverb (e.g. when, where, how, why, wherever)
75
• Miscellaneous:
CJC Coordinating conjunction (e.g. and, or, but)
CJS Subordinating conjunction (e.g. although, when)
CJT The subordinating conjunction that
CRD Cardinal number (e.g. one, 3, fifty-five, 3609)
ORD Ordinal numeral (e.g. first, sixth, 77th, last)
ITJ Interjection or other isolate (e.g. oh, yes, mhm, wow)
POS The possessive or genitive marker 's or '
TO0 Infinitive marker to
PUL Punctuation: left bracket - i.e. ( or [
PUN Punctuation: general separating mark - i.e. . , ! , : ; - or ?
PUQ Punctuation: quotation mark - i.e. ' or "
PUR Punctuation: right bracket - i.e. ) or ]
XX0 The negative particle not or n't
ZZ0 Alphabetical symbols (e.g. A, a, B, b, c, d)
76
Task: Part-of-Speech Tagging
• Goal: Assign the correct part-of-speech to
each word (and punctuation) in a text.
• Example:
Two
old
men
bet
on
the game
CRD
AJ0
NN2
VVD
PP0
AT0
NN1
.
PUN
• Learn a local model of POS dependencies,
usually from pre-tagged data
• No parsing
77
Hidden Markov Models
• Assume: POS (state) sequence generated as timeinvariant random process, and each POS randomly
generates a word (output symbol)
0.2
AJ0
0.2
“a” 0.6
0.3
0.3
NN2
0.5
AT0
“the” 0.4
“cats”
“men”
0.9
0.5
NN1
0.1
“cat”
“bet”
78
Definition of HMM for Tagging
• Set of states – all possible tags
• Output alphabet – all words in the language
• State/tag transition probabilities
• Initial state probabilities: the probability of
beginning a sentence with a tag t (t0t)
• Output probabilities – producing word w at state t
• Output sequence – observed word sequence
• State sequence – underlying tag sequence
79
HMMs For Tagging
• First-order (bigram) Markov assumptions:
– Limited Horizon: Tag depends only on previous tag
P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)
– Time invariance: No change over time
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj  tk)
• Output probabilities:
– Probability of getting word wk for tag tj: P(wk | tj)
– Assumption:
Not dependent on other tags or words!
80
Combining Probabilities
• Probability of a tag sequence:
P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)
Assume t0 – starting tag:
= P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)
• Prob. of word sequence and tag sequence:
P(W,T) = Pi P(ti-1ti) P(wi | ti)
81
Training from Labeled Corpus
• Labeled training = each word has a POS tag
• Thus:
PMLE(tj) = C(tj) / N
PMLE(tjtk) = C(tj, tk) / C(tj)
PMLE(wk | tj) = C(tj:wk) / C(tj)
• Smoothing can be applied.
82
Viterbi Tagging
• Most probable tag sequence given text:
T* = arg maxT Pm(T | W)
= arg maxT Pm(W | T) Pm(T) / Pm(W)
(Bayes’ Theorem)
= arg maxT Pm(W | T) Pm(T)
(W is constant for all T)
= arg maxT Pi[m(ti-1ti) m(wi | ti) ]
= arg maxT i log[m(ti-1ti) m(wi | ti) ]
• Exponential number of possible tag sequences – use
dynamic programming for efficient computation
83
w1
t1
-2.3
t0
w2
-1.7
t1
-3
-6
-0.3
-1.7 2
t
-1.7
t1
-7.3
-0.3
t2 -4.7
-3.4
-1.3
-1
w3
t2 -10.3
-1.3
t3 -2.7
t3 -6.7
t3 -9.3
-log m t1
t2
t3
-log m w1
w2
w3
t0 
2.3
1.7
1
t1
0.7
2.3
2.3
t1 
1.7
1
2.3
t2
1.7
0.7
3.3
t2 
0.3
3.3
3.3
t3
1.7
1.7
1.3
t3 
1.3
1.3
2.3
84
Viterbi Algorithm
1.
2.
3.
4.
5.
D(0, START) = 0
for each tag t != START do: D(1, t) = -
for i  1 to N do:
for each tag tj do:
D(i, tj)  maxk D(i-1,tk) + lm(tktj) + lm(wi|tj)
Record best(i,j)=k which yielded the max
log P(W,T) = maxj D(N, tj)
Reconstruct path from maxj backwards
where: lm(.) = log m(.) and D(i, tj) – max joint probability of
state and word sequences till position i, ending at tj.
Complexity: O(Nt2 N)
85
Overview
• Models
– HMM: Hidden Markov Model
– maximum entropy Markov model
– CRFs: Conditional Random Fields
• Tasks
– Chinese word segmentation
– part-of-speech tagging
– named entity recognition
86
Named Entity Recognition and Classification
• Problem of NE tagging
Let W be a sequence of words
W = w1 , w2 , … , wn
Let T be the corresponding NE tag sequence
T = t1 , t2 , … , tn
Task : Find T which maximizes P ( T | W )
T’ = argmaxT P ( T | W )
87
Supervised NERC Systems (ME, CRF and SVM)
• Limitations of HMM
– Use of only local features may not work well
– Simple HMM models do not work well when large data are not used to
estimate the model parameters
– Incorporating a diverse set features in an HMM based NE tagger is difficult
and complicates the smoothing
• Solution:
– Maximum Entropy (ME) model, Conditional Random Field (CRF) or
Support Vector Machine (SVM)
– ME, CRF or SVM can make use of rich feature information
• ME model
– Very flexible method of statistical modeling
– A combination of several features can be easily incorporated
– Careful feature selection plays a crucial role
– Does not provide a method for automatic selection of useful features
– Features selected using heuristics
– Adding arbitrary features may result in overfitting
88
Supervised NERC Systems (ME, CRF and SVM)
• CRF
– CRF does not require careful feature selection in order to avoid overfitting
– Freedom to include arbitrary features
– Ability of feature induction to automatically construct the most useful
feature combinations
– Conjunction of features
– Infeasible to incorporate all possible conjunction features due to overflow of
memory
– Good to handle different types of data
• SVM
– Predict the classes depending upon the labeled word examples only
– Predict the NEs based on feature information of words collected in a
predefined window size only
– Can not handle the NEs outside tokens
– Achieves high generalization even with training data of a very high dimension
– Can handle non-linear feature spaces with
89
Named Entity Features
• Language Independent Features
– Can be applied for NERC in any language
• Language Dependent Features
– Generated from the language specific resources like gazetteers
and POS taggers
– Indian languages are resource-constrained
– Creation of gazetteers in resource-constrained environment
requires a priori knowledge of the language
– POS information depends on some language specific
phenomenon such as person, number, tense, gender etc
– POS tagger (Ekbal and Bandyopadhyay, 2008d) makes use of
the several language specific resources such as lexicon, inflection
list and a NERC system to improve its performance
• Language dependent features improve system performance
90
Language Independent Features
– Context Word: Preceding and succeeding words
– Word Suffix
• Not necessarily linguistic suffixes
• Fixed length character strings stripped from the endings of words
• Variable length suffix -binary valued feature
– Word Prefix
• Fixed length character strings stripped from the beginning of the words
– Named Entity Information: Dynamic NE tag (s) of the previous word (s)
– First Word (binary valued feature): Check whether the current token is the
first word in the sentence
91
Language Independent Features (Contd..)
• Length (binary valued): Check whether the length of the current word
less than three or not (shorter words rarely NEs)
• Position (binary valued): Position of the word in the sentence
• Infrequent (binary valued): Infrequent words in the training corpus
most probably NEs
• Digit features: Binary-valued
– Presence and/or the exact number of digits in a token
• CntDgt : Token contains digits
• FourDgt: Token consists of four digits
• TwoDgt: Token consists of two digits
• CnsDgt: Token consists of digits only
92
Language Independent Features (Contd..)
– Combination of digits and punctuation symbols
• CntDgtCma: Token consists of digits and comma
• CntDgtPrd: Token consists of digits and periods
– Combination of digits and symbols
• CntDgtSlsh: Token consists of digit and slash
• CntDgtHph: Token consists of digits and hyphen
• CntDgtPrctg: Token consists of digits and percentages
– Combination of digit and special symbols
• CntDgtSpl: Token consists of digit and special symbol such as $, #
etc.
93
CRF based NERC System: Feature Templates
• Feature Template: Feature represented in
terms of feature template
Feature template used in the experiment
94
Best Feature Sets for ME, CRF and SVM
Model
Feature
ME
Word, Context (Preceding one and following one word), Prefixes and suffixes of length
up to three characters of the current word only, Dynamic NE tag of the previous word,
First word of the sentence, Infrequent word, Length of the word, Digit features
CRF
Word, Context (Preceding two and following two words), Prefixes and suffixes of
length up to three characters of the current word only, Dynamic NE tag of the previous
word, First word of the sentence, Infrequent word, Length of the word, Digit features
SVM-F Word, Context (Preceding three and following two words), Prefixes and suffixes of
length up to three characters of the current word only, Dynamic NE tag of the previous
two words, First word of the sentence, Infrequent word, Length of the word, Digit
features
SVM-B Word, Context (Preceding three and following two words), Prefixes and suffixes of
length up to three characters of the current word only, Dynamic NE tag of the
previous two words, First word of the sentence, Infrequent word, Length of the word,
Digit features
Best Feature set Selection:
Training with language independent features and tested with
the development set
95
Language Dependent Evaluation
(ME, CRF and SVM)
• Observations:
 Classifiers trained with best set of language independent as well as
language dependent features
 POS information of the words are very effective
 Coarse-grained POS tagger (Nominal, PREP and Other) for ME and CRF
 Fine-grained POS tagger (developed with 27 POS tags) for SVM based
Systems
 Best Performance of ME: POS information of the current word only (an
improvement of 2.02% F-Score )
 Best Performance of CRF: POS information of the current, previous and next
words (an improvement of 3.04% F-Score )
 Best Performance of SVM: POS information of the current, previous and next
words (an improvement of 2.37% F-Score in SVM-F and 2.32% in SVM-B )
 NE suffixes, Organization suffix words, person prefix words,
designations and common location words are more effective than
other gazetteers
96
Reference
•
•
•
•
•
HMM http://www-nlp.stanford.edu/fsnlp/hmm-chap/blei-hmm-ch9.ppt
MEMM www.cs.cornell.edu/courses/cs778/2006fa/lectures/05-memm.pdf
CRFs web.engr.oregonstate.edu/~tgd/classes/539/slides/Shen-CRF.ppt
PoS-tagging cs.haifa.ac.il/~shuly/teaching/04/statnlp/pos-tagging.ppt
NER www.cl.uni-heidelberg.de/colloquium/docs/ekbal_abstract.pdf
97