Kernel Methods : Origins - ACL Home Page | Association for

Download Report

Transcript Kernel Methods : Origins - ACL Home Page | Association for

ACL-04 Tutorial
1
Kernel Methods in Natural
Language Processing
Jean-Michel RENDERS
Xerox Research Center Europe (France)
ACL’04 TUTORIAL
ACL-04 Tutorial
2
Warnings
This presentation contains extra slides
(examples, more detailed views, further
explanations, …) that are not present in the
official notes
If needed, the complete presentation can be
downloaded on the Kermit Web Site
www.euro-kermit.org
(feedback welcome)
ACL-04 Tutorial
3
Agenda
What’s the philosophy of Kernel Methods?
How to use Kernels Methods in Learning tasks?
Kernels for text (BOW, latent concept, string, word
sequence, tree and Fisher Kernels)
Applications to NLP tasks
ACL-04 Tutorial
4
Plan
What’s the philosophy of Kernel Methods?
How to use Kernels Methods in Learning tasks?
Kernels for text (BOW, latent concept, string, word
sequence, tree and Fisher Kernels)
Applications to NLP tasks
ACL-04 Tutorial
5
Kernel Methods : intuitive idea
Find a mapping f such that, in the new space,
problem solving is easier (e.g. linear)
The kernel represents the similarity between two
objects (documents, terms, …), defined as the
dot-product in this new vector space
But the mapping is left implicit
Easy generalization of a lot of dot-product (or
distance) based pattern recognition algorithms
ACL-04 Tutorial
6
Kernel Methods : the mapping
f
f
f
Original Space
Feature (Vector) Space
ACL-04 Tutorial
7
Kernel : more formal definition
A kernel k(x,y)
is a similarity measure
defined by an implicit mapping f,
from the original space to a vector space (feature
space)
such that: k(x,y)=f(x)•f(y)
This similarity measure and the mapping include:
Invariance or other a priori knowledge
Simpler structure (linear representation of the data)
The class of functions the solution is taken from
Possibly infinite dimension (hypothesis space for learning)
… but still computational efficiency when computing k(x,y)
ACL-04 Tutorial
8
Benefits from kernels
Generalizes (nonlinearly) pattern recognition algorithms in
clustering, classification, density estimation, …
When these algorithms are dot-product based, by replacing the
dot product (x•y) by k(x,y)=f(x)•f(y)
e.g.: linear discriminant analysis, logistic regression, perceptron,
SOM, PCA, ICA, …
NM. This often implies to work with the “dual” form of the algo.
When these algorithms are distance-based, by replacing d(x,y) by
k(x,x)+k(y,y)-2k(x,y)
Freedom of choosing f implies a large variety of learning
algorithms
ACL-04 Tutorial
9
Valid Kernels
The function k(x,y) is a valid kernel, if there exists a
mapping f into a vector space (with a dot-product) such
that k can be expressed as k(x,y)=f(x)•f(y)
Theorem: k(x,y) is a valid kernel if k is positive definite and
symmetric (Mercer Kernel)
A function is P.D. if
 K (x, y) f (x) f (y)dxdy  0 f  L2
In other words, the Gram matrix K (whose elements are k(xi,xj))
must be positive definite for all xi, xj of the input space
One possible choice of f(x): k(•,x) (maps a point x to a function
k(•,x)  feature space with infinite dimension!)
ACL-04 Tutorial
10
Example of Kernels (I)
Polynomial Kernels: k(x,y)=(x•y)d
Assume we know most information is contained in
monomials (e.g. multiword terms) of degree d (e.g.
d=2: x12, x22, x1x2)
Theorem: the (implicit) feature space contains all
possible monomials of degree d (ex: n=250; d=5; dim
F=1010)
But kernel computation is only marginally more
complex than standard dot product!
For k(x,y)=(x•y+1)d , the (implicit) feature space
contains all possible monomials up to degree d !
ACL-04 Tutorial
11
Examples of Kernels (III)
f
Polynomial
kernel (n=2)
RBF kernel
(n=2)
ACL-04 Tutorial
12
The Kernel Gram Matrix
With KM-based learning, the sole information
used from the training data set is the Kernel Gram
Matrix
K training
 k (x1 , x1 ) k (x1 , x 2 )
 k (x , x ) k (x , x )
2
1
2
2

 ...
...

k (x m , x1 ) k (x m , x 2 )
... k (x1 , x m ) 
... k (x 2 , x m ) 
...
... 

... k (x m , x m )
If the kernel is valid, K is symmetric definitepositive .
ACL-04 Tutorial
13
The right kernel for the right task
Assume a categorization task: the ideal Kernel matrix is
k(xi,xj)=+1 if xi and xj belong to the same class
k(xi,xj)=-1 if xi and xj belong to different classes
 concept of target alignment (adapt the kernel to the labels),
where alignment is the similarity between the current gram matrix
and the ideal one [“two clusters” kernel]
A certainly bad kernel is the diagonal kernel
k(xi,xj)=+1 if xi = xj
k(xi,xj)=0 elsewhere
All points are orthogonal : no more cluster, no more structure
ACL-04 Tutorial
14
How to choose kernels?
There is no absolute rules for choosing the right kernel,
adapted to a particular problem
Kernel design can start from the desired feature space,
from combination or from data
Some considerations are important:
Use kernel to introduce a priori (domain) knowledge
Be sure to keep some information structure in the feature space
Experimentally, there is some “robustness” in the choice, if the
chosen kernels provide an acceptable trade-off between
simpler and more efficient structure (e.g. linear separability), which
requires some “explosion”
Information structure preserving, which requires that the “explosion” is
not too strong.
ACL-04 Tutorial
15
How to build new kernels
Kernel combinations, preserving validity:
K (x,y )  K1 (x,y )  (1   ) K 2 (x,y )
K (x,y )  a.K1 (x,y )
0   1
a0
K (x,y )  K1 (x,y ).K 2 (x,y )
K (x,y )  f ( x). f ( y ) f is real  valued function
K (x,y )  K 3 (φ( x) ,φ(y ))
K (x,y )  xPy P sym m etricdefinite positive
K (x,y ) 
K1 (x,y )
K1 ( x,x) K1 (y,y )
ACL-04 Tutorial
16
Kernels built from data (I)
In general, this mode of kernel design can use
both labeled and unlabeled data of the training
set! Very useful for semi-supervised learning
Intuitively, kernels define clusters in the feature
space, and we want to find interesting clusters,
I.e. cluster components that can be associated
with labels. In general, extract (non-linear)
relationship between features which will catalyze
learning results
ACL-04 Tutorial
17
Examples
ACL-04 Tutorial
18
ACL-04 Tutorial
19
ACL-04 Tutorial
20
ACL-04 Tutorial
21
Kernels built from data (II)
Basic ideas:
Convex linear combination of kernels in a given family:
find the best coefficient of eigen-components of the
(complete) kernel matrix by maximizing the alignment
on the labeled training data.
Find a linear transformation of the feature space such
that, in the new space, pre-specified similarity or
dissimilarity constraints are respected (as best as
possible) and in a kernelizable way
Build a generative model of the data, then use the
Fischer Kernel or Marginalized kernels (see later)
ACL-04 Tutorial
22
Increased use of syntactic and semantic info
Kernels for texts
Similarity between documents?
Seen as ‘bag of words’ : dot product or polynomial
kernels (multi-words)
Seen as set of concepts : GVSM kernels, Kernel LSI
(or Kernel PCA), Kernel ICA, …possibly multilingual
Seen as string of characters: string kernels
Seen as string of terms/concepts: word sequence
kernels
Seen as trees (dependency or parsing trees): tree
kernels
Etc.
ACL-04 Tutorial
23
Agenda
What’s the philosophy of Kernel Methods?
How to use Kernels Methods in Learning tasks?
Kernels for text (BOW, latent concept, string, word
sequence, tree and Fisher Kernels)
Applications to NLP tasks
ACL-04 Tutorial
24
Kernels and Learning
In Kernel-based learning algorithms, problem
solving is now decoupled into:
A general purpose learning algorithm (e.g. SVM, PCA,
…) – Often linear algorithm (well-funded, robustness,
…)
A problem specific kernel
Simple (linear) learning
algorithm
Complex Pattern
Recognition Task
Specific Kernel function
ACL-04 Tutorial
25
Learning in the feature space: Issues
High dimensionality allows to render flat complex
patterns by “explosion”
Computational issue, solved by designing kernels
(efficiency in space and time)
Statistical issue (generalization), solved by the learning
algorithm and also by the kernel
e.g. SVM, solving this complexity problem by maximizing the
margin and the dual formulation
E.g. RBF-kernel, playing with the s parameter
With adequate learning algorithms and kernels,
high dimensionality is no longer an issue
ACL-04 Tutorial
26
Current Synthesis
Modularity and re-usability
Same kernel ,different learning algorithms
Different kernels, same learning algorithms
This tutorial is allowed to focus only on designing
kernels for textual data
Data 1 (Text)
Kernel 1
Gram Matrix
Learning
Algo 1
(not necessarily stored)
Data 2
(Image)
Kernel 2
Gram Matrix
Learning
Algo 2
ACL-04 Tutorial
27
Agenda
What’s the philosophy of Kernel Methods?
How to use Kernels Methods in Learning tasks?
Kernels for text (BOW, latent concept, string, word
sequence, tree and Fisher Kernels)
Applications to NLP tasks
ACL-04 Tutorial
28
Kernels for texts
Similarity between documents?
Seen as ‘bag of words’ : dot product or polynomial
kernels (multi-words)
Seen as set of concepts : GVSM kernels, Kernel LSI
(or Kernel PCA), Kernel ICA, …possibly multilingual
Seen as string of characters: string kernels
Seen as string of terms/concepts: word sequence
kernels
Seen as trees (dependency or parsing trees): tree
kernels
Seen as the realization of probability distribution
(generative model)
ACL-04 Tutorial
29
Strategies of Design
Kernel as a way to encode prior information
Invariance: synonymy, document length, …
Linguistic processing: word normalisation, semantics,
stopwords, weighting scheme, …
Convolution Kernels: text is a recursively-defined
data structure. How to build “global” kernels form
local (atomic level) kernels?
Generative model-based kernels: the “topology”
of the problem will be translated into a kernel
function
ACL-04 Tutorial
30
Strategies of Design
Kernel as a way to encode prior information
Invariance: synonymy, document length, …
Linguistic processing: word normalisation, semantics,
stopwords, weighting scheme, …
Convolution Kernels: text is a recursively-defined
data structure. How to build “global” kernels form
local (atomic level) kernels?
Generative model-based kernels: the “topology”
of the problem will be translated into a kernel
function
ACL-04 Tutorial
31
‘Bag of words’ kernels (I)
Document seen as a vector d, indexed by all the
elements of a (controlled) dictionary. The entry is
equal to the number of occurrences.
A training corpus is therefore represented by a
Term-Document matrix,
noted D=[d1 d2 … dm-1 dm]
The “nature” of word: will be discussed later
From this basic representation, we will apply a
sequence of successive embeddings, resulting in
a global (valid) kernel with all desired properties
ACL-04 Tutorial
32
BOW kernels (II)
Properties:
All order information is lost (syntactical relationships, local context,
…)
Feature space has dimension N (size of the dictionary)
Similarity is basically defined by:
k(d1,d2)=d1•d2= d1t.d2
or, normalized (cosine similarity):
kˆ(d1 , d 2 ) 
k (d1 , d 2 )
k (d1 , d1 ).k (d 2 , d 2 )
Efficiency provided by sparsity (and sparse dot-product
algo): O(|d1|+|d2|)
ACL-04 Tutorial
33
‘Bag of words’ kernels: enhancements
The choice of indexing terms:
Exploit linguistic enhancements:
Lemma / Morpheme & stem
Disambiguised lemma (lemma+POS)
Noun Phrase (or useful collocation, n-grams)
Named entity (with type)
Grammatical dependencies (represented as feature vector
components)
Ex: The human resource director of NavyCorp communicated
important reports on ship reliability.
Exploit IR lessons
Stopword removal
Feature selection based on frequency
Weighting schemes (e.g. idf )
NB. Using polynomial kernels up to degree p, is a natural and efficient way
of considering all (up-to-)p-grams (with different weights actually), but order
is not taken into account (“sinking ships” is the same as “shipping sinks”)
ACL-04 Tutorial
34
‘Bag of words’ kernels: enhancements
Weighting scheme :
the traditional idf weighting scheme
tfitfi*log(N/ni)
is a linear transformation (scaling) f(d)  W.f(d)
(where W is diagonal): k(d1,d2)=f(d1)t.(Wt.W).f(d2) can
still be efficiently computed (O(|d1|+|d2|))
Semantic expansion (e.g. synonyms)
Assume some term-term similarity matrix Q (positive
definite) : k(d1,d2)=f(d1)t.Q.f(d2)
In general, no sparsity (Q: propagates)
How to choose Q (some kernel matrix for term)?
ACL-04 Tutorial
35
Semantic Smoothing Kernels
Synonymy and other term relationships:
GVSM Kernel: the term-term co-occurrence matrix (DDt) is used in
the kernel: k(d1,d2)=d1t.(D.Dt).d2
The completely kernelized version of GVSM is:
The training kernel matrix K(= Dt.D) K2 (mxm)
The kernel vector of a new document d vs the training documents : t 
K.t (mx1)
The initial K could be a polynomial kernel (GVSM on multi-words terms)
Variants: One can use
a shorter context than the document to compute term-term similarity
(term-context matrix)
Another measure than the number of co-occurrences to compute the
similarity (e.g. Mutual information, …)
Can be generalised to Kn (or a weighted combination of K1 K2 … Kn
cfr. Diffusion kernels later), but is Kn less and less sparse!
Interpretation as sum over paths of length 2n.
ACL-04 Tutorial
36
Semantic Smoothing Kernels
Can use other term-term similarity matrix than DDt; e.g.
a similarity matrix derived from the Wordnet thesaurus,
where the similarity between two terms is defined as:
the inverse of the length of the path connecting the two terms
in the hierarchical hyper/hyponymy tree.
A similarity measure for nodes on a tree (feature space
indexed by each node n of the tree, with fn(term x) if term x is
the class represented by n or “under” n), so that the similarity is
the number of common ancestors (including the node of the
class itself).
With semantic smoothing, 2 documents can be similar
even if they don’t share common words.
ACL-04 Tutorial
37
Latent concept Kernels
Basic idea :
F1
terms
terms
terms
terms
terms
documents
Size d
K(d1,d2)=?
Concepts space
Size t
F2
Size k <<t
ACL-04 Tutorial
38
Latent concept Kernels
k(d1,d2)=f(d1)t.Pt.P.f(d2),
where P is a (linear) projection operator
From Term Space
to Concept Space
Working with (latent) concepts provides:
Robustness to polysemy, synonymy, style, …
Cross-lingual bridge
Natural Dimension Reduction
But, how to choose P and how to define (extract) the
latent concept space? Ex: Use PCA : the concepts are
nothing else than the principal components.
ACL-04 Tutorial
39
Polysemy and Synonymy
polysemy
synonymy
t2
t2
doc1
doc1
Concept
axis
doc2
Concept
axis
doc2
t1
t1
ACL-04 Tutorial
40
More formally …
SVD Decomposition of DU(txk)S(kxk)Vt(kxd), where U and
V are projection matrices (from term to concept and from concept to
document)
Kernel Latent Semantic Indexing (SVD decomposition in feature space) :
U is formed by the eigenvectors corresponding to the k largest eigenvalues
of D.Dt (each column defines a concept by linear combination of terms)
V is formed by the eigenvectors corresponding to the k largest eigenvalues
of K=DtD
S = diag (si) where si2 (i=1,..,k) is the ith largest eigenvalue of K
Cfr semantic smoothing with D.Dt replaced U.Ut (new term-term similarity
matrix): k(d1,d2)=d1t.(U.Ut).d2
As in Kernel GVSM, the completely kernelized version of LSI is: KVS2V’
(=K’s approximation of rank k) and t  VV’t (vector of similarities of a new
doc), with no computation in the feature space
If kn, then latent semantic kernel is identical to the initial kernel
ACL-04 Tutorial
41
Complementary remarks
Composition Polynomial kernel + kernel LSI (disjunctive normal
form) or Kernel LSI + polynomial kernel (tuples of concepts or
conjunctive normal form)
GVSM is a particular case with one document = one concept
Other decomposition :
Random mapping and randomized kernels (Monte-Carlo following some
non-uniform distribution; bounds exist to probabilistically ensure that the
estimated Gram matrix is e-accurate)
Nonnegative matrix factorization D  A(txk)S(kxd) [A>0]
ICA factorization D  A(txk)S(kxd) (kernel ICA)
Cfr semantic smoothing with D.Dt replaced by A.At : k(d1,d2)=d1t.(A.At).d2
Decompositions coming from multilingual parallel corpora
(crosslingual GVSM, crosslingual LSI, CCA)
ACL-04 Tutorial
42
Why multilingualism helps …
Graphically:
Terms in L1
Parallel
contexts
Terms in L2
Concatenating both representations will force languageindependent concept: each language imposes constraints
on the other
Searching for maximally correlated projections of paired
observations (CCA) has a sense, semantically speaking
ACL-04 Tutorial
43
Diffusion Kernels
Recursive dual definition of the semantic smoothing:
K=D’(I+uQ)D
Q=D(I+vK)D’
NB. u=v=0  standard BOW; v=0  GVSM
Let B= D’D (standard BOW kernel); G=DD’
If u=v, The solution is the “Von Neumann diffusion kernel”
K=B.(I+uB+u2B2+…)=B(I-uB)-1 and Q=G(I-uG)-1 [only of u<||B||-1]
Can be extended, with a faster decay, to exponential diffusion kernel:
K=B.exp(uB) and Q=exp(uG)
ACL-04 Tutorial
44
Graphical Interpretation
These diffusion kernels correspond to defining similarities
between nodes in a graph, specifying only the myopic
view
Documents
The (weighted)
adjacency matrix is
the Doc-Term
Matrix
Terms
Or
By aggregation, the
(weighted) adjacency matrix
is the term-term similarity
matrix G
Diffusion kernels corresponds to
considering all paths of length 1, 2,
3, 4 … linking 2 nodes and
summing the product of local
similarities, with different decay
strategies
It is in some way similar to KPCA by just
“rescaling” the eigenvalues of the basic
Kernel matrix (decreasing the lowest ones)
ACL-04 Tutorial
45
Strategies of Design
Kernel as a way to encode prior information
Invariance: synonymy, document length, …
Linguistic processing: word normalisation, semantics,
stopwords, weighting scheme, …
Convolution Kernels: text is a recursively-defined
data structure. How to build “global” kernels form
local (atomic level) kernels?
Generative model-based kernels: the “topology”
of the problem will be translated into a kernel
function
ACL-04 Tutorial
46
Sequence kernels
Consider a document as:
A sequence of characters (string)
A sequence of tokens (or stems or lemmas)
A paired sequence (POS+lemma)
A sequence of concepts
A tree (parsing tree)
(later)
A dependency graph
Sequence kernels  order has importance
Kernels on string/sequence : counting the subsequences two objects
have in common … but various ways of counting
Contiguity is necessary (p-spectrum kernels)
Contiguity is not necessary (subsequence kernels)
Contiguity is penalised (gap-weighted subsequence kernels)
ACL-04 Tutorial
47
String and Sequence
Just a matter of convention:
String matching: implies contiguity
Sequence matching : only implies order
ACL-04 Tutorial
48
p-spectrum kernel
Features of s = p-spectrum of s = histogram of all
(contiguous) substrings of length p
Feature space indexed by all elements of Sp
fu(s)=number of occurrences of u in s
Ex:
s=“John loves Mary Smith”
t=“Mary Smith loves John”
s
t
JL
1
LM
1
MS
1
1
SL
LJ
K(s,t)=1
1
1
ACL-04 Tutorial
49
p-spectrum Kernels (II)
Naïve implementation:
For all p-grams of s, compare equality with the p-grams
of t
O(p|s||t|)
Later, implementation in O(p(|s|+|t|))
ACL-04 Tutorial
50
All-subsequences kernels
Feature space indexed by all elements of S*={e}U
S U S2U S3U…
fu(s)=number of occurrences of u as a (noncontiguous) subsequence of s
Explicit computation rapidly infeasible
(exponential in |s| even with sparse rep.)
L M S J L M S L J L M S J M
e J
Ex:
L M S L J L M S L L S
M S
L
J
M L
S J
s 1 1 11 1 1 1 1
1 1
1
t 1 1 11 1
1 1 1
1 1
1
K=6
ACL-04 Tutorial
51
Recursive implementation
Consider the addition of one extra symbol a to s:
common subsequences of (sa,t) are either in s or
must end with symbol a (in both sa and t).
Mathematically,
k (s, e )  1
k (sa, t )  k (s, t ) 
 k (s, t (1 : j  1))  k (s, t )  k ' (sa, t )
j : t j a
k ' (sa, tv)  k ' (sa, t )  k (s, t ) va
This gives a complexity of O(|s||t|)
ACL-04 Tutorial
52
Practical implementation (DP table)
s\t
e
John
admires
Mary
Ann
Smith
e
1
1
1
1
1
1
K’(jonn)
0
1
1
1
1
1
John
1
2
2
2
2
2
K’(loves)
0
0
0
0
0
0
Loves
1
2
2
2
2
2
K’(Mary)
0
0
0
2
2
2
Mary
1
2
2
4
4
4
NB: by-product : all k(a,b) for prefixes a of s, b of t
ACL-04 Tutorial
53
Fixed-length subsequence kernels
Feature space indexed by all elements of Sp
fu(s)=number of occurrences of the p-gram u as a
(non-contiguous) subsequence of s
Recursive implementation (will create a series of p tables)
k0 (s, e )  1
k p (s, e )  0 p  0
k p (sa, t )  k p (s, t ) 
k
j : t j a
p 1
(s, t (1 : j  1))  k (s, t )  k p ' (sa, t )
k p ' (sa, tv)  k p 1 ' (sa, t )  k p 1 (s, t ) va
Complexity: O(p|s||t|) , but we have the k-length subseq. kernels
(k<=p) for free  easy to compute k(s,t)=Salkl(s,t)
ACL-04 Tutorial
54
Gap-weighted subsequence kernels
Feature space indexed by all elements of Sp
fu(s)=sum of weights of occurrences of the p-gram u as a
(non-contiguous) subsequence of s, the weight being
length penalizing: length(u)) [NB: length includes both
matching symbols and gaps]
Example:
D1 : ATCGTAGACTGTC
D2 : GACTATGC
(D1)CAT = 28+210 and (D2)CAT = 4
k(D1,D2)CAT=212+214
Naturally built as a dot product  valid kernel
For alphabet of size 80, there are 512000 trigrams
For alphabet of size 26, there are 12.106 5-grams
ACL-04 Tutorial
55
Gap-weighted subsequence kernels
Hard to perform explicit expansion and dotproduct!
Efficient recursive formulation (dynamic
programming –like), whose complexity is
O(k.|D1|.|D2|)
Normalization (doc length independence)
kˆ(d1 , d 2 ) 
k (d1 , d 2 )
k (d1 , d1 ).k (d 2 , d 2 )
ACL-04 Tutorial
56
Recursive implementation
Defining K’i(s,t) as Ki(s,t), but the occurrences are weighted by the
length to the end of the string (s)
K i ( sa, t )  K i ( s, t ) 
2
λ
 K 'i 1 (s, t[1 : j  1])
j :t j  a
K 'i ( sa, t )  K 'i ( s, t ) 
|t | j 1
λ
K 'i 1 ( s, t[1 : j  1])  K 'i ( s, t )  K "i ( sa, t )

j :t j  a
K i " ( sa, tv)  K i 1" ( sa, t ).  K i 1 ' ( sa, tv). av .2
K 0 ' ( s, t )  1
K i ' ( s, t )  0 if min(|s |, | t |)  i
3.p DP tables must be built and maintained
As before, as by-product, all gap-weighted k-grams kernels, with k<=p
so that any linear combination k(s,t)=Salkl(s,t) easy to compute
ACL-04 Tutorial
57
Word Sequence Kernels (I)
Here “words” are considered as symbols
Meaningful symbols  more relevant matching
Linguistic preprocessing can be applied to improve
performance
Shorter sequence sizes  improved computation time
But increased sparsity (documents are more : “orthogonal”)
Intermediate step: syllable kernel (indirectly realizes some
low-level stemming and morphological decomposition)
Motivation : the noisy stemming hypothesis (important Ngrams approximate stems), confirmed experimentally in a
categorization task
ACL-04 Tutorial
58
Word Sequence Kernels (II)
Link between Word Sequence Kernels and other
methods:
For k=1, WSK is equivalent to basic “Bag Of Words” approach
For =1, close relation to polynomial kernel of degree k, WSK
takes order into account
Extension of WSK:
Symbol dependant decay factors (way to introduce IDF concept,
dependence on the POS, stop words)
Different decay factors for gaps and matches (e.g. noun<adj when
gap; noun>adj when match)
Soft matching of symbols (e.g. based on thesaurus, or on
dictionary if we want cross-lingual kernels)
ACL-04 Tutorial
59
Recursive equations for variants
It is obvious to adapt recursive equations, without
increasing complexity
K i ( sa , t )  K i ( s, t ) 
 λ a,match K 'i 1 ( s, t[1 : j  1])
2
j :t j  a
K 'i ( sa , t )  K 'i ( s, t ).λ a,gap  K "i ( sa , t )
K i " ( sa , tv)  K i 1" ( sa , t ).λ v,gap  K i 1 ' ( sa , tv). av .λ v, match
2
Or, for soft matching (bi,j: elementary symbol kernel):
|t |
K i ( sa , t )  K i ( s, t )   K 'i 1 ( s, t[1 : j  1])λ 2ba ,t j
j i
K 'i ( sa , t )  K 'i ( s, t ).λ  K "i ( sa , t )
K i " ( sa , tv)  K i 1" ( sa , t ).λ  K i 1 ' ( sa , tv).ba ,v .λ 2
ACL-04 Tutorial
60
Trie-based kernels
An alternative to DP based on string matching techniques
TRIE= Retrieval Tree (cfr. Prefix tree) = tree whose internal
nodes have their children indexed by S.
Suppose F= Sp : the leaves of a complete p-trie are the
indices of the feature space
Basic algorithm:
Generate all substrings s(i:j) satisfying initial criteria; idem for t.
Distribute the s-associated list down from root to leave (depth-first)
Distribute the t-associated list down from root to leave taking into
account the distribution of s-list (pruning)
Compute the product at the leaves and sum over the leaves
Key points: in steps (2) and (3), not all the leaves will be
populated (else complexity would be O(| Sp|) … you need not
build the trie explicitly!
ACL-04 Tutorial
61
Some digression to learning method
Kernel-based learning algorithms (ex. SVM,
kernel perceptron, …) result in model of the form
f(Sai yi k(x, xi)) or f((Sai yi f(xi))•f(x))
This suggests efficient computation by pre-storing
(« pre-compiling ») the weighted TRIE Sai yi f(xi)
(number of occurrences are weighted by ai yi and
summed up.
This « pre-compilation » trick is often possible
with a lot of convolution kernels.
ACL-04 Tutorial
62
Example 1 – p-spectrum
p=2
S=a a b a  {aa, ab, ba}
T=b a b a b b  {ba, ab, ba, ab, bb}
Complexity: O(p(|s|+|t|))
S: {aa, ab} T:
{ab, ab}
a
S: 1
b
a
b
S: 1
T:2
S: {ba}
T:{ba,ba,bb}
a
S:1
T:2
b
 k=2*1+2*1=4
ACL-04 Tutorial
63
Example 2: (p,m)-mismatch kernels
Feature space indexed by all elements of Sp
fu(s)=number of p-grams (substrings) of s that
differ form u by at most m symbols
See example on next slide (p=2; m=1)
Complexity O(pm+1|S|m(|s|+|t|))
Can be easily extended by using a semantic
(local) dissimilarity matrix ba,b, to fu(s)=number of
p-grams (substrings) of s that differ form u by a
total dissimilarity not larger than some threshold
(total=sum)
ACL-04 Tutorial
64
Example 2: illustration
S: {aa:0, ab:0, ba:1}
T: {ab:0, ab:0, ba:1, ba:1,
bb:1}
a
S: {aa:0, ab:1, ba:1}
T:{ab:1,ab:1,ba:1,ba:1}
b
b
a
S: {ba:0, aa:1, ab:1} T:{ba:0,ba:0,bb:0,
ab:1,ab:1}
a
b
k=3*4+2*3+2*3+2*5=34
S:{ba:0,aa:1}
T:{ba:0,ba:0,bb:1}
S: {aa:1, ab:0}
S:{ba:1,ab:1}
T:{ba:1,ba:1,bb:0,ab:1,ab:1}
T:{ab:0,ab:0,bb:1}
ACL-04 Tutorial
65
Example 3: restricted gap-weighted kernel
Feature space indexed by all elements of Sp
fu(s)=sum of weights of occurrences of p-gram u as a
(non-contiguous) subsequence of s, provided that u
occurs with at most m gaps, the weight being gap
penalizing: gap(u))
For small , restrict m to 2 or even 1 is a reasonable
approximation of the full gap-weighted subsequence
kernel
Cfr. Previous algorithms but generate all (p+m) substrings
at the initial phase.
Complexity O((p+m)m(|s|+|t|))
If m is too large, DP is more efficient
ACL-04 Tutorial
66
Example 3 : illustration
p=2 ; m=1
S=a a b a  {aab,aba}
T=b a b a a b  {bab,aba,baa,aab}
S: {aab:0, aba:0}
T: {aba:0, aab:0}
a
b
a
b
a
S:{aab:0; aba:1} S:{aab:1; aba:0}
T:{aab:0; aba:1} T:{aab:1; aba:0}
 k=(1+) (1+)+(1+) (1+)
b
ACL-04 Tutorial
67
Mini Dynamic Programming
Imagine the processing of substring aaa in the
previous example
In the depth-first traversal, to assign a unique
value at the leave (when more than 1 way),
Dynamic Programming must be used
Build a small DP table [ size (p+m)xp ] to find the
least penalised way.
ACL-04 Tutorial
68
Tree Kernels
Application: categorization [one doc=one tree],
parsing (desambiguation) [one doc = multiple
trees]
Tree kernels constitute a particular case of more
general kernels defined on discrete structure
(convolution kernels). Intuitively, the philosophy is
to split the structured objects in parts,
to define a kernel on the “atoms” and a way to
recursively combine kernel over parts to get the kernel
over the whole.
ACL-04 Tutorial
69
Tree seen as String
One could use our string kernels by re-encoding
the tree as a string, using extra characters (cfr.
LISP representation of trees)
Ex:
VP
V
loves
Encoded as: [VP [V [loves]] [N [Mary]]]
N
Mary
Restrict substrings to subtrees by imposing
constraints on the number and position of ‘[‘ and ‘]’
ACL-04 Tutorial
70
Fundaments of Tree kernels
Feature space definition: one feature for each
possible proper subtree in the training data;
feature value = number of occurences
A subtree is defined as any part of the tree which
includes more than one node, with the restriction
there is no “partial” rule production allowed.
ACL-04 Tutorial
71
Tree Kernels : example
S
Example :
NP
S
VP
Mary
VP
VP
NP
John
N
VP
V
N
V
V
N
N
loves
loves
loves
A Parse Tree
Mary
… a few among the
many subtrees of
this tree!
Mary
VP
V
N
ACL-04 Tutorial
72
Tree Kernels : algorithm
Kernel = dot product in this high dimensional feature space
Once again, there is an efficient recursive algorithm (in
polynomial time, not exponential!)
Basically, it compares the production of all possible pairs of
nodes (n1,n2) (n1T1, n2  T2); if the production is the
same, the number of common subtrees routed at both n1
and n2 is computed recursively, considering the number of
common subtrees routed at the common children
Formally, let kco-rooted(n1,n2)=number of common subtrees
rooted at both n1 and n2
k (T1 , T2 ) 
 k
n1T1 n2T2
corooted
(n1 , n2 )
ACL-04 Tutorial
73
All sub-tree kernel
Kco-rooted(n1,n2)=0 if n1 or n2 is a leave
Kco-rooted(n1,n2)=0 if n1 and n2 have different
production or, if labeled, different label
Else Kco-rooted(n1,n2)=  (1  kcorooted (ch(n1 , i),ch(n2 , i)))
children i
“Production” is left intentionally ambiguous to both
include unlabelled tree and labeled tree
Complexity s O(|T1|.|T2|)
ACL-04 Tutorial
74
Illustration
a
b
i
c
j
d
e
f g
h
k
Kcoroot
i
j
k
a
1
0
0
b
0
0
0
c
1
0
0
d
0
0
0
e
0
0
0
f
0
0
0
g
0
0
0
h
0
0
0
K=2
ACL-04 Tutorial
75
Tree kernels : remarks
Even with “cosine” normalisation, the kernel
remains “to peaked” (influence of larger structure
feature … which grow exponentially)
Either, restrict to subtrees up to depth p
Either, downweight larger structure feature by a decay
factor
ACL-04 Tutorial
76
d-restricted Tree Kernel
Stores d DP tables Kco-rooted(n1,n2,p) p=1,…,d
kup _to _ p(T1,T2)kcorooted (n1,n2, p)
n1T1n2T2
Kco-rooted(n1,n2,p)=P(1+ Kco-rooted(ci(n1),ci(n2),p-1))
With initial Kco-rooted(n1,n2,1)=1 if n1,n2 same production
Complexity is O(p.|T1|.|T2|)
ACL-04 Tutorial
77
Depth-penalised Tree Kernel
Else Kco-rooted(n1,n2)= 2  (1  kcorooted (ch(n1 , i),ch(n2 , i)))
children i
Corresponds to weight each (implicit) feature by
size(subtree) where size=number of (nonterminating) nodes
ACL-04 Tutorial
78
Variant for labeled ordered tree
Example: dealing with html/xml documents
Extension to deal with:
Partially equal production
Children with same labels
… but order is important
The subtree
n1
A
A
n2
B
B
A
A
B
A
B
C
is common 4 times
ACL-04 Tutorial
79
Labeled Order Tree Kernel : algo
Actually, two-dimensional dynamic programming
(vertical and horizontal)
k (T1 , T2 ) 
 k
n1T1 n2T2
corooted
(n1 , n2 )
Kco-rooted(n1,n2)=0 if n1,n2 different labels; else
Kco-rooted(n1,n2)=S(nc(n1),nc(n2)) [number of subtrees up to …]
S(i,j)=S(i-1,j)+S(i,j-1)-S(i-1,j-1)+S(i-1,j-1)* Kco-rooted(ch(n1,i),ch(n2,j))
• Complexity: O(|T1|.|T2|.nc(T1).nc(T2))
• Easy extension to allow label mutation by introducing
sim(label(a),label(b)) PD matrix
ACL-04 Tutorial
80
Dependency Graph Kernel
A sub-graph is a
connected part
with at least two
word (and the
labeled edge)
*
with
PP-obj
telescope
det
the
saw
sub
I
PP
obj
man
det
the
with
PP-obj
telescope
saw
obj
man
det
the
det
the
ACL-04 Tutorial
81
Dependency Graph Kernel
k ( D1 , D2 ) 
 k
n1D1 n2D2
corooted
(n1 , n2 )
Kco-rooted(n1,n2)=0 if n1 or n2 has no child
Kco-rooted(n1,n2)=0 if n1 and n2 have different
label
Else Kco-rooted(n1,n2)=
 (2  k
co rooted
x , ycommon dependencies
( x,y))  1
ACL-04 Tutorial
82
Paired sequence kernel
A subsequence is a subsequence of states, with or
without the associated
word
States
(TAG)
words
Det
Noun
Verb
The
man
saw
…
Det
Noun
Det
Noun
The
man
Verb
ACL-04 Tutorial
83
Paired sequence kernel
k (S1 , S2 ) 

k
corooted
n1states( S1 ) n2 states( D2 )
(n1 , n2 )
If tag(n1)=tag(n2) AND tag(next(n1))=tag(next(n2))
Kco-rooted(n1,n2)=(1+x)*(1+y+ Kco-rooted(next(n1),next(n2)))
where
x=1 if word(n1)=word(n2), =0 else
y=1 if word(next(n1))=word(next(n2)), =0 else
Else Kco-rooted(n1,n2)=0
A
c
A
B
d
=
B
c
d
A
+
B
A
+
BA
d + c
B
ACL-04 Tutorial
84
Graph Kernels
General case:
Directed
Labels on both vertex and edges
Loops and cycles are allowed (not in all algorithms)
Particular cases easily derived from the general
one:
Non-directed
No label on edge, no label on vertex
ACL-04 Tutorial
85
Theoretical issues
To design a kernel taking the whole graph
structure into account amounts to build a
complete graph kernel that distinguishes between
2 graphs only if they are not isomorphic
Complete graph kernel design is theoretically
possible … but practically infeasible (NPcomplex)
Approximations are therefore necessary:
Common local subtrees kernels
Common (label) walk kernels (most popular)
ACL-04 Tutorial
86
Graph kernels based on Common Walks
Walk = (possibly infinite) sequence of labels
obtained by following edges on the graph
Path = walk with no vertex visited twice
Important concept: direct product of two graphs
G1xG2
V(G1xG2)={(v1,v2), v1 and v2: same labels)
E(G1xG2)={(e1,e2): e1, e2: same labels, p(e1) and
p(e2) same labels, n(e1) and n(e2) same labels}
e
p(e)
n(e)
ACL-04 Tutorial
87
Direct Product of Graphs
Examples
A
A

X
C
B
B
A
A
C
A
B
B
A
A
B

X
B
A
A
ACL-04 Tutorial
88
Kernel Computation
Feature space: indexed by all possible walks (of length 1,
2, 3, …n, possibly )
Feature value: sum of number of occurrences, weighted
by function (size of the walk)
Theorem: common walks on G1 and G2 are the walks of
G1xG2 (this is not true for path!)
So, K(G1,G2)= S(over all walks g) (size of g)) (PD kernel)
Choice of the function  will ensure convergence of the
kernel : typically bs or bs/s! where s=size of g.
Walks of (G1xG2) are obtained by the power series of the
adjacency matrix (closed form if the series converges) –
cfr. Our discussion about diffusion kernels
Complexity: O(|V1|.|V2|.|A(V1)|.|A(V2)|)
ACL-04 Tutorial
89
Remark
This kind of kernel is unable to make the
distinction between
B
A
B
A
B
A
B
Common subtree kernels allow to overcome
this limitation (locally unfold the graph as a tree)
ACL-04 Tutorial
90
Variant: Random Walk Kernel
Doesn’t (explicitly) make use of the direct product
of the graphs
Directly deal with walk up to length 
Easy extension to weighted graph by the
« random walk » formalism
Actualy, nearly equivalent to the previous
approach (… and same complexity)
Some reminiscence of PageRank, …approaches
ACL-04 Tutorial
91
Random Walk Graph Kernels
k(G1,G2)kcorooted(n1,n2)
n1G1 n2G2
Where kco-root(n1,n2)= sum of probabilities of
common walks starting from both n1 and n2
Very local view: kco-root(n1,n2)= I(n1,n2), wtih
I(n1,n2)=1 only if n1 and n2 same labels
Recursively defined (cfr. Diffusion kernels) as



I
(
e
1
,
e
2
)
kcoroot (n1,n2)I(n1,n2).(1)
k(next(n1,e1),next(n2,e2))

A
(
n
1
)
A
(
n
2
)
 e1A(n1),e2A(n2)


ACL-04 Tutorial
92
Random Walk Kernels (end)
This corresponds to random walks with a
probability of stopping =(1-) at each stage and a
uniform (or non-uniform) distribution when
choosing an edge
Matricially, the vector of kco-root(n1,n2), noted as k,
can be written k=(1-)k0+B.k (where B directly
derives from the adjacency matrices and has size
|V1|.|V2| x |V1|.|V2|)
So, kernel computation amount to … matrix
inversion: k=(1-)(I-B)-1ko …often solved
iteratively (back to the recursice formulation)
ACL-04 Tutorial
93
Strategies of Design
Kernel as a way to encode prior information
Invariance: synonymy, document length, …
Linguistic processing: word normalisation, semantics,
stopwords, weighting scheme, …
Convolution Kernels: text is a recursively-defined
data structure. How to build “global” kernels form
local (atomic level) kernels?
Generative model-based kernels: the “topology”
of the problem will be translated into a kernel
function
ACL-04 Tutorial
94
Marginalised – Conditional Independence Kernels
Assume a family of models M (with prior p0(m) on each
model) [finite or countably infinite]
each model m gives P(x|m)
Feature space indexed by models: x P(x|m)
Then, assuming conditional independence, the joint
probability is given by
PM ( x, z)   P( x, z | m) P0 (m)   P( x | m) P( z | m) P0 (m)
mM
mM
This defines a valid probability-kernel (CI implies PD
kernel), by marginalising over m. Indeed, the gram matrix
is K=P.diag(P0).P’ (some reminiscence of latent concept
kernels)
ACL-04 Tutorial
95
Remind
This family of strategies brings you the additional
advantage of using all your unlabeled training
data to design more problem-adapted kernels
They constitute a natural and elegant way of
solving semi-supervised problems (mix of labelled
and unlabelled data)
ACL-04 Tutorial
Example 1: PLSA-kernel
(somewhat artificial)
Probabilistic Latent Semantic Analysis provides a
generative model of both documents and words in
a corpus:
d
c
w
P(d,w)=SP(c)P(d|c)P(w|c)
Assuming that the topics is the model, you can
identify the models P(d|c) and P(c)
Then use marginalised kernels
PM (d1 , d 2 )   P(d1 | c) P(d 2 | c) P0 (c)
cM
96
ACL-04 Tutorial
97
Example 2: HMM generating fixed-length strings
The generative models of a string s (of length n) is given by
an HMM, represented by (A is the set of states)
h1
h2
h3
hn
n
P( s | h)   P( si | ti )
s1
s2
Then
k ( s, t ) 
s3
sn
i 1
n
  P(s | h ).P(t
i
i
i
| hi ).P(hi | hi 1 )
path hAn i 1
With efficient recursive (DP) implementation in O(n|A|2)
ACL-04 Tutorial
One step further: marginalised
kernels with latent variable models
98
Assume you both know the visible (x) and the
latent (h) variables and want to impose the joint
kernel kz((x,h),(x’,h’))
This is more flexible than previous approach for
which kz((x,h),(x’,h’)) is automatically 0 if hh’
Then the kernel is obtained by marginalising
(averaging) over the latent variables:
k(x,x') p(h x).p(h' x').kz((x,h),(x',h'))
h
h'
ACL-04 Tutorial
99
Marginalised latent kernels
The posterior p(h|x) are given by the generative model
(using Bayes rule)
This is an elegant way of coupling generative models and
kernel methods; the joint kernel kz allows to introduce
domain (user) knowledge
Intuition: learning with such kernels will perform well if the
class variable is contained as a latent variable of the
probability model (as good as the corresponding MAP
decision)
Basic key feature: when computing the similarity between
two documents, the same word (x) can be weighted
differently depending on the context (h) [title,
development, conclusion]
ACL-04 Tutorial
100
Examples of Marginalised Latent Kernels
Gaussian mixture: p(x)=Sp(h)N(x|h,mh,Ah)
Choosing kz((x,h),(x’,h’))=xtAhx’ if h=h’ (0 else) … this corresponds
to the local Mahalanobis intra-cluster distance
Then k(x,x’)= Sp(x|h)p(x’|h)p2(h) xtAhx’ /p(x) /p(x’)
Contextual BOW kernel:
Let x,x’ be two symbol sequences corresponding to sequence h,h’
of hidden states
We can decide to count common symbol occurrences only if
appearing in the same context (given h and h’ sequences, this is
the standard BOW kernel restricted to common states; then the
results are summed and weighted by the posteriors.
ACL-04 Tutorial
101
ACL-04 Tutorial
102
Fisher Kernels
Assume you have only 1 model
Marginalised kernel give you little information: only one feature: P(x|m)
To exploit much, the model must be “flexible”, so that we can measure
how it adapts to individual items  we require a “smoothly”
parametrised model
Link with previous approach: locally perturbed models constitute our
family of models, but dimF=number of parameters
More formally, let P(x|q0) be the generative
model
0 is typically
 θ log P
( x | θ) θ(q
θ
found by max likelihood); the gradient reflects how the model
will be changed to accommodate the new point x (NB. In
practice the loglikelihood is used)
0
ACL-04 Tutorial
103
Fisher Kernel : formally
Two objects are similar if they require similar
adaptation of the parameters or, in other words, if
they stretch the model in the same direction:
K(x,y)=
( θ log P( x | θ) θθ )' I M1 ( θ log P( y | θ) θθ )
0
0
Where IM is the Fisher Information Matrix
I M  E[( θ log P( x | θ) θ θ )( θ log P( x | θ) θ θ )' ]
0
0
ACL-04 Tutorial
104
On the Fisher Information Matrix
The FI matrix gives some non-euclidian,
topological-dependant dot-product
It also provides invariance to any smooth
invertible reparametrisation
Can be approximated by the empirical covariance
on the training data points
But, it can be shown that it increase the risk of
amplifying noise if some parameters are not
relevant
Practically, IM is taken as I.
ACL-04 Tutorial
105
Example 1 : Language model
Language models can improve k-spectrum
kernels
Language model is a generative model: p(wn|wnk…wn-1) are the parameters
The likelihood of s= nk
 p( s
j k
|s j ...s j  k 1 )
j 1
The gradient of the log-likelihood with respect to
parameter uv is the number of occurrences of
uv in s divided by p(uv)
ACL-04 Tutorial
106
String kernels and Fisher Kernels
Standard p-spectrum kernel corresponds to
Fisher Kernel with p-stage Markov process, with
uniform distribution for p(uv) (=1/|S|) … which is
the least informative parameter setting
Similarly, the gap weighted subsequence kernel is
the Fisher Kernel of a generalized k-stage Markov
process with decay factor  and uniform p(uv)
(any subsequence of length (k-1) from the
beginning of the document contributes to
explaining the next symbol, with a gap-penalizing
weight)
ACL-04 Tutorial
107
Example 2 : PLSA-Fisher Kernels
An example : Fisher kernel for PLSA improves the
standard BOW kernel
P(c | d1 ).P(c | d 2 )
P(c | d1 , w).P(c | d 2 , w)
~
~
K (d1 , d 2 )  
  t f ( w, d1 )t f ( w, d 2 )
P (c )
P( w | c)
c
w
c
where k1(d1,d2) is a measure of how much d1 and d2
share the same latent concepts (synonymy is taken
into account)
where k2(d1,d2) is the traditional inner product of
common term frequencies, but weighted by the degree
to which these terms belong to the same latent concept
(polysemy is taken into account)
ACL-04 Tutorial
Link between Fisher Kernels and
Marginalised Latent Variable Kernels
Assume a generative model of the form:
P(x)=Sp(x,h|q) (latent variable model)
Then the Fisher Kernel can be rewritten as
k(x,x') p(h x).p(h' x').kz((x,h),(x',h'))
h
h'
with a particular form of kz
Some will argue that
Fisher Kernels are better, as kz is theoretically founded
MLV Kernels are better, because of the flexibility of kz
108
ACL-04 Tutorial
109
Applications of KM in NLP
Document categorization and filtering
Event Detection and Tracking
Chunking and Segmentation
Dependency parsing
POS-tagging
Named Entity Recognition
Information Extraction
Others: Word sense disambiguation, Japanese
word segmentation, …
ACL-04 Tutorial
110
General Ideas (1)
Consider generic NLP tasks such as tagging (POS, NE,
chunking, …) or parsing (syntactical parsing, …)
Kernels defined for structures such as paired sequences
(tagging) and trees (parsing) ; can be easily extended to
weighted (stochastic) structures (probabilities given by
HMM, PCFG,…)
Goal: instead of finding the most plausible analysis by
building a generative model, define kernels and use a
learning method (classifier) to discover the correct
analysis, … which necessitates training examples
Advantages: avoid to rely on restrictive assumptions
(indepedence, restriction to low order information, …),
take into account larger substructure by efficient kernels
ACL-04 Tutorial
111
General ideas (2)
Methodolgy often envolves transforming the
original problam into a classification problem or a
ranking problem
Exemple: parsing problem transformed as ranking
Sentence s{x1,x2,x3,…,xn} candidate parsing trees,
obtained by CFG or top-n PCFG
Find a ranking model which outputs a plausibility score
W•f(xi)
Ideally, for each sentence, W•f(x1)> W•f(xi) i=2,…,n
This is the primal formulation (not practical)
ACL-04 Tutorial
112
Ranking problem
f2
Correct trees
for all training
W
sentences
(optimal)
x
x
Incorrect trees
for all training
sentences
x
x
x
x
x
Non-optimal W
f1
ACL-04 Tutorial
113
Ranking problem Dual Algo
This is a problem very close to classification
The only difference: origin has no importance: we
can work with relative values f(xi)- f(x1) instead
Dual formulation: W=Sai,j[f(xi,j)- f(x1,j)] (ai,j :dual
parameters)
Decision output is now= Sai,j[k(x,x1,j)-k(x,xi,j)]
Dual parameters are obtained by margin
maximisation (SVM) or simple udating rule such
as ai,j = ai,j +1 if W•f(x1,j)< W•f(xi,j) (this is the dual
formulation of the Perceptron algorithm)
ACL-04 Tutorial
114
Typical Results
Based on the coupling (efficient kernel + margin-based
learning): the coupling is known
To have good generalisation properties, both theoretically and
experimentally
To overcome the « curse of dimensionality » problem in high
dimensional feature space
Margin-based kernel method vs PCFG for parsing the
Penn TreeBank ATIS corpus: 22% increase in accuracy
Other learning frameworks such as boosting, MRF are
primal
Having the same features as in Kernel methods is prohibitive
Choice of good features is critical
ACL-04 Tutorial
115
Slowness remains the main issue!
Slowness during learning (typically SVM)
Circumvented by heuristics in SVM (caching, linear expansion, …)
Use of low-cost learning method such as the Voted Perceptron
(same as perceptron with storage of intermediate « models » and
weighted vote)
Slowness during the classification step (applying the
learned model to new examples to take a decision)
Use of efficient representation (inverted index, a posteriori
expansion of the main features to get a linear model so that the
decision function is computed in linear time, and is no longer
quadratic in document size)
Some « pre-compilation » of the Support Vector solution is often
possible
Revision Learning: Efficient co-working between standard method
(ex. HMM for tagging) and SVM (used only to correct errors of
HMM: binary problem instead of complex one-vs-n problem)
ACL-04 Tutorial
116
Document Categorization & Filtering
Classification task
Classes = topics
Or Class = relevant / not relevant (filtering)
Typical corpus: 30.000 features , 10.000 training
documents
Break-even point
Reuters
WebKb
Ohsumed
Naïve Bayes
72.3
82.0
62.4
Rocchio
79.9
74.1
61.5
C4.5
79.4
79.1
56.7
K-NN
82.6
80.5
63.4
SVM
87.5
90.3
71.6
ACL-04 Tutorial
117
SVM for POS Tagging
POS tagging is a multiclass classification problem
Typical Feature Vector (for unknown words):
Surrounding context: words of both sides
Morphologic info: pre- and suffixes, existence of
capitals, numerals, …
POS tags of preceding words (previous decisions)
Results on the Penn Treebank WSJ corpus :
TnT (second-order HMM): 96.62 % (F-1 measure)
SVM (1-vs-rest): 97.11%
Revision learning: 96.60%
ACL-04 Tutorial
118
SVM for Chunk Identification
Each word has to be tagged with a chunk label
(combination IBO / chunk type). E.g. I-NP, B-NP, …
This can be seen as a classification problem with typically
20 categories (multiclass problem – solved by one-vsrest or pairwise classification and max or majority voting Kx(K-1)/2 classifiers)
Typical feature vector: surrounding context (word and
POS-tag) and (estimated) chunk labels of previous words
Results on WSJ corpus (section 15-19 as training; section
20 as test):
SVM: 93.84%
Combination (weighted vote) of SVM with different input
representations and directions: 94.22%
ACL-04 Tutorial
119
Named Entity Recognition
Each word has to be tagged with a combination of entity
label (8 categories) and 4 sub-tags (B,I,E,S) E.g.
Company-B, Person-S, …
This can be seen as a classification problem with typically
33 categories (multiclass problem – solved by one-vs-rest
or pairwise classification and max or majority voting)
Typical feature vector: surrounding context (word and
POS-tag), character type (in Japanese)
Ensures consistency among word classes by Viterbi
search (SVM’ scores are transformed into probabilities)
Results on IREX data set (‘CRL’ as training; ‘General’ as
test):
RG+DT (rule-based): 86% F1
SVM: 90% F1
ACL-04 Tutorial
120
SVM for Word Sense Disambiguation
Can be considered as a classification task (choose
between some predefined senses)
NB. One (multiclass) classifier for each ambiguous word
Typical Feature Vector:
Surrounding context (words and POS tags)
Presence/absence of focus-specific keywords in a wider context
As usual in NLP problems, few training examples and
many features
On the 1998 Senseval competition Corpus, SVM has an
average rank of 2.3 (in competition with 8 other learning
algorithms on about 30 ambiguous words)
ACL-04 Tutorial
121
Recent Perspectives
Rational Kernels
Weighted Finite SateTransducer representation (and
computation) of kernels
Can be applied to compute kernels on variable-length
sequences and on weighted automata
HDAG Kernels (tomorrow presentation of Suzuki
and co-workers)
Kernels defined on Hierarchical Directed Acyclic Graph
(general structure encompassing numerous structures
found in NLP)
ACL-04 Tutorial
122
Rational Kernels (I)
Reminder:
FSA accepts a set a strings x
A string can be directly represented by an automaton
WFST associates to each pair of strings (x,y) a weight
[ T ](x,y) given by the « sum » over all « succesful
paths » (accepts x in input, emits y in output, starts
from an initial state and ends in a final state) of the
weights of the path (the « product » of the transition
weights)
A string kernel is rational if there exists a WFST T
and a function y such that k(x,y)=y([T](x,y))
ACL-04 Tutorial
123
Rational Kernels (II)
A rational kernel will define a PD (valid) kernel iif T can be
decomposed as UoU-1 (U-1 is obtained by swapping
input/output labels … some kind of transposition; « o » is
the composition operator)
Indeed, by definition of composition, neglecting y,
K(x,y)=Sz[U](x,z)*[U](y,z) … corresponding to A*At (sum
and product are defined over a semi-ring)
Kernel computation y([T](x,y) involves
Transducer composition ((XoU)o(U-1oY))
Shortest distance algo to find y([T](x,y)
Using « failure functions », complexity is O(|x|+|y|)
P-spectrum kernels and the gap-weighted string kernels
can be expressed as rational transducer (the elementary
« U » transducer is some « counting transducer »,
automatically outputting weighted substrings of x
ACL-04 Tutorial
124
HDAG Kernels
Convolution kernels applied to HDAG, ie. a mix of
tree and DAG (nodes in a DAG can themselves
be a DAG … but edges connecting « nonbrother » nodes are not taken into account
Can handle hierarchical chunk structure,
dependency relations, attributes (labels such as
POS, type of chunk, type of entity, …) associated
with a node at whatever level
Presented in details tomorrow
ACL-04 Tutorial
125
Conclusions
If you can only remember 2 principles after this session,
these should be:
Kernel methods are modular
The KERNEL, unique interface with your data, incorporating the
structure, the underlying semantics, and other prior knowledge, with a
strong emphasis on efficient computation while working with an
(implicit) very rich representation
The General learning algorithm with robustness properties based on
both the dual formulation and margin properties
Successes in NLP come from both origines
Kernel exploiting the particular structures of NLP entities
NLP tasks reformulated as (typically) classification or ranking tasks,
enabling general robust kernel-based learning algorithms to be used.
ACL-04 Tutorial
126
Bibliography (1)
Books on Kernel Methods (general):
J. Shawe-Taylor and N. Cristianini , Kernel Methods for Pattern
Analysis, Cambridge University Press, 2004
N. Cristianini and J. Shawe-Taylor, An Introduction to Support
Vector Machines, Cambridge University Press, 2000
B. Schölkopf, C. Burges, and A.J. Smola, Advances in Kernel
Methods – Support Vector Learning, MIT Press, 1999
B. Schölkopf and A.J. Smola, Learning with Kernels, MIT Press,
2001
V. N. Vapnik, Statiscal Learning Theory, J. Wiley & Sons, 1998
Web Sites:
www.kernel-machines.org
www.support-vector.net
www.kernel-methods.net
www.euro-kermit.org
ACL-04 Tutorial
127
Bibliography (2)
General Principles governing kernel design
[Sha01,Tak03]
Kernels built from data [Cri02a,Kwo03]
Kernels for texts – Prior information encoding
BOW kernels and linguistic enhancements: [Can02,
Joa98, Joa99, Leo02]
Semantic Smoothing Kernels: [Can02, Sio00]
Latent Concept Kernels: [Can02, Cri02b, Kol00, Sch98]
Multilingual Kernels: [Can02, Dum97, Vin02]
Diffusion Kernels: [Kan02a, Kan02b, Kon02, Smo03]
ACL-04 Tutorial
128
Bibliography (3)
Kernels for Text – Convolution Kernels [Hau99, Wat99]
String and sequence kernels: [Can03, Les02, Les03a,
Les03b, Lod01, Lod02, Vis02]
Tree Kernels: [Col01, Col02]
Graph Kernels: [Gar02a, Gar02b, Kas02a, Kas02b, Kas03,
Ram03, Suz03a]
Kernels defined on other NLP structures: [Col01, Col02,
Cor03, Gar02a, Gar03, Kas02b, Suz03b]
Kernels for Text – Generative-model Based
Fisher Kernels: [Hof01, Jaa99a, Jaa99b, Sau02, Sio02]
Other approaches: [Alt03, Tsu02a, Tsu02b]
ACL-04 Tutorial
129
Bibliography (4)
Particular Applications in NLP
Categorisation:[Dru99, Joa01, Man01, Tak01,
Ton01,Yan99]
WSD: [Zav00]
Chunking: [Kud00, Kud01]
POS-tagging: [Nak01, Nak02, Zav00]
Entity Extraction: [Iso02]
Relation Extraction: [Cum03, Zel02]
[Alt03] Y. Altsun, I. Tsochantaridis and T. Hofmann. Hidden Markov Support Vector Machines. ICML 2003.
[Can02] N. Cancedda et al., Cross-Language and Semantic Information Analysis and its Impact on Kernel Design,
Deliverable D3 of the KERMIT Project, February 2002.
[Can03] N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders. Word-sequence kernels. Journal of Machine
Learning Research 3:1059-1082, 2003.
[Col01] M. Collins and N. Duffy. Convolution kernels for natural languages. NIPS’2001.
[Col02] M. Collins and N. Duffy, Convolution Kernels for Natural Language Processing. In Advances in Neural
Information Processing Systems, 14, 2002
[Cor03] C. Cortes, P. Haffner and M. Mohri. Positive Definite Rational Kernels. COLT 2003
[Cri02a] N. Cristianini, A. Eliseef, J. Shawe-Taylor and J. Kandola, On Kernel-Target Alignment, In Advances in
Neural Information Processing Systems 14, MIT Press, 2002
[Cri02b] N. Cristianini, J. Shawe-Taylor and Huma Lodhi, Latent Semantic Kernels. Journal of Intelligent
Information Systems, 18 (2-3):127-152, 2002
[Cum03] C. Cumby and D. Roth. On kernel methods for relational learning. ICML’2003.
[Dru99] H. Drucker, D. Wu and V. Vapnik, Support Vector Machines for Spam Categorization. IEEE Transactions
on Neural Networks 10 (5), 1048-1054, 1999
[Dum97] S. T. Dumais, T.A. Letsche, M.L. Littmann and T.K. Landauer, Automatic Cross-Language Retrieval
Using Latent Semantic Indexing. In AAAI Spring Symposium on Cross-Language ext and Speech Retrieval, 1997
[Gar02a] T. Gartner, J. Lloyd and P. Flach. Kernels for Structured Data. Proc. Of 12th Conf. On Inductive Logic
Programming, 2002.
[Gar02b] T. Gartner, Exponential and Geometric kernels for graphs. NIPS,- Workshop on unreal data – 2002
[Gar03] T. Gartner. A survey of kernels for structured data. SIGKDD explorations 2003.
[Hau99] D. Haussler, Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, University
of California in Santa Cruz, Computer Science Department, July 1999
[Hof01] T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42 (12):177-196, 2001
[Iso02] H. Isozaki and H. Kazawa. Efficient Support Vector Classifiers for Named Entity Recognition. COLING
2002
[Jaa99a] T. S. Jaakkola and D. Haussler, Probabilistic Kernel Regression Models. Proceedings on the Conference
on AI and Statistics, 1999
[Jaa99b] T.S. Jaakkola and D. Haussler, Exploiting Generative Models in Discriminative Classifiers. In Advances in
Neural Information Processing Systems 11, 487:493, MIT Press, 1999
[Joa01] T. Joachims, N. Cristianini and J. Shawe-Taylor, Composite Kernels for Hypertext Categorization.
Proceedings 18th International Conference on Machine Learning (ICML-01), Morgan Kaufmann Publishers, 2001
[Joa98] T. Joachims, Text Categorization with Support Vector Machines : Learning with many Relevant Features.
In Proceedings of the European Conference on Machine Learning, Berlin, 137-142, Springer Verlag, 1998
[Joa99] T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Proceedings
of the International Conference on Machine Learning, 1999
[Kan02a] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Learning semantic similarity. NIPS’2002.
[Kan02b] J. Kandola, J. Shawe-Taylor, and N. Cristianini. On the Applications of Diffusion Kernels to Text Data.
NeuroCOLT’2002.
[Kas02a] H. Kashima and A. Inokuchi. Kernels for Graph Classification. ICDM Workshop on Active Mining 2002.
[Kas02b] H. Kashima and T. Koyanagi. Kernels for Semi-Structured Data. ICML 2002.
[Kas03] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. ICML’2003.
[Kol00] T. Kolenda, L.K. Hansen and S. Sigurdsson, Independent Components in Text. In Advances in
Independent Component Analysis (M. Girolami Editor), Springer Verlag, 2000
[Kon02] R.I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. ICML’2002.
[Kud00] T. Kudo and Y. Matsumoto, Use of Support Vector Learning for Chunk Identification. Proceedings of
CoNNL-2000 and LLL-2000, Lisbon, Portugal, 2000
[Kud01] T. Kudo and Y. Matsumoto. Chunking with Support Vector Machines. NAACL 2001
[Kud03] T. Kudo and Y. Matsumoto. Fast Methods for Kernel-based Text Analysis. ACL 2003
[Kwo03] J. Kwok and I. Tsang. Learning with Idealized Kernels. ICML 2003
[Leo02] E. Leopold and J. Kindermann, Text Categorization with Support Vector Machines : how to represent
Texts in Input Space?, Machine Learning 46, 423-444, 2002
[Les02] C. Leslie, E. Eskin and W. Noble. The Spectrum Kernel : a string kernel for SVM Protein Classification.
Proc. Of the Pacific Symposium on Biocomputing. 2002
[Les03a] C. Leslie and R. Kuang. Fast Kernels for Intexact String Matching. COLT 2003.
[Les03b] C. Leslie, E. Eskin, J. Weston and W. Noble. Mismatch String Kernels for SVM Protein Classification.
NIPS 2002.
[Lod01] H. Lodhi, N. Cristianini, J. Shawe-Taylor and C. Watkins, Text Classification using String Kernel. In
Advances in Neural Information Processing Systems 13, MIT Press, 2001
[Lod02] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini and C. Watkins, Text Classification using String
Kernels. Journal of Machine Learning Research 2, 419-444, 2002
[Man01] L. Manevitz and M. Yousef, One-Class SVMs for Document Classification. Journal of Machine Learning
Research 2, December 2001
[Nak01] T. Nakagawa, T. Kudoh and Y. Matsumoto, Unknown Word Guessing and Part-of-Speech Tagging Using
Support Vector Machines. Proceedings of the 6th Natural Language Processing Pacific Rim Symposium
(NLPRS2001), 2001
[Nak02] T. Nakagawa, T. Kudo and Y. Matsumoto. Revision Learning and its Application to Part-of-Speech
Tagging. ACL 2002.
[Ram03] J. Ramon and T. Gartner. Expressivity vs Efficiency of Graph Kernels. MGTS 2003
[Sau02] C. Saunders, J. Shawe-Taylor, and A. Vinokourov. String kernels, Fisher kernels and finite state
automata.NIPS’2002.
[Sch98] B. Schölkopf, A.J. Smola and K. Müller, Kernel Principal Component Analysis. In Advances in Kernel
Methods – Support Vector Learning, MIT Press, 1998
[Sha01] J. Shawe-Taylor et al., Report on Techniques for Kernels, Deliverable 2 of the KERMIT Project, August
2001.
[Sio00] G. Siolas and F. d’Alche Buc, Support Vector Machines based on a Semantic Kernel for Text
Categorization. In Proceedings of the International Joint Conference on Neural Networks 2000, Vol.5, 205-209,
IEEE Press, 2000
[Sio02] G. Siolas and F. d’Alche-Buc. Mixtures of probabilistic PCAs and Fisher kernels for word and document
Modeling. ICANN’2002.
[Smo03] A. Smola and R. Kondor. Kernels and Regularization on Graphs. COLT 2003
[Suz03a] J. Suzuki, T. Hirao, Y. Saski and E. Maeda. Hierarchical Directed Acyclic Graph Kernel: Methods for
Structured Natural Language Data. ACL 2003
[Suz03b] J. Suzuki, Y. Sasaki, and E. Maeda. Kernels for structured natural language data. NIPS’2003.
[Tak01] H. Takamura and Y. Matsumoto, Feature Space Restructuring for SVMs with Application to Text
Categorization. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2001
[Tak03] E. Takimoto and M. Warmut. Path Kernels and Multiplicative Updates. Journal of Machine Learning
Research 4, pp. 773-818. 2003
[Ton01] S. Tong and D. Koller, Support Vector Machine Active Learning with Applications to Text Classification.
Journal of Machine Learning Research 2, December 2001
[Tsu02a] K. Tsuda, M. Kawanabe, G. Ratsch, S. Sonnenburg, and K.-R. Muller. A new discriminative kernel from
probabilistic models. Neural Computation 14:2397-2414, 2002.
[Tsu02b] K. Tsuda, T. Kin and K. Asai. Marginalized Kernels for Biological Sequences. Bioinformatics, 1 (1), pp. 18, 2002
[Vin02] A. Vinokourov, J. Shawe-Taylor and N. Cristianini, Finding Language-Independent Semantic
Representation of Text using Kernel Canonical Correlation Analysis. NeuroCOLT Technical Report, NC-TR-02119, 2002
[Vis02] S. Vishwanathan and A. Smola. Fast Kernels for String and Tree Matching. NIPS 2002.
[Wat99] C. Watkins, Dynamic Alignment Kernels. Technical Report CSD-TR-98-11, Royal Holloway, University of
London, Computer Science Department, January 1999
[Yan99] Y. Yang and X. Liu, A Re-examination of Text Categorization Methods. Proceedings of ACM SIGIR
Conference on Research and Development in Information Retrieval, 1999
[Zav00] J. Zavrel, S Degroeve, A. Kool, W. Daelemans and K Jokinen, Diverse Classifiers for NLP Disambiguation
Tasks : Comparison, Optimization, Combination and Evolution. Proceedings of the 2 nd CevoLE Workshop, 2000
[Zel02] D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relational extraction. Journal of Machine
Learning Research 3