QA for the Web

Download Report

Transcript QA for the Web

Machine Learning
in
Natural Language
1. No Lecture on Thursday.
2. Instead:
• Monday, 4pm, 1404SC
• Mark Johnson lectures on: Bayesian Models of Language Acquisition
1
Machine Learning
in
Natural Language
Features and Kernels
1. The idea of kernels
• Kernel Perceptron
2. Structured Kernels
• Tree and Graph Kernels
3. Lessons
• Multi-class classification
2
Embedding
Can be done explicitly (generate expressive features)
or implicitly (use kernels).
Whether
Weather
New discriminator in functionally simpler
x1x2 x3  x1x 4 x3  x3 x2 x5
y1  y4  y5
3
Kernel Based Methods
f(x)  Th (zM S(z)K(x, z))
 A method to run Perceptron on a very large feature set, without
incurring the cost of keeping a very large weight vector.
 Computing the weight vector is done in the original space.
 Notice: this pertains only to efficiency.
 Generalization is still relative to the real dimensionality.
 This is the main trick in SVMs. (Algorithm - different) (although
many applications actually use linear kernels).
4
Kernel Base Methods
Examples : x  {0,1} n ;
Hypothesis : w  R n
f(x)  Th (i1 w i x i (x ))
n
If Class  1 but w  x   , w i  w i  1 (if x i  1) (promotion)
If Class  0 but w  x   , w i  w i - 1 (if x i  1) (demotion)


Let I be the set t1,t2,t3 …of monomials (conjunctions) over
The feature space x1, x2… xn.
Then we can write a linear function over this new feature
space.
f(x)  Th (iI w i t i (x))
Example : x1x 2 x 4 (11010)  1 x 3 x 4 (11010)  0
5
Kernel Based Methods
Examples : x  {0,1} n ;
Hypothesis : w  Rn
f(x)  Th (iI w i t i (x))
If Class  1 but w  x   , w i  w i  1 (if x i  1) (promotion)
If Class  0 but w  x   , w i  w i - 1 (if x i  1) (demotion)




Great Increase in expressivity
Can run Perceptron, Winnow, Logistics regression, but the
convergence bound may suffer exponential growth.
Exponential number of monomials are true in each example.
Also, will have to keep many weights.
6
The Kernel Trick(1)
Examples : x  {0,1} n ;
Hypothesis : w  Rn
f(x)  Th (iI w i t i (x))
If Class  1 but w  x   , w i  w i  1 (if x i  1) (promotion)
If Class  0 but w  x   , w i  w i - 1 (if x i  1) (demotion)
• Consider the value of w used in the prediction.
• Each previous mistake, on example z, makes an additive
contribution of +/-1 to w, iff t(z) = 1.
• The value of w is determined by the number of mistakes on
which t() was satisfied.
7
The Kernel Trick(2)
Examples : x  {0,1} n ;
Hypothesis : w  Rn
f(x)  Th (iI w i t i (x))
If Class  1 but w  x   , w i  w i  1 (if x i  1) (promotion)
If Class  0 but w  x   , w i  w i - 1 (if x i  1) (demotion)
• P – set of examples on which we Promoted
• D – set of examples on which we Demoted
• M = P D


f(x)  Th (iI   1   1  t i (x )) 
 zP,ti (z)1 zD,ti (z)1


 Th (iI   S(z)t i (z)ti (x)
 zM

8
The Kernel Trick(3)
f(x)  Th (iI w i t i (x))
• P – set of examples on which we Promoted
• D – set of examples on which we Demoted


• M = P D f(x)  Th (iI   1   1  t i (x )) 
 zP,ti (z)1

zD,t i (z)1


 Th (iI   S(z)t i (z)ti (x)
 zM

• Where S(z)=1 if z P and S(z) = -1 if z D.
Reordering:
f(x)  Th (zM S(z) t i (z)ti (x ))
iI
9
The Kernel Trick(4)
f(x)  Th (iI w i t i (x))
• S(y)=1 if y P and S(y) = -1 if y D.
f(x)  Th (zM S(z) t i (z)ti (x ))
iI
• A mistake on z contributes the value +/-1 to all monomials
satisfied by z. The total contribution of z to the sum is equal
to the number of monomials that satisfy both x and z.
• Define a dot product in the t-space: K(x, z)   t i (z)ti (x)
iI
• We get the standard notation:
f(x)  Th (zM S(z)K(x, z))
10
Kernel Based Methods
f(x)  Th (zM S(z)K(x, z))

What does this representation give us?
K(x, z)   t i (z)ti (x)
iI

We can view this Kernel as the distance between x,z
measured in the t-space.

But, K(x,z) can be computed in the original space, without
explicitly writing the t-representation of x, z
11
Kernel Based Methods
f(x)  Th (zM S(z)K(x, z))
K(x, z)   t i (z)ti (x)
iI
• Consider the space of all 3n monomials (allowing both positive
and negative literals).
same(x, z)
K(x,
z)

2
• Then,
• if same(x,z) is the number of features that have the same value
for both x and z.. We get:
f(x)  Th (zM S(z)(2same(x, z) )
• Example: Take n=2; x=(00), z=(01), ….
• Proof: let k=same(x,z); choose to (1)include the literal with the
right polarity in the monomial, or (2) not include at all.
• Other Kernels can be used.
12
Implementation
f(x)  Th (zM S(z)K(x, z))
K(x, z)   t i (z)ti (x)
iI
• Simply run Perceptron in an on-line mode, but keep
track of the set M.
• Keeping the set M allows to keep track of S(z).
• Rather than remembering the weight vector w,
remember the set M (P and D) – all those examples
on which we made mistakes.
Dual Representation
13
Summary – Kernel Based Methods I
f(x)  Th (zM S(z)K(x, z))
• A method to run Perceptron on a very large feature set, without
incurring the cost of keeping a very large weight vector.
• Computing the weight vector can still be done in the original
feature space.
• Notice: this pertains only to efficiency: The classifier is identical
to the one you get by blowing up the feature space.
• Generalization is still relative to the real dimensionality.
• This is the main trick in SVMs. (Algorithm - different) (although
most applications actually use linear kernels)
14
Summary – Kernel Trick
• Separating hyperplanes (produced by Perceptron, SVM) can be computed in
terms of dot products over a feature based representation of examples.
• We want to define a dot product in a high dimensional space.
• Given two examples x = (x1, x2, …xn) and y = (y1,y2, …yn) we
want to map them to a high
dimensional space
[examplep2
p2
p2 quadratic]:
(x1,x2,…xn) = (x1,…xn, x12,…xn2, x1 ¢ x2, …,xn-1¢ xn)
(y1,y2,…yn) = (y1,…yn ,y12,…yn2, y1 ¢ y2,…,yn-1¢ yn)
• And compute the dot product A = (x) ¢ (y)
[takes time ]
• Instead, in the original space, compute
B = f(x ¢ y)= [1+ (x1,x2, …xn) ¢ (y1,y2, …yn)]2
• Theorem: A = B
Coefficients do not really matter; can be done for other functions.
15
Efficiency-Generalization Tradeoff



There is a tradeoff between the computational efficiency
with which these kernels can be computed and the
generalization ability of the classifier.
For example, using such kernels the Perceptron algorithm
can make an exponential number of mistakes even when
learning simple functions.
In addition, computing with kernels depends strongly on the
number of examples. It turns out that sometimes working in
the blown up space is more efficient than using kernels.

Next: More Complicated Kernels
16
Structured Input
afternoon,
Dr. Ab
C
…in
Ms.
De. F class..
[NP Which type] [PP of ] [NP submarine] [VP was bought ]
[ADVPrecently ] [PP by ] [NP South Korea ] (. ?)
S = John will join the board as a director
G1
join
will
as
board
John
director
the
Knowledge Representation
a
Word=
POS=
IS-A=
…
G2
17
Learning From Structured Input

We want to extract features from structured domain elements


A feature is a mapping from the instances space to {0,1} or [0,1]



With appropriate representation language it is possible to represent
expressive features that constitute infinite dimensional space [FEX]
Learning can be done in the infinite attribute domain.
What does it mean to extract features?



their internal (hierarchical) structure should be encoded.
Conceptually: different data instantiations may be abstracted to yield the
same representation (quantified elements)
Computationally: Some kind of graph matching process
Challenge:


Provide the expressivity necessary to deal with large scale and highly
structured domains
Meet the strong tractability requirements for these tasks.
18
Example

Only those descriptions that are ACTIVE in the input are listed
D = (AND word (before tag))




Michael Collins developed kernels over parse trees.
Cumby/Roth developed parameterized kernels over structures.
When is it better to use kernel vs. using the primal representation.
Explicit features
19
Overview – Goals (Cumby&Roth 2003)

Applying kernel learning methods to structured domains.

Develop a unified formalism for structured kernels. (Collins & Duffy, Gaertner
& Lloyd, Haussler)


Examine complexity & generalization between different
feature sets, learners.


Flexible language that measures distance between structure with respect to a
given ‘substructure’.
When does each type of feature set perform better with what learners?
Exemplify with experiments from bioinformatics & NLP.

Mutagenesis, Named-Entity prediction.
20
Feature Description Logic


A flexible knowledge representation for feature extraction from
structured data
Domain Elements are represented as labeled graphs


FDL is formed from an alphabet of


attributes, value, and role symbols.
Well defined syntax and equivalent semantics


Concept graphs that correspond to FDL expressions.
E.g., descriptions are defined inductively with sensors as primitives
Sensor: a basic description – a term of the form a(v), or a


a = attribute symbol, v = value symbol (ground sensor).
existential sensor a describes object that has some value for attribute a.

AND clauses, (role D) clauses for relations between objects,

Expressive and Efficient Feature extraction.
Knowledge Representation
21
Example (Cont.)

Features; Feature Generation Functions; extensions
Subsumption…
(see paper)
Basically:
 Only those descriptions that are ACTIVE in the input are listed
D = (AND word (before tag))

{Dθ} = {(AND word(the) (before tag(N)), (AND word(dog) (before tag(V)),
(AND word(ran) (before tag(ADV)), (AND word(very) (before tag(ADJ))}

The language is expressive enough to generate linguistically
interesting features such as agreements, etc.
Explicit features
22
Kernels




It’s possible to define FDL based Kernels for structured data
When using linear classifiers it is important to enhance the set of
features to gain expressivity.
A common way - blow up the feature space by generating functions
of primitive features.
For some algorithms – SVM, Perceptron - Kernel functions can be used
to expand the feature space while working still in the original space.
• Is it worth doing in structured domains?
• Answers are not clear so far
–Computationally: yes, when we simulate a huge space
–Generalization: not always
[Khardon,Roth,Servedio,NIPS’01; Ben David et al.]
Kernels
23
Kernels in Structured Domains

We define a Kernel family K parameterized by FDL
descriptions. k (G , G ) 
k (n , n
D

1
2


D
n1N1G1 n2N 2G2
1
2
)
The definition is recursive on the definition of D
[sensor, existential sensor; role description; AND]
Key: Many previous structured kernels considered all
substructures. (e.g., Collins&Duffy02, Tree Kernels);
Analogous to an exponential feature space; over fitting.
Generalization issues & Computation issues [if # of examples large]
If feature space is explicitly expanded – can use algorithms
such as Winnow (SNoW); [complexity and experimental results]
Kernels
24
FDL Kernel Definition

Kernel family K parameterized by feature type descriptions.
For description D :
kD (G1 , G2 ) 




If D is a sensor s(v) is a label of
 k
n1N1 n2N 2
D
(n1 , n2 )
n1 , n2then
kD (n1, n2 )  1
If D is a sensor s and sensor descriptions s(v1), s(v2)… s(vj) are labels of
both n1 , n2 then
k D (n1 , n2 )  j
 
k (n1 ' , n2 ' )
If D is a role description (r D’), then k D (n1 , n2 ) 
n1 '
n2 ' D '
with n1’, n2’ those nodes that have r –labeled edge from n1,n2.
If D is a description (AND D1 D2 ... Dn) with li repetitions of any Di then
 k D ' (n1 , n2 ) 

k D (n1 , n2 )  i 1 
li


n
Kernels
25
Kernel Example



D = (AND word (before word))
G1: The dog ran very fast
G2: The dog ran quickly
kD (G1 , G2 ) 
 k
n1N1 n2N 2
D
(n1 , n2 )
1
2
1
2
1
2
kD (nthe
, nthe
)  kword (nthe
, nthe
)  k(before word ) (nthe
, nthe
)
2
 1 kword (n1dog , ndog
)  11


Etc. the final output is 2 since there are 2 matching collocations.
Can simulate Boolean kernels as seen in Khardon,Roth et al.
Kernels
26
Complexity & Generalization

How to compare in complexity and generalization to other kernels for
structured data?





for m examples, with average example size g, and time to evaluate the kernel
t1, kernel Perceptron takes O(m2g2t1)
if extracting a feature explicitly takes t2 , Perceptron takes O(mgt2).
most kernels that simulate a well defined feature space have t1 << t2 .
By restricting size of expanded feature space we avoid overfitting – even SVM
suffers under many irrelevant features (Weston).
Margin argument: Margin goes down when you have more features.




given a linearly separable set of points S = {x1,…xm} 2 Rn with separator w 2 Rn
embed S into an n’>n dimensional space by adding zero-mean random noise e to
the additional n’-n dimensions s.t. w’= (w,0) 2 Rn’ still separates S.
Now margin
but w' xi '  w ( xi ,0)  (0, e)  w xi
Analysis
&
w'T xi '  ( w,0)T ( xi , e)  wT xi  wT e
27
Experiments


Serve as comparison – Our features w/ kernel Perc, normal Winnow, and
all-subtrees expanded features.
Bioinformatics experiment in mutagenesis prediction:



NLP experiment in classifying detected NE’s:



188 compounds with atom-bond data, binary prediction.
10-fold cross validation with 12 runs training
4700 training 1500 test phrases from MUC-7
person, location, & organization
Trained and tested with kernel Perceptron, Winnow (Snow) classifiers with
FDL kernel & respective features. Also all-subtrees kernel based on Collins
& Duffy work.
Mutagenesis concept graph
Features simulated with all-subtrees kernel
28
Discussion


Have kernel that simulates features obtained with FDL



But quadratic training time means cheaper to extract and learn
explicitly vs kernel Perceptron
SVM could take (slightly) even longer, but maybe perform better
But restricted features might work better than larger spaces
simulated by other kernels.

Can we improve on benefits of useful features?



microaveraged accuracy
Compile examples together ?
More sophisticated kernels than matching kernel?
Still provides metric for similarity based approaches.
29
Conclusion

Kernels for learning from structured data is an interesting idea



To justify these methods we must embed in a space much larger
than the training set size.


Different kernels may expand/restrict the hypothesis space in useful
ways.
Need to know the benefits and hazards
Can decrease margin
Expressive knowledge representations can be used to create features
explicitly or in implicit kernel-spaces.


Data representation could allow us to plug in different base kernels to
replace matching kernel.
Parameterized kernel allows us to direct the way the feature space is
blown up to encode background knowledge.
30