New Theoretical Frameworks for Machine Learning Maria-Florina Balcan Thesis Proposal 05/15/2007 Thanks to My Committee Avrim Blum Yishay Mansour Manuel Blum Tom Mitchell Santosh Vempala.

Download Report

Transcript New Theoretical Frameworks for Machine Learning Maria-Florina Balcan Thesis Proposal 05/15/2007 Thanks to My Committee Avrim Blum Yishay Mansour Manuel Blum Tom Mitchell Santosh Vempala.

New Theoretical Frameworks for
Machine Learning
Maria-Florina Balcan
Thesis Proposal
05/15/2007
Thanks to My Committee
Avrim Blum
Yishay Mansour
Manuel Blum
Tom Mitchell
Santosh Vempala
2
The Goal of the Thesis
New Theoretical Frameworks for Modern Machine
Learning Paradigms
Connections between Machine Learning Theory and
Algorithmic Game Theory
3
New Frameworks for Modern Learning Paradigms
Modern Learning Paradigms
Incorporating Unlabeled
Data in the Learning Process
Unified theoretical
Semi-supervised
Learning
treatment is lacking
Active Learning
Our Contributions
Semi-supervised learning
- a unified PAC framework
Active Learning
- new positive theoretical results
Kernel based Learning
Qualitative gap between
theory and practice
Our Contributions
A theory of learning with
general similarity functions
Extensions to clustering
With Avrim and Santosh
4
New Frameworks for Modern Learning Paradigms
Modern Learning Paradigms
Incorporating Unlabeled
Data in the Learning Process
Unified theoretical
treatment is lacking
Our Contributions
Semi-supervised learning
- a unified PAC framework
Active Learning
- new positive theoretical results
Kernel, Similarity based
Learning and Clustering
Qualitative gap between
theory and practice
Our Contributions
A theory of learning with
general similarity functions
Extensions to clustering
With Avrim and Santosh
5
Machine Learning Theory and Algorithmic Game Theory
Brief Overview of Our Results
Mechanism Design, ML, and Pricing Problems
Generic Framework for reducing problems of incentive-compatible
mechanism design to standard algorithmic questions.
[Balcan-Blum-Hartline-Mansour, FOCS 2005, JCSS 2007]
Approximation Algorithms for Item Pricing.
[Balcan-Blum, EC 2006]
• Revenue maximization in comb. auctions with single-minded consumers
6
The Goal of the Thesis
New Theoretical Frameworks for Modern Machine Learning
Paradigms
• Semi-Supervised and Active Learning
• Similarity Based Learning and Clustering
Connections between Machine Learning Theory and
Algorithmic Game Theory
• Use MLT techniques for designing and analyzing auctions
in the context of Revenue Maximization
7
The Goal of the Thesis
New Theoretical Frameworks for Modern Machine Learning
Paradigms
Incorporating Unlabeled
Data in the Learning Process
Semi-supervised learning (SSL)
- An Augmented PAC model for SSL
[Balcan-Blum, COLT 2005; book chapter,
“Semi-Supervised Learning”, 2006]
Active Learning (AL)
- Generic agnostic AL procedure
[Balcan-Beygelzimer-Langford, ICML 2006]
- Margin based AL of linear separators
[Balcan-Broder-Zhang, COLT 2007]
Kernel, Similarity based
learning and Clustering
- Connections between kernels,
margins and feature selection
[Balcan-Blum-Vempala, MLJ 2006]
- A general theory of learning
with similarity functions
[Balcan-Blum, ICML 2006]
- Extensions to Clustering
[Balcan-Blum-Vempala, work in progress]
8
The Goal of the Thesis
New Theoretical Frameworks for Modern Machine Learning
Paradigms
Incorporating Unlabeled
Data in the Learning Process
Semi-supervised learning (SSL)
- An Augmented PAC model for SSL
[Balcan-Blum, COLT 2005; book chapter,
“Semi-Supervised Learning”, 2006]
Active Learning (AL)
- Generic agnostic AL procedure
[Balcan-Beygelzimer-Langford, ICML 2006]
- Margin based AL of linear separators
[Balcan-Broder-Zhang, COLT 2007]
Kernel, Similarity based
learning and Clustering
- Connections between kernels,
margins and feature selection
[Balcan-Blum-Vempala, MLJ 2006]
- A general theory of learning
with similarity functions
[Balcan-Blum, ICML 2006]
- Extensions to Clustering
[Balcan-Blum-Vempala, work in progress]
9
Part I, Incorporating Unlabeled Data in
the Learning Process
Semi-Supervised Learning
A unified PAC-style framework
[Balcan-Blum, COLT 2005; book chapter, “Semi-Supervised Learning”, 2006]
Standard Supervised Learning Setting
• X – instance/feature space
• S={(x, l)} - set of labeled examples
– labeled examples - assumed to be drawn i.i.d. from some distr.
D over X and labeled by some target concept c* 2 C
– labels 2 {-1,1} - binary classification
• Want to do optimization over S to find some hypothesis h, but we
want h to have small error over D.
– err(h)=Prx 2 D(h(x)  c*(x))
• Classic models for learning from labeled data.
• Statistical Learning Theory (Vapnik)
• PAC (Valiant)
12
Standard Supervised Learning Setting
Sample Complexity
• E.g., Finite Hypothesis Spaces, Realizable Case
• In PAC, can also talk about efficient algorithms.
13
Semi-Supervised Learning
Hot topic in recent years in Machine Learning.
• Several methods have been developed to try to use
unlabeled data to improve performance, e.g.:
• Transductive SVM [Joachims ’98]
• Co-training [Blum & Mitchell ’98], [Balcan-Blum-Yang’04]
• Graph-based methods [Blum & Chawla01], [ZGL03]
Scattered Theoretical Results…
14
An Augmented PAC model for SSL [BB05]
Extends PAC naturally to fit SSL.
Can generically analyze:
• When will unlabeled data help and by how much.
• How much data should I expect to need to perform well.
Key Insight
Unlabeled data is useful if we have beliefs not only about
the form of the target, but also about its relationship
with the underlying distribution.
Different algorithms are based on different assumptions about how
data should behave.
Challenge – how to capture many of the assumptions typically used.
15
Example of “typical” assumption: Margins
The separator goes through low density regions of the
space/large margin.
-
assume we are looking for linear separator
belief: should exist one with large separation
+
_
+
_
SVM
Labeled data only
+
_
+
_
+
_
+
_
Transductive SVM
16
Another Example: Self-consistency
Agreement between two parts : co-training [BM98].
- examples contain two sufficient sets of features, x = h x1, x2 i
- the belief is that the two parts of the example are consistent,
i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x)
For example, if we want to classify web pages: x = h x1, x2 i
Prof. Avrim Blum
My Advisor
x - Link info & Text info
Prof. Avrim Blum
x1- Text info
My Advisor
x2- Link info
17
Problems thinking about SSL in the PAC model
Su={xi} - unlabeled examples drawn i.i.d. from D
Sl={(xi, yi)} – labeled examples drawn i.i.d. from D and labeled
by some target concept c*.
PAC model talks of learning a class C under (known or
unknown) distribution D.
• Not clear what unlabeled data can do for you.
• Doesn’t give you any info about which c 2 C is the
target function.
We extend the PAC model to capture these (and more) uses of
unlabeled data.
• Give a unified framework for understanding when and why
unlabeled data can help.
18
Proposed Model, Main Idea (1)
Augment the notion of a concept class C with a notion of
compatibility  between a concept and the data distribution.
“learn C” becomes “learn (C,)” (i.e. learn class C under
compatibility notion )
Express relationships that one hopes the target function
and underlying distribution will possess.
+
_
+
_
Idea: use unlabeled data & the belief that the target is
compatible to reduce C down to just {the highly compatible
functions in C}.
19
Proposed Model, Main Idea (2)
Idea: use unlabeled data & our belief to reduce size(C) down
to size(highly compatible functions in C) in our sample
complexity bounds.
Need to be able to analyze how much unlabeled data is
needed to uniformly estimate compatibilities well.
Require that the degree of compatibility be something that
can be estimated from a finite sample.
Require  to be an expectation over individual examples:
• (h,D)=Ex2 D[(h, x)] compatibility of h with D, (h,x)2 [0,1]
• errunl(h)=1-(h, D) incompatibility of h with D (unlabeled error
rate of h)
20
Margins, Compatibility
Margins: belief is that should exist a large margin separator.
Highly compatible
+
_
+
_
Incompatibility of h and D (unlabeled error rate of h) – the
probability mass within distance  of h.
Can be written as an expectation over individual examples
(h,D)=Ex 2 D[(h,x)] where:
(h,x)=0 if dist(x,h) · 
(h,x)=1 if dist(x,h) ¸ 
21
Margins, Compatibility
Margins: belief is that should exist a large margin separator.
Highly compatible
+
_
+
_
If do not want to commit to  in advance, define (h,x) to
be a smooth function of dist(x,h), e.g.:
Illegal notion of compatibility: the largest  s.t. D has
probability mass exactly zero within distance  of h.
22
Co-Training, Compatibility
Co-training: examples come as pairs hx1, x2i and the goal is to
learn a pair of functions hh1,h2i.
Hope is that the two parts of the example are consistent.
Legal (and natural) notion of compatibility:
- the compatibility of hh1,h2i and D:
- can be written as an expectation over examples:
23
Types of Results in the [BB05] Model
As in PAC, can discuss algorithmic and sample complexity issues.
Sample Complexity issues that we can address:
– How much unlabeled data we need:
• depends both on the complexity of C and the on the
complexity of our notion of compatibility.
- Ability of unlabeled data to reduce # of labeled examples needed:
• compatibility of the target
• (various) measures of the helpfulness of the distribution
Give both uniform convergence bounds and epsilon-cover based bounds.
24
Examples of results: Sample Complexity, Uniform
Convergence Bounds
Finite Hypothesis Spaces, Doubly Realizable Case
ALG: pick a compatible concept that agrees with the labeled sample.
CD,() = {h 2 C :errunl(h) ·}
Bound the # of labeled examples as a measure of the helpfulness of D
with respect to 
– helpful D is one in which CD, () is small
25
Examples of results: Sample Complexity, Uniform
Convergence Bounds
Finite Hypothesis Spaces, Doubly Realizable Case
ALG: pick a compatible concept that agrees with the labeled sample.
CD,() = {h 2 C :errunl(h) ·}
+
Highly compatible
+
_
_
26
Sample Complexity Subtleties
Uniform Convergence Bounds
Depends both on the complexity of C and on
the complexity of 
Distr. dependent measure of complexity
+
-Cover bounds much better than Uniform
Convergence
bounds.
_
For algorithms that behave in a specific way:
+ a representative set of compatible
• first use the unlabeled
data to choose
Highly compatible
_
hypotheses
27
• then use the labeled sample to choose among these
Sample Complexity Implications of Our Analysis
Ways in which unlabeled data can help
• If c* is highly compatible and have enough unlabeled data, then can
reduce the search space (from C down to just those h 2 C whose
estimated unlabeled error rate is low).
• By providing an estimate of D, unlabeled data can allow a more
refined distribution-specific notion of hypothesis space size (e.g. the
size of the smallest -cover).
Subsequent Work, E.g.:
P. Bartlett, D. Rosenberg, AISTATS 2007
J. Shawe-Taylor et al., Neurocomputing 2007
28
Efficient Co-training of linear separators
• Assume independence given the label
– both points from D+ or from D-.
• [Blum & Mitchell] show can co-train (in polynomial time) if
have enough labeled data to produce a weakly-useful
hypothesis to begin with.
Idea: use unlabeled data to generate poly # of
candidate hyps s.t. at least one is weakly-useful (uses
Outlier Removal Lemma). Plug into [BM98].
• [BB05] shows we can learn (in polynomial time) with only a
single labeled example.
• Key point: independence given the label implies that the
functions with low errunl rate are:
•
•
•
•
close to c*
close to : c*
close to the all positive function
close to the all negative function
30
Modern Learning Paradigms: Our Contributions
Modern Learning Paradigms
Incorporating Unlabeled
Data in the Learning Process
Semi-supervised learning (SSL)
- An Augmented PAC model for SSL
[Balcan-Blum, COLT 2005]
[Balcan-Blum, book chapter,
“Semi-Supervised Learning”, 2006]
Active Learning (AL)
- Generic agnostic AL procedure
[Balcan-Beygelzimer-Langford, ICML 2006]
- Margin based AL of linear separators
[Balcan-Broder-Zhang, COLT 2007]
Kernel, Similarity based
learning and Clustering
- Connections between kernels,
margins and feature selection
[Balcan-Blum-Vempala, MLJ 2006]
- A general theory of learning
with similarity functions
[Balcan-Blum, ICML 2006]
- Extensions to Clustering
[Balcan-Blum-Vempala, work in progress]
33
Modern Learning Paradigms: Our Contributions
Modern Learning Paradigms
Incorporating Unlabeled
Data in the Learning Process
Semi-supervised learning (SSL)
- An Augmented PAC model for SSL
[Balcan-Blum, COLT 2005]
[Balcan-Blum, book chapter,
“Semi-Supervised Learning”, 2006]
Active Learning (AL)
- Generic agnostic AL procedure
[Balcan-Beygelzimer-Langford, ICML 2006]
- Margin based AL of linear separators
[Balcan-Broder-Zhang, COLT 2007]
Kernel, Similarity based
learning and Clustering
- Connections between kernels,
margins and feature selection
[Balcan-Blum-Vempala, MLJ 2006]
- A general theory of learning
with similarity functions
[Balcan-Blum, ICML 2006]
- Extensions to Clustering
[Balcan-Blum-Vempala, work in progress]
34
Part II, Similarity Functions
for Learning
[Balcan-Blum, ICML 2006]
Extensions to Clustering
(With Avrim and Santosh, work in progress)
Kernels and Similarity Functions
Kernels have become a powerful tool in ML.
• Useful in practice for dealing with many different kinds
of data.
• Elegant theory about what makes a given kernel good for
a given learning problem.
Our Work: analyze more general similarity functions.
• In the process we describe ways of constructing good
data dependent kernels.
36
Kernels
• A kernel K is a pairwise similarity function s.t. 9 an implicit
mapping  s.t. K(x,y)=(x) ¢ (y).
• Point is: many learning algorithms can be written so only
interact with data via dot-products.
• If replace x¢y with K(x,y), it acts implicitly as if data was in
higher-dimensional -space.
• If data is linearly separable by large margin in -space, don’t
have to pay in terms of data or comp time.

 (x)
1
If margin  in -space, only need 1/2
examples to learn well.
w
37
General Similarity Functions
We provide: characterization of good similarity functions
for a learning problem that:
1) Talks in terms of natural direct properties:
• no implicit high-dimensional spaces
• no requirement of positive-semidefiniteness
2) If K satisfies these properties for our given problem,
then has implications to learning.
3) Is broad: includes usual notion of “good kernel”.
(induces a large margin
separator in -space)
38
A First Attempt: Definition satisfying
properties (1) and (2)
Let P be a distribution over labeled examples (x, l(x))
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a
1- probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
• Suppose that positives have K(x,y) ¸ 0.2, negatives have
K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are
uniform random in [-1,1].
Note: this might not be a legal kernel.
C-
BA+
39
A First Attempt: Definition satisfying
properties (1) and (2). How to use it?
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a
1- probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+ of O((1/2) ln(1/2)) positive examples.
• Draw S- of O((1/2) ln(1/2)) negative examples.
• Classify x based on which gives better score.
40
A First Attempt: How to use it?
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1-
probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+ of O((1/2) ln(1/2)) positive examples.
• Draw S- of O((1/2) ln(1/2)) negative examples.
• Classify x based on which gives better score.
Guarantee: with probability ¸ 1-, error ·  + .
Proof
•
Hoeffding: for any given “good x”, probability of error
w.r.t. x (over draw of S+, S-) at most 2.
• By Markov, at most  chance that the error rate over
GOOD is more than . So overall error rate ·  + .
41
A First Attempt: Not Broad Enough
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1-
probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
++
+
+
++
more similar
to negs than
to typical pos
--- --
• K(x,y)=x ¢ y has large margin separator but doesn’t
satisfy our definition.
42
A First Attempt: Not Broad Enough
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if at least a 1-
probability mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
++
+
+
++
R
--- --
Idea: would work if we didn’t pick y’s from top-left.
Broaden to say: OK if 9 non-negligable region R s.t. most
x are on average more similar to y2R of same label
43
than to y2 R of other label.
Broader/Main Definition
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a
weighting function w(y) 2 [0,1] at least a 1- probability
mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
44
Main Definition, How to Use It
• K:(x,y) ! [-1,1] is an (,)-good similarity for P if exists a weighting
function w(y) 2 [0,1] at least a 1- probability mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
Algorithm
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• Use to “triangulate” data:
F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
• Take a new set of labeled examples, project to this space,
and run your favorite alg for learning lin. separators.
Point is: with probability ¸ 1-, exists linear separator of
error ·  + at margin /4.
(w = [w(y1), …,w(yd),-w(zd),…,-w(zd)])
45
Main Definition, Implications
Algorithm
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• Use to “triangulate” data: F(x) = [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].
Guarantee: with prob. ¸ 1-, exists linear separator of error ·  +
at margin /4.
legal
kernel
Implications
K arbitrary sim.
function
(,)-good sim.
function
(+,/4)-good kernel function
46
Good Kernels are Good Similarity Functions
Main Definition: K:(x,y) ! [-1,1] is an (,)-good similarity
for P if exists a weighting function w(y) 2 [0,1] at least a 1 probability mass of x satisfy:
Ey~P[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~P[w(y)K(x,y)|l(y)l(x)]+
Theorem
• An (,)-good kernel is an (’,’)-good similarity function
under main definition.
Our proofs incurred some penalty:
’ =  + extra, ’ = 3extra.
Nati Srebro (COLT 2007) has improved the bounds.
47
Learning with Multiple Similarity Functions
• Let K1, …, Kr be similarity functions s. t. some (unknown)
convex combination of them is (,)-good.
Algorithm
• Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)).
• Use to “triangulate” data:
F(x) = [K1(x,y1), …,Kr(x,yd), K1(x,zd),…,Kr(x,zd)].
Guarantee: The induced distribution F(P) in R2dr has a
separator of error ·  +  at margin at least
Sample complexity is
roughly
50
Implications
• Theory that provides a formal way of understanding kernels
as similarity functions.
• Algorithms work for sim. fns that aren’t necessarily PSD.
• Suggests natural approach for using similarity
functions to augment feature vector in “anytime” way.
– E.g., features for document can be list of words in
it, plus similarity to a few “landmark” documents.
• Formal justification for “Feature Generation for Text
Mugizi has
Categorization using World Knowledge”,
GM’05
proposed on this
51
Clustering via Similarity Functions
(Work in Progress, with Avrim and Santosh)
What if only unlabeled examples available?
Consider the following setting:
• Given data set S of n objects.
[documents,
web pages]
• There is some (unknown) “ground truth” clustering. Each x
has true label l(x) in {1,…,t}. [topic]
• Goal: produce hypothesis h of low error up to isomorphism
of label names.
People have traditionally considered mixture models here.
Can we say something in our setting?
53
What if only unlabeled examples available?
• Suppose our similarity function satisfies the
stronger condition:
• Ground truth is “stable” in that
For all clusters C, C’, for all A in C, A’ in C’:
A and A’ are not both more attracted to
each other than to their own clusters.
K(x,y) is
attraction
between x
and y
• Then, can construct a tree (hierarchical clustering)
such that the correct clustering is some pruning of
this tree.
54
What if only unlabeled examples available?
• Suppose our similarity function satisfies the
stronger condition:
• Ground truth is “stable” in that
For all clusters C, C’, for all A in C, A’ in C’:
A and A’ are not both more attracted to
each other than to their own clusters.
sports
volleyball
soccer
fashion
K(x,y) is
attraction
between x
and y
Dolce & Gabbana
Cocco Chanel
gymnastics
55
Main point
• Exploring the question: what are minimal conditions on a
similarity function that allow it to be useful for
clustering?
• Have considered two relaxations of the Clustering
objective:
a. List Clustering -- small number of candidate
clusterings.
b. Hierarchical clustering -- output a tree such that
right answer is some pruning of it.
• Allow for right answer to be identified with a little bit
of additional feedback.
56
Modern Learning Paradigms: Future Work
Modern Learning Paradigms
Incorporating Unlabeled
Data in the Learning Process
Active Learning
- Margin based AL of linear separators
Extend the analysis to a more general
class of distributions, e.g. log-concave.
Kernel, Similarity based
learning and Clustering
Learning with Sim. Functions
Alternative/tighter definitions
and connections.
Clustering via Sim. Functions
Can we get an efficient alg. for the
stability of large subsets property
Interactive Feedback
57
MLA and Algorithmic Game Theory, Future Work
Mechanism Design, ML, and Pricing Problems
Revenue maximization in comb. auctions with general preferences.
Extend BBHM’05 to the limited supply setting.
Approximation algorithms for the case of pricing below cost.
58
Timeline
• Plan to finish in a year

Summer 07 - Revenue Maximization in General
Fall 07
Comb. Auctions, limited and
unlimited supply.
- Clustering via Similarity Functions
- Active Learning under Log-Concave
Distributions
Spring 08 Wrap-up; writing; job search!
59
60