New Theoretical Frameworks for Machine Learning Maria-Florina Balcan … Thanks to My Committee Avrim Blum Yishay Mansour Manuel Blum Tom Mitchell Santosh Vempala.

Download Report

Transcript New Theoretical Frameworks for Machine Learning Maria-Florina Balcan … Thanks to My Committee Avrim Blum Yishay Mansour Manuel Blum Tom Mitchell Santosh Vempala.

New Theoretical Frameworks for
Machine Learning
Maria-Florina Balcan
…
Thanks to My Committee
Avrim Blum
Yishay Mansour
Manuel Blum
Tom Mitchell
Santosh Vempala
2
The Goals of the Thesis
New Frameworks for Important Learning Problems
Models, algorithms, generalization bounds for:
• Semi-Supervised Learning
• Active Learning
• Learning with Kernels
• Learning with General Similarity Fns
Prominent
methods
in ML today.
• Clustering
Machine Learning
Algorithmic Game Theory
• ML both for designing and analyzing auctions (Revenue Maximization)
3
New Frameworks for Machine Learning
Important Learning Paradigms
Incorporating Unlabeled
Data in the Learning Process
Unified theoretical
Semi-supervised
Learning
treatment lacking
Active Learning
Our Contributions
Semi-supervised learning
- a unified discriminative framework
Active Learning
- new positive theoretical results
(Not captured by
standard learning
models…)
Kernel based Learning
Qualitative gap between
theory and practice
Our Contributions
A theory of learning with
general similarity functions
New discriminative model for
Clustering
4
New Frameworks for Machine Learning
Important Learning Paradigms
Incorporating Unlabeled
Data in the Learning Process
Unified theoretical
treatment lacking
Our Contributions
Semi-supervised learning
- a unified discriminative framework
Active Learning
- new positive theoretical results
(Not captured by
standard learning
models…)
Kernel, Similarity based
Learning and Clustering
Qualitative gap between
theory and practice
Our Contributions
A theory of learning with
general similarity functions
New discriminative model for
Clustering
5
Machine Learning and Algorithmic Game Theory
One-slide Overview of Our Results
Machine Learning for Auction Design and Pricing
Incentive compatible
auction design
generic
reduction
Standard algorithm
design
[Balcan-Blum-Hartline-Mansour, FOCS 2005] [Balcan-Blum-Hartline-Mansour, JCSS 2008]
Other related work: Approximation and Online Algorithms Pricing
revenue maximization in combinatorial auctions
Single minded customers
[BB, EC 2006] [BB, TCS 2007]
[BBCH, WINE 2007]
Customers with general valuations
[BBM, EC 2008]
6
The Goals of the Thesis
New Frameworks for Important Learning Problems
Machine Learning
Algorithmic Game Theory
7
Structure of the Talk
New Frameworks for Important Learning Paradigms
Incorporating Unlabeled
Data in the Learning Process
Semi-supervised learning (SSL)
- An Augmented PAC model for SSL
[Balcan-Blum, COLT 2005; book chapter,
“Semi-Supervised Learning”, 2006]
Active Learning (AL)
- Generic agnostic AL procedure
[Balcan-Beygelzimer-Langford, ICML 2006 & JCSS 2008]
- Margin based AL of linear separators
[Balcan-Broder-Zhang, COLT 2007]
[Balcan-Hanneke-Wortman, COLT 2008 & MLJ 2008 ]
(best student paper)
Kernels, Similarity based
learning and Clustering
- Kernels, margins & feature selection
[Balcan-Blum-Vempala, ALT 2004 & MLJ 2006]
- General theory of learning with
similarity functions
[Balcan-Blum, ICML 2006]
[Balcan-Blum-Srebro, MLJ 2008]
[Balcan-Blum-Srebro, COLT 2008]
- Discriminative model for Clustering
[Balcan-Blum-Vempala, STOC 2008]
[Balcan-Blum-Gupta, Manuscript 2008]
9
Part I, Incorporating Unlabeled Data in
the Learning Process
Semi-Supervised Learning
A general discriminative framework
[Balcan-Blum, COLT 2005; book chapter, “Semi-Supervised Learning”, 2006]
Standard Supervised Learning
• X – instance/feature space
• S={(x, l)} - set of labeled examples
– labeled examples - drawn i.i.d. from distr. D over X and
labeled by some target concept c*
•
labels 2 {-1,1} - binary classification
• Do optimization over S, find hypothesis h 2 C.
• Goal: h has small error over D.
err(h)=Prx 2 D(h(x)  c*(x))
c* in C, realizable case
c* not in C, agnostic case
• Classic models for learning from labeled data.
• Statistical Learning Theory (Vapnik)
• PAC (Valiant)
11
Standard Supervised Learning
Sample Complexity
• E.g., Finite Hypothesis Spaces, Realizable Case
• In PAC, can also talk about efficient algorithms.
12
Semi-Supervised Learning
Su={xi} - unlabeled examples i.i.d. from D
Sl={(xi, yi)} – labeled examples i.i.d. from D, labeled by target c*.
Data Source
Learning
Algorithm
Expert / Oracle
Unlabeled
examples
Unlabeled
examples
Labeled Examples
Algorithm outputs a classifier
13
Semi-Supervised Learning
• Variety of methods and experimental results:
• Transductive SVM [Joachims ’98]
• Co-training [Blum & Mitchell ’98], [Balcan-Blum-Yang’04]
• Graph-based methods [Blum & Chawla01], [Zhu-Lafferty-Ghahramani’03]
• Etc
• Scattered and very specific theoretical results…
We provide: a general discriminative (PAC, SLT style)
framework for SSL.
Challenge: capture many of the assumptions typically used.
Different SSL algorithms based on different assumptions.
14
Example of “typical” assumption: Margins
Belief: target goes through low density regions (large margin).
+
_
+
_
SVM
Labeled data only
+
_
+
_
+
_
+
_
Transductive SVM
15
Another Example: Self-consistency
Agreement between two parts : co-training
[Blum-Mitchell98].
- examples contain two sufficient sets of features, x = h x1, x2 i
- belief: the parts are consistent, i.e. 9 c1, c2 s.t. c1(x1)=c2(x2)=c*(x)
For example, if we want to classify web pages: x = h x1, x2 i
Prof. Avrim Blum
My Advisor
x - Link info & Text info
Prof. Avrim Blum
x1- Text info
My Advisor
x2- Link info
16
New discriminative model for SSL [BB05]
Problems with thinking about SSL in standard models
• PAC or SLT: learn a class C under (known or unknown) distribution D.
• Unlabeled data doesn’t give any info about which c 2 C is the target.
Key Insight
Unlabeled data useful if we have beliefs not only about
the form of the target, but also about its relationship
with the underlying distribution.
17
Proposed Model, Main Ideas
Augment the notion of a concept class C with a notion of
compatibility  between a concept and the data distribution.
“learn C” becomes “learn (C,)” (learn class C under )
Express relationships that one hopes the target function and
underlying distribution possess.
_
Idea I: use unlabeled data & belief that the target +is compatible
+
to reduce C down to just {the highly compatible functions
in_ C}.
abstract prior 
Class of fns C
e.g., linear separators
unlabeled data
finite sample
Compatible
fns in C
e.g., large margin
linear separators
Idea II: require that the degree of compatibility can be
estimated from a finite sample.
18
Types of Results in the [BB05] Model
Fundamental Sample Complexity issues:
– How much unlabeled data we need:
• both complexity of C and of the compatibility notion.
- Ability of unlabeled data to reduce # of labeled examples:
• compatibility of the target
• (various) measures of the helpfulness of the distribution
-Cover bounds much better than Uniform Convergence bounds.
Main Poly-Time Algorithmic Result: improved alg co-training
of linear separators (improves over BM’98 substantially)
Subsequent Work used our framework
P. Bartlett, D. Rosenberg, AISTATS 2007;
Kakade et al, COLT 2008
J. Shawe-Taylor et al., Neurocomputing 2007
19
Part II, Incorporating Unlabeled Data in
the Learning Process
Active Learning
Brief Overview of the results
Active Learning (AL)
Data Source
Learning
Algorithm
Expert / Oracle
Unlabeled
examples
Request for the Label of an Example
A Label for that Example
Request for the Label of an Example
A Label for that Example
...
Algorithm outputs a classifier
• Classic example where AL helps: thresholds on the real line
-
-
- -
+
+
W*
21
First Agnostic Active Learning Procedure
We provide: A2 the first algorithm which is robust to noise.
[Balcan, Beygelzimer, Langford, ICML’06] [Balcan, Beygelzimer, Langford, JCSS’08]
“Region of disagreement” style: Pick a few points at random from the
current region of uncertainty, query their labels, throw out hypothesis
if you are statistically confident they are suboptimal.
Guarantees for
(similar to [CAL’92] realizable case)
A2:
• Fall-back & exponential improvements.
• C – thresholds, low noise, exponential improvement.
• C - homogeneous linear separators in Rd,
D - uniform over unit sphere, low noise,
only d2 log (1/) labels to find h with error .
A lot of subsequent work.
c*
[Hanneke’07, DHM’07, BBZ’07, BHW’08]
22
First Agnostic Active Learning Procedure
We provide: A2 the first algorithm which is robust to noise.
[Balcan, Beygelzimer, Langford, ICML’06] [Balcan, Beygelzimer, Langford, JCSS’08]
“Region of disagreement” style: Pick a few points at random from the
current region of uncertainty, query their labels, throw out hypothesis
if you are statistically confident they are suboptimal.
Guarantees for
(similar to [CAL’92] realizable case)
A2:
• Fall-back & exponential improvements.
• C – thresholds, low noise, exponential improvement.
• C - homogeneous linear separators in Rd,
D - uniform over unit sphere, low noise,
only d2 log (1/) labels to find h with error .
c*
• Realizable: d3/2 log (1/) labels
• Improved in subsequent work: d log2 (1/)
[Balcan-Broder-Zhang, COLT 07]
23
Part III, Learning with Kernels and
More General Similarity Functions
[Balcan-Blum, ICML 2006] [Balcan-Blum-Srebro, MLJ 2008]
[Balcan-Blum-Srebro, COLT 2008]
Kernel Methods
Prominent method for supervised classification today.
The learning alg. interacts with the data via a similarity fns
What is a Kernel?
A kernel K is a legal def of dot-product: i.e. there exists an
implicit mapping  such that K( , )= ( )¢ ( ).
E.g., K(x,y) = (x ¢ y + 1)d
: (n-dimensional space) ! nd-dimensional space
Why Kernels matter?
Many algorithms interact with data only via dot-products.
So, if replace x ¢ y with K(x,y), they act implicitly as if data
was in the higher-dimensional -space.
26
Example
E.g., for n=2, d=2, the kernel K(x,y) = (x¢y)d corresponds to
original space
-space
x2
X
X
z2
X
X
X
X
X
X
X
O
O
X
O
O
O
X
O
O
x1
O
X
O
X
X
X
X
X
X
X
z3
X
X
X
O
X
X
O
O O
O
X
X
O
X
O
z1
X
X
O
X
X
X
X
X
X
X
27
Generalize Well if Good Margin
• If data is linearly separable by margin in -space, then good
sample complexity.
If margin  in -space, then need
sample size of only
Õ(1/2)
to
get confidence in generalization.
(another example of a generalization bound)


+
- - -
+
+ +
++
|(x)| · 1
28
Limitations of the Current Theory
In practice: kernels are constructed by viewing them as
measures of similarity.
Existing Theory: in terms of margins in implicit spaces.
Difficult to think about, not great for intuition.
Kernel requirement rules out many natural similarity functions.
Better theoretical explanation?
29
Better Theoretical Framework
In practice: kernels are constructed by viewing them as
measures of similarity.
Yes! We provide a more general and
Existing Theory: in terms of margins in implicit spaces.
intuitive theory that formalizes the
Difficult
to think
about,
not great
for intuition.
intuition
that
a good
kernel
is a good
measure of similarity.
[Balcan-Blum,
ICML 2006] [Balcan-Blum-Srebro,
2008]
Kernel
requirement
rules outMLJ
natural
similarity functions.
[Balcan-Blum-Srebro, COLT 2008]
Better theoretical explanation?
30
More General Similarity Functions
We provide a notion of a good similarity function:
1) Simpler, in terms of natural direct quantities.
• no implicit high-dimensional spaces
• no requirement that K(x,y)=(x) ¢  (y)
K can be used to learn well.
Main notion
Good kernels
First attempt
2) Is broad: includes usual notion of good kernel.
has a large margin sep. in -space
3) Allows one to learn classes that have no good kernels.
31
A First Attempt
P distribution over labeled examples (x, l(x))
Goal: output classification rule good for P
K is good if most x are on average more similar to points
y of their own type than to points y of the other type.
K is (,)-good for P if a 1- prob. mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Average similarity to
points of the same label
Average similarity to
points of opposite label
gap
32
A First Attempt
K is (,)-good for P if a 1- prob. mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Example:
E.g., K(x,y) ¸ 0.2, l(x) = l(y)
0.4
0.3
-1
0.5
1
K(x,y) random in {-1,1}, l(x)  l(y)
1
33
A First Attempt
K is (,)-good for P if a 1- prob. mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw sets S+, S- of positive and negative examples.
• Classify x based on average similarity to S+ versus to S-.
0.4
S+
-1
1
x
1
0.5
S34
A First Attempt
K is (,)-good for P if a 1- prob. mass of x satisfy:
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
Algorithm
• Draw sets S+, S- of positive and negative examples.
• Classify x based on average similarity to S+ versus to S-.
Theorem If |S+| and |S-| are ((1/2) ln(1/’)), then with
probability ¸ 1-, error · +’.
•
For a fixed good x prob. of error w.r.t. x (over draw of S+, S-) is ± ²’. [Hoeffding]
•
At most  chance that the error rate over GOOD is ¸ ’.
•
Overall error rate · +’.
35
A First Attempt: Not Broad Enough
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
++
+
more similar
to - than to
typical +
+
++
30o
30o
--- --
½
versus
¼
½
versus ½ ¢ 1 + ½ ¢ (- ½)
Similarity function K(x,y)=x ¢ y
• has a large margin separator; does not satisfy our definition.
36
A First Attempt: Not Broad Enough
Ey~P[K(x,y)|l(y)=l(x)] ¸ Ey~P[K(x,y)|l(y)l(x)]+
R
++
+
+
++
30o
30o
--- --
Broaden: 9 non-negligible R s.t. most x are on average more
similar to y 2 R of same label than to y 2 R of other label.
[even if do not know R in advance]
37
Broader Definition
K is (, , ) if 9 a set R of “reasonable” y (allow probabilistic) s.t. 1-
fraction of x satisfy:
Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+
At least  prob. mass of reasonable positives & negatives.
Property
• Draw S={y1, , yd} set of landmarks.
Re-represent data.
x ! F(x) = [K(x,y1), …,K(x,yd)].
P
d
F
R
F(P)
• If enough landmarks (d=(1/2  )), then with high prob. there
exists a good L1 large margin linear separator.
w=[0,0,1/n+,1/n+,0,0,0,-1/n-,0,0]
38
Broader Definition
K is (, , ) if 9 a set R of “reasonable” y (allow probabilistic) s.t. 1-
fraction of x satisfy:
Ey~P[K(x,y)|l(y)=l(x), R(y)] ¸ Ey~P[K(x,y)|l(y)l(x), R(y)]+
At least  prob. mass of reasonable positives & negatives.
Algorithm
du=Õ(1/(2 ))
dl=O(1/(2²acc ln (du) ))
• Draw S={y1, , yd} set of landmarks.
Re-represent data.
P
X
O
X X
O
X
X
O O O
x ! F(x) = [K(x,y1), …,K(x,yd)]
F(P)
d
F
R
X
XX X
X
O
O O
O O
• Take a new set of labeled examples, project to this space, and run a
good L1 linear separator alg.
39
Kernels versus Similarity Functions
Main Technical Contributions
Good Similarities
My Work
Good Kernels
State of the art technique
Theorem
K is a good kernel
K is also a good similarity function.
(but  gets squared).
If K has margin  in implicit space, then for any ,
K is (,2,)-good in our sense.
40
Kernels versus Similarity Functions
Main Technical Contributions
Good Similarities
My Work
Strictly more
general
Good Kernels
State of the art technique
Theorem
K is a good kernel
K is also a good similarity function.
(but  gets squared).
Can also show a Strict Separation.
Theorem
For any class C of n pairwise uncorrelated functions, 9 a similarity
function good for all f in C, but no such good kernel function exists.
41
Kernels versus Similarity Functions
Can also show a Strict Separation.
Theorem
For any class C of n pairwise uncorrelated functions, 9 a similarity
function good for all f in C, but no such good kernel function exists.
• In principle, should be able to learn from O(-1log(|C|/))
labeled examples.
• Claim 1: can define generic (0,1,1/|C|)-good similarity function
achieving this bound. (Assume D not too concentrated)
• Claim 2: There is no (,) good kernel in hinge loss, even if =1/2
and =1/ |C|-1/2. So, margin based SC is d=(1/|C|).
42
Similarity Functions for Classification
Conceptual Contributions
Before
After, Our Work
Difficult theory
Much more intuitive theory
• Implicit spaces
• No Implicit spaces
Not helpful for intuition.
Formalizes a common intuition.
Limiting.
Provably more general.
Algorithmic Implications
• Can use non-PSD similarities, no need to “transform” them into PSD
functions and plug into SVM.
E.g., Liao and Noble, Journal of Computational Biology
44
Similarity Functions for Classification
Algorithmic Implications
• Can use non-PSD similarities, no need to “transform” them into PSD
functions and plug into SVM.
E.g., Liao and Noble, Journal of Computational Biology
• Give justification to the following rule:
• Also show that anything learnable with SVM is learnable this way! 45
Part IV, A Novel View on Clustering
[Balcan-Blum-Vempala, STOC 2008]
A General Framework for analyzing clustering
accuracy without strong probabilistic assumptions
What if only Unlabeled Examples Available?
S set of n objects.
[documents]
9 ground truth clustering. x, l(x) in {1,…,t}.
[topic]
Goal: h of low error where err(h) = minPrx~S[(h(x))  l(x)]
Problem: unlabeled data only!
But have a Similarity Function!
47
What if only Unlabeled Examples Available?
Protocol
9 ground truth clustering for S
i.e., each x in S has l(x) in {1,…,t}.
Input
S, a similarity function K.
Output Clustering of small error.
The similarity function
K has to be related to
the ground-truth.
(err(h) = minPrx~S[(h(x))  l(x)])
48
What if only Unlabeled Examples Available?
Fundamental Question
What natural properties on a similarity function
would be sufficient to allow one to cluster well?
49
Contrast with Standard Approaches
Approximation algorithms
Mixture models
Input: graph or embedding into Rd
Input: embedding into Rd
- score algs based on apx ratios
- analyze algs to optimize various
criteria over edges
- score algs based on error rate
- strong probabilistic assumptions
Clustering Theoretical
Frameworks
Our Approach
Discriminative, not generative.
[Balcan-Blum-Vempala, STOC 2008]
Input: graph or similarity info
- score algs based on error rate
- no strong probabilistic assumptions
Much better when input graph/
similarity is based on heuristics.
E.g., clustering documents by topic, web
search results by category
50
What natural properties on a similarity function would be
sufficient to allow one to cluster well?
[sports]
Condition that trivially works.
K(x,y) > 0 for all x,y, l(x) = l(y).
K(x,y) < 0 for all x,y, l(x)  l(y).
C
A
[fashion]
C’
A’
51
What natural properties on a similarity function would be
sufficient to allow one to cluster well?
All x more similar to all y in own cluster than any z in
any other cluster
Problem: same K can satisfy it for two very different, equally
natural clusterings of the same data!
K(x,x’)=1
sports
fashion
Lacoste
soccer
K(x,x’)=0.5
tennis
K(x,x’)=0
Gucci
sports
soccer
tennis
fashion
Lacoste
Gucci
52
Relax Our Goals
1.
Produce a hierarchical clustering s.t. correct answer is
approximately some pruning of it.
53
Relax Our Goals
1.
Produce a hierarchical clustering s.t. correct answer is
approximately some pruning of it.
All topics
soccer
Lacoste
sports
tennis
Gucci
soccer
2.
fashion
tennis
Gucci
Lacoste
List of clusterings s.t. at least one has low error.
Tradeoff strength of assumption with size of list.
Obtain a rich, general model.
54
Examples of Properties and Algorithms
Strict Separation Property
All x are more similar to all y in own cluster than any z in any
other cluster
Sufficient for hierarchical clustering (single linkage algorithm)
Stability Property
For all clusters C, C’, for all Aµ C, A’ µ C,
neither A nor A’ more attracted to the other one
than to the rest of its own cluster.
C
A
C’
A’
(K(A,A’) - average attraction between A and A’)
Sufficient for hierarchical clustering (average linkage algorithm)
55
Examples of Properties and Algorithms
Average Attraction Property
Ex’ 2 C(x)[K(x,x’)] > Ex’ 2 C’ [K(x,x’)]+ (8 C’C(x))
Not sufficient for hierarchical clustering
Can produce a small list of clusterings. (sampling based algorithm)
Stability of Large Subsets Property
For all clusters C, C’, for all Aµ C, A’ µ C,
|A|+|A’|¸ sn, neither A nor A’ more attracted to
the other one than to the rest of its own cluster.
C
A
C’
A’
Sufficient for hierarchical clustering
Find hierarchy using a multi-stage learning-based algorithm.
56
Stability of Large Subsets Property
C
For all C, C’, all A ½ C, A’ µ C’, |A|+|A’| ¸ sn
K(A,C-A) > K(A,A’),
Algorithm
1)
A
C’
A’
Generate list L of candidate clusters (average attraction alg.)
Ensure that any ground-truth cluster is f-close to one in L.
2)
For every (C, C0) in L s.t. all three parts are large:
If K(C Å C0, C \ C0) ¸ K(C Å C0, C0 \ C),
then throw out C0
C
C Å C0
Else throw out C.
3) Clean and hook up the surviving clusters into a tree.
C0
57
Similarity Functions for Clustering, Summary
• Minimal conditions on K to be useful for clustering.
• For robust theory, relax objective: hierarchy, list.
• A general model that parallels PAC, SLT, Learning
with Kernels and Similarity Functions in Supervised
Classification.
59
Similarity Functions, Overall Summary
Supervised Classification
Generalize and simplify the
existing theory of Kernels.
[Balcan-Blum, ICML 2006]
Unsupervised Learning
First Clustering model for
analyzing accuracy without
strong probabilistic assumptions.
[Balcan-Blum-Srebro, COLT 2008]
[Balcan-Blum-Srebro, MLJ 2008]
[Balcan-Blum-Vempala, STOC 2008]
60
Future Directions
Connections between Computer Science and Economics
Active learning and online learning techniques for better pricing
algorithms and auctions.
New Frameworks and Algorithms for Machine Learning
– Similarity Functions for Learning and Clustering
Learn a good similarity based on data from related problems.
Other notions of “useful”, other types of feedback.
Other navigational structures: e.g., a small DAG.
– Interactive Learning
61
62