Slides - Events @ CSA Dept., IISc Bangalore

Download Report

Transcript Slides - Events @ CSA Dept., IISc Bangalore

Learning from
Satisfying Assignments
Rocco A. Servedio
Columbia University
Anindya De
DIMACS/IAS
Ilias Diakonikolas
U. Edinburgh
Symposium on Learning, Algorithms and Complexity /
MSR India Theory Day
Indian Institute of Science
January, 2015
1
Background:
Learning Boolean functions
from labeled examples
Each data point:
• Studied in TCS for ~three decades
• Lots of interplay with (low-level) complexity theory
• Bumper-sticker level summary of this research:
Simple functions are (computationally) easy to learn,
and complicated functions are hard to learn
2
This talk:
Learning Probability Distributions
Each data point:
• Big topic in statistics (“density estimation”) for decades
• Exciting algorithmic work in the last decade+ in TCS, largely
on continuous distributions (mixtures of Gaussians & more)
• This talk: distribution learning from a complexity theoretic
perspective
– What about distributions over the hypercube {0,1}n?
– Can we formalize intuition that “simple distributions are easy to
learn”?
– Insights into classical density estimation questions
3
What do we mean by
“learn a distribution”?
• Unknown target distribution
• Algorithm gets i.i.d. draws from
• With probability 9/10, must output (a sampler for a)
distribution
such that statistical distance between
and
is small:
(Natural analogue of Boolean function learning.)
4
Previous work: [KMRRSS94]
• Looked at learning distributions over {0,1}n in terms of
n-output circuits that generate distributions:
output x1............ xn distributed according to
circuit
input z1........................ zm uniform over {0,1}m
• [AIK04] showed it’s hard to learn even very simple
distributions from this perspective: already hard even if
each output bit is a 4-junta of input bits.
5
This work: A different perspective
Our notion of a “simple” distribution over {0,1}n:
uniform distribution over satisfying assignments of a
“simple” Boolean function.
What kinds of Boolean functions can we learn from
their satisfying assignments?
Want algorithms that have polynomial runtime and # of
samples required.
6
What are “simple” functions?
+
Halfspaces:
+
+
+
- - - ++ +
- -- - -
+
OR
DNF formulas:
AND
_
x2
x3
_
x5 x6
AND
_
x3
AND
x5
x1
x6
_
x7
7
Simple functions, cont.
AND
OR
3-CNF formulas:
_
x2
x3
OR
_
x5
OR
_
x3 x7 x5
x1
x6
_
x7
AND
OR
Monotone 2-CNF:
x2
x3
OR
x3
OR
OR
x2 x3
x5
x6
x7
8
Yet more simple functions
Low-degree polynomial threshold
functions:
Intersections of k halfspaces:
- ++ + + + +
-- +- +
+- + +
-- -- - +
+ ++
- - - + - -- - - -- 9
The model, more precisely
• Let
be a fixed class of Boolean functions over
• There is some unknown
. Learning algorithm
sees samples drawn uniformly from
.
Target distribution:
.
• Goal : With probability 9/10, output a sampler for a
hypothesis distribution
such that
We’ll call this a distribution learning algorithm for
.
10
Relation to function learning
Q: How is this different from learning (function
learning) under the uniform distribution?
A: Here we only get positive examples. Some other
ways:
• (not so major) Output a hypothesis distribution rather
than a hypothesis function
• (major) Much more demanding guarantee than usual
uniform-distribution learning.
11
Example: Halfspaces
Usual uniform-distribution model for learning functions:
Hypothesis allowed to be wrong on
points in
.
1n
0n
For highly biased target function like
a fine hypothesis for any
, constant-0 function is
.
12
A stronger requirement
Our distribution-learning model: “constant-0 hypothesis” is meaningless!
In this example, for
to be a good hypothesis distribution,
must be only a
fraction of
.
1n
0n
Essentially, we require hypothesis function with multiplicative rather
than additive -accuracy relative to .
13
Usual functionlearning setting
Given: random labeled
examples from
, must
Output: hypothesis such that
Our distributionlearning setting
Given: draws from
, must
Output: hypothesis
with the
following guarantee :
++ ++
- -- - - If both regions are small, this
is fine!
must satisfy
14
Brief motivational digressions:
(1) Real-world language learning
People typically learn new languages by being
exposed to correct utterances (positive
examples), which are a sparse subset of all
possible vocalizations (all examples).
Goal is to be able to generate new correct
utterances (generate draws from a distribution
similar to the one the samples came from).
15
(2) Connection to continuous density
estimation questions
A basic question in continuous 1-dimensional density estimation:
Target distribution (say over [0,1]) is a “k-bin histogram” -- pdf is
piecewise constant with k pieces.
k=5
0
1
Easy to learn such a distribution with poly(k,1/e) samples and runtime.
16
Multi-dimensional histograms
Target distribution over [0,1]d is specified by k hyper-rectangles that
cover [0,1]d; pdf is constant within each rectangle.
d=2, k=5
Question: Can we learn such distributions without incurring the “curse
of dimensionality”? (Don’t want runtime, # samples to be
exponential in d)
17
Connection with our problem
Our “learning from satisfying assignments” problem for the class
= {all k-leaf decision trees over d Boolean variables}
is a (very) special case of learning k-bin d-dimensional histograms.
One of the k hyperrectangles
set of inputs reaching one of
the k decision tree leaves
Rectangle with 0 weight
in the distribution
decision tree leaf that’s
labeled 0
For this special case, we beat the “curse of dimensionality” and
achieve runtime dO(log k).
18
Results
19
Positive results
Theorem 1: We give an efficient distribution learning algorithm for
= { halfspaces }.
+ +
+
+
+
Runtime is
- - - ++
- -
+
Theorem 2: We give a (pretty) efficient distribution learning algorithm for
OR
= { poly(n)-term DNFs }.
AND
AND
AND
Runtime is
_
x2
x3
_
x5 x6
_
x3
Both results obtained via a general approach, plus
work.
x5
x1
x6
_
x7
-specific
20
Negative results
Assuming crypto-hardness (essentially RSA), there are
no efficient distribution learning algorithms for:
o
o
- - - --- - - -- - + + ++
Intersections of two halfspaces
-- -+ +++++ +
-- + ++ +
Degree-2 polynomial threshold functions
AND
o
3 – CNFs ,
or even _
x2
OR
x3
OR
_
x5
OR
_
x3 x7 x5
x1
x6
_
x7
AND
o
OR
Monotone 2-CNFs
x2
x3
OR
x3
OR
OR
x2 x3
x5
x6
x7
21
Rest of talk
• Positive results
• General approach, illustrated through specific
case of halfspaces
• Touch on DNFs
22
Learning halfspace distributions
Given positive examples drawn uniformly from
unknown halfspace ,
for some
++
+
++ + +
+
+
unknown
We need to (whp) output a sampler for a distribution that’s
close to
.
23
Let’s fantasize
Suppose somebody gave us .
++
++ +
+
+
++
known
Even then, we need to output a sampler for a distribution
close to uniform over
.
Is this doable? Yes.
24
Approximate sampling
for halfspaces
Theorem: Given
over
can return a uniform point from
in time
(with failure probability )
•
[MorrisSinclair99]: sophisticated MCMC analysis
•
[Dyer03]: elementary randomized
algorithm & analysis using
“dart throwing”
Of course, in our setting we are not given .
But, we should expect to use (at least) this machinery
for our problem.
,
25
A potentially easier case…?
For approximate sampling problem (where we’re given ),
problem is much easier if
is large: sample
uniformly & do rejection sampling.
Maybe our problem is easier too in this case?
In fact, yes. Let’s consider this case first.
26
Halfspaces: the high-density case
• Let
.
• We will first consider the case that
.
• We’ll solve this case using Statistical Query
learning & hypothesis testing for distributions.
27
First Ingredient for the
high-density case: SQ
Statistical Query (SQ) learning model:
o SQ oracle
: given poly-time computable
outputs
where
.
o An algorithm is said to be a SQ learner for
(under distribution ) if can learn given
access to
.
28
SQ learning for halfspaces
Good news: [BlumFriezeKannanVempala97] gave an
efficient SQ learning algorithm for halfspaces.
Outputs halfspace hypotheses!
Of course, to run it, need access to oracle for
for the unknown halfspace .
So, we need to simulate this given our examples from
.
29
The high-density case: first
step
Lemma: Given access to uniform random samples from
and such that
, queries to
can be simulated up to error
in time
.
Proof sketch:
Estimate using samples
from
Estimate using samples
from
30
The high-density case: first
step
Lemma: Given access to uniform random samples from
and such that
, queries to
can be simulated up to error
in time
.
Recall promise:
Additionally, we assume that we have
=
.
A halfspace!
Lemma lets us use the halfspace SQ-learner to get
that
such
31
Handling the high-density case
• Since
, have that
o
o
• Hence using rejection sampling, we can easily
sample
.
Caveat : We don’t actually have an estimate
for
.
32
Ingredient #2: Hypothesis testing
• Try all possible values of
multiplicative grid
in a sufficiently fine
• We will get a list of candidate distributions
such that at least one of
them is -close to
.
• Run a “distribution hypothesis tester” to return
which is - close to
.
33
Distribution hypothesis testing
Theorem: Given
• Sampler for target distribution
• Approximate samplers for distributions
• Approximate evaluation oracles for
• Promise :
Hypothesis tester guarantee: Outputs
in time
such that
Having evaluators as well as samplers for the
hypotheses is crucial for this.
34
Distribution hypothesis testing, cont.
We need samplers & evaluators for our hypothesis
distributions
All our hypotheses are dense, so can do approximate
counting easily (rejection sampling) to estimate
Note that
So we get the required (approximate) evaluators.
Similarly, (approximate) samples are easy via rejection
sampling.
35
Recap
So we handled the high-density case using
• SQ learning (for halfspaces)
• Hypothesis testing (generic).
(Also used approximate sampling & counting, but
they were trivial because we were in the dense
case.)
Now let’s consider the low-density case (the
interesting case).
36
Low density case: A new ingredient
New ingredient for the low-density case:
A new kind of algorithm called a densifier.
• Input:
such that
samples from
• Output: A function
, and
such that:
–
–
For simplicity, assume that
(like )
37
Densifier illustration
:
g
Input:
Samples from
Good estimate
f
Output: g satisfying two conditions:
:
38
Low-density case (cont.)
To solve the low-density case, we need approximate
sampling and approximate counting algorithms for the
class .
This, plus previous ingredients for high-density case (SQ
learning, hypothesis testing, & densifier) suffices: given
all these ingredients, we get a distribution learning
algorithm for .
39
How does it work?
The overall algorithm: (recall that
)
Needs good estimate
1. Run densifier to get
2. Use approximate sampling algorithm for
from
3. Run SQ-learner for under distribution
hypothesis for
4. Sample from
till get such that
this .
of
to get samples
to get
; output
Repeat with different guesses for , & use hypothesis testing to
choose
that’s close to
40
A picture of one stage
Note: This all assumed we have a good estimate
g
h
1. Using samples
from
,
run densifier to
get g
f
2. Run approximate uniform
generation algorithm to get uniform
positive examples of g
3. Run SQ-learner on distribution
to get high-accuracy hypothesis h for
(under
)
4. Sample from
till get point
where
, and output it.
41
How it works, cont.
Recall that to carry out hypothesis testing, we need samplers &
evaluators for our hypothesis distributions
Now some hypotheses
may be very sparse…
• Use approximate counting to estimate
As before,
so we get (approximate) evaluator.
• Use approximate sampling to get samples from
.
42
Recap: a general method
Theorem: Let be a class of Boolean functions
such that:
(i) is efficiently SQ-learnable;
(ii) has a densifier with an output in ; and
(iii) has efficient approximate counting and
sampling algorithms.
Then there is an efficient distribution learning
algorithm for .
43
Back to halfspaces: what have we got?
• Saw earlier we have SQ learning [BlumFriezeKannanVempala97]
• [MorrisSinclair99,Dyer03] give approximate counting and
sampling.
So we have all the necessary ingredients.…except a densifier.
Reminiscent of [Dyer03] “dart throwing” approach to approximate
counting – but in that setting, we are given
f
Approximate counting setting:
g
Densifier setting:
g
f
Given , come up with
Can we come up with a suitable
given only samples from
44
?
A densifier for halfspaces
Theorem: There is an algorithm running in time
such that for any halfspace , if the
algorithm gets as input such that
and access to
, it
outputs a halfspace with the following
properties :
1.
, and
2. .
45
Getting a densifier for halfspaces
Key ingredients:
o Approximate sampling for halfspaces
[MorrisSinclair99,Dyer03]
o Online learner of [MaassTuran90]
46
Towards a densifier for halfspaces
Recall our goals:
1.
2.
Fact: Let
be of size
. Then, with
probability
, condition (1) holds for any
halfspace such that
.
Proof: If (1) fails for a halfspace
, then
Fact follows from union bound over all (at most
So ensuring (1) is easy – choose
consistent with .
How to ensure (2)?
.
many) halfspaces
.
and ensure
is
47
Online learning as a two-player game
Imagine a two player game in which Arnab has a
halfspace and Larry wants to learn :
i. Larry initializes to the empty set
ii. Larry runs a (specific polytime) algorithm
on the set and returns halfspace
consistent with
iii. Arnab either says “yes,
“ or else returns
an
such that
iv. Larry adds
to and returns to
step (ii).
48
Guarantee of the game
Theorem: [MaassTuran90] There is a specific algorithm
that Larry can run so that the game terminates in at
most
rounds. At the end, either
or
Larry can certify that there is no halfspace meeting all
the constraints.
(Algorithm
is essentially the ellipsoid algorithm.)
Q: How is this helpful for us ?
A: Larry seems to have a powerful strategy  We will
exploit it.
49
Using the online learner
• Choose
as defined earlier. Start with
.
• “Larry” simulation:
stage – Run Larry’s strategy
and return consistent with .
• “Arnab” simulation: If
for some
, then return .
– Else, if
(use approx
counting), then we are done and return .
– Else use approx sampling to randomly choose a
point
and return .
Key point: In this case, have
50
Why is the simulation correct?
• If
for
, then the simulation
step is indeed correct.
• The other case in which Alice returns a point
is that
. This means
that the simulation at every step is correct
with probability
!
• Since the simulation lasts
steps, all the
steps are correct with probability
.
51
Finishing the algorithm
• Provided the simulation is correct,
returned satisfies the conditions:
which gets
1.
2.
So, we have a densifier – and a distribution learning
algorithm – for halfspaces.
52
DNFs
Recall general result:
Theorem: Let be a class of Boolean functions such that:
(i) is efficiently SQ-learnable;
(ii) has a densifier with an output in ; and
(iii)
has efficient approximate counting and sampling
algorithms.
Then there is an efficient distribution learning algorithm for
.
Get (iii) from [KarpLubyMadras89].
What about densifier and SQ learning?
53
Sketch of the densifier for DNFs
• Consider a DNF
suppose each
• Key observation: for each i,
So Pr[
. For concreteness,
consecutive samples from
satisfy same ] is
all
• If this happens, whp these
samples completely
identify
• The densifier finds candidate terms in this way, outputs OR
of all candidate terms.
54
SQ learning for DNFs
• Unlike halfspaces, no efficient SQ algorithm for learning
DNFs under arbitrary distributions is known; best known
runtime is
.
• But: our densifier identifies
“candidate terms” such
that f is (essentially) an OR of at most of them.
• Can use noise-tolerant SQ learner for sparse disjunctions,
applied over
“metavariables” (the candidate terms).
• Running time is poly(# metavariables).
55
Summary of talk
• New model: Learning distribution of satisfying
assignments
• “Multiplicative accuracy” learning
+
• Positive results:
+
+
+
- -+
- -- - -
+
OR
++
AND
AND
AND
DNFs
Halfspaces
• Negative results:
AND
OR
OR
3-CNFs
OR
+ + ++
-- - -- --
- -- -- -
Intersection
of 2 halfspaces
AND
OR
OR
OR
Monotone
2-CNFs
--- -+ +
-- +- + ++ +
---- -+ + + +
Degree-2 PTFs
56
Future work
• Beating the “curse of dimensionality” for ddimensional histogram distributions?
• Extensions to agnostic / semi-agnostic setting?
• Other formalizations of “simple distributions”?
Thank you!
58
Hardness results
59
Secure signature schemes
•
: (randomized) key generation algorithm; produces
key pairs
•
: signing algorithm;
is signature for message
using secret key
.
•
: verification algorithm;
if
Security guarantee: Given signed messages
no poly-time algorithm can produce
for a new
,
such that
60
Connection with our problem
Intuition: View
messages
.
as uniform distribution over signed
.
If, given signed messages, you can (approximately) sample from
, this means you can generate new signed messages –
contradicts security guarantee!
Need to work with a refinement of signature schemes – unique
signature schemes [MicaliRabinVadhan99] – for intuition to go
through.
Unique signature schemes known to exist under various crypto
assumptions (RSA’, Diffie-Hellman’, etc.)
61
Signature schemes + Cook-Levin
Lemma: For any secure signature scheme, there is
a secure signature scheme with the same security
where the verification algorithm is a 3-CNF.
corresponds to
, so security of
signature scheme  no distribution learning
algorithm for 3-CNF.
62
More hardness
Same approach yields hardness for intersections of 2
halfspaces & degree-2 PTFs. (Require parsimonious
reductions, efficiently computable/invertible maps
between sat. assignments of and sat. assignments
of 3-CNF.)
For monotone 2CNFs: use the “Blow-up” reduction
used in proving hardness of approximate counting for
monotone-2-SAT. Roughly, most sat. assignments of
monotone-2-CNF correspond to sat. assignments of
3-CNF.
63