algorithms - Simons Institute for the Theory of Computing

Download Report

Transcript algorithms - Simons Institute for the Theory of Computing

Bayesianism, Convexity, and the
quest towards Optimal Algorithms
Boaz Barak
Harvard University
Microsoft Research
Talk Plan
• Dubious historical analogy.
• Philosophize about automating algorithms.
• Wave hands about convexity and the Sum of Squares
algorithm.
• Sudden shift to Bayesianism vs Frequentism.
• Non-results on the planted clique problem
(or, how to annoy your friends).
Skipping today:
•
Sparse coding / dictionary learning / tensor prediction
•
Unique games conjecture / small set expansion
•
Connections to quantum information theory
[B-Kelner-Steurer’14,’15 B-Moitra’15]
[..B-Brandao-Harrow-Kelner-Steurer-Zhou’12..]
Prologue: Solving equations
Solutions for quadratic equations.
Babylonians (~2000BC):
del Ferro-TartagliaCardano-Ferrari (1500’s):
Solutions for cubics and quartics.
van Roomen/Viete (1593):
“Challenge all mathematicians in the world”
𝑥 45 − 45𝑥 43 + ⋯ + 45𝑥 =
Special cases of quintics
Euler(1740’s):
Solve 𝑥 11 = 1 with square and fifth roots
Vandermonde(1777):
…
Gauss (1796):
Ruffini-Abel-Galois
(early 1800’s):
7/4 + ⋯ 45/64
𝑥 17 = 1 root: −1+
•
•
•
•
17+ 34+2 17
8
+
68+12 17−16 34+2 17−2 1− 17
34−2 17
8
Some equations can’t be solved in radicals
Characterization of solvable equations.
Birth of group theory
17-gon construction now “boring”:
few lines of Mathematica.
A prototypical TCS paper
Interesting
problem
Efficient Algorithm
(e.g. MAX-FLOW in P)
Hardness Reduction
(e.g. MAX-CUT NP-hard)
Can we make algorithms boring?
Can we reduce creativity in algorithm design?
Can we characterize the “easy” problems?
Characterizing easy problems
Goal: A single simple algorithm that solves efficiently
every problem that can be efficiently solved.
Trivially True:
???
Algorithm that enumerates Turing machines.
Trivially False: Analyzing algorithm⇒Resolving P vs NP
Part 1
Revised Goal: A single simple algorithm that is conjectured
to be optimal in some interesting domain of problems.
Part 2
Next slide
Byproducts: New algorithms, theory of computational knowledge.
Domain: Combinatorial Optimization*
Maximize/minimize an objective subject to constraints
Examples: Satisfiability, Graph partitioning and coloring, Traveling
Salesperson, Matching, ...
Non-Examples: Integer factoring, Determinant
Characteristics:
•
Natural notions of approximation and noise.
•
•
•
No/little algebraic structure
“𝑁𝑃 ∩ 𝑐𝑜𝑁𝑃 = 𝑃’’ (“good characterization”), “𝐵𝑄𝑃 = 𝑃”
Threshold behavior: either very easy or very hard
•
Same algorithmic ideas and themes keep recurring.
(e.g. 2SAT vs 3SAT, random kSAT)
Hope: Make this formal for some subclass of optimization.
Theme: Convexity
Convexity in optimization
Interesting
Problem
Convex
Problem
General
Solver
Example: Can embed {±1} in [−1, +1] or {𝑥: ∥ 𝑥 ∥= 1}
Sum of Squares Algorithm: [Shor’87,Parrilo’00,Lasserre’01]
Universal embedding of any* optimization problem into an
𝑛𝑑 -dimensional convex set.
Algorithmic version of works related to Hilbert’s 17th problem
[Artin 27,Krivine64,Stengle74]
• Both “quality” of embedding and running time grow with 𝑑
• 𝑑 = 𝑛 ⇒ optimal solution, exponential time.
• Encapsulates many natural algorithms.
Optimal among a natural class [Lee-Raghavenrda-Steurer’15]
Talk Plan
• Dubious historical analogy.
• Philosophize about automating algorithms.
• Wave hands about convexity and the Sum of
Squares algorithm.
• Sudden shift to Bayesianism vs Frequentism.
• Non-results on the planted clique problem.
Frequentists vs Bayesians
“There is 10% chance that the 1020 𝑡ℎ digit of 𝜋 is 7”
“Nonsense! The digit is either 7 or isn’t.”
“I will take an 11: 1 bet on this.”
Computational version
𝐺 graph with (unknown) maximum clique 𝑆 of size 𝑘
What’s the probability that vertex 17 is in 𝑆?
Information Theoretically: Either 0 or 1
For computationally bounded observer: May be ≈
𝑘
𝑛
Making this formal
Convex set.
Defined by 𝑛𝑑 eq’s +
PSD constraint
𝐺 graph with (unknown) maximum clique 𝑆 of size 𝑘
Computational
degree 𝑑 pseudo-distribution
Classical Bayesian Uncertainty: posterior distribution
𝜇: 0,1
𝑛
→ℝ,
∀𝑥, 𝜇 𝑥 ≥ 0 ,
𝜇 𝑥 =1
𝔼𝜇 𝑝2 ≥ 0 ∀𝑝, deg 𝑝 ≤ 𝑑 /2
𝜇 consistent with observations:
𝑥
𝔼𝜇 𝑥𝑖 𝑥𝑗 ≔ ∑𝑥 𝜇 𝑥 𝑥𝑖 𝑥𝑗 = 0 ∀𝑖 ∼ 𝑗
Theorem: 𝑆𝑂𝑆𝑑 𝐺 =
max
𝜇:𝑑−𝑝.𝑑𝑖𝑠𝑡
𝔼𝜇 ∑𝑥𝑖
Corollary: 𝑆𝑂𝑆2 𝐺 = 𝜗 𝐺 , 𝑆𝑂𝑆𝑛 𝐺 = 𝜔(𝐺)
Making this formal
A General Perspective:
Convex set.
Defined by 𝑛𝑑 eq’s +
PSD constraint
𝐺 graph
maximum
clique
𝑆 ofΠ size 𝑘
For with
every(unknown)
sound but incomplete
proof
system
Computational
degree
𝑑 pseudo-distribution
𝜇 is 𝛱-pseudo-distribution consistent
w observations
𝒪
Classical Bayesian Uncertainty: posterior distribution
If 𝒪 ⊢Π 𝑓 ≥ 𝛼 then 𝔼𝜇 𝑓 ≥ 𝛼
𝜇: 0,1 𝑛 → ℝ ,
∀𝑥, 𝜇 𝑥 ≥ 0 ,
𝜇 𝑥 =1
for every function
𝑓 and number 𝛼. 𝑥
2
𝔼𝜇 𝑝 ≥ 0 ∀𝑝, deg 𝑝 ≤ 𝑑 /2
Π incomplete
⇒ 𝜇 might not
∑𝑥 𝜇 𝑥 𝑥𝑖 𝑥𝑗 = 0 ∀𝑖 ∼ 𝑗
𝜇 consistent
with observations:
𝔼𝜇be𝑥𝑖actual
𝑥𝑗 ≔ distribution.
Computational analog to Bayesian probabilities.
Theorem:Algorithms
𝑆𝑂𝑆𝑑 𝐺: Proof
= Systems
max 𝔼𝜇 ∑𝑥𝑖
𝜇:𝑑−𝑝.𝑑𝑖𝑠𝑡
Frequentist : Bayesian
: Pseudodistribution
Corollary: 𝑆𝑂𝑆2Pseudorandom
𝐺 = 𝜗 𝐺 , 𝑆𝑂𝑆
𝑛 𝐺 = 𝜔(𝐺)
Planted Clique Problem
[Karp’76,Kucera’95]
Distinguish between 𝐺𝑛,1/2 and 𝐺𝑛,1/2 + clique
Theorem [Lovász’79,Juhász’82]: G ∼ 𝐺𝑛,1/2 , 𝜗 𝐺 = 𝑆𝑂𝑆2 𝐺 = 𝑛
No known poly time algorithm does better than 𝑐 𝑛
Central problem in average-case complexity. Related to problems in
statistics, sparse recovery, finding equilibrium, …
[Hazan-Krauthgamer’09, Koiran-Zouzias’12, Berthet-Rigolet’12]
Theorem [Feige-Krauthgamer’02]: G ∼ 𝐺𝑛,1/2 , 𝐿𝑆𝑑+ 𝐺 = 𝑛/2𝑑
Can SOS do better?
“Theorem” [Meka-Wigderson’13]: G ∼ 𝐺𝑛,1/2 , 𝑆𝑂𝑆𝑑 𝐺 ≅ 𝑛
“Proof”: Let 𝑘 ≅ 𝑛 and define 𝜇 of “maximal ignorance”:
𝔼𝜇
𝑥𝑖 ≅
𝑖∈𝑇
𝑘
𝑛
0
𝑇
, 𝑇 is a clique
, otherwise
(Same pseudo-dist as used for 𝐿𝑆 + by Feige-Krauthgamer)
𝜇 is valid p-dist assuming higher degree Matrix-valued Chernoff bound
Bug [Pisier]: Concentration bound is false.
In fact, for 𝑘 ∼ 𝑛, ∃𝑝 deg 2 s.t. 𝔼𝜇 𝑝2 < 0 [Kelner]
Moments are OK for 𝑘 ∼ 𝑛1/
0.5𝑑+1
[Meka-Potechin-Wigderson’15, Desphande-Montanari’15, Hopkins-Kothari-Potechin’15 ]
MW’s “moral” error
Pseudo-distributions should be as simple as possible
but not simpler.
Following A. Einstein.
Pseudo-distributions should have maximum entropy
but respect the data.
MW violated Bayeisan reasoning:
𝑛
2
Consider 𝔼𝜇 𝑥𝑖 = Pr[𝑖 in clique] , deg(𝑖) = + Δ , Δ = Θ( 𝑛)
According to MW: 𝔼𝜇 𝑥𝑖 =
𝑘
𝑛
𝑛 𝑛 entropy
Pseudo-distributions
should
have
maximum
By Bayesian reasoning: 𝑖 ∉ 𝑆 ⇒ deg 𝑖 ∼ 𝑁 ,
2 2
but respect the data.
𝑛
𝑛
𝑖 ∈ 𝑆 ⇒ deg 𝑖 ∼ 𝑁 + 𝑘,
2
2
𝔼𝜇 𝑥𝑖 should be reweighed by exp
2 Δ−𝑘 2
−
𝑛
/ exp
2Δ2
−
𝑛
Thm: Bayesian moments get 𝑘 ∼ 𝑛 for 𝑑 = 4
[Hopkins-Kothari-Potechin-Raghavendra-Schramm’16]
4𝑘Δ
)
𝑛
∝ exp(
𝑑 > 4 ??
Why is MW’s error interesting?
• Shows SoS captures Bayesian reasoning in a way that
other algorithms do not.
• Suggests new way to define what a computationally
bounded observer knows about some quantity..
• ..and a more principled way to design algorithms based
on such knowledge. (see [B-Kelner-Steurer’14,’15])
Even if SoS is not the optimal algorithm we’re looking for, the
dream of a more general theory of hardness, easiness and
knowledge is worth pursuing.
Why is MW’s
error interesting?
Thanks!!
• Shows SoS captures Bayesian reasoning in a way that
other algorithms do not.
• Suggests new way to define what a computationally
bounded observer knows about some quantity..
• ..and a more principled way to design algorithms based
on such knowledge. (see [B-Kelner-Steurer’14,’15])
Even if SoS is not the optimal algorithm we’re looking for, the
dream of a more general theory of hardness, easiness and
knowledge is worth pursuing.