Computer Mountain Climbing

Download Report

Transcript Computer Mountain Climbing

Peter Richtarik
Why parallelizing like crazy
and being lazy can be good
I.
Optimization
Optimization with Big Data
= Extreme* Mountain Climbing
* in a billion dimensional space on a foggy day
Arup
(Truss Topology Design)
Western General Hospital
(Creutzfeldt-Jakob Disease)
Royal Observatory
(Optimal Planet Growth)
Ministry of Defence
dstl lab
(Algorithms for Data Simplicity)
Big Data
BIG Volume
BIG Velocity
•
•
•
•
•
•
BIG Variety
digital images & videos
transaction records
government records
health records
defence
internet activity (social media,
wikipedia, ...)
• scientific measurements
(physics, climate models, ...)
God’s Algorithm = Teleportation
If You Are Not a God...
x0
x2 x
3
x1
II.
Randomized
Coordinate Descent
Methods
[the cardinal directions of big data optimization]
P. R. and M. Takáč
Iteration complexity of randomized block
coordinate descent methods for minimizing a
composite function
Mathematical Programming A, 2012
Yu. Nesterov
Efficiency of coordinate descent methods
on huge-scale optimization problems
SIAM J Optimization, 2012
2D Optimization
Contours of function
Goal:
Find the minimizer of
Randomized Coordinate Descent in 2D
N
E
W
S
Randomized Coordinate Descent in 2D
N
E
W
1
S
Randomized Coordinate Descent in 2D
N
E
W
1
2
S
Randomized Coordinate Descent in 2D
3
N
E
W
1
2
S
Randomized Coordinate Descent in 2D
4
3
N
E
W
1
2
S
Randomized Coordinate Descent in 2D
4
5
3
N
E
W
1
2
S
Randomized Coordinate Descent in 2D
4
5 6
3
N
E
W
1
2
S
Randomized Coordinate Descent in 2D
4
5 6
7
3
N
E
W
1
2
S
Randomized Coordinate Descent in 2D
4
3
5 6
7
8
N
E
W
1
2
S
1 Billion Rows & 100 Million Variables
Bridges are Indeed Optimal!
P. R. and M. Takáč
Parallel coordinate descent methods
for big data optimization
ArXiv:1212.0873, 2012
M. Takáč, A. Bijral, P. R. and N. Srebro
Mini-batch primal and dual methods for SVMs
ICML 2013
Failure of Naive Parallelization
1b
0
1a
Failure of Naive Parallelization
1b
0
1
1a
Failure of Naive Parallelization
2b
1
2a
Failure of Naive Parallelization
1
2b
2
2a
Failure of Naive Parallelization
2
Parallel Coordinate Descent
Theory
Reality
A Problem with Billion Variables
P. R. and M. Takáč
Distributed coordinate descent methods
for big data optimization
Manuscript, 2013
Distributed Coordinate Descent
1.2 TB LASSO problem solved on the HECToR supercomputer with 2048 cores
III.
Randomized Lock-Free
Methods
[optimization as lock breaking]
A Lock with 4
Dials
A function
representing the
“quality” of a
combination
x = (x1, x2, x3, x4)
F(x) = F(x1, x2, x3, x4)
Setup: Combination maximizing F opens the lock
Optimization Problem: Find combination maximizing F
Optimization Algorithm
P. R. and M. Takáč
Randomized lock-free gradient methods
Manuscript, 2013
F. Niu, B. Recht, C. Re, and S. Wright
HOGWILD!: A Lock-Free Approach to
Parallelizing Stochastic Gradient Descent
NIPS, 2011
A System of Billion Locks
with Shared Dials
1) Nodes in the graph
correspond to dials
Lock
x1
x4
x3
x2
xn
2) Nodes in the graph
also correspond to
locks: each lock
(=node) owns dials
connected to it in the
graph by an edge
# dials = n = # locks
How do we Measure the Quality
of a Combination?
• Each lock j has its own quality function Fj
depending on the dials it owns
• However, it does NOT open when Fj is maximized
• The system of locks opens when
F = F1 + F2 + ... + Fn
is maximized
F : Rn
R
An Algorithm with (too much?)
Randomization
1) Randomly select a lock
2) Randomly select a dial
belonging to the lock
3) Adjust the value on the
selected dial based only on
the info corresponding to
the selected lock
Synchronous Parallelization
Processor 1
Processor 2
J1
IDLE
J4
Processor 3
IDLE
J7
time
J2
IDLE
J3
IDLE
J5
J8
J6
IDLE
J9
IDLE
Crazy (Lock-Free) Parallelization
Processor 1
Processor 2
J1
J2
J4
Processor 3
J5
J7
time
J3
J6
J8
J9
Crazy Parallelization
Crazy Parallelization
Crazy Parallelization
Crazy Parallelization
Theoretical Result
# Processors
Average # of dials common to 2 locks
# Locks
Average # dials in a lock
Computational Insights
IV.
Final Two Slides
Why parallelizing
like crazy and being
lazy can be good?
Parallelization
Randomization
•
•
•
•
•
•
•
Effectivity
Tractability
Efficiency
Scalability (big data)
Parallelism
Distribution
Asynchronicity
Probability
HPC
Matrix Theory
Tools
Machine Learning