Computer Mountain Climbing - School of Mathematics

Download Report

Transcript Computer Mountain Climbing - School of Mathematics

Peter Richtárik
A P PROX
Accelerated, Parallel and PROXimal
coordinate descent
(Joint work with Olivier Fercoq - arXiv:1312.5799)
Moscow
February 2014
Optimization Problem
Problem
Loss
Convex
(smooth or nonsmooth)
Regularizer
Convex
(smooth or nonsmooth)
- separable
- allow
Regularizer: examples
e.g., LASSO
No regularizer
Weighted L1 norm
Box constraints
Weighted L2 norm
e.g., SVM dual
Loss: examples
Quadratic loss
Logistic loss
Square hinge loss
BKBG’11
RT’11b
TBRS’13
RT ’13a
L-infinity
L1 regression
Exponential loss
FR’13
RANDOMIZED
COORDINATE DESCENT
IN 2D
2D Optimization
Contours of a function
Goal:
Find the minimizer of
Randomized Coordinate Descent in 2D
N
E
W
S
Randomized Coordinate Descent in 2D
N
E
W
1
S
Randomized Coordinate Descent in 2D
2
N
E
W
1
S
Randomized Coordinate Descent in 2D
3
2
N
E
W
1
S
Randomized Coordinate Descent in 2D
4
3
2
N
E
W
1
S
Randomized Coordinate Descent in 2D
4
5
3
2
N
E
W
1
S
Randomized Coordinate Descent in 2D
6
4
5
3
2
N
E
W
1
S
Randomized Coordinate Descent in 2D
6
7
4
5
3
2
N
E
W
1
S
CONTRIBUTIONS
Variants of Randomized
Coordinate Descent Methods
• Block
– can operate on “blocks” of coordinates
– as opposed to just on individual coordinates
• General
– applies to “general” (=smooth convex) functions
– as opposed to special ones such as quadratics
• Proximal
– admits a “nonsmooth regularizer” that is kept intact in solving subproblems
– regularizer not smoothed, nor approximated
• Parallel
– operates on multiple blocks / coordinates in parallel
– as opposed to just 1 block / coordinate at a time
• Accelerated
– achieves O(1/k^2) convergence rate for convex functions
– as opposed to O(1/k)
• Efficient
– complexity of 1 iteration is O(1) per processor on sparse problems
– as opposed to O(# coordinates) : avoids adding two full vectors
Brief History of
Randomized Coordinate Descent Methods
+ new long stepsizes
APPROX
“PARALLEL”
A P PROX
“ACCELERATED”
“PROXIMAL”
PCDM (R. & Takáč, 2012) = APPROX
if we force
APPROX: Smooth Case
Update for
coordinate i
Partial derivative of f
Want this to be as
large as possible
CONVERGENCE RATE
Convergence Rate
Theorem [FR’13b]
# iterations
# coordinates
average # coordinates updated / iteration
implies
Key assumption
Special Case: Fully Parallel Variant
all coordinates are updated in each iteration
# iterations
implies
# normalized weights
(summing to n)
Special Case: Effect of New Stepsizes
With the new stepsizes (will mention later!), we have:
Average degree of
separability
“Average” of the
Lipschitz constants
“EFFICIENCY”
OF
APPROX
Cost of 1 Iteration of APPROX
Assume N = n (all blocks are of size 1)
and that
Then the average cost of 1 iteration of
APPROX is
Scalar function:
derivative = O(1)
Sparse matrix
arithmetic ops
=
average # nonzeros in a column of A
Bottleneck: Computation of Partial Derivatives
maintained
PRELIMINARY
EXPERIMENTS
L1 Regularized L1 Regression
Gradient Method
Nesterov’s Accelerated Gradient Method
SPCDM
APPROX
Dorothea dataset:
L1 Regularized L1 Regression
L1 Regularized Least Squares (LASSO)
PCDM
APPROX
KDDB dataset:
Training Linear SVMs
Malicious URL dataset:
Choice of Stepsizes:
How (not) to Parallelize
Coordinate Descent
Convergence of
Randomized Coordinate Descent
Strongly convex F
(Simple Mehod)
Smooth or ‘simple’ nonsmooth F
(Accelerated Method)
‘Difficult’ nonsmooth F
(Simple Method)
‘Difficult’ nonsmooth F
(Accelerated Method)
or smooth F
(Simple method)
Parallelization Dream
Serial
Parallel
What do we actually get?
WANT
Depends on to what extent we can add up individual
updates, which depends on the properties of F and the
way coordinates are chosen at each iteration
“Naive” parallelization
Do the same thing as before, but
for MORE or ALL coordinates
&
ADD UP the updates
Failure of naive parallelization
1b
0
1a
Failure of naive parallelization
1b
0
1
1a
Failure of naive parallelization
2b
1
2a
Failure of naive parallelization
2b
2
1
2a
Failure of naive parallelization
2
Idea: averaging updates may help
1b
1
0
1a
Averaging can be too conservative
2b
1b
2
2a
1
0
1a
Averaging may be too conservative 2
BAD!!!
But we wanted:
WANT
What to do?
Update to coordinate i
i-th unit coordinate vector
Averaging:
Summation:
Figure out when one can safely use:
ESO:
Expected Separable
Overapproximation
5 Models for f Admitting Small
1
Smooth partially separable f [RT’11b
2
Nonsmooth max-type f [FR’13]
3
f with ‘bounded Hessian’ [BKBG’11, RT’13a
]
]
5 Models for f Admitting Small
4
Partially separable f with smooth components
[NC’13]
5
Partially separable f with block smooth
components [FR’13b]
Randomized
Parallel Coordinate Descent Method
New iterate
Current iterate
i-th unit coordinate vector
Random set of
coordinates (sampling)
Update to
i-th coordinate
ESO: Expected Separable Overapproximation
Definition [RT’11b]
Shorthand:
1. Separable in h
2. Can minimize in parallel
3. Can compute updates for
only
PART II.
ADDITIONAL TOPICS
Partial Separability
and
Doubly Uniform
Samplings
Serial uniform sampling
Probability law:
-nice sampling
Good for shared memory systems
Probability law:
Doubly uniform sampling
Can model unreliable processors / machines
Probability law:
ESO for partially separable functions
and doubly uniform samplings
1
Smooth partially separable f [RT’11b
Theorem [RT’11b]
]
PCDM: Theoretical Speedup
# coordinates
degree of partial separability
# coordinate updates / iter
LINEAR OR GOOD SPEEDUP:
Nearly separable (sparse) problems
Much of Big Data is here!
WEAK OR NO SPEEDUP:
Non-separable (dense) problems
Theory
n = 1000
(# coordinates)
Practice
n = 1000
(# coordinates)
PCDM:
Experiment with a
1 billion-by-2 billion
LASSO problem
Optimization with Big Data
= Extreme* Mountain Climbing
* in a billion dimensional space on a foggy day
Coordinate Updates
Iterations
Wall Time
Distributed-Memory
Coordinate Descent
Distributed -nice sampling
Good for a distributed version of coordinate descent
Probability law:
Machine 1
Machine 2
Machine 3
ESO: Distributed setting
3
f with ‘bounded Hessian’ [BKBG’11, RT’13a
Theorem [RT’13b]
spectral norm of the data
]
Bad partitioning at most
doubles # of iterations
spectral norm of the partitioning
Theorem [RT’13b]
implies
# iterations =
# updates/node
# nodes
LASSO with a 3TB data matrix
= # coordinates
128 Cray XE6 nodes with 4 MPI processes (c = 512)
Each node: 2 x 16-cores with 32GB RAM
References: serial coordinate descent
•
Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss
minimization. JMLR 2011.
•
Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization
problems. SIAM Journal on Optimization, 22(2):341-362, 2012.
•
[RT’11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate
descent methods for minimizing a composite function. Mathematical Prog., 2012.
•
Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity
and preconditioning, arXiv: 1304.5530, 2013.
•
Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized
coordinate descent methods on optimization problems with linearly coupled
constraints. Technical report, Politehnica University of Bucharest, 2012.
•
Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized blockcoordinate descent methods. Technical report, Microsoft Research, 2013.
References: parallel coordinate descent
Good entry point to
the topic (4p paper)
•
[BKBG’11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin,
Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011
•
[RT’12] P.R. and Martin Takáč, Parallel coordinate descen methods for big
data optimization. arXiv:1212.0873, 2012
•
Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and
dual methods for SVMs. ICML 2013
•
[FR’13a] Olivier Fercoq and P.R., Smooth minimization of nonsmooth
functions with parallel coordinate descent methods. arXiv:1309.5885, 2013
•
[RT’13a] P.R. and Martin Takáč, Distributed coordinate descent method for
big data learning. arXiv:1310.2059, 2013
•
[RT’13b] P.R. and Martin Takáč, On optimal probabilities in stochastic
coordinate descent methods. arXiv:1310.3438, 2013
References: parallel coordinate descent
•
P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods
for huge-scale truss topology design. Operations Research Proceedings 2012.
•
Rachael Tappenden, P.R. and Burak Buke, Separable approximations and
decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013.
•
Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with
MapReduce. IEEE Transactions on Knowledge and Data Engineering,
24(10):1904-1916, 2012.
•
[FR’13b] Olivier Fercoq and P.R., Accelerated, Parallel and Proximal coordinate
descent. arXiv:1312.5799, 2013