Transcript Talk slides

Parallel Coordinate Descent
for L1-Regularized Loss
Minimization
Joseph K. Bradley
Danny Bickson
Aapo Kyrola
Carlos Guestrin
Carnegie Mellon
L1-Regularized Regression
Stock
volatility
Bigrams from
financial reports
(label)
(features)
(Kogan et al., 2009)
Lasso (Tibshirani, 1996)
Sparse logistic regression
(Ng, 2004)
Produces sparse solutions
Useful in high-dimensional settings
(# features >> # examples)
5x106 features
3x104 samples
From Sequential Optimization...
Many algorithms
Gradient descent, stochastic gradient, interior point methods,
hard/soft thresholding, ...
Coordinate descent (a.k.a. Shooting (Fu, 1998))
One of the fastest algorithms
(Friedman et al., 2010; Yuan et al., 2010)
But for big problems?
5x106 features
3x104 samples
...to Parallel Optimization
We use the multicore setting:
shared memory
low latency
We could parallelize:
Matrix-vector ops
Not great empirically.
E.g., interior point
W.r.t. samples
Best for many samples,
E.g., stochastic gradient
not many features.
(Zinkevich et al., 2010)
Analysis not for L1.
W.r.t. features
E.g., shooting
Inherently sequential? Surprisingly, no!
Our Work
Shotgun:
Parallel coordinate descent
for L1-regularized regression
Parallel convergence analysis
Linear speedups up to a problem-dependent limit
Large-scale empirical study
37 datasets, 9 algorithms
Lasso
(Tibshirani, 1996)
Goal: Regress
Objective:
y Î Â on a Î Âd, given samples {(ai , yi )}i
min 12 || Ax - y ||22 +l || x ||1
x
Squared
error
L1 regularization
Shooting: Sequential SCD
Lasso:
min F(x)
x
where
F(x) = 12 || Ax - y ||22 +l || x ||1
Stochastic Coordinate Descent (SCD)
(e.g., Shalev-Shwartz & Tewari, 2009)
While not converged,
Choose random coordinate j,
Update xj (closed-form minimization)
F(x) contour
Shotgun: Parallel SCD
Lasso:
min F(x)
x
where
F(x) = 12 || Ax - y ||22 +l || x ||1
Shotgun (Parallel SCD)
While not converged,
On each of P processors,
Choose random coordinate j,
Update xj (same as for Shooting)
Is SCD inherently
sequential?
Nice case:
Uncorrelated
features
Bad case:
Correlated
features
Is SCD inherently sequential?
Lasso:
min F(x)
x
where
F(x) = 12 || Ax - y ||22 +l || x ||1
Coordinate update:
x j ¬ x j + dx j
(closed-form minimization)
Collective update:
æd x i
ç0
Dx = ç 0
çd x
ç j
è0
ö
÷
÷
÷
÷
ø
d xi
Dx
dx j
Is SCD inherently sequential?
min F(x)
Lasso:
where
x
F(x) = 12 || Ax - y ||22 +l || x ||1
Theorem: If A is normalized s.t. diag(ATA)=1,
Decrease of
objective
F(x + Dx) - F(x)
£-
1
2
å(
d xi
i j ÎP
j
Sequential
Updated
coordinates
progress
)
2
+ 12
å
i j ,ik ÎP,
T
A
( A)
i j ,ik
d xi d xi
j¹k
Interference
j
k
Is SCD inherently sequential?
Theorem: If A is normalized s.t. diag(ATA)=1,
F(x + Dx) - F(x)
£-
1
2
å(
d xi
i j ÎP
j
)
2
+ 12
å
i j ,ik ÎP,
T
A
( A)
i j ,ik
d xi d xi
j
k
i j ¹ik
(ATA)jk=0 for j≠k.
(if ATA is centered)
Nice case:
Uncorrelated
features
(ATA)jk≠0
Bad case:
Correlated
features
Convergence Analysis
Lasso:
min F(x)
x
F(x) = 12 || Ax - y ||22 +l || x ||1
where
Main Theorem: Shotgun Convergence
Assume # parallel updates P < d / r +1
xÎÂ
d
E éëF(x (T ) )ùû - F(x*) £
Final
objective
Optimal
objective
r = spectral radius of ATA
d × ( 12 || x* ||22 +F(x(0) ))
T ×P
iterations
Generalizes bounds for Shooting
(Shalev-Shwartz & Tewari, 2009)
Convergence Analysis
Lasso:
min F(x)
x
where
F(x) = 12 || Ax - y ||22 +l || x ||1
Theorem: Shotgun Convergence
Assume P < d / r +1
where
r = spectral radius of ATA.
E éëF(x(T ) )ùû - F(x*)
£
final - opt
objective
d ( || x* || +F(x ))
1
2
2
2
(0)
TP
iterations
# parallel updates
Nice case:
Uncorrelated
features
r =1Þ Pmax = d
Bad case:
Correlated
features
r = d Þ Pmax =1 (at worst)
Convergence Analysis
Theorem: Shotgun Convergence
Assume P < d / r +1.
Experiments match
the theory!
E éëF(x(T ) )ùû - F(x*)
d ( 12 || x* ||22 +F(x (0) ))
TP
Mug32_singlepixcam
10000
Pmax=158
1000
d = 1024
r = 6.4967
100
10
1
10
100
P (# simulated parallel updates)
Iterations to convergence
Iterations to convergence
£
Up to a threshold...
linear speedups
predicted.
Ball64_singlepixcam
56870
Pmax=3
d = 4096
r = 2047.8
14000
1
4
P (# simulated parallel updates)
Thus far...
Shotgun
The naive parallelization of
coordinate descent works!
Theorem: Linear speedups up to
a problem-dependent limit.
Now for some experiments…
Experiments: Lasso
7 Algorithms
Shotgun P=8 (multicore)
Shooting
(Fu, 1998)
35 Datasets
# samples n: [128, 209432]
# features d: [128, 5845762]
λ=.5, 10
Interior point (Parallel L1_LS)
(Kim et al., 2007)
Shrinkage (FPC_AS, SpaRSA)
(Wen et al., 2010; Wright et al., 2009)
Projected gradient (GPSR_BB)
(Figueiredo et al., 2008)
Iterative hard thresholding (Hard_l0)
(Blumensath & Davies, 2009)
Also ran: GLMNET, LARS, SMIDAS
Optimization Details
Pathwise optimization
(continuation)
Asynchronous Shotgun with
atomic operations
Hardware
8 core AMD Opteron 8384
(2.69 GHz)
Sparse Compressed Imaging
Sparco (van den Berg et al., 2009)
n Î [477, 32768] d Î [954, 65536]
n Î [128, 29166] d Î [128, 29166]
Pmax Î [2865, 11779], avg 7688
Pmax Î [3, 17366], avg 2987
40
70
Shotgun
faster
4
On this (data,λ)
Shotgun: 1.212s
Shooting:
%&' ' ( )3.406s
*$
+, -Shotgun
. / %$
Other alg. runtime (sec)
Other alg. runtime (sec)
Experiments: Lasso
slower
0.4
0.4
0 , %1. 22$
4
40
3456.
7
!
$
Shotgun runtime (sec)
%&' ' ( ) *$
Shotgun
faster
7
%&' ' ( ) *$
+, - . / %$
%Shotgun
&' ' ( ) *$
0, %1. 22$
0.73456. 7! $
0.7
%&' ' ( ) *$
%: 41%/ $
0, %1. 22$
7
slower
+, - . / %$
89. 8%$ Shotgun
+, - . /runtime
%$
0,(sec)
%1. 22$
70
%&' ' ( ) *$
89. 8%$(Parallel)
+, - . / %$
+, - . / %$
%: 41%/ $
89.
8%$ 8 cores.
0, %1. 22$
3456. L1_LS
7! $
Shotgun & Parallel
used
0 , %1. 22$
3456. 7! $
89. 8%$
3456. 7! $
%: 41%/ $
Single-Pixel Camera (Duarte et al., 2008)
Large, Sparse Datasets
n Î [410, 4770] d Î [1024, 16384]
n Î [30465, 209432] d Î [209432, 5845762]
Pmax = 3
Pmax Î [214, 2072], avg 1143
Shotgun
faster
20
Other alg. runtime (sec)
Other alg. runtime (sec)
Experiments: Lasso
Shotgun
faster
3000
300
%&' ' ( ) *$
Shotgun
slower
2
0 , %1. 22$
2
20
Shotgun
slower
%&' ' ( ) *$
+, - . / %$
30
30
+, - . / %$
300
3000
3456. 7! $
0, %Shotgun
1. 22$ runtime (sec)
%&' ' ( ) *$
89. 8%$(Parallel)
3456. 7! $
+, - . / %$
%: 41%/ $ Shotgun89.
$
&8%
Parallel
L1_LS used 8 cores.
Shotgun runtime (sec)
0 , %1. 22$
%: 41%/ $
Single-Pixel Camera (Duarte et al., 2008)
Large, Sparse Datasets
n Î [410, 4770] d Î [1024, 16384]
n Î [30465, 209432] d Î [209432, 5845762]
Pmax = 3
Pmax Î [214, 2072], avg 1143
Shotgun
faster
20
Other alg. runtime (sec)
Other alg. runtime (sec)
Experiments: Lasso
Shotgun
faster
3000
Shooting is one of the fastest algorithms.
Shotgun provides additional speedups.
300
%&' ' ( ) *$
Shotgun
slower
2
0 , %1. 22$
2
20
Shotgun
slower
%&' ' ( ) *$
+, - . / %$
30
30
+, - . / %$
300
3000
3456. 7! $
0, %Shotgun
1. 22$ runtime (sec)
%&' ' ( ) *$
89. 8%$(Parallel)
3456. 7! $
+, - . / %$
%: 41%/ $ Shotgun89.
$
&8%
Parallel
L1_LS used 8 cores.
Shotgun runtime (sec)
0 , %1. 22$
%: 41%/ $
Experiments: Logistic Regression
Coordinate Descent Newton (CDN)
(Yuan et al., 2010)
Uses line search
Extensive tests show CDN is very fast.
Algorithms
Shooting (CDN)
Shotgun CDN
Stochastic Gradient Descent (SGD)
Parallel SGD (Zinkevich et al., 2010)
Averages results of 8 instances
run in parallel
SGD
Lazy shrinkage updates
(Langford et al., 2009)
Used best of 14 learning rates
Experiments: Logistic Regression
Zeta* dataset: low-dimensional setting
l =1 n = 500, 000 d=2000
320000
Shooting CDN
300000
Objective Value
better
280000
260000
SGD
240000
220000
Parallel SGD
200000
Shotgun CDN
180000
160000
140000
0
200
400
600
800
1000
1200
1400
1600
Time (sec)
*From the Pascal Large Scale Learning Challenge
http://www.mlbench.org/instructions/
Shotgun & Parallel SGD used 8 cores.
Experiments: Logistic Regression
rcv1 dataset (Lewis et al, 2004): high-dimensional setting
l =1 n =18217 d=44504
8500
Objective Value
better
8000
Shooting CDN
SGD and Parallel SGD
7500
7000
6500
Shotgun CDN
6000
5500
0.1
1
10
100
Time (sec)
Shotgun & Parallel SGD used 8 cores.
Shotgun: Self-speedup
Aggregated results from all tests
But we are doing
fewer iterations! 
8
Optimal
7
Speedup
6
Lasso Iteration
Speedup
5
Explanation:
Memory wall
(Wulf & McKee, 1995)
4
Logistic
regression
The memory
bus uses
more
FLOPS/datum.
gets flooded.
 Extra computation
hides memory latency.
 great
Better
speedups
Not so
on average!
Logistic Reg.
Time Speedup
3
2
Lasso Time
Speedup
1
1
2
3
4
5
# cores
6
7
8
Conclusions
Shotgun: Parallel Coordinate Descent
Linear speedups up to a problem-dependent limit
Significant speedups in practice
Large-Scale Empirical Study
Lasso & sparse logistic regression
37 datasets, 9 algorithms
Compared with parallel methods:
Parallel matrix/vector ops
Parallel Stochastic Gradient Descent
Code & Data:
http://www.select.cs.cmu.edu/projects
Future Work
Hybrid Shotgun + parallel
SGD
More FLOPS/datum, e.g.,
Group Lasso (Yuan and Lin,
2006)
Alternate hardware, e.g.,
graphics processors
Thanks!
References
Blumensath, T. and Davies, M.E. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–
274, 2009.
Duarte, M.F., Davenport, M.A., Takhar, D., Laska, J.N., Sun, T., Kelly, K.F., and Baraniuk, R.G. Single-pixel imaging via compressive sampling.
Signal Processing Magazine, IEEE, 25(2):83–91, 2008.
Figueiredo, M.A.T, Nowak, R.D., and Wright, S.J. Gradient projection for sparse reconstruction: Application to compressed sensing and other
inverse problems. IEEE J. of Sel. Top. in Signal Processing, 1(4):586–597, 2008.
Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical
Software, 33(1):1–22, 2010.
Fu, W.J. Penalized regressions: The bridge versus the lasso. J. of Comp. and Graphical Statistics, 7(3):397– 416, 1998.
Kim, S. J., Koh, K., Lustig, M., Boyd, S., and Gorinevsky, D. An interior-point method for large-scale l1-regularized least squares. IEEE Journal of
Sel. Top. in Signal Processing, 1(4):606–617, 2007.
Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., and Smith, N.A. Predicting risk from financial reports with regression. In Human Language Tech.NAACL, 2009.
Langford, J., Li, L., and Zhang, T. Sparse online learning via truncated gradient. In NIPS, 2009a.
Lewis, D.D., Yang, Y., Rose, T.G., and Li, F. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361–397, 2004.
Ng, A.Y. Feature selection, l1 vs. l2 regularization and rotational invariance. In ICML, 2004.
Shalev-Shwartz, S. and Tewari, A. Stochastic methods for l1 regularized loss minimization. In ICML, 2009.
Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal Statistical Society, 58(1):267–288, 1996.
van den Berg, E., Friedlander, M.P., Hennenfent, G., Herrmann, F., Saab, R., and Yılmaz, O. Sparco: A testing framework for sparse
reconstruction. ACM Transactions on Mathematical Software, 35(4):1–16, 2009.
Wen, Z., Yin, W. Goldfarb, D., and Zhang, Y. A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and
continuation. SIAM Journal on Scientific Computing, 32(4):1832–1857, 2010.
Wright, S.J., Nowak, D.R., and Figueiredo, M.A.T. Sparse reconstruction by separable approximation. IEEE Trans. on Signal Processing,
57(7):2479–2493, 2009.
Wulf, W.A. and McKee, S.A. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News, 23(1):20–24,
1995.
Yuan, G. X., Chang, K. W., Hsieh, C. J., and Lin, C. J. A comparison of optimization methods and software for large-scale l1-reg. linear
classification. JMLR, 11:3183– 3234, 2010.
Zinkevich, M., Weimer, M., Smola, A.J., and Li, L. Parallelized stochastic gradient descent. In NIPS, 2010.
TO DO
References slide
Backup slides
Discussion with reviewer about SGD vs SCD in terms
of d,n
Experiments: Logistic Regression
Zeta* dataset: low-dimensional setting
Objective Value
better
l =1 n = 500, 000 d=2000
Shooting CDN
290000
SGD
240000
Parallel SGD
Shotgun CDN
190000
140000
0
200
400
600
800
1000
1200
1400
1600
0
200
400
600
800
1000
1200
1400
1600
0.25
Test Error
0.2
0.15
0.1
0.05
*Pascal Large Scale Learning Challenge:
http://www.mlbench.org/instructions/
Time (sec)
Experiments: Logistic Regression
rcv1 dataset (Lewis et al, 2004): high-dimensional setting
l =1 n =18217 d=44504
Objective Value
SGD and Parallel SGD
Shooting CDN
8000
7500
7000
Shotgun CDN
6500
6000
5500
0.1
1
10
100
0.1
1
10
100
0.053
Test Error
better
8500
0.048
0.043
0.038
Time (sec)
Shotgun: Improving Self-speedup
Lasso: Time Speedup
6
5
4
Max
3
2
Mean
1
Min
0
2
3
4
5
6
7
8
Max
4
Mean
3
2
Min
1
1
# cores
Speedup (iterations)
5
0
1
10
Speedup (time)
Speedup (time)
6
Logistic Regression: Time Speedup
Lasso: Iteration Speedup
Mean
6
Min
4
2
0
2
3
4
5
# cores
6
7
3
4
5
# cores
6
7
Max
8
1
2
8
Logistic regression uses
more FLOPS/datum.
 Better speedups
on average.
8
Shotgun: Self-speedup
Lasso: Time Speedup
Speedup (time)
6
5
4
Max
3
2
Mean
1
Min
0
1
2
3
4
5
6
7
Speedup (iterations)
The memory bus gets flooded.
Lasso: Iteration Speedup
Max
8
Mean
6
Min
4
2
0
1
2
3
4
5
# cores
6
7
Explanation:
Memory wall (Wulf & McKee, 1995)
8
# cores
10
Not so great 
8
But we are doing
fewer iterations! 