graphlab@msft

Download Report

Transcript graphlab@msft

Machine Learning in the Cloud

Yucheng Low Aapo Kyrola Joey Gonzalez

Carlos Guestrin Joe Hellerstein David O’Hallaron

Danny Bickson Carnegie Mellon

Machine Learning in the Real World

13 Million Wikipedia Pages 500 Million Facebook Users 3.6 Billion Flickr Photos 24 Hours a Minute YouTube

Parallelism is Difficult

Wide array of different parallel architectures: GPUs Multicore Clusters Clouds Supercomputers Different challenges for each architecture

High Level Abstractions to make things easier.

4

MapReduce – Map Phase

1 CPU 1 9 4 CPU 2 3 2 CPU 3 3 2 CPU 4 8

Embarrassingly Parallel independent computation No Communication needed

MapReduce – Map Phase

2 CPU 1 1 8 CPU 2 3 1 CPU 3 4 8 CPU 4 4 1 2 .

9 4 2 .

3 2 1 .

3 2 5 .

8

Embarrassingly Parallel independent computation No Communication needed

MapReduce – Map Phase

1 CPU 1 5 6 CPU 2 5 1 CPU 3 9 3 CPU 4 3 1 2 .

9 2 4 .

1 4 2 .

3 8 4 .

3 2 1 .

3 1 8 .

4 2 5 .

8 8 4 .

4

Embarrassingly Parallel independent computation No Communication needed

MapReduce – Reduce Phase

22 CPU 1 26 17 CPU 2 31 1 2 .

9 2 4 .

1 1 7 .

5 4 2 .

3 8 4 .

3 6 7 .

5 2 1 .

3 1 8 .

4

Fold/Aggregation

1 4 .

9 2 5 .

8 8 4 .

4 3 4 .

3

MapReduce and ML

Excellent for large data-parallel tasks!

Data-Parallel Complex Parallel Structure

Map Reduce

Feature Extraction Cross Validation Computing Sufficient Statistics

Is there more to Machine Learning

?

9

Iterative Algorithms?

We can implement iterative algorithms in MapReduce: Iterations Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data

Iterative MapReduce

System is not optimized for iteration: Iterations Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data

Iterative MapReduce

Only a subset of data needs computation: (multi-phase iteration) Iterations Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data CPU 2 CPU 3 Data Data Data Data CPU 2 CPU 3 Data Data Data Data CPU 2 CPU 3 Data Data Data Data Data Data Data Data Data Data

MapReduce and ML

Excellent for large data-parallel tasks!

Data-Parallel Complex Parallel Structure

Map Reduce

Feature Extraction Cross Validation Computing Sufficient Statistics

Is there more to Machine Learning

?

13

Structured Problems

Example Problem: Will I be successful in research?

Success depends on the success of others. May not be able to safely update neighboring nodes. [e.g., Gibbs Sampling] Interdependent Computation:

Not Map-Reducible

14

Space of Problems

Sparse Computation Dependencies

Can be decomposed into local “computation kernels”

Asynchronous Iterative Computation

Repeated iterations over local kernel computations 15

Parallel Computing and ML

Not all algorithms are efficiently data parallel Data-Parallel Structured Iterative Parallel

Map Reduce

Feature Extraction Cross Validation Computing Sufficient Statistics Lasso

GraphLab

Kernel Methods ?

Tensor Factorization Belief Propagation SVM Sampling Deep Belief Networks Learning Graphical Neural Models Networks 16

GraphLab Goals Designed for ML needs

Express data dependencies Iterative

Simplifies the design of parallel programs:

Abstract away hardware issues Addresses multiple hardware architectures

Multicore Distributed GPU and others

GraphLab Goals

Simple Models Complex Models Small Data

Now

Large Data

Data-Parallel

Goal

GraphLab Goals

Simple Models Complex Models Small Data

Now

Large Data

Data-Parallel

GraphLab

GraphLab

A Domain-Specific Abstraction for Machine Learning

Carnegie Mellon

Everything on a Graph

A Graph with data associated with every vertex and edge :Data

Update Functions

Update Functions: operations applied on vertex  transform data in scope of vertex

Update Functions

Update Function can Schedule the computation of any other update function: - FIFO Scheduling - Prioritized Scheduling - Randomized Etc.

Scheduled computation is guaranteed to execute eventually.

Example: Page Rank

Graph = WWW Update Function:

multiply adjacent pagerank values with edge weights to get current vertex’s pagerank “Prioritized” PageRank Computation? Skip converged vertices.

Example: K-Means Clustering

Data

(Fully Connected?) Bipartite Graph Update Function:

Clusters

Cluster Update:

compute average of data connected on a “marked” edge.

Data Update:

Pick the closest cluster and mark the edge. Unmark remaining edges.

Example: MRF Sampling

Graph = MRF Update Function:

- Read samples on adjacent vertices - Read edge potentials - Compute new sample for current vertex

Not Message Passing!

Graph is a data-structure. Update Functions perform parallel modifications to the data-structure.

Safety

If adjacent update functions occur simultaneously?

Safety

If adjacent update functions occur simultaneously?

Importance of Consistency

ML resilient to soft-optimization?

Permit Races? “Best-effort” computation?

True for some algorithms.

Not true for many. May work empirically on some datasets; may fail on others.

Importance of Consistency

Many algorithms require strict consistency, or performs significantly better under strict consistency.

Alternating Least Squares

12 10 8 6 4 2 0 0 Inconsistent Updates Consistent Updates 10 # Iterations 20 30

Importance of Consistency

Fast ML Algorithm development cycle:

Build Test Debug Tweak Model

Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism.

Is the execution wrong? Or is the model wrong?

Sequential Consistency

GraphLab guarantees sequential consistencyparallel execution, 

sequential execution

of update functions which produce same result time Parallel CPU 1 CPU 2 Sequential CPU 1

Sequential Consistency

GraphLab guarantees sequential consistencyparallel execution, 

sequential execution

of update functions which produce same result Formalization of the intuitive concept of a “correct program”.

- Computation does not read outdated data from the past - Computation does not read results of computation that occurs in the future.

Primary Property of GraphLab

Global Information

What if we need global information?

Algorithm Parameters?

Sufficient Statistics?

Sum of all the vertices?

Shared Variables

Global aggregation through Sync Operation A global parallel reduction over the graph data.

Synced variables recomputed at defined intervals Sync computation is Sequentially Consistent Permits correct interleaving of Syncs and Updates Sync: Loglikelihood Sync: Sum of Vertex Values

Sequential Consistency

GraphLab guarantees sequential consistencyparallel execution, 

sequential execution

of update functions and Syncs which produce same result time Parallel CPU 1 CPU 2 Sequential CPU 1

GraphLab in the Cloud

Carnegie Mellon

Moving towards the cloud…

Purchasing and maintaining computers is very expensive Most computing resources seldomly used Only for deadlines… Buy time, access hundreds or thousands of processors Only pay for needed resources

Distributed GL Implementation

Mixed Multi-threaded / Distributed Implementation. (Each machine runs only one instance) Requires all data to be in memory. Move

computation to data.

MPI for management + TCP/IP for communication Asynchronous C++ RPC Layer Ran on 64 EC2 HPC Nodes = 512 Processors

Skip Implementation

Underlying Network RPC Controller Distributed Graph Execution Engine Distributed Locks Execution Threads Shared Data Cache Coherent Distributed K-V Store RPC Controller Execution Engine Shared Data Distributed Graph Distributed Locks Execution Threads Cache Coherent Distributed K-V Store RPC Controller Execution Engine Shared Data Distributed Graph Distributed Locks Execution Threads Cache Coherent Distributed K-V Store RPC Controller Execution Engine Shared Data Distributed Graph Distributed Locks Execution Threads Cache Coherent Distributed K-V Store Execution Engine Distributed Graph Distributed Locks Execution Threads Shared Data Cache Coherent Distributed K-V Store

Carnegie Mellon

GraphLab RPC

Carnegie Mellon

Write distributed programs easily

Asynchronous communication Multithreaded support Fast Scalable

Easy To Use

(Every machine runs the same binary)

I

C++

Carnegie Mellon

Features

Easy RPC capabilities: One way calls rpc.remote_call([target_machine ID],

printf

, “%s %d %d %d\n”, “hello world”, 1, 2, 3); Requests (call with return value) } std::vector& sort_vector(std::vector &v) { std::sort(v.begin(), v.end()); return v; vec = rpc.remote_request( [target_machine ID], sort_vector, vec);

Features

MPI-like primitives dc.barrier() dc.gather(...) dc.send_to([target machine], [arbitrary object]) dc.recv_from([source machine], [arbitrary object ref]) Object Instance Context RPC Controller K-V Object RPC Controller K-V Object MPI-Like Safety RPC Controller K-V Object RPC Controller K-V Object

350 300 250 200 150 100 50 0

Request Latency

GraphLab RPC MemCached 16 128 1024

Value Length (Bytes)

10240 Ping RTT = 90us

1000 900 800 700 600 500 400 300 200 100 0

One-Way Call Rate

1Gbps physical peak GraphLab RPC ICE 16 128 1024

Value Length (Bytes)

10240

Serialization Performance

100,000 X

One way call of vector of 10 X {"hello", 3.14, 100} 0,3 0,2 0,1 0 0,8 0,7 0,6 0,5 0,4 Receive Issue ICE RPC Buffered RPC Unbuffered

Distributed Computing Challenges Q1: How do we efficiently distribute the state ?

- Potentially varying #machines Q2: How do we ensure sequential consistency ?

Keeping in mind:

Limited Bandwidth High Latency

Performance

Distributed Graph

Carnegie Mellon

Two-stage Partitioning Initial Overpartitioning of the Graph

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph Repartition as needed

Two-stage Partitioning Initial Overpartitioning of the Graph Generate Atom Graph Repartition as needed

Ghosting Ghost vertices are a copy of neighboring vertices which are on remote machines.

Ghost vertices/edges act as cache for remote data.

Coherency maintained using versioning. Decrease bandwidth utilization.

Distributed Engine

Carnegie Mellon

Distributed Engine

Sequential Consistency can be guaranteed through distributed locking. Direct analogue to shared memory impl.

To improve performance: User provides some “expert knowledge” about the properties of the update function.

Full Consistency

User says: update function modifies all data in scope.

Acquire write-lock on all vertices.

Limited opportunities for parallelism.

Edge Consistency

User: update function only reads from adjacent vertices.

Acquire write-lock on center vertex, read-lock on adjacent.

More opportunities for parallelism.

Vertex Consistency

User: update function does not touch edges nor adjacent vertices Acquire write-lock on current vertex.

Maximum opportunities for parallelism.

Performance Enhancements

Latency Hiding:

calls. - “pipelining” of >> #CPU update function (about 1K deep pipeline) - Hides the latency of lock acquisition and cache synchronization

Lock Strength Reduction:

- A trick where number of locks can be decreased while still providing same guarantees

Video Cosegmentation

Segments mean the same

Gaussian EM clustering + BP on 3D grid Model: 10.5 million nodes, 31 million edges

Speedups

Video Segmentation

Video Segmentation

Chromatic Distributed Engine

Locking overhead is too high in high-degree models. Can we satisfy sequential consistency in a simpler way?

Observation

:

Scheduling using vertex colorings can be used to automatically satisfy consistency.

Example: Edge Consistency (distance 1) vertex coloring

Update functions can be executed on all vertices of the same color in parallel.

Example: Full Consistency (distance 2) vertex coloring

Update functions can be executed on all vertices of the same color in parallel.

Example: Vertex Consistency (distance 0) vertex coloring

Update functions can be executed on all vertices of the same color in parallel.

Chromatic Distributed Engine

Execute tasks on all vertices of color 0 Execute tasks on all vertices of color 0 Data Synchronization Completion + Barrier Execute tasks on all vertices of color 1 Execute tasks on all vertices of color 1 Data Synchronization Completion + Barrier

Experiments

Netflix Collaborative Filtering

Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Users Netflix d Movies

Netflix

Speedup Increasing size of the matrix factorization

16 Ideal 14 d=100 (159.91 IPB) 12 d=50 (85.68 IPB) 10 d=20 (48.72 IPB) d=5 (44.85 IPB) 8 6 4 2 1 4 8 16 24 32 40 #Nodes 48 56 64

Netflix

10 2 10 1 10 0 10 −1 10 1 Hadoop GraphLab 10 2 Runtime(s) 10 3 10 1 D=100 D=50 D=20 D=5 10 0 10 4 10 −1 0.92

0.94

0.96

Error (RMSE) 0.98

1

10 4

Netflix

10 3 Hadoop MPI GraphLab 10 2 10 1 4 8 16 24 32 40 #Nodes 48 56 64

Experiments

Named Entity Recognition

(part of Tom Mitchell’s NELL project) CoEM Algorithm ! "#$ ! "#%$

! ""#"$%&' ( ! )$

! "#&$ ! "#' $ ! "#( $ ! "#) $ ) $ &! $ ) ! $ %! ! $ Web Crawl Food onion garlic noodles blueberries beans Religion Catholic Fremasonry Marxism Catholic Chr.

Humanism City Munich Cape Twn.

Seoul Mexico Cty.

Winnipeg (a) Netflix Prediction Error (b) Video Frame (d) NER Figure 5: (a) Netflix: The test error (RMSE) of the ALS algorithm on the Netflix dataset after 30 iterations with different values of

d

. Lower observe that the algorithm successfully Graph is rather dense. A small number of vertices connect to almost all the vertices.

rithm [44] labels the remaining noun-phrases and con texts (see Table 5(d)) by alternating between estimating the best assignment to each noun-phrase given the types of its contexts and estimating the type of each context given the types of its noun-phrases.

several other research projects like clustering communi ties in the twitter network, collaborati ve filtering for BBC TV data as well as non-parametric Bayesian inference.

6 Evaluation

The GraphLab data graph for the NER problem is bi partite with vertices corresponding to each noun-phrase on one side and vertices corresponding to each context on the other. There is an edge between a noun-phrase and a context if the noun-phrase occurs in the context.

The vertex for both noun-phrases and contexts stores the estimated distribution over types. The edge stores the number of times the noun-phrase appears in that context.

The NER computation is represented in a simple GraphLab update function which computes a weighted sum of probability tables stored on adjacent vertices and then normalizes. Once again, the bipartite graph is naturally two colored, allowing us to use the chro matic scheduler. Due to the density of the graph, a ran dom partitioning was used. Since the NER computation is relatively light weight and uses only simple floating point arithmetic; combined with the use of a random partitioning, this application stresses the overhead of the GraphLab runtime as well as the network.

We evaluated GraphLab on the three applications (Net flix, CoSeg and NER) described above using important large-scale real-world problems (see Table 2). We used the Chromatic engine for the Netflix and NER prob lems and the Locking Engine for the CoSeg applica tion. Equivalent Hadoop and MPI implementations were also tested for both the Netflix and the NER application.

An MPI implementation of the asynchronous prioritized LBP algorithm needed for CoSeg requires building an entirely new asynchronous sequentially consistent sys tem and is beyond the scope of this work.

Experiments were performed on Amazon’s Elas tic Computing Cloud (EC2) using up to 64 High Performance Cluster (HPC) instances (c c 1. 4x l ar ge ).

The HPC instances (as of February 2011) have 2 x In tel Xeon X5570 quad-core Nehalem architecture with 22 GB of memory, connected by a low latency 10 GigaBit Ethernet network. All our timings include data loading time and are averaged over three or more runs. Our prin cipal findings are: 5.4

Other Applications

◦ GraphLab is fast!

On equivalent tasks, GraphLab out performs Hadoop by 20x-60x and is as fast as custom tailored MPI implementations.

In the course of our research, we have also implemented several other algorithms, which we describe briefly: Gibbs Sampling on a M ar kov Random Field. The task is to compute a probability distribution for a graph of random variables by sampling. Algorithm proceeds by sampling a new value for each variable in turn condi tioned on the assignments of the neighboring variables.

Strict sequential consistency is necessary to preserve sta tistical properties [22].

Bayesian Pr obabilistic Tensor Factor ization (BPTF). This is a probabilistic Markov-Chain Monte Carlo version of Alternative Least Squares that also incorporates time-factor into the prediction.

In this case, the tensor ces

R ≈ V R U

is decomposed into three matri-

T

which can be represented in GraphLab as a tripartite graph.

In addition, GraphLab has been used successfully in

◦ ◦

GraphLab’sperformance scaling improveswith higher computation to communication ratios. When commu nication requirements are high, GraphLab can saturate the network, limiting scalability.

The GraphLab abstraction more compactly expresses the Netflix, NER and Coseg algorithms than MapRe duce or MPI.

6.1

Scaling Per for mance In Fig. 6(a) we present the parallel speedup of GraphLab when run on 4 to 64 HPC nodes. Speedup is measured relative to the 4 HPC node running time. On each node, 10

Named Entity Recognition (CoEM)

16 14 12 10 8 6 4 2 1 4 8 Ideal NER 16 24 32 40 #Nodes 48 56 64 85

Named Entity Recognition (CoEM)

100 80 60 40 20 NER Netflix CoSeg 8 16 24 32 40 #Nodes 48

Bandwidth Bound

56 64 86

Named Entity Recognition (CoEM)

10 4 Hadoop 10 3 GraphLab MPI 10 2 10 1 4 8 16 24 32 40 #Nodes 48 56 64 87

Future Work

Distributed GraphLab

Fault Tolerance  Spot Instances  Cheaper Graph using off-memory store (disk/SSD) GraphLab as a database Self-optimized partitioning Fast data  graph construction primitives

GPU GraphLab ?

Supercomputer GraphLab ?

Is GraphLab the Answer to (Life the Universe and Everything?) Probably Not.

Carnegie Mellon

GraphLab

Microsoft Safe

graphlab.ml.cmu.edu

Parallel/Distributed Implementation LGPL (highly probable switch to MPL in a few weeks)

Danny Bickson Marketing Agency bickson.blogspot.com

Very fast matrix factorization implementations, other examples, installation, comparisons, etc

Carnegie Mellon

SVD CoEM Matrix Factorization Bayesian Tensor Factorization Lasso Gibbs Sampling

Questions?

PageRank SVM Dynamic Block Gibbs Sampling Many Others… Belief Propagation

Carnegie Mellon

Video Cosegmentation

Naïve Idea:

Treat patches independently Use Gaussian EM clustering (on image features) E step: Predict membership of each patch given cluster centers M step: Compute cluster centers given memberships of each patch Does not take relationships among patches into account!

Video Cosegmentation

Better Idea:

Connect the patches using an MRF. Set edge potentials so that adjacent (spatially and temporally) patches prefer to be of the same cluster.

Gaussian EM clustering with a twist: E step: Make unary potentials for each patch using cluster centers.

Predict membership of each patch using BP M step: Compute cluster centers given memberships of each patch D. Batra, et al. iCoseg: Interactive co-segmentation with intelligent scribble guidance. CVPR 2010.

Distributed Memory Programming APIs …do not make it easy…

• • • • • MPI Global Arrays GASnet ARMCI etc.

Synchronous computation.

Insufficient primitives for multi-threaded use.

Also, not exactly easy to use… If all your data is a n-D array Direct remote pointer access. Severe limitations depending on system architecture.