graphx@ampcamp3 - UC Berkeley AMP Camp

Download Report

Transcript graphx@ampcamp3 - UC Berkeley AMP Camp

GraphX:
Graph Analytics on Spark
Joseph Gonzalez, Reynold Xin,
Ion Stoica, Michael Franklin
Developed at the UC Berkeley AMPLab
AMPCamp: August 29, 2013
Graphs are Essential to
Data Mining and Machine
Learning
Identify influential people and information
Find communities
Understand people’s shared interests
Model complex data dependencies
Predicting Political Bias
?
?
?
Liberal
?
Conservative
?
?
?
?
?
?
Post
Post
?
?
Post
Post
Post
?
Post
Post
?
Post
?
?
Post
?
Post
Post
Post
Post
?
?
Conditional
Random
Field
?
?
?
?
?
Belief Propagation
?
?
Post
Post
Post
?
?
?
?
3
Triangle Counting
Count the triangles passing through each
vertex:
2
3
1
4
Measures “cohesiveness” of local
community
Fewer Triangles
Weaker Community
More Triangles
Stronger Community
Collaborative Filtering
User
s
Ratings
Item
s
Many More Graph Algorithms
• Collaborative Filtering
– CoEM
– Alternating Least Squares • Graph Analytics
– Stochastic Gradient
Descent
– Tensor Factorization
– SVD
• Structured Prediction
– Loopy Belief Propagation
– Max-Product Linear
Programs
– Gibbs Sampling
• Semi-supervised ML
– Graph SSL
– PageRank
– Single Source Shortest
Path
– Triangle-Counting
– Graph Coloring
– K-core Decomposition
– Personalized PageRank
• Classification
– Neural Networks
– Lasso
…
6
Structure of Computation
Data-Parallel
Table
Dependency
Graph
Row
Row
Row
Graph-Parallel
Resul
t
Row
7
The Graph-Parallel Abstraction
A user-defined Vertex-Program runs on each vertex
Graph constrains interaction along edges
Using messages (e.g. Pregel [PODC’09, SIGMOD’10])
Through shared state (e.g., GraphLab [UAI’10, VLDB’12])
Parallelism: run multiple vertex programs simultaneously
8
By exploiting graph-structure
Graph-Parallel systems
can be orders-of-magnitude
faster.
9
Triangle Counting on Twitter
40M Users, 1.4 Billion Links
Counted: 34.8 Billion
Triangles
Hadoop
[WWW’11]
1536 Machines
423 Minutes
64 Machines
15 Seconds
1000 x
Faster
S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11
10
Specialized Graph
Systems
Pregel
Specialized Graph
Systems
1. APIs to capture complex graph
dependencies
B
A
2. Exploit graph structure to
reduce communication
and computation
F
C
E
D
Why GraphX?
13
The Bigger Picture
Graph
Lab
0%
Hadoop
Graph Creation
Graph Algorithms
20%
40%
60%
Time Spent in Data Pipeline
80%
Post
Proc
.
100%
Vertices
Edges
Edges
Limitations of Specialized
Graph-Parallel Systems
No support for Construction & Post
Processing
Not interactive
Requires maintaining multiple platforms
Spark excels at these!
GraphX Unifies
Data-Parallel and GraphParallel
Systems
Spark
GraphLab
Table API
RDDs, Faulttolerance, and task
scheduling
Graph API
graph representation
and execution
one system for the entire graph pipeline
Graph Construction
0%
10%
20%
30%
40%
Computation Post-Processing
50%
60%
70%
80%
90%
100%
Enable Joining Tables and
Graphs
ET
L
User
Data
Join
Friend
Graph
Product
Ratings
Inf.
Product Rec.
Graph
Tables
Prod.
Rec.
Graphs
20
The GraphX
Resilient Distributed Graph
Id
Attribute (V)
Rxin
(Stu., Berk.)
Jegonzal
(PstDoc, Berk.)
Franklin
(Prof., Berk)
Istoica
(Prof., Berk)
SrcId
DstId
Attribute (E)
rxin
jegonzal
Friend
franklin
rxin
Advisor
istoica
franklin
Coworker
franklin
jegonzal
PI
R
F
J
I
class Graph [ V, E ] {
GraphX API
// Table Views ----------------def vertices: RDD[ (Id, V) ]
def edges: RDD[ (Id, Id, E) ]
def triplets: RDD[ ((Id, V), (Id, V), E) ]
// Transformations -----------------------------def reverse: Graph[V, E]
def filterV(p: (Id, V) => Boolean): Graph[V,E]
def filterE(p: Edge[V,E] => Boolean): Graph[V,E]
def mapV[T](m: (Id, V) => T ): Graph[T,E]
def mapE[T](m: Edge[V,E] => T ): Graph[V,T]
// Joins ---------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]
def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]
// Computation ---------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,
reduceF: (T, T) => T,
direction: EdgeDir): Graph[T, E]
}
Aggregate Neighbors
Map-Reduce for each vertex B
mapF(A
B
)
a1
mapF(A
C
)
a2
C
A
D
reduceF(a1 ,a2
)
E
A
F
Example: Oldest Follower
23
What is the age of the oldest
follower for each user?
val followerAge =
graph.aggNbrs(
e => e.src.age, // MapF
max(_, _), // ReduceF
InEdges).vertices
42
B
C
30
A
D
E
19
F
16
75
We can express both Pregel and
GraphLab using aggregateNeighbors in
40 lines of code!
Performance Optimizations
Replicate & co-partition vertices with edges
» GraphLab (PowerGraph) style vertex-cut
partitioning
» Minimize communication by avoiding edge data
movement in JOINs
In-memory hash index for fast joins
Early Performance
Hadoop
1340
165
GraphX
GraphLab
22
0
200
400
600
800
1000
1200
1400
1600
Runtime (in seconds, PageRank for 10 iterations)
In Progress Optimizations
Byte-code inspection of user functions
» E.g. if mapf does not need edge data, we can
rewrite the query to delay the join
Execution strategies optimizer
» Scan edges randomly accessing vertices
» Scan vertices randomly accessing edges
Current Implementation
PageRank
(5)
Connecte
d Comp.
(10)
Shortest
Path (10)
Pregel (20)
ALS
(40)
GraphLab (20)
GraphX
Spark (relational operators)
Demo
Reynold Xin
Summary
1. Graph-parallel primitives on Spark.
2. Currently slower than GraphLab, but
» No need for specialized systems
» Easier ETL, and easier consumption of output
» Interactive graph data mining
3. Future work will bring performance closer
to specialized engines.
Status
Currently finalizing the APIs
» Feedback wanted: http://bit.ly/graph-api
Also working on improving system
performance
Will be part of Spark 0.9
Questions?
[email protected]
[email protected]
Backup slides
Vertex Cut Partitioning
B
A
C
F
E
D
Vertex Cut Partitioning
Partition 1
Partition 3
B
A
C
F
Partition 2
E
D
aggregateNeighbors
B
map
reduce
A
C
F
E
D
aggregateNeighbors
B
map(B)
A
C
map(C)
F
map(F)
map(D)
E
map(E)
D
aggregateNeighbors
B
map(B)
A
C
map(C)
F
map(F)
map(D)
E
map(E)
D
aggregateNeighbors
B
reduce
map(B)
A
C
map(C)
F
map(F)
map(D)
E
map(E)
D
Example: Vertex Degree
B
map: 1
reduce: sum
A
C
F
E
D
Example: Vertex Degree
B
1
A
C
1
F
1
1
E
1
D
Example: Vertex Degree
A: 5
B
sum: 5
B: 0
A
C
C: 0
D: 0
F
E: 0
E
F: 0
D
Example: Oldest Follower
What is the age of the oldest B
follower for each user?
val followerAge =
graph.aggNbrs(
e => e.src.age, // MapF
max(_, _), // ReduceF
D
InEdges).vertices
C
A
E
F
Specialized Graph Systems
Messaging
[PODC’09, SIGMOD’10]
Shared State
[UAI’10, VLDB’12]
Many Others
Giraph, Stanford GPS, Signal-Collect,
Combinatorial BLAS, BoostPGL, …
47
class Graph [ V, E ] {
GraphX API
// Table Views ----------------def vertices: RDD[ (Id, V) ]
def edges: RDD[ (Id, Id, E) ]
def triplets: RDD[ ((Id, V), (Id, V), E) ]
// Transformations -----------------------------def reverse: Graph[V, E]
def filterV(p: (Id, V) => Boolean): Graph[V,E]
def filterE(p: Edge[V,E] => Boolean): Graph[V,E]
def mapV[T](m: (Id, V) => T ): Graph[T,E]
def mapE[T](m: Edge[V,E] => T ): Graph[V,T]
// Joins ---------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]
def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]
// Computation ---------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,
reduceF: (T, T) => T,
direction: EdgeDir): Graph[T, E]
}