School of Computer Science Carnegie Mellon University Dept. of ECE University of Minnesota ParCube: Sparse Parallelizable Tensor Decompositions Evangelos E.
Download ReportTranscript School of Computer Science Carnegie Mellon University Dept. of ECE University of Minnesota ParCube: Sparse Parallelizable Tensor Decompositions Evangelos E.
School of Computer Science Carnegie Mellon University Dept. of ECE University of Minnesota
ParCube: Sparse Parallelizable Tensor Decompositions
Evangelos E. Papalexakis
1 , Christos Faloutsos 1 , Nikos Sidiropoulos 2
1 Carnegie Mellon University, School of Computer Science 2 University of Minnesota, ECE Department
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Bristol, UK, September 24 th -28t h , 2012.
Outline
•
Introduction
Problem Statement Method Experiments Conclusions
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 2
Introduction
• Facebook has ~800 Million users Evolves over time How do we spot interesting patterns & anomalies in this very large network?
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 3
Introduction
• Suppose we have Knowledge Base data E.g. Read the Web Project at CMU Subject – verb – object triplets, mined from the web Many gigabytes or terabytes of data!
How do we find potential new synonyms to a word using this knowledge base?
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 4
Introduction to Tensors
• • Tensors are multidimensional generalizations of matrices Previous problems can be formulated as tensors!
Time-evolving graphs/social networks, Multi-aspect data (e.g. subject, object, verb) Focus on 3-way tensors Can be viewed as Data cubes Indexed by 3 variables (IxJxK)
subject object
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 5
Introduction to Tensors
• PARAFAC decomposition Decompose a tensor into sum of outer products/rank 1 tensors Each rank 1 tensor is a different group/”concept” “Similar” to the Singular Value Decomposition in the matrix case
subject object “leaders/CEOs” “products”
Evangelos Papalexakis (CMU) – ECML-PKDD 2012
Store the factor vectors a i , b columns of i , c i as matrices A, B, C
6
Outline
• Introduction
Problem Statement
Method Experiments Conclusions
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 7
Why not PARAFAC?
• • • Today’s datasets are in the orders of terabytes e.g. Facebook has ~ 800 Million users!
Explosive complexity/run time for truly large datasets!
Also, data is very sparse We need the decomposition factors to be sparse Better interpretability / less noise Can do multi-way soft co-clustering this way!
PARAFAC is dense!
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 8
Problem Statement
• Wish-list: Significantly drop the dimensionality Ideally 1 or more orders of magnitude Parallelize the computation Ideally split the problem into independent parts and run in parallel Yield sparse factors Don’t loose much in the process
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 9
Previous work
• • • • • A.H. Phan et al. Block decomposition for very large-scale nonnegative tensor
factorization
Partition & merge parallel algorithm for NN PARAFAC No sparsity Q. Zhang et al. A parallel nonnegative tensor factorization algorithm for mining
global climate data.
D. Nion et al. Adaptive algorithms to track the parafac decomposition of a third- order tensor & J. Sun et al. Beyond streams and graphs: dynamic tensor analysis Tensor is a stream, both methods seek to track the decomposition C.E. Tsourakakis Mach: Fast randomized tensor decompositions & J. Sun et al.
Multivis:Content- based social network exploration through multi-way visual analysis
Sampling based TUCKER models. E.E. Papalexakis et al. Co-clustering as multilinear decomposition with sparse
latent factors.
Sparse PARAFAC algorithm applied to co-clustering
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 10
Our proposal
• • • • We introduce P AR C UBE goals: and set the following Goal 1: Fast Scalable & parallelizable Goal 2: Sparse Ability to yield sparse latent factors and a sparse tensor approximation Goal 3: Accurate provable correctness in merging partial results, under appropriate conditions
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 11
Outline
• Introduction Problem Statement
Method
Experiments Conclusions
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 12
P
AR
C
UBE
: The big picture
Break up tensor into small pieces using sampling
G1 %
! "
!
$ %!
" %
! "
$
! "
#
! "
G2 ! " !
$ %#
" %
#"
$
#"
$
! " Match columns and distribute non-zero values to appropriate indices in described is t he following: Creat e
r
#
#" sampled tensors
#
! " original (non-sampled) space
G1
Fit dense PARAFAC decomposition on small F i g. 3. Example of rank-1 Par af ac using Par Cube (A lgorit hm 3). T he procedure independent samples of X , using A lgorit hm 1. Run t o t he •
K
= 1 and obt ain PARAFAC vectors a i
b
i
r
c
i t riplet s of vect ors, corresponding first component of X . As a final st ep, combine t hose
r
will be sparse by construction
Evangelos Papalexakis (CMU) – ECML-PKDD 2012
A l gor i t hm 2: Basic Par Cube for Non-negat ive Par af ac I n p u t : Tensor X of size
I × J × K
O u t p u t : Fact or mat rices A
,
B
,
, number of component s C of size
I × F
,
J × F
,
K F × F
, sampling fact or respect ively.
1: Run B iasedSampl e (X 2: Run Non-Negat ive Par af ac (X
J / s × F
and
K / s × F
.
, s
) (A lgorit hm 1) and obt ain X
s , F
) and obt ain A
s ,
B
s , s
and C
s I , J , K
of size .
I / s × F
,
s
.
3: A (
I ,
:) = A
s
, B (
J ,
:) = B
s
, C (
K ,
:) = C
s 13
A l gor i t hm 3: Par Cube for Non-negat ive Par af ac wit h repet it ion I n p u t : Tensor X of size
I × J × K
, number of component s
F
, sampling fact or
s
, number of repet it ions
r
.
O u t p u t : Par af ac fact or mat rices A and vect or
λ
of size
F × ,
B
,
C of size
I × F
,
J × F
,
K × F
respect ively 1 which cont ains t he scale of each component .
1: Init ialize A
,
B
,
C t o all-zeros.
2: Randomly,
using mode densities as bias
, select a set of 100
p
% (
p
∈
I p , J
3: for
i p , K p
t o be common across all repet it ions.
= 1
· · · r
d o [0
,
1]) indices 4: 5: Run A lgorit hm 2 wit h sampling fact or reference among all
r s
, using
I p , J p , K p
as a common different samples and obt ain A
i ,
B
i ,
C
i
. T he sampling is made on t he set difference of t he set of all indices and t he set of common indices.
Calculat e t he 2 norm of t he columns of t he common part :
f
n
a
(
f
) = = 1
· · · F
A
i
(
I p , f
) 2 , n
b
(
f
) = B
i
( . Normalize columns of A
J i λ i
(
f
) = n
a
(
f
)n
b
(
f
)n
c
(
f p , f ,
B
i ,
) C 2
i
, n
c
(
f
) = using n
a ,
C n
b i ,
(
K
n
c p , f
) 2 and set for ). Not e t hat t he common part will now be normalized t o unit norm.
6: en d for 7: A = Fact or M er ge (A 8:
λ
= average of
λ i
.
i
), B = Fac t or M er ge (B
i
),C = Fac t or M er ge (C
i
)
The P
AR
C
UBE
method
• Key ideas: Use biased sampling to sample rows, cols & fibers Sampling weight During sampling, always keep a common portion of indices across samples For each smaller tensor, do the PARAFAC decomposition.
Need to specify 2 parameters: Sampling rate: s Initial dimensions I, J, K I/s, J/s, K/s Number of repetitions / different sampled tensors: r
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 14
•
Putting the pieces together
Say we have matrices A s • from each sample Possibly have re-ordering of factors • Each matrix corresponds to different sampled index set of the original index space • All factors share the “upper” part (by construction) …
G3
Proposition: Under mild conditions, the algorithm will stitch components correctly & output what exact PARAFAC would
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 15
Outline
• Introduction Problem Statement Method
Experiments
Conclusions
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 16
Experiments
• • We use the Tensor Toolbox for Matlab PARAFAC for baseline and core implementation
Evaluation of performance
Algorithm correctness Execution speedup Factor sparsity
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 17
Experiments
–
Correctness for multiple repetitions
• • • Relative cost = P AR C UBE approximation cost approximation cost / PARAFAC
The more samples we get, the closer we are to exact PARAFAC
Experimental validation of our theoretical result.
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 18
Experiments - Correctness & Speedup for 1 repetition
• • • Relative cost = P AR C UBE cost approximation cost / PARAFAC approximation Speedup = PARAFAC execution time / P AR C UBE execution time
Extrapolation to parallel execution for 4 repetitions yields 14.2x speedup (and improves accuracy)
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 19
Experiments
–
Correctness & Sparsity
Same as PARAFAC • • Output size = NNZ(A) + NNZ(B) + NNZ(C) 90% sparser than PARAFAC while maintaining the same approximation error
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 20
Experiments
•
Knowledge Discovery
E NRON email/social network 186×186×44 Network traffic data (L BNL ) 65170 × 65170 × 65327 F ACEBOOK Wall posts 63891 × 63890 × 1847 Knowledge Base data (Never Ending Language Learner – N ELL ) 14545 × 14545 × 28818
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 21
Discovery - E
NRON
• • Who-emailed-whom data from the ENRON email dataset.
Spans 44 months 184×184×44 tensor We picked s = 2, r = 4 We were able to identify social cliques and spot spikes that correspond to actual important events in the company’s timeline
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 22
Discovery
–
L
BNL
Network Data
1 src 1 dst • • Network traffic data of form (src IP, dst IP, port #) 65170 × 65170 × 65327 tensor We pick s = 5, r = 10 We were able to identify a possible Port Scanning
Attack
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 23
Discovery
–
F
ACEBOOK
Wall posts
1 Wall 1 day • • • • Small portion of Facebook’s users 63890 users for 1847 days Picked s = 100, r = 10 Data in the form (Wall owner, poster, timestamp) Downloaded from http://socialnetworks.mpi-sws.org/data wosn2009.html
We were able to identify a birthday-like event.
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 24
Discovery - N
ELL
• • • • • Knowledge base data Taken from the Read The Web project at CMU http://rtw.ml.cmu.edu/rtw/ Special thanks to Tom Mitchell for the data.
Noun phrase x Context x Noun phrase triplets e.g. ‘Obama’ – ‘is’ – ‘the president of the United States’ Discover words that may be used in the same context We picked s = 500, r = 10.
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 25
Outline
• Introduction Problem Statement Method Experiments
Conclusions
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 26
Conclusions
Goal 1: Fast Scalable & parallelizable Goal 2: Sparse Ability to yield sparse latent factors and a sparse tensor approximation Goal 3: Accurate provable correctness in merging partial results, under appropriate conditions Experiments that also demonstrate that • • Enables processing of tensors that don’t fit in memory Interesting findings in diverse Knowledge Discovery settings
Evangelos Papalexakis (CMU) – ECML-PKDD 2012 27
The End
Evangelos E. Papalexakis Email: [email protected]
Web: http://www.cs.cmu.edu/~epapalex Christos Faloutsos Email: [email protected]
Web: http://www.cs.cmu.edu/~christos Nicholas Sidiropoulos Email: [email protected]
Web:
http://www.ece.umn.edu/users/nikos/
Evangelos Papalexakis (CMU) - ASONAM 2012 28