School of Computer Science Carnegie Mellon University Dept. of ECE University of Minnesota ParCube: Sparse Parallelizable Tensor Decompositions Evangelos E.

Download Report

Transcript School of Computer Science Carnegie Mellon University Dept. of ECE University of Minnesota ParCube: Sparse Parallelizable Tensor Decompositions Evangelos E.

School of Computer Science Carnegie Mellon University Dept. of ECE University of Minnesota

ParCube: Sparse Parallelizable Tensor Decompositions

Evangelos E. Papalexakis

1 , Christos Faloutsos 1 , Nikos Sidiropoulos 2

1 Carnegie Mellon University, School of Computer Science 2 University of Minnesota, ECE Department

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Bristol, UK, September 24 th -28t h , 2012.

Outline

Introduction

Problem Statement Method Experiments Conclusions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 2

Introduction

• Facebook has ~800 Million users  Evolves over time  How do we spot interesting patterns & anomalies in this very large network?

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 3

Introduction

• Suppose we have Knowledge Base data  E.g. Read the Web Project at CMU  Subject – verb – object triplets, mined from the web  Many gigabytes or terabytes of data!

 How do we find potential new synonyms to a word using this knowledge base?

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 4

Introduction to Tensors

• • Tensors are multidimensional generalizations of matrices  Previous problems can be formulated as tensors!

 Time-evolving graphs/social networks, Multi-aspect data (e.g. subject, object, verb) Focus on 3-way tensors  Can be viewed as Data cubes  Indexed by 3 variables (IxJxK)

subject object

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 5

Introduction to Tensors

• PARAFAC decomposition  Decompose a tensor into sum of outer products/rank 1 tensors  Each rank 1 tensor is a different group/”concept”  “Similar” to the Singular Value Decomposition in the matrix case

subject object “leaders/CEOs” “products”

Evangelos Papalexakis (CMU) – ECML-PKDD 2012

Store the factor vectors a i , b columns of i , c i as matrices A, B, C

6

Outline

• Introduction

Problem Statement

Method Experiments Conclusions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 7

Why not PARAFAC?

• • • Today’s datasets are in the orders of terabytes  e.g. Facebook has ~ 800 Million users!

Explosive complexity/run time for truly large datasets!

Also, data is very sparse  We need the decomposition factors to be sparse  Better interpretability / less noise  Can do multi-way soft co-clustering this way!

 PARAFAC is dense!

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 8

Problem Statement

• Wish-list:  Significantly drop the dimensionality  Ideally 1 or more orders of magnitude  Parallelize the computation  Ideally split the problem into independent parts and run in parallel  Yield sparse factors  Don’t loose much in the process

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 9

Previous work

• • • • • A.H. Phan et al. Block decomposition for very large-scale nonnegative tensor

factorization

  Partition & merge parallel algorithm for NN PARAFAC No sparsity Q. Zhang et al. A parallel nonnegative tensor factorization algorithm for mining

global climate data.

D. Nion et al. Adaptive algorithms to track the parafac decomposition of a third- order tensor & J. Sun et al. Beyond streams and graphs: dynamic tensor analysis  Tensor is a stream, both methods seek to track the decomposition C.E. Tsourakakis Mach: Fast randomized tensor decompositions & J. Sun et al.

Multivis:Content- based social network exploration through multi-way visual analysis

 Sampling based TUCKER models. E.E. Papalexakis et al. Co-clustering as multilinear decomposition with sparse

latent factors.

 Sparse PARAFAC algorithm applied to co-clustering

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 10

Our proposal

• • • • We introduce P AR C UBE goals: and set the following Goal 1: Fast  Scalable & parallelizable Goal 2: Sparse  Ability to yield sparse latent factors and a sparse tensor approximation Goal 3: Accurate  provable correctness in merging partial results, under appropriate conditions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 11

Outline

• Introduction Problem Statement

Method

Experiments Conclusions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 12

P

AR

C

UBE

: The big picture

Break up tensor into small pieces using sampling

G1 %

! "

!

$ %!

" %

! "

$

! "

#

! "

G2 ! " !

$ %#

" %

#"

$

#"

$

! " Match columns and distribute non-zero values to appropriate indices in described is t he following: Creat e

r

#

#" sampled tensors

#

! " original (non-sampled) space

G1

Fit dense PARAFAC decomposition on small F i g. 3. Example of rank-1 Par af ac using Par Cube (A lgorit hm 3). T he procedure independent samples of X , using A lgorit hm 1. Run t o t he •

K

= 1 and obt ain PARAFAC vectors a i

b

i

r

c

i t riplet s of vect ors, corresponding first component of X . As a final st ep, combine t hose

r

will be sparse by construction

Evangelos Papalexakis (CMU) – ECML-PKDD 2012

A l gor i t hm 2: Basic Par Cube for Non-negat ive Par af ac I n p u t : Tensor X of size

I × J × K

O u t p u t : Fact or mat rices A

,

B

,

, number of component s C of size

I × F

,

J × F

,

K F × F

, sampling fact or respect ively.

1: Run B iasedSampl e (X 2: Run Non-Negat ive Par af ac (X

J / s × F

and

K / s × F

.

, s

) (A lgorit hm 1) and obt ain X

s , F

) and obt ain A

s ,

B

s , s

and C

s I , J , K

of size .

I / s × F

,

s

.

3: A (

I ,

:) = A

s

, B (

J ,

:) = B

s

, C (

K ,

:) = C

s 13

A l gor i t hm 3: Par Cube for Non-negat ive Par af ac wit h repet it ion I n p u t : Tensor X of size

I × J × K

, number of component s

F

, sampling fact or

s

, number of repet it ions

r

.

O u t p u t : Par af ac fact or mat rices A and vect or

λ

of size

F × ,

B

,

C of size

I × F

,

J × F

,

K × F

respect ively 1 which cont ains t he scale of each component .

1: Init ialize A

,

B

,

C t o all-zeros.

2: Randomly,

using mode densities as bias

, select a set of 100

p

% (

p

I p , J

3: for

i p , K p

t o be common across all repet it ions.

= 1

· · · r

d o [0

,

1]) indices 4: 5: Run A lgorit hm 2 wit h sampling fact or reference among all

r s

, using

I p , J p , K p

as a common different samples and obt ain A

i ,

B

i ,

C

i

. T he sampling is made on t he set difference of t he set of all indices and t he set of common indices.

Calculat e t he 2 norm of t he columns of t he common part :

f

n

a

(

f

) = = 1

· · · F

A

i

(

I p , f

) 2 , n

b

(

f

) = B

i

( . Normalize columns of A

J i λ i

(

f

) = n

a

(

f

)n

b

(

f

)n

c

(

f p , f ,

B

i ,

) C 2

i

, n

c

(

f

) = using n

a ,

C n

b i ,

(

K

n

c p , f

) 2 and set for ). Not e t hat t he common part will now be normalized t o unit norm.

6: en d for 7: A = Fact or M er ge (A 8:

λ

= average of

λ i

.

i

), B = Fac t or M er ge (B

i

),C = Fac t or M er ge (C

i

)

The P

AR

C

UBE

method

• Key ideas:  Use biased sampling to sample rows, cols & fibers  Sampling weight  During sampling, always keep a common portion of indices across samples  For each smaller tensor, do the PARAFAC decomposition.

 Need to specify 2 parameters:  Sampling rate: s  Initial dimensions I, J, K  I/s, J/s, K/s  Number of repetitions / different sampled tensors: r

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 14

Putting the pieces together

Say we have matrices A s • from each sample Possibly have re-ordering of factors • Each matrix corresponds to different sampled index set of the original index space • All factors share the “upper” part (by construction) …

G3

Proposition: Under mild conditions, the algorithm will stitch components correctly & output what exact PARAFAC would

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 15

Outline

• Introduction Problem Statement Method

Experiments

Conclusions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 16

Experiments

• • We use the Tensor Toolbox for Matlab PARAFAC for baseline and core implementation

Evaluation of performance

 Algorithm correctness  Execution speedup  Factor sparsity

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 17

Experiments

Correctness for multiple repetitions

• • • Relative cost = P AR C UBE approximation cost approximation cost / PARAFAC

The more samples we get, the closer we are to exact PARAFAC

Experimental validation of our theoretical result.

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 18

Experiments - Correctness & Speedup for 1 repetition

• • • Relative cost = P AR C UBE cost approximation cost / PARAFAC approximation Speedup = PARAFAC execution time / P AR C UBE execution time

Extrapolation to parallel execution for 4 repetitions yields 14.2x speedup (and improves accuracy)

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 19

Experiments

Correctness & Sparsity

Same as PARAFAC • • Output size = NNZ(A) + NNZ(B) + NNZ(C) 90% sparser than PARAFAC while maintaining the same approximation error

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 20

Experiments

Knowledge Discovery

 E NRON email/social network 186×186×44  Network traffic data (L BNL ) 65170 × 65170 × 65327  F ACEBOOK Wall posts 63891 × 63890 × 1847  Knowledge Base data (Never Ending Language Learner – N ELL ) 14545 × 14545 × 28818

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 21

Discovery - E

NRON

• • Who-emailed-whom data from the ENRON email dataset.

   Spans 44 months 184×184×44 tensor We picked s = 2, r = 4 We were able to identify social cliques and spot spikes that correspond to actual important events in the company’s timeline

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 22

Discovery

L

BNL

Network Data

1 src 1 dst • • Network traffic data of form (src IP, dst IP, port #)  65170 × 65170 × 65327 tensor  We pick s = 5, r = 10 We were able to identify a possible Port Scanning

Attack

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 23

Discovery

F

ACEBOOK

Wall posts

1 Wall 1 day • • • • Small portion of Facebook’s users  63890 users for 1847 days  Picked s = 100, r = 10 Data in the form (Wall owner, poster, timestamp) Downloaded from http://socialnetworks.mpi-sws.org/data wosn2009.html

We were able to identify a birthday-like event.

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 24

Discovery - N

ELL

• • • • • Knowledge base data Taken from the Read The Web project at CMU  http://rtw.ml.cmu.edu/rtw/ Special thanks to Tom Mitchell for the data.

Noun phrase x Context x Noun phrase triplets  e.g. ‘Obama’ – ‘is’ – ‘the president of the United States’ Discover words that may be used in the same context We picked s = 500, r = 10.

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 25

Outline

• Introduction Problem Statement Method Experiments

Conclusions

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 26

Conclusions

   Goal 1: Fast  Scalable & parallelizable Goal 2: Sparse  Ability to yield sparse latent factors and a sparse tensor approximation Goal 3: Accurate   provable correctness in merging partial results, under appropriate conditions Experiments that also demonstrate that • • Enables processing of tensors that don’t fit in memory Interesting findings in diverse Knowledge Discovery settings

Evangelos Papalexakis (CMU) – ECML-PKDD 2012 27

The End

Evangelos E. Papalexakis Email: [email protected]

Web: http://www.cs.cmu.edu/~epapalex Christos Faloutsos Email: [email protected]

Web: http://www.cs.cmu.edu/~christos Nicholas Sidiropoulos Email: [email protected]

Web:

http://www.ece.umn.edu/users/nikos/

Evangelos Papalexakis (CMU) - ASONAM 2012 28