KDD 09 tutorial - Carnegie Mellon University

Download Report

Transcript KDD 09 tutorial - Carnegie Mellon University

CMU SCS
Large Graph Mining:
Power Tools and a Practitioner’s guide
Task 5: Graphs over time & tensors
Faloutsos, Miller, Tsourakakis
CMU
KDD '09
Faloutsos, Miller, Tsourakakis
P5-1
CMU SCS
Outline
•
•
•
•
•
•
•
•
•
•
•
Introduction – Motivation
Task 1: Node importance
Task 2: Community detection
Task 3: Recommendations
Task 4: Connection sub-graphs
Task 5: Mining graphs over time
Task 6: Virus/influence propagation
Task 7: Spectral graph theory
Task 8: Tera/peta graph mining: hadoop
Observations – patterns of real graphs
Conclusions
KDD '09
Faloutsos, Miller, Tsourakakis
P5-2
CMU SCS
Thanks to
• Tamara Kolda (Sandia)
for the foils on tensor
definitions, and on TOPHITS
KDD '09
Faloutsos, Miller, Tsourakakis
P5-3
CMU SCS
Detailed outline
• Motivation
• Definitions: PARAFAC and Tucker
• Case study: web mining
KDD '09
Faloutsos, Miller, Tsourakakis
P5-4
CMU SCS
Examples of Matrices:
Authors and terms
data
John
Peter
Mary ...
Nick ...
...
KDD '09
...
13
5
...
...
...
mining
classif.
11
4
22
6
...
...
...
Faloutsos, Miller, Tsourakakis
tree
...
...
...
...
55 ...
7 ...
...
...
...
P5-5
CMU SCS
Motivation: Why tensors?
• Q: what is a tensor?
KDD '09
Faloutsos, Miller, Tsourakakis
P5-6
CMU SCS
Motivation: Why tensors?
• A: N-D generalization of matrix:
KDD’09
John
Peter
Mary
Nick
...
KDD '09
data
13
5
...
...
...
...
...
...
mining
classif.
11
4
22
6
...
...
...
Faloutsos, Miller, Tsourakakis
tree
...
...
...
...
55 ...
7 ...
...
...
...
P5-7
CMU SCS
Motivation: Why tensors?
• A: N-D generalization of matrix:
KDD’07
KDD’08
KDD’09
John
Peter
Mary
Nick
...
KDD '09
data
13
5
...
...
...
...
...
...
mining
classif.
11
4
22
6
...
...
...
Faloutsos, Miller, Tsourakakis
tree
...
...
...
...
55 ...
7 ...
...
...
...
P5-8
CMU SCS
Tensors are useful for 3 or more modes
Terminology: ‘mode’ (or ‘aspect’):
Mode#3
data
13
5
Mode#2
KDD '09
...
...
...
...
...
...
mining
classif.
11
4
22
6
...
...
...
tree
...
...
...
Mode (== aspect) #1
Faloutsos, Miller, Tsourakakis
...
55 ...
7 ...
...
...
...
P5-9
CMU SCS
Notice
• 3rd mode does not need to be time
• we can have more than 3 modes
Dest. port
125
...
80
13
5
IP source
KDD '09
...
...
...
11
4
...
...
...
22
6
...
...
...
...
...
...
IP destination
Faloutsos, Miller, Tsourakakis
55 ...
7 ...
...
...
...
P5-10
CMU SCS
Notice
• 3rd mode does not need to be time
• we can have more than 3 modes
– Eg, fFMRI: x,y,z, time, person-id, task-id
From DENLAB, Temple U.
(Prof. V. Megalooikonomou +)
http://denlab.temple.edu/bidms/cgi-bin/browse.cgi
Faloutsos, Miller, Tsourakakis
P5-11
KDD '09
CMU SCS
Motivating Applications
• Why tensors are useful?
– web mining (TOPHITS)
–
–
–
–
–
KDD '09
environmental sensors
Intrusion detection (src, dst, time, dest-port)
Social networks (src, dst, time, type-of-contact)
face recognition
etc …
Faloutsos, Miller, Tsourakakis
P5-12
CMU SCS
Detailed outline
• Motivation
• Definitions: PARAFAC and Tucker
• Case study: web mining
KDD '09
Faloutsos, Miller, Tsourakakis
P5-13
CMU SCS
Tensor basics
• Multi-mode extensions of SVD – recall
that:
KDD '09
Faloutsos, Miller, Tsourakakis
P5-14
CMU SCS
Reminder: SVD
n
n
m
A

m

VT
U
– Best rank-k approximation in L2
KDD '09
Faloutsos, Miller, Tsourakakis
P5-15
CMU SCS
Reminder: SVD
n
m
A
1u1v1

2u2v2
+
– Best rank-k approximation in L2
KDD '09
Faloutsos, Miller, Tsourakakis
P5-16
CMU SCS
Goal: extension to >=3 modes
IxJxK
IxR
JxR
B
¼
A
KDD '09
=
+…+
RxRxR
Faloutsos, Miller, Tsourakakis
P5-17
CMU SCS
Main points:
• 2 major types of tensor decompositions:
PARAFAC and Tucker
• both can be solved with ``alternating least
squares’’ (ALS)
KDD '09
Faloutsos, Miller, Tsourakakis
P5-18
CMU SCS
Specially Structured Tensors
• Tucker Tensor
• Kruskal Tensor
Our
Notation
Our
Notation
“core”
IxJxK
IxR
JxS
V
=
U
KDD '09
IxJxK
RxSxT
Faloutsos, Miller, Tsourakakis
= =
wI1x
v1
U
u1
wR
R
JxR
+…+
V
vR
R x R x Ru
R
P5-19
CMU SCS
Tucker Decomposition - intuition
IxJxK
IxR
JxS
B
¼
A
•
•
•
•
•
RxSxT
author x keyword x conference
A: author x author-group
B: keyword x keyword-group
C: conf. x conf-group
G: how groups relate to each other
KDD '09
Faloutsos, Miller, Tsourakakis
P5-20
CMU SCS
Intuition behind core tensor
• 2-d case: co-clustering
• [Dhillon et al. Information-Theoretic Coclustering, KDD’03]
KDD '09
Faloutsos, Miller, Tsourakakis
P5-21
CMU SCS
n
.05
.05
m 0
0
.04
.04
k
.5
.5
m  00
0
 0
0
0
.5
.5
0
0
l
 .03 .03 l
0 k
.2 .2
0

0

.5

.5 
0
KDD '09

.36
0
.05 .05 0
0
.05 .05 0
0
0
0
0 .05 .05
0 .05 .05
.04 0 .04 .04
.04 .04 0 .04

0

.05

.05

.04
.04 
0
n
.36
0
.28
0
0
.28
0
.36
0
.36

eg, terms x documents
.054
.054
 00
.036
.036
Faloutsos, Miller, Tsourakakis
.054
.042
0
0
.054
.042
0
0
0
0
.042
.054
0
0
.042
.054
.036
.036
028
.028
.028
.028
.036
.036

0

.054

.054

.036

.036 
0
P5-22
CMU SCS
med. doc
.05
.05
 00
.04
.04
term group x
doc. group
.5 0 0  .03 .03
.5 0 0  .2 .2
 0 .5 0   
 00 .05 .05
 0 0 .5
term x
term-group
KDD '09

.36
0
.36
0
cs doc
.04 .04 0 .04

0

.05

.05

.04
.04 
.28
0
0
.36
.05 .05 0
0
.05 .05 0
0
0
0
0 .05 .05
0 .05 .05
.04 0 .04 .04
0
.28
0
.36
med. terms
0
doc x
doc group

cs terms
common terms
.054
.054
 00
.036
.036
Faloutsos, Miller, Tsourakakis
.054
.042
0
0
.054
.042
0
0
0
0
.042
.054
0
0
.042
.054
.036
.036
028
.028
.028
.028
.036
.036

0

.054

.054

.036

.036 
0
P5-23
CMU SCS
Tensor tools - summary
• Two main tools
– PARAFAC
– Tucker
• Both find row-, column-, tube-groups
– but in PARAFAC the three groups are identical
• ( To solve: Alternating Least Squares )
KDD '09
Faloutsos, Miller, Tsourakakis
P5-24
CMU SCS
Detailed outline
• Motivation
• Definitions: PARAFAC and Tucker
• Case study: web mining
KDD '09
Faloutsos, Miller, Tsourakakis
P5-25
CMU SCS
Web graph mining
• How to order the importance of web pages?
– Kleinberg’s algorithm HITS
– PageRank
– Tensor extension on HITS (TOPHITS)
KDD '09
Faloutsos, Miller, Tsourakakis
P5-26
CMU SCS
Kleinberg’s Hubs and Authorities
(the HITS method)
Sparse adjacency matrix and its SVD:
authority scores
for 1st topic
authority scores
for 2nd topic
from
to
hub scores
for 1st topic
KDD '09
Kleinberg, JACM, 1999
Faloutsos, Miller, Tsourakakis
hub scores
for 2nd topic
P5-27
CMU SCS
HITS Authorities on Sample Data
.97
.24
.08
.05
.02
.01
.01
1st Principal Factor
www.ibm.com
www.alphaworks.ibm.com
2nd Principal Factor
www-128.ibm.com
We started our crawl from
.99 www.lehigh.edu
www.developer.ibm.com
http://www-neos.mcs.anl.gov/neos,
.11 www2.lehigh.edu
3rd Principal Factor
www.research.ibm.com
and crawled 4700 pages,
.06 www.lehighalumni.com
www.redbooks.ibm.com
.75 java.sun.com
resulting in 560
.06 www.lehighsports.com
news.com.com
.38 www.sun.com
cross-linked hosts.
.02 www.bethlehem-pa.gov
.36 developers.sun.com 4th Principal Factor
.02 www.adobe.com
.24 see.sun.com
.60 www.pueblo.gsa.gov
.02 lewisweb.cc.lehigh.edu
.16 www.samag.com.45 www.whitehouse.gov
.02 www.leo.lehigh.edu
.13 docs.sun.com .35 www.irs.gov
.02 www.distance.lehigh.edu
.12 blogs.sun.com .31 travel.state.gov 6th Principal Factor
.02 fp1.cc.lehigh.edu
.08 sunsolve.sun.com.22 www.gsa.gov.97 mathpost.asu.edu
.08 www.sun-catalogue.com
.20 www.ssa.gov.18 math.la.asu.edu
.08 news.com.com .16 www.census.gov
.17 www.asu.edu
authority scores
authority scores
for 2nd topic
st
for 1 topic
from
to
hub scores
for 1 topic
KDD '09st
.04 www.act.org
.14 www.govbenefits.gov
.03 www.eas.asu.edu
.13 www.kids.gov
.02 archives.math.utk.edu
.13 www.usdoj.gov
.02 www.geom.uiuc.edu
.02 www.fulton.asu.edu
.02 www.amstat.org
.02 www.maa.org
hub scores Faloutsos, Miller, Tsourakakis
for 2nd topic
P5-28
CMU SCS
Three-Dimensional View of the Web
Observe that this
tensor is very sparse!
KDD '09
Faloutsos, Miller, Tsourakakis
Kolda, Bader, Kenny, ICDM05
P5-29
CMU SCS
Three-Dimensional View of the Web
Observe that this
tensor is very sparse!
KDD '09
Faloutsos, Miller, Tsourakakis
Kolda, Bader, Kenny, ICDM05
P5-30
CMU SCS
Three-Dimensional View of the Web
Observe that this
tensor is very sparse!
KDD '09
Faloutsos, Miller, Tsourakakis
Kolda, Bader, Kenny, ICDM05
P5-31
CMU SCS
Topical HITS (TOPHITS)
Main Idea: Extend the idea behind the HITS model to incorporate
term (i.e., topical) information.
term scores
for 1st topic
term scores
for 2nd topic
from
to
authority scores
for 1st topic
hub scores
for 1st topic
KDD '09
Faloutsos, Miller, Tsourakakis
authority scores
for 2nd topic
hub scores
for 2nd topic
P5-32
CMU SCS
Topical HITS (TOPHITS)
Main Idea: Extend the idea behind the HITS model to incorporate
term (i.e., topical) information.
term scores
for 1st topic
term scores
for 2nd topic
from
to
authority scores
for 1st topic
hub scores
for 1st topic
KDD '09
Faloutsos, Miller, Tsourakakis
authority scores
for 2nd topic
hub scores
for 2nd topic
P5-33
CMU SCS
TOPHITS Terms & Authorities
on Sample Data
.23
.18
.17
.16
.16
.15
.15
.14
.12
.12
1st Principal Factor
.86 java.sun.com
JAVA
.38 developers.sun.com
SUN
2nd Principal Factor
.16 docs.sun.com
PLATFORM
TOPHITS uses 3D analysis to find
.20 NO-READABLE-TEXT .99 www.lehigh.edu
.14 see.sun.com
SOLARIS
the dominant groupings of web
.16 FACULTY
.06 3rd
www2.lehigh.edu
Principal Factor
.14 www.sun.com
DEVELOPER
.16 SEARCH
.03 www.lehighalumni.com
pages and terms.
.15 NO-READABLE-TEXT
.09 www.samag.com .97 www.ibm.com
EDITION
.16 NEWS .15 IBM
.07 developer.sun.com .18 www.alphaworks.ibm.com
DOWNLOAD
.16 LIBRARIES
Principal Factor
.12 SERVICES
www-128.ibm.com
.06 sunsolve.sun.com .07 4th
INFO
.16 COMPUTING
.26 INFORMATION
.87 www.pueblo.gsa.gov
.12 WEBSPHERE
.05 www.developer.ibm.com
.05 access1.sun.com
SOFTWARE
.12 LEHIGH.12 WEB .24 FEDERAL .02 www.redbooks.ibm.com
.24 www.irs.gov
.05 iforce.sun.com
NO-READABLE-TEXT
.23 CITIZEN
.23 6th
www.whitehouse.gov
.11 DEVELOPERWORKS
.01 www.research.ibm.com
Principal Factor
wk = # unique links using term k
.11 LINUX .22 OTHER.26 PRESIDENT.19 travel.state.gov
.87 www.whitehouse.gov
.19 CENTER
.18 www.gsa.gov
.11 RESOURCES
.25 NO-READABLE-TEXT
.18 www.irs.gov
.19 LANGUAGES
.09 www.consumer.gov
.11 TECHNOLOGIES
.25 BUSH
.16 12th
travel.state.gov
Principal Factor
.15 U.S
.09 www.kids.gov
.10 DOWNLOADS
.25 WELCOME
.10
www.gsa.gov
.75 OPTIMIZATION
.35 www.palisade.com
.15 PUBLICATIONS
.07 www.ssa.gov
.17 WHITE .58 SOFTWARE
.08 www.ssa.gov
.35 www.solver.com
.14 CONSUMER
.05
www.forms.gov
.16 U.S
.05 www.govbenefits.gov
Principal Factor
.08 DECISION
.33 13th
plato.la.asu.edu
.13 FREE
.04 www.govbenefits.gov
.15 HOUSE.07 NEOS .46 ADOBE
.04 www.census.gov
.99 www.adobe.com
.29 www.mat.univie.ac.at
.13 BUDGET
.04 www.usdoj.gov
.06 TREE .45 READER
.28 www.ilog.com
.13 PRESIDENTS
.04 www.kids.gov
16th Principal Factor
.05 GUIDE .45 ACROBAT
.26 www.dashoptimization.com
.11 OFFICE
.02 www.forms.gov
.50 WEATHER
.30 FREE
.05 SEARCH
.26 www.grabitech.com.81 www.weather.gov
.24 OFFICE
.30 NO-READABLE-TEXT
.05 ENGINE
.25 www-fp.mcs.anl.gov.41 www.spc.noaa.gov
.30 lwf.ncdc.noaa.gov
.29 HERE .23 CENTER
.05 CONTROL
.22 www.spyderopts.com
Principal Factor
.19
NO-READABLE-TEXT
.15 19th
www.cpc.ncep.noaa.gov
.29
COPY
.05
ILOG
.17
www.mosek.com
term scores
term scores
for 1 topic
.22 TAX
.73 www.irs.gov
for 2 topic
.05 DOWNLOAD
.17 ORGANIZATION
.14 www.nhc.noaa.gov
.43 travel.state.gov
.15 NWS .17 TAXES
.09 www.prh.noaa.gov
.15
CHILD
.22 www.ssa.gov
to
.15 SEVERE
.07 aviationweather.gov
.15
RETIREMENT
.08 www.govbenefits.gov
.15 FIRE
.06 www.nohrsc.nws.gov
authority scores
authority scores
.14
BENEFITS
.06 www.usdoj.gov
.15 POLICY
.06 www.srh.noaa.gov
for 2 topic
for 1 topic
.14
STATE
.03 www.census.gov
.14 CLIMATE
hub scores
.14
INCOME
.03 www.usmint.gov
for 2 topic
hub scores
KDD '09
Faloutsos,
Miller,
Tsourakakis
.13
SERVICE
.02 www.nws.noaa.gov
for 1 topic
.13 REVENUE
.02 www.gsa.gov
.12 CREDIT
.01 www.annualcreditreport.com
Tensor PARAFAC
from
st
nd
nd
st
nd
st
P5-34
CMU SCS
Conclusions
• Real data are often in high dimensions with
multiple aspects (modes)
• Tensors provide elegant theory and
algorithms
– PARAFAC and Tucker: discover groups
KDD '09
Faloutsos, Miller, Tsourakakis
P5-35
CMU SCS
References
• T. G. Kolda, B. W. Bader and J. P. Kenny.
Higher-Order Web Link Analysis Using
Multilinear Algebra. In: ICDM 2005, Pages 242249, November 2005.
• Jimeng Sun, Spiros Papadimitriou, Philip Yu.
Window-based Tensor Analysis on Highdimensional and Multi-aspect Streams, Proc. of
the Int. Conf. on Data Mining (ICDM), Hong
Kong, China, Dec 2006
KDD '09
Faloutsos, Miller, Tsourakakis
P5-36
CMU SCS
Resources
• See tutorial on tensors, KDD’07 (w/ Tamara
Kolda and Jimeng Sun):
www.cs.cmu.edu/~christos/TALKS/KDD-07-tutorial
KDD '09
Faloutsos, Miller, Tsourakakis
P5-37
CMU SCS
Tensor tools - resources
• Toolbox: from Tamara Kolda:
csmr.ca.sandia.gov/~tgkolda/TensorToolbox
• T. G. Kolda and B. W. Bader. Tensor Decompositions and
Applications. SIAM Review, Volume 51, Number 3, September 2009
csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/TensorReview-preprint.pdf
• T. Kolda
and J. Sun: Scalable
Tensor
Decomposition for Multi-Aspect
KDD '09
ICDE’09
Copyright:
Faloutsos,Faloutsos,
Miller, Tsourakakis
Tong (2009)
P5-38
2-38
Data Mining (ICDM 2008)