ICS 278: Data Mining Lecture 1: Introduction to Data

Download Report

Transcript ICS 278: Data Mining Lecture 1: Introduction to Data

From Gauss to Google: Data Analysis in
the Digital Age
Padhraic Smyth
Department of Computer Science
University of California, Irvine
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 1
Opening Comments
• From Gauss to Google
– Not just Gauss…
– Not just Google….
• Broad interpretation of “Web data”, e.g., will include email, etc
• Many topics in Web data analysis will not be discussed
• Data mining, machine learning, and statistics?
– All pursuing the same goals, but with different agendas/biases
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 2
The Internet Archive
Non-profit organization with a broad goal to crawl and
archive the Web
As of June 2007:
- 96 billion Web pages archived since Oct 1996
- 49 billion unique documents
- 500 terabytes of data
Source: Internet Archive ACM/IEEE JCDL Conference Tutorial, June 2007
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 3
Computer Architecture 101
Disk
CPU
HICSS Keynote Talk, Jan 2008
RAM
© Padhraic Smyth, UC Irvine: 4
How Far Away are the Data?
Disk
CPU
RAM
10-8 seconds
10-3 seconds
Random Access Times
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 5
How Far Away are the Data?
Disk
CPU
RAM
1 meter
100 km
Effective Distances
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 6
What do we mean by “Web data”?
• Data
– Objects of interest
– Measurements we can make on objects
• Examples
– Object = Web document
• Measurements = text content, traffic, edit history,..
– Object = Network
• Measurements = nodes, links, time-stamps, content, …
– Object = Human
• Measurements = browsing behavior, queries, demographics…
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 7
Data as a matrix…..
Rows -> objects
Columns -> measurements
ID
Income
Age
….
Monthly Debt
Good Risk?
18276
72514
28163
17265
…
…
61524
65,000
28,000
120,000
90,000
…
…
35,000
55
19
62
35
…
…
22
….
….
….
….
….
….
….
2200
1500
1800
4500
…
…
900
Yes
No
Yes
No
…
…
Yes
In fact Web data is very different:
- sequential record of events per user
- vastly different amounts of data per user
- many categorical variables (e.g., query terms)
- and so on….
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 8
Email Network over 3 months
from Hewlett Packard Research Labs
Example Research Question:
What is the “best” way to detect
significant changes in such a
network over time?
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 9
Discovering Organizational Structure from Email Network
O’ Madadhain and Smyth, 2005
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 10
Networks of Instant Messagers
Leskovec and Horvitz, 2007
• Network Data
– 240 IM users over 1 month
– 1 billion conversations per day
– 1.3 billion edges in the graph
Example Research Question:
How do these spatial patterns
depend on social and economic
factors?
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 11
Linking Demograpics with IM Usage
Leskovec and Horvitz, 2007
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 12
Query Data
(source: Dan Russell, Google)
Research Question:
Predict the age and
gender of an individual
given their query history
More difficult:
Predict how many
people are using one
account, and their ages
and genders
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 13
Eye-Tracking: The Golden Triangle for Search
from Hotchkiss, Alston, Edwards, 2005; EnquiroResearch
Research Question:
Build a probabilistic model
that characterizes these
patterns at individual and
population levels
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 14
The State of Web Data
•
Text Content, Networks, and Human Behavior
•
Complex
•
Non-stationary
•
Observational versus experimental
•
Measurement is non-trivial
•
Vast Scale
So should we just forget about statistics?
Do we need fundamentally new ways to analyze this type of data?
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 15
Key Ideas from Statistics
• Regularity in the aggregate
– The Normal curve and central limit theorem
– Ubiquity of power-laws
• …but diversity in individual behavior
– extremes are prevalent in very large data sets
• Observed versus unobserved variables
– Using unobserved variables to explain observed data
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 16
1800: The Birth of Modern Statistical Thinking
• Before 1800
– Long history of inferring patterns from data – but largely ad hoc
– More recent history of probability – limited to games of chance
• Around 1800
– New data analysis problems in science (astronomy), commerce
(navigation), and social sciences
– Realization of the importance of deriving a systematic approach to data
analysis
– Work of Laplace, Legendre, Gauss, etc, was fundamental
Source: Stephen Stigler, The History of Statistics: The Measurement of
Uncertainty before 1900, Harvard University Press, 1986.
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 17
A Simple Example from 1800
y1 = x + e1
y2 = x + e2
y3 = x + e3
…….
• e.g, astronomy: taking measurements from a telescope
yi = observed position of a planet in the sky for measurement i
x = the true position
ei = random measurement error
• Combining multiple measurements: major open problem in 1800
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 18
A Simple Example from 1800
y1 = x + e1
y2 = x + e2
y3 = x + e3
…….
• Key insights from Laplace, Legendre, Gauss:
– If e’s are normal/Gaussian, we can estimate x by least-squares
– We can also make statements about P(x | {y}, e)
– We can generalize to multiple variables
• y = ax + bv + fz + d + e
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 19
y=x+e
Probabilistic
Model
HICSS Keynote Talk, Jan 2008
Observed
Data
© Padhraic Smyth, UC Irvine: 20
y=x+e
Observed
Data
Probabilistic
Model
Least squares
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 21
The Average Man
• Early applications of statistical thinking were restricted
to scientific problems where x’s were physical
quantities
• 1835: enter Adolphe Quetelet
– L’homme moyen – the average man
– We can apply ideas like Normal curves to human
characteristics and behavior
• Heights, birth rates, growth curves
– Introduced statistical thinking to social sciences
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 22
Key Concepts from Quetelet
• Conditional dependence
– E.g., P( height | male) versus P( height | female)
• Latent hidden variables
– E.g., tendency to commit a crime
• The regularity of human behavior:
“The constancy with which the same crimes repeat themselves
every year with the same frequency … is one of the most curious
facts we learn from the statistics of the courts;”
Do we see such regularities in Web data?
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 23
Histogram of session length
for visitors to department Web site
over 1 week (robots removed)
[on a log-log scale]
0
Empirical Frequency of L
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
1
10
2
10
Session Length L
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 24
Query Distribution of MSN and AOL search logs
from Adar, Weld, Bershad, Gribble, 2007
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 25
Conversation Duration for Instant Messenger Sessions
from Leskovec and Horvitz, 2007
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 26
Login and Logout Durations for Instant Messenger Sessions
from Leskovec and Horvitz, 2007
Regularities such as power-laws are
abundant in Web data
Highly non-Gaussian
Aggregate behavior – very predictable
Individual behavior – much less so
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 27
Contribution of Gauss et al
y=x+e
Observed
Data
Probabilistic
Model
Least squares
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 28
Inverse Probability
P(data | model)
Observed
Data
Probabilistic
Model
P(model | data)
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 29
UCLA, 1988: Judea Pearl and Graphical Models
• A “language” for modeling dependencies among
sets of random variables
• Graphical model
– Nodes = variables
– Edges = direct dependencies
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 30
UCLA, 1988: Judea Pearl and Graphical Models
• A “language” for modeling dependencies among
sets of random variables
• Graphical model
– Nodes = variables
– Edges = direct dependencies
• Leverages the idea of conditional independence
Age
Reading
Ability
HICSS Keynote Talk, Jan 2008
Reading and height
are modeled as conditionally
independent given age
Height
But if age is unknown, they are
dependent!
© Padhraic Smyth, UC Irvine: 31
Classifying Documents, e.g., Spam Email Filtering
Class
Class = {spam, non-spam}
w
w
w
Word 1
Word i
Word n
“All models are wrong, but some are useful”
from G. E. P. Box
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 32
Specifying the Forward Model
Class
f
f are the parameters
of the model, e.g.,
P(w = free| class = spam)
HICSS Keynote Talk, Jan 2008
w
w
w
Word 1
Word i
Word n
© Padhraic Smyth, UC Irvine: 33
Specifying the Forward Model
Class
f
P( w | class, f)
HICSS Keynote Talk, Jan 2008
w
w
w
Word 1
Word i
Word n
© Padhraic Smyth, UC Irvine: 34
A Compact Notation: Plates
f
Class
Plate = replicates of a node
wi
Nodes within plates are
conditionally independent
given parent nodes
i = 1:n
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 35
Plates with Multiple Documents
f
Classj
Assumes documents are
conditionally independent
given model parameters
wi
i = 1:n
j = 1:D
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 36
Learning Parameters
f
Classj
Use “inverse probability”
(Bayes rule) to learn
f are in fact unknown
the f ’s
wi
i = 1:n
In essence, information
flows from the observed
nodes to the unobserved
j = 1:D
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 37
Being Bayesian
Prior
f
Classj
Priors can help smooth
out data-driven estimates,
e.g., dictionary-derived
wi
i = 1:n
j = 1:D
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 38
Making Predictions
f
Classj
Now we have new
documents where
“class” is unknown
Again use inverse
probability (Bayes rule),
wi
i = 1:n
j = 1:D
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 39
Why is Graphical Modeling Important?
• A systematic stochastic modeling framework
– handles parameters, variables, and data
• Links modeling with computation
– In other words, it links statistics and computer science
• Allows us to use computers to help build complex models
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 40
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 41
Graphical Model for Markov Chains
f
ci
Pages
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 42
Multiple Users…One Common Markov Chain
f
ci
Pages
Users
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 43
Multiple Users…One Chain per User
f
ci
Pages
Users
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 44
One Chain per Cluster of Users
Cadez, Meek, Heckerman, Smyth, 2003
f
Clusterj
ci
Pages
Users
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 45
Clusters of Probabilistic State Machines
A
A
Cluster 1
Cluster 2
B
C
E
B
C
E
Motivation:
approximate the heterogeneity of Web surfing behavior
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 46
This is the sequencemining algorithm in
SQL-server
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 47
Statistical Text Mining
NYT
330,000 articles
CiteSeer
600,000 abstracts
HICSS Keynote Talk, Jan 2008
Enron
250,000 emails
Pennsylvania Gazette
80,000 articles
1728-1800
NSF/ NIH
100,000 grants
16 million Medline articles
© Padhraic Smyth, UC Irvine: 48
Problems of Interest
– What topics do these documents “span”?
– Which documents are about a particular topic?
– How have topics changed over time?
– What does author X write about?
– Who is likely to write about topic Y?
– and so on…..
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 49
Collaborators in Text Research
Mark Steyvers, UCI
Chaitanya
Chemudugunta, UCI
Tom Griffiths, UC Berkeley
HICSS Keynote Talk, Jan 2008
Dave Newman, UCI
Michal Rosen-Zvi, IBM
© Padhraic Smyth, UC Irvine: 50
P(Data | Parameters)
Probabilistic
Model
HICSS Keynote Talk, Jan 2008
Words in
Documents
© Padhraic Smyth, UC Irvine: 51
P(Data | Parameters)
Probabilistic
Model
Words in
Documents
P(Parameters | Data)
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 52
The Multinomial Model for Words
f
wi
Words
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 53
Multiple Documents: One Multinomial
f
wi
Words
Documents
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 54
One Multinomial per Document
f
wi
Words
Documents
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 55
Clusters of Documents
f
z
wi
Words
Documents
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 56
The Statistical Topic Model
q
f
q
z
wi
Words
Documents
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 57
The Statistical Topic Model
q
f
q
P(topic|doc)
z
P(word|topic)
wi
Words
Documents
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 58
Topic Models
Documents = mixtures of topics
Topics = probability distributions over words
• Model = joint distribution over words, topics, docs
• Answering queries = computing conditional probabilities
• Topics are learned completely automatically from data
(no human intervention)
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 59
Enron email data
250,000 emails
28,000 authors
1999-2002
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 60
Enron email: business topics
TOPIC 36
TOPIC 72
TOPIC 23
TOPIC 54
WORD
PROB.
WORD
PROB.
WORD
PROB.
FEEDBACK
0.0781
PROJECT
0.0514
FERC
0.0554
PERFORMANCE
0.0462
PLANT
0.028
MARKET
0.0328
PROCESS
0.0455
COST
0.0182
ISO
0.0226
PEP
0.0446
MANAGEMENT
0.03
UNIT
0.0166
ORDER
COMPLETE
0.0205
FACILITY
0.0165
QUESTIONS
0.0203
SITE
0.0136
CONSTRUCTION 0.0169
WORD
PROB.
ENVIRONMENTAL 0.0291
AIR
0.0232
MTBE
0.019
EMISSIONS
0.017
0.0212
CLEAN
0.0143
FILING
0.0149
EPA
0.0133
COMMENTS
0.0116
PENDING
0.0129
COMMISSION 0.0215
SELECTED
0.0187
PROJECTS
0.0117
PRICE
0.0116
SAFETY
0.0104
COMPLETED
0.0146
CONTRACT
0.011
CALIFORNIA
0.0110
WATER
0.0092
SYSTEM
0.0146
UNITS
0.0106
FILED
0.0110
GASOLINE
0.0086
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
perfmgmt
0.2195
***
0.0288
***
0.0532
***
0.1339
perf eval process
0.0784
***
0.022
***
0.0454
***
0.0275
enron announcements
0.0489
***
0.0123
***
0.0384
***
0.0205
***
0.0089
***
0.0111
***
0.0334
***
0.0166
***
0.0048
***
0.0108
***
0.0317
***
0.0129
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 61
Enron: non-work topics…
TOPIC 66
TOPIC 182
TOPIC 113
TOPIC 109
WORD
PROB.
WORD
PROB.
WORD
PROB.
WORD
PROB.
HOLIDAY
0.0857
TEXANS
0.0145
GOD
0.0357
AMAZON
0.0312
PARTY
0.0368
WIN
0.0143
LIFE
0.0272
GIFT
0.0226
YEAR
0.0316
FOOTBALL
0.0137
MAN
0.0116
CLICK
0.0193
SEASON
0.0305
FANTASY
0.0129
PEOPLE
0.0103
SAVE
0.0147
COMPANY
0.0255
SPORTSLINE
0.0129
CHRIST
0.0092
SHOPPING
0.0140
CELEBRATION
0.0199
PLAY
0.0123
FAITH
0.0083
OFFER
0.0124
ENRON
0.0198
TEAM
0.0114
LORD
0.0079
HOLIDAY
0.0122
TIME
0.0194
GAME
0.0112
JESUS
0.0075
RECEIVE
0.0102
RECOGNIZE
0.019
SPORTS
0.011
SPIRITUAL
0.0066
SHIPPING
0.0100
MONTH
0.018
GAMES
0.0109
VISIT
0.0065
FLOWERS
0.0099
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
chairman & ceo
0.131
cbs sportsline com 0.0866
crosswalk com
0.2358
amazon com
0.1344
***
0.0102
houston texans 0.0267
wordsmith
0.0208
jos a bank
0.0266
***
0.0046
houstontexans 0.0203
***
0.0107
sharperimageoffers
0.0136
***
0.0022
sportsline rewards 0.0175
travelocity com
0.0094
general announcement 0.0017
pro football 0.0136
barnes & noble com
0.0089
HICSS Keynote Talk, Jan 2008
doctor dictionary 0.0101
***
0.0061
© Padhraic Smyth, UC Irvine: 62
Enron: public-interest topics...
TOPIC 18
TOPIC 22
TOPIC 114
WORD
PROB.
WORD
PROB.
WORD
PROB.
TOPIC 194
WORD
PROB.
POWER
0.0915
STATE
0.0253
COMMITTEE
0.0197
LAW
0.0380
CALIFORNIA
0.0756
PLAN
0.0245
BILL
0.0189
TESTIMONY
0.0201
ELECTRICITY
0.0331
CALIFORNIA
0.0137
HOUSE
0.0169
ATTORNEY
0.0164
UTILITIES
0.0253
POLITICIAN Y
0.0137
SETTLEMENT
0.0131
PRICES
0.0249
RATE
0.0131
LEGAL
0.0100
MARKET
0.0244
EXHIBIT
0.0098
PRICE
0.0207
SOCAL
0.0119
CONGRESS
0.0112
CLE
0.0093
UTILITY
0.0140
POWER
0.0114
PRESIDENT
0.0105
SOCALGAS
0.0093
CUSTOMERS
0.0134
BONDS
0.0109
METALS
0.0091
ELECTRIC
0.0120
MOU
0.0107
DC
0.0093
PERSON Z
0.0083
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
SENDER
PROB.
***
0.1160
***
0.0395
***
0.0696
***
0.0696
***
0.0518
***
0.0337
***
0.0453
***
0.0453
***
0.0284
***
0.0295
***
0.0255
***
0.0255
***
0.0272
***
0.0251
***
0.0173
***
0.0173
***
0.0266
***
0.0202
***
0.0317
***
0.0317
HICSS Keynote Talk, Jan 2008
BANKRUPTCY 0.0126
WASHINGTON 0.0140
SENATE
0.0135
POLITICIAN X 0.0114
LEGISLATION 0.0099
© Padhraic Smyth, UC Irvine: 63
Examples of CiteSeer Topics
TOPIC 10
TOPIC 209
WORD
TOPIC 87
WORD
PROB.
SPEECH
0.1134
RECOGNITION
0.0349
BAYESIAN
0.0671
TOPIC 20
PROB.
WORD
PROB.
WORD
PROB.
PROBABILISTIC 0.0778
USER
0.2541
STARS
0.0164
INTERFACE
0.1080
OBSERVATIONS 0.0150
WORD
0.0295
PROBABILITY
0.0532
USERS
0.0788
SOLAR
0.0150
SPEAKER
0.0227
CARLO
0.0309
INTERFACES
0.0433
MAGNETIC
0.0145
ACOUSTIC
0.0205
MONTE
0.0308
GRAPHICAL
0.0392
RAY
0.0144
RATE
0.0134
DISTRIBUTION
0.0257
INTERACTIVE
0.0354
EMISSION
0.0134
SPOKEN
0.0132
INFERENCE
0.0253
INTERACTION
0.0261
GALAXIES
0.0124
SOUND
0.0127
VISUAL
0.0203
OBSERVED
0.0108
TRAINING
0.0104
CONDITIONAL
0.0229
DISPLAY
0.0128
SUBJECT
0.0101
MUSIC
0.0102
PRIOR
0.0219
MANIPULATION
0.0099
STAR
0.0087
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
AUTHOR
PROB.
Waibel_A
0.0156
Friedman_N
0.0094
Shneiderman_B
0.0060
Linsky_J
0.0143
Gauvain_J
0.0133
Heckerman_D
0.0067
Rauterberg_M
0.0031
Falcke_H
0.0131
Lamel_L
0.0128
Ghahramani_Z
0.0062
Lavana_H
0.0024
Mursula_K
0.0089
Woodland_P
0.0124
Koller_D
0.0062
Pentland_A
0.0021
Butler_R
0.0083
Ney_H
0.0080
Jordan_M
0.0059
Myers_B
0.0021
Bjorkman_K
0.0078
Hansen_J
0.0078
Neal_R
0.0055
Minas_M
0.0021
Knapp_G
0.0067
Renals_S
0.0072
Raftery_A
0.0054
Burnett_M
0.0021
Kundu_M
0.0063
Noth_E
0.0071
Lukasiewicz_T
0.0053
Winiwarter_W
0.0020
Christensen-J
0.0059
Boves_L
0.0070
Halpern_J
0.0052
Chang_S
0.0019
Cranmer_S
0.0055
Young_S
0.0069
Muller_P
0.0048
Korvemaker_B
0.0019
Nagar_N
0.0050
HICSS Keynote Talk, Jan 2008
PROBABILITIES 0.0253
© Padhraic Smyth, UC Irvine: 64
0.012
CHANGING TRENDS IN COMPUTER SCIENCE
0.01
WWW
Topic Probability
0.008
0.006
INFORMATION
RETRIEVAL
0.004
0.002
0
1990
1992
1994
1996
1998
2000
2002
Year
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 65
0.012
CHANGING TRENDS IN COMPUTER SCIENCE
0.01
OPERATING
SYSTEMS
Topic Probability
0.008
WWW
PROGRAMMING
LANGUAGES
0.006
INFORMATION
RETRIEVAL
0.004
0.002
0
1990
1992
1994
1996
1998
2000
2002
Year
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 66
-3
8
x 10
HOT TOPICS: MACHINE LEARNING/DATA MINING
7
Topic Probability
6
CLASSIFICATION
5
REGRESSION
4
DATA MINING
3
2
1
1990
1992
1994
1996
1998
2000
2002
Year
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 67
-3
5.5
x 10
BAYES MARCHES ON
5
BAYESIAN
Topic Probability
4.5
PROBABILITY
4
3.5
STATISTICAL
PREDICTION
3
2.5
2
1.5
1990
1992
1994
1996
1998
2000
2002
Year
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 68
0.012
INTERESTING "TOPICS"
0.01
FRENCH WORDS:
LA, LES, UNE, NOUS, EST
Topic Probability
0.008
0.006
DARPA
0.004
0.002
0
1990
MATH SYMBOLS:
GAMMA, DELTA, OMEGA
1992
1994
1996
1998
2000
2002
Year
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 69
Topic trends from New York Times
15
Tour-de-France
10
5
0
Jan00
330,000 articles
2000-2002
Jul00
Jan01
Jul01
Jan02
Jul02
20
10
0
Jan00
Jul00
Jan01
Jul01
Jan02
Jul02
50
0
Jan00
Jul00
Jan01
Jul01
Jan02
Jul02
COMPANY
QUARTER
PERCENT
ANALYST
SHARE
SALES
EARNING
Jan03
Anthrax
100
HICSS Keynote Talk, Jan 2008
Jan03
Quarterly Earnings
30
TOUR
RIDER
LANCE_ARMSTRONG
TEAM
BIKE
RACE
FRANCE
ANTHRAX
LETTER
MAIL
WORKER
OFFICE
SPORES
POSTAL
BUILDING
Jan03
© Padhraic Smyth, UC Irvine: 70
Pennsylvania Gazette
Joint work with Sharon Block, UC Irvine History Dept
Size
Most likely words in topic
Labels Added
6.3%
country public great people men liberty many let life friend spirit
government
Republicanism
5.7%
say might thing think without against own did know make well reason
good
Rhetoric
4.9%
away servant reward old jacket whoever pair named paid run hat coat
master
Runaways
4.1%
silk cotton ditto white black linen women cloth blue worsted men thread
fine
Cloth for Sale
3.8%
acres good land meadow plantation containing sold tract miles well
premise
Real Estate
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 71
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 72
Analyzing Austen novels
[SENTIMENT] (3.4%) felt comfort feeling feel spirit mind heart point moment
ill letter beyond mother state never event evil fear impossible hope time idea
left situation poor distress possible hour end loss relief dearest suffering
Emma
Mansfield Park
40
60
30
words
40
20
20
0
0
time -->
Persuasion
30
words
words
20
10
0
HICSS Keynote Talk, Jan 2008
0
time -->
Pride and Prejudice
40
Sense and Sensibility
40
30
30
20
0
time -->
0
time -->
10
0
20
10
words
10
0
40
words
words
30
0
Northanger Abbey
20
10
0
0
time -->
0
time -->
© Padhraic Smyth, UC Irvine: 73
Applications
• Automatically building domain-specific browsers “on the fly”
– Burns et al (2007) constructed an interactive visual browser, based on
topics, for papers at the Annual Society for Neuroscience Conference
– Kumar (2006) developed a browser for 40,00 MEDLINE documents
about schizophrenia
– Others in development
• Automated indexing in digital libraries
– REXA system uses topics to automatically index 1 million computer
science papers (McCallum et al, U Mass)
– California Digital Library (Newman et al, 2006)
• Exploratory analysis “beyond keywords”
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 74
Concluding Comments
• Web data analysis
– Tremendous opportunities and interesting problems
– Rich measurement of human behavior on a large scale
– In terms of Web data analysis, its about 1820-1850
• Probability and statistics remain highly relevant
• We need a new breed of “data scientist”
– fluent in both computer science and statistics
– not enough attention being paid to this in education
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 75
Further Reading
• Web Data Analysis
P. Baldi, P. Frasconi, and P. Smyth,
Modeling the Internet and the Web: Probabilistic Methods and Algorithms
Wiley, 2003
S. Chakrabarti
Mining the Web: Discovering Knowledge from Hypertext Data
Morgan Kaufmann, 2002
• Topic Modeling
M. Steyvers and T. Griffiths
Probabilistic topic models, 2006
(Good introductory article, available from Mark Steyvers’ Web page)
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine: 76