Transcript Slide 1

Lada Adamic, HP Labs, Palo Alto, CA
Talk outline
Information flow through blogs
Information flow through email
Search through email networks
Search within the enterprise
Search in an online community
Implicit Structure and Dynamics of BlogSpace
Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose
• Blog use:
– Record real-world and virtual experiences
– Note and discuss things “seen” on the net
• Blog structure: blog-to-blog linking
• Use + Structure
– Great to track “memes” (catchy ideas)
Approaches and uses of blog analysis
• Patterns of information flow
– How does the popularity of a topic evolve over time?
– Who is getting information from whom?
• Ranking algorithms that take advantage of transmission
patterns
Tracking popularity over time
Slashdot Effect
Popularity
BoingBoing Effect
Time
Blogdex, BlogPulse, etc. track the most popular links/phrases of the day
Different kinds of information have different
popularity profiles
1
0.9
Slashdot
postings
0.8
Major-news
site (editorial
content) –
back of the
paper
Front-page
news
0.7
Products,
etc.
0.6
0.5
0.4
0.3
0.2
0.1
0
5
10
15
5
10
15
5
10
15
5
% of hits received on each day since first appearance
10
15
Micro example: Giant Microbes
Microscale Dynamics
• What do we need track specific info ‘epidemics’?
– Timings
– Underlying network
b2
b1
b3
t0
Time of infection
t1
Microscale Dynamics
• Challenges
– Root may be unknown
– Multiple possible paths
– Uncrawled space, alternate media (email, voice)
– No links
bn
b2
?
b1
?
b3
t0
Time of infection
t1
Microscale Dynamics
who is getting info from whom
• Explicit blog to blog links (easy)
– Via links are even better
• Implicit/Inferred transfer (harder)
– Use ML algorithm for link inference problem
• Support Vector Machine (SVM)
• Logistic Regression
– What we can use
•
•
•
•
Full text
Blogs in common
Links in common
History of infection
Visualization
http://www-idl.hpl.hp.com/blogstuff
• Zoomgraph tool
– Using GraphViz (by AT&T) layouts
• Simple algorithm
– If single, explicit link exists, draw it
– Otherwise use ML algorithm
• Pick the most likely explicit link
• Pick the most likely possible link
• Tool lets you zoom around space, control
threshold, link types, etc.
Giant Microbes epidemic visualization
via link
explicit link
inferred link
blog
iRank
Find early sources of good information
using inferred information paths or timing
b3
b4
b1
True source
b2
Popular site
b5
…
bn
iRank Algorithm
•
•
•
•
Draw a weighted edge for all pairs of blogs that cite the same URL
higher weight for mentions closer together
run PageRank
control for ‘spam’
t0
Time of infection
t1
Do Bloggers Kill Kittens?
02:00 AM Friday Mar. 05, 2004 PST Wired publishes:
"Warning: Blogs Can Be Infectious.”
7:25 AM Friday Mar. 05, 2004 PST Slashdot posts:
"Bloggers' Plagiarism Scientifically Proven"
9:55 AM Friday Mar. 05, 2004 PST Metafilter announces
"A good amount of bloggers are outright thieves."
Information flow in social groups
Fang Wu, Bernardo Huberman, Lada Adamic, Joshua Tyler
Spread of disease is affected
by the underlying network
co-worker
mom
college
friend
co-worker
mike
co-worker
Spread of computer viruses
is affected by the
underlying network
co-worker
mom
college
friend
co-worker
mike
co-worker
Difference between information flow and disease/virus spread
Viruses (computer and otherwise) are shared
indiscriminately (involuntarily)
Information is passed selectively from one host to another based on
knowledge of the recipient’s interests
Spread of information is affected
by its content, potential recipients,
and network topology
co-worker
mom
college
friend
co-worker
mike
co-worker
homophily: individuals with like interests associate with one another
average similarity at the distance
personal homepages at Stanford
1.2
1
0.8
0.6
0.4
0.2
0
0
5
10
15
distance
between
personal
homepages
distance
between
personal
homepages
20
The Model:
Decay in transmission probability as a function of the
distance m between potential target and originating node
T(m) = (m+1)-b T
power-law implies slowest decay
m=2
m=0
m=1
Virus, information transmission on a scale free network
10
P(k )  Ck  e k / 
0
outdegree distribution
 = 2.0 fit
P(k)
frequency
10
10
10
10
-2
-4
-6
-8
10
outdegree k
0
10
1
10
2
10
3
10
4
outdegree
Degree distribution of all senders of email passing through the HP email server
epidemics on scale free graphs
106 nodes, epidemic if 1% (104) infected
critical threshold
1
=, b=0
=100, b=0
=100, b=1
0.8
0.6
0.4
Wu et al. (2004)
Newman (2002)
0.2
Pastor-Satorras
& Vespignani (2001)
0
1
1.5
2
2.5

3
3.5
4
Study of the spread of URLs and attachments
40 participants (30 within HPL, 10 elsewhere in HP & other orgs)
6370 URLs and 3401 attachments crypotgraphically hashed
Question: How many recipients in our sample did each item reach?
caveats:
messages are deleted (still, the median number of messages > 2000)
non-uniform sample
Only forwarded messages are counted
forwarded
message
forwarded URLs
Results
number of items with so many recipients
average = 1.1 for attachments, and 1.2 for URLs
10
10
10
10
4
3
email attachments
-4.1
x
URLs
-3.6
x
2
ads at the
bottom of
hotmail &
yahoo
messages
1
0
10 0
10
10
number of recipients
1
short term expense
control
Simulate transmission on email log
each message has a probability p of transmitting information
from an infected individual to the recipient
02/19/2003
15:45:33 I-1
I-2
02/19/2003
15:45:33 I-1
I-3
02/19/2003
15:45:40 E-1
I-4
02/19/2003
15:45:52 I-5
E-2
02/19/2003
15:45:55 E-3
I-6
02/19/2003
15:45:58 I-7
I-8
02/19/2003
15:46:00 E-4
I-9
02/19/2003
15:46:05 I-10
I-11
02/19/2003
15:46:10 I-12
I-13
02/19/2003
15:46:10 I-12
I-14
02/19/2003
15:46:10 I-12
I-15
15:46:14 I-16
.
.
E-5
.
.
02/19/2003
.
.
.
.
internal
node
external
node
Simulation of information transmission on
the actual HP Labs email graph
an individual is infected if they receive a particular piece
of information
individuals remain infected for 24 hours
start by infecting one individual at random
every time an infected individual sends an email they have
a probability p of infecting the recipient
track epidemic over the course of a week, most run their
course in 1-2 days
Introduce a decay in the transmission probability
based on the hierarchical distance
p  p0h1.75
hAB = 5
distance 2
distance 1
distance 2
B
A
distance 1
7119 potential recipients
average size of outbreak or epidemic
2500
outbreak w/ decay
epidemic w/ decay
outbreak w/o decay
epidemic w/o decay
2000
1500
1000
500
0
0
0.2
0.4
0.6
0.8
probability of transmission p0
1
Conclusions on info flow in social groups
Information spread typically does not reach epidemic proportions
Information is passed on to individuals with matching properties
The likelihood that properties match decreases with distance
from the source
Model gives a finite threshold
Results are consistent with observed URL & attachment frequencies
in a sample
Simulations following real email patterns also consistent
How to search in a small world
MA
NE
Milgram’s experiment:
Given a target individual and a particular property, pass the
message to a person you correspond with who is “closest” to the
target.
Small world experiment at Columbia
Dodds, Muhamad, Watts, Science 301, (2003)
email experiement conducted in 2002
18 targets in 13 different countries
24,163 message chains
384 reached their targets
average path length 4.0
Why study small world phenomena?
Curiosity:
Why is the world small?
How are people able to route messages?
Social Networking as a Business:
Friendster, Orkut, MySpace
LinkedIn, Spoke, VisiblePath
Six degrees of separation - to be expected
Pool and Kochen (1978) - average person has 500-1500 acquaintances
Ignoring clustering, other redundancy …
~ 103 first neighbors, 106 second neighbors, 109 third neighbors
But networks are clustered:
my friends’ friends tend to be my friends
Watts & Strogatz (1998) - a few random links in an otherwise clustered graph give an
average shortest path close to that of a random graph
But how are people are able to find short paths?
How to choose among hundreds of acquaintances?
Strategy:
Simple greedy algorithm - each participant chooses correspondent
who is closest to target with respect to the given property
Models
geography
Kleinberg (2000)
hierarchical groups
Watts, Dodds, Newman (2001), Kleinberg(2001)
high degree nodes
Adamic, Puniyani, Lukose, Huberman (2001), Newman(2003)
Spatial search
Kleinberg (2000)
“The geographic movement of the [message]
from Nebraska to
Massachusetts is striking. There is a
progressive closing in on the target
area as each new person is added to the
chain”
S.Milgram ‘The small world
problem’, Psychology Today 1,61,1967
nodes are placed on a lattice and
connect to nearest neighbors
additional links placed with
f(d)~ d(u,v)-r
if r = 2, can search in polylog (< (logN)2) time
Kleinberg: searching hierarchical structures
‘Small-World Phenomena and the Dynamics of Information’, NIPS 14, 2001
Hierarchical network models:
h is the distance between two individuals in hierarchy
with branching b
f(h) ~ b-h
If  = 1, can search in O(log n) steps
Group structure models:
q = size of smallest group that two individuals belong to
f(q) ~ q-
If  = 1, can achieve in O(log n) steps
Identity and search in social networks
Watts, Dodds, Newman (2001)
individuals belong to hierarchically nested groups
multiple independent hierarchies coexist
pij ~ exp(- x)
Identity and search in social networks
Watts, Dodds, Newman (2001)
There is an attrition rate r
Network is ‘searchable’ if a fraction q of messages reach the target
N=102400
N=204800
N=409600
High degree search
Adamic et al. Phys. Rev. E, 64 46135 (2001)
Mary
Who could
introduce me to
Richard Gere?
Bob
Jane
power-law graph
number of
nodes found
94
67
63
54
2
6
1
Poisson graph
number of
nodes found
93
19
15
11
7
3
1
Scaling of search time with size of graph
Sharp cutoff at k~N1/ , 2nd degree neighbors
3
covertime for half the nodes
10
random walk
 = 0.37 fit
degree sequence
 = 0.24 fit
2
10
1
10
0
10 1
10
2
10
3
10
size of graph
4
10
5
10
Testing the models on social networks
(w/ Eytan Adar)
Use a well defined network:
HP Labs email correspondence over 3.5 months
Edges are between individuals who sent
at least 6 email messages each way
Node properties specified:
degree
geographical location
position in organizational hierarchy
Can greedy strategies work?
Strategy 1: High degree search
Degree distribution of all senders of email passing through the HP email server
10
0
outdegree distribution
 = 2.0 fit
frequency
10
10
10
10
-2
-4
-6
-8
10
0
10
1
10
2
outdegree
outdegree
10
3
10
4
Filtered network
(6 messages sent each way)
Degree distribution no longer power-law, but Poisson
35
10
450 users
median degree = 10
0
30
p(k)
mean degree = 13
p(k)
25
10
-2
average shortest
path = 3
20
15
10
10
-4
0
20
40
k
60
80
High degree search
performance (poor):
median # steps = 16
mean = 40
5
0
0
60
40
20
number of email correspondents, k
80
Strategy 2:
Geography
Communication across corporate geography
1U
1L
87 % of the
4000 links are
between individuals
on the same floor
4U
2U
3U
2L
3L
Cubicle distance vs. probability of being linked
0
10
measured
1/r
proportion of linked pairs
1/r2
-1
10
-2
10
optimum for search
-3
10
2
10
distance in feet
3
10
Finding someone in a sea of cubicles
16000
number of pairs
14000
12000
10000
8000
6000
4000
2000
0
0
2
4
6
8
10
12
number of steps
median = 7
mean = 12
14
16
18
20
Strategy 3: Organizational hierarchy
Email correspondence scrambled
Actual email correspondence
Example of search path
distance 2
distance 1
distance 1
distance 1
hierarchical distance = 5
search path distance = 4
Probability of linking vs. distance in hierarchy
observed
fit exp(-0.92*h)
probability of linking
0.6
0.5
0.4
0.3
0.2
0.1
0
2
4
6
hierarchical distance h
8
in the ‘searchable’ regime: 0 <  < 2 (Watts 2001)
10
Results
5
x 10
4
number of pairs
4
3
distance
search
geodesic
org
random
median
4
3
6
28
mean
5.7 (4.7)
3.1
6.1
57.4
2
1
0
0
5
10
15
number of steps in search
20
25
Group size vs. probability of linking
Group size and probability of linking
probability of linking
10
10
observed
-0.74
fit g
-1
g
0
-1
optimum for
search (Kleinberg 2001)
10
-2
10
1
10
group size
size g
g
group
2
Search Conclusions
Individuals associate on different levels into groups.
Group structure facilitates decentralized search using social ties.
HP Labs as a social network is searchable but not quite optimal.
searching using the organizational hierarchy is faster
than using physical location
A fraction of ‘important’ individuals are easily findable
Humans may be much more resourceful in executing search tasks:
making use of weak ties
using more sophisticated strategies
PeopleFinder2 – a search engine for HP people
Extract & disambiguate names from publicly available documents
Enrich information available about individuals
Search for them by topic
Identify knowledge communities from co-occurrence of names
Live Demo
If live demo fails:
Current PeopleFinder functionality
PeopleFinder2 info on a person
Extracted topics for a person
Social network
Social network visualization
Search for individuals by topic
Visualize knowledge network
Find social network paths to experts
To find out more:
(papers, slides, other research in the group)
Information dynamics group (IDL) at HP Labs:
http://www.hpl.hp.com/research/idl
List of publications
http://www.hpl.hp.com/personal/Lada_Adamic/research.html