School of Information University of Michigan SI 614 Empirical observations of networks over time (models to explain them another time) Perl tutorial Lecture 13

Download Report

Transcript School of Information University of Michigan SI 614 Empirical observations of networks over time (models to explain them another time) Perl tutorial Lecture 13

School of Information
University of Michigan
SI 614
Empirical observations of networks over time
(models to explain them another time)
Perl tutorial
Lecture 13
Outline (networks over time)
 university email network over time (recent Science
article)
 graphs over time (internet, citation, authorship, patents):
 densification & shrinking diameters
How does one choose new
acquaintances in a social network?
 triadic closure: choose a friend of friend
 homophily: choose someone with similar interests
 proximity: choose someone who is close spatially and
with whom you spend a lot of time
 seek novel information and resources
 connect outside of circle of acquaintances
 span structural holes between people who don’t know each other
 sometimes social ties also dissolve
 avoid conflicting relationships
 reason for tie is removed: common interest, activity
Empirical analysis of an evolving social
network
 Gueorgi Kossinets & Duncan J. Watts
 Science, Jan. 6th, 2006
 The data
 university email logs
 sender, recipient, timestamp
 no content
 43,553 undergraduate and graduate students, faculty, staff
 filtered out messages with more than 4 recipients (5% of
messages)
 14,584,423 messages remaining sent over a period of 355 days
(2003-2004 school year)
weighted ties
 wij = weight of the tie between individuals i and j
 m = # of messages from i to j in the time period between
(t-t) and t
 “geometric rate” – because rates are multiplied together
 high if email is reciprocated
 low if mostly one-way
 t serves as a relevancy horizon (30 days, 60 days…)
 60 days chosen as window is study because rate of tie
formation stabilizes after 60 days
 sliding window: compare networks day by day (but each
day represents an overlapping 60 day window)
cyclic closure & focal closure
shortest path distance between i and j
new ties that appeared
on day t
ties that were there
in the past 60 days
number of
common foci,
i.e. classes
cyclic closure & focal closure
distance between two people in the email graph
pairs that attend one or more classes together
do not attend classes together
Individuals who share at least one class are three times more likely to start
emailing each other if they have an email contact in common
If there is no common contact, then the probability of a new tie forming is
lower, but ~ 140 times more likely if the individuals share a class than if they
don’t
# triads vs. # foci
 Having 1 tie or 1 class in common yield equal probability
of a tie forming
 probability increases significantly for additional
acquaintances, but rises modestly for additional foci
>=1 class in common
no classes in common
>=1 tie in common
no ties in common
Multivariate analysis
the strength of ties
 the stronger the ties, the greater the likelihood of triadic
closure
 bridges are on average weaker than other ties
 but bridges are more unstable:
 may get stronger, become part of triads, or disappear
other observations
 properties such as degree distribution, average shortest
path, and size of giant component have seasonal
variation (summer break, start of semester, etc.)
 appropriate smoothing window (t) needed
 clustering coefficient, shape of degree distribution
constant
 but rank of individuals changes over time
Graphs over time: Densification Laws, Shrinking
Diameters and Possible Explanations
 Jurij Leskovec, Jon Kleinberg, & Christos Faloutsos
KDD’05
 analysis of 4 large data sets of time show
 increase in average node degree
 a shrinking of the diameter
 effective diameter defined as:
 distance d, such that 90% of pairs are within d hops of each
other
 more robust than diameter (maximum distance between any two
nodes) because it does not fall prey to degenerate structures
such as extremely long chains
internet
authorship
key observations
 densification
 e(t) ~ n(t)a
 shrinking diameters
 even correcting for ‘phantom’ nodes – nodes that are not in the data
because they arrived before the period of study
 models so far do not account for it
 Erdos-Renyi – grows as log(N)
 Scale free networks – log(log(N))
 nothing accounts for shrinking
 We’ll be covering two additional models in a couple of weeks
 community guided attachment
 forest fire model
Perl
 Quick-and-dirty (or sophisticated and elegant) scripting language
 easily read in and parse text
 (do some simple counting up)
 output text in a different format
 Choose a distribution of Perl
 usually already installed on campus unix/linux systems
 for windows/mac I would recommend ActivePerl (free, easy to
update/download modules)
 http://aspn.activestate.com/ASPN/Downloads/ActivePerl/
 Lots of good online tutorials
 three Powerpoint presentations by Bridget Thomson McInnes
 part 1: http://www.d.umn.edu/~bthomson/presentations/PerlPresentationPartI.ppt
 part 2: http://www.d.umn.edu/~bthomson/presentations/PerlPresentationPartII.ppt
 part 3: http://www.d.umn.edu/~bthomson/presentations/PerlPresentationPartIII.ppt
Converting a simple tab separated file into Pajek format
a simple tab separated fake file like the one I typed in from
the survey
margaret
charles
tanya
charles
yan margaret
tanya margaret
The file in Pajek format
*vertices 4
1 "margaret"
2 "charles"
3 "tanya"
4 "yan"
*Arcs
*Edges
1 2 1
1 3 1
2 4
the script (reading in the file)
open(IN,“dummysocnet.txt") || die "could not open input file\n";
$i = 0;
while($line = <IN>) {
chomp($line);
($node,@links) = split(/\t/,$line);
if (!defined($number{$node})) {
$number{$node} = ++$i;
}
for $to (@links) {
if (!defined($number{$to})) {
$number{$to} = ++$i;
}
$pair{$node}{$to} = 1;
}
}
close(IN);
$n = $i;
the script (outputting the Pajek file)
open(OUT,"> dummysocnet.net");
print OUT "*vertices $n\n";
for $node (sort {$number{$a}<=>$number{$b};} keys %number) {
print OUT " ".$number{$node}." \"$node\"\n";
}
print OUT "*Arcs\n";
print OUT "*Edges\n";
for $from (keys %pair) {
for $to ( sort keys %{ $pair{$from} } ) {
if (!$havealready{$from}{$to}) {
print OUT " $number{$from} $number{$to} 1\n";
}
$havealready{$from}{$to} = 1;
$havealready{$to}{$from} = 1;
}
}
close(OUT);
Another example: creating a regular lattice
 Creating the lattice
$n = 100;
$k = 2;
for ($i = 0; $i < $n; $i++) {
for ($j = 1; $j <= $k; $j++) {
$ii = $i+$j;
$neighbor = $ii % $n;
$links{$i} .= "\t$neighbor";
$neighbor = ($i - $j);
if ($neighbor < 0) {
$neighbor = $n + $neighbor;
}
$links{$i} .= "\t$neighbor";
}
}
Outputting the regular lattice to Pajek
open(OUT,"> smallworld.net");
print OUT "*vertices $n\n";
for ($i = 0; $i < $n; $i++) {
$j = $i+1;
print OUT " ".$j." \"$j\"\n";
}
print OUT "*Arcs\n";
print OUT "*Edges\n";
for ($i = 0; $i < $n; $i++) {
($null,@neighbors) = split(/\t/,$links{$i});
$j = $i+1;
for $neighbor (@neighbors) {
$nn = $neighbor+1;
if ($j < $nn) {
print OUT " $j $nn 1\n";
}
}
}
close(OUT);
messiest example: bipartite actor-movie network
from html files downloaded from the imdb database
files = `ls *.htm`;
@files = split(/\n/,$files);
for $file (@files) {
open(IN,"$file")||die "couldn't open $file\n";
$actor = $file;
$actor =~ s/\..*//;
$start = 0;
while($line=<IN>) {
chomp($line);
if ($line =~ /Actor<\/a> - filmography/) {
$start = 1;
}
if ($line =~ /Actress<\/a> - filmography/) {
$start = 1;
}
next if ($start == 0);
if ($line =~ /<\/ol/) {
$start = 0;
last;
}
if ($line =~ /title\/[^\/]+\/\">([^<]+)/) {
$title = $1;
$tnums{$actor} .= "\t$title";
$actorsfortitle{$title} .= "\t$actor";
}
}
print "$tnums{$actor}\n";
}
read in multiple .htm
files
identify the part of
the file that contains
the list of movies
for each movie in the
list
add movie to list
of movies for actor
add actor to list
of actors for movie
counting things up
for $movie (keys %actorsfortitle) {
next if (($movie =~ /departed/i)||($movie =~ /sinbad/i));
($null,@actors) = split(/\t/,$actorsfortitle{$movie});
if ($#actors > 0) {
$amoviewithactors{$movie} = 1;
}
ignore some movies
}
@actors = (keys %tnums);
@titles = (keys %amoviewithactors);
$numactors = $#actors+1;
$numtitles = $#titles+1;
$total = $numactors+$numtitles;
do a bit of cleanup
and make sure each
movie has some actors
count number of
actors (for Pajek)
count number of
movies (for Pajek)
defining movie and actor nodes for Pajek
open(OUT,"> actorsandmovies2.net");
print OUT "*vertices
$total $numactors\n";
$i = 1;
for $actor (@actors) {
print OUT "
$i \"$actor\"\n";
$vertexnum{$actor} = $i;
$i++;
}
for $title (@titles) {
print OUT "
$i \"$title\"\n";
$vertexnum{$title} = $i;
$i++;
}
defining edges for Pajek
print OUT "*Arcs\n";
print OUT "*Edges\n";
for $movie (keys %actorsfortitle) {
next if (($movie =~ /departed/i)||($movie =~ /sinbad/i));
($null,@actors) = split(/\t/,$actorsfortitle{$movie});
if ($#actors > 0) {
print "$movie:$actorsfortitle{$movie}\n";
for $actor (@actors) {
print OUT "
$vertexnum{$actor}
$vertexnum{$movie}
1\n";
}
}
}
close(OUT);
making a partition file for Pajek

e.g. state where actor was born
if ($line =~ /BornWhere\?.*?([^,]*),([^,<]*)\</) {
$actorstate{$actor} = $1;
}
$stateid = 0;
for $actor (sort keys %actorstate) {
if (!defined($stateid{$actorstate{$actor}})) {
$stateid{$actorstate{$actor}} = $stateid++;
}
}
open(OUT,"> actorsandmovies.net");
open(CLU,"> states.clu");
print OUT "*vertices
$total $numactors\n";
print CLU "*vertices
$numactors\n";
$i = 1;
for $actor (@actors) {
print OUT "
$i \"$actor\"\n";
print CLU "
$stateid{$actorstate{$actor}}\n";
$vertexnum{$actor} = $i;
$i++;
}
close(CLU);
output for movie data set
print "$actor\t$actorstate{$actor}\t$stateid{$actorstate{$actor}}\n";
EdwardNorton Massachusetts 2
GeorgeClooney Kentucky
3
SusanSarandon New York
4
MattDamon
Massachusetts 2
MarkWahlberg Massachusetts 2
RichardGere Pennsylvania 6
BradPitt
Oklahoma
0
JuliaRoberts Georgia
5
CatherineZetaJones
Wales 1
JenniferLopez New York
4
the file states.clu now contains:
*vertices
2
3
4
2
2
6
0
5
1
4
10
Formatting data for guess
 Use same actor dataset, this time pre-process to have a
unimodal network of just actors
for $movie (keys %actorsfortitle) {
next if (($movie =~ /departed/i)||($movie =~ /sinbad/i));
($null,@actors) = split(/\t/,$actorsfortitle{$movie});
for $a1 (@actors) {
for $a2 (@actors) {
next if ($a1 ge $a2);
$nummovies{$a1}{$a2}++;
}
}
}
output actor file for GUESS

output is straightforward – no need to count up the number of vertices or assign them
numerical IDs
open(OUT,"> actors.gdf");
print OUT "nodedef> name,state VARCHAR(32)\n";
for $actor (keys %actorstate) {
print OUT "$actor,$actorstate{$actor}\n";
}
print OUT "edgedef> node1,node2,nummovies INT,directed DEFAULT
false\n";
for $from (sort {$number{$a}<=>$number{$b};} keys %nummovies) {
for $to ( sort {$number{$a}<=>$number{$b};} keys %{
$nummovies{$from} } ) {
print OUT "$from,$to,$nummovies{$from}{$to},false\n";
}
}
close(OUT);
the gdf file
nodedef> name,state VARCHAR(32)
EdwardNorton, Massachusetts
GeorgeClooney, Kentucky
SusanSarandon, New York
MattDamon, Massachusetts
MarkWahlberg, Massachusetts
RichardGere, Pennsylvania
BradPitt, Oklahoma
JuliaRoberts, Georgia
CatherineZetaJones, Wales
JenniferLopez, New York
edgedef> node1,node2,nummovies INT,directed DEFAULT false
EdwardNorton,MarkWahlberg,1,false
EdwardNorton,RichardGere,1,false
EdwardNorton,JuliaRoberts,1,false
EdwardNorton,MattDamon,1,false
RichardGere,SusanSarandon,1,false
BradPitt,EdwardNorton,1,false
BradPitt,GeorgeClooney,3,false
BradPitt,SusanSarandon,1,false
BradPitt,JuliaRoberts,5,false
BradPitt,CatherineZetaJones,1,false
BradPitt,MattDamon,3,false
GeorgeClooney,MarkWahlberg,2,false
GeorgeClooney,JuliaRoberts,3,false
GeorgeClooney,JenniferLopez,1,false
GeorgeClooney,MattDamon,4,false
JuliaRoberts,RichardGere,2,false
JuliaRoberts,SusanSarandon,1,false
JuliaRoberts,MattDamon,4,false
JenniferLopez,RichardGere,1,false
JenniferLopez,SusanSarandon,1,false
JenniferLopez,MattDamon,1,false
CatherineZetaJones,GeorgeClooney,2,false
CatherineZetaJones,RichardGere,1,false
CatherineZetaJones,JuliaRoberts,2,false
CatherineZetaJones,MattDamon,1,false