Communities and Clustering in some Social Networks Guido Caldarelli SMC CNR-INFM Rome INTRODUCTION 246 Summary 1 Introduction on basic notions of graphs and clustering 2 Introduction on.

Download Report

Transcript Communities and Clustering in some Social Networks Guido Caldarelli SMC CNR-INFM Rome INTRODUCTION 246 Summary 1 Introduction on basic notions of graphs and clustering 2 Introduction on.

Communities and Clustering
in some Social Networks
Guido Caldarelli
SMC CNR-INFM Rome
INTRODUCTION
1
2
3
4
5
6
Summary
1 Introduction on basic notions of graphs and clustering
2 Introduction on clustering methods based on
similarity/centrality
3 Introduction on clustering methods based on spectral analysis
4 The case of study of word association network
5 The case of study of Wikipedia
6 Conclusions and advertisements 
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
INTRODUCTION
1
2
3
4
5
6
1.0 Basic matrix notation
0

1
A
0

1

1
0
0
1
0
0
0
1
1

1
1

0 
0

1
A
0

0

0
0
0
1
4
1
0
0
0
1
1

0
0

0 
4
2
 k1 0

 0 k2
K 
... ...

0 0

1
3
...
...
...
...
  a1 j
0   j 1,n
 
0  0

...   ...
 
kn   0


2
0
j 1,n
...
0



...
0 

...
... 
...  anj 
j 1,n

...
a
2j
3
0
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
INTRODUCTION
1
2
3
4
5
6
1.1 Clusters and Communities
Generally a cluster corresponds to a community
Some communities are hard to detect with clustering analysis
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
INTRODUCTION
1
2
3
4
5
6
1.2 Small graphs
In order to detect communities, clustering is a good clue
• Clustering Coefficient
• Motifs
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
INTRODUCTION
1
2
3
4
5
6
1.2 Hubs and Authorities
Sometimes vertices differ each other, according to their function
•HITS
• hubs are those web pages that point to a
large number of authorities (i.e. they have a
large number of outgoing edges).
• authorities are those web pages pointed
by a large number of hubs (i.e. they have a
large number of ingoing edges).
Kleinberg, J.M. (1999).
Authoritative sources in a hyperlinked environment.
Journal of the ACM, 46, 604–632.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
INTRODUCTION
1
2
3
4
5
6
1.3 Hubs and Authorities
If every page i,j, has authority Ui,j and hubness Hij
U i   H j

j i

H i  U j
i j




T
U  A H



 H  AU







T
U  A A U


T

H  AA H
We can divide the pages according to their value of U
or H. These values are obtained by the eigenvalues of
the matrices ATA and AAT respectively.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS
1
2
3
4
5
6
2.1 Agglomerative Methods
One way to cluster vertices is to find similarites
between them. One “topological” way is given by
considering their neighbours. One can then define a
distance x given by
xijS 
 a   a  2 a
 a  a
k
ik
k
k
ik
jk
k
k
jk
ik
a jk

 a a
a
k
ik
k
jk
ik
xijS 
N (Si  S j )  N (Si  S j )
N (Si  S j )  N (Si  S j )
 2aik a jk
 a jk
Brun, et al (2003).
Functional classification of proteins for the prediction of
cellular function from a protein-protein interaction network.
Genome Biology, 5, R6 1–13.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS
1
2
3
4
5
6
2.2 Divisive Methods: betweenness
The Algorithm of Girvan and Newman selects recursively the largest
edge-betweenness in the graph
The betweenness is a measure of the centrality of a vertex/edge in a graph
b(i) 

j ,l 1,n
i  j l
jl (i)
jl
Girvan, M. and Newman, M.E.J. (2002).
Community structure in social and biological networks.
Proc. Natl. Acad. of Science (USA), 99, 7821–7826.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS
1
2
3
4
5
6
2.3 Examples
The procedure on a more complicated
network, produces a dendrogram of
the community structure
(a) friendship network from Zachary’s karate
club study (26). Nodes associated with the club
administrator’s faction are drawn as circles, those
associated with the instructor’s faction are drawn
as squares. (b) Hierarchical tree showing the
complete community structure. (c) Hierarchical
tree calculated by using edge-independent path
counts, which fails to extract the known
community structure of the network.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS
1
2
3
4
5
6
2.3 Examples
One typical example is that of the e-mail
network. Below the case of study of
University of Tarragona (Spain). Different
colors correspond to different departments
Guimerà, R., Danon, L., Diaz-Guilera, A., Giralt, F., and Arenas, A. (2002).
Self-similar community structure in organisations.
Physical Review E, 68, 065103.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
TOPOLOGICAL ANALYSIS
1
2
3
4
5
6
2.4 Random walks and communities
Random walks on Graphs are at the basis of the PageRank algorithm
(Google). This means that the largest is the probability to pass in a
certain page the largest its interest.
Random walks can also be used to detect clusters in graphs, the
idea is that the more closed is a subgraph, the largest the time
a random walker need to escape from it.
One of the heuristic algorithms based on
random walks is the Markov Cluster (MCL) one.
You find the complete description and codes at
http://micans.org/mcl
•Start from the Normal Matrix,
•through matrix manipulation (power), one
obtains a matrix for a n-steps connection.
•Enhance intercluster passages by raising the
elements to a certain power and then normalize.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS
1
2
3
4
5
6
3.1 The functions of the adjacency matrix
0

1
A
0

1

1
0
0
1
0
0
0
1
1

1
1

0 
 k1 0

 0 k2
K 
... ...

0 0

4
1
2
...
...
...
...
  a1 j
0   j 1,n
 
0  0

...   ...
 
kn   0


0
a
j 1,n



...
0 

...
... 
...  anj 
j 1,n

...
2j
...
0
3
N  K 1 A
Normal Matrix
LKA
Laplacian Matrix
 a11 / k1

a / k
N   21 2
...

a / k
 n1 n
a12 / k1
a22 / k2
...
an 2 / kn
 a12
 k1

k2
 a
L   21
...
...

 a
 n1  an 2
... a1n / k1 

... a2 n / k2 
...
... 

... ann / kn 
...  a1n 

...  a2 n 
... ... 

... kn 
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
0
SPECTRAL ANALYSIS
1
2
3
4
5
6
3.1 The functions of the adjacency matrix
  a1 j
 j 1, N

  a21
L
 ...
  an1


 a12
a
j 1, N
2j
...
 an 2
a12 / k1
 0

0
 a21 / k 2
N 
...
...

a / k a / k
 n1 n n 2 n
...  a1n 


...  a2 n 

... ... 
...  anj 
j 1, N

 f1 
 
 f2 
 ... 
 
f 
 n
... a1n / k1 

... a2 n / k 2 
...
... 

...
0 
If f’ = Lf
f 'i    aijf j  kifi  2fi
j 1,n
The elements of matrix N
give the probability with
which one
field f passes
from a vertex i to the
neighbours.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS
1
2
3
4
5
6
3.2 The block properties in clustered graphs
In a very clustered graph,
the adjacency matrix can be
put in a block form.
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
0
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS
1
2
3
4
5
6
3.2 The block properties in clustered graphs
Given this probabilistic explanation for the matrix N
We have a series of results, for example
•One eigenvalue is equal to one and
•The eigenvector related is constant.
Consider the case of disconnected subclusters:
The matrix N is made of blocks and a general eigenvector
will be given by the space product of blocks eigenvectors
(the constant can be different!)
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS
1
2
3
4
5
6
3.3 Eigenvalues and Communities
It is possible to express the eigenvectors problem as a research
of a minimum under constraint
1. Define a ficticious quantity x for the sites of the graph
2. Define a suitable function z on these x’s (a “distance”)
3. Define a suitable constraint on these x’s (to avoid having all equal or all 0)
For example
z ( x) 
2
(
x

x
)
 i j wij
i , j 1, N
where the xi are values assigned to nodes, with some constraint expressed by
x x m
i , j 1, N
i
j
ij
1
(A)
Stationary points of z(x) + constraint (A) → Lagrange multiplier
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
SPECTRAL ANALYSIS
1
2
3
4
5
6
3.3 Eigenvalues and Communities
 z ( x)   ( xi  x j ) 2 wij

i , j 1, N

xi x j mij  1


i , j 1, N

xi
 wij 
j 1, N
 z ( x)

i 

xi
 xi
 wij x j 
j 1, N


( K w  Aw ) x  Mx

2
x m
j 1, N
j
ij


x
x
m

i j ij   0
i , j 1, N

0
(   / 2)

M  K w  K w 1 Aw x
 (1  2 ) x



 M  1  ( K w  Aw ) x  x
M

M
 Kw 
 1 
Lagrange Multiplier = Normal Eigenvalue problem
Lagrange Multiplier = Laplacian Eigenvalue problem
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WORD ASSOCIATION NETWORK
1
2
3
4
5
6
4.1 The experimental data
The data are collected through a psychological experiment:
Persons (about 100) are given as a stimulus a single word
i.e. “House”. They must answer with the first word that
comes on their mind i.e.“Family”. Answer are later given
as new stimula, so that a network of average
associations forms.
A path from “Volcano”
to “Ache”
Steyvers, M. and Tenenbaum, J.B. (2005).
The large scale structure of semantic networks:
Statistical analyses and a model of semantic growth.
Cognitive Science, 29, 41–78.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WORD ASSOCIATION NETWORK
1
2
3
4
5
6
4.1 The experimental data
The number of connections
(i.e. the degree of nodes) is
power-law distributed
Capocci, A., Servedio, V. D. P., Caldarelli, G., and Colaiori, F. (2005).
Detecting communities in large networks.
Physica A, 352, 669–676..
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WORD ASSOCIATION NETWORK
1
2
3
4
5
6
4.2 The community structure
Therefore we expect similar words to be on the same plateau.
We can measure the correlation between the values of
various vertices averaged over 10 different eigenvectors.
science
1
literature
1
piano
1
scientific
0.994
dictionary
0.994
cello
0.993
chemistry
0.990
editorial
0.990
fiddle
0.992
physics
0.988
synopsis
0.988
viola
0.990
concentrate
0.973
words
0.987
banjo
0.988
thinking
0.973
grammar
0.986
saxophone
0.985
test
0.973
adjective
0.983
director
0.984
lab
0.969
chapter
0.982
violin
0.983
brain
0.965
prose
0.979
clarinet
0.983
equation
0.963
topic
0.976
oboe
0.983
examine
0.962
English
0.975
theater
0.982
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.1 Introduction
http://www.wikipedia.org
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.1 Introduction
http://www.wikipedia.org
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.1 Introduction
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.1 Introduction
A Nature investigation aimed to find if Wikipedia is an authoritative source
of information with respect to established sources as Encyclopedia
Britannica.
Among 42 entries tested, the difference in accuracy was not particularly
great:
• the average science entry in Wikipedia contained around four
inaccuracies;
• the one in Britannica, about three.
On the other hand the articles on Wikipedia are longer on average than
those of Britannica. This accounts for a lower rate of errors in Wikipedia.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.2 The network properties
We generated six wikigraphs, wikiEN, wikiDE, wikiFR, wikiES, wikiIT and
wikiPT, generated from the English, German, French, Spanish, Italian and
Portuguese datasets, respectively. The graphs were obtained from an old
dump of June 13, 2004. We are not using the current data due to disk
space restrictions. The English dataset of June 2005 has more than 36 GB
compacted, that is about 200 GB expanded.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.2 The network properties
The Degree shows fat tails that
can be approximated by a powerlaw function of the kind
P(k) ~ k-g
Where the exponent is the same
both for in-degree and outdegree.
In the case of WWW
2 ≤ gin ≤ 2.1
in–degree(empty) and out–degree(filled).
Occurrency distributions for the Wikgraph
in English (o) and Portuguese ().
Capocci, A., et al. (2006).
Preferential attachment in the growth of social
networks: The internet encyclopedia Wikipedia.
Physical Review E, 74, 036116
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.2 The network properties
As regards the assortativity
(as measured by the average
degree of the neighbours of a
vertex with degree k) there is
no evidence of any assortative
behaviour.
The average neighbors’ in–degree, computed along
incoming edges, as a function of the in–degree for
the English (o) and Portuguese ()
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.3 The growth of Wikipedia
Given the history of growth one can
verify the hypothesis of preferential
attachment. This is done by means of
the histogram P(k) who gives the
number of vertices (whose degree is k)
acquiring new connections at time t.
This is quantity is weighted by the
factor
N(t)/n(k,t)
English (o) and Portuguese ().
White= in-degree
Filled = out-degree
We find preferential
attachment for in and out
degree.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.4 The communities in Wikipedia
Taxonomy
Categorization provided gives an imposed
taxonomy to the pages.
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.3 The Communities in Wikipedia
Given different wikigraphs
one
can
compute
the
frequency of the category
sizes in the various systems
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.3 The Communities in Wikipedia
Similarly,
also the cluster
size frequency distribution
(computed
with
MCL
algorithm) can be considered
Qualitatively rather good agreement.
But are there the same?
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
WIKIPEDIA
1
2
3
4
5
6
5.3 The Communities in Wikipedia
NOT REALLY! The power-law shape is
probably a very common feature for any
categorization
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
SUMMARY
1
2
3
4
5
6
Communities represents an important categorization of graphs.
Methods to detect them varies according to the specific case of study
•
•
SMALL GRAPHS (motifs, clustering coefficient)
LARGE GRAPHS
•
FUNCTION OF VERTICES (HITS, Vertex Similarity)
•
CENTRALITY (Girvan Newman Algorithms)
•
DIFFUSION ON THE GRAPH
•
MCL Algorithm
•
Spectral analysis of the stochastic matrices associated with
the graph
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
SHAMELESS ADVERTISEMENT 
1
2
3
4
5
6
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007
SHAMELESS ADVERTISEMENT 
1
2
3
4
5
6
http://www.complexnetworks.net
Guido Caldarelli, Communities and Clustering in Some social Networks
NetSci 2007 New York, May 20th 2007