Discovering opportunities in content

Download Report

Transcript Discovering opportunities in content

An Introduction to Social
Network Analysis
Fulvio D’Antonio
NARG: Network Analysis Research Group
DII - Dipartimento di Ingegneria dell'Informazione
Università Politecnica delle Marche
1
Outline

What is a social network?

A little history…

Modelling social networks with random
graphs

Link prediction

Content-based social networks
2
What is a Social Network?

Networks in which nodes and ties model
social phenomena.

Generally represented using graphs

Different kind of relationships:
◦ Static (kinship, friendship, similarity,…)
◦ Dynamic (information flow, material flow,…)
3
4
History

In the 19th century Durkheim introduces the
concept of “social facts”
◦ phenomena that are created by the interactions of
individuals, yet constitute a reality that is independent of
any individual actor.

In the 1930s, Moreno:
◦ the systematic recording and analysis of social interaction
in small groups, especially classrooms and work groups
(sociometry)
◦ He invents the “sociogram” (graphical representation of
interpersonal relationships)
5
History (2):
Milgram’s experiment (1960s)

People in Nebraska, were each given a letter
addressed to a target person in Boston,
Massachusetts, along with demographic information
(name, address, profession) on this person.

They were asked to send the letter to the target
person, by forwarding it to other people

Average number of hops to get the letter to the target:
6
◦ “six degrees of separation”
6
History (3):
The Strength of Weak Ties

Granovetter
◦ “The Strength of Weak Ties” (1973)
 considered one of the most important sociology papers
written in recent decades
◦ He argued that “weak ties” could actually be
more advantageous in politics or in seeking
employment than “strong ties”
◦ Some reasons:
 They allows you to reach a vaster audience.
 Information coming from weak ties is “fresh”
Understanding Networks with
Random Graphs

A random graph is a graph that is generated by some
random process

The objective is to study the properties of random graphs
(e.g. diameter, clustering coefficient, mean degree)

Are generated graphs compatible with actual social
networks?

Different approaches:
◦ Erdős–Rényi Graphs
◦ Small-World model
◦ Barabasi-albert model
8
Random Graphs

Studied by P. Erdös A. Rényi in 1960s

How to build a random graph
◦ Take n vertices
◦ Connect each pair of vertices with an edge with some probability p

There are n(n-1)/2 possible edges

The mean number of edges per vertex is
n(n - 1) p
z
 (n - 1) p  np
n
Degree Distribution

Probability that a vertex of has degree k follows binomial
distribution
 n - 1 k
n -1-k
pk  
p
(1
p
)

k



In the limit of n >> kz, Poisson distribution
z k e- z
pk 
k!
◦ z is the mean
Characteristics

Small-world effect (Milgram 60s)
• Diameter (Bollobas)
• Average vertex-vertex distance
• Grows slowly (logarithmically with the size)

Doesn’t fit real-world networks
• Degree distribution (not Poisson!)
• Clustering (Network transitivity)
 Random graphs: small clustering coefficient
 social networks, biological networks in nature,
artificial networks – power grid, WWW: significantly
higher
Clustering

If A is connected to B, and B is connected
to C, then it is likely that A is connected to
C

“A friend of your friend is your friend”

The average fraction of a node’s neighbor
pairs that are also neighbors each other
C
6*(number - of - triangles - on - a - graph )
(number - of - paths - of - length - 2)
Small-World Model


Watts-Strogatz (1998) first introduced small world
model
Mixture of regular and random networks
•
•

Regular Graphs have a high clustering coefficient, but also a
high diameter
Random Graphs have a low clustering coefficient, but a low
diameter
Characteristic of the small-world model
•
•
The length of the shortest chain connecting two vertices
grow very slowly, i.e., in general logarithmically, with the
size of the network
Higher clustering or network transitivity
Small-World Model (2)
•Construct a regular ring lattice . Each node has degree k
•For every node take every edge (a,b) with i < j, and rewire it with
probability β
14
Scale-Free Network

A small proportion of the nodes in a scale-free network have
high degree of connection

Power law distribution
• A given node has k connections to other nodes with probability as
the power law distribution with exponent  ~ [2, 3]

Examples of known scale-free networks:
• Communication Network - Internet
• Ecosystems and Cellular Systems
• Social network responsible for spread of disease
Barabasi-Albert Networks

Start from a small number of node, add a new node
with m links

Preferential Attachment
• Probability of these links to connect to existing nodes is
proportional to the node’s degree
 (ki ) 
ki
k j
j
• “The rich gets richer”

This creates ‘hubs’: few nodes with very large degrees
Link Prediction

Who will be connected in the next future (present
or past)?

Why link prediction?
◦ Eliciting hidden or Incomplete link information
 Missing links from data collection (criminal networks)
◦ Recommendation




Friends, groups in social networks
Product, Book, Movie, Music on e-commerce site
Articles on content site
Who should one collaborate?
◦ ….
17
Ok, this was about the structure….
but what about the content?
18
Content-based social networks

A special kind of Social Networks

The actors (nodes) of the network produce
documents
◦ They can be produced by more than one actor
 co-authorship relationship

Similarity relationship between any 2 actors A and
B of the network can be estimated using a function on
the set of documents produced Doc(A) and Doc(B)
◦ Sim: DOC(A)  DOC(B)  [0,1]
19
Automatically detecting
content-based social networks
NLP Methodology*:
1.
2.
3.
4.
5.
6.
7.
8.
Choose a set of actors and gather related documents;
Pre-process textual data to extract raw text;
Process raw text with a part-of-speech tagger;
Extract candidate annotating terms by using a set of partof-speech patterns
Rank candidates, possibly filter them choosing a threshold;
Output a set of weighted vectors V of annotating terms for
each documents;
Group the vectors by actor and construct a centroid (i.e. a
mean vector) with such groups. This centroid roughly represents
the actor main interests.
Build a graph by computing a similarity function for each pair
of centroids.
*Cooperation with university of Rome
20
Reducing Information Dimensionality:
Clustering / Community finding

dividing a set of data-points into subsets (called
clusters) so that points in the same cluster are
similar in some sense
◦ Crisp/Fuzzy clustering
◦ Partitive/Non partitive clustering

K-means, repeated bisection, graph partitioning,…

Cohesive subgroups detection:
◦
◦
◦
◦
Cliques
K-Cliques
K-Plex
Density based subgraphs
21
Experiments: Research Networks
INTEROP NoE (6FP):
•Domain Ontology expressed using OWL (Ontology Web
Language) in the Interoperability of Software Application domain
•INTEROP partners’ corpus
•2 types of edges:
•Coauthorship
•Similarity
22
Evaluation: predictive power of the
model



We evaluated how many of the possible opportunities computed
for year 2003 have been exploited in the rest of the project (20042007).
Perc. of opportunities for year 2003 realized in the rest of the
project (2004-2007)
Year
Perc. realized
In 2004
20%
In 2005
33%
After 2005
57%
Perc. of opportunities for year 2004 realized in the rest of the project
(2005-2007)
Year
Perc. realized
In 2005
54%
After 2005
75%
24
Experiments: Patent Networks
The European Patent Office (EPO):
web-services to access to information about European patents that have been
registered;
• the date of presentation
•the applicant name and mission,
•the address of the applicant
• textual description of the patent.
25
Thank you…..
Questions?!?!?!
26