The contenet of the Presentation

Download Report

Transcript The contenet of the Presentation

Subject :
Discovering Overlapping Groups in Social Media
Professor :
Dr. sh.Esmaili
The Student’s Identifiers :
Mr. Hossien Sadrizadeh(Slides 3 to 55)
Mr. Houshyar Mohammadi Talvar(Slides 57 to 78)
The Date :
June 21th 2012 , (On Thursday , Tir 1th 1391 )
Discovering Overlapping Groups in social Media
1/79
Mr. Hossien Sadrizadeh
slides from 3 to 53
Discovering Overlapping Groups in social Media
2/79
Introduction
• The following sites are attractive Social media sites,
they have more user than ever:
•
•
•
•
•
Facebook
Twitter
Wikipedia
Blogger
Myspace
• In 2009, the global time spent on social media sites
increased by 82% than the year before.
• Facebook, one of the most popular social media
site, has more than 500 million active users and the
number is still increasing.
Discovering Overlapping Groups in social Media
3/79
Introduction(Continue)
• What kind of activities do the people in the social
media?
• In social media websites, users are allowed to
partrticipate in social activities, for example:
•
•
•
•
•
Connect to the other likeminded people.
Updating their status.
Posting blogs.
Uploading photos.
Bookmark and tags.
• People can join to groups at different websites,for
instance:
• Fans of sports teams can join dedicated groups.
• They can share their opinions on team performance.
• Put comment on the newest information about player.
Discovering Overlapping Groups in social Media
4/79
Group - Community
• A group (community) can be considered as a set of
users where each user interacts more ferquenly
with users within the group than users outside the
groups.
• Some social media websites(Flicker,Youtube)
provide explicit groups which allow users to join
them.
• Some dynamic sites(Twitter,Delicious)have no clear
group structure in it, then we need to discover
community detection between them.
Discovering Overlapping Groups in social Media
5/79
Group - Community
• In social media , a community is:
• A group of people who are more similar with people
within the group than people outside this group.
• Homophily is one of the important reasons that
people connected with others.for example:
• People from the same city talk more frequently.
• People have similar political viewpoints are more likely
to vote for the same candidates.
• The people who watch the same movies because of the
commonly liked movie stars.
Discovering Overlapping Groups in social Media
6/79
why group?
• Group-level investigation can provide usesful
information.
• Studying individual behaviour is usually difficult for large
population.
• Studying statistics at website level often fail to catch
sufficient detail.
Discovering Overlapping Groups in social Media
7/79
An example to make groups1
• We have a set of 50 people.we want to make
two sets, with the following properties:
• Make a set whose the first letter’s name’s is “J”.
• Make the second set whose the first letter’s name’s is “W”.
http://en.wikipedia.org/wiki/Partition_of_a_set
1. An example who I make it(The Presenter – hossien sadrizadeh)
Discovering Overlapping Groups in social Media
8/79
Another example to make groups
• We have a set of 50 people.we want to make
two sets, with the following properties:
• Make a set whose the first letter’s name’s is “J”.
• Make the second set whose the last letter’s name’s is “W”.
Adaptation And Enhancement Of Evaluation
Measure To Overlapping Graph Clustering(Tatiana
Gossen, Michael Kotzyba,)
Discovering Overlapping Groups in social Media
9/79
Overlapping - Introduce
• The multiple interactions in social activities imply
that the community structures are often
overlapping.
• Example: one person is in several communications.
• We have a new idea to take advantage network
information between users and tags in social media
and discover these overlapping communications
with co-clustering.
• Co-clustering is a way to obtain this kind of
community structure.
Discovering Overlapping Groups in social Media
10/79
Overlapping
• When a website have an explicit group, and
allowed to the users than join to more than one
group base on their personal pereferences then
overlapping is take place.
• When there are no explicit groups available,
community detection algorithm can be used to
obtain such groups.
Discovering Overlapping Groups in social Media
11/79
Community detection
• Community detection are usually base on structureal
features(links).
• A sketch of a small
network displaying
community structure,
with three groups of
nodes with dense
internal connections and
sparser connections
between groups.
Discovering Overlapping Groups in social Media
12/79
Co-Clustering
• The graph that is on the right has two
type of nodes:
• Vertices u1-u5 on the left for users.
• T1-t4 on the right for tags.
• Edges for tag subscription relation
between users and tags.
• If we use a method to make two
cluster,then we’ll see that u3 is
associated with two cluster.
Discovering Overlapping Groups in social Media
13/79
Co-Clustering
• There are two method of clustering:
• Vertices clustering.
• Edges clustering.
• Instead of clustering vertices , use of clustering
edges is better.
• Clustering edges usually achieves overlapping
communities.
Discovering Overlapping Groups in social Media
14/79
Our contribution1
• We propose to discover overlapping
communities in social network.
• We use user-tag subscription information
instead of user-user links.
• We obtain clusters containing users and tags
simultaneousely.
1 research team
Discovering Overlapping Groups in social Media
15/79
Co-Clustering
In this graph , edges
connecting to nodes t1,t2
and t3,t4 are clusterd into
two separate groups
both containing user u3.
Discovering Overlapping Groups in social Media
16/79
Community – Mathematical Defination
• Supose:
• A community Ci ( 1  i  k ) is a subset of users and
tags, where k is the number of community.
• Communities are usually overlap, Ci  Cj   .
• We use an adjacncy matrix to represent the
relation between user their subscribed tag.(sparce
matrix)
Discovering Overlapping Groups in social Media
17/79
Adjacency Matrix via Incidence Matrix
Discovering Overlapping Groups in social Media
18/79
User-Tag Network
• In a user-tag network, each edge is associated with a uservertix
ui and a tag vertix tp.
• We can use of incidenc matrix.each vector in this matrix have Nu
+ Nt.(Nu for users and Nt for tags).
• For example :
the edge between u1 and t1
in the followin graph is:
Discovering Overlapping Groups in social Media
19/79
User-Tag Network
The incidence matrix
Discovering Overlapping Groups in social Media
20/79
Why is the incidence matrix useful?
The incidence matrix
• It is a sparse matrix.
• We can impliment it with a linked list,(or double linked list).
Discovering Overlapping Groups in social Media
21/79
Overlapping co-clustering problem
• The overlapping co-clustering problen can be
stated formally as follows:
Input:
• A user-tag subscription matrix NNu  Nt. when Nu and Nt
are the numbers of users and tags,respectively.
• K is the number of communities.
Output:
• K overlapping communities which consist of both users
and tags.
Discovering Overlapping Groups in social Media
22/79
The Co-Clustering Framework
• A user is usually has several friendship but, only a link
is usually related to one community ,then we understand to
use of cluster edges instead of nodes.
• After obtaining edge clusters, communities can be
recovered by replacing each edge with its two vecrtices, i.e.,
a node is in a community, if any of its connection is in the
community.
• Then the obtained communities are often highly
overlapping.
Discovering Overlapping Groups in social Media
23/79
Make Categories - Find Clusters
Communities that aggregate similar users and tags
together can be detected by maximizing intra-cluster
similarity, which is shown in the following equation: (this
formulation can be solved by k-mean Algorithm).
Discovering Overlapping Groups in social Media
24/79
Disadvantage of k-means cluster
• K-means isn’t efficent for large scale data set.
• Then, What should we do ?
• Our propose1 is use of another type of k-means . That is
EdgeCluster and it is efficent, which is a scalable algorithm
to extract communities for sparse network.
• Why is the Edgecluster efficent ? Because :
• each centroid only compare to a small set of edges that are
correlated to the centroid.
• It is reported to be able to cluster a sparse network with more than
one million nodes into thousands of clusters in tens of minues.
1 Writers
Discovering Overlapping Groups in social Media
25/79
Density
The expected density of the user-tag network is
shown in the following equation :
Discovering Overlapping Groups in social Media
26/79
Key Step in Clustering edge
•
Define edges similarity .
•
Given two edges : e(ui , tp) and e’(uj , tq) in a user tag
graph, the similarity between them can be define with the
following equation :
Discovering Overlapping Groups in social Media
27/79
Similarity Schemes for clustering
• There are 3 similarity schemes:
• Independent Learning.
• Normalized Learning.
• Correalational Learning.
• Our framework1 can cover different similarity shemes.
1.Writer
Discovering Overlapping Groups in social Media
28/79
Independent Learning
• A public way is use
of the similarity.
• The similarity can
be represent by the
following function :
(the user Similarity
can be define at the
same way )
The Kronecker delta function
Discovering Overlapping Groups in social Media
29/79
Independent Learning – Cosine
Similarity(Continue)
• The cosine similarity is widely used in measuring
the similarity between two vectors.It’s define with
the following form.
• Given two edges e(ui ,tp) and e’(uj ,tq), the cosine
similarity can be define with the following
equation:
Discovering Overlapping Groups in social Media
30/79
Independent Learning – Cosine
Similarity(Continue)
• The cosine of two vectors can be easily derived by
using the Euclidean dot product formula:
Given two vectors of attributes, A and B, the cosine
similarity, , is represented using a dot product :
Discovering Overlapping Groups in social Media
31/79
An example of cosine similarity
• If we have two following vectors, the similarity is :
=(1,2,3) and =(2,5,-3)
• What do you think about the range of the similarity?
• The resulting similarity ranges is in [-1,1].
• −1 meaning exactly opposite.
• 1 meaning exactly the same.
• 0 usually indicating independence.
Discovering Overlapping Groups in social Media
32/79
Text matching1
• The attribute vectors A and B are usually the term frequency
vectors of the documents. The cosine similarity can be seen
as a method of normalizing document length during
comparison.
• In the case of information retrieval, the cosine similarity of
two documents will range from 0 to 1, since the term
frequencies (tf-idf weights) cannot be negative. The angle
between two term frequency vectors cannot be greater
than 90°.
1 . Co-clustering documents and words using Bipartite Spectral Graph Partitioning (S. Dhillon)
Discovering Overlapping Groups in social Media
33/79
Normalized Learning
• Let dui , denoted the degree of the user ui , and dtp
represent the degree of tag tp in a user-tag
network.
• After do the normalization , edge e(ui ,tp), can be
represented by the following form :
1 The Research team
Discovering Overlapping Groups in social Media
34/79
Normalized Learning(continue)
• If given two edges e(ui ,tp), and e’(uj,tq), the
cosine similarity between them after
normalization can be written the following
equation :
Discovering Overlapping Groups in social Media
35/79
Normalized Learning(continue)
• if we set  by 0.5 , then we can derive the
following equation that tell us normalized edge
similarity.
• These formula say that the similarity between two
users is not only related to users, but also the tags.
Discovering Overlapping Groups in social Media
36/79
Correlational Learning
• Users often use more than one tag to describe the main
topic of a bookmark.
• A grouped tags indicates their Correlation.
• In a user-tag network :
• At the first side, a user can be viewed as a vector by treating
tags as features.
• At the other side , a tag can also be viewed as a vector by
treating by users as features.
• we use a latent space to represent the users and display
correlation between their tags.
Discovering Overlapping Groups in social Media
37/79
Correlational Learning(Continue)
• Let’s take the following basis vector in the
orthogonal latent’s axsis :
• Users vectors in the original space can be mapped to
new vectors in the latent space, which is shown like this :
M is a linear mapping from the original space to the latent space
Discovering Overlapping Groups in social Media
38/79
Correlational Learning(Continue)
• We mapped the real vectors from a real
space to the Latent space.like this : (we
use a Mapped function)
Discovering Overlapping Groups in social Media
39/79
Correlational Learning(Continue)
• Another method to select a set of orthogonal basis
is Singular Value Decomposition(SVD).
• The singular value decomposition for a user-tag
network M is given by the following formula:
Discovering Overlapping Groups in social Media
40/79
Correlational Learning(Continue)
• User latents can be formulated with the following
form :
• We need only a small set of vector to comput
them.it is here:
Discovering Overlapping Groups in social Media
41/79
Correlational Learning(Continue)
• User similarity and tag similarity are defined by the
following formula in the latent space :
• Z solved and derived the generalized eigenvectors.
Discovering Overlapping Groups in social Media
42/79
Correlational Learning(Continue)
• The Adjacency matrix and Laplacian matrix are :
Discovering Overlapping Groups in social Media
43/79
Correlational Learning(Continue)
• The generalized eigenvector can be rewritten by :
• After simple manipulation , we obtain :
Discovering Overlapping Groups in social Media
44/79
Singular Value Decomposition – SVD
1,2
• SVD is base on theorem from linear algebra which
says that, a rectangular matrix A can be broken
down into the product of three matrices:
• An orthogonal matrix U.
• A diagonal matrix S.
• Transpose of an orthogonal materix V.
•
Gram-Schmidt orthogonalization process.
• Is a method for converting a set of vectors into a set of
orthonormal vectors.
• It uses of normalization method.
1.Linear Algebra, Haffman Kenneth. (chapter 8 : vector spaces)
2.Numerical Analysis, Samuel D.Conte.(Chapter 4 : Matrixes and eigen values, eigen vectors )
Discovering Overlapping Groups in social Media
45/79
Other view point of Gram-Schmidt
Discovering Overlapping Groups in social Media
46/79
Singular Value Decomposition – SVD
• The theorem is usually presented with a formula
like this :
Discovering Overlapping Groups in social Media
47/79
Example - SVD
• Start with the matrix:
• To find the U we have to find AAT.
Discovering Overlapping Groups in social Media
48/79
Example – SVD(Continue)
• Next , we have to find the eigenvalues and
corresponding eigenvectors of AAT.
• If we find the eigenvectors and store in a matrix
order by size of the corresponding of eigenvalue.
Discovering Overlapping Groups in social Media
49/79
Example – SVD(Continue)
•
Finally,we have to
convert this matrix
into an orthogonal
matrix.
Discovering Overlapping Groups in social Media
50/79
Example – SVD(Continue)
• We use a similar method to find V, base on ATA.
• Find the eigenvalues for ATA.
Discovering Overlapping Groups in social Media
51/79
Example – SVD(Continue)
• For all of data we have the following vectors:
• According to the size of eigenvalue, we have:
Discovering Overlapping Groups in social Media
52/79
Example – SVD(Continue)
• After orthonormalization process, and the convert
that to an orthogonal matrix.
Discovering Overlapping Groups in social Media
53/79
Example – SVD(Continue)
• For S we take the Square roots of the non-zero
eigenvalues and populate with them,putting the largest
in S11, the next largest in S22 and so on.the smallest value
in Smm.
•
•
•
•
The non-zero eigenvalues of U and V are the same.
The diagonal entries in Sare the singular values of A.
The columns in U are called left singular vectors.
The columns in V are called right singular vectors.
Discovering Overlapping Groups in social Media
54/79
Example – SVD(Continue)
• Now we have the following matrixes:
Discovering Overlapping Groups in social Media
55/79
Mr.Houshyar Mohammadi Talvar
slides from 57 to 78
Discovering Overlapping Groups in social Media
56/79
SYNTHETIC DATA AND FINDINGS
Clustering evaluation is difficult when there is no ground truth.
We first introduce the synthetic data and how they are generated, then the
clustering quality measurement Normalized Mutual Information (NMI).
Finally, the NMI of different clustering methods are reported.
Synthetic Data Generation
We develop a synthetic data generator that allows input of the numbers of
clusters, users and tags. First users and tags are split evenly into each
cluster. Then, in each cluster users and tags are randomly connected with a
specified density (e.g., 0.8).
Discovering Overlapping Groups in social Media
57/79
Figure 2, shows a toy example of the synthetic user-tag graph in
which users are labeled as u1−u7 and tags t1−t8. Three overlapping
clusters are highlighted with different colors.
Discovering Overlapping Groups in social Media
58/79
NMI Evaluation in Synthetic Data
The Normalized Mutual Information (NMI) is commonly used to measure the
clustering quality.
Given two clusterings X and Y, the NMI is defined below.
First, find the pairs of clusters that are
most close to each other in two clusterings
The NMI is computed in two steps
Second, average the mutual information
between those pairs of clusters
The higher the NMI value is, the more similar between two clusterings. If two
clusterings X and Y are exactly the same, the NMI value is 1.
Discovering Overlapping Groups in social Media
59/79
NMI and Number of Clusters
We generate another data set with 1,000 users and 1,000 tags and with different
number of clusters which range from 5 to 50 and cluster density is set to 1 such that all
users connect to all tags within each cluster.
Figure 3. NMI Performance w.r.t Number of Clusters
Discovering Overlapping Groups in social Media
60/79
NMI and Link Density
We also study how intra-cluster link density affects clustering in synthetic data sets.
We created synthetic data sets (50 clusters, 1,000 users and 1,000 tags) with different
intracluster densities that range from 0.1 to 1.
Figure 4. NMI Performance w.r.t Intra-cluster Link Density
Discovering Overlapping Groups in social Media
61/79
View Correlational Learning in Figure 3 & figure 4
Figure 3. NMI Performance w.r.t Number of
Clusters
Figure 4. NMI Performance w.r.t Intracluster Link Density
Discovering Overlapping Groups in social Media
62/79
SOCIAL MEDIA DATA AND FINDINGS
BlogCatalog is a social blog directory where the bloggers can register their blogs under
predefined categories.We crawled user names, user ids, their friends, blogs, the
associated tags and blog categories.
Delicious is a social bookmarking website, which allows users to tag, manage, and
share online resources (e.g.,articles). For each resource, users are asked to provide
several tags to summarize its main topic.
Discovering Overlapping Groups in social Media
63/79
Interplay between Link Connection and Tag Sharing
There exist explicit and implicit relations between users.Examples of explicit relations
are friends or fans people choose to be. Examples of implicit relations are tag
sharing,i.e., people who use the same tags.
Are there any correlation between the two different relations? What drives people
connect to others? Is it a random operation? We conducted statistical analysis
between user-user links and tag sharing.
In the first study, we fix users who have or have no connection with others, then
show the tag sharing probabilities.
Discovering Overlapping Groups in social Media
64/79
Interplay between Link Connection and Tag Sharing(countinu)
Figure 5 shows the tag sharing probabilities in BlogCatalog and Delicious data sets. For
Delicious data, the friends network and fans network are evaluated separately.
Figure 5. X-axis represents the number of tags that two users share.
Discovering Overlapping Groups in social Media
65/79
Interplay between Link Connection and Tag Sharing(countinu)
Figures 6 and 7 are the probability that two users being connected if they share
tags in BlogCatalog and Delicious,respectively. In Figure 6, the probability of a link
between two users increases with respect to the number of tags they share
Figure 6. Link probability w.r.t tag sharing in BlogCatalog
Discovering Overlapping Groups in social Media
66/79
Interplay between Link Connection and Tag Sharing(countinu)
Figure 7. Link probability w.r.t tag sharing in Delicious
Discovering Overlapping Groups in social Media
67/79
Clustering Evaluation
The clustering evaluation consists of three studies:
1. First,cross-validation is performed to demonstrate the effectiveness of different
clustering algorithms in BlogCatalog data set.
2. we study the correlation between user connectivity and co-occurrence in
extracted communities
3. concrete examples illustrate what clusters are about.
Discovering Overlapping Groups in social Media
68/79
1) Comparative Study: In BlogCatalog, categories for each blog are selected by the blog
owner from a predefined list.With category information, certain procedures such as
cross validation (e.g., treating categories as class labels,cluster memberships as
features) can be used to show the clustering quality.
Linear SVM is adopted in our experiments since it scales well to large data sets. As
recommended by Tang et al, 1,000 communities are used in our experiments. We vary the
fraction of training data from 10% to 90% and use the rest as test data.
This experiment is repeated for 10 times and the average Micro-F1 and Macro-F1
measures are reported.
Discovering Overlapping Groups in social Media
69/79
Table II shows five different clustering methods and their prediction performance. In this
table, the fourth algorithm EdgeCluster uses user-user network rather than the usertag
network. Dhillon’s co-clustering algorithm is based on Singular Value Decomposition (SVD)
of the normalized user-tag matrix.
As shown in Table II, Correlational Learning consistently performs better, especially when
the training set is small. According to Table II, normalization does not improve performance.
This suggests normalization should be taken cautiously. Dhillon’s co-clustering method
which can only deal with non-overlapping clustering does not perform well compared to
other methods.
Discovering Overlapping Groups in social Media
70/79
2) Connectivity Study: We study the correlation between user co-occurrence in
extracted communities and the actual social connections between them
We also study the connectivity between users who are in the top similar list. 1,000
overlapping communities are extracted by Correlational Learning.
Discovering Overlapping Groups in social Media
71/79
We study the dis-connectivity between users who are most similar. Figure 8 shows that
the probability of being disconnected is higher than 96% and 99% in BlogCatalog and
Delicious, respectively, which means that the majority of homogeneous users are not
connected in actual social networks.
For example, users marama6 and ameer1577 both are interested in the online game
“World of Warcraft”.
Figure 8. Probability being Dis-connected between Top Similar Users
Discovering Overlapping Groups in social Media
72/79
3) Illustrative Examples: Health is the second largest category (the largest is personal) in
BlogCatalog, a hot topic that attracts lots of cares.
Discovering Overlapping Groups in social Media
73/79
The largest cluster about Health obtained by Correlational Learning is cluster-health with
127 users and 102 tags. The cluster that has the maximum user overlapping with
clusterhealth is cluster-nutrition with 83 users and 25 tags. Their tag clouds are shown in
Figures 10 and 11. Between the two clusters, there are 18 users and 3 tags health,
nutrition and weight loss in common. Both clusters are related to health but the first has
an emphasis on physical health, highlighted by tags arthritis, drugs, food, dentist, and the
second is more about nutrition.
Discovering Overlapping Groups in social Media
74/79
The top 102 tags of categoryhealth are compared to the tags of cluster-health and the top
25 tags of category-health to those of cluster-nutrition. The numbers of shared tags are
16 for cluster-health and 9 for cluster-nutrition.
Discovering Overlapping Groups in social Media
75/79
In addition, we aggregate tags of the users in cluster health and present the most
frequent 102 tags in Figure 12. Comparing these tags with those of cluster-health,
40 tags are in common. Many tags such as environment, humor, jokes are not
present in the tag cloud of cluster-health, which suggests that these users
actually have other interests besides health. A similar pattern is observed for
cluster nutrition.
Discovering Overlapping Groups in social Media
76/79
Discovering Overlapping Groups in social Media
77/79
CONCLUSIONS AND FUTURE WORK
We proposed a framework to study the overlapping clustering of users and tags in
online social media which helps to understand the major concerns within the groups.
Experimental results in synthetic data reveal that Correlational Learning is
very effective in recovering the overlapping cluster structures even when the inner
cluster density is low.
This study suggests more interesting problems that are worth further exploring.
Formulating the co-clustering problem into an objective function and maximizing it
is one direction to work on.
Discovering Overlapping Groups in social Media
78/79
Discovering Overlapping Groups in social Media
79/79