Transcript Document

Brant Chee
Bruce Schatz
University of Illinois
http://www.beespace.uiuc.edu
Small World Graph
Clauset et al., 2004
Small World Graph
 Characteristic Path Length
 The typical separation of nodes in a graph.
 lrand ~ ln(N)/ln(z)
 Clustering Coefficient C
 Average fraction of pairs of neighbors of a node which are also neighbors of
each other.
 Average number of nodes that are cliques!
 Crand~ z/N
 Small World Graph
 C >> Crand
L ≥ Lrand
 N>> z >> ln(N)

Newman, 2000
SW MI Graph
Sole et al., 2003
Purpose…So What?
 Facilitate Exploratory Process
 Search result clustering
 Information discovery
 Develop Middle Ground Algorithms
 Interactive responses AND
 Useful clusters
 Language as a Small World Network
 Make use of underlying structure of language
System Overview
Graph Construction
 A node is a term in the index
 Terms bounded by frequency cutoff.
 Terms occurring < 5 documents > 25% documents are
removed.
 Edges between nodes are determined by Mutual
Information
P( x, y)
log2 P( x)P( y)
 P(x,y) is calculated in a window of the size of the
abstract

Church and Hanks, 1989
What threshold?
∆l
C
4.31 3.85
.46
0.57 0.002
4087 5.09
4.84 5.11
.27
0.68 0.001
0.005
1342 2.66
5.51 7.36
1.85
0.75 0.002
0.01
517
1.69
1.21 11.91
10.7
0.81 0.003
0.02
161
1.22
0.34 25.55 25.2
0.91 0.008
0.05
25
1.12
0.06 28.40 28.3
1.0
Threshold
N
z
0.001
6612 9.85
0.002
l
lrand
Crand
0.045
Where to cut?
Clustering Algorithm




Clauset, Newman and Moore, 2004
Generalization for nodes based upon Newman’s algorithm.
Based upon modularity: The fraction of edges within communities versus the fraction falling at random
in the same network. 0 if little community structure, between .3 if there is significant structure.
If just looking at the fraction of nodes within communities, then max modularity will always be when all
nodes are in one cluster.
(ci,cj) = 1 if ci and cj are in the same community
ki k j
1
Q
[Aij 
](c i,c j )

2m ij
2m
2m=# of edges in graph
Experiments
 3 clustering algorithms
 Complete Link (Cluto)
 K means (Cluto)
 Small World
Test Collections
Collection
Search Terms
C1
plasticity
acetylcholine
General
C2
Specific
microarray
muscarinic
plasticity
((cholinergic
noradrenergic)
receptor)
Number
of Number of Terms
Abstracts
OR 81,746
267,981
OR 74,533
OR
OR
OR
AND
285,623
Experimental Setup
 Parameters left at package defaults
 Clustered with n = 50,100,150 and 200.
 Clusters with less than 4 elements or more than 50
elements were eliminated and the clustering which
resulted in less than 40 clusters was chosen to be
evaluated.
Quantitative Results
Collection Algorithm
Threshold Running
(s)
C1
SW
N/A
40.54
C-Link
50
214.106
K-Means
200
11.581
SW
N/A
47.35
C-Link
100
198.147
K-Means
200
5.538
C2
Time
Conclusions
 Developed Balanced Clustering System
 Fast running time
 Good clustering results
 Modified Small World Algorithm
 Clustered text based on language model
 Produced many similar sized clusters
Social Networks as Small World
Networks
 Social Network
 Network demonstrating who interacts with whom


Threaded messages in a Newsgroup
Create a network based on various characteristics
 Homophily

Similar people tend to interact more than those who are
dissimilar
 Race, Age, Gender, Social Class
Social Networks Inform
Healthcare
 You do what your peers do
 Framingham Study
 20 years of data
 Manually constructed networks



Smoking Cessation
Obesity
Happiness
 Can we construct Social Networks automatically?
Social Network Construction
and Evaluation
 We have lots of text available
 30K message groups from Yahoo! Health
 Utilize threaded messaging to establish network
 Our cognitive model is evident in what we write
 Differentiate Schizophrenic from non-Schizophrenic
 LIWC


Poets who commit suicide vs those that do not
Differentiate depressed vs non depressed college students
 Sentiment – positive or negative polarity
 Score – evaluation metric
Example Message
 Hi All, I need your input. I'm having
about 27,000 extra pre-ventricular beats in a
24 hour period, per a Holter monitor test. My
electrophysiologist and cardiologist agree that
I should go on <Link>sotalol</Link>
/Betapace. They are putting me in the hospital
on February 26 to titrate me up on it. I've
refused the drug in the past because it is such
a dangerous drug.
Is there anyone out there who could
give me an idea of how you've done on this
drug? I'd sure appreciate hearing about your
experiences. Thanks so much.
Sentiment
10/1/2008
8/1/2008
6/1/2008
4/1/2008
2/1/2008
12/1/2007
10/1/2007
8/1/2007
6/1/2007
4/1/2007
2/1/2007
12/1/2006
10/1/2006
8/1/2006
6/1/2006
4/1/2006
2/1/2006
12/1/2005
10/1/2005
8/1/2005
6/1/2005
4/1/2005
2/1/2005
12/1/2004
10/1/2004
Date
0
-0.005
Withdrawal
Ratio of Negative Emotion Words
Resume
Tysabri
-0.01
Control
-0.015
FDA Approval
-0.02
-0.025
-0.03
-0.035
-0.04
-0.045
Figure . Sentiment of messages mentioning Tysabri versus those that do not for two MS groups. Vertical bars indicate dates for FDA approval of
Tysabri, voluntary withdrawal, and remarketing.
Results
 Sentiment over all messages
 Proxy for mental model – how happy they are
 Difference in average sentiment between two people
 Higher between random people in a network
 Lower for pairs that are closely connected
 Test methodology
 Compare means of differences between highly
connected nodes vs random pairs of nodes
 T-Test for statistical significance
 P-value < .0001 for 10 randomly selected groups
Acknowledgements
 Nyla Ismail for evaluating results
 Todd Littell for the MI code
Questions?
 Live demonstration available at:
 http://www.beespace.uiuc.edu
References
Church, K. W. and Hanks, P., (1989). Word association norms, mutual
information, and lexicography. in Proc. of the 27th Annual Conference of the
Association of Computational Linguistics, (Vancouver, B.C.), ACM Press, 76-83.
Clauset, A., Newman, M. E. J., and Moore, C., (2004). Finding community
structure in very large networks. Phys. Rev. E, 70 (6), 066111.
Kuhlthau, C. C., (1989). Information search process: A Summary of research and
implications for school library media programs. SLMQ, 18(1).
Newman, M. E. J., (2000). Models of the small world. J. Stat. Phys., 101, 819-841.
Solé, R., Ferrer-Cancho, R., Montoya, J. M., and Valverde, S., (2003). Selection,
tinkering, and emergence in complex networks. Complexity, 8 (1), 20-33.