Transcript Document
Brant Chee Bruce Schatz University of Illinois http://www.beespace.uiuc.edu Small World Graph Clauset et al., 2004 Small World Graph Characteristic Path Length The typical separation of nodes in a graph. lrand ~ ln(N)/ln(z) Clustering Coefficient C Average fraction of pairs of neighbors of a node which are also neighbors of each other. Average number of nodes that are cliques! Crand~ z/N Small World Graph C >> Crand L ≥ Lrand N>> z >> ln(N) Newman, 2000 SW MI Graph Sole et al., 2003 Purpose…So What? Facilitate Exploratory Process Search result clustering Information discovery Develop Middle Ground Algorithms Interactive responses AND Useful clusters Language as a Small World Network Make use of underlying structure of language System Overview Graph Construction A node is a term in the index Terms bounded by frequency cutoff. Terms occurring < 5 documents > 25% documents are removed. Edges between nodes are determined by Mutual Information P( x, y) log2 P( x)P( y) P(x,y) is calculated in a window of the size of the abstract Church and Hanks, 1989 What threshold? ∆l C 4.31 3.85 .46 0.57 0.002 4087 5.09 4.84 5.11 .27 0.68 0.001 0.005 1342 2.66 5.51 7.36 1.85 0.75 0.002 0.01 517 1.69 1.21 11.91 10.7 0.81 0.003 0.02 161 1.22 0.34 25.55 25.2 0.91 0.008 0.05 25 1.12 0.06 28.40 28.3 1.0 Threshold N z 0.001 6612 9.85 0.002 l lrand Crand 0.045 Where to cut? Clustering Algorithm Clauset, Newman and Moore, 2004 Generalization for nodes based upon Newman’s algorithm. Based upon modularity: The fraction of edges within communities versus the fraction falling at random in the same network. 0 if little community structure, between .3 if there is significant structure. If just looking at the fraction of nodes within communities, then max modularity will always be when all nodes are in one cluster. (ci,cj) = 1 if ci and cj are in the same community ki k j 1 Q [Aij ](c i,c j ) 2m ij 2m 2m=# of edges in graph Experiments 3 clustering algorithms Complete Link (Cluto) K means (Cluto) Small World Test Collections Collection Search Terms C1 plasticity acetylcholine General C2 Specific microarray muscarinic plasticity ((cholinergic noradrenergic) receptor) Number of Number of Terms Abstracts OR 81,746 267,981 OR 74,533 OR OR OR AND 285,623 Experimental Setup Parameters left at package defaults Clustered with n = 50,100,150 and 200. Clusters with less than 4 elements or more than 50 elements were eliminated and the clustering which resulted in less than 40 clusters was chosen to be evaluated. Quantitative Results Collection Algorithm Threshold Running (s) C1 SW N/A 40.54 C-Link 50 214.106 K-Means 200 11.581 SW N/A 47.35 C-Link 100 198.147 K-Means 200 5.538 C2 Time Conclusions Developed Balanced Clustering System Fast running time Good clustering results Modified Small World Algorithm Clustered text based on language model Produced many similar sized clusters Social Networks as Small World Networks Social Network Network demonstrating who interacts with whom Threaded messages in a Newsgroup Create a network based on various characteristics Homophily Similar people tend to interact more than those who are dissimilar Race, Age, Gender, Social Class Social Networks Inform Healthcare You do what your peers do Framingham Study 20 years of data Manually constructed networks Smoking Cessation Obesity Happiness Can we construct Social Networks automatically? Social Network Construction and Evaluation We have lots of text available 30K message groups from Yahoo! Health Utilize threaded messaging to establish network Our cognitive model is evident in what we write Differentiate Schizophrenic from non-Schizophrenic LIWC Poets who commit suicide vs those that do not Differentiate depressed vs non depressed college students Sentiment – positive or negative polarity Score – evaluation metric Example Message Hi All, I need your input. I'm having about 27,000 extra pre-ventricular beats in a 24 hour period, per a Holter monitor test. My electrophysiologist and cardiologist agree that I should go on <Link>sotalol</Link> /Betapace. They are putting me in the hospital on February 26 to titrate me up on it. I've refused the drug in the past because it is such a dangerous drug. Is there anyone out there who could give me an idea of how you've done on this drug? I'd sure appreciate hearing about your experiences. Thanks so much. Sentiment 10/1/2008 8/1/2008 6/1/2008 4/1/2008 2/1/2008 12/1/2007 10/1/2007 8/1/2007 6/1/2007 4/1/2007 2/1/2007 12/1/2006 10/1/2006 8/1/2006 6/1/2006 4/1/2006 2/1/2006 12/1/2005 10/1/2005 8/1/2005 6/1/2005 4/1/2005 2/1/2005 12/1/2004 10/1/2004 Date 0 -0.005 Withdrawal Ratio of Negative Emotion Words Resume Tysabri -0.01 Control -0.015 FDA Approval -0.02 -0.025 -0.03 -0.035 -0.04 -0.045 Figure . Sentiment of messages mentioning Tysabri versus those that do not for two MS groups. Vertical bars indicate dates for FDA approval of Tysabri, voluntary withdrawal, and remarketing. Results Sentiment over all messages Proxy for mental model – how happy they are Difference in average sentiment between two people Higher between random people in a network Lower for pairs that are closely connected Test methodology Compare means of differences between highly connected nodes vs random pairs of nodes T-Test for statistical significance P-value < .0001 for 10 randomly selected groups Acknowledgements Nyla Ismail for evaluating results Todd Littell for the MI code Questions? Live demonstration available at: http://www.beespace.uiuc.edu References Church, K. W. and Hanks, P., (1989). Word association norms, mutual information, and lexicography. in Proc. of the 27th Annual Conference of the Association of Computational Linguistics, (Vancouver, B.C.), ACM Press, 76-83. Clauset, A., Newman, M. E. J., and Moore, C., (2004). Finding community structure in very large networks. Phys. Rev. E, 70 (6), 066111. Kuhlthau, C. C., (1989). Information search process: A Summary of research and implications for school library media programs. SLMQ, 18(1). Newman, M. E. J., (2000). Models of the small world. J. Stat. Phys., 101, 819-841. Solé, R., Ferrer-Cancho, R., Montoya, J. M., and Valverde, S., (2003). Selection, tinkering, and emergence in complex networks. Complexity, 8 (1), 20-33.