Transcript Slide 1
Social Network Analysis: State of the Art and Challenges
Noshir Contractor, UIUC & NCSA
Katy Börner, Indiana University
Network Data
Network Data Collection & Extraction
Network Data Coding
Network Data Analysis: Methods
Network Data Visual-Analytics: Implementations and Algorithms
Network Simulation
Network Visualization
Tools for Networking and Referrals
Network Workflows
Network Data
Description
Multi-dimensional, multi-relational, multi-layer, multi-rater, ego-centric – data sets or
streams.
Entities could be people, text, documents, organizations, websites, blogs.
Datasets vary in number of nodes, and relations/dimensions, subnetworks.
Culture of benchmark datasets shared: Holland Leinhardt, Sampson, Krackhardt,
Enron … many included with software packages such as UCINET.
Examples
Global Information Sector, Blogosphere, Political networks, Emergency
Multiorganizational Networks, AP News, Tobacco Control, Multidisciplinary virtual
teams, Scientometric networks, Bibliometric Co-author, Co-citation networks,
Funding/collaboration networks, (Cell) Phone usage logs, U.S. Census data, GIS
data, Capacity of medical facilities, Transit route schedules, etc.
Challenges
Intellectual: How networks explain large social phenomena such as diffusion of
ideas, creativity, political movements and action, infectious diseases (AIDS, Avian
flu), online retailing, mobilization to prepare, respond and recover from disasters?
Network Data:
Size and Complexity are Growing Fast
There are now 1 billions books cataloged in WordCat
http://www.oclc.org/worldcat/
Brewster Kahle’s Internet archive captures about 600TB of data
http://www.archive.org/ - Michael Macy’s group (Cornell U, Cybertools)
Google does not list any more how many web pages it indexes.
Indiana University receives 650 million email messages per year (excluding
four times this amount from sites that are non-existent and are blocked).
Global Information Sector longitudinal dataset of international IT by David
Knoke, U Minnesota, companies
Public Health Epidemiology, Evaluation and Surveillance Data (Gary Giovino,
Roswell Park Cancer Center, Brad Hesse/NCI, Pamela Clark/Battelle)
Instrumenting human interaction (video, speech, and non-verbal interactions)
and cognition: Corman/ASU, Cox/NCSA, Pentland/MIT, Bennett Bertenthal/U
of Chicago (Cybertools)
Infectious Disease Epidemiology data – Marathe/Eubank, Virginia Tech
A large number of small data sets dealing with the same variables or collected
using the same instruments that are typically not analyzed collectively, but
could be. For instance, ego-centric network datasets.
Network Data Collection:
Entity and Affiliation Extraction
Description
Manual (self-report, interviews observation), Sensors, Web-crawling,
Automated network extraction methods: Entity extraction, relation or
affiliation extraction
Examples
EgoNet, CI as a source of data (provenance graphs), Bibliographic data,
CITESEER, Web 2.0 technologies (Blogs, RSS feeds, etc), GATE,
Crawdad (Corman), D2K & T2K (Welge), VIAS (Craig)
Challenges
Validation of automated (or computer-augment) approaches to entity and
affiliation extraction
Network Data Coding
Description
Manual, Tagging, Semantic, From Taxonomy to “Folksonomies”
Examples
Flickr, Tag Cloud, Semantic Grid, Provenance, Social bookmarking,
Credentialling
Challenges
Self-organizing standards – Metadata, Semantic Grid, Incentives
structures for mobilizing communities to contribute
Adaptiveness to changing Folksonomies
“Conservative” tendency towards the mean or majority view
Network Data Analysis: Methods
Description
ERGM - Exponential Random Graph Models (Wasserman, Pattison, Robins,
Snijders, et al)
Network Evolution – Actor oriented models (Snijders, Steglich)
Positional analysis – generalized block modeling (Batagelj)
Autocorrelation Models (Leenders)
Spectral analysis (Richards, Seary)
Multi-relational, multi-rater networks (Koehly, Corman)
Sampling, missing data (Wasserman, Butts)
Examples
Monte Carlo techniques for Maximum Likelihood Estimation of ERGM: Simulate a
distribution of random graphs from a starting set of parameter values and to refine
these estimated parameter values by comparing the distribution of graphs with
observed graph until parameter stabilizes.
Challenges
“However, Monte Carlo approaches to MLE can be computer intensive, so estimation
for networks with a large number of nodes, or for a complex model may not be possible
or may take an unacceptably long time (p. 157, Wasserman & Robins, 2005).
Scalability in terms of size of networks (nodes, number of relationships) and complexity
of model – number of parameters to be estimated.
Network Data Visual-Analytics:
Implementations and Algorithms
Description
Descriptive methods to calculate (simple) network statistics (e.g., centrality or
transitivity)
Procedure-based analysis for more complex (iterative) algorithms (e.g., cluster
analysis or eigen decomposition)
Statistical modeling based on probability distributions (e g., Exponential Random
Graph Modeling, Quadratic Assignment Procedures)
Examples
Pajek (Vlado), SNA-R/StatNet (Butts), Multinet/PSPAR (Richards/Seary), VOSON
(Ackland), StocNet/SIENA (Snijders/Steglich), NetVis (Cummings), JUNG (Fisher)
Challenges
Data formats interoperability, data integration, data management.
Open code, doing “bake-offs” using simulation models on benchmark data to
explain variance.
Evaluating and assessing different Markov Chain Monte Carlo (MCMC) algorithms
such as Robbins-Monro, Gibbs Sampler, Metropolis Hastings)
Extensible and scalable software frameworks to ‘plug & play’ diverse algorithms.
Algorithm and code documentation and learning modules.
Software for Network Analysis
Source: Huisman, M. & Van Duijn, M. A. J. (2005). Software for Social Network Analysis. In P J. Carrington, J. Scott, & S. Wasserman
(Editors), Models and Methods in Social Network Analysis (pp. 270-316). New York: Cambridge University Press.
Software for Network Analysis
Source: Huisman, M. & Van Duijn, M. A. J. (2005). Software for Social Network Analysis. In P J. Carrington, J. Scott, & S. Wasserman
(Editors), Models and Methods in Social Network Analysis (pp. 270-316). New York: Cambridge University Press.
Software Toolkits for Network Analysis
Source: Huisman, M. & Van Duijn, M. A. J. (2005). Software for Social Network Analysis. In P J. Carrington, J. Scott, & S. Wasserman
(Editors), Models and Methods in Social Network Analysis (pp. 270-316). New York: Cambridge University Press.
Network Simulation
Description
Systems Dynamics
Agent Based Models
Computational Network Models
Examples
Diffusion of information, emergence of norms, coordination of
conventions, or participation in collective action (Macy)
Spread of epidemics EPISIMS (Marathe/Eubank)
Tobacco control (Clark/Hesse)
Computational modeling Environments: Repast (Sallach), Blanche
(Contractor)
Challenges
Reusable, transparency, docking, multi-scale simulations, distributed
data, doing “bake-offs” on simulation models on benchmark data to
explain variance. Theoretical testing and empirical validation.
Network Visualization
Description
An (animated) image of a network is sometimes worth a list of millions of
(dynamically changing) node-node pairs.
In many cases, visuals are the major means to represent and
communicate scientific results -- across scientific boundaries.
Examples
Pajek (Batagelj), GUESS (Adar), JUNG (Fisher), MatrixVis (Chris
Mueller), TraceEncounters (Paley), TreePlus (Plaisant)
Challenges
Eye candy vs. highly readable and effective visualizations.
Visualization of data origin, provenance, accuracy, (un)certainty.
Tight coupling of data analysis and visualization to help people make
sense of very large, dynamically evolving datasets.
Scalable, interactive/iterative specification of data analysis and data
mappings.
http://www.visualcomplexity.com/vc/
Tools for Networking and Referrals
Description
Search-based recommendations, Category-based recommendations,
Collaborative filtering: Memory-based, Model-based (clustering),
Association rules (or item-to-item collaborative filtering), Content-based
methods, Recommendation support Hybrid methods
Examples
TraceEncounters (Paley), Intellibadge (Donna Cox), I-neighbors
(Hampton), AOL (Karahalios), LEEP (Haythornthwaite), CI-KNOW
(Contractor), NetExpert (Ramon Sanguesa), Adaptive Referral Systems
(Singh)
Challenges
Algorithms to capturing data in automated and close to real-time
Agreement and implementation of metadata, provenance, and
Developing theoretically grounded and statistically defensible
mechanisms for referrals
Implementation of scalable algorithms
Privacy benefit trade-offs
Access to large bibliographic databases
Network Workflows:
Tying it all together
Description
Itineraries – Supporting power users who want to thread multiple
programs, data sets
Examples
Kepler and D2K (Welge)
Challenges
Agreement and implementation of standards
Interfacing with existing and new visual-analytic tools, data sets
What Other Challenges Exist?