High-dimensional social data: a mapper’s worst nightmare Elijah Wright School of Library and Information Science Indiana University, Bloomington Mapping Humanity’s Knowledge and Expertise in the Digital.

Download Report

Transcript High-dimensional social data: a mapper’s worst nightmare Elijah Wright School of Library and Information Science Indiana University, Bloomington Mapping Humanity’s Knowledge and Expertise in the Digital.

High-dimensional social data: a mapper’s worst nightmare

Elijah Wright School of Library and Information Science Indiana University, Bloomington

Mapping Humanity ’ s Knowledge and Expertise in the Digital Domain

Annual Meeting of the Association of American Geographers, Denver, CO, April 5-9, 2005

Pros and Cons, or “ Why Map Social Artifacts?

One of the largest benefits of my work - and all similar work - is that it tries to transform very large social interaction patterns into a more understandable form. The largest danger with this approach, of course, is that fine-grained details of high interpretive value can quickly become lost in a sea of possibly, but not necessarily, relevant data.

Major advantage: we can come to understand aspects of systems that are far too complex to map out by any other means.

Visualization 1 and User Task

User Task Questions

• How do I find myself within this map?

• How do I locate other people who share more than one, or some combination, of my interests?

• How do I track myself within this map as my ideas, and as my posting habits, change?

• What can I learn from the relationships between clusters of points in this map? Do I infer that proximity between the center of two topic clusters means that people tend to post about both of those topics?

• How do I interpret cluster centrality versus marginal positioning?

What do we all share, and what can we get from these questions?

• • I want us to share new, exciting, and innovative ways of thinking about the arrangement of high-dimensional data, and to actively contribute to each other ’ s work. For myself, I want to better understand how cartographers and geographers would like to see us use spatial metaphors to place abstract (social and other) data into imagined space.

The “ real-world ” nature of social questions makes for a pretty convenient grounding for the interpretation of new techniques.

What I do

• Geographers and information scientists are facing similar problems – especially regarding the visualization of high-dimensional, abstract data that may or may not have a real-world spatial component. • In my own work, which attempts to analyze, model, and visualize the evolving structure of social communication networks (via citations, weblogs, or semantic web data), the scale of the data is often such that it is very difficult to generate usable visualizations or any meaningful analysis of its systemic properties.

More What-I-Do

• As we saw in my sample set of user tasks, it is common for users to want to develop an understanding of how they relate to others within the system, where interesting activity is going on, and where fruitful results may be gleaned from additional conversation or interaction.

• These seem to be core, common user tasks which

many

providers of high-dimensional or high quality data sets wish to support.

Algorithmic Toolbox

• Many. Primarily drawing upon Social Network Analysis; often use Principal Components Analysis, MDS (multidimensional scaling), and other SVD-related techniques.

• More of the approaches from various schools of analysis (information retrieval research, network analysis, corpus-linguistic methods) share mathematical roots (factor analysis, principal components analysis, eigen-systems of one sort or another) than is commonly admitted.

Motivation, and Suggestions for Disciplinary Synthesis

• The visualization problem - accompanied by the need for research into how users cognize and interpret our ‘maps’ • The data storage and management problem along with issues of data quality and data sampling.

• The mathematical problem: much of this work relies on techniques that are relatively difficult to learn, evaluate, or teach.

• Trust and privacy issues • All of these need *synthesis* into solutions. This is **HARD**.

Technical Challenges • The primary technical challenge facing us, as users of high-dimensional data, is the provision of both appropriate statistical methods and reliable, efficient storage systems that can scale with the rapid increase in size of the data to be considered.

Non-technical (social!) challenges

• The most important non-technical challenges are interpretation and the retention of contextually sensitive data. Many current systems are difficult to interpret for other users than those designing the systems, or require expertise that is not readily available to the target users. Along similar lines, it is difficult to retain an appropriate amount of contextually important information within a large universe of data. Users viewing maps of all of science, or of large subsets of human knowledge, may be able to devise interesting (and perhaps more valuable) ways to arrange the data that are not initially obvious to system designers. Any system that allows for the creation of maps should carefully consider user input and the interpretation of stored data by these end users.

Sample Data and Visualizations / ‘Maps’

My work is primarily associated with analysis of weblog and semantic web data. In association with a number of other researchers, I have been studying the network properties of both the “ blogosphere ” (the global, interconnected network of weblog authors) and of data associated with the W3C ’ s (World Wide Web Consortium) FOAF (Friend-of-a-Friend) Semantic Web project.

Sample Maps

1) Content-based MDS map of a weblog corpus [see slide #2] 2) Vis of PCA of LiveJournal user ’ s interests 3) Vis of correspondences betweeen LJ user interests and their social relations 4) Vis of a snowball crawl of the blogosphere, with content-analytic codes applied 5) A smaller slice of vis 4 - Catholic weblog authors

Visualization 2 • LJ FOAF vis

Clusters and Groups Visualization 3

Visualization 4

Visualization 5 - Catholic weblog authors

Planned Work (some now completed…)

• When I proposed this talk, my collaborator (John Paolillo) and I were preparing a chapter for the second edition of Vladimir Geroimenko ’ s book, Visualizing the Semantic Web. That ’ s now done, and is the source for Vis. 2/3.

• An article for

The Semantic Web Journal

and a research paper for a social networks conference (focused on the evolving network structure of NIH author networks) are also in the planning stages.

Representative Work

• • Much of my recent research reading has focused on the application of social network analysis principles to the graph structures produced by interconnections between weblog and Semantic Web documents. Also see papers linked from http://www.blogninja.com/ for a sense of what our research group is up to and has done in the past.

Representative papers: Susan C. Herring, Inna Kouper, John C. Paolillo, Lois Ann Scheidt, Michael Tyworth, Peter Welsch, Elijah Wright, and Ning Yu. (2005). Conversations in the Blogosphere: An Analysis "From the Bottom Up". Proceedings of the Thirty-Eighth Hawai'i International Conference on System Sciences (HICSS-38). Los Alamitos: IEEE Press. Available at http://www.blogninja.com/hicss05.blogconv.pdf

John C. Paolillo and Elijah Wright. (2004). “ The Challenges of FOAF Characterization.

” From the proceedings of the 1st Workshop on Friend of a Friend, Social Networking and the Semantic Web. Available at http://www.w3.org/2001/sw/Europe/events/foaf galway/papers/fp/challenges_of_foaf_characterization/

Citations…

• • • • • • John C. Paolillo and Elijah Wright. (2004). “ The Challenges of FOAF Characterization.

” From the proceedings of the 1st Workshop on Friend of a Friend, Social Networking and the Semantic Web. Available at http://www.w3.org/2001/sw/Europe/events/foaf-galway/papers/fp/challenges_of_foaf_characterization/ Herring, Susan C., Kouper, Inna, Paolillo, John, Scheidt, Lois Ann, Tyworth, Michael, Welsch, Peter, Wright, Elijah, Yu, Ning. (2005). Conversations in the Blogosphere: A Social Network Analysis "from the Bottom Up". In Proceedings of the Thirty-eighth Hawaii International Conference on System Sciences (HICSS-38) (Ed.), Los Alamitos: IEEE Press.

Herring, Susan C., Kouper, Inna, Scheidt, Lois Ann, & Wright, Elijah (2004). Women and Children Last: The Discourse Construction of Weblogs. In Laura J. Gurak, Smiljana Antonijevic, Laurie Johnson, Clancy Ratliff, & Jessica Reyman (Eds.), Into the Blogosphere: Rhetoric, Community, and Culture of Weblogs (Minneapolis).

Herring, Susan C., Scheidt, Lois Ann, Bonus, Sabrina, & Wright, Elijah (in press). Weblogs as a bridging genre. Information, Technology, & People.

Herring, Susan C., Scheidt, Lois Ann, Bonus, Sabrina, & Wright, Elijah (2004b). Bridging the Gap: A Genre Analysis of Weblogs. In Proceedings of the Thirty-seventh Hawaii International Conference on System Sciences (HICSS-37) (Ed.), Los Alamitos: IEEE Press.

Scheidt, Lois Ann & Wright, Elijah (2004). Common Visual Design Elements of Weblogs. In Laura J. Gurak, Smiljana Antonijevic, Laurie Johnson, Clancy Ratliff, & Jessica Reyman (Eds.), Into the Blogosphere: Rhetoric, Community, and Culture of Weblogs (Minneapolis).