Transcript Slide 1
Visualization in Text Analysis Problems VAC Consortium Meeting Stanford, May 24, 2006 Marti Hearst School of Information, UC Berkeley Outline Some Visualization Design Principles Illustrated with a new example Why Text is Tricky to Visualize How to do good visualization design with text while meeting analysts needs? Focus on Flexibility with Reproducibility Examples from 4 different domains What Makes for a Good Visualization? Visually illuminates important aspects of the underlying data and domain. Supports the users’ tasks (better than without the visualization). Adheres to good design principles. Example from Software Engineering Marat Boshernitsan, UC Berkeley PhD Dissertation 2006 Problem: need to make complex changes throughout code. Example: convert from one API to another. A Typical Solution Either requires programmers to understand and manipulate abstract syntax trees … Or requires learning another programming language (or both)! First Attempt Second Attempt A Better Solution Build on how programmers think about programming. Operate on the textual representation of code. Users Operate on Familiar Visual Representation of Code Context-and-Domain Sensitive Visual Cues Lessons from this Example User-centered Design This was the third attempt. First 2 attempts did not accurately reflect how users think about the problem. Careful design of labels and interaction cues Very intelligent backend, but user-activated. Visually and interactively reflects how programmers think about programming. What Makes for a Good Visualization for Analysts? Visually illuminates important aspects of the underlying data and domain. Supports the users’ tasks (better than without the visualization). Adheres to good design principles. Goals vs. Tasks Analysts’ Goals: Understand current and past situations Predict and anticipate future situations Observations by Pirolli & Card ’05: Different analysts starting with people, organizations, tasks, and time: predict coup likelihood understand bio-warfare threats understand relations within cartel Goals vs. Tasks Analysts’ tasks: Explore Extract Filter Link Arrange Compare Hypothesize (A combination of Foraging and Sensemaking) Should do the tasks only to support the goals. Design Principles for Analysts Experienced analysts notice what is missing or unexpected (Wright et al. ’06) Thus consistency and reproducibility are important. Design Principles for Analysts Analysts must guard against confirmation bias. (Pirolli & Card ’05) Thus it is important for analysts to Be able to easily arrange and re-arrange, View information flexibly from many angles, While at the same time retaining consistency and reproducibility. However … it’s hard to do this with text. Working with Text Text is especially difficult to visualize Very high dimensionality Tens to hundreds of thousands of features Compositional Can be combined together in innumerable ways Abstract And so difficult to visualize Not pre-attentive Must foveate to read Subtle Small differences matter Unordered Text Meaning is NOT pre-attentive SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC Why Text is Tough Abstract concepts are difficult to visualize Combinations of abstract concepts are even more difficult to visualize time shades of meaning social and psychological concepts causal relationships Why Text is Tough Why Text is Tough The dog.. Why Text is Tough Why Text is Tough The dog. The dog cavorts. The dog cavorted. Why Text is Tough Why Text is Tough The man. The man walks. Why Text is Tough Why Text is Tough The man walks the cavorting dog. So far, we can sort of show this in pictures. Why Text is Tough Why Text is Tough As the man walks the cavorting dog, thoughts arrive unbidden of the previous spring, so unlike this one, in which walking was marching and dogs were baleful sentinels outside unjust halls. How do we visualize this? Why Text is Tough Why Text is Tough Language only hints at meaning Most meaning of text lies within our minds and common understanding “How much is that doggy in the window?” how much: social system of barter and trade (not the size of the dog) “doggy” implies childlike, plaintive, probably cannot do the purchasing on their own “in the window” implies behind a store window, not really inside a window, requires notion of window shopping Why Text is Tough Why Text is Tough General categories have no standard ordering (nominal data) Categorization of documents by single topics misses important distinctions Consider an article about NAFTA The effects of NAFTA on truck manufacture The effects of NAFTA on productivity of truck manufacture in the neighboring cities of El Paso and Juarez Why Text is Tough Other issues about language Ambiguous (many different meanings for the same words and phrases) Same meaning implied by different combinations Different combinations imply different meanings Why Text is (Deceptively) Easy Text is easier when you have a lot of it Web search is now usually conjunction Text has a lot of redundancy A very simple algorithm can: Pull out “important” phrases Find “meaningfully” related words Create a “summary” from document Group “related” documents Simple Text Analysis can Mislead Most frequent words Biases towards concepts with unique identifiers. From Spink, Wolfram, Jansen, Saracevic, JASIS ‘01 Major Trends vs. Minor Discoveries With text, it’s easy to extract and show the largest, main trends But often we want the rare but unexpected and important event: Russian oil company example Schwarzenegger and Enron Cigarettes and kids Person on the periphery who is working stealthily to influence things This is really difficult to solve! Design Principles for Analysts Experienced analysts notice what is missing or unexpected. Analysts must guard against confirmation bias. Need to be able to easily arrange and re-arrange, View information flexibly from many angles, While at the same time retaining consistency and reproducibility. Interfaces should reflect the domain and data. How to achieve this with text collections? Must transform text in understandable ways Must provide multiple, consistent views that nevertheless allow for new discovery and insight Why Emphasize Flexibility? Can’t view representations of all the text content at once. Instead, needs ways to flexibly navigate, group, organize, explore See important pieces over time. The Importance of Flexibility Russell, Slaney, Qu, Houston ’05 The ease of viewing and manipulation in the system strongly influenced the kind of analysis operations done. Examples of Flexibility on Text Data PaperLens (Conference proceedings) TAMKI (Customer service requests) Faceted Browsing (e-commerce) Flamenco Ebay Express FaThumb TRIST and Sandbox (Analysts) Flexible views Infoviz 2004 contest Visualize 8 years of conference proceedings Tasks: 1. 2. 3. 4. 5. Static Overview of 10 years of Infovis Characterize the research areas and their evolution The people in InfoVis Which papers/authors are most often referenced? How many papers conducted a user study? PaperLens integrated solution by Lee, Czerwinski, Robertson, Bederson Uses graphical elements and brushing and linking to flexibly elicudate a collection’s contents. http://www.cs.umd.edu/hcil/InfovisRepository/contest-2004/index.shtml Flexibility in Foraging and Analysis TAKMI, by Nasukawa and Nagano, ‘01 The system integrates: Analysis tasks (customer service help) Content analysis Information Visualization Flexibility in Analysis TAKMI, by Nasukawa and Nagano, 2001 Documents containing “windows 98” Flexibility in Analysis TAKMI, by Nasukawa and Nagano, 2001 TAKMI, by Nasukawa and Nagano, 2001 Patent documents containing “inkjet”, organized by entity and year Flexibility in Category Navigation Browsing Information Collections using (Hierarchical) Faceted Metadata What are facets? Sets of categories, each of which describe a different aspect of the objects in the collection. Each of these can be hierarchical. (Not necessarily mutually exclusive nor exhaustive, but often that is a goal.) GeoRegion + Time/Date + Topic Facet example: Recipes Cooking Method Ingredient Stir-fry Chicken Red Bell Pepper Course Main Course Curry Cuisine Thai Nobel Prize Winners Collection New Site: eBay Express Is This Visualization? Prior experience and other people’s attempts seem to suggest that fewer graphics and more text is better. Details of layout, font and color contrast, label selection, and interaction make all the difference. Earlier Variation on the Idea Cat-a-Cone, 1997 Mobile Variation FaThumb: Karlson, Robertson, Robbins, Czerwinski, Smith ’06 Well-received, but visualization part not looked at. Flexibility in SenseMaking DLITE by Cousins et al. ‘97 Sandbox by Wright et al. ‘06 TRIST (The Rapid Information Scanning Tool) is the work space for Information Retrieval and Information Triage. Flexibility in Sensemaking TRIST, Jonkers et al 05 User Defined and Automatic Categorization Launch Queries Comparative Analysis of Answers and Content Rapid Scanning with Context Entities Query History Dimensions Annotated Document Browser Linked Multi-Dimensional Views Speed Scanning Flexibility for Sensemaking Support Sandbox, Wright et al ‘06 Quick Emphasis of Items of Importance. Dynamic Analytical Models. Direct interaction with Gestures (no dialog, no controls). Assertions with Proving/Disproving Gates. Communication-Centric Text Email, conversations, blogs The first thought is usually nodes and links Doesn’t have the desired flexibility Some alternatives: The Network Multivariate Networks Re-envisioning Networks Viewing people’s shared workplaces, hometowns, schools over time. www.theyrule.net: Re-envisioning Networks First cut: Hastings, Snow, and King ’05 Reenvisioning Networks Better version: Hastings, Snow, and King ’05 Re-envisioning Networks Wattenberg ’06 OLAP on directed labeled graphs Network Flexibility Martin Wattenberg, “Visual Exploration of Multivariate Graphs” M Location A Location B Location C Location D Location E F Re-envisioning Networks Idea: vary these ideas to apply to email and other communication text. Summary: Text Viz Design Guidelines An emphasis on flexible views on text data Emphasize brushing and linking using appropriate visual cues. Interaction flow should guide the user but also be flexible. Information structure should be consistent and reproducible. Other guidelines: Make text visible. Visual components should reflect the data and tasks. Thank you! www.sims.berkeley.edu/~hearst