From CTRnet to IDEAL (and Qatar, VT, SiteStory, UPS, …) NSF IA WIRE Workshop Harvard -- June 16, 2014 Edward A.
Download ReportTranscript From CTRnet to IDEAL (and Qatar, VT, SiteStory, UPS, …) NSF IA WIRE Workshop Harvard -- June 16, 2014 Edward A.
From CTRnet to IDEAL (and Qatar, VT, SiteStory, UPS, …) NSF IA WIRE Workshop Harvard -- June 16, 2014 Edward A. Fox, [email protected], http://fox.cs.vt.edu Professor, Dept. of Computer Science, Virginia Tech Director, Digital Library Research Laboratory Director, Networked Digital Library of Theses and Dissertations http://www.eventsarchive.org Acknowledgments - 1 • Related Funding: – 2007-2008: NSF IIS-0736055, DL-VT416: A Digital Library Testbed for Research Related to 4/16/2007 at Virginia Tech – 2009-2013: NSF IIS-0916733, Crisis, Tragedy, and Recovery network (CTRnet) – 2013-2016: NSF IIS-1319578, Integrated Digital Event Archive & Library (IDEAL) – 2012-2014: Villanova University (NSF DUE-1141209): Computing in Context – 2012-2015: Qatar NPRP 4-029-1-007, Establishing a Qatari Arabic-English Library Institute – 2014: Mellon/Columbia, Archiving Transactions Towards Uninterruptible Web Service (UPS – building on Memento and SiteStory) • The Internet Archive (Kristine Hanna, co-PI): – Heritrix crawler and other tools and support – Hosting the crawls and resulting archives • LucidWorks (software and support – open jobs, internships) Acknowledgments - 2 • IDEAL: VT: PI: Fox, co-PIs: Andrea Kavanaugh, Steve Sheetz, Don Shoemaker; GRAs: Mohamed Magdy, Sunshin Lee; Egypt: Riham Mansour • CTRnet: also Naren Ramakrishnan (co-PI); GRAs Seungwon Yang and Venkat Srinivasan • DL-VT416: also Christopher North and Weiguo Fan • Computing in Context: Villanova PI Robert Beck; Students: Xuan Zhang, Tarek Kanan: class to learn Computational Linguistics by 5-way better summarizing Web archive collections (extract words/sentences, find topics, use event templates) • Qatar: Lead PI Fox, Co-PIs Mohammed Samaka (Qatar U.), Somaya Almaadeed (QU), Krishna RoyChowdhury (Qatar National Library), C. Lee Giles (Penn State), Rick Furuta (Texas A&M); consultant John Impagliazzo (Hofstra), VT GRA Tarek Kanan • Mellon: PI Zhiwu Xie, co-PI Fox, GRA Prashant Chandrasekar • Other students: Kiran Chitturi, Rachel Coston, Ishita Ganotra, S.M.Shamimul Hasan, Christopher Jones, Rohan Kaul, Jun Kim, Lin Tzi Li, Ying Ni, Braeden Sebastian, and teams in CS4624, 5604, 6604 • Collaborators in: Egypt, Tunisia, Mexico, Philippines • WE WELCOME OTHER COLLABORATORS! Memento – Time Travel for the Web: Across-Archive Method for Linking the Current & Past Web http://bit.ly/memento-for-chrome http://www.mementoweb.org/ RFC 7089 (Martin Klein) Related Projects • Mellon/Columbia: enhance SiteStory by devising a webserver that also archives; use the archive automatically when server is down; capture the VT Web and bring up UPS on multiple campus sites • Qatar: at Qatar U., Qatar National Library – Build a digital library community (consulting center) • 4 DL books with M&C + http://en.wikiversity.org/wiki/Curriculum_on_Digita l_Libraries – Build digital library infrastructure: • SiteSeer (CiteSeerX, ChemXseer, TableSeer, …) with Arabic and CLIR support • Heritrix, Wayback Machine, Solr, … Web Archives • 13 TB of IA Collections, e.g., 2013: Boko Haram attack, Boston Marathon blast, Global Emergency Overview, Texas fertilizer plant explosion Category No. of Archives Accidents (plane crash, building collapse, ferry sinking) 11 Bombings 4 Earthquakes (Japan) 12 Fires 2 Floods 4 Hurricanes (Sandy), Tsunami, Cyclones, Typhoons 8 Shootings 17 Tweet Collections • > 120 Event-specific and general collections • Total of 600 million tweets, from streaming API, using hashtags and keywords Category No. of collections Accident (transportation) 33 Bombing 8 Community 10 Earthquake 18 Fire 6 Flood 11 General (including health) 67 Hurricane, Tsunami 39 Political (Middle East, Iran) 40 Shooting 29 CTRnet Collect, analyze, and visualize disaster information with a DL Collect Analyze Web sites, images Image similarity Content Tweets Facebook content Focus group interviews/surveys Content, user profiles Usage of social media (SM) Visualize Organize images by similarity Patterns, frequencies SM use Technology Usage of SM SM use/needs Crawler CBIR algorithm CBIR visualization interface Online tools, scripts, APIs NLP toolkit, SQL Facebook app Spreadsheets Brainstorming tool Brainstorming tool Graphics Social Media Use in Political Crisis (1/2)(2/7 - 2/14, 2011) No. Tweets Total 514,782 tweets Social Media Use in Political Crisis (2/2) • Opinion Leadership in Egypt Uprising 2011 – 514,782 tweets (one week around Mubarak’s resignation) – Total 79,000 unique users • Presumably posting from Egypt 4,710 • Individuals excluding organizations 3,675 – Opinion leaders • • 500-27,000 followers in top 10% (365) individuals Bios: blogger/activist, writer/reporter, lawyer/executive director, social media consultant,… ‘elite’ type actors Visualizing Emergency Phases in Tweets (ISCRAM 2013) (1/2) Disaster Response Prepared ness Emergency Management Recovery Mi ga on Four phases of emergency management model Visualizing Emergency Phases in Tweets (2/2) WHAT WHO WHEN WHERE Topic Tagging of Webpages: Xpantrac - 1 ➔ Input: text file Seungwon Yang dissertation ➔ Build query ◆ Every 5 words, 1 word overlap ➔ Send query to search API ➔ Web search (Seungwon) ➔ Wikipedia, our collection(s): CS4624 Spring 2014: Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman ➔ Find topics in retrieved documents ◆ Frequency of words ➔ Select most frequent as “topics” ➔ Output: topics Topic Tagging of Webpages: Xpantrac - 2 Seungwon Yang (GMU postdoc now) Xpantrac search Input Text retrieve (a) EXPANSION !"" " " ##### ########## #####$ %" " " ######## ########## ##$ &" " " " " " ########## ##########$ " " " " " " " " ########## ##########$ " " " " " " " " ########## ##########$ " " " " " " " " ########## ##########$ " " " " " " " " ########## ##########$ " " " " " " " " ########## ##########$ " " " " " " " " ########## ##########$ " " " " " " " " ########## ##########$ #$#$#$ " " " " " " " " ########## ##########$ ' "" " " " " ########## ##########$ HTML Topics Term n Term 1 Term 3 2 0 ... 7 4 0 ... 1 ... ... ... ... 1 3 ... 0 8 9 ... 17 !!!!!!!!!!!!!! !!!!!!!!!!!!" !!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!!!" Search Engine API (b) EXTRACTION Doc 1 Doc 2 Doc 3 . . . Doc m Sum Term-doc Term 1 3 1 0 ... 4 12 matrix Term 2 1 Term 3 4 ... ... Term n 2 HTML !!!!!!!!!!!!!! !!!!!!!!!!!!" HTML !!!!!!!!!!!!!!" !!!!!!!!!!!!!! HTML !!!!!!!!!!!!" HTML !!!!!!!!!!!!" !!!!!!!!!!!!!!!!!!!!!!!!!!!!" !!!!!!!!!!!!!! !!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!!!"!!!!!!!!!!!!!!"HTML !!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!"!!!!!!!!!!!!!! HTML !!!!!!!!!!!!" !!!!!!!!!!!!!!"!!!!!!!!!!!!!!" !!!!!!!!!!!!!! !!!!!!!!!!!!!!" !!!!!!!!!!!!" HTML !!!!!!!!!!!!!!" !!!!!!!!!!!!!! !!!!!!!!!!!!" !!!!!!!!!!!!!!" !!!!!!!!!!!!" HTML!!!!!!!!!!!!" !!!!!!!!!!!!!!" !!!!!!!!!!!!!!" !!!!!!!!!!!!!! HTML !!!!!!!!!!!!"!!!!!!!!!!!!" !!!!!!!!!!!!!!" !!!!!!!!!!!!!! !!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!!!"!!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!!!" Query units Corpus Web HTML !!!!!!!!!!!!!! !!!!!!!!!!!!" !!!!!!!!!!!!!! HTML !!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!!! !!!!!!!!!!!!" !!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!!!" !!!!!!!!!!!!!!" HTML !!!!!!!!!!!!"HTML !!!!!!!!!!!!!! !!!!!!!!!!!!!! !!!!!!!!!!!!!!" !!!!!!!!!!!!" HTML !!!!!!!!!!!!" !!!!!!!!!!!!!!"!!!!!!!!!!!!!! !!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!!!"!!!!!!!!!!!!!!"!!!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!!!" HTML Water Main Break Visualization Table 9 shows that at most 1.17 percent of tweet messages have GPS location information (longitude, latitude); this is a very small percentage, justifying the need to use another method for location extraction along with the GPS data. All datasets were collected between 10/24/2011 and 1/30/2012. Sunshin Lee Table 9. A sample dataset. Table 10. Comparison of # of tweets: GPS data vs. location data extracted from text. Dataset Total # of tweets which have GPS Location information # of tweets Keyword tweets type information (percentage) GPS data (longitude, 36 (1.08 %) water main 13,382 latitude) 156 (1.17 %) break Location information extracted from text 1,473 (44.19 %) water pipe 967 1 (0.10 %) leak To visualize tweets on a map, locations (longitude, latitude) are required. The Google Table 10. Comparison of enables # of tweets: GPS data vs. location extracted text. Fusion Table, which gathering, visualizing, anddata sharing data from online, provides a geocoding function to visualize Location information type tweets according to Google # ofMaps tweetslocations. Identification: In order to automatically extract location information from tweets, a GPS data (longitude, 36 a(1.08 %) 6Recognizer shows an example the visualized tweets on maporganizations, of the New York NamedFigure Entity (NER)latitude) isofapplied. It can identify people, and area, USA. the Google Maps,NER each[8] dotisrepresents a tweet event. When a dot is clicked, a locations fromOn a text. The Stanford a widely used implementation. Location extracted from text 1,473 (44.19 %) pop-up displaysinformation a tweet message, location, and created time. The twitter data needs to be cleaned before the location or other detailed information can be Removing special characters, as # and @, and removing To extracted visualize from tweetsit.on a map, locations (longitude,such latitude) are required. The Google URLs, SNERenables to findgathering, location information Fusion helps Table,thewhich visualizing,more and accurately. sharing data online, provides a geocoding function to visualize tweets according to Google Maps locations. SNER extracts location information from each tweet message, and then it returns a set of geonames as location information. selectedtweets dataseton hasa 3333 TableYork 10 shows Figure 6 shows an example of theThe visualized map tweets. of the New area, that we get 1,473 relevant tweets by extracting location information from text. Ina USA. On the Google Maps, each dot represents a tweet event. When a dot is clicked, contrast, we onlyaget 36 message, tweets using GPS data. pop-up displays tweet location, and created time. Visualization: The categorization of tweets by location may facilitate the search for relevant information. For example, a Blacksburg utility worker usually monitors water main break events that occur within the town of Blacksburg. Figure 6. An example22 of the visualized tweets on Google Maps Tweets collected with keywords Selected tweets with location information (lat/long, geonames) Event locations displayed with details Integrated Digital Event Archive and Library (IDEAL) Project http://www.eventsarchive.org/ • Extension of CTRnet with broadened scope: – Event detection – Event data archiving & processing • Multimedia (images, videos) shared in social media • Digital government research – Community issue detection – Public opinion mining, mood perception, information flow • Technologies: – Focused crawling, analysis/visualization services, integration of archive and DL capabilities Event Ontology • Event model – Who, What, When, Where, How – Organizations/entities participating in the event • What – Topics of the Event • Where – Event location • When – Event time frame (and later times of interest, e.g., anniversaries) IDEAL Proposal Architecture Producers Curators Contributors Web Publishers Social Media Preservation Planning Ingest Tweet Manager Focused Crawler Lucid Data Management Works Consumers Internet Archive Access Search Researchers Visualize Practitioners Archival Storage Internet Analyze Browse Archive Affected Administration Internet Archive IDEAL System Architecture Sunshin Lee (built low-cost cluster) IDEAL Data Architecture Sunshin Lee Event Focused Crawler Mohamed Magdy Focus of research Baseline vs. Event Focused Crawler Mohamed Magdy Harvest ratio: relevant crawled webpages vs. cumulative set of crawled webpages Extracted News Events on a Time Line CS6604 Spring 2014: Tianyu Geng, Wei Huang, Ji Wang, Xuan Zhang ukraine, yanukovich, crisis, minister, sign, russian 02/28 russia, bank, sanctions, ukraine, crisis, crimea ukraine, tensions, data, rise, shares, china, stocks ukraine, russia, talks, aid, crisis, sanctions, deal 03/12 03/16 03/23 03/08 03/01 03/09 03/14 ukraine, crimea, crisis, putin, russia, minister crimea, ukraine, russia, minister, referendum, vote 03/20 ukraine, house, imf, u.s, bill, white, aid gas, ukraine, russian, russia, europe, talks, energy 04/12 03/26 ukraine, aid, support, government, talks, house, russian crimea, ukraine, russian, troops, border 04/16 History: 3/7 referendum annulled 3/14: UN draft resolution News-Tweet Architecture CS6604 Spring 2014: Tianyu Geng, Wei Huang, Ji Wang, Xuan Zhang Event 3 Event 2 Who Topic Event 1 Who Topic Topic Event Extraction Sys. Preprocessor When Who When Where When Event 3 Event 2 Who Topic Event 1 Topic Topic Where Where When Who When Where When Where Where Event Extraction Sys. Preprocessor LDA LDA NER Who Correlation NER IDEAL Spreadsheet CS4624 Spring 2014: Tony Ardura, Austin Burnett, Rex Lacy, Shawn Neumann (based on ArcSpread by Andreas Paepcke et al.) Recommended Collection-Level Metadata CS6604 Spring 2014: Michael Shuffett • Dublin Core – Title, Description • PROV-O – Starting Point Classes – Collection process, organization, hadMember, atLocation • ISO 3166-2 for locations • W3/XMLSchema#dateTime • PLUS: TweetID tool for tweet collections – Extracts tweet and collection level metadata – Compares / combines tweet collections Thank you! Questions/Comments? [email protected], http://fox.cs.vt.edu 540-231-5113 Office: 2160G Torgersen Hall Campus Mail: 114 McBryde Hall, M/C 0106, Dept. of CS, Virginia Tech, Blacksburg, VA 24061