From CTRnet to IDEAL (and Qatar, VT, SiteStory, UPS, …) NSF IA WIRE Workshop Harvard -- June 16, 2014 Edward A.

Download Report

Transcript From CTRnet to IDEAL (and Qatar, VT, SiteStory, UPS, …) NSF IA WIRE Workshop Harvard -- June 16, 2014 Edward A.

From CTRnet to IDEAL
(and Qatar, VT, SiteStory, UPS, …)
NSF IA WIRE Workshop
Harvard -- June 16, 2014
Edward A. Fox, [email protected], http://fox.cs.vt.edu
Professor, Dept. of Computer Science, Virginia Tech
Director, Digital Library Research Laboratory
Director, Networked Digital Library of Theses and Dissertations
http://www.eventsarchive.org
Acknowledgments - 1
• Related Funding:
– 2007-2008: NSF IIS-0736055, DL-VT416: A Digital Library Testbed for Research
Related to 4/16/2007 at Virginia Tech
– 2009-2013: NSF IIS-0916733, Crisis, Tragedy, and Recovery network (CTRnet)
– 2013-2016: NSF IIS-1319578, Integrated Digital Event Archive & Library (IDEAL)
– 2012-2014: Villanova University (NSF DUE-1141209): Computing in Context
– 2012-2015: Qatar NPRP 4-029-1-007, Establishing a Qatari Arabic-English
Library Institute
– 2014: Mellon/Columbia, Archiving Transactions Towards Uninterruptible Web
Service (UPS – building on Memento and SiteStory)
• The Internet Archive (Kristine Hanna, co-PI):
– Heritrix crawler and other tools and support
– Hosting the crawls and resulting archives
• LucidWorks (software and support – open jobs, internships)
Acknowledgments - 2
• IDEAL: VT: PI: Fox, co-PIs: Andrea Kavanaugh, Steve Sheetz, Don
Shoemaker; GRAs: Mohamed Magdy, Sunshin Lee; Egypt: Riham Mansour
• CTRnet: also Naren Ramakrishnan (co-PI); GRAs Seungwon Yang and
Venkat Srinivasan
• DL-VT416: also Christopher North and Weiguo Fan
• Computing in Context: Villanova PI Robert Beck; Students: Xuan Zhang, Tarek
Kanan: class to learn Computational Linguistics by 5-way better summarizing Web
archive collections (extract words/sentences, find topics, use event templates)
• Qatar: Lead PI Fox, Co-PIs Mohammed Samaka (Qatar U.), Somaya Almaadeed (QU), Krishna RoyChowdhury (Qatar National Library), C. Lee
Giles (Penn State), Rick Furuta (Texas A&M); consultant John Impagliazzo
(Hofstra), VT GRA Tarek Kanan
• Mellon: PI Zhiwu Xie, co-PI Fox, GRA Prashant Chandrasekar
• Other students: Kiran Chitturi, Rachel Coston, Ishita Ganotra,
S.M.Shamimul Hasan, Christopher Jones, Rohan Kaul, Jun Kim, Lin Tzi Li,
Ying Ni, Braeden Sebastian, and teams in CS4624, 5604, 6604
• Collaborators in: Egypt, Tunisia, Mexico, Philippines
• WE WELCOME OTHER COLLABORATORS!
Memento – Time Travel for the
Web: Across-Archive Method for
Linking the Current & Past Web
http://bit.ly/memento-for-chrome
http://www.mementoweb.org/
RFC 7089 (Martin Klein)
Related Projects
• Mellon/Columbia: enhance SiteStory by devising a
webserver that also archives; use the archive
automatically when server is down; capture the VT Web
and bring up UPS on multiple campus sites
• Qatar: at Qatar U., Qatar National Library
– Build a digital library community (consulting center)
• 4 DL books with M&C +
http://en.wikiversity.org/wiki/Curriculum_on_Digita
l_Libraries
– Build digital library infrastructure:
• SiteSeer (CiteSeerX, ChemXseer, TableSeer, …) with
Arabic and CLIR support
• Heritrix, Wayback Machine, Solr, …
Web Archives
• 13 TB of IA Collections, e.g., 2013: Boko
Haram attack, Boston Marathon blast, Global
Emergency Overview, Texas fertilizer plant
explosion
Category
No. of Archives
Accidents (plane crash, building
collapse, ferry sinking)
11
Bombings
4
Earthquakes (Japan)
12
Fires
2
Floods
4
Hurricanes (Sandy), Tsunami,
Cyclones, Typhoons
8
Shootings
17
Tweet Collections
• > 120 Event-specific and general collections
• Total of 600 million tweets, from streaming
API, using hashtags and keywords
Category
No. of collections
Accident (transportation)
33
Bombing
8
Community
10
Earthquake
18
Fire
6
Flood
11
General (including health)
67
Hurricane, Tsunami
39
Political (Middle East, Iran)
40
Shooting
29
CTRnet
Collect, analyze, and visualize disaster information with a DL
Collect
Analyze
Web sites, images
Image similarity
Content
Tweets
Facebook content
Focus group
interviews/surveys
Content, user
profiles
Usage of social
media (SM)
Visualize
Organize
images by
similarity
Patterns,
frequencies
SM use
Technology
Usage of SM
SM use/needs
Crawler
CBIR algorithm
CBIR
visualization
interface
Online tools,
scripts, APIs
NLP toolkit, SQL
Facebook app
Spreadsheets
Brainstorming tool
Brainstorming tool
Graphics
Social Media Use in Political Crisis
(1/2)(2/7 - 2/14, 2011)
No. Tweets

Total 514,782 tweets
Social Media Use in Political Crisis (2/2)
• Opinion Leadership in Egypt Uprising 2011
– 514,782 tweets (one week around Mubarak’s
resignation)
– Total 79,000 unique users
• Presumably posting from Egypt  4,710
• Individuals excluding organizations  3,675
– Opinion leaders
•
•
500-27,000 followers in top 10% (365) individuals
Bios: blogger/activist, writer/reporter, lawyer/executive
director, social media consultant,…  ‘elite’ type actors
Visualizing Emergency Phases in Tweets
(ISCRAM 2013) (1/2)
Disaster
Response
Prepared
ness
Emergency
Management
Recovery
Mi ga on
Four phases of emergency management model
Visualizing Emergency Phases in
Tweets (2/2)
WHAT
WHO
WHEN
WHERE
Topic Tagging of Webpages: Xpantrac - 1
➔ Input: text file
Seungwon Yang dissertation
➔ Build query
◆ Every 5 words, 1 word overlap
➔ Send query to search API
➔ Web search (Seungwon)
➔ Wikipedia, our collection(s):
CS4624 Spring 2014: Sloane
Neidig, Samantha Johnson,
David Cabrera, Erika Hoffman
➔ Find topics in retrieved documents
◆ Frequency of words
➔ Select most frequent as “topics”
➔ Output: topics
Topic Tagging of Webpages: Xpantrac - 2
Seungwon Yang (GMU postdoc now)
Xpantrac
search
Input Text
retrieve
(a) EXPANSION
!""
" " #####
##########
#####$
%" "
" ########
##########
##$
&" "
" " " "
##########
##########$
" " " "
" " " "
##########
##########$
" " " "
" " " "
##########
##########$
" " " "
" " " "
##########
##########$
" " " "
" " " "
##########
##########$
" " " "
" " " "
##########
##########$
" " " "
" " " "
##########
##########$
" " " "
" " " "
##########
##########$
#$#$#$
" " " "
" " " "
##########
##########$
' ""
" " " "
##########
##########$
HTML
Topics
Term n
Term 1
Term 3
2
0
...
7
4
0
...
1
...
...
...
...
1
3
...
0
8
9
...
17
!!!!!!!!!!!!!!
!!!!!!!!!!!!"
!!!!!!!!!!!!!!"
!!!!!!!!!!!!"
!!!!!!!!!!!!!!"
Search Engine API
(b) EXTRACTION
Doc 1 Doc 2 Doc 3 . . . Doc m Sum
Term-doc
Term 1 3
1
0
...
4
12
matrix
Term 2 1
Term 3 4
... ...
Term n 2
HTML
!!!!!!!!!!!!!!
!!!!!!!!!!!!"
HTML
!!!!!!!!!!!!!!"
!!!!!!!!!!!!!!
HTML !!!!!!!!!!!!"
HTML
!!!!!!!!!!!!"
!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
!!!!!!!!!!!!!!
!!!!!!!!!!!!!!"
!!!!!!!!!!!!" !!!!!!!!!!!!"
!!!!!!!!!!!!"
!!!!!!!!!!!!!!"!!!!!!!!!!!!!!"HTML
!!!!!!!!!!!!!!"
!!!!!!!!!!!!" !!!!!!!!!!!!"!!!!!!!!!!!!!!
HTML
!!!!!!!!!!!!"
!!!!!!!!!!!!!!"!!!!!!!!!!!!!!"
!!!!!!!!!!!!!!
!!!!!!!!!!!!!!"
!!!!!!!!!!!!"
HTML
!!!!!!!!!!!!!!" !!!!!!!!!!!!!! !!!!!!!!!!!!"
!!!!!!!!!!!!!!"
!!!!!!!!!!!!"
HTML!!!!!!!!!!!!"
!!!!!!!!!!!!!!"
!!!!!!!!!!!!!!"
!!!!!!!!!!!!!!
HTML !!!!!!!!!!!!"!!!!!!!!!!!!"
!!!!!!!!!!!!!!"
!!!!!!!!!!!!!! !!!!!!!!!!!!!!"
!!!!!!!!!!!!" !!!!!!!!!!!!"
!!!!!!!!!!!!!!"!!!!!!!!!!!!!!"
!!!!!!!!!!!!"
!!!!!!!!!!!!!!"
Query
units
Corpus
Web
HTML
!!!!!!!!!!!!!!
!!!!!!!!!!!!"
!!!!!!!!!!!!!!
HTML !!!!!!!!!!!!!!"
!!!!!!!!!!!!"
!!!!!!!!!!!!!! !!!!!!!!!!!!"
!!!!!!!!!!!!!!"
!!!!!!!!!!!!" !!!!!!!!!!!!!!"
!!!!!!!!!!!!"
!!!!!!!!!!!!!!"
!!!!!!!!!!!!!!"
HTML !!!!!!!!!!!!"HTML
!!!!!!!!!!!!!!
!!!!!!!!!!!!!! !!!!!!!!!!!!!!"
!!!!!!!!!!!!" HTML !!!!!!!!!!!!"
!!!!!!!!!!!!!!"!!!!!!!!!!!!!! !!!!!!!!!!!!!!"
!!!!!!!!!!!!" !!!!!!!!!!!!" !!!!!!!!!!!!"
!!!!!!!!!!!!!!"!!!!!!!!!!!!!!"!!!!!!!!!!!!!!"
!!!!!!!!!!!!"
!!!!!!!!!!!!!!"
HTML
Water Main Break Visualization
Table 9 shows that at most 1.17 percent of tweet messages have GPS location
information (longitude, latitude); this is a very small percentage, justifying the need to
use another method for location extraction along with the GPS data. All datasets were
collected between 10/24/2011 and 1/30/2012.
Sunshin Lee
Table 9. A sample dataset.
Table 10. Comparison of # of tweets: GPS data vs. location data extracted from text.
Dataset
Total
# of tweets which have GPS
Location information
# of tweets
Keyword
tweets type information (percentage)
GPS data (longitude,
36 (1.08 %)
water main
13,382 latitude)
156 (1.17 %)
break
Location information extracted from text
1,473 (44.19 %)
water pipe
967
1 (0.10 %)
leak
To visualize
tweets on a map, locations (longitude, latitude) are required. The Google
Table
10.
Comparison
of enables
# of tweets:
GPS data
vs. location
extracted
text.
Fusion Table, which
gathering,
visualizing,
anddata
sharing
data from
online,
provides a
geocoding
function
to visualize
Location
information
type tweets according to Google
# ofMaps
tweetslocations.
Identification: In order to automatically extract location information from tweets, a
GPS
data (longitude,
36 a(1.08
%)
6Recognizer
shows
an example
the visualized
tweets on
maporganizations,
of the New York
NamedFigure
Entity
(NER)latitude)
isofapplied.
It can identify
people,
and area,
USA.
the Google
Maps,NER
each[8]
dotisrepresents
a tweet
event. When a dot is clicked, a
locations
fromOn
a text.
The Stanford
a widely used
implementation.
Location
extracted
from text
1,473 (44.19 %)
pop-up
displaysinformation
a tweet message,
location,
and created time.
The twitter data needs to be cleaned before the location or other detailed information can
be
Removing
special characters,
as ‘#‘
and
‘@‘, and
removing
To extracted
visualize from
tweetsit.on
a map, locations
(longitude,such
latitude)
are
required.
The
Google
URLs,
SNERenables
to findgathering,
location information
Fusion helps
Table,thewhich
visualizing,more
and accurately.
sharing data online, provides a
geocoding function to visualize tweets according to Google Maps locations.
SNER extracts location information from each tweet message, and then it returns a set of
geonames
as location
information.
selectedtweets
dataseton
hasa 3333
TableYork
10 shows
Figure 6 shows
an example
of theThe
visualized
map tweets.
of the New
area,
that
we
get
1,473
relevant
tweets
by
extracting
location
information
from
text. Ina
USA. On the Google Maps, each dot represents a tweet event. When a dot is clicked,
contrast,
we onlyaget
36 message,
tweets using
GPS data.
pop-up displays
tweet
location,
and created time.
Visualization: The categorization of tweets by location may facilitate the search for
relevant information. For example, a Blacksburg utility worker usually monitors water
main break events that occur within the town of Blacksburg.
Figure 6. An example22
of the visualized tweets on Google Maps
Tweets collected
with keywords
Selected tweets with
location information
(lat/long, geonames)
Event locations
displayed with
details
Integrated Digital Event Archive and
Library (IDEAL) Project
http://www.eventsarchive.org/
• Extension of CTRnet with broadened scope:
– Event detection
– Event data archiving & processing
• Multimedia (images, videos) shared in social media
• Digital government research
– Community issue detection
– Public opinion mining, mood perception, information
flow
• Technologies:
– Focused crawling, analysis/visualization services,
integration of archive and DL capabilities
Event Ontology
• Event model
– Who, What, When, Where, How
– Organizations/entities participating in the event
• What
– Topics of the Event
• Where
– Event location
• When
– Event time frame (and later times of interest, e.g.,
anniversaries)
IDEAL Proposal Architecture
Producers
Curators
Contributors
Web
Publishers
Social
Media
Preservation Planning
Ingest
Tweet
Manager
Focused
Crawler
Lucid
Data Management Works
Consumers
Internet Archive
Access
Search
Researchers
Visualize
Practitioners
Archival Storage Internet
Analyze
Browse
Archive
Affected
Administration
Internet Archive
IDEAL System Architecture
Sunshin Lee (built low-cost cluster)
IDEAL Data Architecture
Sunshin Lee
Event Focused Crawler
Mohamed Magdy
Focus of research
Baseline vs. Event Focused Crawler
Mohamed Magdy
Harvest ratio: relevant crawled webpages vs. cumulative set of crawled webpages
Extracted News Events on a Time Line
CS6604 Spring 2014: Tianyu Geng, Wei Huang, Ji Wang, Xuan Zhang
ukraine, yanukovich,
crisis, minister, sign,
russian
02/28
russia,
bank,
sanctions,
ukraine,
crisis,
crimea
ukraine,
tensions,
data, rise,
shares,
china,
stocks
ukraine,
russia,
talks, aid,
crisis,
sanctions,
deal
03/12
03/16
03/23
03/08
03/01
03/09
03/14
ukraine,
crimea,
crisis, putin,
russia,
minister
crimea, ukraine, russia,
minister, referendum, vote
03/20
ukraine,
house, imf,
u.s, bill,
white, aid
gas, ukraine, russian,
russia, europe, talks,
energy
04/12
03/26
ukraine, aid,
support,
government,
talks, house,
russian
crimea, ukraine, russian, troops, border
04/16
History:
3/7 referendum
annulled
3/14: UN draft
resolution
News-Tweet Architecture
CS6604 Spring 2014: Tianyu Geng, Wei Huang, Ji Wang, Xuan Zhang
Event 3
Event 2 Who
Topic
Event 1 Who
Topic
Topic
Event Extraction
Sys.
Preprocessor
When
Who
When
Where
When
Event 3
Event 2 Who
Topic
Event 1
Topic
Topic
Where
Where
When
Who
When
Where
When
Where
Where
Event Extraction
Sys.
Preprocessor
LDA
LDA
NER
Who
Correlation
NER
IDEAL Spreadsheet
CS4624 Spring 2014: Tony Ardura, Austin Burnett, Rex Lacy, Shawn Neumann
(based on ArcSpread by Andreas Paepcke et al.)
Recommended Collection-Level Metadata
CS6604 Spring 2014: Michael Shuffett
• Dublin Core
– Title, Description
• PROV-O
– Starting Point Classes
– Collection process, organization, hadMember, atLocation
• ISO 3166-2 for locations
• W3/XMLSchema#dateTime
• PLUS: TweetID tool for tweet collections
– Extracts tweet and collection level metadata
– Compares / combines tweet collections
Thank you!
Questions/Comments?
[email protected], http://fox.cs.vt.edu
540-231-5113
Office: 2160G Torgersen Hall
Campus Mail: 114 McBryde Hall, M/C 0106, Dept. of CS,
Virginia Tech, Blacksburg, VA 24061