Enabling webscale research in Europe

Download Report

Transcript Enabling webscale research in Europe

Enabling Webscale Research
in Europe
Julien Masanès
European Archive Foundation
[email protected]
Consultation Workshop, Brussels, 19/1/2010
1
‘Webtables’
Using Over 14 billions pages from the web to extract tables.
[1] M. Cafarella, A. Halevy, D. Wang, E. Wu, et Y. Zhang, “WebTables: exploring the
power of tables on the web,” Proc. VLDB Endow., vol. 1, 2008, pp. 549, 538.
2
Data driven science
•
Here is the evidence, now what is the hypothesis?
•
«A now-common strategy in post-genomic biology is to
measure, quantitatively, the action of all (or as many as
possible) of the genes at the level of the transcriptome,
proteome, metabolome and phenotype , and to use
computerised methods to infer gene function via various
kinds of pattern recognition techniques»
D.B. Kell et S.G. Oliver, “Here is the evidence, now what is the hypothesis? The
complementary roles of inductive and hypothesis-driven science in the post-genomic era,”
BioEssays, vol. 26, 2004, pp. 99-105.
3
Webscale data
•
The web represents a unique source of access to media
content of all sorts, that a growing number of scientific
communities, agencies and industries are starting to need to
mine at large scale.
•
The ability to acquire, process and mine large scale data from
the web is becoming a strategic advantage in many domains
from business intelligence to epidemiological tracking and
monitoring.
4
Research engine
•
Key infrastructure to monitor and analyze the evolution of
networked media
•
More broadly, will become a key tool for research in more and
more domains:
•
low noise signal of ecological evolution,
• economical trends,
• emergence of new research of new research fields
• tracking of reputation on the web
• etc.
5
An example:
Ecological monitoring
Victor Galaz, Beatrice Crona, Tim Daw, Örjan Bodin, Magnus Nyström, et Per
Olsson, “Can web crawlers revolutionize ecological monitoring?,” Mar. 2009.
•
«mining the internet to detect “early-warning” signs that may
signal abrupt ecological changes»
6
Who can do research on
Webscale data?
•
Webscale is already proving to be a challenge for many
research group as the infrastructure, the cost and the skills
required represent a significant barrier to entry.
•
But when it comes to doing this through time, all but a few
(mainly large search engines) can do it at all.
•
In other words, only large search engines (none being
European) are able to do research at this scale, hence
comforting their advance by developing and testing new
algorithms for search, ranking, mining etc.
7
Research challenges (1)
•
Building in europe of an open, neutral and sustainable virtual observatory
of the web for research requires:
•
large scale crawling, storage and indexing of web data (10+ Petabytes),
not limited to text.
•
We know the TB, not the PB yet.
8
Research challenges (2)
•
Create a baseline distributed analytics services (large scale IE, NLP,
distributed and efficient processing and storage).
•
We need to standardize and define baseline in this domain to create a
platform for MMSE, social media research etc.
•
Hadoop-style abstractions over internet-wide repository/processing
clouds
•
Optimized data placement (partitioning and replication) for analytics
•
Distributed indices
9
Research challenges (3)
•
Temporal indexing of significant characteristics of networked content
(from distribution to semantic)
•
•
Large spectrum of research in IE/IR, network topology etc.
Last but not least: make this infrastructure acceptable by society
(respect privacy, transparence, IP rights)
10
Thanks
Julien Masanès
European Archive Foundation
[email protected]
Consultation Workshop, Brussels, 19/1/2010
11
•
M. Toyoda et M. Kitsuregawa, A system for visualizing and analyzing the
evolution of the web with a time series of graphs, Salzburg, Austria: ACM
Press New York, NY, USA, 2005.
12