Transcript Technical Computing Initiative - E-LIS
Life Sciences Earth Sciences
e-Science and its Implications for the Library Community
Computer and Information Sciences Social Sciences Tony Hey Corporate Vice President Technical Computing Microsoft Corporation New Materials, Technologies and Processes Multidisciplinary Research
Licklider’s Vision
“Lick had this concept – all of the stuff linked together throughout the world, that you can use a remote computer, get data from a remote computer, or use lots of computers in your job”
Larry Roberts – Principal Architect of the ARPANET
Physics and the Web
Tim Berners-Lee developed the Web at CERN as a tool for exchanging information between the partners in physics collaborations The first Web Site in the USA was a link to the SLAC library catalogue It was the international particle physics community who first embraced the Web ‘Killer’ application for the Internet Transformed modern world – academia, business and leisure
Beyond the Web?
Scientists developing collaboration technologies that go far beyond the capabilities of the Web
To use remote computing resources
To integrate, federate and analyse information from many disparate, distributed, data resources To access and control remote experimental equipment Capability to access, move, manipulate and mine data is the central requirement of these new collaborative science applications
Data held in file or database repositories Data generated by accelerator or telescopes Data gathered from mobile sensor networks
What is e-Science?
‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it’ John Taylor Director General of Research Councils UK, Office of Science and Technology
The e-Science Vision
e-Science is about multidisciplinary science and the technologies to support such distributed, collaborative scientific research
Many areas of science are in danger of being overwhelmed by a ‘data deluge’ from new high throughput devices, sensor networks, satellite surveys …
Areas such as bioinformatics, genomics, drug design, engineering, healthcare … require collaboration between different domain experts ‘e-Science’ is a shorthand for a set of technologies to support collaborative networked science
e-Science – Vision and Reality
Vision
Oceanographic sensors - Project Neptune
Joint US-Canadian proposal Reality
Chemistry – The Comb-e-Chem Project
Annotation, Remote Facilities and e-Publishing
http://www.neptune.washington.edu/
Undersea Sensor Network Connected & Controllable Over the Internet
Data Provenance
Visual Programming Persistent Distributed Storage
Distributed Computation Interoperability & Legacy Support via Web Services
Searching & Visualization Reputation & Influence Live Documents
Reproducible Research
Collaboration
Handwriting
Dynamic Documents
Interactive Data
The Comb-e-Chem Project
Video Data Stream HPC Simulation Data Mining and Analysis Structures Database Automatic Annotation National X-Ray Service Combinatorial Chemistry Wet Lab
Middleware
National Crystallographic Service
Send sample material to NCS service Collaborate in e-Lab experiment and obtain structure Search materials database and predict properties using Grid computations Download full data on materials of interest X-Ray e-Laboratory Structures Database Computation Service
A digital lab book replacement that chemists were able to use, and liked
Monitoring laboratory experiments using a broker delivered over GPRS on a PDA
Crystallographic e-Prints
Direct Access to Raw Data from scientific papers Raw data sets can be very large - stored at UK National Datastore using SRB software
eBank Project
Digital Library Virtual Learning Environment Undergraduate Students E-Scientists Graduate Students Reprints Peer Reviewed Journal & Conference Papers Preprints & Metadata Technical Reports E-Experimentation Publisher Holdings Institutional Archive Local Web Certified Experimental Results & Analyses Data, Metadata & Ontologies
5
Entire E-Science Cycle
Encompassing experimentation, analysis, publication, research, learning
Grid
Support for e-Science
Cyberinfrastructure and e-Infrastructure
In the US, Europe and Asia there is a common vision for the ‘cyberinfrastructure’ required to support the e-Science revolution Set of Middleware Services supported on top of high bandwidth academic research networks Similar to vision of the Grid as a set of services that allows scientists – and industry – to routinely set up ‘Virtual Organizations’ for their research – or business
Many companies emphasize computing cycle aspect of Grids The ‘Microsoft Grid’ vision is more about data management than about compute clusters
Six Key Elements for a Global Cyberinfrastructure for e-Science
1.
2.
3.
4.
5.
6.
High bandwidth Research Networks Internationally agreed AAA Infrastructure Development Centers for Open Standard Grid Middleware Technologies and standards for Data Provenance, Curation and Preservation Open access to Data and Publications via Interoperable Repositories Discovery Services and Collaborative Tools
The Web Services ‘Magic Bullet’
Company A (J2EE)
Web Services
Open Source (OMII) Company C (.Net)
Computational Modeling Persistent Distributed Data Workflow, Data Mining & Algorithms Interpretation & Insight Real-world Data
Technical Computing in Microsoft
Radical Computing
Research in potential breakthrough technologies Advanced Computing for Science and Engineering
Application of new algorithms, tools and technologies to scientific and engineering problems High Performance Computing
Application of high performance clusters and database technologies to industrial applications
New Science Paradigms
Thousand years ago: Experimental Science -
description of natural phenomena
Last few hundred years: Theoretical Science Newton’s Laws, Maxwell’s Equations … Last few decades
:
Computational Science - simulation of complex phenomena Today:
e-Science or Data-centric Science - unify theory, experiment, and simulation - using data exploration and data mining Data captured by instruments
Data generated by simulations Processed by software Scientist analyzes databases/files (With thanks to Jim Gray)
.
a a
2 4
G
3
c
2
a
2
Advanced Computing for Science and Engineering
TOOLS DATA CONTENT
. . .
Workflow, Collaboration, Visualization, Data Mining Acquisition, Storage, Annotation, Provenance, Curation, Preservation Scholarly Communication, Institutional Repositories
Top 500 Supercomputer Trends
Industry usage rising Clusters over 50% GigE is gaining x86 is winning
Key Issues for e-Science
Workflows
The LEAD Project The Data Chain
From Acquisition to Preservation Scholarly Communication
Open Access to Data and Publications
The LEAD Project
Better predictions for Mesoscale weather
The LEAD Vision
DYNAMIC OBSERVATIONS Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction/Detection PCs to Teraflop Systems Models and Algorithms Driving Sensors
The CS challenge: Build a virtual “eScience” laboratory to support experimentation and education leading to this vision.
Product Generation, Display, Dissemination End Users NWS Private Companies Students
Composing LEAD Services
Need to construct workflows that are:
Data Driven
The weather input stream defines the nature of the computation
Persistent and Agile
An agent mines a data stream and notices an “interesting” feature. This event may trigger a workflow scenario that has been waiting for months Adaptive
The weather changes Workflow may have to change on-the-fly Resources
Example LEAD Workflow
The e-Science Data Chain
Data Acquisition Data Ingest Metadata Annotation Provenance Data Storage Curation Preservation
The Data Deluge
In the next 5 years e-Science projects will produce more scientific data than has been collected in the whole of human history Some normalizations:
The Bible = 5 Megabytes
Annual refereed papers = 1 Terabyte Library of Congress = 20 Terabytes Internet Archive (1996 – 2002) = 100 Terabytes In many fields new high throughput devices, sensors and surveys will be producing Petabytes of scientific data
The Problem for the e-Scientist
Experiments & Instruments Other Archives facts facts Literature
?
questions answers Simulations
Data ingest Managing a petabyte Common schema How to organize it?
How to reorganize it?
How to coexist & cooperate with others?
Data Query and Visualization tools Support/training Performance
Execute queries in a minute Batch (big) query scheduling
Digital Curation?
In 20 years can guarantee that the operating system and spreadsheet program and the hardware used to store data will not exist Need research ‘curation’ technologies such as workflow, provenance and preservation
Need to liaise closely with individual research communities, data archives and libraries The UK has set up the ‘Digital Curation Centre’ in Edinburgh with Glasgow, UKOLN and CCLRC Attempt to bring together skills of scientists, computer scientists and librarians
Digital Curation Centre
Actions needed to maintain and utilise digital data and research results over entire life-cycle
For current and future generations of users Digital Preservation
Long-run technological/legal accessibility and usability Data curation in science
Maintenance of body of trusted data to represent current state of knowledge Research in tools and technologies
Integration, annotation, provenance, metadata, security…..
Berlin Declaration 2003
‘To promote the Internet as a functional instrument for a global scientific knowledge base and for human reflection’ Defines open access contributions as including:
‘original scientific research results, raw data and metadata, source materials, digital representations of pictorial and graphical materials and scholarly multimedia material’
NSF ‘Atkins’ Report on Cyberinfrastructure
‘the primary access to the latest findings in a growing number of fields is through the Web, then through classic preprints and conferences, and lastly through refereed archival papers’ ‘archives containing hundreds or thousands of terabytes of data will be affordable and necessary for archiving scientific and engineering information’
MIT DSpace Vision
‘Much of the material produced by faculty, such as datasets, experimental results and rich media data as well as more conventional document-based material (e.g. articles and reports) is housed on an individual’s hard drive or department Web server. Such material is often lost forever as faculty and departments change over time.
’
Publishing Data & Analysis Is Changing
Roles
Authors Publishers Curators Archives Consumers
Traditional
Scientists Journals Libraries Archives Scientists
Emerging
Collaborations Project web site Data+Doc Archives Digital Archives Scientists
Data Publishing: The Background
In some areas – notably biology – databases are replacing (paper) publications as a medium of communication
These databases are built and maintained with a great deal of human effort
They often do not contain source experimental data sometimes just annotation/metadata They borrow extensively from, and refer to, other databases You are now judged by your databases as well as your (paper) publications Upwards of 1000 (public databases) in genetics
Data Publishing: The issues
Data integration
Tying together data from various sources Annotation
Adding comments/observations to existing data
Becoming a new form of communication Provenance
‘Where did this data come from?’ Exporting/publishing in agreed formats
To other programs as well as people Security
Specifying/enforcing read/write access to parts of your data
Interoperable Repositories?
Paul Ginsparg’s arXiv at Cornell has demonstrated new model of scientific publishing
Electronic version of ‘preprints’ hosted on the Web
David Lipman of the NIH National Library of Medicine has developed PubMedCentral as repository for NIH funded research papers
Microsoft funded development of ‘portable PMC’ now being deployed in UK and other countries
Stevan Harnad’s ‘self-archiving’ EPrints project in Southampton provides a basis for OAI-compliant ‘Institutional Repositories’
Many national initiatives around the world moving towards mandating deposition of ‘full text’ of publicly funded research papers in repositories
Microsoft Strategy for e-Science
Microsoft intends to work with the scientific and library communities: to define open standard and/or interoperable high-level services, work flows and tools to assist the community in developing open scholarly communication and interoperable repositories
Acknowledgements
With special thanks to Kelvin Droegemeier, Geoffrey Fox, Jeremy Frey, Dennis Gannon, Jim Gray, Yike Guo, Liz Lyon and Beth Plale