Ticer Summer School Thursday 24th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh www.nesc.ac.uk TICER Summer School, August 24th 2006
Download ReportTranscript Ticer Summer School Thursday 24th August 2006 Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh www.nesc.ac.uk TICER Summer School, August 24th 2006
Ticer Summer School
Thursday 24
th
August 2006
Dave Berry & Malcolm Atkinson National e-Science Centre, Edinburgh www.nesc.ac.uk
TICER Summer School, August 24th 2006 1
Digital Libraries, Grids & E-Science
What is E-Science?
What is Grid Computing?
Data Grids Requirements Examples Technologies Data Virtualisation The Open Grid Services Architecture Challenges TICER Summer School, August 24th 2006 2
TICER Summer School, August 24th 2006 3
What is e-Science?
• Goal: to enable better research in
all
disciplines • Method: Develop
collaboration
supported by advanced distributed computation – to generate, curate and analyse rich data resources • From experiments, observations, simulations & publications • Quality management, preservation and reliable evidence – to develop and explore models and simulations • Computation and data at all scales • Trustworthy, economic, timely and relevant results – to enable
dynamic
distributed collaboration • Facilitating collaboration with information and resource sharing • Security, trust, reliability, accountability, manageability and
agility
TICER Summer School, August 24th 2006 4
climate
prediction
.net and GENIE
• Largest climate model ensemble • >45,000 users, >1,000,000 model years Response of Atlantic circulation to freshwater forcing 2K 10K
Integrative Biology
Tackling two Grand Challenge research questions:
•
What causes heart disease?
•
How does a cancer form and grow?
Together these diseases cause 61% of all UK deaths Building a powerful, fault-tolerant Grid infrastructure for biomedical science Enabling biomedical researchers to use distributed resources such as high-performance computers, databases and visualisation tools to develop coupled multi-scale models of how these killer diseases develop.
Courtesy of David Gavaghan & IB Team
6
Portal Synteny Grid Service
Biomedical Research Informatics Delivered by Grid Enabled Services
Oxford Private data data Private
CFG Virtual Organisation
Edinburgh
DATA HUB
London Private data Publically Curated Data Ensembl Private data Leicester Private data Private data OMIM MGI SWISS-PROT HUGO …
+
http://www.brc.dcs.gla.ac.uk/projects/bridges/
eDiaMoND: Screening for Breast Cancer Patients Screening Digital Reading Letters Radiology reporting systems 2ndary Capture Or FFD X-Rays and Case Information Case Information eDiaMoND Grid Case and Reading Information Electronic Patient Records Biopsy 1 Trust
Many Trusts Collaborative Working Audit capability Epidemiology Assessment/ Symptomatic Other Modalities
-
MRI
-
PET
-
Ultrasound Symptomatic/Assessment Information Case and Reading Information SMF CAD 3D Images Better access to Case information And digital tools Training Supplement Mentoring With access to digital Training cases and sharing Of information across clinics Manage Training Cases Perform Training SMF CAD
E-Science Data Resources
• Curated databases
– Public, institutional, group, personal
• Online journals and preprints • Text mining and indexing services • Raw storage (disk & tape) • Replicated files • Persistent archives • Registries • …
TICER Summer School, August 24th 2006 9
EBank
Slide from Jeremy Frey TICER Summ er School 10
Biomedical data – making connections
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg TICER Summ er School 11
Using Workflows to Link Services
• Describe the steps in a Scripting Language • Steps performed by Workflow Enactment Engine • Many languages in use – Trade off: familiarity & availability – Trade off: detailed control versus abstraction • Incrementally develop correct process – Sharable & Editable – Basis for scientific communication & validation – Valuable IPR asset • Repetition is now easy – Parameterised explicitly & implicitly TICER Summer School, August 24th 2006 12
Workflow Systems Language
Shell scripts Perl Java BPEL Taverna VDT / Pegasus Kepler
WF Enact.
Shell + OS Perl runtime JVM BPEL Enactment Scufl
Comments
Common but not often thought of as WF. Depend on context, e.g. NFS across all sites Popular in bioinformatics. Similar context dependence – distribution has to be coded Popular target because JVM ubiquity – similar dependence – distribution has to be coded OASIS standard for industry – coordinating use of multiple Web Services – low level detail - tools EBI, OMII-UK & MyGrid http://taverna.sourceforge.net/index.php
Chimera & DAGman Kepler High-level abstract formulation of workflows, automated mapping towards executable forms, cached result re-use BIRN, GEON & SEEK http://kepler-project.org/ TICER Summer School, August 24th 2006 13
Workflow example
Taverna in MyGrid http://www.mygrid.org.uk/ “allows the e-Scientist to describe and enact their experimental processes in a structured, repeatable and verifiable way” GUI Workflow language Enactment engine 14 TICER Summ er School
Notification
Pub/Sub for Laboratory data using a broker and ultimately delivered over GPRS Comb-e-chem: Jeremy Frey TICER Summ er School 15
Relevance to Digital Libraries
• Similar concerns
– Data curation & management – Metadata, discovery – Secure access (AAA +) – Provenance & data quality – Local autonomy – Availability, resilience
• Common technology
– Grid as an implementation technology TICER Summer School, August 24th 2006 16
TICER Summer School, August 24th 2006 17
What is a Grid?
A grid is a system consisting of
− Distributed but connected resources and − Software and/or hardware that provides and manages logically seamless access to those resources to meet desired objectives Web server Handheld Workstation Database Server License Cluster Printer Supercomputer Data Center R2AD Source: Hiro Kishimoto GGF17 Keynote May 2006 TICER Summer School, August 24th 2006 18
Virtualizing Resources
Access
Type-specific interfaces
Computers Storage
Common Interfaces
Sensors Applications Information
Resource-specific Interfaces
Web services Resources 19
Hiro Kishimoto: Keynote GGF17
TICER Summer School, August 24th 2006
Ideas and Forms
• Key ideas
– Virtualised resources – Secure access – Local autonomy
• Many forms
– Cycle stealing – Linked supercomputers – Distributed file systems – Federated databases – Commercial data centres – Utility computing TICER Summer School, August 24th 2006 20
Grid Middleware
Grid middleware services Job-Submit Service Brokering Service Registry Service Virtualized resources Service Data Service Application Service Printer Service Hiro Kishimoto: Keynote GGF17
TICER Summer School, August 24th 2006 21
Key Drivers for Grids
• Collaboration – Expertise is distributed – Resources (data, software licences) are location-specific – Necessary to achieve critical mass of effort – Necessary to raise sufficient resources • Computational Power – Rapid growth in number of processors – Powered by Moore’s law + device roadmap – Challenge to transform models to exploit this • Deluge of Data – Growth in scale: Number and Size of resources – Growth in complexity – Policy drives greater data availability TICER Summer School, August 24th 2006 22
Minimum Grid Functionalities
• Supports
distributed
computation
– Data and computation – Over a
variety
of • hardware components (servers, data stores, …) • Software components (services: resource managers, computation and data services) – With
regularity
that can be exploited • By applications • By other middleware & tools • By providers and operations – It will normally have
security
mechanisms • To develop and sustain trust regimes TICER Summer School, August 24th 2006 23
Grid & Related Paradigms
Distributed Computing
• Loosely coupled • Heterogeneous • Single Administration
Cluster
• Tightly coupled • Homogeneous • Cooperative working
Grid Computing
• Large scale • Cross-organizational • Geographical distribution • Distributed Management
Utility Computing
• Computing “services” • No knowledge of provider • Enabled by grid technology Source: Hiro Kishimoto GGF17 Keynote May 2006 TICER Summer School, August 24th 2006 24
TICER Summer School, August 24th 2006 25
Why use / build Grids?
• Research Arguments
– Enables new ways of working – New distributed & collaborative research – Unprecedented scale and resources
• Economic Arguments
– Reduced system management costs – Shared resources better utilisation – Pooled resources increased capacity – Load sharing & utility computing – Cheaper disaster recovery TICER Summer School, August 24th 2006 26
Why use / build Grids?
• Operational Arguments
– Enable autonomous organisations to • Write complementary software components • Set up run & use complementary services • Share operational responsibility • General & consistent environment for Abstraction, Automation, Optimisation & Tools
• Political & Management Arguments
– Stimulate innovation – Promote intra-organisation collaboration – Promote inter-enterprise collaboration TICER Summer School, August 24th 2006 27
Grids In Use: E-Science Examples
• Data sharing and integration − Life sciences, sharing standard data-sets, combining collaborative data-sets − Medical informatics, integrating hospital information systems for better care and better science − Sciences, high-energy physics • Simulation-based science and engineering − Earthquake simulation • Capability computing − Life sciences, molecular modeling, tomography − Engineering, materials science − Sciences, astronomy, physics • High-throughput, capacity computing for − Life sciences: BLAST, CHARMM, drug screening − Engineering: aircraft design, materials, biomedical − Sciences: high-energy physics, economic modeling Source: Hiro Kishimoto GGF17 Keynote May 2006 TICER Summer School, August 24th 2006 28
TICER Summer School, August 24th 2006 29
Database Growth EMBL DB 111,416,302,701 nucleotides PDB 33,367 Protein structures
Slide provided by Richard Baldock: MRC HGU Edinburgh
Requirements: User’s viewpoint
• Find Data – Registries & Human communication • Understand data – Metadata description, Standard / familiar formats & representations, Standard value systems & ontologies • Data Access – Find how to interact with data resource – Obtain permission (authority) – Make connection – Make selection • Move Data – In bulk or streamed (in increments) TICER Summer School, August 24th 2006 31
Requirements: User’s viewpoint 2
• Transform Data – To format, organisation & representation required for computation or integration • Combine data – Standard database operations + operations relevant to the application model • Present results – To humans: data movement + transform for viewing – To application code: data movement + transform to the required format – To standard analysis tools, e.g. R – To standard visualisation tools, e.g. Spitfire TICER Summer School, August 24th 2006 32
Requirements: Owner’s viewpoint
• Create Data – Automated generation, Accession Policies, Metadata generation – Storage Resources • Preserve Data – Archiving – Replication – Metadata – Protection • Provide Services with available resources – Definition & implementation: costs & stability – Resources: storage, compute & bandwidth TICER Summer School, August 24th 2006 33
Requirements: Owner’s viewpoint 2
• Protect Services – Authentication, Authorisation, Accounting, Audit – Reputation • Protect data – Comply with owner requirements – encryption for privacy, … • Monitor and Control use – Detect and handle failures, attacks, misbehaving users – Plan for future loads and services • Establish case for Continuation – Usage statistics – Discoveries enabled TICER Summer School, August 24th 2006 34
TICER Summer School, August 24th 2006 35
Large Hadron Collider
• The most powerful instrument ever built to investigate elementary particle physics • Data Challenge: – 10 Petabytes/year of data – 20 million CDs each year!
• Simulation, reconstruction, analysis: – LHC data handling requires computing power equivalent to ~100,000 of today's fastest PC processors TICER Summer School, August 24th 2006 36
Composing Observations in Astronomy
No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues near 1B objects
Data and images courtesy Alex Szalay, John Hopkins
TICER Summer School, August 24th 2006 37
GODIVA Data Portal
• G rid for O cean D iagnostics, I nteractive V isualisation and A nalysis • Daily Met Office Marine Forecasts and gridded research datasets • National Centre for Ocean Forecasting • ~3Tb climate model datastore via Web Services • Interactive Visualisations inc. Movies • ~ 30 accesses a day worldwide • Other GODIVA software produces 3D/4D Visualisations reading data remotely via Web Services www.nerc-essc.ac.uk/godiva Online Movies
GODIVA Visualisations
• Unstructured Meshes • Grid Rotation/Interpolation • GeoSpatial Databases v. Files (Postgres, IBM, Oracle) • Perspective 3D Visualisation • Google maps viewer
NERC Data Grid
• The DataGrid focuses on federation of NERC Data Centres • Grid for data discovery, delivery and use across sites • Data can be stored in many different ways (flat files, databases…) • Strong focus on Metadata and Ontologies • Clear separation between
discovery
and
use
of data.
• Prototype focussing on Atmospheric and Oceanographic data www.ndg.nerc.ac.uk
Global In-flight Engine Diagnostics airline ground station in-flight data 100,000 aircraft 0.5 GB/flight 4 flights/day 200 TB/day Now BROADEN global network Significant in getting Boeing 787 engine contract DS&S Engine Health Center internet, e-mail, pager maintenance centre data centre
Distributed Aircraft Maintenance Environment: Leeds, Oxford, Sheffield &York, Jim Austin
TICER Summer School, August 24th 2006 42
Storage Resource Manager (SRM)
• • http://sdm.lbl.gov/srm-wg/
de facto
& written standard in physics, … • Collaborative effort – CERN, FNAL, JLAB, LBNL and RAL • Essential bulk file storage – (pre) allocation of storage • abstraction over storage systems – File delivery / registration / access – Data movement interfaces • E.g. gridFTP • Rich function set – Space management, permissions, directory, data transfer & discovery TICER Summer School, August 24th 2006 43
Storage Resource Broker (SRB)
• http://www.sdsc.edu/srb/index.php/Main_Page • SDSC developed • Widely used – Archival document storage – Scientific data: bio-sciences, medicine, geo-sciences, … • Manages – Storage resource allocation • abstraction over storage systems – File storage – Collections of files – Metadata describing files, collections, etc. – Data transfer services TICER Summer School, August 24th 2006 44
Condor Data Management
• Stork
– Manages File Transfers – May manage reservations
• Nest
– Manages Data Storage – C.f. GridFTP with reservations • Over multiple protocols TICER Summer School, August 24th 2006 45
Globus Tools and Services for Data Management
GridFTP A secure, robust, efficient data transfer protocol The Reliable File Transfer Service (RFT) Web services-based, stores state about transfers The Data Access and Integration Service (OGSA-DAI) Service to access to data resources, particularly relational and XML databases The Replica Location Service (RLS) Distributed registry that records locations of data copies The Data Replication Service Web services-based, combines data replication and registration functionality Slides from Ann Chervenak TICER Summer School, August 24th 2006 46
RLS in Production Use: LIGO
Laser Interferometer Gravitational Wave Observatory Currently use RLS servers at 10 sites Contain mappings from 6 million logical files to over 40 million physical replicas Used in customized data management system: the LIGO Lightweight Data Replicator System (LDR) Includes RLS, GridFTP, custom metadata catalog, tools for storage management and data validation Slides from Ann Chervenak TICER Summer School, August 24th 2006 47
RLS in Production Use: ESG
Earth System Grid: Climate modeling data (CCSM, PCM, IPCC) RLS at 4 sites Data management coordinated by ESG portal Datasets stored at NCAR 64.41 TB in 397253 total files 1230 portal users IPCC Data at LLNL 26.50 TB in 59,300 files 400 registered users Data downloaded: 56.80 TB in 263,800 files Avg. 300GB downloaded/day 200+ research papers being written Slides from Ann Chervenak TICER Summer School, August 24th 2006 48
Enabling Grids for E-sciencE
• • • •
FTS
– File Transfer Service
LFC
– Logical file catalogue
Replication Service
– Accessed through LFC
AMGA
– Metadata services
gLite Data Management
INFSO-RI-508833
TICER Summer School, August 24th 20062 nd EGEE Review, CERN - gLite Middleware Status 49
•
Data Management Services
Enabling Grids for E-sciencE FiReMan catalog
– Resolves logical filenames (LFN) to physical location of files and storage elements – Oracle and MySQL versions available – – – – Secure services Attribute support Symbolic link support Deployed on the Pre-Production Service and DILIGENT testbed •
gLite I/O
– Posix-like access to Grid files – Castor, dCache and DPM support – – Has been used for the BioMedical Demo Deployed on the Pre-Production Service and the DILIGENT testbed •
AMGA MetaData Catalog
– Used by the LHCb experiment – Has been used for the BioMedical Demo
Medical Imager Trigger
: • Retrieve DICOM files from imager.
• Register file in Fireman • gLite EDS client:
Generate encryption keys and store them in Hydra
• Register Metadata in AMGA
Enabling Grids for E-sciencE Medical Data Management File Catalog (Fireman) Encryption Keystore Metadata Catalog (AMGA) Application Client Library:
• Lookup file through Metadata (AMGA) • Use gLite EDS client: •
Retrieve file through gLite I/O
•
Retrieve encryption Key from Hydra
•
Decrypt data
• Serve it up to the application
Medical Data Management 3 50
INFSO-RI-508833
TICER Summer School, August 24th 20062 nd EGEE Review, CERN - gLite Middleware Status
• • • • • •
File Transfer Service
Enabling Grids for E-sciencE Reliable file transfer Full scalable implementation
– – – Java Web Service front-end, C++ Agents, Oracle or MySQL database support Support for Channel, Site and VO management Interfaces for management and statistics monitoring
Gsiftp, SRM and SRM-copy support Support for MySQL and Oracle Multi-VO support GridFTP and SRM copy support
INFSO-RI-508833
TICER Summer School, August 24th 20062 nd EGEE Review, CERN - gLite Middleware Status 51
Commercial Solutions
• Vendors include:
– Avaki – Data Synapse
• Benefits & costs
– Well packaged and documented – Support – Can be expensive • But look for academic rates TICER Summer School, August 24th 2006 52
TICER Summer School, August 24th 2006 53
Data Integration Strategies
• Use a Service provided by a Data Owner
• Use a scripted workflow • Use data virtualisation services
– Arrange that multiple data services have common properties – Arrange federations of these – Arrange access presenting the common properties – Expose the important differences – Support integration accommodating those differences TICER Summer School, August 24th 2006 54
Data Virtualisation Services
• Form a federation – Set of data resources – incremental addition – Registration & description of collected resources – Warehouse data or access dynamically to obtain updated data – Virtual data warehouses – automating division between collection and dynamic access • Describe relevant relationships between data sources – Incremental description + refinement / correction • Run jobs, queries & workflows against combined set of data resources – Automated distribution & transformation • Example systems – IBM’s Information Integrator – GEON, BIRN & SEEK – OGSA-DAI is an extensible framework for building such systems TICER Summer School, August 24th 2006 55
Virtualisation variations
• Extent to which homogeneity obtained
– Regular representation choices – e.g. units – Consistent ontologies – Consistent data model – Consistent schema – integrated super-schema – DB operations supported across federation – Ease of adding federation elements – Ease of accommodating change as federation members change their schema and policies – Drill through to primary forms supported TICER Summer School, August 24th 2006 56
OGSA-DAI
• http://www.ogsadai.org.uk
• A framework for data virtualisation • Wide use in e-Science – BRIDGES, GEON, CaBiG, GeneGrid, MyGrid, BioSimGrid, e Diamond, IU RGRBench, … • Collaborative effort – NeSC, EPCC, IBM, Oracle, Manchester, Newcastle • Querying of data resources – Relational databases – XML databases – Structured flat files • Extensible activity documents – Customisation for particular applications TICER Summer School, August 24th 2006 57
TICER Summer School, August 24th 2006 58
The Open Grid Services Architecture
• An open, service-oriented architecture (SOA) − Resources as first-class entities − Dynamic service/resource creation and destruction • Built on a Web services infrastructure • Resource virtualization at the core • Build grids from small number of standards-based components − Replaceable, coarse-grained − e.g. brokers • Customizable − Support for dynamic, domain-specific content… − …within the same standardized framework
Hiro Kishimoto: Keynote GGF17
TICER Summer School, August 24th 2006 59
OGSA Capabilities
Execution Management
• Job description & submission • Scheduling • Resource provisioning
Data Services
• Common access facilities • Efficient & reliable transport • Replication services
Resource Management
• Discovery • Monitoring • Control
OGSA Information Services
• Registry • Notification • Logging/auditing
OGSA “profiles”
Web services foundation
Self-Management
• Self-configuration • Self-optimization • Self-healing
Security
• Cross-organizational users • Trust nobody • Authorized access only
Hiro Kishimoto: Keynote GGF17
TICER Summer School, August 24th 2006 60
Basic Data Interfaces
• Storage Management − e.g. Storage Resource Management (SRM) • Data Access − ByteIO − Data Access & Integration (DAI) • Data Transfer − Data Movement Interface Specification (DMIS) − Protocols (e.g. GridFTP) • Replica management • Metadata catalog • Cache management
Hiro Kishimoto: Keynote GGF17
TICER Summer School, August 24th 2006 61
TICER Summer School, August 24th 2006 62
The State of the Art
• Many successful Grid & E-Science projects
– A few examples shown in this talk
• Many Grid systems
– All largely incompatible – Interoperation talks under way
• Standardisation efforts
– Mainly via the Open Grid Forum – A merger of the GGF & EGA
• Significant user investment required
– Few “out of the box” solutions TICER Summer School, August 24th 2006 63
Technical Challenges
• Issues you can’t avoid – Lack of Complete Knowledge (LOCK) – Latency – Heterogeneity – Autonomy – Unreliability – Scalability – Change • A Challenging goal – balance technical feasibility – against virtual homogeneity, stability and reliability – while remaining affordable, manageable and maintainable TICER Summer School, August 24th 2006 64
Areas “In Development”
• Data provenance • Quality of Service – Service Level Agreements • Resource brokering – Across all resources • Workflow scheduling – Co-sheduling • Licence management • Software provisioning – Deployment and update • Other areas too!
TICER Summer School, August 24th 2006 65
Operational Challenges
• Management of distributed systems – With local autonomy • Deployment, testing & monitoring • User training • User support • Rollout of upgrades • Security – Distributed identity management – Authorisation – Revocation – Incident response TICER Summer School, August 24th 2006 66
Grids as a Foundation for Solutions
• The grid
per se
doesn’t provide – Supported e-Science methods – Supported data & information resources – Computations – Convenient access • Grids help providers of these, via – International & national secure e-Infrastructure – Standards for interoperation – Standard APIs to promote re-use • But Research Support must be built – Application developers – Resource providers TICER Summer School, August 24th 2006 67
Collaboration Challenges
• Defining common goals • Defining common formats
– E.g. schemas for data and metadata
• Defining a common vocabulary
– E.g. for metadata
• Finding common technology
– Standards should help, eventually
• Collecting metadata
– Automate where possible TICER Summer School, August 24th 2006 68
Social Challenges
• Changing cultures
– Rewarding data & resource sharing – Require publication of data
• Taking the first steps
– If everyone shares, everyone wins – The first people to share must not lose out
• Sustainable funding
– Technology must persist – Data must persist TICER Summer School, August 24th 2006 69
TICER Summer School, August 24th 2006 70
Summary
• E-Science exploits distributed computing resource to enable new discoveries, new collaborations and new ways of working • Grid is an enabling technology for e-science. • Many successful projects exist • Many challenges remain
TICER Summer School, August 24th 2006 71
UK e-Science
Globus Alliance e-Science Institute National Centre for e-Social Science Digital Curation Centre Open Middleware Infrastructure Institute Grid Operations Support Centre CeSC (Cambridge) EGEE, National Institute for Environmental e-Science TICER Summer School, August 24th 2006 ChinaGrid 72
TICER Summer School, August 24th 2006 73