The Kepler Project Overview, Status, and Future Directions

Transcript The Kepler Project Overview, Status, and Future Directions

The Kepler Project

Overview, Status, and Future Directions

Matthew B. Jones on behalf of the Kepler Project team National Center for Ecological Analysis and Synthesis University of California, Santa Barbara

The Kepler Project

•

Goals

• Produce an open-source scientific workflow system • enable scientists to design scientific workflows and execute them • Support scientists in a variety of disciplines • e.g., biology, ecology, astronomy • Important features

SWDB

• • • • access to scientific data flexible means for executing complex analyses enable use of Grid-based approaches to distributed computation semantic models of scientific tasks effective UI for workflow design

Kepler Collaboration

• Open-source • Builds on Ptolemy II • • Collaborators SEEK Project • • • • SciDAC SDM Center Ptolemy Project GEON Project ROADNet Project Resurgence Project Goals • • Create powerful analytical tools that are useful across disciplines Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, … Ptolemy II

•

Usage statistics

Source code access

• • 154 people accessed source code 30 members have write permission

SWDB

Kepler downloads Total = 9204 Beta = 6675 red=Windows blue=Macintosh

Aug 29, 2004

– Projects using Kepler: • SEEK (ecology) • SciDAC (molecular bio, ...) • CPES (plasma simulation) • GEON (geosciences) • CiPRes (phylogenetics) • CalIT2 • ROADnet (real-time data) • LOOKING (oceanography) • CAMERA (metagenomics) • Resurgence (Computational chemistry) • NORIA (ocean observing CI) • NEON (ecology observing CI) • ChIP-chip (genomics) • COMET (environmental science) • Cheshire Digital Library (archival) • Digital preservation (DIGARCH) • Cell Biology (Scripps) • DART (X-Ray crystallography) • Ocean Life • Assembling theTree of Life project • Processing Phylodata (pPOD) • FermiLab (particle physics)

Kepler advances

• • • • • • • •

Data and Actor search

• •

Kepler Archive (KAR) format

• EarthGrid data access system

Kepler Component Library

Integrated support for LSID identifiers for all objects

Object Manager and cache Web service execution

RExpression & MatlabExpression actors

Redesigned user interface Authentication subsystem Null-value handling

SWDB Aug 29, 2004

More advances

• • • • • •

Documentation

Collection-oriented workflows (COMAD)

Domain-specific actors for case studies

• e.g., GARP, phylogenetics actors

Provenance system

Grid computing support •

Semantics support

• NIMROD, Globus, ssh, ...

annotation, search, workflow validation, integration

SWDB Aug 29, 2004

Distributed execution

• • •

Opportunities for parallel execution

• • •

Current ‘plumbing’ approaches to distributed execution

• Fine-grained parallelism Coarse-grained parallelism • • Few or no cycles Limited dependencies among components ‘Trivially parallel’ Many science problems fit this mold • parameter sweep, iteration of stochastic models workflow acts as a controller • stages data resources writes job description files controls execution of jobs on nodes requires expert understanding of the Grid system

SWDB Aug 29, 2004

Scientists need to focus on just the computations

• try to avoid plumbing as much as possible

Distributed Kepler

• • • Higher-order component for executing a model on one or more remote nodes Master and slave controllers handle setup and communication among nodes, and establish data channels Extremely easy for scientist to utilize • requires no knowledge of grid computing systems IN

SWDB

OUT

Aug 29, 2004

Controller Master Controller Slave

Data Management

•

Need for integrated management of external data

• • • • EarthGrid access is partial, need refactoring Include other data sources, such as JDBC, OpeNDAP, etc.

Data needs to be a first class object in Kepler, not just represented as an actor Need support for data versioning to support provenance •

e.g., Need to pass data by reference

• • workflows contain large data tokens (100’s of megabytes) intelligent handling of unique identifiers (e.g., LSID)

SWDB A Aug 29, 2004 B {1,5,2}

New projects: REAP

• Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System • Extend Kepler to: • • • Manage and monitor sensor networks Consume data from sensors Integrate sensor data handling with data archive handling • Terrestrial ecology and oceanography use cases

SWDB

PIs Jones Altintas Estrin Seabloom Gallagher Cornillon Hosseini Ludäscher Schildhauer Reichman Baru Potter Borer Institutions UCSB UCD UCSD UCLA OSU OpeNDAP

REAP breakdown SWDB Aug 29, 2004

New projects: ChIP-chip

• A Collaborative Scientific Workflow Environment for Accelerating Genome-Scale Biological Research • CS/IT: Ludaescher, Bowers, McPhillips Bio: Peggy Farnham, Mark Bieda • • • Integrate a web-based "experiment workspace" environment with a flexible scientific workflow system Support rapid prototyping and easy addition of new "methods"

SWDB

• Support client-server (i.e., enterprise) deployments for group/lab wide collaboration • templates • details of how key steps are left out until runtime, then late-binding of one or more specific algorithms or data which is the best motif-finding algorithm?

parts of workflow have similar set of steps need to compare results from parallel analyses different people have different roles [Software dev (Tim), Bioinformatics specialist (Mark), Biologist (Peggy)]

Experiment Workspace (setup, run, and manage) Setup “protocol” Import/Export Data Data Display, Visualization Run Experiment Peggy “biologist” experiment repository Mark Tim Workflow Automation (configuration and execution support) Configuration Management Kepler Workflow Engine (1) select design template and configure Workflow Specification (workflow design and template creation)

F 1

Execution Management Monitoring Support Provenance Tracking (2) generate optimized executable workflow

F 2 T 1 F 3 F 2 T 2 SWDB

Component Specification (wrapping, integration, and creation of components)

Aug 29, 2004

provenance repository design repository component repository external components and services ChIP-chip Data Analysis (ChIPOTle, HMM, …) Motif Finding Algorithms (MEME, MDscan, …) Visualization Packages and Statistics Tools Public Databases & Services (GenBank, David,TransFac, …) Figure from Bowers and McPhillips

Kepler C.O.R.E. proposal

•

Development of Kepler CORE -- A Comprehensive, Open, Robust, and Extensible Scientific Workflow Infrastructure

• Ludäscher, Altintas, Bowers, Jones, McPhillips •

Goals

• • • Reliable • • Independently extensible Open architecture, open project • refactored build more modular design improved engineering practices improved governance

SWDB Aug 29, 2004

Kepler C.O.R.E. -- Extensibility SWDB Aug 29, 2004

Kepler C.O.R.E. -- Governance SWDB Aug 29, 2004

Kepler C.O.R.E. -- Sustainability

•

How does Kepler persist?

• Now, via research grants • unsustainable for production purposes • Future

SWDB

• • new models for financial support • • • • support contracts?

extension contracts?

new science domains?

continued research dollars?

foundations?

exploring 501.3c organization that can sustain Kepler and similar open-source initiatives

Aug 29, 2004

Acknowledgements

• • • Funding • • • • • • The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, 0225676, and 0619060.

The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

The Andrew W. Mellon Foundation The Department of Energy Collaborators NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas, University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis Kepler contributors

Aug 29, 2004

The Kepler Project Overview, Status, and Future Directions

Transcript The Kepler Project Overview, Status, and Future Directions

The Kepler Project

Overview, Status, and Future Directions

Goals

Source code access

Data and Actor search

Kepler Archive (KAR) format

Object Manager and cache Web service execution

Redesigned user interface Authentication subsystem Null-value handling

Documentation

Domain-specific actors for case studies

Semantics support

Opportunities for parallel execution

Current ‘plumbing’ approaches to distributed execution

Scientists need to focus on just the computations

Need for integrated management of external data

e.g., Need to pass data by reference

Development of Kepler CORE -- A Comprehensive, Open, Robust, and Extensible Scientific Workflow Infrastructure

Goals

How does Kepler persist?

Directory