Transcript The Kepler Project Overview, Status, and Future Directions
The Kepler Project
Overview, Status, and Future Directions
Matthew B. Jones on behalf of the Kepler Project team National Center for Ecological Analysis and Synthesis University of California, Santa Barbara
The Kepler Project
•
Goals
• Produce an open-source scientific workflow system • enable scientists to design scientific workflows and execute them • Support scientists in a variety of disciplines • e.g., biology, ecology, astronomy • Important features
SWDB
• • • • access to scientific data flexible means for executing complex analyses enable use of Grid-based approaches to distributed computation semantic models of scientific tasks effective UI for workflow design
Kepler Collaboration
• Open-source • Builds on Ptolemy II • • Collaborators SEEK Project • • • • SciDAC SDM Center Ptolemy Project GEON Project ROADNet Project Resurgence Project Goals • • Create powerful analytical tools that are useful across disciplines Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, … Ptolemy II
•
Usage statistics
Source code access
• • 154 people accessed source code 30 members have write permission
SWDB
Kepler downloads Total = 9204 Beta = 6675 red=Windows blue=Macintosh
Aug 29, 2004
– Projects using Kepler: • SEEK (ecology) • SciDAC (molecular bio, ...) • CPES (plasma simulation) • GEON (geosciences) • CiPRes (phylogenetics) • CalIT2 • ROADnet (real-time data) • LOOKING (oceanography) • CAMERA (metagenomics) • Resurgence (Computational chemistry) • NORIA (ocean observing CI) • NEON (ecology observing CI) • ChIP-chip (genomics) • COMET (environmental science) • Cheshire Digital Library (archival) • Digital preservation (DIGARCH) • Cell Biology (Scripps) • DART (X-Ray crystallography) • Ocean Life • Assembling theTree of Life project • Processing Phylodata (pPOD) • FermiLab (particle physics)
Kepler advances
• • • • • • • •
Data and Actor search
• •
Kepler Archive (KAR) format
• EarthGrid data access system
Kepler Component Library
Integrated support for LSID identifiers for all objects
Object Manager and cache Web service execution
RExpression & MatlabExpression actors
Redesigned user interface Authentication subsystem Null-value handling
SWDB Aug 29, 2004
More advances
• • • • • •
Documentation
Collection-oriented workflows (COMAD)
Domain-specific actors for case studies
• e.g., GARP, phylogenetics actors
Provenance system
Grid computing support •
Semantics support
• NIMROD, Globus, ssh, ...
annotation, search, workflow validation, integration
SWDB Aug 29, 2004
Distributed execution
• • •
Opportunities for parallel execution
• • •
Current ‘plumbing’ approaches to distributed execution
• Fine-grained parallelism Coarse-grained parallelism • • Few or no cycles Limited dependencies among components ‘Trivially parallel’ Many science problems fit this mold • parameter sweep, iteration of stochastic models workflow acts as a controller • stages data resources writes job description files controls execution of jobs on nodes requires expert understanding of the Grid system
SWDB Aug 29, 2004
Scientists need to focus on just the computations
• try to avoid plumbing as much as possible
Distributed Kepler
• • • Higher-order component for executing a model on one or more remote nodes Master and slave controllers handle setup and communication among nodes, and establish data channels Extremely easy for scientist to utilize • requires no knowledge of grid computing systems IN
SWDB
OUT
Aug 29, 2004
Controller Master Controller Slave
Data Management
•
Need for integrated management of external data
• • • • EarthGrid access is partial, need refactoring Include other data sources, such as JDBC, OpeNDAP, etc.
Data needs to be a first class object in Kepler, not just represented as an actor Need support for data versioning to support provenance •
e.g., Need to pass data by reference
• • workflows contain large data tokens (100’s of megabytes) intelligent handling of unique identifiers (e.g., LSID)
SWDB A Aug 29, 2004 B {1,5,2}
New projects: REAP
• Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System • Extend Kepler to: • • • Manage and monitor sensor networks Consume data from sensors Integrate sensor data handling with data archive handling • Terrestrial ecology and oceanography use cases
SWDB
PIs Jones Altintas Estrin Seabloom Gallagher Cornillon Hosseini Ludäscher Schildhauer Reichman Baru Potter Borer Institutions UCSB UCD UCSD UCLA OSU OpeNDAP
REAP breakdown SWDB Aug 29, 2004
New projects: ChIP-chip
• A Collaborative Scientific Workflow Environment for Accelerating Genome-Scale Biological Research • CS/IT: Ludaescher, Bowers, McPhillips Bio: Peggy Farnham, Mark Bieda • • • Integrate a web-based "experiment workspace" environment with a flexible scientific workflow system Support rapid prototyping and easy addition of new "methods"
SWDB
• Support client-server (i.e., enterprise) deployments for group/lab wide collaboration • templates • details of how key steps are left out until runtime, then late-binding of one or more specific algorithms or data which is the best motif-finding algorithm?
parts of workflow have similar set of steps need to compare results from parallel analyses different people have different roles [Software dev (Tim), Bioinformatics specialist (Mark), Biologist (Peggy)]
Experiment Workspace (setup, run, and manage) Setup “protocol” Import/Export Data Data Display, Visualization Run Experiment Peggy “biologist” experiment repository Mark Tim Workflow Automation (configuration and execution support) Configuration Management Kepler Workflow Engine (1) select design template and configure Workflow Specification (workflow design and template creation)
F 1
Execution Management Monitoring Support Provenance Tracking (2) generate optimized executable workflow
F 2 T 1 F 3 F 2 T 2 SWDB
Component Specification (wrapping, integration, and creation of components)
Aug 29, 2004
provenance repository design repository component repository external components and services ChIP-chip Data Analysis (ChIPOTle, HMM, …) Motif Finding Algorithms (MEME, MDscan, …) Visualization Packages and Statistics Tools Public Databases & Services (GenBank, David,TransFac, …) Figure from Bowers and McPhillips
Kepler C.O.R.E. proposal
•
Development of Kepler CORE -- A Comprehensive, Open, Robust, and Extensible Scientific Workflow Infrastructure
• Ludäscher, Altintas, Bowers, Jones, McPhillips •
Goals
• • • Reliable • • Independently extensible Open architecture, open project • refactored build more modular design improved engineering practices improved governance
SWDB Aug 29, 2004
Kepler C.O.R.E. -- Extensibility SWDB Aug 29, 2004
Kepler C.O.R.E. -- Governance SWDB Aug 29, 2004
Kepler C.O.R.E. -- Sustainability
•
How does Kepler persist?
• Now, via research grants • unsustainable for production purposes • Future
SWDB
• • new models for financial support • • • • support contracts?
extension contracts?
new science domains?
continued research dollars?
foundations?
exploring 501.3c organization that can sustain Kepler and similar open-source initiatives
Aug 29, 2004
Acknowledgements
• • • Funding • • • • • • The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, 0225676, and 0619060.
The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.
The Andrew W. Mellon Foundation The Department of Energy Collaborators NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas, University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis Kepler contributors
Aug 29, 2004