The California Institute for Telecommunications and

Download Report

Transcript The California Institute for Telecommunications and

The CAMERA Project

Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD

The CAMERA Partnership Community

C

yberinfrastructure for

A

dvanced Marine

M

icrobial

E

cology

R

esearch and

A

nalysis

Genomic Data Is Growing Rapidly, But Metagenomics Will Vastly Increase The Scale… 100 Billion Bases!

35,000 Structures

GenBank Protein Data Bank

www.ncbi.nlm.nih.gov/Genbank www.rcsb.org/pdb/holdings.html

Total Data < 1TB

The Sargasso Sea Experiment The Power of Environmental Metagenomics

• • • •

Yielded a Total of Over 1 billion Base Pairs of Non Redundant Sequence Displayed the Gene Content, Diversity, & Relative Abundance of the Organisms Sequences from at Least 1800 Genomic Species, including 148 Previously Unknown Identified over 1.2 Million Unknown Genes J. Craig Venter, et al. Science 2 April 2004: Vol. 304. pp. 66 - 74 MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from 22 February 2003

Full Genome Sequencing is Exploding: Most Sequenced Genomes are Bacterial

Completed Genomes First Genome 1995 6 Genomes/ Year 2000 Ongoing Genomes Archaeal Moore 155 In Here Eukaryal Archaeal Eukaryal Bacterial Total 422 55 Metagenomes Total 1665 www.genomesonline.org

Bacterial

Moore Microbial Genome Sequencing Project Selected Microbes Throughout the World’s Oceans Microbes Nominated by Leading Ocean Microbial Biologists www.moore.org/microgenome/worldmap.asp

Moore Microbial Genome Sequencing Project: Cyanobacteria Being Sequenced by Venter Institute

Marine Genome Sequencing Project Measuring the Genetic Diversity of Ocean Microbes

Genomic Data Is Growing Rapidly, But Metagenomics Will Vastly Increase The Scale… 100 Billion Bases!

35,000 Structures

GenBank Protein Data Bank

www.ncbi.nlm.nih.gov/Genbank www.rcsb.org/pdb/holdings.html

Total Data < 1TB

Metagenomics Will Couple to Earth Observations

8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 Terra EOM Dec 2005 Aqua EOM May 2008 Aura EOM Jul 2010 file name: archive holdings_122204.xls

tab: all instr bar

Calendar Year NOTE: Data remains in the archive pending transition to LTA Source: Glenn Iona, EOSDIS Element Evolution Technical Working Group January 6-7, 2005

Other EOS HIRDLS MLS TES OMI AMSR-E AIRS-is GMAO MOPITT ASTER MISR V0 Holdings MODIS-T MODIS-A

Other EOS =

• ACRIMSAT • Meteor 3M • Midori II • ICESat • SORCE

Driven by User Needs

CAMERA serves as one representation of a specific research community’s need for a system to

– Collect and reference increasing metadata relevant to environmental metagenome datasets – – Exploit the power of querying on metadata across multiple geospatial locations Have access to a diverse and customizable set of easy-to-use tools to analyze their data – – – Have ability to add, update and propagate improvements to annotations Have a pre-publication, pre-submission collaborative workspace Serve a diverse levels of informatics literacy

Services Provided

• • •

Data and Application Services Tools and Workflows Computational Data, Visualization and Collaborative environment

Outreach and Training in Environmental Genomics

Data and Application Services

• •

Primary Data

– – – – – – – Sargasso Sea and Sorcerer II expedition data JGI marine & terrestrial environmental datasets Moore Microbial Genomes JGI and other relevant whole genomes Research community submitted datasets Submitted 454-based metagenomic datasets Publicly available NR protein and DNA sequence datasets

Derived Data

– – – – Annotations of datasets Assemblies Alignments Pre-computed clusters

Sample Metadata from GOS

• • •

Site Metadata

– – – – Location (lat/long, water depth) Site characterization (finite list of types plus “other”) Site description (free text) Country

Sampling Metadata

– – – – – Sample collection date/time Sampling depth Conditions at time of sampling (e.g., stormy, surface temperature) Sample physical/chemical measurements (T (oC), S (ppt), chl a (mg m-3), etc) “author”

Experimental Parameters

– – Filter size Insert size

Tools and Workflows

• •

Initial set

– – – – – – BLAST Server Clustering HMM/Profile Neighborhood analysis Multiple sequence alignments Assembly

Proposed New Tools

– – – – – Multiple Auto Annotation pipelines Fast Sequence lookup Customized Assembly Phylogenetic Analysis Clustering Tools

Guiding Philosophy for Development

• •

Sprint

Q4 2006

Propagate JCVI toolkit and data

ASAP

– – Mechanism for publication of Sorcerer II data Enabler for community – Defined deliverables, project management approach

Marathon Q4 2006 onward

– – – Additional Datasets Additional tools Community drives prioritization for ongoing releases – Advisory Board, Community Outreach •

Keys to success:

 Tight integration of science, bioinformatics, software, and IT  Matched to Community Needs

The Future Home of the Moore Foundation Funded Marine Microbial Ecology Metagenomics Complex First Implementation of the CAMERA Complex Major Buildout of Calit2 Server Room Underway http://calit2-1101-1.ucsd.edu/ Photo Courtesy Joe Keefe, Calit2

Moore CAMERA Production Environment

• •

Creation of Initial Production Environment – September 2006

– Hardware – – Compute Nodes – – ~200 4 CPU Nodes = ~800 Processing Cores Storage Servers – – 10 systems = ¼ Petabyte raw storage – – Database Servers – Larger 20-40TB; Smaller 5-10TB Network Management – – Force10 E1200 Router w/12 10GigE Interfaces to Each System Ports

User Access to Compute Cycles

– – Bulk of free cycles available to external users Proposal mechanism in process

Source: Greg Hidley, Calit2; Phil Papadopoulos, SDSC, Calit2

Countries are Aggressively Creating Gigabit Services: Interactive Access to CAMERA and LOOKING Systems Visualization courtesy of Bob Patterson, NCSA.

www.glif.is

Created in Reykjavik, Iceland 2003

CAMERA Outreach Modes

• • • • •

Scientific Advisory Board

– Early Adopters – OptIPortal End Points

Targeted Workshops

– – – User Forums User Software Testing Viz Tool Brainstorming

Presentations at Scientific Meetings

– Talks, posters, eventually demonstration booths

Partnerships With Metagenomics Projects

– E.g. DoE’s Joint Genome Institute (JGI)

Training and User Services Team

A Near Future Metagenomics Fiber Optic-Enabled Data Generator Source John Delaney, UWash