National Center for Genome Analysis Support: Engaging

Download Report

Transcript National Center for Genome Analysis Support: Engaging

Genomics, Transcriptomics, and
Proteomics: Engaging Biologists
Richard LeDuc
Manager, NCGAS
eScience, Chicago 10/8/2012
Central Dogma of Molecular Biology
RNA is
translated to
protein
ATGGC
TA CC A
DNA
mRNA
DNA is
transcribed to
RNA
DNA
Replicates
itself
Protein
Central Dogma of Molecular Biology
DNA
mRNA
Genomics
Transcriptomics
Protein
Proteomics
Tools of the Trade
Instruments
Techniques
• Next-Generation
Sequencers
• Genome assembly

Illumina

454

PacBio
• Mass Spectrometers


5 kinds of mass analyzers
Hybrid analyzers +
separation technology
• RNA-sequencing
• ChIP-sequencing
• Methyl-sequencing
• Shotgun bottom-up
proteomics
• 2D gel proteomics
• Top-down proteomics
Zhao et al. BMC Bioinformatics 2011, 12(Suppl 14):S2
http://www.biomedcentral.com/1471-2105/12/S14/S2
Figure © Vincent Montoya / wikipedia
Analysis as Data Reduction
Instrument Data
• Proteomics
Shotgun Bottom-up




3.4 GB of instrument data
172 MB (x1/20) of
unstructured files
(5,219 files in 67 folders)
13 MB of publishable results
(x1/260).
Improved technology
increases the size of the
instrument files, but not usually
the intermediate or final file
sizes.
• DNA Sequencing

Often on the order of x1/2500
from start to finish
Options for Computational Support
Compute at the Instrument
• Supercomputer in a box


Many commercial venders are
entering with turn-key solutions
to specific problems.
Limited variety of analytic
expertise.
• Build Your Own
Computational Center


A rack or two, a few servers,
and you are good to go.
Only a subset of HPC skills are
present in staff.
Computer Centers
• Biologists


Each has to learn to work with
existing systems.
Few have specialized in HPC.
• Computer center


Support for hundreds of small
projects.
Each project has different
needs.
• Funded by National Science Foundation
1.
Large memory clusters for assembly
2.
Bioinformatics consulting for biologists
3.
Optimized software for better efficiency
• Open for business at: http://ncgas.org
Making it easier for Biologists
Computational Skills
Common
LOW
• Web interface to NCGAS
resources
• Supports many
bioinformatics tools
Rare
HIGH
• Available for both
research and instruction.
GALAXY.NCGAS.ORG Model
NCGAS establishes
tools, hardens them,
and moves them into
production.
Virtual box hosting
Galaxy.ncgas.org
Individual projects can
get duplicate boxes –
provided they support it
themselves.
The host for each tool is
configured individually
Quarry
Mason
Custom Galaxy
tools can be
made for
moving data
Archive
Data
Capacitor
Policies on the DC
guarantee that
untouched data is
removed with time.
NCGAS Sandbox Demo at SC 11
•
STEP 1: data preprocessing, to
evaluate and improve
the quality of the
input sequence
•
STEP 2: sequence
alignment to a
known reference
genome
•
STEP 3: SNP
detection to scan the
alignment result for
new polymorphisms
Moving Forward
Your Friendly
Neighborhood
Sequencing Center
NCGAS Mason
(Free for
NSF users)
100 Gbps
Data Capacitor
Globus On-line
and other tools
Your Friendly
Neighborhood
Sequencing Center
NO data storage Charges
Lustre WAN File System
Other NCGAS XSEDE Resources…
10 Gbps
Your Friendly
Neighborhood
Sequencing Center
Optimized Software
IU POD
(12 cents
per core hour)
How would this work at scale?
1. Biologists use Galaxy and other web portals to
move data and execute workflows
2. Instrument data transferred across Internet2
3. Data Capacitor flows data into Mason or other
computational clusters
4. Data reduction allows “compute in place” to work
5. Data Capacitor mounts or mirrors reference data
from NCBI or other sources
In Sum…
• Modern molecular biology – specifically the omics such as genomics,
transcriptomics, and proteomics, provides many tools for answering
many questions, but no single solution meets all needs.
• The amount of data generated decreases along a workflow. This has
implications in both storage and analysis.
• NCGAS can provide a national scale infrastructure to better serve the
needs of biologists who cannot become bioinformaticians to
accomplish their research.
• Increasingly specialized skills are needed to provide best-practice
solutions at all steps in a workflow.
Thank You
Questions?
Bill Barnett ([email protected])
Rich LeDuc ([email protected])
Le-Shin Wu ([email protected])
Carrie Ganote ([email protected])
NCGAS Cyberinfrastructure at IU
• Mason large memory cluster (512 GB/node)
• Quarry cluster (16 GB/node)
• Data Capacitor (1 PB at 20 Gbps throughput)
• Research File System (RFS) for data storage
• Research Database Cluster for managing data
sets.
• All interconnected with a high speed internal
network (40 Gbps)
Acknowledgements & disclaimer
•
This material is based upon work supported by the National Science
Foundation under Grants No. ABI-1062432
•
This work was supported in part by the Lilly Endowment, Inc. and the Indiana
University Pervasive Technology Institute
•
Any opinions presented here are those of the presenter(s) and do not
necessarily represent the opinions of the National Science Foundation or any
other funding agencies
License terms
•
Please cite as: LeDuc, R.D., Genomics, Transcriptomics, and Proteomics: Engaging
Biologists, presented at Extending High-Performance Computing Beyond its Traditional
User Communities, Co-located with the 8th IEEE International Conference on eScience,
Chicago, USA, October 8, 2012. Available from: http://hdl.handle.net/2022/14746
•
Items indicated with a © are under copyright and used here with permission. Such items
may not be reused without permission from the holder of copyright except where license
terms noted on a slide permit reuse.
•
Except where otherwise noted, contents of this presentation are copyright 2011 by the
Trustees of Indiana University.
•
This document is released under the Creative Commons Attribution 3.0 Unported license
(http://creativecommons.org/licenses/by/3.0/). This license includes the following terms:
You are free to share – to copy, distribute and transmit the work and to remix – to adapt
the work under the following conditions: attribution – you must attribute the work in the
manner specified by the author or licensor (but not in any way that suggests that they
endorse you or your use of the work). For any reuse or distribution, you must make clear to
others the license terms of this work.