National Center for Genome Analysis Support Leverages

Download Report

Transcript National Center for Genome Analysis Support Leverages

National Center for Genome Analysis Support Leverages XSEDE
Resources to Support Life Scientists
William K. Barnett, Ph.D. (Director)
Richard LeDuc, Ph.D. (Manager)
National Center for Genome Analysis Support
XSEDE 2013, San Diego CA, 7/23/2013
Summary
• What is NCGAS?
• What do we do?
• How do we do it?
I will assume you know more about HPC than biology.
National Center for Genome Analysis Support: http://ncgas.org
• Funded by National Science Foundation
1.
Large memory clusters for assembly
2.
Bioinformatics consulting for biologists
3.
Optimized software for better efficiency
• Collaboration across IU, TACC, SDSC,
and PSC.
• Open for business at: http://ncgas.org
Making it easier for Biologists
Computational Skills
Common
LOW
• Web interface to NCGAS
resources
• Supports many
bioinformatics tools
Rare
HIGH
• Available for both
research and instruction.
We Provide:
• Large RAM computational resources
• Appropriate storage
• Data transport assistance
• IT (help-desk-like) Support
• Bioinformatic Consultation and support
The services announced today
include:
• Storage of up to 50 terabytes of research data on
IU's Scientific Data Archive tape storage system.
• Services for curation and long-term storage of data
sets and final results from genome research in the
IUScholarWorks…
• NCGAS will write letters of commitment for
consulting, computation, and data storage resources
to include with grant proposals …
Service Continuum
Focus
Staffing
• 3.7 FTE Direct Staff
1. I’m a biologist with 15 years software engineering
experience
2. PhD Computer Scientist
3. Full-time Bioinformatic Analyst
4. 50% of a PhD Genomicist
5. 20% of people above me
• But we have direct access to the rest of our partner
supercomputing centers
Customize footer: View menu/Header and Footer
7/17/2015
NCGAS Cyberinfrastructure at IU
• Rockhopper: 11 servers with 48 cores and
128 GB RAM.
• Mason large memory cluster: 16 nodes with 32
cores each and 512 GB RAM per node.
• Data Capacitor: 1 PB at 20 Gbps throughput.
• SDA – 17(+) PB hierarchical tape archive
• Additional resources through our XSEDE partners
• XRAC allocation
National Center for Genome Analysis Support: http://ncgas.org
Rockhopper
• Penguin Computing's Penguin-On-Demand (POD) supercomputing
cloud appliance hosted by Indiana University.
• A collaborative effort between Penguin Computing, IU, the University
of Virginia, the University of California Berkeley, and the University of
Michigan.
• Provides supercomputing cloud services in a secure US facility.
• Researchers at US institutions of higher education and Federally
Funded Research and Development Centers (FFRDCs) can purchase
computing time from Penguin Computing, and receive access via
high-speed national research networks operated by IU.
National Center for Genome Analysis Support: http://ncgas.org
Standardized Trinity Analyses
Cost by Input Size for Trinity Jobs on POD@IU
$50.00
$40.00
$30.00
Cost by Input Size
$20.00
Linear (Cost by Input Size)
$10.00
$0.00
0.0 GB
2.0 GB
4.0 GB
6.0 GB
Size of Each Input File from Paired-End Library
National Center for Genome Analysis Support: http://ncgas.org
8.0 GB
Who do we serve?
12
Zhao et al. BMC Bioinformatics 2011, 12(Suppl 14):S2
http://www.biomedcentral.com/1471-2105/12/S14/S2
Haas, B., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P., Bowden, J.,
Couger, M., Eccles, D., Li, B., Lieber, M., MacManes, M., Ott, M., Orvis, J.,
Pochet, N., Strozzi, F., Weeks, N., Westerman, R., William, T., Dewey, C.,
Henschel, R., LeDuc, R., Friedman, N., and Regev, A. (2013) De novo
transcript sequence reconstruction from RNA-Seq using the Trinity platform
for reference generation and analysis, Nature Protocols, in press.
The Project
• A project represents a single NSF grant, or
similar for Server-on-Demand services.
• Projects are inherent organic organizational
structures used in biology.
Researchers know what projects they are on, who else
is on the project, what the project is trying to
accomplish etc.
• Projects are frequently widely distributed.
Projects
“SUGAR”: Schaack lab Undergraduate Genome Analyses at Reed
Projects as of Early July 2013
How do we support Projects
Support varies by Project need
• Most projects just need access to the
resources, and some technical support.
• Other projects require more staff interaction;
up to and including intellectual
contributions to the project.
• Several projects have utilized “private”
Galaxy instances.
GALAXY.NCGAS.ORG Model
NCGAS establishes
tools, hardens them,
and moves them into
production.
Virtual box hosting
Galaxy.ncgas.org
Individual projects can
get duplicate VMs.
The host for each tool is
configured individually
Quarry
Mason
Each project
can get 50 TB of
archive space
for raw data.
Archive
Data
Capacitor
Policies guarantee that
untouched data is
removed with time.
Current System
Your Friendly
Neighborhood
Sequencing Center
NCGAS Mason
(Free for
NSF users)
100 Gbps
Data Capacitor
Globus On-line
and other tools
Your Friendly
Neighborhood
Sequencing Center
NO data storage Charges
Lustre WAN File System
Other NCGAS XSEDE Resources…
10 Gbps
Your Friendly
Neighborhood
Sequencing Center
Optimized Software
IU POD
(12 cents
per core hour)
Future Direction
NCGAS gives back to XSEDE
• NCGAS is a Tier 2 XSEDE Partner.
• XSEDE allocations are available on Mason
(our large RAM cluster).
• 300,000 SU’s of 0.5 TB RAM nodes with
NCGAS support software.
Thank You
Questions?
Bill Barnett
Rich LeDuc
Le-Shin Wu
Carrie Ganote
Tom Doak