High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales William K.
Download ReportTranscript High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales William K.
High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales William K. Barnett, Ph.D.
Richard LeDuc, Ph.D.
National Center for Genome Analysis Support
Bio-IT World Asia, June 7, 2012
Summary
• Changing genomics analytical needs • NCGAS and its mission • NCGAS cyberinfrastructure • The 100 Gigabit demonstration • Scaling genomics analysis • The NCGAS research model • Outcomes for life sciences research National Center for Genome Analysis Support: http://ncgas.org
Bio-IT World Asia, June 7, 2012
Changing genomics analytical needs
• Next Gen sequencers are generating more data and getting cheaper • Sequencing is: Becoming commoditized at large centers and Multiplying at individual labs • Analytical capacity has not kept up Bioinformatics support Computational support (thousand points solution) Storage support National Center for Genome Analysis Support: http://ncgas.org
Bio-IT World Asia, June 7, 2012
NCGAS widening the analytical bottleneck
• Funded by National Science Foundation (grant # ABI 1062432) • Large memory clusters for assembly • Bioinformatics consulting for biologists • Optimized software for better efficiency • Providing services at: http://ncgas.org
National Center for Genome Analysis Support: http://ncgas.org
Bio-IT World Asia, June 7, 2012
Making it easier for Biologists
• Galaxy interface provides a “user friendly” window to NCGAS resources • Supports many bioinformatics tools • Available for both research and instruction.
Computational Skills
Common
LOW
Rare
HIGH
National Center for Genome Analysis Support: http://ncgas.org
Bio-IT World Asia, June 7, 2012
NCGAS Service Model
NEED
Bioinformatics Applications Services Layer APIs OS Layer Public Cloud Providers Hardware Layer Network Layer National Center for Genome Analysis Support: http://ncgas.org
NCGAS Expert Consulting Hardened Applications and Workflows Galaxy, Parallelization Systems Administration Mason (512 GB/node) 100 Gbps I2
Bio-IT World Asia, June 7, 2012
NCGAS Galaxy Applications Model
Virtual box hosting Galaxy.NCGAS.org
The host for each tool is configured to meet National needs Virtual box hosting Galaxy.Indiana.edu
The host for each tool is configured to meet IU needs Custom Site Hosting Galaxy.YourSite.???
The host for each tool is configured to meet Your needs
Quarry Mason RFS Data Capacitor
National Center for Genome Analysis Support: http://ncgas.org
Bio-IT World Asia, June 7, 2012
NCGAS Workflow Demo at SC 11
• • •
STEP 1: data pre- processing, to evaluate and improve the quality of the input sequence STEP 2: sequence alignment to a known reference genome STEP 3: SNP detection to scan the alignment result for new polymorphisms Bloomington, IN
National Center for Genome Analysis Support: http://ncgas.org
Seattle, WA Bio-IT World Asia, June 7, 2012
NCGAS Virtual Genomics Science Instrument
10 Gbps 100 Gbps Large Sequencing Center Data Capacitor
Lustre WAN File System
Mason IU POD Smaller Sequencing Centers
FTP
NCBI Reference Data
International Collaborators via TransPAC, Geant
Gbps 100 Internet2 (100Gbps)
This Architecture Scales!
DDR3 SDRAM (51.2 Gbps, 6.4GBps, ) 0 IU Data Capacitor (20 Gbps throughput) NLR to Sequencing Centers (10Gbps/link) Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps) Commodity Internet (1Gbps but highly variable) National Center for Genome Analysis Support: http://ncgas.org
Bio-IT World Asia, June 7, 2012
How would this work at scale?
1. Biologists anywhere use Galaxy 2. Sequence data transferred over Research Nets 3. Lustre WAN flows data into Data Capacitor 4. Data Capacitor mounts reference data 5. Results available on Data Capacitor for subsequent analyses (secure to HIPAA standards) National Center for Genome Analysis Support: http://ncgas.org
Bio-IT World Asia, June 7, 2012
Outcomes for Life Sciences Research…
• National and international networks have the capacity to handle genomics data.
• Distributed workflow tools lower the bar for biologists to accomplish genomic science.
• NCGAS is an extensible model of a scaled and integrated infrastructure for biological research.
• This model can extend internationally National Center for Genome Analysis Support: http://ncgas.org
Bio-IT World Asia, June 7, 2012
Thank You
Questions?
Bill Barnett ( [email protected]
) Rich LeDuc ( [email protected]
) National Center for Genome Analysis Support: http://ncgas.org
Bio-IT World Asia, June 7, 2012