High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales William K.

Download Report

Transcript High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales William K.

High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales William K. Barnett, Ph.D.

Richard LeDuc, Ph.D.

National Center for Genome Analysis Support

Bio-IT World Asia, June 7, 2012

Summary

• Changing genomics analytical needs • NCGAS and its mission • NCGAS cyberinfrastructure • The 100 Gigabit demonstration • Scaling genomics analysis • The NCGAS research model • Outcomes for life sciences research National Center for Genome Analysis Support: http://ncgas.org

Bio-IT World Asia, June 7, 2012

Changing genomics analytical needs

• Next Gen sequencers are generating more data and getting cheaper • Sequencing is:   Becoming commoditized at large centers and Multiplying at individual labs • Analytical capacity has not kept up    Bioinformatics support Computational support (thousand points solution) Storage support National Center for Genome Analysis Support: http://ncgas.org

Bio-IT World Asia, June 7, 2012

NCGAS widening the analytical bottleneck

• Funded by National Science Foundation (grant # ABI 1062432) • Large memory clusters for assembly • Bioinformatics consulting for biologists • Optimized software for better efficiency • Providing services at: http://ncgas.org

National Center for Genome Analysis Support: http://ncgas.org

Bio-IT World Asia, June 7, 2012

Making it easier for Biologists

• Galaxy interface provides a “user friendly” window to NCGAS resources • Supports many bioinformatics tools • Available for both research and instruction.

Computational Skills

Common

LOW

Rare

HIGH

National Center for Genome Analysis Support: http://ncgas.org

Bio-IT World Asia, June 7, 2012

NCGAS Service Model

NEED

Bioinformatics Applications Services Layer APIs OS Layer Public Cloud Providers Hardware Layer Network Layer National Center for Genome Analysis Support: http://ncgas.org

NCGAS Expert Consulting Hardened Applications and Workflows Galaxy, Parallelization Systems Administration Mason (512 GB/node) 100 Gbps I2

Bio-IT World Asia, June 7, 2012

NCGAS Galaxy Applications Model

Virtual box hosting Galaxy.NCGAS.org

The host for each tool is configured to meet National needs Virtual box hosting Galaxy.Indiana.edu

The host for each tool is configured to meet IU needs Custom Site Hosting Galaxy.YourSite.???

The host for each tool is configured to meet Your needs

Quarry Mason RFS Data Capacitor

National Center for Genome Analysis Support: http://ncgas.org

Bio-IT World Asia, June 7, 2012

NCGAS Workflow Demo at SC 11

• • •

STEP 1: data pre- processing, to evaluate and improve the quality of the input sequence STEP 2: sequence alignment to a known reference genome STEP 3: SNP detection to scan the alignment result for new polymorphisms Bloomington, IN

National Center for Genome Analysis Support: http://ncgas.org

Seattle, WA Bio-IT World Asia, June 7, 2012

NCGAS Virtual Genomics Science Instrument

10 Gbps 100 Gbps Large Sequencing Center Data Capacitor

Lustre WAN File System

Mason IU POD Smaller Sequencing Centers

FTP

NCBI Reference Data

International Collaborators via TransPAC, Geant

Gbps 100 Internet2 (100Gbps)

This Architecture Scales!

DDR3 SDRAM (51.2 Gbps, 6.4GBps, ) 0 IU Data Capacitor (20 Gbps throughput) NLR to Sequencing Centers (10Gbps/link) Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps) Commodity Internet (1Gbps but highly variable) National Center for Genome Analysis Support: http://ncgas.org

Bio-IT World Asia, June 7, 2012

How would this work at scale?

1. Biologists anywhere use Galaxy 2. Sequence data transferred over Research Nets 3. Lustre WAN flows data into Data Capacitor 4. Data Capacitor mounts reference data 5. Results available on Data Capacitor for subsequent analyses (secure to HIPAA standards) National Center for Genome Analysis Support: http://ncgas.org

Bio-IT World Asia, June 7, 2012

Outcomes for Life Sciences Research…

• National and international networks have the capacity to handle genomics data.

• Distributed workflow tools lower the bar for biologists to accomplish genomic science.

• NCGAS is an extensible model of a scaled and integrated infrastructure for biological research.

• This model can extend internationally National Center for Genome Analysis Support: http://ncgas.org

Bio-IT World Asia, June 7, 2012

Thank You

Questions?

Bill Barnett ( [email protected]

) Rich LeDuc ( [email protected]

) National Center for Genome Analysis Support: http://ncgas.org

Bio-IT World Asia, June 7, 2012