Transcript Slide 1

www.biobankcloud.com
Jim Dowling
KTH – Royal Institute of Technology, Stockholm
SICS Swedish ICT
CSHL Meeting on Biological Data Science, 2014
Definition of a Biobank
•The Biobank “concept” is defined (by Swedish law)
as:
“biological material from one or several human beings
collected and stored indefinitely or for a specified time and
whose origin can be traced to the human or humans from
whom it originates”
•In Sweden, we have the goal of digitizing biological
material in Biobanks.
≈ 500,000 people
Population-Scale WGS: $1325 per Genome
HiSeq X Ten^ =>
Volume =>
Velocity =>
^Cost ~$10 million
~18,000 genomes/year
~5.2 PB/year*
~45 MB/sec*
*5.2 PB assumes a replication factor of 3
See: http://goo.gl/OCgJ36
Network effects: Biggest Dataset wins!
Centralized dataset: technically feasible, politically hard.
Federated dataset: technically hard,
politically feasible.
#diseases
insights
log(#samples)
Big Data means Commodity Hardware
180TB for $9,778 or 270TB for $15,763*
*[https://www.backblaze.com/blog/why-now-is-the-time-for-backblaze-to-build-a-270-tb-storage-pod/]
91,428 Whole Genomes: $782,240 (14.4PB)
*Each genome is 112.5 GB 35x coverage, Reed-Solomon erasure-coded. Nov 2014.
Hadoop can mean low administration costs
Facebook Operations staffers manage 20-26,000 servers each^
^ http://allfacebook.com/20000-servers_b127053
Population-based analysis needs scale-out
Read genome on 1 machine*:
~1000 secs
[Harddisk Image courtesy of sorapop / http://www.freedigitalphotos.net]
*112 GB, 35x coverage
Hadoop means Parallelization
Read genome on 1000 machines:
~1 second*
*112 GB, 35x coverage
Genomic data management solutions
• Outsource to SaaS providers
• Sequencing Centers
• High Performance Computing Centers
• In-House
BiobankCloud Users
EU General Data Protection Regulation
IT-Admin
Biobanker
Users,
Data,
Programs
Ethics Auditor
Bioinformatician
Never the Twain Shall Meet?
•Biobankers
•Bioinformaticians
- NGS data producers
- NGS data consumers
• Manage samples, metadata
• Analyze samples
- Non-programmers
- Programmers
• Python, R, Matlab, scripts
BiobankCloud LIMS
Kerberos
Hops-YARN
Hops-HDFS
LDAP
Overbank
PaaS
Cuneiform/HiWAY
IT
Admins
BiobankCloud LIMS (Lab Info Mgmt System)
Perimeter Security and Multi-Tenancy
Network Isolation
•LIMS
- J2EE Application
• LDAP, 2-Factor Authentication
- Role-Based Access Control
Kerberos
• Study-level
- Hadoop trusted proxy
• Kerberos
- REST APIs
LIMS
•Related Hadoop Projects
- Knox, Sentry, Rhino
- Hue, Ambari
LDAP
LIMS Authentication
Studies for Multi-Tenancy; Audit Trails
Study
Global Audit Trail
Study Membership
Browse Study Files
HDFS
Files
Upload Sample Data
Low Volume
Data ingestion
(Overcome 3 GB
browser upload limit)
Apache Flume
High Volume
Data ingestion
Run Cuneiform Workflows on YARN
HiWay/Cuniform Variant Call Workflow
TeraBytes
of input
data (1000
Genomes
Project)
Scalability
Experiments
on up to
576 containers
in a 28 node cluster
read and
reference
parallel
reference
parallel
reference
parallel
HiWay/Cuniform Scalability*
Workflow Runtime with Increasing
Number of Containers
Container time spent at
different stages of
execution
128.00
Runtime in Minutes
64.00
32.00
16.00
4
8
16
32
Number of Nodes
[*Unpublished Results]
idle
scheduling
startup
execution
stage-out
shutdown
stage-in
Store Genomes in Hops-HDFS
Block-id and a block placement policy minimizes re-identification risk.
HDFS-6134, HDFS-6891 - Transparent data at rest encryption
Sample data is
not fully compromised
unless intruders also
compromise the
NameNode
/genomes/jim.bam -> {2,4,3,6,9}
NameNodes
3
4 2
9
9
2
4
3
9
6
3
6
4
6
2
DataNodes
Data nodes comprimised
Hops-HDFS:
• Stateless
NameNodes +
MySQL Cluster
• Erasure-Coded
Replication
www.hops.io
PaaS support with Chef/Karamel
Launch Hadoop and the LIMs on EC2, Vagrant.
Conclusions
•Open-source commodity hardware NGS data
management solutions are feasible and economical
- BiobankCloud is building such a system
•BiobankCloud status
- First beta release coming in 2014
• Ongoing work on regulatory and ethical compliance
- BBC Newsletter on our website
•NGS Hadoop Workshop Feb 19-20, Stockholm
- Signup at www.biobankcloud.com
The Team
• KTH
Salman Niazi, Mahmoud Ismail,
Kamal Hakimzadeh, Stig Viaene,
Gautier Berthou, Roshan Sedar,
Ali Gholami, Erwin Laure
• Humboldt University
Ulf Leser, Jörgen Brandt,
Marc Bux
• University of Lisbon
Alysson Bessani, Vinicius Cogo
• Karolinska Institute
Jan-Eric Litton, Roxana
Martinez, Jane Reichel,
Mats Hansson
• Charité University Hospital
Michael Hummel, Lora
Dimitrova, Karen Zimmermann
Financed by the European Commission 7th Framework Programme.