Distributed Systems Laboratory Computational Biology Laboratory Superlink-Online: Harnessing the world’s computers to hunt for disease-provoking genes Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster,

Download Report

Transcript Distributed Systems Laboratory Computational Biology Laboratory Superlink-Online: Harnessing the world’s computers to hunt for disease-provoking genes Mark Silberstein, CS, Technion Dan Geiger, Computational Biology Lab Assaf Schuster,

Distributed Systems Laboratory
Computational Biology Laboratory
Superlink-Online:
Harnessing the world’s computers to
hunt for disease-provoking genes
Mark Silberstein, CS, Technion
Dan Geiger, Computational Biology Lab
Assaf Schuster, Distributed Systems Lab
Genetics Research Institutes in Israel, EU, US
MS eScience Workshop 2008
1
Familial Onychodysplasia and dysplasia
of distal phalanges (ODP)
III-15
IV-10
IV-7
5
Family Pedigree
MS eScience Workshop 2008
6
Marker Information Added
Id, dad, mom, sex, aff
Marker 1
Marker 2
III-21 II-10
II-11
f
h
0
0
0
0
II-5
I-3
I-4
f
h
155
157
A
A
III-7
II-4
II-5
f
a
155
157
A
T
III-13 II-4
II-5
m
a
151
155
A
T
III-14 II-1
II-2
f
h
151
155
A
A
III-15 II-4
II-5
male
a
151
155
A
A
III-16 II-10 II-11
f
h
151
159
A
A
III-5
II-4
f
h
151
155
A
A
IV-1
III-13 III-14
f
h
151
155
A
T
IV-2
III-13 III-14
f
a
151
155
A
T
155
155
A
T
IV-3 III-13
II-5
III-14 female a
M1
.
M2
Chromosome pair:
MS eScience Workshop 2008
7
Maximum Likelihood Evaluation
M1
M2
D1
M3
M4
202,209
202,202
a
h
139,141
139,146
1,2
3,3
D2
θ
III-15 151,159
III-16 151,155
The computational problem:
find a value of θ maximizing Pr(data|θ)
LOD score (to quantify how confident we are):
Z(θ)=log10[Pr(data|θ) / Pr(data|θ=½)].
MS eScience Workshop 2008
8
Results of Multipoint Analysis
Position in centi-Morgans
Ln(Likelihood)
LOD
0.0000 (Marker 3)
-216.0217
-14.74
0.5500
-192.2385
-4.41
1.1000 (Marker 4)
-216.0210
-14.74
3.6000
-176.3810
2.47
6.1000 (Marker 5)
-174.3392
3.35
8.6500
-173.9743
3.51
11.2000 (Marker 6)
-173.7030
3.63
16.5500
-173.3106
3.80
21.9000 (Marker 9)
-172.9497
3.96
25.2500
-173.6540
3.65
28.6000 (Marker 10)
-177.5622
1.95
40.3001
-178.9946
1.33
MS eScience Workshop 2008
9
The Bayesian network model
Locus 1
Locus 2 (Disease)
Locus 3
Locus 4
This model depicts the qualitative relations between the variables.
We need also to specify the
joint distribution
MS eScience
Workshop 2008 over these variables.10
The Computational Task
ComputingExponential
Pr(data|θ)time
forand
a specific
value of θ :
space
in:
n
P(data• #variables
|  )   P ( xi | pai )
 five per person
xk
x3 x1 i 1
 #markers
 #gene loci
Finding the
best order
is equivalent to finding
 #values
per variable
the best order
for sum-product operations for
 #alleles
 non-typed
persons
high dimensional
matrices
:
 table dimensionality
 cycles in pedigree
Yij  
m
  A
ikl
n
l
BkjmClmn
k
MS eScience Workshop 2008
11
Divisible Tasks through
Variable Conditioning
non trivial
parallelization
overhead
MS eScience Workshop 2008
13
Terminology
• Basic unit of execution – batch job
– Non-interactive mode: “enqueue – wait – execute –
return”
– Self-contained execution sandbox
• A linkage analysis request - a task
– A bag (of millions) of jobs
– Turnaround time is important
MS eScience Workshop 2008
15
Requirements
• The system must be geneticists-friendly
– Interactive experience
• Low response time for short tasks
• Prompt user feedback
– Simple, secure, reliable, stable, overloadresistant, concurrent tasks, multiple users...
– Fast computation of previously infeasible long
tasks via parallel execution
• Harness all available resources: grids, clouds, clusters
• Use them efficiently!
MS eScience Workshop 2008
16
Grids or Clouds?
Remaining
Jobs in
Queue
Preempted jobs, UW Madison
Grid
(k CPUs)
Long tail
due to failures
Cloud
(k CPUs)
Error rate, UW Madison
Queuing time in EGEE
Time
Queue Waiting
Time

Small tasks are severely slow on grids


Takes 5 minutes on 10-nodes dedicated cluster
May take several hours on a grid
Should we move scientific loads on the cloud? YES!
MS eScience Workshop 2008
17
Grids or Clouds?



Consider 3.2x106 jobs, ~40 min each
It took 21 days on ~6000-8000 CPUs
It would cost about $10K on Amazon’s EC2
Should we move scientific loads on the cloud? NO!
?
MS eScience Workshop 2008
18
Clouds or Grids? Clouds and Grids!
Low
Low
Opportunistic
Reliability
Performance predictibility
High
High
Dedicated
Potential amount of available resources
Low
High
Reuse of existing infrastructure
Low
High
MS eScience Workshop 2008
19
Cheap and Expensive Resources

Task sensitivity to QoS differ in different stages
Remaining
jobs in
queue
High throughput




High performance
Use cheap unreliable
resources
Grids
Community grids
Non-dedicated clusters





Use expensive reliable
resources
Dedicated clusters
Clouds
Dynamically determine entering tail mode
Switch to expensive resources (gracefully)
MS eScience Workshop 2008
20
Glue pools together via overlay
Scheduling Server
Job
queue
Scheduler
Issues: granularity,
load balancing, firewalls, failed
resources, scheduler scalability…
Submitter
to Grid 1
Submitter
to Grid 2
Virtual
cluster maintainer
Submitter
to Cloud 1
Submitter
to Cloud 2
21
Practical considerations

Overlay scalability and firewall penetration


Compatibility with community grids



The server is based on BOINC
Agents are upgraded BOINC clients
Elimination of failed resources from scheduling


Server may not initiate connect to the agent
Performance statistics is analyzed
Resource allocation depending on the task state

Dynamic policy update via Condor classad mechanism
MS eScience Workshop 2008
22
SUPERLINK@TECHNION
Upgraded
BOINC Server
HTTP frontend
Scheduler
Database
jobs, monitoring,
system statistics
Web Portal
Task state
Task execution
and monitoring
workflow
BOINC clients
submitter
for EGEE
BOINC clients
submitter for
Madison pool
Virtual cluster maintainer
Submitter
to Technion
Submitter
To EC2
Cloud
Submitter
to OSG
Submitter
to any
grid/cluster/cloud
Dedicated cluster
fallback
23
Superlink-online 1.0:
http://bioinfo.cs.technion.ac.il
24
Task Submission
25
Superlink-online statistics


~1720 CPU years for ~18,000 tasks during 20062008 (counting)
~37 citations (several mutations found)


Over 250 (counting) users: Israeli and international


Examples: Ichthyosis,"uncomplicated" hereditary
spastic paraplegia (1-9 people per 100,000)
Soroka H., Be'er Sheva, Galil Ma'aravi H., Nahariya, Rabin H., Petah
Tikva, Rambam H., Haifa, Beney Tzion H., Haifa, Sha'arey Tzedek H.,
Jerusalem, Hadassa H., Jerusalem, Afula H. NIH, Universities and
research centers in US, France, Germany, UK, Italy, Austria, Spain,
Taiwan, Australia, and others...
Task example

250 days on single computer - 7 hours on 300-700 computers

Short tasks: few seconds even during severe overload
MS eScience Workshop 2008
26
Using our system in Israeli Hospitals
Rabin Hospital, by Motti Shochat’s group
New locus for mental retardation
Infantile bilateral striatal necrosis
Soroka Hospital, by Ohad Birk’s group
Lethal congenital contractural syndrome
Congenital cataract
Rambam Hospital, by Eli Shprecher’s group
Congenital recessive ichthyosis
CEDNIK syndrome
Galil Ma’aravi Hospital, by Tzipi Falik’s group
Familial Onychodysplasia and dysplasia
Familial juvenile hypertrophy
MS eScience Workshop 2008
27
Utilizing Community Computing
~3.4 TFLOPs, ~3000 users, from 75 countries
28
Superlink-online V2(beta) deployment
Technion
Condor
pools
Submission server
EGEE-II BIOMED VO
UW in Madison
Condor pool
Dedicated cluster
~12,000 hosts
operational during
the last month
Superlink@Campus
Superlink@Technion
OSG GLOW VO
MS eScience Workshop 2008
29
3.1 million jobs in 21 days
60 dedicated
CPUs only
MS eScience Workshop 2008
30
Conclusions

Our system integrates clusters, grids,
clouds, community grids, etc.



Geneticist friendly
Minimizes use of expensive resources
while providing QoS for tasks
Generic mechanism for scheduling policy

Can dynamically reroute jobs from one pool
to another according to a given optimization
function (budget, energy, etc.)
MS eScience Workshop 2008
31
NVIDIA Compute Unified Device
Architecture (CUDA)
GPU
16MPX8SPX4
MP
S
P
S
P
S
P
S
P
S
P
S
P
S
P
S
P
S
P
S
P
S
P
S
P
Cached Read-Only memory
S
P
S
P
S
P
S
P
...
S
P
S
P
S
P
S
P
S
P
S
P
S
P
S
P
Cached Read-Only memory
Global Memory
33
S
P
S
P
S
P
S
P
Register file
S
P
S
P
S
P
S
P
Shared memory
(16KB)
Register file
~1 cycle
~TB/s
MP
Shared memory
(16KB)
MS eScience Workshop 2008
Key ideas
(Joint work with John Owens -UC Davis)
Software-managed cache
We implement the cache replacement policy in software
Maximization of data reuse
Better compute/memory access ratio
A simple model for performance bounds
Yes, we are (optimal)
Use special function units for hardwareassisted execution
34
MS eScience Workshop 2008
Results summary
Experiment setup
CPU: single core Intel Core 2 2.4GHz, 4MB L2
GPU: NVIDIA G80 (GTX8800), 750MB GDDR4, 128 SP, 16K mem / 512 threads
Only kernel runtime included (no memory transfers, no CPU setup time)
2500~ 2 x 25 x 25 x 2
Hardware
35
Software managed
Caching
Use of SFU: expf is about
6x slower than “+” on GPU,
but ~200x slower on CPU
Acknowledgments

Superlink-online team:





Alumni: Anna Tzemach, Julia Stolin, Nikolay Dovgolevsky, Maayan Fishelson, Hadar
Grubman, Ophir Etzion
Current: Artyom Sharov, Oren Shtark
Prof. Miron Livny (Condor pool UW Madison, OSG)
EGEE BIOMED VO and OSG GLOW VO
Microsoft TCI program, NIH grant, SciDAC Institute for ultrascale
visualization
If your grid is underutilized – let us know!
Visit us at: http://bioinfo.cs.technion.ac.il/superlink-online
Superlink@TECHNION project home page:
http://cbl-boinc-server2.cs.technion.ac.il/superlinkattechnion
MS eScience Workshop 2008
36
QUESTIONS???
Visit us at:
http://bioinfo.cs.technion.ac.il/superlink-online
MS eScience Workshop 2008
37