Transcript GridQTL

GridQTL : Using the NGS to map
genes through a web portal
Jean-Alain Grunchec
University of Edinburgh
Plan
• GridQTL Team and users
• Introduction to the GridQTL project
• Description of computing infrastructure and
software behind the scene
• Short demonstration of the Grided service
GridQTL Team
* Sara Knott
+ Ian White
× Jules Hernandez-Sanchez
# Jean-Alain Grunchec
# Kashif Saleem
* Chris Haley
* Dirk-Jan de Koning
× Wenhua Wei
× Burak Karacaoeren
× Susan Rowe
* Jano van Hemert
# John Allen
+ Mathematician
× Biologist
# Computer scientist
GridQTL Users
• 443 Registered external users in 44 countries
• 211 have used the core services, 67 the LDLA
Plan
• GridQTL Team and users
• Purpose the GridQTL project
• Description of computing infrastructure and
software behind the scene
• Short demonstration of the Grided service
QTL mapping
• Aim: To detect and locate genes (QTL) having
an effect on a quantitative trait
• Quantitative trait – a trait with a continuous
measurement (size, weight, concentration)
• QTL (quantitative trait locus) – a gene or DNA segment
having an effect on a quantitative trait
Rationale for QTL analysis
• To understand genetic variation by dissecting
complex traits
• fundamental knowledge of gene actions and
interactions
• applications in agriculture
• applications in medicine
Women
700
600
500
400
300
200
Std. Dev = 6.40
100
Mean = 169.1
N = 1785.00
0
145.0
155.0
150.0
Stature
165.0
160.0
175.0
170.0
185.0
180.0
190.0
History: QTL Express
• Web portal to map QTL in experimental
populations
• Based on Java servlets and uses a dedicated
pool of 6 computers
• 100+ Users
• The increase of computational demand
degraded the quality of service very
significantly : 6 computers are not enough !
Most recent developments in genetics
• New models are very computational (100s CPU hours per
analysis)
• Potential for models which can be applied on complex pedigrees (real life
populations: ex LDLA)
• Potential of more complex genetic models (multiple QTLs: ex epistasis)
• Now feasible
• 100,000s marker genotypes per individual
• 10,000s phenotypes
• 1000s individuals
• Current approaches may be inadequate to analyse resulting
large data sets
• High Throughput analyses : 10,000s CPU hours per analysis
Plan
•
•
•
•
GridQTL Team and users
Introduction to the GridQTL project
Description of computing infrastructure
Short demonstration of the Grided service
Increasing the computational capacity
available to GridQTL
• 2006 : Condor pool of 10 computers
• 2007 : NGS-1 (500 CPUs)
• 2008 : ECDF+NGS-2 > 2600 CPUs
Grid Infrastructure
RAL
Local resources
In Edinburgh
CARDIFF
256 CPUs
MANCHESTER
ECDF
256 CPUs
LEEDS
256 CPUs
National Grid
Service
WESTMINSTER
1456 CPUs
PETRA (condor)
Server
10 CPUs
128 CPUs
OXFORD
256 CPUs
LINUX(FC, RHAT, SUSE)
SOLARIS
LANCASTER
IRIX
Software description
NGS/ECDF
globus
gsissh
Server running
GridSphere /
Tomcat
SWARM Meta-Scheduler
JSR-168 Portlet in browser
JSP
HTML
ssh
Condor pool
AJAX :
JavaScript /
servlet
JavaScript
How do we use the NGS ?
• Our users log on the website, are identified
through their unique user name.
• They run queries by clicking buttons on the
web interface.
• These buttons run some Java functions that
call Globus toolkit routines
• The NGS authorize these routines to run by
recognising a NGS portal certificate which
identifies the web server and its administrator
• An accounting system has to be put in place by
the administrator for usage audit.
Job submission
• globus-job-submit
-env JAVA_HOME=/usr/local/Cluster-Apps/java/jdk1.6.0_01
-env host_name=ngs.wmin.ac.uk
ngs.wmin.ac.uk/jobmanager-pbs
-stderr
/home/ngs/ngs0739/production_new_modules/LDLA_GridSphere/.err.0000004696.4702.527.tx
t
-stdout
/home/ngs/ngs0739/production_new_modules/LDLA_GridSphere/.out.0000004696.4702.527.t
xt
-maxtime 18
-np 2 -host-count 1
-dir /home/ngs/ngs0739/production_new_modules/LDLA_GridSphere
-x "&(jobtype=single)(minMemory=1100)" -l
/home/ngs/ngs0739/production_new_modules/LDLA_GridSphere/LDLA_GridSphere.sh
"4;4702;527;0000004696;achatzipli;0;server.cap.ed.ac.uk;qtlportlets/public/01573ec42120bf13
01218d475e67021b/;"
Profiling the application
• Profiling software : gprof
• Your own script
#/bin/csh
./myapplication.sh &
./profiler.csh $! &
wait
• Memory : ps -o vsize,comm,user
• Can monitor also disk usage ( some NGS
cluster have large temporary storage facilities )
• Computational load : uptime
Failures happen !
• Output failures : data corrupted during file transfer/on the
nodes/out of memory etc…
• Duration failures : jobs terminated because of their
duration exceed the reserved duration
• Submission failures : failure of the network during the
jobs submission
• Server failures : partly handled
Plan
•
•
•
•
GridQTL Team and users
Introduction to the GridQTL project
Description of computing infrastructure
Short demonstration of the Grided service
Linkage Disequilibrium Linkage Analysis
•
•
•
•
Good for complex pedigrees
For instance a population of feral sheep from St Kilda
Or others… even plants (diploids).
Basically good for populations which would be too expensive
(or unethical) to breed for experimental purpose.
• 3,447 analyses run (198,142 Jobs on the Grid)
11,005 Hours CPU time
900 Hours user time
Thank you !
•J. Hernández-Sánchez, J.A. Grunchec and S. Knott. A web application to
perform linkage disequilibrium and linkage analyses on a computational
grid. Bioinformatics 25(11): 1377-1383 (2009).
•J.A. Grunchec, J. Hernández-Sánchez and S. Knott. SWARM: A metascheduler to minimize job queuing duration in a Grid portal. Accepted by
the International Conference of Cluster and Grid Computing Systems, Oslo,
Norway, July 2009.
•G. Seaton, J. Hernandez, J.A. Grunchec, I. White, J. Allen, D.J. De Koning,
Wenhua Wei, D. Berry, C. Haley, S. Knott. GridQTL: a Grid portal for QTL
mapping of compute intensive datasets. 8th World Congress on Genetics
Applied to Livestock Production, August 13-18, 2006, Belo Horizonte, MG,
Brasil
•Portal
http://cleopatra.cap.ed.ac.uk/gridsphere/gridsphere
http://gridqt1.cap.ed.ac.uk:8080/gridsphere/gridsphere
email : [email protected]