Transcript GridQTL
GridQTL : Using the NGS to map genes through a web portal Jean-Alain Grunchec University of Edinburgh Plan • GridQTL Team and users • Introduction to the GridQTL project • Description of computing infrastructure and software behind the scene • Short demonstration of the Grided service GridQTL Team * Sara Knott + Ian White × Jules Hernandez-Sanchez # Jean-Alain Grunchec # Kashif Saleem * Chris Haley * Dirk-Jan de Koning × Wenhua Wei × Burak Karacaoeren × Susan Rowe * Jano van Hemert # John Allen + Mathematician × Biologist # Computer scientist GridQTL Users • 443 Registered external users in 44 countries • 211 have used the core services, 67 the LDLA Plan • GridQTL Team and users • Purpose the GridQTL project • Description of computing infrastructure and software behind the scene • Short demonstration of the Grided service QTL mapping • Aim: To detect and locate genes (QTL) having an effect on a quantitative trait • Quantitative trait – a trait with a continuous measurement (size, weight, concentration) • QTL (quantitative trait locus) – a gene or DNA segment having an effect on a quantitative trait Rationale for QTL analysis • To understand genetic variation by dissecting complex traits • fundamental knowledge of gene actions and interactions • applications in agriculture • applications in medicine Women 700 600 500 400 300 200 Std. Dev = 6.40 100 Mean = 169.1 N = 1785.00 0 145.0 155.0 150.0 Stature 165.0 160.0 175.0 170.0 185.0 180.0 190.0 History: QTL Express • Web portal to map QTL in experimental populations • Based on Java servlets and uses a dedicated pool of 6 computers • 100+ Users • The increase of computational demand degraded the quality of service very significantly : 6 computers are not enough ! Most recent developments in genetics • New models are very computational (100s CPU hours per analysis) • Potential for models which can be applied on complex pedigrees (real life populations: ex LDLA) • Potential of more complex genetic models (multiple QTLs: ex epistasis) • Now feasible • 100,000s marker genotypes per individual • 10,000s phenotypes • 1000s individuals • Current approaches may be inadequate to analyse resulting large data sets • High Throughput analyses : 10,000s CPU hours per analysis Plan • • • • GridQTL Team and users Introduction to the GridQTL project Description of computing infrastructure Short demonstration of the Grided service Increasing the computational capacity available to GridQTL • 2006 : Condor pool of 10 computers • 2007 : NGS-1 (500 CPUs) • 2008 : ECDF+NGS-2 > 2600 CPUs Grid Infrastructure RAL Local resources In Edinburgh CARDIFF 256 CPUs MANCHESTER ECDF 256 CPUs LEEDS 256 CPUs National Grid Service WESTMINSTER 1456 CPUs PETRA (condor) Server 10 CPUs 128 CPUs OXFORD 256 CPUs LINUX(FC, RHAT, SUSE) SOLARIS LANCASTER IRIX Software description NGS/ECDF globus gsissh Server running GridSphere / Tomcat SWARM Meta-Scheduler JSR-168 Portlet in browser JSP HTML ssh Condor pool AJAX : JavaScript / servlet JavaScript How do we use the NGS ? • Our users log on the website, are identified through their unique user name. • They run queries by clicking buttons on the web interface. • These buttons run some Java functions that call Globus toolkit routines • The NGS authorize these routines to run by recognising a NGS portal certificate which identifies the web server and its administrator • An accounting system has to be put in place by the administrator for usage audit. Job submission • globus-job-submit -env JAVA_HOME=/usr/local/Cluster-Apps/java/jdk1.6.0_01 -env host_name=ngs.wmin.ac.uk ngs.wmin.ac.uk/jobmanager-pbs -stderr /home/ngs/ngs0739/production_new_modules/LDLA_GridSphere/.err.0000004696.4702.527.tx t -stdout /home/ngs/ngs0739/production_new_modules/LDLA_GridSphere/.out.0000004696.4702.527.t xt -maxtime 18 -np 2 -host-count 1 -dir /home/ngs/ngs0739/production_new_modules/LDLA_GridSphere -x "&(jobtype=single)(minMemory=1100)" -l /home/ngs/ngs0739/production_new_modules/LDLA_GridSphere/LDLA_GridSphere.sh "4;4702;527;0000004696;achatzipli;0;server.cap.ed.ac.uk;qtlportlets/public/01573ec42120bf13 01218d475e67021b/;" Profiling the application • Profiling software : gprof • Your own script #/bin/csh ./myapplication.sh & ./profiler.csh $! & wait • Memory : ps -o vsize,comm,user • Can monitor also disk usage ( some NGS cluster have large temporary storage facilities ) • Computational load : uptime Failures happen ! • Output failures : data corrupted during file transfer/on the nodes/out of memory etc… • Duration failures : jobs terminated because of their duration exceed the reserved duration • Submission failures : failure of the network during the jobs submission • Server failures : partly handled Plan • • • • GridQTL Team and users Introduction to the GridQTL project Description of computing infrastructure Short demonstration of the Grided service Linkage Disequilibrium Linkage Analysis • • • • Good for complex pedigrees For instance a population of feral sheep from St Kilda Or others… even plants (diploids). Basically good for populations which would be too expensive (or unethical) to breed for experimental purpose. • 3,447 analyses run (198,142 Jobs on the Grid) 11,005 Hours CPU time 900 Hours user time Thank you ! •J. Hernández-Sánchez, J.A. Grunchec and S. Knott. A web application to perform linkage disequilibrium and linkage analyses on a computational grid. Bioinformatics 25(11): 1377-1383 (2009). •J.A. Grunchec, J. Hernández-Sánchez and S. Knott. SWARM: A metascheduler to minimize job queuing duration in a Grid portal. Accepted by the International Conference of Cluster and Grid Computing Systems, Oslo, Norway, July 2009. •G. Seaton, J. Hernandez, J.A. Grunchec, I. White, J. Allen, D.J. De Koning, Wenhua Wei, D. Berry, C. Haley, S. Knott. GridQTL: a Grid portal for QTL mapping of compute intensive datasets. 8th World Congress on Genetics Applied to Livestock Production, August 13-18, 2006, Belo Horizonte, MG, Brasil •Portal http://cleopatra.cap.ed.ac.uk/gridsphere/gridsphere http://gridqt1.cap.ed.ac.uk:8080/gridsphere/gridsphere email : [email protected]