Transforming life science research with advanced Information Technology at Indiana University Craig A. Stewart [email protected] University Information Technology Services, Indiana University © Copyright Trustees of Indiana University.
Download ReportTranscript Transforming life science research with advanced Information Technology at Indiana University Craig A. Stewart [email protected] University Information Technology Services, Indiana University © Copyright Trustees of Indiana University.
Transforming life science research with advanced Information Technology at Indiana University Craig A. Stewart [email protected] University Information Technology Services, Indiana University © Copyright Trustees of Indiana University 2004 1 License Terms • • • • Please cite this presentation asStewart, C.A. Transforming life science research with advanced Information Technology at Indiana University. 2004. Presentation. Presented at: IBM Life Sciences Symposium (Pallisades, NY, 31 May 2004). Available from: http://hdl.handle.net/2022/14785 Portions of this document that originated from sources outside IU are shown here and used by permission or under licenses indicated within this document. Items indicated with a © or denoted with a source url are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse. Except where otherwise noted, the contents of this presentation are copyright 2004 by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. Outline • IU overview • Data, data grids, and life sciences (Centralized Life Science Data Service) • Computational grids and life science research (HPC Challenge Award at SC2003) • Looking forward – Institute of Innovation projects • Strategy and execution: how did we get here? • Delivering benefits 3 I-Light & Abilene • I-light – connects IUB, IUPUI, and Purdue University, to be extended within Indiana – first higher ed owned statewide network in nation – The networking infrastructure for collaboration of many sorts • Abilene – Nation’s (current) highestspeed national network – NOC in Indianapolis 4 IU in a nutshell • $2B Annual Budget • One university with • 8 campuses • 90,000 students • 3,900 faculty • 878 degree programs • Nation’s 2nd largest school of medicine • CIO: Vice President Michael A. McRobbie • ~$100M annual IT budget • Indiana Genomics Initiative - $105M Lilly Endowment, Inc. grant 5 IBM Research SP (Aries/Orion Complex) • 1.005 TeraFLOPS. 1st University-owned supercomputer in US to exceed 1 TFLOPS peak theoretical processing capacity. • Geographically distributed at IUB and IUPUI • Initially 50th, now 170th in Top 500 supercomputer list • An enabler of collaborative research using very large scale computations Photo: Tyagan Miller. May be reused by IU for noncommercial purposes. To license for commercial use, contact the photographer 6 AVIDD • Analysis and Visualization of Instrument-Driven Data • Distributed Linux cluster. Three locations: IUN, IUPUI, IUB • 2.164 TFLOPS (peak theoretical), 0.5 TB RAM, 10 TB Disk • First distributed Linux cluster to achieve more than 1 TFLOPS on Linpack benchmark 7 Massive Data Storage System • Reliable and robust • HPSS (High Performance Software System) • Automatic replication of data between Indianapolis and Bloomington, via Ilight. • 180 TB capacity with existing tapes; total capacity of 2.4 PB. • >100 TB currently in use; >5 TB for biomedical data Photo: Tyagan Miller. May be reused by IU for noncommercial purposes. To license for commercial use, contact the photographer 8 John-E-Box Design licensed to central Indiana manufacturer 9 10 Data, data grids, and life sciences 11 Federated Databases • Federated database approach focuses on establishing glue between existing databases • “Private” databases stay where they are – under local control • “Public” databases may be replicated locally for performance • Queries are entered as SQL, and the Federated Database System knows enough about the structure of the databases to select data from the right sources • Integrate the right data in the right way Lab Result s You! Clinica l Data Toxicit y Data 12 IBM’s Federated Database approach • Based on Discovery Link • Wrappers – program that sits between a database and DiscoveryLink, allowing on the fly queries by DL from the database – Database registration. Each particular database must be registered once – Accessing a calculation as one might a database (BLAST) • Parsers – Programs to import data from one format into another that permits higher-performance queries • Accessing a calculation from within a database query (BLAST, HMMR) • Accessing a database from within a calculation (SAS) 13 14 Microarray Data Portal • Web application and database designed for annotation and analysis of microarray experiments. • Annotation: Designed for users to set up experimental design first minimizing amount of time for sample entry but still getting in the essential info • Analysis – Allows user to partition data into groups based on their annotation. – Extensive filtering, search, and display options – T-test, Clustering, SVD, etc. – Allows different views of data based on informatics associated with the genes (e.g. KEGG, GO, Chromosome Location) 15 The Microarray Data Portal was created by the Center for Medical Genomics at IU School of Medicine. Supported in part by the 21st Century Research & Technology Fund and the Indiana Genomics Initiative. The Indiana Genomics Initiative is supported in part by a grant from the Lilly Foundation, Inc. 16 Hereditary Diseases and Family Studies Division, Dept. of Medical and Molecular Genetics, IU School of Medicine. Supported in part by NIH R01 NS37167. 17 Hereditary Diseases and Family Studies Division, Dept. of Medical and Molecular Genetics, IU School of Medicine. Supported in part by NIH R01 NS37167. 18 Under development: Linking Cancer data within IUSM • • • • Thousands of cancer and normal tissue samples De-identified, select phenotype data Database system that manages IRB approvals DiscoveryLink is planned ‘glue’ to tie tissue data to data generated by other IUSM cores 19 Protein identification • Problem: categorize thousands of protein identifications from proteomic experiments • Planned solution: Use CLSD interface with LocusLink to obtain information about proteins • Data Generation: – Peptide Extracts from experiment – Separate peptides using Liquid 2D Chromatography – Identify Mass/Charge using Mass Spectrometer – Creates raw data (LOTS of it!) • Data Analysis: – SAS, using queries into CLSD 20 HPC Challenge @ SC2003 Are Hexapods a single evolutionary group? Are ecdysozoans a single evolutionary group? 21 Computational grids and life science research (HPC Challenge Award at SC2003) 22 A partial bestiary All organism illustrations copyright Jennifer Fairman, 2003. www.fairmanstudios.com Used by agreement 23 Software and data analysis • Non-grid preparatory work – Download sequences from NCBI (67 Taxa, 12,162 bp, mitochondrial genes for 12 proteins) – Align sequences with Multi-Clustal – Determine rate parameters with TreePuzzle • Grid preparatory work – Analyze performance of fastDNAml with Vampir – Meetings via Access Grid & CoVise • The grid software – PACXMPI – Grid/MPI middleware (HLRS – High Performance Computing Center Stuttgart) – Covise – Collaboration and visualization (HLRS) – fastDNAml – Maximum Likelihood phylogenetics (IU) 24 • ML analysis of phylogenetic trees based on DNA sequences • Foreman/worker MPI program • Heuristic search for best trees • For 67 taxa: 2.12 ~10109 trees • Goal: 300 bootstraps, 10 jumbles per – 3000 executions (more than 3x typical!) fastDNAml 25 It worked! • Grid of 6 continents, 5 functional units, 6+ vendors, 8 types of systems, 641 processors… all analyzing evolutionary relationships of arthropods • HPC Challenge Award winner at SC03 conference – demonstrates new capabilities in grid computing while advancing research in evolutionary biology 26 Looking forward – Institute of Innovation projects • IBM Life Sciences Institute of Innovation in 3-D Cell Modeling • Center for Cell and Virus Theory • Biocomplexity Institute (talk tomorrow by Debasis Dan) • Model repository • Markup Languages and Cell Models • To the TeraGrid (and beyond!) 27 Strategy and execution: how did we get here 28 IU’s IT Strategic Plan • Real plans and real execution of those plans • Strong focus on centralization and enablement of capability computing • Economy of scale • Advantages of centralization while minimizing disadvantages • Engagement with researchers and vendors in projects and grants 29 Support strategy • CS research is wonderful, but what biomedical researchers care about is tools! • Considerable effort is put into seeking out collaborators and people we can assist • If a particular application is useful it doesn’t matter if it seems sophisticated to a computer scientist • When a problem is sophisticated we need the computer scientists! • Gradual enhancement of community 30 Collaboration and Outreach • AVIDD – 20 faculty, dozens of staff, $1.8M in NSF funding • Research in Indiana – 3 universities, dozens of faculty • IP-Grid – 2 universities, dozens of faculty, $3M in NSF funding • INGEN – 100+ faculty, hundreds of staff, $105M funding from Lilly Endowment, Inc. • In-state, national, and international outreach are all essential 31 Delivering Benefits • 9 inventions disclosed since 1997; 6 of these are open source software (BSD-like). Participation in the community behind community codes essential! [IBM has supported this strongly] • John-E-Box design licensed to a central Indiana firm for commercial production! • A software product has just been commercialized • Results explainable to a voter are essential for continued public support! 32 For further information • fastDNAml: http://www.indiana.edu/~rac/hpc/fastDNAml/ • about.uits.iu.edu/divisions/rac/index.html • about.uits.iu.edu/divisions/rac/pubsstaff.html • ingen.iu.edu • it.iu.edu 33 Acknowledgments • This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. IU’s life science research has benefited from collaboration with IBM researchers since 1997. • This research was supported in part by the Indiana Genomics Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment Inc. • This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors) and do not necessarily reflect the views of the National Science Foundation (NSF). • Assistance with this presentation: John Herrin, Malinda Lingwall, W. Les Teach • For HPC Challenge: thanks to the SciNet team, SC2003 organizers, HLRS, and especially Prof. Dr. Michael Resch 34 & Dr. Matthias Müller. Thank you Any questions? 35