The Centralized Life Sciences Data Service at Indiana University Craig A. Stewart Andrew Arenson Anurag Shankar Director, Research and Academic Computing Principal INGEN Data Specialist Manager, Distributed Storage Systems Group [email protected] Director, Information Technology.
Download ReportTranscript The Centralized Life Sciences Data Service at Indiana University Craig A. Stewart Andrew Arenson Anurag Shankar Director, Research and Academic Computing Principal INGEN Data Specialist Manager, Distributed Storage Systems Group [email protected] Director, Information Technology.
The Centralized Life Sciences Data Service at Indiana University Craig A. Stewart Andrew Arenson Anurag Shankar Director, Research and Academic Computing Principal INGEN Data Specialist Manager, Distributed Storage Systems Group [email protected] Director, Information Technology Core, Indiana Genomics Initiative [email protected] [email protected] 1 License terms • Please cite as: Stewart, C.A., A. Arenson and A. Shankar. The Centralized Life Sciences Data Service at Indiana University. 2003. Presentation. Presented at: IBM/Lilly/IU Data Integration Conference (Indianapolis, IN, 17 Jan 2003). Available from: http://hdl.handle.net/2022/15216 • Except where otherwise noted, by inclusion of a source url or some other note, the contents of this presentation are © by the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. 2 3 http://www.ncbi/nlm/nih/gov The data revolution in biology The key question: how can researchers effectively access diverse data resources, some public, some not, in a fashion that suits the research styles and needs of the biomedical researcher? http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html 4 Outline • Some background about IU • Overview of IU advanced IT environment – Networks – Storage – Computation • The Centralized Life Science Data Service • Making advanced IT useful to biomedical researchers at IU • Questions? 5 IU in a nutshell • $2B Annual Budget • One university with • 8 campuses • 90,000 students • 3,900 faculty • 878 degree programs • Nation’s 2nd largest school of medicine • CIO: Vice President Michael A. McRobbie • ~$100M annual IT budget • Indiana Genomics Initiative - $105M Lilly Endowment, Inc. grant 6 Network Environment Abilene National Network I-light State Network Connects IU’s campuses in Bloomington, Indianapolis, and Purdue University (West Lafayette) to each other and Abilene 7 Massive Data Storage System • Easy to use, no cost to users • Reliable and robust • HPSS (High Performance Software System) • Automatic replication of data between Indianapolis and Bloomington, via I-light. • 180 TB capacity with existing tapes; total capacity of 2.4 PB. • 100 TB currently in use; >5 TB for biomedical data Photo: Tyagan Miller. May be reused by IU for noncommercial purposes. To license for commercial use, contact the photographer 8 IBM Research SP (Aries/Orion Complex) • 1.005 TeraFLOPS. 1st University-owned supercomputer in US to exceed 1 TFLOPS peak theoretical processing capacity. • Geographically distributed at IUB and IUPUI • Initially 50th, now 170th in Top 500 supercomputer list • An enabler of collaborative research using very large scale computations Photo: Tyagan Miller. May be reused by IU for noncommercial purposes. To license for commercial use, contact the photographer 9 AVIDD • Analysis and Visualization of Instrument-Driven Data • Distributed Linux cluster. Three locations: IUN, IUPUI, IUB • 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk • First distributed Linux cluster to achieve more than 1 TFLOPS on Linpack benchmark – currently 50th on Top500 list 10 All this hardware is nice… but how does it help me do my research? • Goal set by the IU School of Medicine: any researcher should be able to transparently access from her/his workstation data from all relevant public data sources and all internal data sources that researcher has rights to access • Our choice of tool: DiscoveryLink • The system created based on use of DiscoveryLink is called the Centralized Life Science Data Service 11 IBM’s Federated Database approach • Federated database approach focuses on establishing glue between existing databases • “Private” databases stay where they are – under local control • “Public” databases may be replicated locally for performance • Queries are entered as SQL, and the Federated Database System knows enough about the structure of the databases to select data from the right sources 12 IBM’s Federated Database approach • Wrappers – program that sits between a database and DiscoveryLink, allowing on the fly queries by DL from the database – No loss of local control – Database registration. Each particular database must be registered once – Accessing a calculation as one might a database (BLAST) • Parsers – Programs to import data from one format into another that permits higher-performance queries • Accessing a database from within a calculation (SAS) 13 More details • Wrappers exist for: – Relational databases: Other DB2 instances, Informix, Oracle, Sybase, SQL Server, MySQL – Non-relational databases: Documentum, Excel, Flat files, XML, BLAST, HMMER, Entrez API (PubMed & Nucleotide) • Parsers exist for: BIND, ENZYME, ePCR, HomoloGene, KEGG PATHWAY, LIGAND, LocusLink, SGD, UniGene • Parsers and wrappers are straightforward to write. Parsers – days to weeks; wrappers - ~6 personmonths 14 The idealized view of DiscoveryLink Architecture Lab Results DL Clinical Data Toxicity Data 15 16 17 Some example applications 18 Microarray Data Portal • Web application and database designed for annotation and analysis of microarray experiments. • Annotation: Designed for users to set up experimental design first minimizing amount of time for sample entry but still getting in the essential info • Analysis – Allows user to partition data into groups based on their annotation. – Extensive filtering, search, and display options – T-test, Clustering, SVD, etc. – Allows different views of data based on informatics associated with the genes (e.g. KEGG, GO, Chromosome Location) 19 Annotation 20 KEGG pathway information 21 GO category filtering of genes 22 Clustering (k-means, also EM, Hierarchical) 23 Online Biological Data Retrieval • Web queries used to quickly identify SNPs and Genes in specific regions and return information about those identified SNPs and Genes. • Used by the Hereditary Diseases and Family Studies Division of the Medical and Molecular Genetics Department of the Indiana University School of Medicine. • Live demo (hopefully) http://www.medgen.iupui.edu/binf/cgiproto.html • Marker1: D5S2057 • Marker2: D5S436 • Filter on tissue expression: Muscle • < 60 seconds vs 10 hours 24 25 26 Informatics E-mail Server • Web application allowing users of the Center for Medical Genomics at Indiana University School of Medicine. Web application allowing a user to request genomic information for many genes or sequences and receive that information via email. • Screen shot 27 28 LabRat • LIMS that allows users to collect related genomic information for known sequences. • Used internally by customers of the Center for Medical Genomics at Indiana University School of Medicine. 29 30 31 32 33 34 Two new applications under development 35 Linking Cancer data within IUSM • • • • Thousands of cancer and normal tissue samples De-identified, select phenotype data Database system that manages IRB approvals DiscoveryLink is planned ‘glue’ to tie tissue data to data generated by other IUSM cores 36 Protein identification • Problem: categorize thousands of protein identifications from proteomic experiments • Planned solution: Use CLSD interface with LocusLink to obtain information about proteins • Data Generation: – Peptide Extracts from experiment – Separate peptides using Liquid 2D Chromatography – Identify Mass/Charge using Mass Spectrometer – Creates raw data (LOTS of it!) 37 Raw Data NCBI (RefSeq) CLSD LocusLink Schema Human FASTA Protein Ontological Information Additions / Modifications (manual) Software Analysis (SEQUEST / Protein Prophet) Potential Protein Identifications or Quantifications Data Processing (custom software (Sizemore) Potential Protein IDs by Ontological Information 38 The key benefits to IU’s use of DiscoveryLink • Significant operational benefits (downloading data exactly once) • With DiscoveryLink and the CLSD as a base, it’s quite straightforward for a programmer within a lab to build a significant application based on use of CLSD and DiscoveryLink (no marathon browsing) • Power of accessing calculations (BLAST) within a database query, and accessing data from within common application programs (SAS) • New opportunities for discovery within IUSM (interesting joins of data) • New opportunities without destroying local policy autonomy 39 A few general thoughts on advanced information technologies for biomedical researchers 40 IU’s strategy • CS research is wonderful, but what biomedical researchers care about is tools! • Considerable effort is put into seeking out collaborators and people we can assist • If a particular application is useful it doesn’t matter if it seems sophisticated to a computer scientist 41 Indiana Genomics Initiative Information Technology • 136 users of IU’s supercomputers • 70 users of massive data storage system – 5 TB stored • Six new software packages created or enhanced, more than 20 packages installed for use by INGENaffiliated researchers • Three software packages made available as open source software as direct result of INGEN. Opportunities for tech transfer! • The INGEN IT Core is providing services valued by traditionally trained biomedical researchers as well as researchers in bioinformatics, genomics, proteomics, etc. > 90% satisfaction with UITS services by IUSM 42 Acknowledgments • This research was supported in part by the Indiana Genomics Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment Inc. • This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University, and in particular by IU’s relationship with IBM as an IBM Life Sciences Institute of Innovation. • This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). • Informatics E-mail server supported in part by the 21st Century Research & Technology Fund Online Biological Retrieval Data system supported in part by National Institutes of Health R01 NS37167 43 Acknowledgments, con’t • UITS Research and Academic Computing Division managers: Mary Papakhian, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar • Indiana Genomics Initiative Staff: Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock • Center for Medical Genomics: Matthew J. Stephens, Marcus Breese, Jeanette McClintick, Howard Edenberg, Matt Grow • Harrington Lab: Lee Ott, Alecia Sizemorey • Goebl Lab: Josh Heyen • Wang Lab • UITS Senior Management: Associate Vice President and Dean Bradley Wheeler, Associate Vice President and Dean (Retired) Christopher Peebles, RAC (Data) Director Gerry Bernbom • Assistance with this presentation: John Herrin, Malinda Lingwall, W. Les Teach 44 For additional Information • • • • about.uits.iu.edu/divisions/rac/index.html about.uits.iu.edu/divisions/rac/pubsstaff.html ingen.iu.edu it.iu.edu 45 Thank you! Questions? 46