The Centralized Life Sciences Data Service at Indiana University Craig A. Stewart Andrew Arenson Anurag Shankar Director, Research and Academic Computing Principal INGEN Data Specialist Manager, Distributed Storage Systems Group [email protected] Director, Information Technology.

Download Report

Transcript The Centralized Life Sciences Data Service at Indiana University Craig A. Stewart Andrew Arenson Anurag Shankar Director, Research and Academic Computing Principal INGEN Data Specialist Manager, Distributed Storage Systems Group [email protected] Director, Information Technology.

The Centralized Life
Sciences Data Service at
Indiana University
Craig A. Stewart
Andrew Arenson
Anurag Shankar
Director, Research and
Academic Computing
Principal INGEN Data
Specialist
Manager, Distributed
Storage Systems Group
[email protected]
Director, Information
Technology Core,
Indiana Genomics
Initiative
[email protected]
[email protected]
1
License terms
• Please cite as: Stewart, C.A., A. Arenson and A. Shankar. The
Centralized Life Sciences Data Service at Indiana University. 2003.
Presentation. Presented at: IBM/Lilly/IU Data Integration Conference
(Indianapolis, IN, 17 Jan 2003). Available from:
http://hdl.handle.net/2022/15216
• Except where otherwise noted, by inclusion of a source url or some
other note, the contents of this presentation are © by the Trustees of
Indiana University. This content is released under the Creative
Commons Attribution 3.0 Unported license
(http://creativecommons.org/licenses/by/3.0/). This license includes
the following terms: You are free to share – to copy, distribute and
transmit the work and to remix – to adapt the work under the following
conditions: attribution – you must attribute the work in the manner
specified by the author or licensor (but not in any way that suggests
that they endorse you or your use of the work). For any reuse or
distribution, you must make clear to others the license terms of this
work.
2
3
http://www.ncbi/nlm/nih/gov
The data revolution in biology
The key question: how
can researchers
effectively access
diverse data resources,
some public, some not,
in a fashion that suits
the research styles and
needs of the biomedical
researcher?
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
4
Outline
• Some background about IU
• Overview of IU advanced IT environment
– Networks
– Storage
– Computation
• The Centralized Life Science Data Service
• Making advanced IT useful to biomedical
researchers at IU
• Questions?
5
IU in a nutshell
• $2B Annual Budget
• One university with
• 8 campuses
• 90,000 students
• 3,900 faculty
• 878 degree programs
• Nation’s 2nd largest school of
medicine
• CIO: Vice President Michael A.
McRobbie
• ~$100M annual IT budget
• Indiana Genomics Initiative - $105M
Lilly Endowment, Inc. grant
6
Network Environment
Abilene National Network
I-light State Network
Connects IU’s campuses in Bloomington,
Indianapolis, and Purdue University (West
Lafayette) to each other and Abilene
7
Massive Data Storage System
• Easy to use, no cost to
users
• Reliable and robust
• HPSS (High Performance
Software System)
• Automatic replication of
data between
Indianapolis and
Bloomington, via I-light.
• 180 TB capacity with
existing tapes; total
capacity of 2.4 PB.
• 100 TB currently in use;
>5 TB for biomedical
data
Photo: Tyagan Miller. May be reused by IU for noncommercial
purposes. To license for commercial use, contact the photographer
8
IBM Research SP
(Aries/Orion Complex)
• 1.005 TeraFLOPS. 1st
University-owned
supercomputer in US to
exceed 1 TFLOPS peak
theoretical processing
capacity.
• Geographically
distributed at IUB and
IUPUI
• Initially 50th, now
170th in Top 500
supercomputer list
• An enabler of
collaborative research
using very large scale
computations
Photo: Tyagan Miller. May be reused by IU for noncommercial
purposes. To license for commercial use, contact the photographer
9
AVIDD
• Analysis and
Visualization of
Instrument-Driven
Data
• Distributed Linux
cluster. Three
locations: IUN, IUPUI,
IUB
• 2.164 TFLOPS, 0.5 TB
RAM, 10 TB Disk
• First distributed Linux
cluster to achieve more
than 1 TFLOPS on
Linpack benchmark –
currently 50th on
Top500 list
10
All this hardware is nice… but
how does it help me do my
research?
• Goal set by the IU School of Medicine: any
researcher should be able to transparently access
from her/his workstation data from all relevant
public data sources and all internal data sources
that researcher has rights to access
• Our choice of tool: DiscoveryLink
• The system created based on use of DiscoveryLink
is called the Centralized Life Science Data Service
11
IBM’s Federated Database
approach
• Federated database approach focuses on
establishing glue between existing databases
• “Private” databases stay where they are – under
local control
• “Public” databases may be replicated locally for
performance
• Queries are entered as SQL, and the Federated
Database System knows enough about the
structure of the databases to select data from the
right sources
12
IBM’s Federated Database
approach
• Wrappers
– program that sits between a database and DiscoveryLink,
allowing on the fly queries by DL from the database
– No loss of local control
– Database registration. Each particular database must be
registered once
– Accessing a calculation as one might a database (BLAST)
• Parsers
– Programs to import data from one format into another that
permits higher-performance queries
• Accessing a database from within a calculation
(SAS)
13
More details
• Wrappers exist for:
– Relational databases: Other DB2 instances, Informix,
Oracle, Sybase, SQL Server, MySQL
– Non-relational databases: Documentum, Excel, Flat files,
XML, BLAST, HMMER, Entrez API (PubMed & Nucleotide)
• Parsers exist for: BIND, ENZYME, ePCR,
HomoloGene, KEGG PATHWAY, LIGAND, LocusLink,
SGD, UniGene
• Parsers and wrappers are straightforward to write.
Parsers – days to weeks; wrappers - ~6 personmonths
14
The idealized view of
DiscoveryLink Architecture
Lab
Results
DL
Clinical
Data
Toxicity
Data
15
16
17
Some example applications
18
Microarray Data Portal
• Web application and database designed for
annotation and analysis of microarray experiments.
• Annotation: Designed for users to set up
experimental design first minimizing amount of
time for sample entry but still getting in the
essential info
• Analysis
– Allows user to partition data into groups based
on their annotation.
– Extensive filtering, search, and display options
– T-test, Clustering, SVD, etc.
– Allows different views of data based on
informatics associated with the genes (e.g.
KEGG, GO, Chromosome Location)
19
Annotation
20
KEGG pathway information
21
GO category filtering of genes
22
Clustering (k-means, also EM, Hierarchical)
23
Online Biological Data
Retrieval
• Web queries used to quickly identify SNPs and
Genes in specific regions and return information
about those identified SNPs and Genes.
• Used by the Hereditary Diseases and Family Studies
Division of the Medical and Molecular Genetics
Department of the Indiana University School of
Medicine.
• Live demo (hopefully)
http://www.medgen.iupui.edu/binf/cgiproto.html
• Marker1: D5S2057
• Marker2: D5S436
• Filter on tissue expression: Muscle
• < 60 seconds vs 10 hours
24
25
26
Informatics E-mail Server
• Web application allowing users of the Center for
Medical Genomics at Indiana University School of
Medicine. Web application allowing a user to
request genomic information for many genes or
sequences and receive that information via email.
• Screen shot
27
28
LabRat
• LIMS that allows users to collect related genomic
information for known sequences.
• Used internally by customers of the Center for
Medical Genomics at Indiana University School of
Medicine.
29
30
31
32
33
34
Two new applications
under development
35
Linking Cancer data within IUSM
•
•
•
•
Thousands of cancer and normal tissue samples
De-identified, select phenotype data
Database system that manages IRB approvals
DiscoveryLink is planned ‘glue’ to tie tissue data to
data generated by other IUSM cores
36
Protein identification
• Problem: categorize thousands of protein
identifications from proteomic experiments
• Planned solution: Use CLSD interface with
LocusLink to obtain information about proteins
• Data Generation:
– Peptide Extracts from experiment
– Separate peptides using Liquid 2D
Chromatography
– Identify Mass/Charge using Mass Spectrometer
– Creates raw data (LOTS of it!)
37
Raw Data
NCBI
(RefSeq)
CLSD
LocusLink
Schema
Human
FASTA
Protein
Ontological
Information
Additions /
Modifications
(manual)
Software
Analysis
(SEQUEST /
Protein Prophet)
Potential
Protein
Identifications
or
Quantifications
Data Processing
(custom software
(Sizemore)
Potential
Protein
IDs by
Ontological
Information
38
The key benefits to IU’s use of
DiscoveryLink
• Significant operational benefits (downloading data exactly
once)
• With DiscoveryLink and the CLSD as a base, it’s quite
straightforward for a programmer within a lab to build a
significant application based on use of CLSD and
DiscoveryLink (no marathon browsing)
• Power of accessing calculations (BLAST) within a database
query, and accessing data from within common application
programs (SAS)
• New opportunities for discovery within IUSM (interesting joins
of data)
• New opportunities without destroying local policy autonomy
39
A few general thoughts on
advanced information
technologies for biomedical
researchers
40
IU’s strategy
• CS research is
wonderful, but what
biomedical researchers
care about is tools!
• Considerable effort is
put into seeking out
collaborators and
people we can assist
• If a particular
application is useful it
doesn’t matter if it
seems sophisticated to
a computer scientist
41
Indiana Genomics Initiative
Information Technology
• 136 users of IU’s supercomputers
• 70 users of massive data storage system – 5 TB
stored
• Six new software packages created or enhanced,
more than 20 packages installed for use by INGENaffiliated researchers
• Three software packages made available as open
source software as direct result of INGEN.
Opportunities for tech transfer!
• The INGEN IT Core is providing services valued by
traditionally trained biomedical researchers as well as
researchers in bioinformatics, genomics, proteomics,
etc. > 90% satisfaction with UITS services by IUSM
42
Acknowledgments
• This research was supported in part by the Indiana Genomics
Initiative. The Indiana Genomics Initiative of Indiana University
is supported in part by Lilly Endowment Inc.
• This work was supported in part by Shared University
Research grants from IBM, Inc. to Indiana University, and in
particular by IU’s relationship with IBM as an IBM Life Sciences
Institute of Innovation.
• This material is based upon work supported by the National
Science Foundation under Grant No. 0116050 and Grant No.
CDA-9601632. Any opinions, findings and conclusions or
recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the
National Science Foundation (NSF).
• Informatics E-mail server supported in part by the 21st
Century Research & Technology Fund Online Biological
Retrieval Data system supported in part by National Institutes
of Health R01 NS37167
43
Acknowledgments, con’t
• UITS Research and Academic Computing Division managers:
Mary Papakhian, Stephen Simms, Richard Repasky, Matt Link,
John Samuel, Eric Wernert, Anurag Shankar
• Indiana Genomics Initiative Staff: Chris Garrison, Huian Li,
Jagan Lakshmipathy, David Hancock
• Center for Medical Genomics: Matthew J. Stephens, Marcus
Breese, Jeanette McClintick, Howard Edenberg, Matt Grow
• Harrington Lab: Lee Ott, Alecia Sizemorey
• Goebl Lab: Josh Heyen
• Wang Lab
• UITS Senior Management: Associate Vice President and Dean
Bradley Wheeler, Associate Vice President and Dean (Retired)
Christopher Peebles, RAC (Data) Director Gerry Bernbom
• Assistance with this presentation: John Herrin, Malinda
Lingwall, W. Les Teach
44
For additional Information
•
•
•
•
about.uits.iu.edu/divisions/rac/index.html
about.uits.iu.edu/divisions/rac/pubsstaff.html
ingen.iu.edu
it.iu.edu
45
Thank you!
Questions?
46