Transcript Document

The MGED Society
Facilitating Data Sharing and
Integration with Standards
CTSA Omics Data Standards Working Group
Chris Stoeckert
Dept. of Genetics and Penn Center for Bioinformatics,
U. Penn School of Medicine
Philadelphia, PA
Goal: Developing integrated data repositories, e.g.
genomics, transcriptomics, etc. along with clinical data.
Your data repository
Some other public repository
Integration requires standards:
 For efficient loading and access
 For data sharing
Goal: Developing integrated data repositories, e.g.
genomics, transcriptomics, etc. along with clinical data.
MAGE-TAB for microarray, UHTS data
OBI for describing biomedical (including clinical) data
Your data repository
Some other public repository
Integration requires standards:
 For efficient loading and access
 For data sharing
The MGED Society Mission
The MGED Society is an international organization of biologists,
computer scientists, and data analysts that aims to facilitate
biological and biomedical discovery through data integration.
The MGED Society Mission
The MGED Society is an international organization of biologists,
computer scientists, and data analysts that aims to facilitate
biological and biomedical discovery through data integration.
Our approach is to promote the sharing of large data sets
generated by high throughput functional genomics technologies.
Historically, MGED began with a focus on microarrays and gene
expression data. However, the scope of MGED now includes
data generated using any technology when applied to genomescale studies of gene expression, binding, modification and
other related applications.
Members of MGED work to establish standards for data quality,
management, annotation and exchange; facilitate the creation of
tools that leverage these standards; and work with other
standards organizations and promoting the sharing of high
quality, well annotated data within the life sciences and
biomedical communities.
MGED Standards
• What information is needed for a microarray
experiment?
– MIAME: Minimal Information About a Microarray
Experiment. Brazma et al., Nature Genetics 2001
• How do you “code up” microarray data?
– MAGE-OM: MicroArray Gene Expression Object Model.
Spellman et al., Genome Biology 2002
– MAGE-TAB Rayner et al., BMC Bioinformatics 2006
• What words do you use to describe a microarray
experiment?
– MO: MGED Ontology. Whetzel et al. Bioinformatics 2006
New MGED-Related Activities
•
•
•
The MGED Society mission includes facilitating deposition of functional
genomics datasets (e.g. microarray studies) in public archives. In
addition to addressing what and how data gets deposited, we are very
much concerned with seeing that authors adhere to journal
requirements for data deposition. Unfortunately, the requirement for
data deposition is not being sufficiently met and important
datasets are not accessible (see for example Ochsner et al Nature
Methods 2008).
Therefore, we ask that investigators seeking microarray and UHTS
functional genomics datasets from studies published in journals
requiring deposition contact us if they are unable to get them. We will
then contact the authors on your behalf and inform the journal
where the study was published. We will document the results on the
MGED web site to assist others seeking the same dataset and to aid
reviewers of related publications and grants.
http://www.mged.org/wiki/index.php/Published_Dataset_Availability
New MGED-Related Activities
• UHTS submission to repositories
– Both ArrayExpress and NCBI GEO accept functional genomic experiment
submissions generated by ultra-high-throughput sequencing (UHTS)
technologies. ArrayExpress and GEO have entered into a metadata
exchange agreement, meaning that UHTS sequence experiments will
appear in both databases regardless of where they were submitted.
This complements the exchange of underlying raw data between the short
read archives, SRA and ERA. Raw sequencing data submitted to
ArrayExpress or GEO will be sent to ERA or SRA respectively. You do not
need to submit to the sequence repositories separately.
– See Helen Parkinson (ArrayExpress) and Tanya Barrett (GEO) for details.
New MGED-Related Activities
• UHTS Quality Working Group
– Marc Salit (NIST)
– Best practices for RNA-Seq
• Illumina (Solexa)
• Ambion (ABI SOLID)
New Directions for MGED Standards
• What information is needed for a UHTS experiment?
– MINSEQE: Minimal Information about a high throughput
SEQuencing Experiment.
– http://www.mged.org/minseqe/
• How do you annotate microarray and gene expression
data?
– Annotare: Tool to create MAGE-TAB.
– http://code.google.com/p/annotare/
• What words do you use to describe an investigation?
– OBI: Ontology for Biomedical Investigations.
– http://obi-ontology.org/
A draft proposal for the required Minimum
Information about a high-throughput Nucleotide
SeQuencing Experiment – MINSEQE
(April 1, 2008)
• The description of the biological system and the particular
states that are studied
• The sequence read data for each assay
• The 'final' processed (or summary) data for the set of
assays in the study
• The experiment design including sample data relationships
• General information about the experiment
• Essential experimental and data processing protocols
Annotare - An open source
standalone MAGE-TAB editor
MAGE-TAB Format
What’s MAGE-TAB?
•
•
•
•
•
MAGE-TAB is a simple spreadsheet view which has two files
IDF - describing the experiment design, contact details, variables and
protocols
SDRF - a spreadsheet with columns that describe samples, annotations,
protocol references, hybridizations and data
Linked data files, e.g. CEL files, these are referenced by the SDRF
For single channel data one row in the SDRF = 1 hybridization, for two
channel data one row = 1 channel
MAGE-TAB can also be used to annotate Next Gen Sequencing data
Where can I get MAGE-TAB from?
•
•
~10,000 MAGE-TAB files are available for download from ArrayExpress
(GEO derived and ArrayExpress data
caArray also provides MAGE-TAB files for download.
IDF file for E-TABM-34
SDRF file for E-TABM-34
Annotare
Annotare - an open source MAGE-TAB Editor
•
Annotare is an annotation tool for high throughput gene expression
experiments in MAGE-TAB format. Biologists can describe their
investigations with the investigators’ contact details, experimental design,
protocols that were employed, references to publications, details of biological
samples, arrays, and experimental data produced in the investigation.
Annotare Features
•
•
•
•
•
•
Intuitive graphical user interface forms for editing
Ontology support, an inbuilt ontology and web services connectivity to
bioportal
Searchable standard templates
Design wizard
Validation module for syntactic and semantic checking
Mac and Windows Support
Annotare Features - Templates
Search, choose, and save templates
Annotare Features – Design Wizard
Define species, common array designs and protocols can be
pre-loaded
Ontology Support
Autcomplete using preloaded EFO, or ontology term lookup at
BioPortal
Excel like, or form driven annotation
Validation
Supporting Applications
•
•
•
•
caArray upload
ArrayExpress submissions
SOFT-MAGE-TAB converter (for GEO)
Similarity Search – AnnotCompute
– /www.cbil.upenn.edu/RAD/php/annotCompute/
• MeV data upload
• MAGE-TAB Bioconductor Import
• Generic limpopo parser for MAGE-TAB
Links
• Code and documentation code.google.com/p/annotare
• Limpopo parser
– sourceforge.net/projects/limpopo/
Annotare Acknowledgements
• Annotare: Catherine A. Ball, Tony Burdett,
Junmin Liu, Emma K. Hastings, Michael
Miller, Sarita Nair, Helen Parkinson, Ravi
Shankar, Rashmi Srinivasa, Joseph White
NHGRI grant P41 HG003619
OBI – Ontology for Biomedical
Investigations
• MGED is one of many communities
contributing to OBI
• Whereas the MGED Ontology is primarily a
controlled vocabulary for use with MAGE,
OBI is a well-founded ontology with logical
definitions and restrictions to be used for
multiple purposes (e.g., database models,
text mining, file annotation)
Partial high level structure of OBI classes
OBI and IAO (Information Artifact Ontology) classes are shown in blue. Classes
imported from other external ontologies are shown in red. Some example
subclasses, such as PCR product and cell culture are included to illustrate the use
of the class processed material.
OBI – Ontology for Biomedical
Investigations
• OBI intends to be part of the OBO Foundry
• Interoperable with Gene Ontology, CheBI,
Phenotypic qualities (PATO), Cell Type
(CL)…
• Learn more at
– http://purl.obolibrary.org/obo/obi
• OBI is available through browsers like the
NCBO BioPortal
Measuring the glucose concentration in blood
From The OBI Consortium, The Ontology for Biomedical Investigations, under revision
An OBI representation of a MAGE-TAB file
Focus on where MO terms were used in E-TABM-34
Utility of these standards for CTSA?
MAGE-TAB for microarray, UHTS data
OBI for describing biomedical (including clinical) data
Your data repository
Some other public repository
Integration requires standards:
 Use Annotare to generate MAGE-TAB
 Use OBI when possible for source of controlled terms,
modeling protocols, assays, investigations
For more information see http://www.mged.org
For more information see http://www.mged.org
more about standards at http://biostandards.info/
follow us on twitter @MGED_Society
MGED Meetings
• It’s about the science!
• Keeping up with the latest advances
• Making connections with potential
collaborators
Thank you!
Questions?