Transcript Slides
Overview of Bioconductor
Aed ín Culhane [email protected]
http://bcb.dfci.harvard.edu/~aedin http://www.hsph.harvard.edu/research/aedin-culhane
Bioconductor Biannual release (normally April, October) to coincide with R release. Current: Bioconductor 2.9
(release coincide with R 2.14)
To install use script on Bioconductor Website
source("http://www.bioconductor.org/biocLite.R") biocLite()
Packages Overview
BioConductor web site • Bioconductor BiocViews Task view Software Annotation Data Experimental Data
What Packages do I need?
Specific to you data and analysis pipeline but for examples: • Bioconductor Workshops • Bioconductor Workflows
Main types of Annotation Packages • Gene centric AnnotationDbi packages: – Organism: org.Mm.eg.db.
– Technology/Platform: hgu133plus2.db.
– GeneSets and Pathway (biology level): GO.db or KEGG.db
– .db packages can be queried with sql or accessed using annotation package (totable, get, mget) • Genome centric GenomicFeatures packages: – Transriptome level: TxDb.Hsapiens.UCSC.hg19.knownGene
– Generic features: Can generate via GenomicFeatures • • biomaRt: – Query web-based `biomart' resource for genes, sequence, SNPs, and etc .
See http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/AnnotationSlidesBioc2011.pdf
Bioconductor resources
• Mailing List (sign up for daily digest) • Documentation, workshop/course material online – Slides from talks, pdf of tutorials, R code • Help available for each software package – Each package MUST contain vignette (howto) • Other resources ww.Rseek.org www.r-bloggers.com
Vignette
• Tutorials, provide worked example of package • Required in Bioconductor packages • Written in Sweave (Leisch, 2002).
– L A T E X dynamic reports in which R code is embedded and executable – All R code in vignette is checked (and executed) by R CMD check – http://www.bioconductor.org/docs/vignettes.html
library("Biobase") library("GOstats") # Load package of interest openVignette()
S4 classes and ExpressionSet
• Within Bioconductor, you will encounter packages are structured around S4 object oriented programming proposed by John Chambers (developer of S) • A class provides a software abstraction of a real world object.
• A method performs an action on a class (Think of a class as a noun, and method as verb)
Object (S4)
• An object is an instance of a class.
• Descriptions are stored in slots • slotNames(ob1) lists all slots in object, or use str().
• To access slots – ob1@slotname – slotname(ob1), or – slot(ob1, “slotname")
Example: ExpressionSet
library(ALL) data(ALL) slotNames(ALL) ALL@phenoData phenoData(ALL) class(ALL) ?ExpressionSet
> ALL ExpressionSet (storageMode: lockedEnvironment) assayData: 12625 features, 128 samples element names: exprs protocolData: none phenoData sampleNames: 01005 01010 ... LAL4 (128 total) varLabels: cod diagnosis ... date last seen (21 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' pubMedIds: 14684422 16243790 Annotation: hgu95av2
Method which act on a S4 class
showMethods(class= "ExpressionSet") getMethod("write.exprs", "ExpressionSet") Or if you wish to see how the package really works, download and look the source code
Getting Data into R & Bioconductor
Aed ín Culhane [email protected]
http://www.hsph.harvard.edu/research/aedin-culhane/
Simple Excel SpreadSheet data
• Simple table – read.table() – read.csv() – scan() • However more datatype specialized. See Technologies on BiocViews.
– http://www.bioconductor.org/packages/release/BiocVi ews.html
• Large data files. Also see http://www.revolutionanalytics.com
13
Some common data types
• Microarray • SNP • NGS May 2011 14
A Microarray Overview
15
Reading Affymetrix Data
library(affy) require(affy) # Alternative affybatch <- ReadAffy(celfile.path="[Location of your data]") eSet<-justRMA() May 2011 16
Sample R code
17
May 2011
ExpressionSet Class in R
18
May 2011
Assessing Data Quality
19
Public Microarray Data ArrayExpress • 21997 Studies (622,617 profiles,) GEO • 22,735 Studies (558,074 profiles) Statistics May 2011
May 2011
R Code
21
May 2011
More on GEOquery
require(GEOquery) Let's try to load the GDS810 dataset which contains data on Alzheimer's disease at various stages of severity. GDS810<-getGEO("GDS810") The
getGEO
function returns an object of class
GEOData
. You can get a description of this class like this: help("GEOData-class") Meta(GDS810) Columns(GDS810) head(Table(GDS810)) 22
May 2011
Affy SNP Arrays
23
May 2011
Process – Affy SNP Arrays (Oligo package)
24
Other Arrays
•
Illumina
– Lumi package • 2 color spotted arrays – Limma package • Other arrays – http://www.bioconductor.org/help/workflows/ol igo-arrays/ May 2011 25
Next Generation Sequencing Data
R Code
May 2011 27
Exercise
• Install the library GEOquery • Download the dataset GSE1297 using getGEO • This data will be downloaded as an eSet, so to see the expression data and phenoData, use pData and exprs • Use ArrayQualityMetrics to Assess the data quality of these data May 2011 28
R basics: Getting help
• To get help – ?mean
– help(mean) • help.search(“mean”) • apropos("mean") • example(mean) • http://www.bioconductor.org/help/
• • With thanks to
www.bioconductor.org/help/course.../Bioconductor-Introduction-lab.pdf
May 2011 30