Gene Expression Studies using Array and Sequencing Technology

Download Report

Transcript Gene Expression Studies using Array and Sequencing Technology

Microarrays and Gene
Expression
DTC Bioinformatics Course
9th February 2010
Helen Lockstone
Overview
• Background
• Array design
• Applications of array technology
• Steps in data analysis
• Finding differentially expressed genes
• Biological interpretation
Schedule
Time
Topic
9.30-10.30
Introduction to microarray technology and applications
10.30-10.45
Break
10.45-11.30
Microarray data analysis
11.30-13.00
Practical 1
13.00-14.00
Lunch
14.00-14.45
Biological interpretation
14.45-15.00
Break
15.00-17.00
Practical 2
Microarrays in the Literature
7000
Number of papers
6000
5000
4000
3000
2000
1000
0
Year
The Central Dogma
Transcriptome measured
by microarrays
Premise of Microarrays
• Compare gene
expression between
groups
• Differentially
expressed genes may
provide some
biological insight
• But not magical
solutions!
Typical Microarray Designs
•
•
•
•
•
•
•
Disease vs control
Good prognosis vs poor prognosis
Different tumour types
Effect of treatment
Effect of stimulus
Time course
Different tissues/stages of development
Criticism of Microarrays
• Non-hypothesis driven “fishing expeditions”
• Because microarray experiments are expensive and timeconsuming to interpret, often published as a stand-alone
experiment
• Produce large amounts of data, interpretations can be very
different (but equally valid)
• Further experimental work, following up hypotheses
suggested from array data, can produce elegant studies
• Perception that data is unreliable – validation
Microarray Repositories
• GEO – http://www.ncbi.nlm.nih.gov/geo/
• ArrayExpress http://www.ebi.ac.uk/microarray-as/ae/
• Excellent resource of microarray data
• MIAME guidelines
What is a Microarray?
• Glass slide consisting of hundreds of thousands of
probes arranged in grid layout
• Each probe detects a particular RNA species (transcript)
• Hybridisation occurs by complementary base-pairing
• Make quantitative measurements – signal from each
probe is proportional to the amount of hybridised RNA
• Interrogate entire genome in single experiment
Microarray Technology
Probes
cDNA
Oligonucleotides
PCR products
Design
Targeted to genes
Tiling (chromosomes, promoters)
Fabrication Method
Spotted (robotic printing)
Photolithography (synthesised in-situ)
Type
One-colour (log intensities)
Two-colour (log ratios)
Labelling molecules
Cy3 (green), Cy5 (red), biotin
Experimental Protocol
Microarray Manufacturers
Company
Established
Main
Microarray
Technology
Human
Headquarters
WholeGenome Array
released
Affymetrix
1992
GeneChip
1994
Santa Clara,
CA
Illumina
1998
BeadChip
2005
San Diego, CA
Roche
NimbleGen
1999
High-density
tiling arrays
Agilent
1999
aCGH, ChIPchip, custom
Madison, WI
2004
Santa Clara,
CA
Array design
Affymetrix Microarrays
 Manufacturing microarrays for >15
years
 25bp probes – 11 individual probes
comprise a probe-set, signal combined
to estimate gene expression
 Whole human genome array has
>50,000 probesets
 Size array surface 1.28cm2
 3’ expression arrays – probes
designed to 3’ end of transcript
Recent Developments
• Limitations of 3’ array design
–
–
–
–
Assumes representative of entire gene
Assumes well-defined 3’ end of gene
Can’t assess splicing events
Can be difficult to distinguish homologous genes
• Whole transcript arrays
– 4-probe probesets designed to each exon
– Gene 1.0 and Exon 1.0 arrays
Exon Array Design
Picture from Affymetrix
Illumina Beadchip Arrays
Beads randomly occupy wells on surface of array
30-40 replicates of each bead type (probe)
Longer probe length – typically one probe per gene
Applications of Microarray
Technology
Microarray Applications
Gene
Expression
Alternative
Splicing
microRNA
expression
SNP
Genotyping
ChIP-chip
DNA
Methylatio
n
Comparative
Genomic
Hybridisation
Gene Expression
• Still most common use for microarrays
• Aim to determine differential expression
between groups of samples e.g. disease
and control
• Generate hypotheses about the
mechanisms underlying the disease of
interest
Alternative Splicing
 Up to 75% of human genes may produce alternative
transcripts
 Increases protein diversity from given set of genes
 Alternative transcripts from same gene can produce proteins
with different, even opposite, functions (e.g. Bcl-x)
 Role in disease - mutations can disrupt splice sites or splicing
machinery
Alternative Splicing
• Affymetrix exon array allows investigation of
alternative splicing
• Custom arrays with junction probes
• Additional layer of analysis
Alternative Poly-A Sites
• Alters length of 3’ UTR - may change which
target regions for miRNAs are present
Alternative Splicing
MicroRNAs
• Small non-coding
RNAs (~22bp)
• Sequence-specific
binding to 3’ UTRs
• Post-transcriptional
gene silencing
Picture from He et al. Nature Reviews Cancer 7, 819-822 (2007)
SNP Arrays
• Illumina and Affymetrix
• ~6 million SNPs genome-wide
• Genotype individuals in high-throughput and
cost-effective manner
• Genome-wide association studies
• eQTL studies
Tiling Arrays
• Applications so far use arrays with probes
designed to genes/miRNAs/SNPs of interest
• Tiling arrays consist of high-density probes
covering a particular region(s) of the genome
• Identify novel transcripts, exons
DNA Methylation
• Methylation of cytosine
bases (CpG islands) in
gene promoter regions
can silence
transcription
• Epigenetic mechanism
• Two-colour
hybridisation
ChIP-chip
• Method to identify transcription factor binding sites in an
unbiased fashion
• Cross-link protein (TF) of interest with DNA
• Use immuno-precipitation to pull down DNA fragments
bound to the protein (enriched sample)
• Hybridise with genomic DNA to obtain log-ratio
• Again looking for large positive ratios
Comparative Genomic
Hybridisation
Trisomy 13 in female compared to reference male
• Detect regions of amplification/deletion (copy number
changes)
• Feature of cancer – hybridise sample with reference
DNA (copy number=2)
• Potential dosage effects on genes in affected regions
Analysing Gene Expression
Data
R and BioConductor
• Powerful, open-source software for statistical
analysis and graphical visualisation
• Greater functionality provided by software
packages contributed by researchers
• BioConductor packages are specifically for
genomic data
– affy
– limma
– vsn
Analysis Steps
• Check quality of the data
• Decide if any samples are outliers
• Preprocessing and normalisation
• Statistical analysis to find differentially
expressed genes
• Tools for biological interpretation
Data Quality
• Looking for good signal and similar metrics
across all arrays in experiment (after
normalisation between arrays)
• Poor signal could indicate a hybridisation
problem or degraded sample
• Control probes for hybridisation, labelling
and sample can help identify problems
Illumina Array Metrics
•
•
•
•
•
•
Average signal
Number of detected genes
Housekeeping genes signal
Biotin controls
Hybridisation controls
Negative control probe signal
Processing Data
• Background correction
• Transform data to log scale (more suitable for
statistical analysis)
• Normalisation between arrays (adjust for
systematic differences such as overall brightness)
• Probe-set summarisation (Affymetrix) or across
replicate probes (Illumina)
Exploring Data – Boxplots Signal Intensity
Exploring Data - PCA
Outlier Samples
• Potential outlier samples will look different to
others in the experiment
• No definitive rules to decide when to exclude a
sample from analysis
– Depends on size of experiment
– Can be useful to run analysis with and without outlier
to assess effect on results
– Always re-normalise data excluding any outlier
samples before proceeding
Outlier Sample
PCA indicating outlier sample
Filtering
• Lose data but signal from low intensity probes is
noisy and can give false positives
• Detection p-values calculated for each probe
based on overlap of signal with negative control
probe signal distribution
• Criteria
– Detected in all samples/at least one sample
– Detected in at least one group
Detecting Differentially Expressed Genes
• Linear Models for Microarray Analysis (limma)
• Handles analysis of simple and complex experimental
designs
• For two-group comparisons, analogous to t-test, otherwise
ANOVA
• Uses information from all genes to estimate variance
– Reduces chance of false positives from very low variance genes
– More robust for small sample sizes
Log normalised intensity
limma
• Fits linear model
for each gene
10
• Test whether
slope = 0 for each
gene and assign
p-values
8
6
4
Group 1
Group 2
• Multiple testing
correction - FDR
Effect of other variables
• Wt and Mut groups
• Three different
litters
• Top gene ~ 5x
higher expression in
Wt compared to
Mut
• Similarly expressed
across litters in both
genotypes
Strong litter effect
• Overlap between groups
• Within litters, consistent
pattern of higher
expression in WT vs Mut
• Within genotypes,
B>C>A – expression
depends on litter
• Accounting for this
variance increases
power
Limma Output
Limma Output
• Small sample size and subtle effects can
mean no probes would be considered
statistically significant
• Ranked in order of evidence for differential
expression – can still be explored
• Biological interpretation can be most
difficult step – tools available