Microarray Data Analysis

Download Report

Transcript Microarray Data Analysis

Microarray - Introduction
Ka-Lok Ng
Department of Bioinformatics
Asia University
http://ppi.bioinfo.asia.edu.tw/klng
Graduate syllabus
Week
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Date
2010/2/23
2010/3/2
2010/3/9
2010/3/16
2010/3/23
2010/3/30
2010/4/6
2010/4/13
2010/4/20
2010/4/27
2010/5/4
2010/5/11
2010/5/18
2010/5/25
2010/6/1
2010/6/8
2010/6/15
2010/6/22
Graduate Level Topic
Introduction - RNA expression, chip technology
Suffix tree, application of microarray
Data normalization, filtering, Log transform
Data Normalization, Lowess normalization, SMD
Statistical analysis of gene expression data
Midterm
distance measure, Entropy
Statistics, normal distribution
Test of hypothesis, t-test of microarray data
ArrayExpress - analysing of data
Gene expression databases
Gene regulation network prediction
Microarray data analysis tools
student presentation
student presentation
student presentation
student presentation
student presentation
次數
日期
Undergraduate Level Topic
1
2010/2/23 Introduction - RNA expression, chip technology
2
2010/3/2
Suffix tree, application of microarray
3
2010/3/9
Data normalization, filtering, Log transform
4
2010/3/16 Data Normalization, Lowess normalization, SMD
5
2010/3/23 Statistical analysis of gene expression data
6
2010/3/30 Midterm
7
2010/4/6
8
2010/4/20 Statistics, normal distribution
9
2010/4/27 Test of hypothesis, t-test of microarray data
10
2010/5/4
11
2010/5/11 chi-square test of microarray data
12
2010/5/18 Second examination
13
2010/5/25
14
2010/6/1
15
2010/6/8
16
2010/6/15
17
18
2010/6/22
2010/6/29
distance measure, Entropy
ArrayExpress - analysing of data
授課教師
吳家樂
吳家樂
吳家樂
吳家樂
吳家樂
吳家樂
吳家樂
吳家樂
吳家樂
吳家樂
吳家樂
吳家樂
張培均
張培均
張培均
張培均
張培均
張培均
Topics to be covered
•
•
•
•
•
•
•
•
Introduction - RNA expression
Experimental design, image processing, Microarray databases
Data normalization, filter and analysis
Statistical analysis of gene expression data
Clustering methods
Time series data (cell cycle) and reverse engineering
Gene regulatory networks
Gene regulatory networks and protein-protein interaction networks
• Classwork, homework, class attendance
• Mid-term, final exam. or SCI paper/thesis presentation (for
graduate students)
References
1.
2.
3.
4.
5.
6.
7.
Gibson G., Muse Spencer. A primer of Genome Science, Ch. 4. 2nd
edition, Sinauer (2004)
Causton H., Quackenbush J., and Brazma A. Microarray Gene
Expression Data Analysis. A Beginner’s Guide. Blackwell (2003)
Baxevanis A. and Ouellette B.F. Francis. Bioinformatics Ch. 16. J.
Wiley (2005)
Knudsen S. A Biologist’s Guide to Analysis of DNA Microarray Data. J.
Wiley (2002)
Benfey P. and Protopapas A.D. Genomics Ch. 5. Prentice Hall (2005).
Setubal J. and Meidanis J. Introduction to computational molecular
biology. PWS publishing. (1997).
A. Gu´enoche (2005). “About the design of oligo-chips”, Discrete
Applied Mathematics, v147(1), pp.57-67.
Contents
• Introduction – the central dogma of molecular biology,
applications, data analysis, Microarray slide surface
• Printing technologies – spotting, photolithography, ink-jet
• Selection of genes for spotting on arrays
• Selection of primers for PCR – suffix tree
• Microarray application - four different types of brain tumors
• Gene co-expression and gene expression profile
• Data management
Introduction
• The last 10 years have brought spectacular achievements in
genome sequencing (such as the HGP)
• It took >1000 years for science to progress from human
anatomy to understand how genomes function)
• Even if we assume all the genes have correctly identified, the
results represents only sequence
• High throughput DNA sequencing technology created a
system approach to biology
The central dogma of molecular biology
http://www.hort.purdue.edu/hort/courses/HORT250/lecture%2004
Glossary
• Transcripts – mRNA
• Transcriptome – the
complete set of
transcripts
• Hybridization
• Microarray technology allow one to identify the genes that
are expressed in different cell types, to learn how their
expression levels change in different developmental stages
or disease states, and to identify the cellular processes in
which they participate
• Microarray technology provide clues about how genes and
gene products interact and their interaction networks
Microarray gene expression data analysis
• Experimental design  data transformations from raw data
to gene expression matrices  data mining and analysis of
gene expression matrices
What are microarrays and how do they work ?
• A microarray is typically a glass or polymer slide
• DNA molecules are attached at fixed locations called spots
or features
Smooth surface enables even deposition of surface chemistries
and perfect spot morphology.
What are microarrays and how do they work ?
• ~10,000 spots on an array
• each spot contains ~107 of identical
DNA of lengths from 10s to 100s
of bp
• spots are either printed on the
microarrays by a robot or jet, or
synthesised by photolithography
(石版影印術) or by inkjet printing
Principle of cDNA microarrays
EST fragments arrayed in 96- or 384-well plates
are spotted at high density onto a glass microarray
slide. Subsequently, two different fluorescently
labeled cDNA populations derived from
independent mRNA samples are hybridized to the
array.
Ink-jet printer microarrays
–
–
–
–
Ink-jet printhead draws up DNA
Printhead moves to specific location on solid support
DNA ejected through small hole
Used to spot DNA or synthesize oligonucleotides
directly on glass slide
– Use pioneered by Agilent Technologies, Inc.
Types of printing pins
(A) Tweezer or split-pin designs transfer low
nanoliter (10-9 liter) amounts of DNA to the array by
capillary action as the tip strikes the solid surface.
(B) TeleChemTM tips and pins apply small droplets
by contact between the pin and substrate.
(C) The pin-and-loop design picks up the DNA in a
small loop, and a pin stamps solution on a slide at a
uniform density.
(D) Ink jets spray picoliter (10-12 liter) droplets of
liquid under pressure.
Robotic spotting, capillary action, the DNA sticks
through hydrostatic interactions
The spacing between spot centers is specified from
120-250 mm according to the density required. The
entire microarray usually covers an area 2.5x5.0 cm,
though shorter grids can be printed when fewer
clones are to be represented.
DNA spotting I
• DNA spotting usually uses
multiple pins
• DNA in microtiter plate
• DNA usually PCR amplified
• Oligonucleotides can also be
spotted
Commercial DNA spotter
Oligonucleotide microarrays – pioneered by Affymetrix
Affymetrix GeneChips
• Oligonucleotides
– Usually at least 20–25 bases in length, optimal with 45~60 bp long
– 10–20 different oligonucleotides for each gene
• Oligonucleotides for each gene selected by computer program to be the
following:
– Unique in genome (4 (20 to 25) =2(40 to 50) >> 3*109 = 230), not likely to appear
twice
– Non-overlapping (if the sequence length is too short then specificity is low,
whereas if the length is too long, self-hybridization could happen)
• Composition based design rules
• Empirically derived rules (ratio of G-C pairs vs. A-T pairs which could
affect the melting temperature of the seq., that is,
Tm = 64.9+0.41*(GC%)-675/L, where L = length of the oligonucleotide
Oligonucleotide microarrays – pioneered by Affymetrix
Construction of oligonucleotide arrays.
Oligonucleotide are synthesized in situ in the silicon chip. (A) In each step, a
flash of light “deprotects” the oligonucleotides at the desired location on the chip;
then “protected” nucleotides of one of the four types (A, C, G or T) are added so
that a single nucleotide can add to the desired chains.
Oligonucleotide microarrays
Construction of oligonucleotide arrays.
The light flash is produced by photolithography using a mask to allow
light to strike only the required features on the surface of the chip.
Photolithography
• Light-activated chemical reaction
– For addition of bases to
growing oligonucleotide
• Custom masks
– Prevent light from reaching
spots where bases not wanted
• Mirrors also used
– NimbleGen™ uses this
approach
lamp
mask
chip
Example: building oligonucleotides by photolithography
• Want to add nucleotide G
• Mask all other spots on chip
• Light shines only where addition of
G is desired (light “deprotects” the
oligonucleotides at the desired
location on the chip)
• G added and reacts
• Now G is on subset of
oligonucleotides
light
Example: adding a second base
• Want to add T
• New mask covers spots where T
not wanted
• Light shines on mask
• T added
• Continue for all four bases
• Need 80 masks for total
20-mer oligonucleotide
light
Comparisons of microarrays
Photolithograhy
Mechanical printing
Ink-jet printing
Design of oligonucleotides by photolithography
•
There are four types of masks according to the added nucleotide. Given a set of oligos to
synthesize, the mask is a common supersequence of the oligo set or, in other words, each oligo
is a subsequence of the mask sequence (characters may be separated, but they remain in the
same order.
• To minimize the number of masks necessary to build a supersequence of a given set of words,
so-called the shortest common supersequence problem, or SCS-problem, is a NP-hard problem.
• We call realization of an oligo a sequence of masks capable to synthesis it.
The number of realizations
• Count the number of realizations of the probe sequence GTATC (L=5) in the mask sequence
GGTTATC (L=7).
• It is found that the following four sets of positions can match the probe sequences; (1,3,5,6,7),
(1,4,5,6,7), (2,3,5,6,7) and (2,4,5,6,7).
G
T
A
T
C
1
2
3
4
5
6
7
G
G
T
T
A
T
C
+
X
+
X
+
+
+
The left copies are indicated by sign +. The instances of identical
characters (repeated) in these intervals are marked by a ‘X’.
Design of oligonucleotides by photolithography
• Count the realizations of the probe sequence ATTAC in the mask sequence
ATTATTACAC. The left and right copies are indicated by sign + and -.
The instances of identical characters in these intervals are marked by a ‘X’.
• 23 realizations: (1,2,3,4,8), (1,3,5,7,8), (1,5,6,7,8,), (4,5,6,7,8) ….etc.
too short
Total number of possible paths
from Start (S) to End (E) is 23.
• Circle denotes the possible
position of probe sequence
within mask sequence.
• Edge denotes consecutive
positions in the probe sequence.
吳哲賢
生物晶片之探針辨識數目問題
第二十四屆組合數學與計算理論研討會
Comparison of microarray hybridization
• Spotted microarrays
– Competitive hybridization
• Two labeled cDNAs hybridized to same slide 
measure the relative difference between the signal
intensity of two targets binding to the same spot of
DNA
• Affymetrix GeneChips
– One labeled RNA population per chip
– Comparison made between hybridization intensities of
same oligonucleotides on different chips
Selection of genes for spotting on arrays
•
•
•
•
•
•
Suppose you are interested in a family of
proteins, say a particular class of receptors GeneChipsTM Target - labeled cDNA or RNA
To identify all the genes that are part of the Spotted probe, MIAME probe
family, you can do a homology search (PSIBLAST) or a PubMed keywords search
PSI-BLAST
http://www.ncbi.nlm.nih.gov/BLAST/
Another way is to use a commercial
Affymetrix array
In the context of spotted arrays, the term probe
often refers to the labelled population of
nucleic acid in solution, while in connection
with GeneChipsTM it is used to refer to the
nuclei acid attached to the array.
GeneChipsTM Probe – the bound DNA
In the MIAME convention probe is referring to
Spotted array – target
the mobile population of nucleic acid as the
labelled extract and the nucleic acid attached to
the array as the reporter, feature or spot
Selection of regions within genes
•
•
•
•
•
•
•
•
Once you have the list of genes you wish to spot on the array
The next question is cross-hybridization
How can you prevent spotting probes that are complementary to more than one gene
(target mRNA or cDNA seq.) if you are working with a gene family (many similar genes)
with similarities in sequence (such as > 70% similarity) ?
– That is a probe could cross-hybridized with different mRNA
– That is there are probes appear to be more abundant than they really are
or a gene’s mRNA (alternative splicing mechanism could generate different mRNAs)
could cross-hybridized with different probe  non-specific  not a true expression
level of the gene under study
Not always can find a solution to the cross-hybridization problem
Solve this problem by using  ProbeWiz Server
Use Blast to find regions in those genes that are the least homologous to other genes
ProbeWiz - http://www.cbs.dtu.dk/services/DNAarray/probewiz.php
Selection of primers for PCR
• Once those unique regions have been identified, the probe needs to be
designed  use PCR amplification of a probe
• Solve this problem by using  ProbeWiz or OligoArray Servers
• ProbeWiz
– predicts optimal PCR primer pairs for generation of probes for cDNA
arrays
– avoid self-hybridization  hairpin structure  high specificity
• http://www.cbs.dtu.dk/services/DNAarray/probewiz.php
• OligoArray
– Genome-scale oligonucleotide design for microarrays
• http://berry.engin.umich.edu/oligoarray2/
• Other option - By using long oligonucleotides (50 to 70 bps) instead of
PCR primers
• Other complicated issues: alternative splicing, SNP
Selection of primers for PCR
Minimal primer set (MPS) problem
• Given a set of ORF sequences S = {S1, S2, …Sn}, L is the
length of the primer, one needs to find the minimal set of
primer P = {P1, P2, …Pk} , such that for every i, Si contains at
least one sequence from P.
• In other words, identify a set of primers P, which is common
among the set of ORF sequences S
• Then selected highly specific primers (dissimilar to the
complementary strand of the template, otherwise they will
hybridize to a lot of positions along the template) from P
Example
• S = {ATTC, GATT, TTAC},
• L = 3  P = {ATT, TTA}, P ={ATT, TAC} or {ATT, TTA}
• if L = 2  P = {TT}
Selection of whole genome oligonucleotide or cDNA primers
•
•
•
•
Automatic generation of whole genome
oligonucleotide or cDNA probes
Probe pre-selection
– by suffix tree algorithm, size of memory
spacing O(n) ~ 40n, where n is the length of
the input seq. (e.g. 10000 Hs gene seqs. is
about 35MB in length, 30000 human gene seqs.
 memory space ~ 40*35*3.0 MB=4200 MB !
– Probes are filtered for length, GC content and
not contain self complementary regions >4bp
Hybridization prediction
– The most time-consuming part
– Need to predicts melting temperatures Tm for
all probes (on average 4 probes/gene  do a
4*30000 vs. 30000 Tm calculations (i.e.
4*30000*30000 = 1.2*109 times of using the
tool Mfold)
Probe selection
– Select the probe-target vs. probe-non-target
seqs.
Probe pre-selection
Hybridization prediction
Probe selection
Suffix tree - Basic notation
• Concatenation (串聯) of two strings s and t is denoted by st and is formed
by appending all characters of t after s, in the order they appear in t, for
instance, if s =GGCTA and t=CAAC, then st=GGCTACAAC. The length
of st is |s|+|t|.
• A prefix of s is any substring of s of the form s[1….j] for 0≦j≦ |s|. It is
admit j=0 and define s[1….0] as being the empty string, which is a prefix
of s as well. Note that t is a prefix of s if and only if there is another string u
such that s=tu. Sometimes one needs to refer to the prefix of s with exactly
k characters, with 0≦k≦|s|, and we use the notation prefix(s,k) to denote
this string.
• prefix(s,3)  ATT is a prefix of ATTCGATTTTAC
• A suffix of s is a substring of the form s[i….|s|] for a certain i such that
1≦i≦ |s|+1. one admit i=|s|+1, in which case s[|s|+1….|s|] denotes the
empty string. A string t is a suffix of s if and only if there is another string u
such that s=ut. The notation suffix(s,k) denotes the unique suffix of s with k
characters, for 0≦k≦|s|.
• suffix(s,3)  TAC is a suffix of ATTCGATTTTAC
Suffix tree
• Suffix tree – contains all suffixes of a string
factoring out common prefixes as much as
possible in the tree structure
• Edges are directed away from the root, and
each edge is labeled by a substring from S.
• All edges coming out of a given vertex
have different labels, and all such labels
have different prefixes (not counting the
empty prefix).
• To each leaf there corresponds a suffix
from S, and this suffix is obtained by
concatenating all labels on all edges on the
path from the root to the leaf.
Suffix tree
•
•
•
More example, X = AATAATGC$, where $ signals the end of the sequence
Let the substring S be the shortest substring beginning at i which does not occur
elsewhere in X
Suffix tree of X, where () denotes position
The longest repeat within the string is AAT
Position
Identifying substring, S
1
AATA
2
ATA
3
TA
4
AATG
5
ATG
6
TG
7
G
8
C
9
$
A
ATA (1)
ATG (4)
TA (2)
TG (5)
C (8)
G (7)
TA (3)
TG (6)
$ (9)
Suffix tree
• Given a set of three ORF sequences S = {S1,S2,S3}, S1= {AATG}, S2={TTTG}, and S3
={TTTC}.
• Merging S1 S2 S3 together to form AATG$1TTTG$2TTTC$3, with a total length of 15.
• Non-overlap Longest Repeat among S1 and S2 is TG, and among is S2, and S3 is TTT
• Leaf A
– AATG$1TTTG$2TTTC$3 with a length of 15
– AATG$1TTTG$2TTTC$3 with a length of 14
• Leaf C
– AATG$1TTTG$2TTTC$3 with a length of 2
• Leaf G
– AATG$1TTTG$2TTTC$3 with a length of 12
– AATG$1TTTG$2TTTC$3 with a length of 7
• Leaf T
– AATG$1TTTG$2TTTC$3 with a length of 3
– AATG$1TTTG$2TTTC$3 with a length of 13
– AATG$1TTTG$2TTTC$3 with a length of 8
– AATG$1TTTG$2TTTC$3 with a length of 4
– AATG$1TTTG$2TTTC$3 with a length of 9
– AATG$1TTTG$2TTTC$3 with a length of 5
– AATG$1TTTG$2TTTC$3 with a length of 10
• Leaf $1
– AATG$1TTTG$2TTTC$3 with a length of 11
• Leaf $2
– AATG$1TTTG$2TTTC$3 with a length of 6
• Leaf $3
– AATG$1TTTG$2TTTC$3 with a length of 1
H. Chen and Y.-S. Hou, A study on specific primer selection algorithms using suffix trees,
Journal of information technology and applications, Vol. 1, No. 1, 25-30, 2006.
cDNA microarrays
Microarrays are used to measure gene expression
levels in two different conditions. Green label
for the control sample and a red one for the
experimental sample.
DNA-cDNA or DNA-mRNA hybridization.
The hybridised microarray is excited by a laser
and scanned at the appropriate wavelenghts for
the red and green dyes
Amount of fluorescence emitted (intensity)
upon laser excitation ~ amount of mRNA bound
to each spot
If the sample in control/experimental condition is
in abundance  green/red, which indicates the
relative amount of transcript for the mRNA (EST)
in the samples.
If both are equal  yellow
If neither are present  black
Scanning of microarrays
• Confocal laser scanning microscopy
• Laser beam excites each spot of
DNA
• Amount of fluorescence detected
• Different lasers used for different
wavelengths
– Cy3
– Cy5
laser
detection
Analysis of hybridization
• Results given as ratios
• Images use colors:
Cy3 = Green
Cy5 = red
Yellow
– Yellow is equal intensity or
no change in expression
Example of spotted microarray
• RNA from irradiated cells (red)
• Compare with untreated cells
(green)
• Most genes have little change
(yellow)
• Gene CDKN1A: red = increase
in expression
• Gene Myc: green = decrease in
expression
CDKNIA
MYC
Visualizing the hybridized target on a microarray can be performed by using either a
confocal detector or a charge couple detector (CCD) camera.
Microarray images produced with a pin-and-loop arrayer. (A) Two common undesirable
features are indicated, namely high local background (arrow head) and scratches (two
arrows) that would suggest “flagging” of the associated spots. (B) A close-up of a portion of
the array demonstrates the uniformity of relative hybridization within each spot and
differences in the red:green ratio of reach clone.
Microarray –
overview
Probe genes
Target
cDNA labeled
by Cy5 (Red)
cDNA labeled
by Cy3 (Green)
By Hanne Jarmer, BioCentrum-DTU, Technical University
of Denmark
What can we learn from the
microarray data ?
(1) Microarray permits an integrated approach to
biology, in which genetic regulation can be
examined  allows us to build a gene network
(2) Classification of disease, diagnosis, prognostic
(judgment of the likely or expected development of a
disease) prediction and pharmaceutical applications
Co-expression of gene expression
• Co-expressed genes  genes involved in common processes  clustering of genes
Examples
• Genes required for nutrition and stress responses
• Genes whose products encode components of metabolic pathways
• Genes encoding subunits of multi-subunit complexes such as the ribosome, the
proteasome and the nucleosome are coordinately expressed
• Ribosome - site of cellular protein synthesis
• Proteasome - large multi-enzyme complexes that digest proteins
• Nucleosome – A length of DNA consisting of about 140 base pairs
makes two turns around the histone core thus forming a nucleosome.
• Animation - http://www.johnkyrk.com/index.html
Co-expression of gene expression
• Waves of co-expressed temporally regulated
genes has been observed during the
development of the rat spinal cord
• the expression levels of 112 genes at nine
different time points are measured during the
development of rat cervical spinal cord, and
70 genes during development and following
injury of the hippocampus 海馬體)
http://www.cs.unm.edu/~patrik/networks/data.html
Gene expression profile and phenotype
• Profile or so-called signature
• the combination of the mRNAs (representing a subset of the
total genotype) being expressed by the cell [Thomas A. Houpt,
Nutrition, 827 (2000)]
• Can be thought of as a precise molecular definition of the cell
in a specific state
• Expression profile is a way to describe a phenotype, and can
be used to characterize a wide variety of samples
Example
• human cancer cell lines treated with 70000 agents
independently or in combinations have been used to link drug
activity with its mode of action
• genes and putative drug targets
Affymetrix GeneChip experiment - Profiling tumors
•
•
•
•
•
•
RNA from four different types of brain
tumors extracted
Extracted RNA hybridized to GeneChips
containing approximately 6,800 human
genes
Identified gene expression profiles specific
to each type of tumor
Image portrays gene expression profiles
showing differences between four different
types of brain tumors
Tumors:
MD (medulloblastoma)
Mglio (malignant glioma)
Rhab (rhabdoid)
PNET (primitive neuroectodermal tumor)
Ncer: normal cerebella
•
•
•
•
•
For a single type of tumor, medulloblastoma
(MD), RNA from 60 different tumor
samples was analyzed
Response to chemotherapy was known for
each of the tumor samples
Gene expression differences for MD
correlated with response to chemotherapy
Patients who failed to respond had a
different profile from survivors (who did
respond and survived longer)
Can use this approach to determine which
tumor samples are likely to respond to
treatment
60 different samples
Affymetrix GeneChip experiment - Cancer diagnosis by microarray
Microarray data generation, processing and analysis
Two parts
1. Material processing and data
collection
2. Information processing
Five steps - Material processing and
data collection
• Array fabrication
• Preparation of the biological samples
to be studied
• Extraction and labeling of the RNA
from the samples
• Hybridization of the labeled extracts
to the array
• Scanning of the hybridized array
Microarray data generation, processing and analysis
Four steps - Information processing
Image analysis
• Image quantitation – locating the
spots and measuring their
fluorescence intensities
• Data normalization and integration –
construction of the gene expression
matrix from sets of spot
• Gene expression data analysis and
Data analysis
mining – finding differentially
http://www.mathworks.com/company/pressroom/i
clustering
mage_library/biotech.html
expressed genes or clusters of
similarly expressed genes
• Generation from these analyses of
new hypotheses about the underlying
biological processes  stimulates
new hypotheses that in turn should
be tested in follow-up experiments
Microarray data processing and analysis
http://www.ebi.ac.uk/microarray/biology_intro.html
Microarray experimental raw data (image data)  spot quantitation matrices
(row = spot on array, column = quantitation of that spot, i.e. mean, median,
background)  gene expression matrix  data analysis (clustering or classification
(SVD or PCA, see http://public.lanl.gov/mewall/kluwer2002.html))
Microarray data processing and analysis
•
•
•
•
•
Clustering
– unsupervised method, i.e. do not assign some prior knowledge about function
to the genes and/or samples
– supervised method, i.e. assign some prior knowledge about function to the
genes and/or samples
Next, the reverse engineering of gene regulatory networks  based on the
hypothesis that genes have similar expression profiles under a variety of
conditions are likely to be regulated by common mechanisms
Cluster of genes  some of these genes’ promoter sequences are obtained 
may contain a ‘signal’, e.g. a specific seq. pattern relevant to gene regulation
Application of different algorithms, or different parameters (such as distance
measures), or different data filtering methods  produce different results !!
What happen ? Well, it reflects the fact that cells typically carry out multiple
processes simultaneously via multiple interacting pathways
Future research directions
– data analysis method, quality or reliability of data  in the next generation of
microarrays, where each spot is printed or synthesized multiple times  estimate
the measurement reliability using the standard deviation between the individual
measurements  data mining
Microarray data management
• Microarray database consists of
three major parts – the gene
expression matrix, gene
annotation, and sample
annotation
• No established standards for
microarray experiments or raw
data processing
• No standard ways for
measuring gene expression
levels
Microarray data management
• Microarray Gene Expression Data Society (MGED), http://www.mged.org
has developed recommendations for the Minimum Information About a
Microarray Experiment (MIAME) that attempt to define the set of
information sufficient to interpret the experiment, and the experiment,
unambiguously, and to enable verification of the data
• A set of guidelines for the describing an experiment, and the guidelines are
translated into protocols enabling the electronic exchange of data in a
standard format
• The MIAME standard has been adopted and supported by the EBI
ArrayExpress database, NCBI GEO and the CIBEX database at the DDBJ
• Members of MGED joins with Rosetta Inpharmatics lead to the
development of the microarray gene expression object model (MAGEOM) and an XML-based extensible markup language (MAGE-ML)
• MAGE is now built into a wide range of free available software, including
BASE, BioConductor, and TM4.
Microarray image processing
Labeled probe  transform the fluorescence intensity  transcript abundance  most of
these steps are done by software provided with commercial scanners
• Image processing essentially involved four steps (1) image acquisition, (2) spot location,
(3) computation of spot intensities, and (4) data reporting
(1) image acquisition
• Raw image of a microarray scan  a 16-bit image file of the intensity of fluorescence
associated with each pixel  a number between 0 and 65536 (i.e. 216).
• Higher resolution use a 32-bit image file (i.e. 0 ~ 4*109)
• However the sources of experimental error are greater than the image resolution !
• Gain on the laser – too high  high intensity spots will converge on the same upper
value, if the gain is too low  information at the low end is lost in the background
• Dyes (Cy 3 and Cy5) quench (平息下來) with time, and different rates, it is not a good
idea to repeatedly scan the same array
(2) spot location
• Spot location  achieved by laying a grid over the image that places a square or circle
around each spot
• Always imperfections in the spacing of spots  spots must be re-centered by
deforming the grid so as to maximize the coverage of the spots by the circles
•
Microarray image processing
(3) computation of spot intensities
• Spot intensities = mean intensity for each pixel within the
circle surrounding a spot – mean intensity of the background
pixels immediately surrounding the spot
(4) data reporting
• Data is usually reported as a tab-delimited text file linkage of
the data to genome databases
• Microarray data or protocols are built on XML-based
languages that allow storage and retrieval from public
databases
A comparison between cDNA and oligonucleotides arrays
cDNA arrays
Oligonucleotide arrays
• Long sequences
• Two-color array platforms
• Short sequences due to the limitations of the
synthesis technology.
• Single color array platforms such as Affymetrix
GeneChips™
Spot small DNA sequences, whole genes or arbitrary
PCR products.
Spot known sequences.
More variability in the system.
More reliable data.
Easier to analyze with appropriate experimental design, More difficult to analyze. All comparisons are inferred
but the choice of direct comparisons on each chip may in the sense that different chips are used for each
limit the feasibility of other comparisons. .
measurement. As a result, chip-to-chip variation can
lead to errors in any comparison.
Regardless of the choice of platform, one of the most significant aspects of
experimental design is determining the level of replication that is necessary to achieve
significance in any study.
Two general types of replicates: (1) biological replicate - even inbred (近親的) strains
of species held under the same conditions (could exhibit fairly significant interindividual variation in gene expression), (2) technical replicate – use repeated
measurements of the same samples
Patterns of gene expression
Deduce gene function based on patterns of expression
Uses patterns of gene expression as a biomarker to
classify samples
Gene function
Infer gene function by monitoring changes in
expression resulting from experimental perturbations.
• Search for genes exhibiting patterns of expression
that differentiate the various groups
• If the transcriptional differences between groups can
be validated, these expression patterns can then be
used as “biomarkers” in classifying other
experimental subjects.
Disadvantage
Even simple changes can often produce a large number
of transcriptional changes and these may be difficult to
link to the underlying biological perturbation.
Disadvantage
In applications such as these, it is not essential that the
genes themselves be linked causally to the underlying
disease or other phenomenon that separates the classes.
• Functional studies and searches for biomarkers are not mutually exclusive. Ultimately
the most useful and informative biomarkers are likely those that can be linked causally
to a disease or outcome.
• Northerns blotting experiments are generally used to test a hypothesis based on biology.
• Microarrays generate hypotheses that should be tested to validate them.
Summary
• A comparison between cDNA and oligonucleotides arrays
• Patterns of gene expression
– Deduce gene function based on patterns of expression vs.
uses patterns of gene expression as a biomarker to classify
samples
•
•
•
•
•
•
•
Hyperlink to the National library - 國家圖書館全國博碩士論
文資訊網, http://etds.ncl.edu.tw/theabs/index.html
先註冊 – account registration
搜索的關鍵字:keyword search, such as microarray
需有電子全文檔 之碩士論文,博士論文也可, Look for fulltext PDF thesis only, then download the file
按年份排序 – 從最新近發表開始. Sorting according to the year
of publishing
每人一篇,不得相同. Select one thesis to report.
製作PPT,報告論文研究之背景,簡述其方法,研究結果之
重要發現,及結論. Prepare a PPT file for your presentation,
including background, methodology, important results, and
conclusion.