lecture9 - Stanford AI Lab

Download Report

Transcript lecture9 - Stanford AI Lab

+
miRNA Discovery and
Prediction Algorithms
George Michopoulos
+
microRNAs

What are they?

Why do we care about them?

How do we discover them?


Biological Methods

Computational Methods
What limitations do these methods have?
+
What is microRNA?
+
miRNA structure

Small non-coding RNAs

~22-25 bases long

Characterized by their hairpin
precursors, composed of the
mature, the loop, and the star
miRNA
+
miRNA biogenesis

Transcribed in the nucleus

Pri-miRNA hairpin gets cut by
Drosha enzyme

The pre-miRNA then either
degrades into miRNA naturally, or
gets cleaved by the Dicer
enzyme

Then the miRNA gets bound by
an Argonoute protein into a RNAinduced silencing complex

Then the complex binds target
mRNA and cleaves it
+
Why do we care?

miRNAs regulate protein expression,
including those involved in:

Cancer – inhibit proteins responsible
for controlling proliferation

Neural development – links to
schizophrenia

Cardiac development – linked to
cardiomyopathies

DNA methylation and histone
modification – can alter the expression
of target genes
+
Why do we care?

The use of antagomirs,
chemically engineered
oligonucleotides, could be
used as a therapy for such
diseases to silence
endogenous microRNA

Non-coding RNAs account
for a significant portion of
the genome, so their
homology can be used as
tool to assess phylogeny
+
Detection and Discovery


Biological Methods:

Can use RT-PCR and QPCR for individual miRNAs

Can use microarrays to detect multiple miRNAs
Computational Methods:

Mining deep-sequencing data and using predictive algorithms to
detect miRNA characteristics and compare potential sequences to
homologs

Bentwich et al. (2005)

miRAlign: Wang et al. (2005)

miRDeep: Friedländer et al. (2008)

miRDeep2: Friedländer et al. (2011)
+
RT-PCR

Reverse transcription
polymerase chain reaction, not
real time PCR (qPCR)

Desired RNA is transcribed and
the resulting cDNA is amplified
using qPCR

Is useful for detecting very low
copy numbers of RNA molecules;
oldest method, non-specific for
miRNA
+
Northern Blotting

Measure levels of RNA
expression using probes
with partial homology

This picture shows a
northern blot that has
detected 4/5 of the shown
microRNAs

Lower sensitivity, but
higher specificity than RTPCR

Fewer false positives
+
Microarray Detection


Microarrays first used to detect miRNAs in 2004 by different
groups

Probes can be developed and then chip can be ordered through
companies (Barad et al.)

Everything can be developed and put together using aminebinding slides and an array printer (Miska et al.)
Incredibly more efficient for large scale discovery, but limited by
the need for prior sequence data for probe development
+
Barad et al. (2004)
Took known miRNA
sequences
Created DNA chips with
probes complementary to
those sequences
Hybridized miRNA
samples onto chips
Performed Clustering
Analysis
Use mirMASA to
confirm findings
Found that the microarray method
has a higher sensitivity and
specificity than previous miRNA
identification methods
+
Useful Programs:
RNAFold

RNAFold is an algorithm that is part of the “Vienna Package”

Takes in RNA sequences and calculates their minimum free
energy structure, outputting the following results:
+
Useful Programs:
ClustalW

ClustalW is a multiple local
alignment tool that is frequently
used to compare homologous
sequences across species, or to
compare families of genes.

Takes in two sequences, does a
pairwise alignment, creates a
phylogenetic tree, and then
uses that to conduct multiple
alignment using other
sequences
+
Bentwich et al. (2005)
+
Bentwich et al. (2005)

Scanning the entire human genome identified 11 million hairpins, including
86% of known microRNA precursors.

After microarray sampling, the 359 expressed microRNAs were subjected to
confirmation by sequencing


Successfully cloned and sequenced 89 human microRNA genes that do not appear in
the microRNA registry
Using UCSC BlastZ alignment and ClustalW, found that fifty three of these are
located in two large non-conserved clusters, including one on chromosome
19 that is only expressed in the placenta and was the largest microRNA
cluster ever reported.



This cluster comprises 43 new predicted microRNAs which all show similarity to a
neighboring miRNA family specifically expressed in human embryonic stem cells
The other cluster is on the X chromosome and its miRNAs are only expressed in the
testis
Homology analysis showed that both clusters are conserved only in chimpanzees
and possibly rhesus monkeys
+
miRAlign: Wang et al. (2005)

A novel genome-wide
computational approach to
detect miRNAs in animals
based on both sequence and
structure alignment

Uses RNAfold to test
secondary structures, then
CLUSTAL to perform pairwise
alignment, unique algorithms
to confirm the miRNA’s
position on the stem-loop, and
finally RNAforester to conduct
pairwise structure alignment
+
miRAlign: Wang et al. (2005)

miRAlign outperforms BLAST search in both sensitivity and
selectivity, and furthermore, nearly all the known miRNAs
found by BLAST can also be detected by miRAlign.


The average number of false positives is 7.1 for BLAST and 0.9 for
miRAlign
Algorithm is dependent on pre-existing data to search
against, only useful for finding miRNAs that are closely
related to previously annotated ones.
+
miRDeep: Friedländer et al.
(2008)

Suite of PERL scripts

Uses a probabilistic model of
miRNA biogenesis to score
compatibility of the position
and frequency of sequenced
RNA with the secondary
structure of the miRNA
precursor
+
Algorithm
for P(sequence is a precursor)

score = log (P(pre | data) / P(bgr | data)

The probability of the sequence being a
precursor is given by Bayes’ theorem:



P(pre | data) = P(data | pre) P(pre) /
P(data)
P(pre | data) = P(abs | pre) P(rel | pre)
P(sig | pre) P(star | pre) P(nuc | pre) P(pre)
/ P(data)
The same holds for the probability of the
sequence being a background hairpin:


P(bgr | data) = P(data | bgr) P(bgr) /
P(data)
P(bgr | data) = P(abs | bgr) P(rel | bgr)
P(sig | bgr) P(star | bgr) P(nuc | bgr)
P(bgr) / P(data)
+
miRDeep: Friedländer et al.
(2008)

Of the 555 known human mature miRNA sequences, 213 were
present in the data set. Of these, 154 (72%) were successfully
recovered by miRDeep. The total estimated number of false
positives was 6 ± 2

This pipeline is much more efficient at finding microRNA
expression from deep-sequencing than the previous
methods
+
miRDeep2: Friedländer et al.
(2011)

Analyzing data from
seven animal species
representing the major
animal clades, miRDeep2
identified miRNAs with an
accuracy of 98.6–99.9%
and reported hundreds of
novel miRNAs

New package include
many more options and
graphical outputs that
make the software more
accessible
+
miRDeep2: Friedländer et al.
(2011)
+
miRDeep2: Friedländer et al.
(2011)
+
miRDeep2: Friedländer et al.
(2011)
+
miRDeep2: Friedländer et al.
(2011)

Relative to miRDeep1:







Performs excision by scanning the genome for stacks of reads, where a stack is
one or more reads that map to the exact same 50 and 30 positions in the
genome
When identifying miRNAs in data from sea squirts, known to harbor large
numbers of non-canonical miRNAs, the first version of miRDeep only reports 46
known and 31 novel miRNAs. In contrast, miRDeep2 reports 313 known and 127
novel ones
Can detect anti-sense miRNAs (+/-)
Supports single or multiple mismatches.
Performs substantially better on the human data, reporting 186 known and 36
novel miRNAs (compared to 154 known and 10 novel in the initial publication)
 More accurate detection of lowly abundant miRNAs
Faster; analyzed 30 million RNAs in less than 5 h and with 3 GB memory
More intuitive interface for biologists
+
Beyond miRDeep2

Remaining challenges in identifying and detecting
expression levels of miRNA:

miRBase, the primary database used as a source for miRNA
annotations used today, is for from pristine

Hard to tell whether detected novel miRNAs actually have a
biological function, will take a lot of biological experimentation
until we know that

Algorithms still have room for improvement in terms of
accessibility and efficiency
+
Questions?
+
References

Barad, O., Meiri, E., Avniel, A., Aharonov, R., Barzilai, A., Bentwich, I., Einav, U., et al. (2004). MicroRNA
expression detected by oligonucleotide microarrays : System establishment and expression profiling in human
tissues. Genome Research, 2486-2494. doi:10.1101/gr.2845604.4

Bentwich, I., Avniel, A., Karov, Y., Aharonov, R., Gilad, S., Barad, O., Barzilai, A., et al. (2005). Identification of
hundreds of conserved and nonconserved human microRNAs. Online, 37(7), 766-770. doi:10.1038/ng1590

Friedländer, M. R., Chen, W., Adamidi, C., Maaskola, J., Einspanier, R., Knespel, S., & Rajewsky, N. (2008).
Discovering microRNAs from deep sequencing data using miRDeep. Nature biotechnology, 26(4), 407-15.
doi:10.1038/nbt1394

Friedländer, M. R., Mackowiak, S. D., Li, N., Chen, W., & Rajewsky, N. (2011). miRDeep2 accurately identifies
known and hundreds of novel microRNA genes in seven animal clades. Nucleic acids research, 1-16.
doi:10.1093/nar/gkr688

Krüger, J., & Rehmsmeier, M. (2006). RNAhybrid: microRNA target prediction easy, fast and flexible. Nucleic
acids research, 34(Web Server issue), W451-4. doi:10.1093/nar/gkl243

Miska, E. a, Alvarez-Saavedra, E., Townsend, M., Yoshii, A., Sestan, N., Rakic, P., Constantine-Paton, M., et al.
(2004). Microarray analysis of microRNA expression in the developing mammalian brain. Genome biology,
5(9), R68. doi:10.1186/gb-2004-5-9-r68

Wang, X., Zhang, J., Li, F., Gu, J., He, T., Zhang, X., & Li, Y. (2005). MicroRNA identification based on sequence
and structure alignment. Bioinformatics (Oxford, England), 21(18), 3610-4. doi:10.1093/bioinformatics/bti562