Transcript Document

RNA Secondary Structure
Prediction
Lecture 11: June 13, 2006
Algorithms of Molecular Biology
13 June 2006
1
Introduction to RNA
Sequence/Structure Analysis

RNAs have many structural and functional
uses






Translation
Transcription
RNA splicing
RNA processing and editing
cellular localization
catalysis
13 June 2006
2
RNA functions
•RNA functions as










mRNA
rRNA
tRNA
In nuclear export
Part of spliceosome: (snRNA)
Regulatory molecules (RNAi)
Enzymes
Viral genomes
Retrotransposons
Medicine
13 June 2006
3
Biological Functions of Nucleic
Acids







tRNA (transfer RNA, adaptor in translation)
rRNA (ribosomal RNA, component of ribosome)
snRNA (small nuclear RNA, component of splicesome)
snoRNA (small nucleolar RNA, takes part in processing of
rRNA)
RNase P (ribozyme, processes tRNA)
SRP RNA (RNA component of signal recognition particle)
……..
13 June 2006
4
RNA Sequence Analysis



RNA sequence analysis different from DNA
sequence analysis
RNA structures fold and base pair to form
secondary structures
not necessarily the sequence but structure
conservation is most important with RNA
13 June 2006
5
Secondary Structures
of Nucleic Acids
More Secondary
Structures
Pseudoknots:
• DNA is
primarily in
duplex form.
• RNA is
normally single
stranded which
can have a
diverse form of
secondary
structures other
than duplex.
13 June 2006
Source: Cornelis W. A. Pleij in
Gesteland, R. F. and Atkins, J. F.
(1993) THE RNA WORLD.
Cold Spring Harbor Laboratory
rRNA Secondary Structure Base
6
3D Structures of
RNA:
Catalytic RNA
Secondary Structure
Of Self-splicing RNA
Tertiary Structure
Of Self-splicing RNA
Some structural
rules:
•Base pairing is
stabilizing
•Unpaired sections
(loops) destabilize
•3D conformation
with interactions
makes up for this
13 June 2006
7
RNA secondary
structure
• E. coli Rnase P
RNA secondary
structure
13 www.mbio.ncsu.edu/JWB/MB409/lecture/
June 2006
Image source:
lecture05/lecture05.htm
8
tRNA structure
13 June 2006
9
RNA Variations



Variations in RNA sequence maintain basepairing patterns for secondary structures
when a nucleotide in one base changes, the
base it pairs to must also change to maintain the
same structure
Such variation is referred to as covariation.
13 June 2006
10
Covariance



secondary structure prediction in RNA takes into
account conserved patterns of base-pairing
Positions of covariance are conserved matches,
since they maintain the secondary structure
computationally challenging
13 June 2006
11
Features of RNA

RNA: polymer composed of a combination
of four nucleotides




adenine (A)
cytosine (C)
guanine (G)
uracil (U)
13 June 2006
12
Features of RNA



G-C and A-U form complementary hydrogen
bonded base pairs (canonical Watson-Crick)
G-C base pairs being more stable (3 hydrogen
bonds) A-U base pairs less stable (2 bonds)
non-canonical pairs can occur in RNA -- most
common is G-U
13 June 2006
13
Features of RNA

RNA typically produced as a single
stranded molecule (unlike DNA)

Strand folds upon itself to form base pairs

secondary structure of the RNA
13 June 2006
14
Features of RNA


intermediary between a linear molecule
and a three-dimensional structure
Secondary structure mainly composed of
double-stranded RNA regions formed by
folding the single-stranded RNA molecule
back on itself
13 June 2006
15
Stem Loops (Hairpins)

Loops generally at least 4 bases long
13 June 2006
16
Bulge Loops

occur when bases on one side of the
structure cannot form base pairs
13 June 2006
17
Interior Loops

occur when bases on both sides of the
structure cannot form base pairs
13 June 2006
18
Junctions (Multiloops)

two or more double-stranded regions
converge to form a closed structure
13 June 2006
19
Tertiary Interactions

tertiary interactions can be present as well

located using covariance analysis
13 June 2006
20
Kissing Hairpins

unpaired bases of two separate hairpin
loops base pair with one another
13 June 2006
21
Pseudoknots
13 June 2006
22
Hairpin-Bulge Interactions
13 June 2006
23
RNA structure prediction methods




Dot Plot Analysis
Base-Pair Maximization
Free Energy Methods
Covariance Models
13 June 2006
24
How RNA Prediction Methods Were
Developed



Mount p. 334
Since Tinoco et al. measured energy
associated with regions of ss a few
energy based algorithms were developed
Nussinov and Jacobson (1980), Zuker
and Stiegler (1981), Trifonov and Bolshoi
(1983) ….
13 June 2006
25
Main approaches to RNA secondary
structure prediction

Energy minimization




dynamic programming approach
does not require prior sequence alignment
require estimation of energy terms contributing to
secondary structure
Comparative sequence analysis


Using sequence alignment to find conserved residues
and covariant base pairs.
most trusted
13 June 2006
26
Circular Representation



base pairs of a secondary structure
represented by a circle
arc drawn for each base pairing in the
structure
If any arcs cross, a pseudoknot is present
13 June 2006
27
Circular Representation

Image source: http://www.finchcms.edu/cms/biochem/Walters/rna_folding.html
13 June 2006
28
13 June 2006
29
Circular Representation
13 June 2006
30
Base-Pair Maximization



Find structure with the most base pairs
Efficient dynamic programming approach
to this problem introduced by Ruth
Nussinov (Tel-Aviv, 1970s).
Tutorial in the classroom: let us try to
reconstruct Nussinov’s algorithm
13 June 2006
31
Nussinov Algorithm

Four ways to get the best structure between position i
and j from the best structures of the smaller
subsequences
1)
Add i,j pair onto best structure found for subsequence
i+1, j-1
2)
add unpaired position i onto best structure for
subsequence i+1, j
3)
add unpaired position j onto best structure for
subsequence i, j-1
4)
combine two optimal structures i,k and k+1, j
13 June 2006
32
Nussinov Algorithm
13 June 2006
33
Nussinov Algorithm

compares a sequence against itself in a
dynamic programming matrix



Four rules for scoring the structure at a
particular point
Since structure folds upon itself, only
necessary to calculate half the matrix
13 June 2006
34
Nussinov Algorithm


Initialization: score for matches along main
diagonal and diagonal just below it are set to
zero
Formally, the scoring matrix, M, is initialized:


M[i][i] = 0 for i = 1 to L (L is sequence length)
M[i][i-1] = 0 for i = 2 to L
13 June 2006
35
Nussinov Algorithm
• Using the sequence GGGAAAUCC, the
matrix now looks like the following, such
that sequences of length 1 will score 0:
13 June 2006
36
Nussinov Algorithm


Matrix Fill:
M[i][j] = max of the following :




M[i+1][j] (ith residue is hanging off by itself)
M[i][j-1] (jth residue is hanging off by itself)
M[i+1][j-1] + S(xi, xj) (ith and jth residue are paired; if xi =
complement of xj, then S(xi, xj) = 1; otherwise it is 0.)
M[i][j] = MAXi<k<j (M[i][k] + M[k+1][j]) (merging two
substructures)
13 June 2006
37
Nussinov Algorithm
• The final filled matrix is as follows:
13 June 2006
38
Nussinov Algorithm

Traceback (P 271, Durbin et al) leads to
the following structure:
13 June 2006
39
Nussinov Algorithm
• Web Interface:
• http://ludwig-sun2.unil.ch/~bsondere/nussinov/
13 June 2006
41
Nussinov Results
13 June 2006
42
Evaluation of Maximizing
Basepairs


Simplistic approach
Does not give accurate structure
predictions



nearest neighbor interactions
stacking interactions
loop length preferences
13 June 2006
43
Free Energy Minimization
RNA Structure Prediction







All possible choices of complementary sequences are considered
Set(s) providing the most energetically stable molecules are chosen
When RNA is folded, some bases are paired with other while others
remain free, forming “loops” in the molecule.
Speaking qualitatively, bases that are bonded tend to stabilize the
RNA (i.e., have negative free energy), whereas unpaired bases form
destabilizing loops (positive free energy).
Through thermodynamics experiments, it has been possible to
estimate the free energy of some of the common types of loops that
arise.
Because the secondary structure is related to the function of the
RNA, we would like to be able to predict the secondary structure.
Given an RNA sequence, the RNA Folding Problem is to predict the
secondary structure that minimizes the total free energy of the
folded RNA molecule.
13 June 2006
44
Prediction of Minimum-Energy RNA
Structure is Limited

In predicting minimum energy RNA
secondary structure, several simplifying
assumptions are made.



The most likely structure is identical to the
energetically preferable structure
Nearest-neighbor energy calculations give
reliable estimates of an experimentally
achievable energy measurements
Usually we can neglect pseudoknots
13 June 2006
45
Assumptions in secondary
Structure Prediction


most likely structure similar to energetically most
stable structure
Energy associated with any position is only
influenced by local sequence and structure
Structure formed does not produce pseudoknots
13 June 2006
46
Predicting Structure From a
Single Sequence


RNA molecule only 200 bases long has 1050
possible secondary structures
Find self-complementary regions in an RNA
sequence using a dot-plot of the sequence
against its complement


repeat regions can potentially base pair to form
secondary structures
advanced dot-plot techniques incorporate free energy
measures
13 June 2006
48
Dot Plot
Image Source: http://www.finchcms.edu/cms/biochem/Walters/rna_folding.html
13 June 2006

49
Energy Minimization Methods

RNA folding is determined by biophysical properties

Energy minimization algorithm predicts the correct secondary
structure by minimizing the free energy (G)

G calculated as sum of individual contributions of:




loops
base pairs
secondary structure elements
Energies of stems calculated as stacking contributions between
neighboring base pairs
13 June 2006
50
Energy Minimization Methods

Free-energy values
(kcal/mole at 37oC )
are as follows:
13 June 2006
51
Energy Minimization Methods

Free-energy values
(kcal/mole at 37oC )
are as follows:
13 June 2006
52
Energy Minimization Methods

Given the energy tables, and a folding, the
free energy can be calculated for a
structure
13 June 2006
53
Calculating Best Structure

sequence is compared against itself using a
dynamic programming approach




similar to the maximum base-paired structure
instead of using a scoring scheme, the score is
based upon the free energy values
Gaps represent some form of a loop
The most widely used software that incorporates
this minimum free energy algorithm is MFOLD.
13 June 2006
54
Free Energy Minimization
RNA Structure Prediction

http://www.bioinfo.rpi.edu/~zukerm/Bio5495/RNAfold-html/
13 June 2006
55
Calculating Best Structure



most widely used software incorporating
minimum free energy algorithm is MFOLD
http://www.bioinfo.rpi.edu/applications/mf
old/
http://www.bioinfo.rpi.edu/applications/mf
old/old/rna/
13 June 2006
56
Example Sequence
GCTTACGACCATATCACGTTGAATGCACGC
CATCCCGTCCGATCTGGCAAGTTAAGCAAC
GTTGAGTCCAGTTAGTACTTGGATCGGAGA
CGGCCTGGGAATCCTGGATGTTGTAAGCT
13 June 2006
57
MFOLD Energy Dot Plot
13 June 2006
58
Optimal Structure
13 June 2006
59
Suboptimal Folds



The correct structure is not necessarily
structure with optimal free energy
within a certain threshold of the calculated
minimum energy
MFOLD updated to report suboptimal folds
13 June 2006
60
Comparison of Methods
13 June 2006
61
Inferring Structure By
Comparative Sequence Analysis



first step is to calculate a multiple sequence
alignment
Requires sequences be similar enough so that
they can be initially aligned
Sequences should be dissimilar enough for
covarying substitutions to be detected
13 June 2006
62
Mutual Information
M ij 
f
xi x j
log2
xi , x j


f xi x j
f xi f x j
fxi : frequency of a base in column i
fxixj : joint (pairwise) frequency of a base pair between columns i
and j

Information ranges from 0 and 2 bits

If i and j are uncorrelated, mutual information is 0
13 June 2006
63
Mutual Information Plot
13 June 2006
64
Mutual Information Plot
13 June 2006
65
Frameshifting
•
Virology. 2005 Feb 20;332(2):498-510
•
•
Programmed ribosomal frameshifting in decoding the SARS-CoV genome.
Baranov PV, Henderson CM, Anderson CB, Gesteland RF, Atkins JF,
Howard MT.
Department of Human Genetics, University of Utah, 15 N 2030 E, Room 7410, Salt Lake City, UT 841125330, USA.
Programmed ribosomal frameshifting is an essential mechanism used
for the expression of orf1b in coronaviruses. Comparative analysis of
the frameshift region reveals a universal shift site U_UUA_AAC,
followed by a predicted downstream RNA structure in the form of either
a pseudoknot or kissing stem loops. Frameshifting in SARS-CoV has
been characterized in cultured mammalian cells using a dual luciferase
reporter system and mass spectrometry. Mutagenic analysis of the
SARS-CoV shift site and mass spectrometry of an affinity tagged
frameshift product confirmed tandem tRNA slippage on the sequence
U_UUA_AAC. Analysis of the downstream pseudoknot stimulator of
frameshifting in SARS-CoV shows that a proposed RNA secondary
structure in loop II and two unpaired nucleotides at the stem I-stem II
junction in SARS-CoV are important for frameshift stimulation. These
results demonstrate key sequences required for efficient frameshifting,
and the utility of mass spectrometry to study ribosomal frameshifting. 66
13 June 2006
Frameshifting
• RNA-struct-frameshift.pdf
• frameshifts.pdf
• hepatitisC-frameshift.pdf
13 June 2006
67
Covariance Models
• 7 approaches to locate covarying sites
offered in Mount, p225
• key to covariance is mutual information
content
• mutual information content can be
plotted on a motif logo
13 June 2006
68
Mutual Information
•
13 June 2006
Image source: http://www.cbs.dtu.dk/~gorodkin/appl/slogo.html
69
Covariance Models
• A formal covariance model, COVE,
devised by Eddy and Durbin
• Provides very accurate results
• extremely slow and unsuitable for
searching large genomes
13 June 2006
70
SCFGs
• Stochastic Context Free Grammars (SCFGs) have
also been used to model RNA secondary structure
• Examples
– tRNAScan-SE
– program created to find snoRNAs
• Grammars are created by using a training set of data,
and then the grammars are applied to potential
sequences to see if they fit into the language
13 June 2006
71
SCFGs
• SCFGs allow the detection of
sequences belonging to a family
– tRNAs
– group I introns
– snoRNAs
– snRNAs
13 June 2006
72
SCFGs
• base-paired columns modeled by
pairwise emitting non terminals
– aWu; aWa; aWc; aWg; ...
• single-stranded columns modeled by
leftwise emitting nonterminals (when
possible)
– aW; cW; gW; uW; ..., when possible
13 June 2006
73
SCFGs
• Any RNA structure can be reduced to a
SCFG (see Durbin, et al., p 278-279)
13 June 2006
74
Transformational Grammars
• First described by linguist Noam
Chomsky in the 1950’s.
– (Yes, the same Noam Chomsky who has
expressed various dissident political views
throughout the years!)
13 June 2006
75
Transformational Grammars
• Very important in computer science,
most notably in compiler design
• Covered in detail in compiler and
automaton classes
13 June 2006
76
Transformational Grammars
• Idea: take a set of outputs (sentence, RNA
structure) and determine if it can be produced
using a set of rules
• consist of a set of symbols and production
rules
• The symbols can terminal (emitting) symbols
or non-terminal symbols
13 June 2006
77
Grammar for Palindromes
• Consider palindromic DNA sequences
• Five possible terminal symbols: {A, C, G,
T, ) ( represents the blank terminal
symbol)
13 June 2006
78
Grammar for Palindromes
• Production Rules, where S and W are
non-terminal symbols:
• SW
• W aWa | cWc | gWg | tWt
• W a | c| g | t | 
13 June 2006
79
Derivation of Sequences
• Using these production rules, a
derivation of the palindromic sequence
acttgttca follows:
• S  W  aWa  acWcaactWtca
 acttWttca  acttgttca
13 June 2006
80
Parse Trees
• A context-free grammar can be aligned to a
sequence using a parse tree
• Root of the tree is the non-terminal start
symbol, S
• Leaves are terminal symbols
• Internal nodes are the nonterminals
• Leaves can be parsed from left to right to
view the results of production
13 June 2006
81
Parse Tree
S
W
W
W
W
W
a
13 June 2006
c
t
t
g
t
t
c
a
82
RNA Structure SCFG
•
•
•
•
•
•
•
SW
W WW
(bifurcation)
W aWu | cWg | gWc | uWa
(stems)
W gWu | uWg
W aW | cW | gW | uW
(bulges)
W Wa | Wc | Wg | Wu
(bulges)
W a | c| g | t | 
13 June 2006
83
Example of SCFG
• structure for the RNA structure for the
sequence produced by MFOLD, can be
constructed (5’ to 3’):
• GCUUACGACCAUAUCACGUUGAAUGCAC
GCCAUCCCGUCCGAUCUGGCAAGUUAAG
CAACGUUGAGUCCAGUUAGUACUUGGAU
CGGAGACGGCCUGGGAAUCCUGGAUGU
UGUAAGCU
13 June 2006
84
Example Construction
•
•
•
•
•
•
•
•
•
•
•
•
•
•
S
W
Wu
gWcu
gcWgcu
gcuWagcu
gcuuWaagcu
gcuuaWuaagcu
gcuuacWguaagcu
gcuuacgWuguaagcu
gcuuacgaWuuguaagcu
gcuuacgacWguuguaagcu
gcuuacgaccWguuguaagcu
gcuuacgaccaWguuguaagcu....
13 June 2006
85
Other Programs
• RNA Movies
– http://bibiserv.techfak.uni-bielefeld.de/rnamovies/
– (Visualization of RNA secondary structure)
• RNA LOGOS
– http://www.cbs.dtu.dk/~gorodkin/appl/slogo.html
13 June 2006
86