RNA Structure Franç[email protected] www.iric.ca We finished the genome map, now we can’t figure out how to fold it! Science (1989) 243, p.786

Download Report

Transcript RNA Structure Franç[email protected] www.iric.ca We finished the genome map, now we can’t figure out how to fold it! Science (1989) 243, p.786

RNA Structure
Franç[email protected]
www.iric.ca
We finished the genome map, now we can’t figure out how to fold it!
Science (1989) 243, p.786
Plan
•
•
•
•
•
•
Introduction
Chemical structure
Base pairing models
Experimental constraints
Beyond secondary structure
RNA families
07.04.06 - Tunis
2
Introduction
07.04.06 - Tunis
3
Sequence-Structure-Function
5’GCGGAUUUAG2MCUCAUDHUDHGGG
AGAGCGM2CCGAC0MUGOMAAGYAUPS
C5MGGAGG7MUCC5MUGUGU5MUPSCG
A1MUCCACAGAAUUGACCA
5’GUGGAACAGUGGUAAUUCCUACGAUUAAGAAACCUGUUUA
CAGAAGGAUCCCCACCUAUGGGCGGGUUAUCAGAUAUAUCA
GGUGGGAAAUUCGGUGGAACACGUGGAGCCUUGUCCUCCGG
GUUAAUGCGCUUUUGGCAUUGGCCCUGCUCCUGAGAGAAGA
AAUAUACUGGGGAACCAGUCUUUACCGACCGUUGUUAUCAGA
AAUUCACGGAGUUCGGCCUAGGUCGGACUCCGAUGGGAACG
CAACGGUUGUUCCGUUUGACUUGUCGCCCGCUACGGCGUGA
GCGUCAAGGUCUGUUGAGUGCAAUCGUAGGACGUCAUUAGU
GGCGAACCCAUACCGAUUACUGUGCUGUUCCAGC
Yeast transfer RNA-Phe
Image from Pande, Stanford U.
S-domain of B. subtilis RNAse P RNA
Image from Krasilnikov, Northwestern U.
07.04.06 - Tunis
4
RNA Families & Function
• 574 families in Rfam
• Local RNA structures in 3’- and 5’- untranslated
regions (UTR) participate in gene regulation and
expression:
–
–
–
–
Reduce mRNA degradation
Control and rate mRNA translation
Determine mRNA localization (transport)
Regulate mRNA processing (complex splicing mechanisms)
07.04.06 - Tunis
5
Ribonomics
• Identify and characterize the RNAs of the cell
• Families are represented by alignments whose quality increases
considerably when high-resolution structures are available
However,
Speed of RNA sequencing (GenBank)
>>
Speed of RNA structure determination (PDB)
Alternative high-resolution structure determination techniques
are needed to determine:
– Sequence (genome annotation)
– Structure (folding)
– Function (family recognition)
07.04.06 - Tunis
6
Working Hypothesis
We can learn the RNA architectural principles and
the sequence-structure relationships from existing
structural data, and in order to enable computational
high-resolution 3D structure determination from
sequence.
07.04.06 - Tunis
7
Chemical Structure
07.04.06 - Tunis
8
The riboses and the
phosphate groups
constitute the backbone
and are linked through
diester bonds: C5’-O5’
and C3’-O3’. The chain
C3’-O3’-P-O5’-C5’ from
one ribose to another is
therefore referred to as
the phosphodiester
linkage
Major & Thibault (2007) In “From Genomes to Therapies” Wiley-VCH. pp 491-539
07.04.06 - Tunis
9
Torsion Angles
Major & Thibault (2007) In “From Genomes to Therapies” Wiley-VCH. pp 491-539
07.04.06 - Tunis
10
Glycosidic Torsion
07.04.06 - Tunis
11
A-RNA Double-Helix
The major groove of the A-RNA
double-helix is narrow and deep,
whereas the minor groove is
broad and shallow.
Major & Thibault (2007) In “From Genomes to Therapies” Wiley-VCH. pp 491-539
07.04.06 - Tunis
12
Watson-Crick Base Pairs
H42
Major groove
H8 C8
N7
G
N3
H5
C4
C5 C6
N9 C4
C1'
H41 N4
O6
N1 H1
N3
C
C2
C2
N2 H21
H22
O2
C5
C6 H6
N1
C1;
Minor groove
Major groove
Minor groove
07.04.06 - Tunis
13
Many Possible Base Pairs
Saenger (1984) Principles of Nucleic Acid Structure, Springer-Verlag, p.120-121
07.04.06 - Tunis
14
Base Pairing Models
07.04.06 - Tunis
15
Hierarchical Folding
1
5’GCGGAUUUAG2M
CUCAGUDHUDHGGG
AGAGCGM2CCAGA
C0MUGOMAAGYAUPS
C5MUGGAGG7MUC
C5MUGUGU5MUPSC
GA1MUCCACAGAA
3
UUCGACCA
2
4
rna.ucsc.edu/rnacenter/ribosome_images.html
07.04.06 - Tunis
16
Representations & Complexity
5 10
20
30
40
50
60
70
| | | | | | | | | | | | | |
(((((((..((((........))))(((((((.....)))))))....((((((...)..)))))))))))).
Amino acceptor
D
5’
07.04.06 - Tunis
Anticodon
T
3’
17
Secondary Interactions
G. Interior loop triple
Westhof’s lab
07.04.06 - Tunis
18
Tertiary Interactions
(Pseudo-Knots)
D.
07.04.06 - Tunis
19
A Dot Plot Shows the Helical Regions
5’
A.U
C.G
A.U
G A
AA
A
C
A
G
A
A
A
U
G
U
A C A G A A A U
.
.
.
.
.
.
.
.
.
. . . . .
.
.
.
. . . . .
G U
.
.
.
.
.
.
.
.
.
.
Diagonal
Wij = 0 for j-i < 4
07.04.06 - Tunis
20
Lowest Free Energy Structure
• The RNA does not fold into a random structure.
• In general, it prefers low-energy conformations.
• The relation between the probability and the
energy is given by:
(str | seq)  RT ln (str | seq)
where RT = 0.606 kcal/mol.

07.04.06 - Tunis
21
Implementation of the Pipas & McMahon Algorithm
(A naïve approach)
• List all possible helical regions
for i = 1 to n-(p+1)
for j = i+p+1 to n
if( pair( i, j ) ) {
// elongate pair( i, j ) to form helix( i, j )
l=1
while( pair( i+l, j-l ) ) l++
// if ( l > lmin ) store helix( i, j, l ) in a set
}
• create all possible secondary structures by forming permutations
of compatible helical regions
• evaluate each structure for total free energy of formation from
a completely extended chain
=> Hic! There are n! permutations of helical regions
=> Possible solution: probabilistic approaches (Monte Carlo, GA, etc)
Pipas & McMahon (1975) PNAS 72
07.04.06 - Tunis
22
Number Of Secondary Structures
n2
S(N  1)  S(N)   S(k)S(N  k 1)
k 0
 1.8 N

The free energy of 1000 different structures can
be calculated
 in approximately 1 second.
Consequently, for an RNA of 100 nucleotides,
we have 3 x 1025 structures, which would need
1014 years to calculate.
Waterman (1978)
07.04.06 - Tunis
23
Dynamic Programming
• Simple and discrete energy model
• Positions i and j are either base paired or not
• Position i base pairs with at most one base
• Neglect pseudo-knots and triples
• Set maximum loop size
• Linear approximation for multi-branch loops
 Finds the minimum free energy structure
 Storage O(n2)
 Time O(n3), or O(n6) with pseudo-knots
07.04.06 - Tunis
24
Secondary Structures
An RNA sequence is represented by an ordered list, S = s1, s2, …, sn,
where n is the length of the sequence and si is the ith nucleotide in the sequence.
A secondary structure on S is an ensemble of ordered pairs, i.j,
1 <= i < j <= n that satisfies:
• j – i > p (where p is the minimal number of nucleotides in a loop)
• Given i.j and i’.j’, two base pairs, either:
• i = i’ and j = j’ (they are the same)
• i < j < i’ < j’ (i.j precedes i’.j’)
• i < i’ < j’ < j (i.j includes i’.j’)
• i < i’ < j < j’ (pseudoknot)
07.04.06 - Tunis
25
Two Base Pairs
j
07.04.06 - Tunis
j’
26
Simplest Energy Model
The simplest energy model is to consider e(i,j) = –3, -2, and –1 kcal/mole,
respectively for the pairs CG, AU, and GU.
The energy of the entire structure is the sum of the energies of its pairs:
E(S ) 
 e(i,j)
i . jS
07.04.06 - Tunis
27
A Recursive Algorithm
A recursive algorithm allows us to compute the minimum energy structure.
Let W = min WS, where S ranges over all secondary structures. The energy for
pairing si with sj is given by e(i,j). Wij are computed for all fragments, i…j of the
RNA.
Wij = 0 for j-i < 4
Wij = min{ Wi+1,j, Wi,j-1, e(i,j) + Wi+1,j-1, min( k = j-1…i+1 ) ( Wi, k + Wk+1, j ) }
Either bases si and sj do not pair, or else they pair with some bases k1 < k2, or else
si and sj pair with each other. The minimum structure is computed using a
recursive algorithm (implemented in mfold by Zuker).
07.04.06 - Tunis
28
Evaluating Structure Between Pair(i, j)
min( k = j-1…i+1 ) ( Wi,k + Wk+1,j )
i
07.04.06 - Tunis
k k+1
j
29
Dynamic Programming Table
Wij = min{ Wi+1,j, Wi,j-1, e(i,j) + Wi+1,j-1, min(k = j-1…i+1) (Wi,k + Wk+1,j) }
j
i
07.04.06 - Tunis
30
Dynamic Programming To Solve Recursive Problems
Consider the Fibonacci sequence: 1 1 2 3 5 8 13 21...
Fib(n) = Fib(n-1) + Fib(n-2)
where
Fib(n-1) = Fib(n-2) + Fib(n-3)
Fib(n-2) = Fib(n-3) + Fib(n-4)
Instead of re-computing over and over the same
values you store them in memory.
07.04.06 - Tunis
31
Applied To Structure Prediction
To compute , the value of
and
are needed. The value
of , in turn needs the value
of
. The value of needs
to be computed only once, even
though we need it twice (or
more).
07.04.06 - Tunis
32
A More Realistic Free-Energy Model
3’ A/U
C/G
G/C
U/A
G/U
U/G
5’
A/U
-0.9
-1.8
-2.3
-1.1
-1.1
-0.8
C/G
-1.7
-2.9
-3.4
-2.3
-2.1
-1.4
G/C
-2.1
-2.0
-2.9
-1.8
-1.9
-1.2
U/A
-0.9
-1.7
-2.1
-0.9
-1.0
-0.5
G/U
-0.5
-1.2
-1.4
-0.8
-0.4
-0.2
U/G
-1.0
-1.9
-2.1
-1.1
-1.5
-0.4
Stacking energy in kcal/mol in double-stranded regions.
The basepair in the left column is 5’ to the base pair in the top row.
Ex)
07.04.06 - Tunis
5’ --> 3’
CA
GU
3’ <-- 5’
-1.7 kcal/mol
5’ --> 3’
CAUG
GUGC
3’ <-- 5’
-1.7 kcal/mol +
-0.8 kcal/mol +
-2.1 kcal/mol
33
Destabilizing Loop Energies
Loop length
Internal
Bulge
Hairpin
07.04.06 - Tunis
-
1
3.9
4.4
5
5.3
4.8
5.3
10
6.6
5.5
6.1
20
7.0
6.3
6.5
30
7.4
6.7
34
Energy Computation
Loop contribution = 4.4 kcal/mol
C/G : C/G = -2.9 kcal/mol
C/G : G/C = -3.4 kcal/mol
TOTAL = -1.9 kcal/mol
Not quite yet what is used in the mfold
program by Zuker!
Zuker uses a table for the initial basepair,
and look at the nucleotides in the loop for
more precise loop contributions.
07.04.06 - Tunis
35
A Short RNA Sequence
ACCCCCUCCU UCCUUGGAUC AAGGGGCUCA A
Optimal (black)
CG/CG
-8.7
CG/UA
-2.3
UA/CG
-1.7
-1.7
dsRNA -12.7
-13.3
LOOP~
07.04.06 - Tunis
Suboptimal (yellow)
15.0
-11.6
14.9
36
Various Programs
Mfold 3.2 @ Rensselaer Polytechnic Institute
Sfold 2.0 @ Wadsworth Bioinformatics Center・
RNAfold 1.5 @ University of Vienna
VSfold 4.0 @ Chiba Institute of Technology・
Hfold @ University of Montreal
paRNAss @ Bielefeld University
GeneBee @ Belozersky Institute
RDfolder @ Peking University
Pfold @ Aarhus University
ILM @ Washington University
CONTRAfold @ Stanford University
RNA Secondary Structure Prediction @ Wikiomics.org
07.04.06 - Tunis
37
Low-Resolution Data Improves Predictions
07.04.06 - Tunis
38
Chemical Probing
DEPC
NH2
DMS
N
N
A
N
HO
N
NH2
O
H
H
N
H
O
O
DMS
H
C
OH
P
N
O-
O
O
DMS
O
O
H
kethoxal
H
N
H
O
NH
H
O
G
OH
P
N
O-
O
O
H
O
CMCT
H
H
H
O
O
NH
OH
P
U
O-
N
O
O
O
H
H
O
OH
H
O
H
P
O-
07.04.06 - Tunis
NH2
N
O-
Stern, Moazed & Noller (1988) Meth Enz 164:488
39
Knowledge Is Power
07.04.06 - Tunis
40
Beyond Secondary Structure
07.04.06 - Tunis
41
Predicting Non-Canonical Base Pairs
(sarcin-ricin motif)
A) MC-Fold
B) RNAsubopt
5
10
15
20
25
|
|
|
|
|
GGGUGCUCAGUACGAGAGGAACCGCACCC
(((((((.((((((..)))))))))))))
((((((((.(((((..)))))))))))))
((((((.(((((((..)))))))))))))
(((((((((.((((..)))))))) )))))
(((((((..(((((..))).)))))))))
(((((((..(((((..)))).))))))))
(((((..(((((((..))).)))))))))
(((((((..((((.....)))))))))))
(((((..(((((((..)))).))))))))
(((((..((((((.....)))))))))))
5
10
15
20
25
|
|
|
|
|
GGGUGCUCAGUACGAGAGGAACCGCACCC
((((((......(....).....))))))
-10.90
((((((...((.(....)..)).))))))
-10.60
((((((....(.(....).)...))))))
-9.50
((((((((.....))..(....)))))))
-9.10
((((((...(..(....)...).))) ))) -9.00
((((((((.......))(....)))))))
-8.80
((((((...((.(.....).)).))))))
-8.70
((((((....(.(....))....))))))
-8.70
((((((.................))))))
-8.63
.(((((......(....).....))))).
-8.40
-26.860
-26.680
-26.560
-26.400
-26.150
-25.870
-25.740
-25.660
-25.270
-25.250
Parisien, Thibault & Major
Wuchty, Fontana, Hofacker, Schuster (1999) Biopolymers 49:145
07.04.06 - Tunis
42
Table 1 | Predictive power
Predicted bps
(%)
INN-HB
Cycle mod el
Zipper
FP
FN
TP
1.1
16.9
83.1
3.6
3.0
97.0
50.2
25.9
74.1
96.8
Ñ
98.5
87.8
75.6
64.9
90.6
96.7
84.1
WC
nC
TP
TP
(TPFN) (TPFP)

The predictive pow er of the Individual Nearest Neighbors Hydrogen Bond s (INN-HB)
and of the cycle model approaches are comp ared. The sequences of 288 RN A
hairpins from PDB structures were subm itted to RNAsubopt (INN-HB), MC-Fold
(Cycle mod el), and Zipper. The set of hairpins cont ain 2093 different Watson-Crick
(WC) base pairs, including 296 (~14% ) non -canon ical (nC) tertiary base pairs. Zipper
implements a gr eedy algorithm that zips the sequence in a hairpin, giving a l ower
bound on pr edictive pow er. For ea ch approach, the best of the top 5 pr edicted
structures was analyzed. The Matthews correlation coefficients are given in the last
row.
Parisien, Thibault & Major
07.04.06 - Tunis
43
Sequence To Structure
07.04.06 - Tunis
44
Loop E (430D)
The best (blue; ranks 4th) of 516 models is superimposed on the
crystal structure + 12 others (light blue) (RMSD 0.9 to 2.5 Å).
Parisien & Thibault
07.04.06 - Tunis
45
RNA Families
07.04.06 - Tunis
46
Alignment
Frank et al. (2000) RNA 6:1895
07.04.06 - Tunis
47
Rfam
If Internet is available: rfam.janelia.org
Each family is represented by an alignment and
a corresponding covariance model. New family
members are searched, in ~200 complete
genomes, using the covariance model + Blast.
07.04.06 - Tunis
48
Erpin
If the Internet is available: tagc.univ-mrs.fr/erpin
Erpin also represents each RNA family by an
alignment. The computer representation of the
alignment differs from that of Rfam, but the goal is
similar: find new family members.
07.04.06 - Tunis
49
Not Mentionned
• You can define a motif by structural
constraints and use programs such as
RNAMOT to scan genomic data.
• You can model the 3D structure of an RNA
from secondary structure and a limited
number of additional structural constraints
using MC-Sym. This requires 3D modeling
and RNA structure expertise.
07.04.06 - Tunis
50