RNA Structure Franç[email protected] www.iric.ca We finished the genome map, now we can’t figure out how to fold it! Science (1989) 243, p.786
Download ReportTranscript RNA Structure Franç[email protected] www.iric.ca We finished the genome map, now we can’t figure out how to fold it! Science (1989) 243, p.786
RNA Structure Franç[email protected] www.iric.ca We finished the genome map, now we can’t figure out how to fold it! Science (1989) 243, p.786 Plan • • • • • • Introduction Chemical structure Base pairing models Experimental constraints Beyond secondary structure RNA families 07.04.06 - Tunis 2 Introduction 07.04.06 - Tunis 3 Sequence-Structure-Function 5’GCGGAUUUAG2MCUCAUDHUDHGGG AGAGCGM2CCGAC0MUGOMAAGYAUPS C5MGGAGG7MUCC5MUGUGU5MUPSCG A1MUCCACAGAAUUGACCA 5’GUGGAACAGUGGUAAUUCCUACGAUUAAGAAACCUGUUUA CAGAAGGAUCCCCACCUAUGGGCGGGUUAUCAGAUAUAUCA GGUGGGAAAUUCGGUGGAACACGUGGAGCCUUGUCCUCCGG GUUAAUGCGCUUUUGGCAUUGGCCCUGCUCCUGAGAGAAGA AAUAUACUGGGGAACCAGUCUUUACCGACCGUUGUUAUCAGA AAUUCACGGAGUUCGGCCUAGGUCGGACUCCGAUGGGAACG CAACGGUUGUUCCGUUUGACUUGUCGCCCGCUACGGCGUGA GCGUCAAGGUCUGUUGAGUGCAAUCGUAGGACGUCAUUAGU GGCGAACCCAUACCGAUUACUGUGCUGUUCCAGC Yeast transfer RNA-Phe Image from Pande, Stanford U. S-domain of B. subtilis RNAse P RNA Image from Krasilnikov, Northwestern U. 07.04.06 - Tunis 4 RNA Families & Function • 574 families in Rfam • Local RNA structures in 3’- and 5’- untranslated regions (UTR) participate in gene regulation and expression: – – – – Reduce mRNA degradation Control and rate mRNA translation Determine mRNA localization (transport) Regulate mRNA processing (complex splicing mechanisms) 07.04.06 - Tunis 5 Ribonomics • Identify and characterize the RNAs of the cell • Families are represented by alignments whose quality increases considerably when high-resolution structures are available However, Speed of RNA sequencing (GenBank) >> Speed of RNA structure determination (PDB) Alternative high-resolution structure determination techniques are needed to determine: – Sequence (genome annotation) – Structure (folding) – Function (family recognition) 07.04.06 - Tunis 6 Working Hypothesis We can learn the RNA architectural principles and the sequence-structure relationships from existing structural data, and in order to enable computational high-resolution 3D structure determination from sequence. 07.04.06 - Tunis 7 Chemical Structure 07.04.06 - Tunis 8 The riboses and the phosphate groups constitute the backbone and are linked through diester bonds: C5’-O5’ and C3’-O3’. The chain C3’-O3’-P-O5’-C5’ from one ribose to another is therefore referred to as the phosphodiester linkage Major & Thibault (2007) In “From Genomes to Therapies” Wiley-VCH. pp 491-539 07.04.06 - Tunis 9 Torsion Angles Major & Thibault (2007) In “From Genomes to Therapies” Wiley-VCH. pp 491-539 07.04.06 - Tunis 10 Glycosidic Torsion 07.04.06 - Tunis 11 A-RNA Double-Helix The major groove of the A-RNA double-helix is narrow and deep, whereas the minor groove is broad and shallow. Major & Thibault (2007) In “From Genomes to Therapies” Wiley-VCH. pp 491-539 07.04.06 - Tunis 12 Watson-Crick Base Pairs H42 Major groove H8 C8 N7 G N3 H5 C4 C5 C6 N9 C4 C1' H41 N4 O6 N1 H1 N3 C C2 C2 N2 H21 H22 O2 C5 C6 H6 N1 C1; Minor groove Major groove Minor groove 07.04.06 - Tunis 13 Many Possible Base Pairs Saenger (1984) Principles of Nucleic Acid Structure, Springer-Verlag, p.120-121 07.04.06 - Tunis 14 Base Pairing Models 07.04.06 - Tunis 15 Hierarchical Folding 1 5’GCGGAUUUAG2M CUCAGUDHUDHGGG AGAGCGM2CCAGA C0MUGOMAAGYAUPS C5MUGGAGG7MUC C5MUGUGU5MUPSC GA1MUCCACAGAA 3 UUCGACCA 2 4 rna.ucsc.edu/rnacenter/ribosome_images.html 07.04.06 - Tunis 16 Representations & Complexity 5 10 20 30 40 50 60 70 | | | | | | | | | | | | | | (((((((..((((........))))(((((((.....)))))))....((((((...)..)))))))))))). Amino acceptor D 5’ 07.04.06 - Tunis Anticodon T 3’ 17 Secondary Interactions G. Interior loop triple Westhof’s lab 07.04.06 - Tunis 18 Tertiary Interactions (Pseudo-Knots) D. 07.04.06 - Tunis 19 A Dot Plot Shows the Helical Regions 5’ A.U C.G A.U G A AA A C A G A A A U G U A C A G A A A U . . . . . . . . . . . . . . . . . . . . . . G U . . . . . . . . . . Diagonal Wij = 0 for j-i < 4 07.04.06 - Tunis 20 Lowest Free Energy Structure • The RNA does not fold into a random structure. • In general, it prefers low-energy conformations. • The relation between the probability and the energy is given by: (str | seq) RT ln (str | seq) where RT = 0.606 kcal/mol. 07.04.06 - Tunis 21 Implementation of the Pipas & McMahon Algorithm (A naïve approach) • List all possible helical regions for i = 1 to n-(p+1) for j = i+p+1 to n if( pair( i, j ) ) { // elongate pair( i, j ) to form helix( i, j ) l=1 while( pair( i+l, j-l ) ) l++ // if ( l > lmin ) store helix( i, j, l ) in a set } • create all possible secondary structures by forming permutations of compatible helical regions • evaluate each structure for total free energy of formation from a completely extended chain => Hic! There are n! permutations of helical regions => Possible solution: probabilistic approaches (Monte Carlo, GA, etc) Pipas & McMahon (1975) PNAS 72 07.04.06 - Tunis 22 Number Of Secondary Structures n2 S(N 1) S(N) S(k)S(N k 1) k 0 1.8 N The free energy of 1000 different structures can be calculated in approximately 1 second. Consequently, for an RNA of 100 nucleotides, we have 3 x 1025 structures, which would need 1014 years to calculate. Waterman (1978) 07.04.06 - Tunis 23 Dynamic Programming • Simple and discrete energy model • Positions i and j are either base paired or not • Position i base pairs with at most one base • Neglect pseudo-knots and triples • Set maximum loop size • Linear approximation for multi-branch loops Finds the minimum free energy structure Storage O(n2) Time O(n3), or O(n6) with pseudo-knots 07.04.06 - Tunis 24 Secondary Structures An RNA sequence is represented by an ordered list, S = s1, s2, …, sn, where n is the length of the sequence and si is the ith nucleotide in the sequence. A secondary structure on S is an ensemble of ordered pairs, i.j, 1 <= i < j <= n that satisfies: • j – i > p (where p is the minimal number of nucleotides in a loop) • Given i.j and i’.j’, two base pairs, either: • i = i’ and j = j’ (they are the same) • i < j < i’ < j’ (i.j precedes i’.j’) • i < i’ < j’ < j (i.j includes i’.j’) • i < i’ < j < j’ (pseudoknot) 07.04.06 - Tunis 25 Two Base Pairs j 07.04.06 - Tunis j’ 26 Simplest Energy Model The simplest energy model is to consider e(i,j) = –3, -2, and –1 kcal/mole, respectively for the pairs CG, AU, and GU. The energy of the entire structure is the sum of the energies of its pairs: E(S ) e(i,j) i . jS 07.04.06 - Tunis 27 A Recursive Algorithm A recursive algorithm allows us to compute the minimum energy structure. Let W = min WS, where S ranges over all secondary structures. The energy for pairing si with sj is given by e(i,j). Wij are computed for all fragments, i…j of the RNA. Wij = 0 for j-i < 4 Wij = min{ Wi+1,j, Wi,j-1, e(i,j) + Wi+1,j-1, min( k = j-1…i+1 ) ( Wi, k + Wk+1, j ) } Either bases si and sj do not pair, or else they pair with some bases k1 < k2, or else si and sj pair with each other. The minimum structure is computed using a recursive algorithm (implemented in mfold by Zuker). 07.04.06 - Tunis 28 Evaluating Structure Between Pair(i, j) min( k = j-1…i+1 ) ( Wi,k + Wk+1,j ) i 07.04.06 - Tunis k k+1 j 29 Dynamic Programming Table Wij = min{ Wi+1,j, Wi,j-1, e(i,j) + Wi+1,j-1, min(k = j-1…i+1) (Wi,k + Wk+1,j) } j i 07.04.06 - Tunis 30 Dynamic Programming To Solve Recursive Problems Consider the Fibonacci sequence: 1 1 2 3 5 8 13 21... Fib(n) = Fib(n-1) + Fib(n-2) where Fib(n-1) = Fib(n-2) + Fib(n-3) Fib(n-2) = Fib(n-3) + Fib(n-4) Instead of re-computing over and over the same values you store them in memory. 07.04.06 - Tunis 31 Applied To Structure Prediction To compute , the value of and are needed. The value of , in turn needs the value of . The value of needs to be computed only once, even though we need it twice (or more). 07.04.06 - Tunis 32 A More Realistic Free-Energy Model 3’ A/U C/G G/C U/A G/U U/G 5’ A/U -0.9 -1.8 -2.3 -1.1 -1.1 -0.8 C/G -1.7 -2.9 -3.4 -2.3 -2.1 -1.4 G/C -2.1 -2.0 -2.9 -1.8 -1.9 -1.2 U/A -0.9 -1.7 -2.1 -0.9 -1.0 -0.5 G/U -0.5 -1.2 -1.4 -0.8 -0.4 -0.2 U/G -1.0 -1.9 -2.1 -1.1 -1.5 -0.4 Stacking energy in kcal/mol in double-stranded regions. The basepair in the left column is 5’ to the base pair in the top row. Ex) 07.04.06 - Tunis 5’ --> 3’ CA GU 3’ <-- 5’ -1.7 kcal/mol 5’ --> 3’ CAUG GUGC 3’ <-- 5’ -1.7 kcal/mol + -0.8 kcal/mol + -2.1 kcal/mol 33 Destabilizing Loop Energies Loop length Internal Bulge Hairpin 07.04.06 - Tunis - 1 3.9 4.4 5 5.3 4.8 5.3 10 6.6 5.5 6.1 20 7.0 6.3 6.5 30 7.4 6.7 34 Energy Computation Loop contribution = 4.4 kcal/mol C/G : C/G = -2.9 kcal/mol C/G : G/C = -3.4 kcal/mol TOTAL = -1.9 kcal/mol Not quite yet what is used in the mfold program by Zuker! Zuker uses a table for the initial basepair, and look at the nucleotides in the loop for more precise loop contributions. 07.04.06 - Tunis 35 A Short RNA Sequence ACCCCCUCCU UCCUUGGAUC AAGGGGCUCA A Optimal (black) CG/CG -8.7 CG/UA -2.3 UA/CG -1.7 -1.7 dsRNA -12.7 -13.3 LOOP~ 07.04.06 - Tunis Suboptimal (yellow) 15.0 -11.6 14.9 36 Various Programs Mfold 3.2 @ Rensselaer Polytechnic Institute Sfold 2.0 @ Wadsworth Bioinformatics Center・ RNAfold 1.5 @ University of Vienna VSfold 4.0 @ Chiba Institute of Technology・ Hfold @ University of Montreal paRNAss @ Bielefeld University GeneBee @ Belozersky Institute RDfolder @ Peking University Pfold @ Aarhus University ILM @ Washington University CONTRAfold @ Stanford University RNA Secondary Structure Prediction @ Wikiomics.org 07.04.06 - Tunis 37 Low-Resolution Data Improves Predictions 07.04.06 - Tunis 38 Chemical Probing DEPC NH2 DMS N N A N HO N NH2 O H H N H O O DMS H C OH P N O- O O DMS O O H kethoxal H N H O NH H O G OH P N O- O O H O CMCT H H H O O NH OH P U O- N O O O H H O OH H O H P O- 07.04.06 - Tunis NH2 N O- Stern, Moazed & Noller (1988) Meth Enz 164:488 39 Knowledge Is Power 07.04.06 - Tunis 40 Beyond Secondary Structure 07.04.06 - Tunis 41 Predicting Non-Canonical Base Pairs (sarcin-ricin motif) A) MC-Fold B) RNAsubopt 5 10 15 20 25 | | | | | GGGUGCUCAGUACGAGAGGAACCGCACCC (((((((.((((((..))))))))))))) ((((((((.(((((..))))))))))))) ((((((.(((((((..))))))))))))) (((((((((.((((..)))))))) ))))) (((((((..(((((..))).))))))))) (((((((..(((((..)))).)))))))) (((((..(((((((..))).))))))))) (((((((..((((.....))))))))))) (((((..(((((((..)))).)))))))) (((((..((((((.....))))))))))) 5 10 15 20 25 | | | | | GGGUGCUCAGUACGAGAGGAACCGCACCC ((((((......(....).....)))))) -10.90 ((((((...((.(....)..)).)))))) -10.60 ((((((....(.(....).)...)))))) -9.50 ((((((((.....))..(....))))))) -9.10 ((((((...(..(....)...).))) ))) -9.00 ((((((((.......))(....))))))) -8.80 ((((((...((.(.....).)).)))))) -8.70 ((((((....(.(....))....)))))) -8.70 ((((((.................)))))) -8.63 .(((((......(....).....))))). -8.40 -26.860 -26.680 -26.560 -26.400 -26.150 -25.870 -25.740 -25.660 -25.270 -25.250 Parisien, Thibault & Major Wuchty, Fontana, Hofacker, Schuster (1999) Biopolymers 49:145 07.04.06 - Tunis 42 Table 1 | Predictive power Predicted bps (%) INN-HB Cycle mod el Zipper FP FN TP 1.1 16.9 83.1 3.6 3.0 97.0 50.2 25.9 74.1 96.8 Ñ 98.5 87.8 75.6 64.9 90.6 96.7 84.1 WC nC TP TP (TPFN) (TPFP) The predictive pow er of the Individual Nearest Neighbors Hydrogen Bond s (INN-HB) and of the cycle model approaches are comp ared. The sequences of 288 RN A hairpins from PDB structures were subm itted to RNAsubopt (INN-HB), MC-Fold (Cycle mod el), and Zipper. The set of hairpins cont ain 2093 different Watson-Crick (WC) base pairs, including 296 (~14% ) non -canon ical (nC) tertiary base pairs. Zipper implements a gr eedy algorithm that zips the sequence in a hairpin, giving a l ower bound on pr edictive pow er. For ea ch approach, the best of the top 5 pr edicted structures was analyzed. The Matthews correlation coefficients are given in the last row. Parisien, Thibault & Major 07.04.06 - Tunis 43 Sequence To Structure 07.04.06 - Tunis 44 Loop E (430D) The best (blue; ranks 4th) of 516 models is superimposed on the crystal structure + 12 others (light blue) (RMSD 0.9 to 2.5 Å). Parisien & Thibault 07.04.06 - Tunis 45 RNA Families 07.04.06 - Tunis 46 Alignment Frank et al. (2000) RNA 6:1895 07.04.06 - Tunis 47 Rfam If Internet is available: rfam.janelia.org Each family is represented by an alignment and a corresponding covariance model. New family members are searched, in ~200 complete genomes, using the covariance model + Blast. 07.04.06 - Tunis 48 Erpin If the Internet is available: tagc.univ-mrs.fr/erpin Erpin also represents each RNA family by an alignment. The computer representation of the alignment differs from that of Rfam, but the goal is similar: find new family members. 07.04.06 - Tunis 49 Not Mentionned • You can define a motif by structural constraints and use programs such as RNAMOT to scan genomic data. • You can model the 3D structure of an RNA from secondary structure and a limited number of additional structural constraints using MC-Sym. This requires 3D modeling and RNA structure expertise. 07.04.06 - Tunis 50