Transcript Slide 1

An Investigation into Selection
Constraints in RNA Genes
Naila Mimouni, Rune Lyngsoe and Jotun Hein
Department of Statistics, Oxford University
Aim
Results:
• A robust approach to investigate selection patterns acting on ncRNAs.
1- Selection Patterns on the stem classes:
• We validate the hypothesis on different ncRNA families.
• Pattern observed in snoRNA, snoRNA+tRNA.
• Not observed in miRNA class III, because the conservation of the mature
miRNA drives class III conservation upwards.
Motivation
• Selection in RNA:
+ No RNA equivalent of Ka/Ks.
+ Comparing paired vs. unpaired bases inconclusive.
+ No trend valid for most ncRNA families has been identified.
• RNA Gene Prediction:
+ Improve identification of ncRNAs: reduce false positives.
+ Identification of RNA function, active sites, and pseudogenes.
Method
Idea:
• Exploit conservation of RNA secondary structure for homologous
sequences.
• By analogy with the contrasting properties of different codon positions
in protein coding genes, we use the nearest neighbour energy model for
RNA structure to infer the effects of disrupting base pairs at different
positions in a stem.
• We divide stem positions into structural classes.
• We then validate the class hypothesis for different ncRNA families and
datasets, and establish that stem positions are under different selective
constraints.
Approach:
A) Folding homologous RNA sequences:
1- Pfold: mutational model, SCFGs.
2- RNACast & RNAforester: Abstract shapes, structure alignment.
B) Classification of base pairs:
Fig. 1: A stem of seven contiguous base
pairs annotated with the classes as
defined by thermodynamic considerations.
It is assumed that base pairs [i,j],
and [i + 6,j- 6] are adjoining loops.
• Based on the Turner energy parameters, we can quantify the energy cost
of disrupting a bp depending on its proximity to a loop.
• We also consider the structural effects. Under the assumption that only
stacking base pairs are stable, we observe that disrupting a class II base
pair will also disrupt adjoining terminal base pairs leading to a larger
structural effect than disrupting class I and class III base pairs.
Terminal
(Class I)
Penultimate
(Class II)
Other
(Class III)
Stability
Structure
Small
Small
Large
Large
Large
Small
Overall
Small
Large
Medium
C) Evolutionary Analysis using Phase:
• Use Phase software to analyse the homologous sequences, with
conserved structure, annotated with classes.
• Compare the average number of substitutions in each class.
Data:
+ Two datasets: Rfam RNA sequence alignments of human, mouse, rat
and chicken.
+ 1st dataset: Alignments: 112 snoRNA, 95 miRNA, 140 NcRNAs.
+ 2nd dataset: Alignments: 54 snoRNA, 79 miRNA alignments, 508 ncRNAs.
1: http://www.bioinfo.rpi.edu/~zukerm/cgi-bin/efiles-3.0.cgi
2: Mathews, D. H. & Sabina, J. & Zuker, M. & Turner, D. H. 1999. J. Mol. Biol., 288, 911-40.
3:http://rana.lbl.gov/drosophila/wiki/index.php/Noncoding_RNA
4: Knudsen, B. & Hein, J. 2003. Nucleic Acids Research, 31, 3423-8.
5: Reeder, J. & Giegerich, R. 2005. Bioinformatics, 21, 3516-23.
6: Höchsmann, M. & Voss, B. and Giegerich,R. 2004. IEEE/ACM Transactions on
Computational Biology and Bioinformatics (TCBB), 1, 53—62.
7: http://www.cs.man.ac.uk/~gowridhv/
Class I
Class II
Class III
snoRNA1
snoRNA1+tRNA1
miRNA1
ncRNA1
ncRNA1+snoRNA1
snoRNA2
miRNA2
ncRNA2
1.31
0.73
0.78
0.32
0.73
0.77
0.49
0.57
1.17
0.68
0.71
0.25
0.58
0.50
0.42
0.47
1.24
0.72
0.60
0.22
0.68
0.67
0.31
0.33
SnoRNA1*
miRNA1*
ncRNA1*
ncRNA1*+snoRNA1*
1.07
0.91
0.67
0.91
0.89
0.72
0.44
0.74
0.98
0.37
0.64
0.87
2- Significance of the classification
Does our classification model perform better than no classification?
Chi-square test of nested models:
• Real Data: the differences in inferred rates observed in our ncRNA families
are statistically significant with a p-value of 0.01.
•Simulated Data: we simulated 100 datasets according to the noclassification model and with random classification.
Fig. 2: Plot of differences in loglikelihoods under the traditional noclassification model and our
classification. For comparison, the
corresponding percentiles of the chisquare distribution with 58 degrees
of freedom and the observed
difference for the ncRNA1+
snoRNA1 dataset are also plotted.
In 90% of the cases, the difference of
log likelihoods is less than 85.95,
which is that at a p-value of 0.01.
All of the simulated data log likelihood
differences are smaller than that of the
data given our structural classification
at 194.64. This indicates that our
classification is a better fit than no
classification.
4- Verification of evolutionary origin of observed selection pattern
• When applying our analysis to shuffled alignments, the observed selection
pattern disappeared. For e.g. shuffled snoRNA1 alignments we observed
mutation rates of 0.59, 0.86 and 0.59 for class I, II and III respectively.
• The observed evolutionary pattern remained when consensus structure
was determined using the structure-first strategy of RNACast & RNAforester
rather than the alignment-first strategy of Pfold, see data sets marked with *
in the above table.
Conclusion
• Most ncRNAs follow a selection pattern whereby the penultimate base pair
is more conserved that the ultimate base pair.
• For ncRNA families, class labelling is a better model of homologous
ncRNA evolution than treating all stem base pairs the same.
I thank the 6th Framework program of the European Union for awarding me a travel fellowship.