PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard1, Sébastien Rigali2, Séverine Colson2, Raphaël Marée1 and Louis Wehenkel1 Bioinformatics and Modeling, GIGA &

Download Report

Transcript PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard1, Sébastien Rigali2, Séverine Colson2, Raphaël Marée1 and Louis Wehenkel1 Bioinformatics and Modeling, GIGA &

PREDetector : Prokaryotic Regulatory Element Detector
Samuel Hiard1, Sébastien Rigali2, Séverine Colson2, Raphaël Marée1 and Louis Wehenkel1
Bioinformatics and Modeling, GIGA & Department of Electrical Engineering and Computer Science – University of Liège, Sart-Tilman B28, Liège, Belgium
2 Centre for Protein Engineering – University of Liège, Sart-Tilman B6 Liège, Belgium
1
Abstract
Background: In the post-genomic area, in silico predictions of regulatory networks are considered as a powerful approach to decipher and understand biological pathways within prokaryotic
cells. The emergence of position weight matrices based programs has facilitated the access to this approach. However, a tool that automatically estimates the reliability of the predictions and
would allow users to extend predictions in genomic regions generally regarded with no regulatory functions was still highly demanded.
Result: Here, we introduce PREDetector, a tool developed for predicting regulons of DNA-binding proteins in prokaryotic genomes that (i) automatically predicts, scores and positions potential
binding sites and their respective target genes, (ii) includes the downstream co-regulated genes, (iii) extends the predictions to coding sequences and terminator regions, (iv) saves private
matrices and allows predictions in other genomes, and (v) provides an easy way to estimate the reliability of the predictions.
Conclusion: We present, with PREDetector, an accurate prokaryotic regulon prediction tool that maximally answers biologists’ requests. PREDetector can be downloaded freely at
http://www.montefiore.ulg.ac.be/~hiard/predetectorfr.html
The weight matrix based
approach
Transcription factor binding sites are usually slightly
variable in their sequences. Positional weight matrix
summarizes information about binding sites sequence
alignment. It also allows to predict the occurrence of
new sites and estimate their binding efficiency for a
transcription factor.
The generation of a position weight matrix starts with
the alignment of the experimentally validated DNA
motifs of a specific transcription factor.
Multiple alignment
A
A
C
C
C
C
G
G
G
T
G
C
C
T
T
The multiple alignment is then converted into an
alignment matrix that represents how many times
nucleotide i was observed in position j of the alignment.
1
2
3
4
5
A
2
0
0
0
0
C
1
3
0
1
1
G
0
0
3
1
0
T
0
0
0
1
2
Why PREDetector ?
3. Results
Our motivation to generate PREDetector came from
our intense utilisation of previously described similar
programmes, such as Target Explorer (A. Sosinsky et
al., 2003), Predictregulon (S. Yellaboina et al, 2004), or
Virtual footprint (R. Munch et al., 2006), that were not
appropriate to predict some of our in vivo
experimentally validated DNA binding sites. The
priority and challenge of PREDetector was to offer a
programme which, all at once, would provide an easy
way to estimate the reliability of the predictions, and
beyond the identification of strongly reliable cis-acting
elements, would guarantee users the possibility to
access information among the predicted sites with
scores generally regarded with no regulatory function
because categorized beyond statistical reliability
thresholds.
Once the options have been set, PREDetector scans
the selected genome sequences and classifies the
predicted target DNA motifs according to their
localisation in the genome. This includes coding
sequences or intergenic sequences, which can be
classified as (1) regulatory regions (where regulatory
elements are predicted to be found), (2) upstream
regions (any region upstream of a translational start
codon), and (3) terminator regions (in PREDetector a
terminator region terminology is only used to indicate
regions between two translational stop codons).
Predictions results are distributed among these four
genome localization categories
weighti , j




p
/
N

1
i, j
i
pi
where :
- ni,j is the observed frequency of nucleotide i in
position j
- N is the number of sequences in the set
- pi is the expected frequency of nucleotide i in the
genome. For instance 0,25 for each nucleotide in a
50% rich GC genome.
Weight matrix
1
2
3
4
Regulatory region Regulatory region
orf 1
1. Weight matrix creation
Coding region
orf 2
Terminator region
orf 3
orf 4
Upstream region
orf 5
Co-transcribed
genes
The first part of PREDetector consists in the generation
of a weight matrix according to a set of experimentally
validated binding sites. The weight matrix can be
saved into user’s library and further used to scan
different bacterial genomes.
4. Prediction Reliability
The alignment matrix is then converted into a weight
matrix via the formula:

n
 ln
A
2. Regulon Prediction
The search for potential binding sites of the regulatory
protein starts with the selection of one of the saved
weight matrices and the definition of the cut-off score.
The lowest score among the input sequences used to
build a matrix is fixed by default as the recommended
cut-off score for this matrix. Users can modify the cutoff score. PREDetector is able to scan either complete
or selected regions of bacterial genomes available in
the GenBank database. Users can determine the
bounds of the so-called “regulatory regions” (estimation
of maximal distances upstream and downstream the
translational start wherein functional regulatory motifs
could be found), as well as bounds of co-directionally
transcribed genes.
One of the main advantages provided by PREDetector
is the opportunity for the user to estimate the reliability
of the predictions. The large natural occurrence of
transcription factors binding sites are located within
intergenic regions and not within coding sequences.
PREDetector provides these statistics and therefore
the user can estimate the scores at which he will find
strongly or weakly reliable sites.
5
A
0.65 -1.39 -1.39 -1.39 -1.39
C
0.41 1.39 -1.39 0.41 0.41
G
-1.39 -1.39 1.39 0.41 -1.39
T
-1.39 -1.39 -1.39 0.08 0.65
Scores in red are those for the best nucleotide at
each position. The consensus sequence is
ACG(C/G)T. The score of a L-length sequence is
computed by summing the weights of each nucleotide.
Conclusion
PREDetector is an accurate prokaryotic regulon
prediction tool that maximally answers biologists’
requests. Suggestions for improvements are welcome
(contact [email protected], [email protected]).