Transcript Document

A Method to Detect Gene
Structure and Alternative Splice
Sites by Agreeing ESTs to a
Genomic Sequence
Paola Bonizzoni
Graziano Pesole*
Raffaella Rizzi
DISCo, University of Milan-Bicocca, Italy
*Department of Physiology and Biochemistry, University of Milan, Italy
Supported by FIRB Bioinformatics: Genomics and Proteomics
15-20 september
WABI03
1
Outline




Gene structure and alternative splicing
(AS)
Problem definition and algorithm
ASPic program
Experimental results and discussion
15-20 september
WABI03
2
Mechanism of Splicing
DNA
5’
3’
3’
5’
TRANSCRIPTION
pre-mRNA
5’
exon 1
exon 2
exon 3
3’
SPLICING by spliceosome
mRNA
EST
15-20 september
WABI03
exon 1
exon 1
exon 2
exon 2
exon 3
exon 3
splicing product
Expressed Sequence Tag
(cDNA)
3
Modes of Alternative Splicing
Genomic sequence
1
Introns
2
3
2
3
Exons
1
Third
Second
splicing
splicing
mode
mode
First splicing
mode
21
32
15-20 september
WABI03
3
4
Modes of Alternative Splicing
1
22b
3
Competing 5’–3’
Exclusive exons: 1 2b
3
15-20 september
WABI03
5
Why AS is important?




AS occurs in 59% of human genes
(Graveley, 2001)
AS expands protein diversity (generates
from a single gene multiple transcripts)
AS is tissue-specific (Graveley, 2001)
AS is related to human diseases
15-20 september
WABI03
6
Motivations
Regulation of AS is still an open problem
NEED tools to


predict alternative splicing forms
analyze such a mechanism by a representation
of splicing forms
15-20 september
WABI03
7
What is available?
Fast programs to produce a single EST alignment to a
genomic sequence: Spidey (Wheelan et al., 2001)
Squall (Ogasawara & Morishita, 2002)
But to predict the exon-intron gene structure is a
complicate goal because of


sequencing errors in EST make difficult to locate splice sites by
alignment
duplications, repeated sequences may produce more than one
possible EST alignment
15-20 september
WABI03
8
Open Problems



Formal definition of AS prediction problem …
Combined analysis of ESTs alignments related
to the same gene by agreeing ESTs to a
common exon-intron gene structure
Optimization criteria
15-20 september
WABI03
9
Formal Definitions

Def 1

Genomic sequence, G = I1 f1 I2 f2 I3 f3 … In fn In+1, where
Ii (i=1, 2, …, n+1) are introns and fi (i=1, 2, …, n) are
exons

Def 2


Exon factorization of G, GE = f1 f2 f3 … fn
Def 3

EST factorization of an EST S compatible with GE is
S=s1 s2 … sk s.t. there exists 1  i1 < i2 < … < ik  n:
st =k-1
suff (fit) or st = pref (fit)
 edit
fit)t=2,
 error
st = (s
fitt,for
3, …,for
k-1t=2, 3, …,

splice
edit(s
suff(fof
errorskand
 error
s1 is a1,suffix
fi1and
is aedit(s
prefixk,ofpref(f
fik ik))variant
i1))
15-20 september
WABI03
10
The Problem
Input
- A genomic sequence G
- A set of EST sequences S = {S1, S2, …, Sn}
Output
An exon factorization GE of G (GE = f1, f2, …, fn) and a
set of ESTs factorizations compatible with GE
Objective: minimize n
15-20 september
WABI03
11
Example
Genomic sequence G
A2
A1A2
B
D1
C1
D1D 2
C1C2
EST set S = {S1, S2, S3}
S1
A2
A1A2
S2
S3
D1
A2
C1
B
D1D2
15-20 september
WABI03
7 exons
4
D1
C1C2
12
Results


MEFC is MAX-SNP-hard (linear reduction
from NODE-COVER)
heuristic algorithm:
Iterate process to factorize each EST
backtracking to recompute previous EST factors
if not compatible to GE
15-20 september
WABI03
13
The algorithm
Iterative jth step: partial EST factorization of Si (compute factor sij)
si-1 1
Si-1
si1
Si
G
si-1 j-1
e1
si-1 j
si j-1
e2
si-1 n
sij
em
em
After
placing all
the factors sij for the set S,
if
(Compatible(e
m, exon_list)) then
place the
factors;
addexternal
em to exon_list;
15-20 september
WABI03
otherwise
try to place sij elsewhere;
If not possible then
backtrack;
14
The algorithm (more details)
Compute factor sij
G
ag
Si
si1
sij
c2
gt
exon
si j-1
c1
c2
c3
si j
c4
si jy
c5
The
Then
Find
Sij can
algorithm
theberightmost
algorithm
entire
canonical
divided
searches
factor
into
searches
ag
gt
pattern
anijperfect
s
components
can
a such
perfect
on
be the
match
placed
that
cleft
of
(k=1,2,…,n)
the
onc1Gedit
of
on cG2 distance
on G
kmatch
between
At leastsij
one
y and
of these
the genomic
components
substring
for k from
from ag
1 to
to (n-1)
gt is
Suppose that
c21 has a
noperfect
perfectmatch
matchon
onGG
bounded
is error-free and can be placed on G
15-20 september
WABI03
15
ASPic (Alternative Splicing
PredICtion)
Input
- A minimum length of an exon
- A maximum number of exons in the exon factorization
of the genomic sequence
- An error percentage
- A genomic sequence
- An ESTs set (or cluster)
Output
- A text file for all ESTs alignments
- An HTML file for the exon factorization of the genomic
sequence
15-20 september
WABI03
16
ASPic data validation
Validation Database:
ASAP (Lee et al., 2003)
ASPic INPUT:


Genomic sequences from ASAP database
EST clusters of human chromosome 1 from
UniGene database
15-20 september
WABI03
17
Experimental Results
Genomic
ASAP
Novel
Introns
introns
introns
detected
sequence
shift detected
(official
by ASPic
detected
ASAPgene
by ASPic
name)
15-20 september
WABI03
18
Execution times
PENTIUM IV, 1600 MHZ,
256 MB, running Linux
15-20 september
WABI03
19
An example of data (gene
HNRPR)
Positions are from 0 for ASPic and from 1 for ASAP
15-20 september
WABI03
ASPic finds a novel intron
from 2144 to 5333 confirmed
by 18 EST sequences
20
An example of data (gene
HNRPR, intron 2144-5333)
EST ID
Genomic
Left
EST
and
exons
right
exonsends of the
two exons
15-20 september
WABI03
21
WEB site
15-20 september
WABI03
22
WEB site
15-20 september
WABI03
23
WEB site
15-20 september
WABI03
24
Responsabili di progetto:
Prof. Paola Bonizzoni
Prof. Graziano Pesole
Responsabile disegno software:
Raffaella Rizzi
Sito WEB:
Rappresentazione grafica:
Gabriele Ravanelli
Francesco Perego
Anna Redondi
Francesca Rossin
Gianluca Dellavedova
Analisi dati:
Altri contributi:
15-20 september
WABI03
25
GRAZIE!
15-20 september
WABI03
26