Coordinated Laboratory for Computational Genomics

Download Report

Transcript Coordinated Laboratory for Computational Genomics

I. Programming Fundamentals
1.
2.
3.
4.
5.
6.
7.
Problem Solving
Problem Specification
Top-down Design
Languages
Debugging/Performance Tuning
Testing
Maintenance
1
1. Problem Solving
(I. Programming Fundamentals)
•
•
Since late 60’s, the phrase “Problem
Solving with Computers”
The computer as a tool:
1.
2.
3.
4.
5.
6.
Understand Problem
Specification
Design
Implement
Test
Maintain
2
2. Problem Specification
(I. Programming Fundamentals)
•
•
•
•
Can be informal, formal, or in between.
A definition of Input/Output Relationships
Uncovers and clarifies any ambiguous issues
Involves interactions between end users and
solution developers
• Ideally, produces a specification “document”
• Realistically, prototyping usually starts
simultaneously with specification
3
3. Top-down Design
(I. Programming Fundamentals)
• Extremely important methodological philosophy.
• Develops a solution in successive phases of
decreasing levels of abstraction
• Any problem can have a solution described (at
some level of abstraction) on about a half sheet of
paper.
• Aliases: modular programming, stepwise
refinement, (object oriented design is this
philosophy with training wheels added)
4
4. Languages
(I. Programming Fundamentals)
• Choice of language can be important.
• Often, however, final choice of language
can be as much a matter of subjective,
personal choice, as is a type of paint brush
to an artist.
• Issues: acceptance (for maintenance),
performance, portability
5
4. Languages
(I. Programming Fundamentals)
• Language types:
–
–
–
–
Procedural (C, Fortran, Pascal, Basic)
Object-oriented (C++, Smalltalk)
Functional, Declarative (LISP, Prolog)
String processing (SNOBOL)
• Deployment technologies:
– Interpreted (Basic)
– Compiled (C, Fortran, most languages)
• Run-time systems
– Statically linked (real-time, older systems)
– Dynamically linked (most modern environments)
6
5. Debugging/Performance Tuning
(I. Programming Fundamentals)
•
•
•
The most unpredictable phase of the process
Not a matter of luck, however
Scientific principles are critical
1.
2.
3.
4.
•
•
Formulate hypothesis
Perform an experiment, examine results
Make a single change
Repeat at step 1.
“Crash” testing, and obvious error finding
Debugging tools can assist (gdb)
7
6. Testing
(I. Programming Fundamentals)
•
•
•
•
•
Goes beyond crash testing
Need to develop test sets
Functional testing
Structural testing
Must struggle with specification now
8
7. Maintenance
(I. Programming Fundamentals)
• The on-going, necessary update and
debugging of “finished” software
• This step never ends
• Often, earlier steps ignore this phase for the
sake of expediency
• Language choice, specification,
modularization, all bear on this step
9
II. Data Structures
• A practical “framework” for holding data.
• Must consider input, intermediate, and
possibly computed output data
• Impacts on:
–
–
–
–
Development time
Memory usage
Performance (execution time)
Maintainability
10
II. Data Structures (cont.)
•
•
•
•
•
Scalar and array variables
Static and dynamic structures
Dense and sparse structures
Linear and linked structures
Lists, Stacks (LIFO), Queues (FIFO), Trees,
Graphs, and Heaps
• Dynamic structure efficiency relies on OS
interaction, and program “behavior”
11
III. Algorithms
• Control flow
• Template structures
• Complexity analysis
12
III. Algorithms
(Control Flow)
• Sequential
• Alternation or
selection
• Iteration or
looping
Statement 1
Statement 2
?
Statement 1
Statement 2
?
Loop Body
13
III. Algorithms
(Template Structures)
•
•
•
•
•
•
Divide and Conquer
Greedy
Backtracking
Branch and Bound
Searching (depth first, breadth first)
Dynamic Programming
14
III. Algorithms
(Complexity Analysis)
for i = 1 to 100
for j = 1 to 50
x[i] = a[i] + b[j]
Inner statement executes 50 x 100, or 5,000 times. If
outer loop executed n times, and inner one n
times, we would say that this “algorithm” had
complexity O(n2).
In some sense, as the problem size n grows, the
execution time will grow as the square of n.
15
IV. Systems and Networks
Memory
Processor
DATA
Scalar
4. Store Data
2. Fetch Data
3. Execute
Instruction
Array
PROGRAMS
1. Fetch Instruction
16
IV. Systems and Networks (cont.)
Tools and Applications
Libraries and Languages
Peripherals: Disks, etc
Network
Operating
System
CPU/Memory
Local Operating System
17
IV. Systems and Networks (cont.)
Network Medium
1 computer:
CPU
Memory
Disk
1 computer
1 computer
1 computer
1 computer
•Many Possible Media: Physical and Protocols
•Functional Variants: message passing, shared files, shared memory
•Security issues: protecting data, allow sharing
•Heterogeneous Operating Systems
18
V. Tools and Scripts
• Tools:
– Debugging
– Performance Tuning
– Administration
• Scripting:
– Programs of “shell” commands
– “Glue” to allow other programs to work
together, and manipulate whole files (of
sequence, for example) as simple data objects
19
VI. Databases
• Pile ‘o data
• Stored on large non-volatile media (e.g. disk
system), Local vs. networked.
• Table Structures
• Primary key for each item
• Strength is “relational” query methods
– SQL – structured query language
– “retrieve from table X where name like “Joe” and age
equal 32”
– Insert, delete, update, etc.
20
Introductory BCB Examples
• Bioinformatics:
– Sequence alignment and database search
– Gene discovery pipeline
– EST Clustering
• Computational Biology
– Gene Prediction
– Analysis of Low Complexity
21
Sequence Alignment and Database Search
(BioInformatics)
• Alignment-based
– Smith/Waterman
– Dynamic Programming
• Markov-model based
• Large Database issues
22
Sequence Alignment
•
•
•
•
Nucleotide vs. amino acids
Global vs. Local
Pair-wise vs. multiple
Simplest case:
– Global, Pair-wise
– Must match at both ends
23
Sequence Alignment Example
• Example:
S1: TTACTTGCC (9 bases)
S2: ATGACGAC (8 bases)
• Scoring (1 possibility):
+2 match
0 mismatch
-1 gap in either sequence
• One Possible alignment:
T T - A C T T G C C
A T G A C - - G A C
0 2-1 2 2-1-1 2 0 2
Score = 10 – 3 = 7
24
Cue to a Data Structure
Gap in S2
Gap in S1
Alignment
(match/mismatch)
25
How hard can this be?
• Brute force approach: consider all possible
alignments, and choose the one with best score
• 3 choices at each internal branch point
• Assume n x n comparison. 3n comparisons
– n = 3  33 = 27 paths
– n = 20  320 = 3.4 x 109 paths
– n = 200  3200 = 2.6 x 1095 paths
• If 1 path takes 1 nanosecond (10-9 secs)
– 8.4 x 1078 years!
• But, using data structures cleverly, this can be
greatly sped up to O(n2)
26
Basics of
Practical
Alignment
Algorithm
Example Sequences:
AAAG
AGC
For large database
Searching, O(n2) is
impractical
27
Other Scoring Systems
28
EST Gene Discovery Pipeline
(BioInformatics)
29
EST Sequence Clustering
(BioInformatics)
• Goal: Group together expressed
sequence tags (ESTs) and full length
cDNA data into gene-based indices
– Sequences considered linked if similarity
score exceeds some threshold
30
Data Flow
31
Basic Flow of Execution
32
Expanding on Step 4c
33
Hashing
• Generate unique integer for 8-base windows
 1
H   ( K i *i )
i 0
if seq[i]  A
if seq[i]  C
if seq[i]  G
if seq[i]  T
0
1

i  
2
3
Hash Example
Sequence: GCCACTTGGCGTTTTG
Hashes:
Hash 1: GCCACTTG
Hash 2: CCACTTGG
Hash 3: CACTTGGC
Hash 4: ACTTGGCG
Hash 5: CTTGGCGT
...etc.
=
=
=
=
=
48406
44869
27601
39668
59069
34
Global Hash Table Data Structure
0
1
Linked list of
clusters that
contain at least 1
hash with value 2.
2
3
4
5
6
7
48 - 1
Cluster Representative
Sequence Name
Sequence
Hashes
Hash Indexes
Touch Count
...
Pointer To Next
Cluster Member
35
Gene Prediction
(Computational Biology)
• Contexts:
– Identifying full length transcripts
– Finding genes in genomic sequence
• Approaches
• Deployment Issues
36
Genome Architecture in an Nutshell
Start c odon
Codons
Donor site
GCCGCCGCCATGCCCTTCTCCAACAGGTGAGTGAG
Transc ription
Start
5’ UTR
Promoter
Exon
CTCCCAGCCCTGCC
Acc eptor site
Stop Codon
Intron
Poly-A site
ATCCCCATGCCTGAGGGCCCCT
GCAGAAACAATAAAACCA
3’ UTR
37
Preview of an HMM Model for
Gene Prediction
38
The Crux of Gene Prediction
5’ UTR
1st Exon
ATG
Kozac
Consensus
Stops in
all 3 frames
No in-frame
stops
GT
Exon
AG
Intron
Exon
39
Gene Prediction Approaches
• Ab initio methods:
– Profile Hidden Markov Models (GENSCAN, HMMgene)
– Neural Networks (GRAIL, Genie)
– Decision Trees (MORGAN)
• Issues:
– Seeding from training sets
– Fully general approaches?
• Interesting question:
– Can gene finding be done species-independent?
40
Simple Dicty Gene Finder
(Intuition and an Example)
• Basic Idea (G. Klein) based on GC/AT content of Intron
vs. Exons
• Idealized Example: Count G/Cs and A/Ts in a window
size of 10 bases.
6
10
10
10 AT content
<EXON>
<EXON>
…….CGCGGGCGCCGTATTTATATATTATA…..AATATTTTATATAGCCCGGCGCGGCCG…...
<INTRON>
GC content 10
Donor Site
10
6
2
Acceptor Site
Point where GC.left and AT.right are both maximized
41
Dicty Gene Finding Tool Model
• Model Parameters:
– W -- Window Size
– low -- threshold below which GC or AT
content does not match hypothesis
– high -- threshold above which GC or AT
content matches hypothesis
– m -- number of consecutive windows
that will be examined
– n -- number of windows out of m that
that must exceed  to qualify for an
intron/exon or exon/intron transition
– tol -- maximum distance from the GC/AT
content transition at which the GT or
motif must be found
AG
42
Dicty Gene Finding Tool Model
W = 8, m = 4,
high =
7,
low =
6
1
2
3
4
5
6
G/C=7
. . .GCGGGCGCTGGGGCCGCGTATTATAGTATTTAT. . .
n=3
n=4
43
Dicty Intron/Gene
Prediction Algorithm
1. Calculate AT (GC) content in size W windows
right and left of each base position.
2. Calculate n
AT count  high, AT count   low
for each window of m bases to the left and right of
each base position.
3. For each position: If ……...
ATlefthigh  n && ATrightlow  n
 potential acceptor site
ATleftlow  n && ATrighthigh  n
 potential donor site
44
Dicty Intron/Gene
Prediction Algorithm
(continued)
4. For each potential donor site:
If GT (donor) or AG (acceptor) motif is found
within Tol bases distance, note this as an intron
boundary.
5. Sort boundaries into candidate introns.
45
Test Data
>IIADP1D6358 Antiparallèle 811 bases
AAAAACCTGCTTAGGATTAATTATGAGCGAATTTTTTTTCTTTAAAACTT
CCAAAAATATTTTTTTTTTTTTTTTTTTTTAATAATTTCGGTTTGCTCAT
AGATTTTTTATTTATTTAATTAATATTTTTAATTTTTTTTTTTTTAATCC
TAAAAATAGATTTTATTTATTTTATTTAATTTTTAATTATTAAAAGATAT
GAGATTTTTAAAGTTCGGGTTAGAAATTAATTTGGGTAAAGGAACTCTTA
TTGAATTTGATGAACAgtgtacttaaatatttaattaatttttttttttt
atttgttttaagaagaagaaaaagaaaaaatatagaaatagTAAAAAACT
ATTTCCATATATTTGTTATACTCTTACACACAAGGTTATAAATTTAAAGT
gttataaataatttaaaaattttattctgtaagaaaatttgttttgaaat
tatttgattaaaaatagaaggtttttttttttattttttttttttatttt
tatttttttttattttttataatttccgcgtttgaatttgttgtgtaaat
taattttaattttttttttttttttttttttttttttttttttttttttt
ttcatttttaacatcatttgattcattaatttattttttttttcaacatc
cccaacccaaaaaaaaaaaataaaaaaaaatgataagAAATTTAACAAAA
TTAACAAAATTTACAATTGAAAATAGATTTTACCAATCCTCATCAAAAGG
AAGATTCAGTGGTAAAAATGGAAACAATGCATTCAGGGGATCTCTAGAGT
CGACCGAAGGC
•Probable Correct Introns:
+267 -341
+401 -687
46
Parameter Space to Search
• Ranges
–
–
–
–
–
–
W -- 3  10 (8 values)
high -- .7xW  W (4 values)
low -- .5xW  .9xW (4 values)
m -- 3  11 (9 values)
n -- m/2  m (4 values)
tol -- 3-7 (5 values)
• 3584 x 5  18,000 sets of parameters
• Search for sets that find all expected sites
with a minimum of false positives.
47
Test Data
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
. .
t1.fasta 3 1 3 1
t1.fasta 3 1 3 2
t1.fasta 3 1 3 1
t1.fasta 3 1 3 2
t1.fasta 3 2 3 1
t1.fasta 3 2 3 2
t1.fasta 3 2 3 1
t1.fasta 3 2 3 2
t1.fasta 3 3 3 1
t1.fasta 3 3 3 2
t1.fasta 3 3 3 1
t1.fasta 3 3 3 2
t1.fasta 3 2 4 1
t1.fasta 3 2 4 2
t1.fasta 3 2 4 1
t1.fasta 3 2 4 2
t1.fasta 3 3 4 1
t1.fasta 3 3 4 2
t1.fasta 3 3 4 1
t1.fasta 3 3 4 2
t1.fasta 3 4 4 1
t1.fasta 3 4 4 2
t1.fasta 3 4 4 1
t1.fasta 3 4 4 2
t1.fasta 3 2 5 1
t1.fasta 3 2 5 2
. . . About 18,000
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
more lines like
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
this . . .
48
Test Data Raw Results
len=811 W=3 n=1 m=3 thrL=1 thrH=2, Tol=4, Sites Found=18
Intron:
Intron:
Intron:
Intron:
Intron:
Intron:
1
2
3
4
5
6
+
+
+
+
+
+
91
236
267
385
471
799
+
+
+
-
213 - 213
241
267
399 - 399 - 467
759 - 759 - 797
799
len=811 W=3 n=1 m=3 thrL=2 thrH=2, Tol=4, Sites Found=29
Intron: 1 + 91
Intron: 2 + 219
Intron: 3 + 236
Intron: 4 + 267
Intron: 5 + 305
Intron: 6 + 341
Intron: 7 + 385
Intron: 8 + 429
Intron: 9 + 441
Intron: 10 + 471
Intron: 11 + 759
Intron: 12 + 799
+
+
-
213
223
241
267
312
341
399
433
467
753
759
799
- 213
- 335
- 399
- 797
len=811 W=3 n=1 m=3 thrL=1 thrH=3, Tol=4, Sites Found=13
Intron:
Intron:
Intron:
1 + 91 + 213 - 213 - 241
2 + 267 + 399 - 399 - 467
3 + 471 + 759 - 759 - 786
. . . About 18,000 sets of results like this. . .
49
Test Data Filtered Results
len=811 W=6 n=5 m=9 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=8 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND
This provides an initial set of likely to be optimal parameters
50
Best Parameter Set on Known Gene
len=811 W=6 n=5 m=10 thrL=5
1: AAAAACCTGC TTAGGATTAA
51: CCAAAAATAT TTTTTTTTTT
101: AGATTTTTTA TTTATTTAAT
151: TAAAAATAGA TTTTATTTAT
201: GAGATTTTTA AAgttcgggt
251: TTGAATTTGA TGAACAgtgt
301: atttgtttta agaagaagaa
351: ATTTCCATAT ATTTGTTATA
401: gttataaata atttaaaaat
451: TATTTGATTA AAAATAGAAG
501: tatttttttt tattttttat
551: taattttaat tttttttttt
601: ttcattttta acatcatttg
651: cccaacccaa aaaaaaaaaa
701: TTAACAAAAT TTACAATTGA
751: AAGATTCAGT GGTAAAAATG
801: CGACCGAAGG C
Intron
Intron
Intron
Intron
1:
2:
3:
4:
+
+
+
+
213
267
385
471
+
-
thrH=6, Tol=4, Sites Found=11
TTATGAGCGA ATTTTTTTTC TTTAAAACTT
TTTTTTTTTT AATAATTTCG GTTTGCTCAT
TAATATTTTT AATTTTTTTT TTTTTAATCC
TTTATTTAAT TTTTAATTAT TAAAAGATAT
tagaaattaa tttgggtaaa gGAACTCTTA
acttaaatat ttaattaatt tttttttttt
aaagaaaaaa tatagaaata gTAAAAAACT
CTCTTACACA CAAGgttata aatttaaagt
tttattctgt aagAAAATTT GTTTTGAAAT
gttttttttt ttattttttt tttttatttt
aatttccgcg tttgaatttg ttgtgtaaat
tttttttttt tttttttttt tttttttttt
attcattaat ttattttttt tttcaacatc
taaaaaaaaa tgataagAAA TTTAACAAAA
AAATAGATTT TACCAATCCT CATCAAAAGG
GAAACAATGC ATTCAGGGGA TCTCTAGAGT
241
341
399 - 433
687
overpredicted (45 bases)
UNDERPREDICTED (37 BASES)
correct + (325 bases)
51
CORRECT - (404 BASES)
Analysis of Unknown Gene
• Started with 21 reads from Michel Satre
(genomic and ESTs)
•Used phred to assemble them
•4 contigs found
•4th contig was longest (1759 bases)
•Used parameters from previous analysis
•Results for contig4 compared . . . . . . .
52
Contig4 Sampled Results
(a closer look)
W=6 n=5
Intron
Intron
Intron
Intron
Intron
m=5 thrL=5 thrH=6, Tol=6, Sites Found=14
1: + 54 - 401
2: + 579 - 612
3: + 711 + 782 -1113
4: +1185 +1350 -1350 -1504
5: +1628 -1709
W=6 n=5
Intron
Intron
Intron
Intron
Intron
m=5 thrL=5 thrH=6, Tol=7, SitesFound=14
1: + 54 - 401
2: + 579 - 612
3: + 711 + 782 -1113
4: +1185 +1350 -1350 -1504
5: +1628 -1709
len=1759 W=6 n=5
Found=14
Intron 1: + 54
Intron 2: + 579
Intron 3: + 711
Intron 4: +1174
Intron 5: +1628
m=6 thrL=5 thrH=6, Tol=6, Sites
len=1759 W=6 n=5
Found=14
Intron 1: + 54
Intron 2: + 579
Intron 3: + 711
Intron 4: +1174
Intron 5: +1628
m=6 thrL=5 thrH=6, Tol=7, Sites
- 401
- 612
+ 782 -1164
+1350 -1350 -1504
-1709
- 401
- 612
+ 782 -1164
+1350 -1350 -1504
-1709
len=1759 W=6 n=5
Intron 1: + 54
Intron 2: + 579
Intron 3: + 650
Intron 4: +1087
Intron 5: +1174
Intron 6: +1628
Intron 7: +1735
m=7 thrL=5 thrH=6, Tol=6, Sites Found=17
- 401
- 612
+ 782 -1043
-1164
+1350 -1350 -1504
-1709
len=1759 W=6 n=5
Intron 1: + 54
Intron 2: + 579
Intron 3: + 650
Intron 4: + 711
Intron 5: +1087
Intron 6: +1350
Intron 7: +1628
Intron 8: +1735
m=7 thrL=5 thrH=6, Tol=7, Sites Found=19
- 401
- 612
- 683
+ 782 -1043
+1164 -1164
-1350 -1504
-1709
len=1759 W=6 n=5
Intron 1: + 54
Intron 2: + 579
Intron 3: + 650
Intron 4: + 711
Intron 5: +1087
Intron 6: +1350
Intron 7: +1628
Intron 8: +1735
m=8 thrL=5 thrH=6, Tol=6, Sites Found=19
- 401
- 612
- 683
+ 782 -1043
+1164 -1164
-1350 -1504
-1709
53
Contig4 Results
len=1759 W=6 n=5
1: TGATAATAAC
51: ATTgtaataa
101: gataatgata
151: tataaataat
201: tatcaccaat
251: aataattcaa
301: ttgtagaaat
351: ttcctataac
401: gACCAATTTA
451: AAACCCAATA
501: TACAATCACA
551: TCAAAATCAT
601: taaaaaatca
651: taattaatca
701: AAAATAACCA
751: aaatatatga
801: tatatatgat
851: attatagaac
901: agtttgattg
951: tatagatttt
.
.
.
m=10 thrL=5 thrH=6, Tol=4, Sites
AATAATAACA ATAATAATAA TAATAATAAT
taataatatt aataatgata ataataataa
ataataatat taatactgtt gataatcatg
ancaataatt ttaataaaaa tgaatatcca
atctccaaaa tcttcaatat caagttttcc
taaataatac aggttcaatg gtttcagatt
tcgatttcct ctagttcaat tgattcaagt
aatacaatca atagattttg aagataagaa
AAATAATATC AAAATCAAAT ATAGAAAATA
CCTCCATTCA ATCAAACCAA TAACCAATGT
TTCTTTACCA ACAATTTTAA AACAACCACA
TTTCTAGTAG TATCAATAgt aatagtaaaa
agATCATTTG AAATTGAATC AAAAATTAAT
tatatattta aacctttcaa aagTTGGTAG
gtatgtatta aattaacaaa tgattaatat
aactaattta atattttaaa ggtgttttta
aagggtttta tttcaagaga tgatttaaaa
taaacaaaat gggttaaaat ttcaagactt
atcacatttt tcaacaattt gataaaaata
gaagaaattt aaagtgaatt aacaattaac
Found=19
AATATTAATA
taataataat
atgatgatat
tcaagtaata
aacaaattta
ctttaagttc
gttgcttcaa
tattaaatca
CAATTGAAAC
GAAGTTCAGT
TATTTATAAA
ttaaaaaaat
TTATTTGATg
TGAAGAACAA
attgttgtaa
aattatatga
gaagtattaa
tacaatggaa
tggatggata
actggaaatt
.
.
1001:
1051:
1101:
1151:
1201:
1251:
1301:
1351:
1401:
1451:
1501:
1551:
1601:
1651:
1701:
1751:
Intron
Intron
Intron
Intron
Intron
Intron
.
aagttaaaga
TATTGGAATA
ttaaaaattg
aaattcaatt
AAAGAGCAAT
GCTCAATTAA
ACAATTATTT
TTGATAAATA
GCTTCATTTT
CGGTAAACCT
TTAGGCCACC
GGTTTCACTT
AAAATTATTN
aaaaaaaaaa
ttattatagC
atgggacaa
1:
2:
3:
4:
5:
6:
+ 54
+ 579
+ 650
+ 711
+1087
+1628
aaaaggaaga
TATACCGGAA
aaggatcaaa
ttagTAATAA
AGAATTATTT
TTGAATTCAA
ACAATGATTA
TATGACATTT
TACATACTAT
GATAATATTT
AGTTTGGGAA
ACCAAAATCA
GGAAATCTAA
aaaaattaat
CATCATTTAT
aaatccaaat
AAAGAAAGTT
attatttttt
CAAGTTTTTT
GGCCCGGGTG
TGCAGCAATA
GAAATACCAA
CATAAATTAA
TGGTTGGATT
TTTATGATTG
ATGATTTTTA
TTTTTAATAA
TTTNGNAgta
tattttttat
TTATNGGATT
tatattttta
TTCATAgttt
atatctttat
AAATGTTCAT
TATATATAAC
ATTTTAATGA
ATTTAAATTT
TTGGTTATAC
GTTGGTATGG
TTTAGCACCT
ACCGTTTACC
TTATGGCAAT
agtttttttt
tatataattt
TTATgtttna
aagaagAAAA
aaaaagatat
tttttattat
GCAAATAATA
AAGAATTGCA
CAATGTGTAA
TTATTTCCAG
ATTAATCATT
CAGTTGCNCC
CATTTTAAAT
AGGTGTAACA
TTTATCTTTA
tttaaaaaaa
tatagttatt
ttaattttac
- 401
- 612
- 683
+ 782 -1046
-1164
-1709 (poor quality)
54
Further Intron Finding Options
•
•
•
•
•
•
•
Exhaustive parsing of sequence
400 base sequence  50 acceptor/donors
20 donor/acceptors  5 minutes on P750/.5GB
24 donor/acceptors  1 day
30 donor/acceptors  ~year
Hybrid solution: rank top 20 d/a sites and parse
Use protein/predicted gene homology to edit results
55
Gene Prediction:
Recognizing Initiation of Coding
5’ UTR
1st Exon
ATG
Kozac
Consensus
Stops in
all 3 frames
No in-fram e
stops
GT
Exon
AG
Intron
Exon
56
High-level Outline
ConsensusKozak
0 errors
1 error
ATG/UTR
Heuristic
2 errors
ATG
L
M
CDS
UTR
R
stop ratio;
frame shift check
Stops upstream
~E(stop)
Check ORF for frame shifts
57
Core Heuristic Components
226 Classes
• Kozak Existence and Fidelity
• ATG Heuristic:
Template (sIFl, sl, sFl) 5len : ATG : 3len (sIFr, sr, sFr)
Ideal
( 1, 3, 3) 125 : ATG : 300 (
0, 6, 2)
• # Stops left of candidate ATG
• CDS: # Stops in minimum frame
• UTR Heuristic
• In frame stops to All stops Ratio
• # Frame shifts needed for perfect ORF
• Not Used:
• Codon or Hexamer Frequencies.
• Known protein starting motifs.
58
Verification and Testing
•
Generation of sets of known CDS “reads” (12,826)
known ATG “reads” (13,672)
known UTR “reads” (1,035)
Run Classifier against all three sets:
•
•
•
•
•
Identify classes with highest CDS to ATG differential & UTR vs. CDS/ATG
Grade A:
K0E.ATG.L.pSL.ORFr0F or 1FS
K0E.ATG.L.npSL.ORFr0FS or 1FS
K1E.ATG.L.pSL.ORF0FS or 1FS
K1E.ATG.L.npSL.ORF0FS or 1FS
KG1E.ATG.L.pSL.ORF0FS or 1FS
KG1E.ATG.L.npSL.ORF0FS or 1FS
Grade B: Same as A, but with ATG in Middle 1/2
Grade C: zSL for K0E only and ATG in L, M, or R
UTR Class
59
Accuracy and Yield of Classes
•ATG True Positive (of 13,672):
•Grade A: 867 - 6.3%
•Grade B: 3,742 - 27.3%
•UTR: 82 - 0.6%
Total: 34.3% (4,691)
•CDS False Positive (of 12,826):
•Grade A: 3 - 0.02%
•Grade B: 753 - 5.5%
•UTR: 1725 - 13.5%
Total: 19.3% (2481)
•UTR True Positive (of 1,035):
•691 - 66.8%
Yield 34% 67%
Confidence 95% 87%
60
Fin
61