Coordinated Laboratory for Computational Genomics

Download Report

Transcript Coordinated Laboratory for Computational Genomics

What is the problem?
• Very large databases
• Unrefined datasets
– Whole genomes in draft form
• Pairwise searching
– Alignment – O(n2) for each sequence in the database
– BLAST: Tool that searches with “hashes” to speed up.
• Basic idea is that if you have a sequence from a
“related” gene, then you can find new genes:
– Copies of genes in same species
– Same gene in different species
• The problem is that single instances may not
represent the diversity that can be biologically
interesting.
1
Central dogma of Genome Function
2
Hidden Markov Models
(Basic Concepts)
• Goal: Construct a model which can be built
from a multiple sequence alignment (i.e., a
training dataset) that will score future
sequences with their degree of similarity to
the set of training sequences.
• Note: Fundamentally different from
BLAST, with it’s universal substitution
matrices.
3
PAM-250 Matrix
4
Hidden Markov Models
(Basic Concepts)
• Uses notion of a prior probability (Bayesian
Statistics) to reverse roles of observation and
expectation
• E.g., in randon sequence, P(A) = P(C) = P(G) =
P(T) = 0.25. These are prior probabilities.
• Now, assume that in a training data set, that 30%
of the time, a ‘G’ was seen to follow an ‘AT’. We
would say that P(G|AT) = 0.3, yet P(G) is still 0.25
overall.
5
HMMs: Start Codon Recognition
A: 0.91
C: 0.03
G: 0.03
T: 0.03
A: 0.03
C: 0.03
G: 0.03
T: 0.91
A: 0.03
C: 0.03
G: 0.91
T: 0.03
A
T
G
• Above: A “state machine/model” for outputting sequences. It would output
various sequences with varying probabilities
ATG  .91 x .91 x .91 = .7536
ATT  .91 x .91 x .03 = .0248
TAG  .03 x .03 x .91 = .000819
• What are these? P(ATT|M) -- M is the model
• But, what we want is P(M|ATT) – I.e., Probability that we are looking at a real
start codon, given that we have seen ‘ATT’.
6
• Subtle, but very important difference.
HMMs: Bayes Rule and
Key Derivation
• Bayes Rule:
P(A|B) x P(B) = P(B|A) x P(A)
• Rearranged:
P(A|B) = (P(B|A) x P(A)) / P(B)
• Let A be M, and B be the observed sequence,
e.g. ATT from our codon example
P(M|ATT) = (.0248 x P(M)) / (0.25 x 0.25 x 0.25)
= 1.587 x P(M) ;
note: P(M) is a constant, so falls out of all comparisons between scores
of sequences
7
Profile HMMs
• Example alignment:
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATG
• [AT][CG][AC][ACGT]*A[TG][GC]
• This regular expression (RE) captures many
sequences, including the ones above. However, it
sees no preference of TGCTAGG over ACACATC.8
HMMs: Building a Model
• Rules:
–
–
–
–
One state for each “clear” position, or for each term in the RE.
Insert states for Kleene closure terms in the RE.
State probabilities computed from state “populations”.
Transition probabilities must sum to 1.0.
• Starting out…
– The [AT] term in the previous example has 80% As, and 20% Ts.
– Transition to the next “state” is unconditional.
A: 0.8
C: 0.0
G: 0.0
T: 0.2
[AT] state
1.0
A: 0.0
C: 0.8
G: 0.2
T: 0.0
[CG] state
1.0
A: 0.8
C: 0.2
G: 0.0
T: 0.0
[AC] state
...
9
HMMs: Building a Model
• Continuing . . .
– If states must split, transition probabilities must
reflect the probabilities of going to the insert
state, versus bypassing the insert state
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATG
3 sequences lead
to insert state
 3/5 = 0.6
A: -C: -G: -T: -2 sequences bypass
the insertion state
 2/5 = 0.4
0.6
A: 0.8
C: 0.0
G: 0.0
T: 0.2
[AT] state
1.0
A: 0.0
C: 0.8
G: 0.2
T: 0.0
[CG] state
1.0
A: 0.8
C: 0.2
G: 0.0
T: 0.0
[AC] state
10
0.4
HMMs: Building a Model
• Probabilities of symbols on insert state
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATG

0.4
A: 0.2
C: 0.4
G: 0.2
T: 0.2
[ACGT*] state
2 C’s (0.4)
1 G (0.2)
1 A (0.2)
1 T (0.2)
Total of 5 symbols
0.6
0.6
• Probabilities of transitions
leaving insert state
– After arriving in insert state, 2 insertions remain
2/5 = 0.4
– Otherwise, we leave this state.
1 - 2/5 = 0.6
A: 0.8
C: 0.2
G: 0.0
T: 0.0
[AC] state
0.4
A: 1.0
C: 0.0
G: 0.0
T: 0.0
[A] state
11
Example HMM Derivation
12
HMMs: Example Sequence Scoring
• P(ACACATC|M) =
0.8x1.0x0.8x1.0x0.8x1.0x0.6x0.4x0.6x1.0x1.0x0.8x1.0x0.8
State probabilities
Transition Probabilities
= 0.04718 = 4.7 x 10-2
• P(M|ACACATC) =
((4.7 x 10-2)/(0.25)7) x P(M) = (7.7 x 102) x P(M)
Log-odds = ln(7.7 x 102) = 6.65
log2(7.7 x 102) = 9.6
This number is a “score” of the likelihood that seeing this sequence
implies that the model applies.
13
HMM Scoring of Sequences
14
Log-odds HMM Model Example
15
HMM Profile Model Structure
16
Example Alignment (SH3 domain)
17
Example HMM Profile Model
(No synthetic pseudo-counts)
18
HMM Model Example with
Pseudo-count of 1
19
Gene Prediction with HMMs
• HMMs can be used for “predicting”, in
genomic sequence, where genes are encoded.
• Models can be built from sets of known genes
–
–
–
–
–
Promoters
Start of coding (start codon)
Intron/exon splice sites
Stop Codon
Polyadenylation site
20
Genome Architecture Primer
Sta rt c o d o n
C o d o ns
Do n o r site
GCCGCCGCCATGCCCTTCTCCAACAGGTGAGTGAG
Tra n sc rip tio n
Sta rt
5 ’ UTR
Exo n
Pro m o te r
CTCCCAGCCCTGCC
Ac c e p to r site
Sto p C o d o n
Po ly-A site
In tro n
ATCCCCATGCCTGAGGGCCCCT
GCAGAAACAATAAAACCA
21
3 ’ UTR
Comprehensive HMM Model for
Unspliced Genes
22
Coding Region Model
23
Intron Modeling
24
Gene Prediction Approaches
• Ab initio methods:
– Profile Hidden Markov Models (GENSCAN, HMMgene)
– Neural Networks (GRAIL, Genie)
– Decision Trees (MORGAN)
• Issues:
– Seeding from training sets
– Fully general approaches?
• Interesting question:
– Can gene finding be done species-independent?
25
Gene Prediction:
Recognizing Initiation of Coding
5 ’ UTR
1 st Exo n
ATG
Ko za c
C o n se n su s
Sto p s in
a ll 3 fra m e s
No in -fra m e
sto p s
GT
Exo n
AG
In tro n
Exo n
26
Classifier Outline
ConsensusKozak
0 errors
1 error
ATG/UTR
Heuristic
 2 errors
ATG
L
M
CDS
R
stop ratio;
frame shift check
Stops upstream


UTR
~E(stop)
Check ORF for frame shifts
27
Classifier Heuristic Components
226 Classes
• Kozak Existence and Fidelity
• ATG Heuristic:
Template (sIFl, sl, sFl) 5len : ATG : 3len (sIFr, sr, sFr)
Ideal
( 1, 3, 3) 125 : ATG : 300 (
0, 6, 2)
• # Stops left of candidate ATG
• CDS: # Stops in minimum frame
• UTR Heuristic
• In frame stops to All stops Ratio
• # Frame shifts needed for perfect ORF
• Not Used:
• Codon or Hexamer Frequencies.
• Known protein starting motifs.
28
Verification and Testing
• Generation of sets of known CDS “reads” (12,826)
known ATG “reads” (13,672)
known UTR “reads” (1,035)
Run Classifier against all three sets:
• Identify classes with highest CDS to ATG differential & UTR vs. CDS/ATG
• Grade A:
K0E.ATG.L.pSL.ORFr0F or 1FS
K0E.ATG.L.npSL.ORFr0FS or 1FS
K1E.ATG.L.pSL.ORF0FS or 1FS
K1E.ATG.L.npSL.ORF0FS or 1FS
KG1E.ATG.L.pSL.ORF0FS or 1FS
KG1E.ATG.L.npSL.ORF0FS or 1FS
• Grade B: Same as A, but with ATG in Middle 1/2
• Grade C: zSL for K0E only and ATG in L, M, or R
• UTR Class
29
Accuracy and Yield of Classes
•ATG True Positive (of 13,672):
•Grade A: 867 - 6.3%
•Grade B: 3,742 - 27.3%
•UTR: 82 - 0.6%
Total: 34.3% (4,691)
•CDS False Positive (of 12,826):
•Grade A: 3 - 0.02%
•Grade B: 753 - 5.5%
•UTR: 1725 - 13.5%
Total: 19.3% (2481)
•UTR True Positive (of 1,035):
•691 - 66.8%
Yield 34%  67%
Confidence 95%  87%
• Notes:
•the yield estimate is conservative due to variable fidelity of mRNA source.
30
C o n se n su s
Gene Prediction:
Sto p s in
No in -fra m e
Finding
a ll 3 fra m e s Intron Boundaries
sto p s
GT
Exo n
AG
In tro n
Exo n
31
Simple Dicty Gene Finder
(Intuition and an Example)
• Basic Idea (G. Klein) based on GC/AT content of
Intron vs. Exons
• Idealized Example: Count G/Cs and A/Ts in a
window size of 10 bases.
6
10
10
10 AT content
<EXON>
<EXON>
…….CGCGGGCGCCGTATTTATATATTATA…..AATATTTTATATAGCCCGGCGCGGCCG…...
<INTRON>
GC content 10
Donor Site
10
6
2
Acceptor Site
Point where GC.left and AT.right are both maximized
32
Dicty Gene Finding Tool Model
• Model Parameters:
– W -- Window Size
– low -- threshold below which GC or AT
content does not match hypothesis
– high -- threshold above which GC or AT
content matches hypothesis
– m -- number of consecutive windows
that will be examined
– n -- number of windows out of m that
that must exceed  to qualify for an
intron/exon or exon/intron transition
– tol -- maximum distance from the GC/AT
content transition at which the GT or
AG motif must be found
33
Dicty Gene Finding Tool Model
W = 8, m = 4, high = 7, low = 6
1
2
3
4
5
6
G/C=7
. . .GCGGGCGCTGGGGCCGCGTATTATAGTATTTAT. . .
n=3
n=4
34
Dicty Intron/Gene
Prediction Algorithm
1. Calculate AT (GC) content in size W
windows right and left of each base position.
2. Calculate n
AT count  high, AT count   low
for each window of m bases to the left and
right of each base position.
3. For each position: If ……...
ATlefthigh  n && ATrightlow  n
 potential acceptor site
ATleftlow  n && ATrighthigh  n
 potential donor site
35
Dicty Intron/Gene
Prediction Algorithm
(continued)
4. For each potential donor site:
If GT (donor) or AG (acceptor) motif is
found within Tol bases distance, note this as
an intron boundary.
5. Sort boundaries into candidate introns.
36
Test Data
>IIADP1D6358 Antiparallèle 811 bases
AAAAACCTGCTTAGGATTAATTATGAGCGAATTTTTTTTCTTTAAAACTT
CCAAAAATATTTTTTTTTTTTTTTTTTTTTAATAATTTCGGTTTGCTCAT
AGATTTTTTATTTATTTAATTAATATTTTTAATTTTTTTTTTTTTAATCC
TAAAAATAGATTTTATTTATTTTATTTAATTTTTAATTATTAAAAGATAT
GAGATTTTTAAAGTTCGGGTTAGAAATTAATTTGGGTAAAGGAACTCTTA
TTGAATTTGATGAACAgtgtacttaaatatttaattaatttttttttttt
atttgttttaagaagaagaaaaagaaaaaatatagaaatagTAAAAAACT
ATTTCCATATATTTGTTATACTCTTACACACAAGGTTATAAATTTAAAGT
gttataaataatttaaaaattttattctgtaagaaaatttgttttgaaat
tatttgattaaaaatagaaggtttttttttttattttttttttttatttt
tatttttttttattttttataatttccgcgtttgaatttgttgtgtaaat
taattttaattttttttttttttttttttttttttttttttttttttttt
ttcatttttaacatcatttgattcattaatttattttttttttcaacatc
cccaacccaaaaaaaaaaaataaaaaaaaatgataagAAATTTAACAAAA
TTAACAAAATTTACAATTGAAAATAGATTTTACCAATCCTCATCAAAAGG
AAGATTCAGTGGTAAAAATGGAAACAATGCATTCAGGGGATCTCTAGAGT
CGACCGAAGGC
•Probable Correct Introns:
+267 -341
+401 -687
37
Parameter Space to Search
• Ranges
–
–
–
–
–
–
W -- 3  10 (8 values)
high -- .7xW  W (4 values)
low -- .5xW  .9xW (4 values)
m -- 3  11 (9 values)
n -- m/2  m (4 values)
tol -- 3-7 (5 values)
• 3584 x 5  18,000 sets of parameters
• Search for sets that find all expected sites
with a minimum of false positives.
38
Test Data
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
. .
t1.fasta 3 1 3 1
t1.fasta 3 1 3 2
t1.fasta 3 1 3 1
t1.fasta 3 1 3 2
t1.fasta 3 2 3 1
t1.fasta 3 2 3 2
t1.fasta 3 2 3 1
t1.fasta 3 2 3 2
t1.fasta 3 3 3 1
t1.fasta 3 3 3 2
t1.fasta 3 3 3 1
t1.fasta 3 3 3 2
t1.fasta 3 2 4 1
t1.fasta 3 2 4 2
t1.fasta 3 2 4 1
t1.fasta 3 2 4 2
t1.fasta 3 3 4 1
t1.fasta 3 3 4 2
t1.fasta 3 3 4 1
t1.fasta 3 3 4 2
t1.fasta 3 4 4 1
t1.fasta 3 4 4 2
t1.fasta 3 4 4 1
t1.fasta 3 4 4 2
t1.fasta 3 2 5 1
t1.fasta 3 2 5 2
. . . About 18,000
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
more lines like
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
this . . .
39
Test Data Raw Results
len=811 W=3 n=1 m=3 thrL=1 thrH=2, Tol=4, Sites Found=18
Intron:
Intron:
Intron:
Intron:
Intron:
Intron:
1
2
3
4
5
6
+
+
+
+
+
+
91
236
267
385
471
799
+
+
+
-
213 - 213
241
267
399 - 399 - 467
759 - 759 - 797
799
len=811 W=3 n=1 m=3 thrL=2 thrH=2, Tol=4, Sites Found=29
Intron: 1 + 91
Intron: 2 + 219
Intron: 3 + 236
Intron: 4 + 267
Intron: 5 + 305
Intron: 6 + 341
Intron: 7 + 385
Intron: 8 + 429
Intron: 9 + 441
Intron: 10 + 471
Intron: 11 + 759
Intron: 12 + 799
+
+
-
213
223
241
267
312
341
399
433
467
753
759
799
- 213
- 335
- 399
- 797
len=811 W=3 n=1 m=3 thrL=1 thrH=3, Tol=4, Sites Found=13
Intron:
Intron:
Intron:
1 + 91 + 213 - 213 - 241
2 + 267 + 399 - 399 - 467
3 + 471 + 759 - 759 - 786
. . . About 18,000 sets of results like this. . .
40
Test Data Filtered Results
len=811 W=6 n=5 m=9 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=8 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND
This provides an initial set of likely to be optimal parameters
41
Analysis of a Known Gene
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11
1: AAAAACCTGC TTAGGATTAA TTATGAGCGA ATTTTTTTTC TTTAAAACTT
51: CCAAAAATAT TTTTTTTTTT TTTTTTTTTT AATAATTTCG GTTTGCTCAT
101: AGATTTTTTA TTTATTTAAT TAATATTTTT AATTTTTTTT TTTTTAATCC
151: TAAAAATAGA TTTTATTTAT TTTATTTAAT TTTTAATTAT TAAAAGATAT
201: GAGATTTTTA AAgttcgggt tagaaattaa tttgggtaaa gGAACTCTTA
251: TTGAATTTGA TGAACAgtgt acttaaatat ttaattaatt tttttttttt
301: atttgtttta agaagaagaa aaagaaaaaa tatagaaata gTAAAAAACT
351: ATTTCCATAT ATTTGTTATA CTCTTACACA CAAGgttata aatttaaagt
401: gttataaata atttaaaaat tttattctgt aagAAAATTT GTTTTGAAAT
451: TATTTGATTA AAAATAGAAG gttttttttt ttattttttt tttttatttt
501: tatttttttt tattttttat aatttccgcg tttgaatttg ttgtgtaaat
551: taattttaat tttttttttt tttttttttt tttttttttt tttttttttt
601: ttcattttta acatcatttg attcattaat ttattttttt tttcaacatc
651: cccaacccaa aaaaaaaaaa taaaaaaaaa tgataagAAA TTTAACAAAA
701: TTAACAAAAT TTACAATTGA AAATAGATTT TACCAATCCT CATCAAAAGG
751: AAGATTCAGT GGTAAAAATG GAAACAATGC ATTCAGGGGA TCTCTAGAGT
801: CGACCGAAGG C
90% correct prediction
Intron 1: + 213 - 241
overpredicted (45 bases)
Intron 2: + 267 - 341
UNDERPREDICTED (37 BASES)
Intron 3: + 385 + 399 - 433
correct + (325 bases)
42
Intron 4: + 471 - 687
CORRECT - (404 BASES)
Analysis of Unknown Gene
• Started with 21 reads from
•Used phred to assemble them
•4 contigs found
•4th contig was longest (1759 bases)
•Used parameters from previous analysis
•Results for contig4 compared . . . . . . .
43
Contig4 Sampled Results
(a closer look)
W=6 n=5
Intron
Intron
Intron
Intron
Intron
m=5 thrL=5 thrH=6, Tol=6, Sites Found=14
1: + 54 - 401
2: + 579 - 612
3: + 711 + 782 -1113
4: +1185 +1350 -1350 -1504
5: +1628 -1709
W=6 n=5
Intron
Intron
Intron
Intron
Intron
m=5 thrL=5 thrH=6, Tol=7, SitesFound=14
1: + 54 - 401
2: + 579 - 612
3: + 711 + 782 -1113
4: +1185 +1350 -1350 -1504
5: +1628 -1709
len=1759 W=6 n=5
Found=14
Intron 1: + 54
Intron 2: + 579
Intron 3: + 711
Intron 4: +1174
Intron 5: +1628
m=6 thrL=5 thrH=6, Tol=6, Sites
len=1759 W=6 n=5
Found=14
Intron 1: + 54
Intron 2: + 579
Intron 3: + 711
Intron 4: +1174
Intron 5: +1628
m=6 thrL=5 thrH=6, Tol=7, Sites
- 401
- 612
+ 782 -1164
+1350 -1350 -1504
-1709
- 401
- 612
+ 782 -1164
+1350 -1350 -1504
-1709
len=1759 W=6 n=5
Intron 1: + 54
Intron 2: + 579
Intron 3: + 650
Intron 4: +1087
Intron 5: +1174
Intron 6: +1628
Intron 7: +1735
m=7 thrL=5 thrH=6, Tol=6, Sites Found=17
- 401
- 612
+ 782 -1043
-1164
+1350 -1350 -1504
-1709
len=1759 W=6 n=5
Intron 1: + 54
Intron 2: + 579
Intron 3: + 650
Intron 4: + 711
Intron 5: +1087
Intron 6: +1350
Intron 7: +1628
Intron 8: +1735
m=7 thrL=5 thrH=6, Tol=7, Sites Found=19
- 401
- 612
- 683
+ 782 -1043
+1164 -1164
-1350 -1504
-1709
len=1759 W=6 n=5
Intron 1: + 54
Intron 2: + 579
Intron 3: + 650
Intron 4: + 711
Intron 5: +1087
Intron 6: +1350
Intron 7: +1628
Intron 8: +1735
m=8 thrL=5 thrH=6, Tol=6, Sites Found=19
- 401
- 612
- 683
+ 782 -1043
+1164 -1164
-1350 -1504
-1709
44
Contig4 Results
len=1759 W=6 n=5
1: TGATAATAAC
51: ATTgtaataa
101: gataatgata
151: tataaataat
201: tatcaccaat
251: aataattcaa
301: ttgtagaaat
351: ttcctataac
401: gACCAATTTA
451: AAACCCAATA
501: TACAATCACA
551: TCAAAATCAT
601: taaaaaatca
651: taattaatca
701: AAAATAACCA
751: aaatatatga
801: tatatatgat
851: attatagaac
901: agtttgattg
951: tatagatttt
.
.
.
m=10 thrL=5 thrH=6, Tol=4, Sites
AATAATAACA ATAATAATAA TAATAATAAT
taataatatt aataatgata ataataataa
ataataatat taatactgtt gataatcatg
ancaataatt ttaataaaaa tgaatatcca
atctccaaaa tcttcaatat caagttttcc
taaataatac aggttcaatg gtttcagatt
tcgatttcct ctagttcaat tgattcaagt
aatacaatca atagattttg aagataagaa
AAATAATATC AAAATCAAAT ATAGAAAATA
CCTCCATTCA ATCAAACCAA TAACCAATGT
TTCTTTACCA ACAATTTTAA AACAACCACA
TTTCTAGTAG TATCAATAgt aatagtaaaa
agATCATTTG AAATTGAATC AAAAATTAAT
tatatattta aacctttcaa aagTTGGTAG
gtatgtatta aattaacaaa tgattaatat
aactaattta atattttaaa ggtgttttta
aagggtttta tttcaagaga tgatttaaaa
taaacaaaat gggttaaaat ttcaagactt
atcacatttt tcaacaattt gataaaaata
gaagaaattt aaagtgaatt aacaattaac
Found=19
AATATTAATA
taataataat
atgatgatat
tcaagtaata
aacaaattta
ctttaagttc
gttgcttcaa
tattaaatca
CAATTGAAAC
GAAGTTCAGT
TATTTATAAA
ttaaaaaaat
TTATTTGATg
TGAAGAACAA
attgttgtaa
aattatatga
gaagtattaa
tacaatggaa
tggatggata
actggaaatt
.
.
1001:
1051:
1101:
1151:
1201:
1251:
1301:
1351:
1401:
1451:
1501:
1551:
1601:
1651:
1701:
1751:
Intron
Intron
Intron
Intron
Intron
Intron
.
aagttaaaga
TATTGGAATA
ttaaaaattg
aaattcaatt
AAAGAGCAAT
GCTCAATTAA
ACAATTATTT
TTGATAAATA
GCTTCATTTT
CGGTAAACCT
TTAGGCCACC
GGTTTCACTT
AAAATTATTN
aaaaaaaaaa
ttattatagC
atgggacaa
1:
2:
3:
4:
5:
6:
+ 54
+ 579
+ 650
+ 711
+1087
+1628
aaaaggaaga
TATACCGGAA
aaggatcaaa
ttagTAATAA
AGAATTATTT
TTGAATTCAA
ACAATGATTA
TATGACATTT
TACATACTAT
GATAATATTT
AGTTTGGGAA
ACCAAAATCA
GGAAATCTAA
aaaaattaat
CATCATTTAT
aaatccaaat
AAAGAAAGTT
attatttttt
CAAGTTTTTT
GGCCCGGGTG
TGCAGCAATA
GAAATACCAA
CATAAATTAA
TGGTTGGATT
TTTATGATTG
ATGATTTTTA
TTTTTAATAA
TTTNGNAgta
tattttttat
TTATNGGATT
tatattttta
TTCATAgttt
atatctttat
AAATGTTCAT
TATATATAAC
ATTTTAATGA
ATTTAAATTT
TTGGTTATAC
GTTGGTATGG
TTTAGCACCT
ACCGTTTACC
TTATGGCAAT
agtttttttt
tatataattt
TTATgtttna
aagaagAAAA
aaaaagatat
tttttattat
GCAAATAATA
AAGAATTGCA
CAATGTGTAA
TTATTTCCAG
ATTAATCATT
CAGTTGCNCC
CATTTTAAAT
AGGTGTAACA
TTTATCTTTA
tttaaaaaaa
tatagttatt
ttaattttac
- 401
- 612
- 683
+ 782 -1046
-1164
-1709 (poor quality)
45
Further Intron Finding Options
•
•
•
•
•
•
•
Exhaustive parsing of sequence
400 base sequence  50 acceptor/donors
20 donor/acceptors  5 minutes on P750/.5GB
24 donor/acceptors  1 day
30 donor/acceptors  ~year
Hybrid solution: rank top 20 d/a sites and parse
Use protein/predicted gene homology to edit results
46
Domain Finding with HMMs
• Basic Elements of Method
• Example from Defensin Genes
47
Antimicrobial Proteins and Peptides
Lysozyme, lactoferrin, SLPI, PLA2, SP-A, SP-D, LL37,
BPI, a- and ß-defensins, inorganics, immunoglobulins
Macrophages
? Defensins
CCL20
48
T cells
DCs
Functions of defensins
 Comprise an ever-ready shield at mucosal surfaces
 Antimicrobial effects: disrupt bacterial cell walls, sequester
nutrients, act as decoys for microbial attachment, enhance
phagocytosis
 Prevent attachment, colonization or infection
 Constitutive and/or inducible expression
 Cross-talk to adaptive immune system
 Synergy or additivity among factors
 Alterations in these properties may contribute to disease
49
Genomics Approach to Defensin Gene
Discovery - Rationale
 Defensin gene discovery in humans has generally
proceeded from identification of the protein
 All known defensin genes in humans cluster to a <1
Mb region on 8p22-p23
 It is likely that not all defensin genes are known
 Hypothesis: Novel defensins in the gene cluster can be
found using a computational genomics-based strategy
50
Structure of mature b-defensin peptides
C1-C5
C2-C4
GAL3
DEFB3
DEFB1
BNBD12
DEFB2
EP2E
TQCRIRGGFC
YYCRVRGGRC
YNCVSSGGQC
LSCGRNGGVC
VTCLKSGAIC
TICRMQQGIC
Consensus
hSC+xxxGhC hhhxCPxxx+ QIGTCxxxxh +CC+
T1
b1
RVGSCRFPHI
AVLSCLPKEE
LYSACPIFTK
IPIRCPVPMR
HPVFCPRRYK
RLFFCHSGEK
C3-C6
T2
T3
AIGKCATFIS
QIGKCSTRGR
IQGTCYRGKA
QIGTCFGRPV
QIGTCGLPGT
KRDICSDPWN
b2
b-loop
b-bulge
-CCGRAYEV(+20)
KCCRRKK
KCCK
KCCRSW
KCCKKP
RCCVSNTDE(+14)
b3
51
Structure of leader sequence
of b-defensin proteins
EP2C
EP2E
TAP
Defb4
GAL1
DEFB2
DEFB1
MRQRLLPSVTSLLLVALLFPGSS
MKVFFLFAVLFCLVQTNSGDVPP
MRLHHLLLALLFLVLSAWSGFTQ
MRIHYLLFTFLLVLLSPLAAFTQ
MRIVYLLLPFILLLAQGAAGSSQ
MRVLYLLFSFLFIFLMPLPGVFG
MRTSYLLLFTLCLLLSEMASGGN
Consensus MRhxxLLhhhhhhhhhxxxxxxx
52
Genome approach for discovering
b-defensin genes
Known genes
HUMAN
DEFB1
DEFB2
MOUSE
Defb1
Defb2
Defb3
Defb4
Defb5
BLAST
HTGS
DEFB1
DEFB2
DEFB3
EP2D
DEFB4
DEFB5
DEFB6
DEFB7
DEFB8
DEFB9
EP2C
EP2D
Defbp1
DEFB10
DEFB11
DEFB12
DEFB13
DEFB14
DEFB15
DEFB16
DEFB17
DEFB18
DEFB19
DEFB20
DEFB21
DEFB22
DEFB23
DEFB24
DEFB25
DEFB26
DEFB27
DEFB28
DEFB29
Markov
Celera
Defb1
Defb2
Defb3
Defb4
Defb5
Defb6
Defb7
Defb8
Defb9
Defb10
Defb11
Defb12
Defb13
Defb14
Defb15
Defb16
Defb17
Defb18
Defb19
Defb20
Defb21
Defb22
Defb28
Defb31
Defb32
Defbp1
BACs
DEFB1
DEFB2
DEFB3
EP2D
DEFB4
DEFB5
DEFB6
DEFB7
DEFB8
DEFB9
EP2C
EP2D
Defbp1
DEFB10
DEFB11
DEFB12
DEFB13
DEFB14
DEFB15
DEFB16
DEFB17
DEFB18
DEFB19
DEFB20
DEFB21
DEFB22
DEFB23
DEFB24
DEFB25
DEFB26
DEFB27
DEFB28
DEFB29
GA-contigs
Defb1
Defb2
Defb3
Defb4
Defb5
Defb6
Defb7
Defb8
Defb9
Defb10
Defb11
Defb12
Defb13
Defb14
Defb15
Defb16
Defb17
Defb18
Defb19
Defb20
Defb21
Defb22
Defb28
Defb31
Defb32
Defbp1
Defb23
Defb24
Defb25
Defb26
Defb27
Defb29
Defb30
Defb33
Defbp2
Defbp3
36
33
53
Chromosomal localization of b-defensin genes
6p11-p21
Mouse 1
8p23
Mouse 8
20q11
Mouse 2
54
TEL
10.5
EP2C
HE2b1/EP2D
EP2E
b
16 16 16
44
D8S542
34
115c21
179c23
16g12
44n19
2541m15
397k22
372k15
24f4
207i3
2629i16
633e22**
540n10**
561b17**
332a23**
877e9
415d8
b
15 15
GCT10E01
D8S1469
D8S503
10
4
D8S1825
33
8
3
D8S351
31
2
D8S277
D8S1511
D8S561
DEFB1
D8S1819/D8S439
DEFA6
DEFA4
DEFA1/3
DEFA7
DEFA5
D8S1706
HE2/EP2
DEFB3
DEFB2
8
A004x20
cR
7.6
D8S1099
D8S1742
cM
1
WI-4625
Mb
211c9
458d3
3023L14
398f12
398f10
399g23
556o5
540e4
776f21
351i21
177k12*
18L2
295j18*
62h7
449o20*
429b7
8o7
10a14
497j4*
115j16
324n11*
375n15
10 kb
b
b
DEFB3
DEFB2
55
CEN
Synteny between human 8p and mouse 8
Chromosome 8p22-p23 (human)
BAC 295j18
BAC 324n11
Chromosome 8 (mouse)
GA_x5J8B7W6WMR
56
Synteny between human 6p21 and mouse 1
Chromosome 6p21 (human)
BAC RP11-397g17
Chromosome 1 (mouse)
GA_x5J8B7W3NRM
57
Synteny between human 20q11 and mouse 2
Chromosome 20q11.1 (human)
BAC RP5-854e16
BAC RP5-1018d12
BAC RP5-1093g12
Chromosome 2 (mouse)
GA_x5J8B7W3FJ8
58
Human and
Mouse
b-defensin
alignment
– all 69 genes
EP2d
_c 8
EP2e
_c 8
EPm2d _c 8
DEFB5 _c 8
Defbm12_c 8
Defbm13_c 8
DEFB11 _c 6
Defbm17_c 1
DEFB12 _c 6
DEFB14 _c 6
EP2c
_c 8
EPm2c _c 8
DEFB10 _c 6
Defbm16_c 1
DEFB13 _c 6
Defbm18_c 1
Defbm28
DEFB9 _c 8
DEFB27 _c20
DEFB17 _c20
Defbm19_c
2
DEFB18 _c20
Defbm21_c 2
DEFB20 _c20
DEFB4 _c 8
DEFB1 _c 8
Defbm1 _c 8
Defbm7 _c 8
Defbm8 _c 8
Defbm2 _c 8
Defbm31
Defbm9 _c 8
Defbm10_c 8
Defbm15_c 8
Defbm3 _c 8
Defbm4 _c 8
Defbm6 _c 8
DEFB2 _c 8
Defbm5 _c
8
DEFB3 _c 8
Defbm14_c 8
DEFB16 _c20
Defbm29
DEFB8 _c 8
DEFB29 _c20
Defbm23_c 2
DEFB28 _c20
Defbm20_c 2
DEFB15 _c20
Defbm32
DEFB25 _c20
Defbm26_c 2
DEFB24 _c20
Defbm25_c 2
DEFB6 _c 8
Defbm11_c 8
Defbm30
DEFB21 _c20
DEFB19 _c20
Defbm24_c 2
DEFB22 _c20
Defbm27_c 2
DEFB23 _c20
TI
TI
TV
ES
ET
FL
RE
KE
KS
DR
VD
VN
-ER
RE
HK
RT
GH
KK
KS
KA
KK
KR
VE
RI
YN
YK
TR
AR
DH
-ER
VS
RA
VS
IT
VT
VT
VS
YY
FF
NP
IA
EI
RR
KR
KK
-R
RR
KL
QK
-K
KR
KR
EK
EK
DT
MK
LR
LQ
ET
ER
QR
C RMQ--Q G I C RLF-F
C RMQ--Q G I C RLF-F
C LMQ--Q G H C RLF-M
C KLG--R G K C RK--E
C RLG--R G K C RR--T
C KKM--N G Q C EA--E
C RIG--N G Q C KN--Q
C KMR--R G H C KL--Q
C TAI--G G R C KN--Q
C TKR--Y G R C KR--D
C RRS--E G F C QE--Y
C KKS--E G Q C QE--Y
C EKV--R G I C KT--F
C EKV--R G M C KT--V
C QLV--R G A C KP--E
C SLV--R G T C KS--E
C FYG--L G K C RR--I
C LNL--S G V C RRD-V
C WNNYVQ G H C RK--I
C WII--K G H C RK--N
C WVL--R G H C RK--H
C WNR--S G H C RK--Q
C LKI--L G H C RR--H
C W-M--D G H C RL--L
C GYG--TAR C RK--K
C VSS--G G Q C LYS-A
C LQH--G G F C LRS-S
C YKF--G G F C HYN-I
C YKF--G G F C YNS-M
C HTN--G G Y C VRA-I
C RSW-- G T C SIAAI
C HKK--G G Y C YF--Y
C IRN--G G I C Q-Y-R
C YRE--G G E C --L-R
C LRK--G G R C WN--R
C MTN--GAI C WG--P
C MSY--G G S C QR--S
C LKS--GAI C HPV-F
CC MI--G G I C RY--L
C RVR--G G R C AVL-S
C RIR--G G R C AVL-N
C ELY--Q G M C RN--A
C ELY--Q G L C RN--A
C ERP--N G S C RD--F
C LMG--L G R C RD--H
C LVG--F G K C KD--S
C FNK-VT G Y C RK--K
C FSN-VE G Y C RK--K
C YYG--T G R C RK--S
C LDQ--KDT C PDSRT
C WKN-NV G H C RR--R
C WKN-SL G Y C RV--R
C WKG--Q G A C QT--Y
C WNG--Q G A C RT--F
C NKL--K G T C KN--N
C SRV--N G R C TA--S
C WKL--K G I C RN--T
C WGK--S G R C RT--T
C MGN--S G I C RA--S
C MGN--R G F C RS--S
C WNF--R G S C RD--E
C WKS--F G V C RE--E
C WNL--Y G K C RY--R
C HSG E KKRDI C SDPWNR CC V SNT
C HSG E KKRDI C SDPWNR CC V SNT
C RSG E RKGDI C SDPWNR CC V PYS
C LEN E KPD G N C RL-NFL CC R QRI
C IES E KIA G W C KL-NFF CC R ERI
C FTF E QK I G T C QA-NFL CC R KRC HEN E IR I AY C IRPGTH CC L QQC SEK E LR I SF C IRPGTH CC ---C DDS E FR I SY C ARPTTH CC V --C LES E KQ I DI C SLPRKI CC ---C NYM E TQV G Y C SKKKDA CC L H-C NFM E TQV G Y C SKKKEP CC L H-C DDV E YDY G Y C IKWRSQ CC V --C DID E YDY G Y C IRWRNQ CC I --C NSW E YVYYY C N--VNP CC ---C NSW E YKYNY C H--TEP CC V VRE
C RAN E KKKER C GE-RTF CC L RET
C KVV E DQ I G A C RR-RMK CC R AWW
C RVN E VPEAL C EN-GRY CC L NIK
C KPG E QVKKP C KN-GDY CC I PSN
C RSG E RVRKP C SN-GDY CC ---C KDG E AVKDT C KN-LRA CC I PSN
C KDG E MDHGS C KY-YRV CC V PDL
C KDG E DS I IR C RN-RKR CC V PSR
C RSQ E YR I G R C PN-TYA CC L RKC PIFTKIQ G T C YRGKAK CC K --C PSNTKLQ G T C KPDKPN CC K S-C PGNSRFMSN C HPENLR CC K NIK
C PPHTKF I G N C HPDHLH CC I NMK
C PPSARRP G S C FPEKNP CC K YMK
C FDSLSRR G Q C GPVKDP CC PL-C FSSHKK I G S C FPEWPR CC K NIK
C IGLRHK I G T C GSP-FK CC K --C IGLFHK I G T C NFR-FK CC K FQC IGNTRQ I G S C GVPFLK CC K RKC PTAFRQ I G N C GHFKVR CC K IRC NGSFRLG G H C GHPKIR CC R RKC PRRYKQ I G T C GLPGTK CC K KPC KGNILQNGN C GVTSLN CC K RKC LPK E EQ I G K C STRGRK CC R RKK
C LGK E EQ I G R C SNSGRK CC R KKK
C REY E IQYLT C PN-DQK CC L KLS
C QKY E IQYLS C PK-TRK CC L KYC LET E IHV G R C LN-SRP CC L PLG
C NVD E KE I QK C KM-KK- CC V GPK
C LAD E TQMQH C KA-KK- CC I GPK
C KVG E RYEIG C LS-GKL CC ANDE
C RLV E ISEMG C LH-GKY CC ---C KEI E RKKEK C GE-KHI CC V PKE
C LEGTQ---P C HPHHPN CC ESSC LDT E RY I LL C RN-KLS CC I SII
C QEE E RY I YL C KN-KVS CC I HRT
C TRQ E TYMHL C PD-ASL CC L SYA
C TRQ E TFMHL C PD-ASL CC L SYS
C GKN E EL I AL C QK-SLK CC R TIQ
C LKN E ELVAL C QK-NLK CC V TVQ
C QKE E IYHIF C G-IQSL CC L EKK
C KES E VYYIL C KT-EAK CC V DPK
C KKN E QPYLY C RN- C QS CC L QSY
C KKS E QAYFY C RT-FQM CC L QSY
C LKN E RVYVF C VS-GKL CC L KPK
C AKK E SFYIF C WN-GKL CC V KPK
C SKK E RVYVY C IN-NKM CC V KPK
59
ESTs provide sequence for exon 1
Chromosome 8 cluster
Gene Name
Exon 1
DEFB1
MRTSYLLLFTLCLLLSEMASGGN
DEFB7
MKIFFFILAALILLAQIFQG
DEFB5
DEFB6
MRTFLFLFAVLFFLTP
DEFB4 (Fo rs smanMQRLVLLLAISLLLYQDLPG
)
EP2c
MRQRLLPSVTSLLLVALLFPG
EP2d/HE2b1MRQRLLPSVTSLLLVALLFPGSS
EP2e
MKVFFLFAVLFCLVQTNSGDVPP
DEFB3
MRIHYLLFALLFLFLVPVPG
DEFB2
MRVLYLLFSFLFIFLMPLPG
DEFBp 1
YLLFSFRFVFLMPLP
DEFB8
DEFB9
aa sequence (exon 2)
xxxxxxxxxFLTGLGHRSDHYN
CVSSGGQ CLYSA CPIFTKIQGT CYRGKAK
xxLKTN CFLYLARTAIHRALISKRMEGH
CEAE- CLTFEVKIGG CRAELAPF
TFPGKLPQQLFLGTGEFAV
CES CKLGRGK CRKE- CLENEKPDGN CRLNFLxxxxxxxxxxxxxAKNAFFDEK
CNKLKGT CKNN- CGKNEELIAL CQKSLKxxxxxxxxxxYLVRSEFELDRI
CGYGTAR CRKK- CRSQEYRIGR CPNTYAxxxxxxxxxxxxEPASDLKVVD
CRRSEGF CQEY- CNYMETQVGY CSKKKDA
xxxxxxxxxxxxxxxxxxxxTI
CRMQQGI CRLFF CHSGEKKRDI CSDPWNR
xxxxxxxxxxxxxxxxxxxxTI
CRMQQGI CRLFF CHSGEKKRDI CSDPWNR
xxxxxxxxxGHGGIINTLQKYY
CRVRGGR CAVLS CLPKEEQIGK CSTRGRK
xxxxxxxxxxxGVFGGIGDPVT
CLKSGAI CHPVF CPRRYKQIGT CGLPGTK
xxxxxxxxxxxxxxxxxxxxxxxxx*RCV
CVLNV CSTSLKQIGTYGHDRIK
xxxxxxxxxxxLHVAKGKFKEI
CERPNGS CRDF- CLETEIHVGR CLNSRPxxxxxxxxxxxxxGGLGPAEGH
CLNLSGV CRRDV CKVVEDQIGA CRRRMK-
Chromosome 6 cluster
Gene Name
Exon 1
xxxxxxxxxxxxxxxxxxxFER
xxxxxxxxxxxxxxxxxDLRRE
xxxxxxxxxxxxxxxxxxxxWKS
xxxxxxxxxxxxxxxxxxxKRE
xxxxxxxxxxxxxTCTLVNADR
DEFB1 0
DEFB1 1
DEFB1 2
DEFB1 3
DEFB1 4
aa sequence (exon 2)
CEKVRGI CKTF- CDDVEYDYGY CIKWRSQ CC V
CRIGNGQ CKNQ- CHENEIRIAY CIRPGTH CC LQQ
CTAIGGR CKNQ- CDDSEFRISY CARPTTH CC VTECDP
CQLVRGA CKPE- CNSWEYVYYY CNVNP-- CC AVWE
CTKRYGR CKRD- CLESEKQIDI CSLPRKI CC TEKL
Chromosome 20 Cluster
Gene Name
Exon 1
DEFB1 5
DEFB1 6
DEFB1 7
DEFB1 8
DEFB1 9
DEFB2 0
MKLLLLALPMLVSYPKZSQ
MKLLYLFLAILLAIEEPVIS
May share with 20-5
DEFB2 1
DEFB2 2
DEFB2 3
maybe 3 exons
MKLLLLTLTVLLLLSQLTP
DEFB2 4
DEFB2 5
DEFB2 6
DEFB2 7
MKSLLFTLAVFMLLAQLVS
MGLFMIIAILLFQKPT
DEFB2 8
DEFB2 9
MKLLFPIFASLMLQYQVNT
C pattern
EST
Exprssion
CC K
6 4 9 6
ai688359,ai688522,
epithelia,
ai733355,ai334993,
kidney
CC KNRKKH
21 3 9 7
no ESTs
CC RQRI
6 3 9 5
no ESTs
CC RTIQPCGSIID
6 3 9 5
aw103145, ai910580
lung/testis EST
CC LRK
6 3 9 5
no ESTs
CC LH
6 3 9 6
1st exon aa778602,
epithelia
aa400545
CC VSNTDE
6 4 9 6
aa778602
testis
CC VSNTDE
6 4 9 6
2nd exon aa176631,
testis be044355, ai018805
CC RRKK
6 4 9 6
epithelia
CC KKP
6 4 9 6
bf08889, bf088086,
epithelia,
be714509
head/neck
CC KK
pseudogene
CC LPLGHQPRIESTTPKKD
6 3 9 5
aa406058
possible EST testis
CC RAWWILMSIPTPLIMSDYQEPLKPKLK
6 4 9 5
aw383156
head_neck
xxxxxxxxxxxGWIRR CYYGTGR
GLFRSHNGKSREPWNP CELYQGM
xxxxxxxxxxxxSQKS CWIIKGH
xxxxxxxxxxxxGEKK CWNRSGH
xxxxxxxxxxKRHILR CMGNSGI
xxxxxxxxxxxxVKSVE
CWMDGH
xxxxxxxxxxxxxxMK CWGKSGR
xxxxxxxxxxxxRIET CWNFRGS
xxxxxxxxxxxxGTQR CWNLYGK
xxxxxxxxxxxxEFKR CWKGQGA
xxxxxxxxxxFEPQK CWKNNVGH
xxxxxxxxxxNWYVKK CLNDVGI
xxxxxxxxTEQLKK CWNNYVQGH
xxxxxxxxxxxxLKK CFNKVTGY
xxxxxxxxxxxxxxRR CLMGLGR
C
6
6
6
6
6
pattern
3 9 6
3 9 6
3 9 6
3 9 4
3 9 6
no
no
no
no
no
EST
ESTs
ESTs
ESTs
ESTs
ESTs
Exon 2
CRKS CKEIERKKEK CGEKHI CC VPKEKDKLSHIHDQKETSELYI
6 3 9 5
no ESTs
CRNA CREYEIQYLT CPNDQK CC LKLSVKITSSKNVKEDYDSNSNLSVTNSSSYSHI
6 3 9 5
no ESTs
CRKN CKPGEQVKKP CKNGDY CC IPSNTDS
6 3 9 5
no ESTs
CRKQ CKDGEAVKDT CKNLRA CC IPSNEDHRRVPATSPTPLSDSTPGIIDDILTVRFTTDYFEVSSKKDMVEESEAGRGTETSLPNVHH
6 3 9 5
AA335178, AI220434
Epididymis, Pooled NFL
CRAS CKKNEQPYLY CRNCQS CC LQSYMRISISGKEENTDWSYEKQWPRLP
6 3 9 5
AA939044, AW193716,
Pooled NFL,
AI Pooled
807541,Germ
aa4366
Ce
CRLL CKDGEDSIIR CRNRKR CC VPSRYLTIQPVTIHGILGWTTPQMSTTAPKMKTNITNR
5 3 9 5
AW070283, AA834919,
Pooled NFL,
H92063
Testis, Retina
CRTT CKESEVYYIL CKTEAK CC VDPKYVPVKPKLTDTNTSLESTSAV
6 3 9 5
AI476463
Pooled NFL
CRDE CLKNERVYVF CVSGKL CC LKPKDQPHLPQHIKN
6 3 9 5
AI989655, AW236570
Pooled Germ Cell Tumors
CRYR CSKKERVYVY CINNKM CC VKPKYQPKERWWPF
6 3 9 5
AA933749, AA970840,
Pooled Germ
BF08527,
Cell Tumors,
AA298819
P
CQTY CTRQETYMHL CPDASL CC LSYALKPPPVPKHEYE 6 3 9 5
no ESTs
CRRR CLDTERYILL CRNKLS CC ISIISHEYTRRPAFPVIHLEDITLDYSDVDSFTGSPVSMLNDLITFDTTKFGETMTPETNTPETTM
7 3 9 5
AA935636(not
Pooled
cysteine
NFL domain)
CKKK CKPEEMHVKNGWAM CGKQRD CC VPADRRANYPVFCVQTKTTRISTVTATTATTTLMMTTASMSSMAPTPVSPTG
6 3 13 5
AA994981, AA846419,
Testis, Pooled
AA453384,
NFL AA40042
CRKI CRVNEVPEAL CENGRY CC LNIKELEACKKITKPPRPKPATLALTLQDYVTIIENFPSLKTQST
8 3 9 5
AI694319, AA812652,
Testis, Pooled
AA454191,
NFL AW18257
CRKK CKVGERYEIG CLSGKL CC ANDEEEKKHVSFKKPHQHSGEKLSVLQDYIILPTITIFTV
7 3 9 5
no ESTs
CRDH CNVDEKEIQK C-KMKK CC VGPKVVKLIKNYLQYGTPNVLNEDVQEMLKPAKNSSAVIQRKHILSVLPQIKSTSFFANTNFVIIP
6 3 9 4
AA401404, AA446332,
Testis, Pooled
AA399988,
NFL AA92714
60
– Summary –
Gene Discovery with HMMs
• Increased number of defensin genes in mouse and
human from 7 to 69
• Genomic searches based solely on BLAST may miss
genes related by tertiary structure
• Hidden Markov Tool is a more reliable approach for
identifying gene families related by tertiary structure
61
“Curing” Disease and
Finding New Treatments
I. “Curing” disease
– know the disease-causing gene(s)
– diagnose with genetic test (before onset)
– preempt entire disease with intervention (therapy or
lifestyle advice)
II. Finding new treatments of disease
– know the gene(s)
– understand the biological pathway like never before
a. identify existing drug candidates that interact
b. precisely design a new drug from a molecular basis
62
“Curing” Disease and
Finding New Treatments
• After all the analysis and data visualization….
– Make some decisions:
• 1. Is this a (strongly) genetic phenomenon?
• 2. Is/are there regulating “known” gene(s)?
• 3. Can they be prioritized for further study?
• Can the pathway be deduced or refined?
• Are there existing related products/drugs?
• BUT, where do we obtain candidate “targets”?….
63