Introduction to Bioinformatics Iosif Vaisman Email:

Download Report

Transcript Introduction to Bioinformatics Iosif Vaisman Email:

Introduction to Bioinformatics
Iosif Vaisman
Email: [email protected]
NIH working definition of bioinformatics and
computational biology (July 2000)
The NIH Biomedical Information Science and Technology Initiative Consortium
agreed on the following definitions of bioinformatics and computational biology
recognizing that no definition could completely eliminate overlap with other
activities or preclude variations in interpretation by different individuals and
organizations.
Bioinformatics: Research, development, or application of computational tools
and approaches for expanding the use of biological, medical, behavioral or health
data, including those to acquire, store, organize, archive, analyze, or visualize
such data.
Computational Biology: The development and application of data-analytical and
theoretical methods, mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and social systems.
Bioinformatics bibliography
(papers with the word “bioinformatics” in title or abstract)
1000
900
800
700
600
500
400
300
200
100
0
1988
Medline
ISI
PNAS
1991
1994
Liebman MN, Molecular modeling of protein
structure and function: a bioinformatic approach.
J Comput Aided Mol Des 1988, 1(4):323-41
1997
2000
Dynamics of Database Growth
EMBL Sequence Database
100000000
1000000
10000
100
1983
1987
1991
1995
1999
2003
Comparative Sequence Sizes
•
•
•
•
•
•
•
Yeast chromosome 3
Escherichia coli (bacterium) genome
Largest yeast chromosome now mapped
Entire yeast genome
Smallest human chromosome (Y)
Largest human chromosome (1)
Entire human genome
350,000
4,600,000
5,800,000
15,000,000
50,000,000
250,000,000
3,000,000,000
The String Alignment Problem
string - a sequence of characters from some alphabet
given: two strings acbcdb and cadbd
one of possible alignments:
a c - - b c d b
- c a d b - d scoring function:
exact match +2
mismatch
-1
insertion
-1
score:
3 . (2) + 5 . (-1) = 1
The String Alignment Problem
given: two strings CTCATG and TACTTG
C T C A T G
|
| |
T A C T T G
score:
3 . (2) + 3 . (-1) = 3
C T C A - T - G
|
|
|
|
. T - A C T T G
score:
4 . (2) + 4 . (-1) = 4
Entropy and Redundancy of Language
CUR
A
F
SED
BLES
W
IEND
D
FR
ROUGHT
DIS
B
EATH
BR
AND P
EASE
AND
AIN
AG
Entropy and Redundancy of Language
** CUR**** F*****W******* D***** DIS*****AND P***
||
|||| ||||| ||||||| |||||
|||||
|||
**BLES****FR*****B*******BR*****AND *****
AG***
The sequences are 65% identical
A CURSED FIEND WROUGHT DEATH DISEASE AND PAIN
||
|||| ||||| ||||||| |||||
|||||
|||
A BLESSED FRIEND BROUGHT BREATH AND EASE
AGAIN
Substitution Matrices
• Dayhoff (or MDM, or PAM) Derived from global alignments of closely related sequences
PAM100 - number referes to evolutionary distance
(Percentage of Acceptable point Mutations per 108 years)
300 million years
200 million years
100 million years
PAM100
PAM100
PAM200
PAM100
PAM100
PAM150
Substitution Matrices
• BLOSUM (BLOcks SUbstitution Matrix) Derived from local, ungapped alignments of
distantly related sequences
BLOSUM62 - number refers to the minimum percent identity
Reference: Henikoff & Henikoff Proteins 17:49, 1993
Selecting a Matrix
• Compared sequences are related:
Low PAM:
short segments,
200 PAM or 250 PAM
high similarity
• Database scanning:
120 PAM
High PAM:
long segments,
• Local alignment search:
low similarity
40 PAM, 120 PAM, 250 PAM
• Detection of related sequences using BLAST:
BLOSUM 62
THERE IS NO “ONE SIZE FITS ALL” MATRIX !
Matrix Example
A
1.5
B
C
D
E
0.2 0.3 0.3 0.3
1.1 -0.4 1.1 0.7
1.5 -0.5 -0.6
1.5 1.0
1.5
F
-0.5
-0.7
-0.1
-1.0
-0.7
G
H
I
K
0.7 -0.1 0.0 0.0
0.6 0.4 -0.2 0.4
0.2 -0.1 0.2 -0.6
0.7 0.4 -0.2 0.3
0.5 0.4 -0.2 0.3
..
..
..
..
..
..
A
B
C
D
E
1.5 -0.6 -0.1 0.7 -0.7
1.5 -0.2 -0.3 -0.1
1.5 -0.3 0.1
1.5 -0.2
1.5
..
..
..
..
..
F
G
H
I
K
Dayhoff’s Acceptable Point Mutations
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
30
109
154
33
93
266
579
21
66
95
57
29
20
345
772
590
0
20
365
A
Ala
17
0
10
120
0
10
103
30
17
477
17
7
67
137
20
27
3
20
R
Arg
532
0
0
50 76
0
94 831
0 422
156 162 10 30 112
226 43 10 243 23 10
36 13 17
8 35
0
3
37
0
0 75 15 17 40
322 85
0 147 104 60 23
0
0
0 20
7
7
0
7
0
0
0
0 17 20
27 10 10 93 40 49 50
432 98 117 47 86 450 26
169 57 10 37 31 50 14
3
0
0
0
0
0
3
36
0 30
0 10
0 40
13 17 33 27 37 97 30
N
D
C
Q
E
G
H
Asn Asp Cys Gln Glu Gly His
253
43
57
90
7
20
129
0
13
661
I
Ile
39
207 90
167
0 17
43 43
4
7
32 168 20 40 269
52 200 28 10 73 696
13
0
0 10
0 17
0
23 10
0 260
0 22 23
6
303 17 77 10 50 43 186
0 17
L
K
M
F
P
S
T
W
Y
Leu Lys Met Phe Pro Ser Thr Trp Tyr
Search and alignment entropy
• Information content per position:
pam10
pam120
pam160
pam250
-
3.43 bits
0.98 bits
0.70 bits
0.38 bits
blosum62 -
0.70 bits
• Information requirements:
for search
for alignment -
30 bits
16 bit
Search and alignment entropy
Recommended matrices for different query length
Query length
<35
35-50
50-85
>85
Substitution matrix
PAM-30
PAM-70
BLOSUM-80
BLOSUM-62
Gap costs
( 9,1)
(10,1)
(10,1)
(11,1)
FASTA Algorithm
Sequence B
1
Sequence A
First run
(identities)
FASTA Algorithm
Sequence B
2
Sequence A
Rescoring using
PAM matrix
high score
low score
The score of the highest
scoring initial region is
saved as the init1 score.
FASTA Algorithm
Sequence B
3
Sequence A
Joining threshold eliminates disjointed
segments
Non-overlapping regions are
joined. The score equals sum
of the scores of the regions
minus a gap penalty. The
score of the highest scoring
region, at the end of this step,
is saved as the initn score.
FASTA Algorithm
Sequence A
Sequence B
4
Alignment
optimization
using dynamic
programming
The score for this alignment
is the opt score.
FASTA Algorithm
FastA uses a simple linear regression against
the natural log of the search set sequence
length to calculate a normalized z-score for
the sequence pair.
Using the distribution of the z-score, the
program can estimate the number of
sequences that would be expected to produce,
purely by chance, a z-score greater than or
equal to the z-score obtained in the search.
This is reported as the E() score.
FASTA Results
• When init1=init0=opt:
100 % homology over the matched stretch.
• When initn > init1:
more than 1 matching region in the database
with poorly matching separating regions.
• When opt > initn:
the matching regions are greatly improved by
adding gaps in one or both of the sequences.
BLAST - Basic Local
Alignment Search Tool
• Blast programs use a heuristic search algorithm.
The programs use the statistical methods of Karlin
and Altschul (1990,1993).
• Blast programs were designed for fast database
searching, with minimal sacrifice of sensitivity to
distant related sequences.
BLAST Algorithm
1
Query sequence of length L
Maximium of L-w+1 words
(typically w = 3 for proteins)
For each word from the
query sequence find the
list of words with high
score using a substitution
matrix (PAM or BLOSUM)
Word list
BLAST Algorithm
2
Database sequences
Word list
Exact matches of words from the word list
to the database sequences
BLAST Algorithm
3
Maximal Segment Pairs (MSPs)
For each exact word match, alignment is extended in both
directions to find high score segments
Gapped BLAST
• The Gapped Blast algorithm allows gaps to be
introduces into the alignments. That means that
similar regions are not broken into several
segments.
• This method reflects biological relationships much
better.
BLAST family of programs
• blastp - amino acid query sequence against a protein
sequence database
• blastn - nucleotide query sequence against a
nucleotide sequence database
• blastx - nucleotide query sequence translated in all
reading frames against a protein database
• tblastn - protein query sequence against a nucleotide
sequence database dynamically translated in all
reading frames
• tblastx - six-frame translations of a nucleotide query
sequence against the six-frame translations of
a nucleotide sequence database.
Database Searches
• Run Blast first, then depending on your results run a
finer tool (Fasta, Smith-Waterman, etc.)
• Where possible use translated sequence.
• E() < 0.05 is statistically significant, usually
biologically interesting. Check also 0.05 < E() <10
because you might find interesting hits.
• Pay attention to abnormal composition of the query
sequence, it usually causes biased scoring.
• Split large query sequence ( if >1000 for DNA, >200
for protein).
• If the query has repeated segments, remove them and
repeat the search.
Documenting the Search
•
•
•
•
•
•
Algorithm(s)
Substitution matrix
Gap penalty (FASTA)
Name of database
Version of database
Computer used
MULTIPLE SEQUENCE
ALIGNMENT
Computational complexity
Alignment of protein sequences with 200 amino acid residues:
# of sequences
CPU time
2
1 sec
3
200 sec
10
8
200 sec
Multiple alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWESNG-Column cost: the sum of costs for all possible pairs
Multiple alignment
A correct multiple alignment corresponds to an
evolutionary history:
no correct way to determine
practical way - to find an alignment with the maximum score
Multiple sequence alignment
Given k (k > 2) sequences, s1,…, sk, each sequence
consisting of characters from an alphabet A
multiple alignment is a a rectangular array, consisting
of characters from the alphabet A’ (A + "-"), that
satisfies the following 3 conditions:
1. There are exactly k rows.
2. Ignoring the gap character, row number i is
exactly the sequence si.
3. Each column contains at least one character
different from "-".
Consensus
Plurality - minimum number of votes for a consensus
Threshold - scoring matrix value below which a symbol
may not vote for a coalition.
Sensitivity - minimum score to select consensus
Profiles - blocks of prealigned sequences
Multiple alignment algorithm
1. Pairwise alignments (progressive pairwise alignments)
2. Distance matrix calculation
3. Guide tree creation (hierarchical clustering)
4. New sequence addition
Scoring system (distances)
D(ij) =
-ln
Sreal(ij) - Srand(ij)
Siden(ij) - Srand(ij)
x 100
Sreal(ij) - observed similarity score for two aligned
sequences i and j
Siden(ij) - average of the two scores for each sequence
aligned with itself
Srand(ij) - average score determined from 100 global
randomizations of the two sequences
The distances D(ij) are used to generate the distance matrix
from which the approximate guide tree is generated.
Multiple alignment
Multiple alignment
(1,0)
(0,0)
C
(1,1)
B
(1,1,1)
B
A
(0,1)
(0,0,0)
A
Segment - line joining two vertices
Each unit m-dimensional cube in the lattice
contains 2m -1 segments
Multiple alignment
Alignment Path for 3 Sequences
(0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1), (4,3,2)
Multiple alignment
V S N - S
- S N A - - - A S
Pairwise Projections of the Alignment
Alignment statistics
Rablpb
Humlbpa
1
2
67%
82%
1%
Humcetp
Rabcetp
Bovbpi
Ratlbp
Maccetp
Humbpi
3
4
5
6
7
8
1
478
0
0
65%
80%
0%
19%
39%
5%
19%
39%
5%
18%
36%
12%
42%
64%
2%
43%
65%
2%
2
327
400
5
483
0
0
58%
75%
0%
16%
38%
5%
16%
38%
5%
16%
35%
12%
39%
62%
1%
41%
63%
1%
3
318
390
4
284
367
1
482
0
0
18%
38%
5%
18%
38%
5%
17%
35%
12%
40%
64%
1%
43%
64%
1%
4
96
198
30
84
192
29
95
194
28
95%
98%
0%
74%
84%
7%
20%
40%
6%
21%
41%
5%
494
0
0
Alignment score
Rablpb
Humlbpa
1
2
Humcetp
Ratlbp
3
4
Rabcetp
Maccetp
5
6
Bovbpi
Humbpi
7
8
1
4077
2
5358
4129
3
5323
5650
4096
4
8103
8229
8112
4210
5
8109
8243
8118
4332
4219
6
8535
8672
8575
5511
5519
4261
7
6474
6531
6500
8103
8119
8572
4103
8
6392
6434
6378
8033
8035
8520
5508
4083
1
2
3
4
5
6
7
8
Alignment visualization
Humlbpa
Rablpb
Ratlbp
Humcetp
Maccetp
Rabcetp
Humbpi
Bovbpi
:
:
:
:
:
:
:
:
*
*
*
*
50
M---MGALARALPS-ILLALLLTSTPEALGA-NPGLVARITDKGLQYAAQEGLLALQ
M---MGTWARALLGSTLLSLLLAAAPGALGT-NPGLITRITDKGLEYAAREGLLALQ
M---MKSATGPLLP-TLLGLLLLSIPRTQGV-NPAMVVRITDKGLEYAAKEGLLSLQ
M---MLAATVLT---LALLGNAHACSKGTSH-EAGIVCRITKPALLVLNHETAKVIQ
M---MLAATVLT---LALLGNVHACSKGTSH-KAGIVCRITKPALLVLNQETAKVIQ
-----------------------ACPKGASY-EAGIVCRITKPALLVLNQETAKVVQ
MRENMARGPCNAPRWVSLMVLVAIGTAVTAAVNPGVVVRISQKGLDYASQQGTAALQ
M---MARGPDTARRWATLVVLAALGTAVTTT-NPGIVARITQKGLDYACQQGVLTLQ
m
m
l
g66 RI3
L
2
6Q
:
:
:
:
:
:
:
:
52
53
52
50
50
33
57
53
:
:
:
:
:
:
:
:
130
131
130
131
131
114
135
131
Identity
Humlbpa
Rablpb
Ratlbp
Humcetp
Maccetp
Rabcetp
Humbpi
Bovbpi
:
:
:
:
:
:
:
:
Summary view
Alignment visualization
Humlbpa
Rablpb
Ratlbp
Humcetp
Maccetp
Rabcetp
Humbpi
Bovbpi
:
:
:
:
:
:
:
:
*
*
*
*
50
M---MGALARALPS-ILLALLLTSTPEALGA-NPGLVARITDKGLQYAAQEGLLALQ
M---MGTWARALLGSTLLSLLLAAAPGALGT-NPGLITRITDKGLEYAAREGLLALQ
M---MKSATGPLLP-TLLGLLLLSIPRTQGV-NPAMVVRITDKGLEYAAKEGLLSLQ
M---MLAATVLT---LALLGNAHACSKGTSH-EAGIVCRITKPALLVLNHETAKVIQ
M---MLAATVLT---LALLGNVHACSKGTSH-KAGIVCRITKPALLVLNQETAKVIQ
-----------------------ACPKGASY-EAGIVCRITKPALLVLNQETAKVVQ
MRENMARGPCNAPRWVSLMVLVAIGTAVTAAVNPGVVVRISQKGLDYASQQGTAALQ
M---MARGPDTARRWATLVVLAALGTAVTTT-NPGIVARITQKGLDYACQQGVLTLQ
m
m
l
g66 RI3
L
2
6Q
:
:
:
:
:
:
:
:
52
53
52
50
50
33
57
53
:
:
:
:
:
:
:
:
52
53
52
50
50
33
57
53
Physico-chemical properties
Humlbpa
Rablpb
Ratlbp
Humcetp
Maccetp
Rabcetp
Humbpi
Bovbpi
:
:
:
:
:
:
:
:
*
*
*
*
50
.---.G.LA...PS-...A...TST.EALG.-.....A...D...............
.---.GTWA....GST..S.....A.GALGT-.....T...D.......R.......
.---.KS..GP..P-T..G...LSI..TQGV-..A..V...D.......K....S..
.---.L...VLT---.A..GNAH..S....H-EA........PA.LVLNH.TAKV..
.---.L...VLT---.A..GN.H..S....H-KA........PA.LVLN..TAKV..
-----------------------.....A.Y-EA........PA.LVLN..TAKV..
......RGPCNAP...S......IGTAV.A.......V...Q...D..S...TA...
.---..RGPDTAR..AT....A.LGTAV..T-.....A...Q...D..C.....T..
m
m
l
g66 RI3
L
2
6Q
Differences mode
Alignment visualization (tree)
Sequence Logos:
a quantitative graphical display for binding sites and proteins
Reference: Schneider, T.D. Meth. Enzym 274:445, 1996
Sequence Logos
Sequence Logos
Multiple Alignment Programs
• Pileup (GCG): Needleman and Wunsch algorithm for pairwise
alignment and UPGMA method for tree construction
• CLUSTAL: Wilbur and Lipman algorithm for pairwise alignment
(CABIOS 8:189, 1992)
• PIMA: pattern-matching based algorithm (PNAS 87:118, 1990)
• TreeAlign: phylogenetic algorithm (Meth. Enzymol. 18:626, 1990)
Patterns in protein sequences
Regular Expressions
Patterns described in a standard way are known as
regular expressions
x
[]
{}
()
<
>
.
ANY
OR
NOT
repetitions
separator
N-terminal
C-terminal
END
[ILV]
{DE}
x(2,3)
I or L or V
not D or E
x-x or x-x-x
Regular Expressions
[AC]-x-V-x(4)-{ED}.
[Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
...LKHVAYVFQALIYWIK...
...AVEMAGVKYLQVQHGS...
...LYTGAIVTNNDGPYMA...
...KEYKCKVEKELTDICN...
PROSITE Database
Current version contains 1079 documentation entries
that describe 1459 different patterns, rules and
profiles/matrices
[ST]-x(2)-[DE]
Casein kinase II phosphorylation site
[AG]-x(4)-G-K-[ST]
ATP/GTP-binding site motif A (P-loop)
Y-x-[NQH]-K-[DE]-[IVA]-F-[LM]-R-[ED]
Heat shock hsp90 proteins family signature
http://www.expasy.ch/prosite
Blocks Database
Blocks are multiply aligned ungapped segments corresponding
to the most highly conserved regions of proteins
N-6 Adenine-specific DNA methylases proteins
width=9 seqs=78
DMA_VIBCH|Q08318 (85) SCTQWWPPF 77
HEMK_MYCLE|P45832 (181) DLFVAQPTL 100
MT57_ECOLI|P25240 (111) DGALGNPPF 13
MTC1_CHVN1|Q01511 (172) NFVFLDPPY 8
MTC1_COREQ|P42828 (71) QLSFSCPPF 49
MTH2_HAEHA|P00473 (32) KIAFFDPQY 52
MTH3_HAEIN|P43871 (23) HAIISDIPY 73
MTM1_MICAM|P50190 (306) AAVLTNPPF 14
MTM2_MORBO|P23192 (25) QLAVIDPPY 10
MTMU_MYCSP|P43641 (37) QVIYADPPW 13
MTR1_RHOSH|P14751 (60) QLIICDPPY 8
....................................
http://www.blocks.fhcrc.org/
Pfam Database
Pfam is a large collection of multiple sequence alignments and
hidden Markov models covering many common protein domains
Zinc finger, C2H2 type
TYY1_HUMAN/383-407
ZG52_XENLA/61-83
KRUP_DROME/306-328
YKQ8_CAEEL/78-102
DEFI_CHICK/268-292
ZFH1_DROME/389-413
YL57_CAEEL/42-65
ZFA_MOUSE/542-564
BASO_HUMAN/719-742
HUNB_DROME/297-319
SFP1_YEAST/598-623
ZG29_XENLA/62-84
YVCPF.DGCN...KKFAQSTNLKSHILT...H
YTCT...QCN...KQFSHSAQLRAHIST...H
YTCE...ICD...GKFSDSNQLKSHMLV...H
YKCT...VCR...KDISSSESLRTHMFKQ.HH
YECP...NCK...KRFSHSGSYSSHISSK.KC
FGCD...NCG...KRFSHSGSFSSHMTSK.KC
YLCY...YCG...KTLSDRLEYQQHMLK..VH
FKCD...ICL...LTFSDTKEVQQHALV...H
FQCD...ICK...KTFKNACSVKIHHKN..MH
FQCD...KCS...YTCVNKSMLNSHRKS...H
FKCPV.IGCE...KTYKNQNGLKYHRLH..GH
FVCT...VCG...KTYKYKHGLNTHLHS...H
http://pfam.wustl.edu/
Other Motif Databases
PRINTS : a compendium of protein fingerprints.
A fingerprint is a group of conserved motifs used
to characterise a protein family
http://bioinf.man.ac.uk/dbbrowser/PRINTS/
DOMO : a protein domain database
http://www.infobiogen.fr/~gracy/domo/home.htm
ProDom : a protein domain database
http://protein.toulouse.inra.fr/prodom.html
InterPro Database
InterPro : integrated resource for the commonly
used signature databases - Pfam, PRINTS,
PROSITE, ProDom and SWISS-PROT + TrEMBL.
Current release of InterPro (3.2) contains 3939
entries, representing 1009 domains, 2850 families,
65 repeats and 15 post-translational modification sites.
http://www.ebi.ac.uk/interpro
InterPro Database
From genes to proteins
DNA
RNA
PROMOTER
ELEMENTS
TRANSCRIPTION
SPLICE
SITES
mRNA
SPLICING
START
CODON
STOP
CODON
TRANSLATION
PROTEI
From genes to proteins
Chromosome 19 gene map
Computational Gene Prediction
•Where the genes are unlikely to be located?
•How do transcription factors know where to bind a region of DNA?
•Where are the transcription, splicing, and translation start and stop
signals?
•What does coding region do (and non-coding regions do not) ?
•Can we learn from examples?
•Does this sequence look familiar?
Measures of Prediction Accuracy
Nucleotide Level
TN
FN
TP
FP
TN
FN
TP
FN TN
REALITY
PREDICTION
c
c
TP
nc
FP
Sensitivity
Sn = TP / (TP + FN)
Specificity
nc
PREDICTION
REALITY
FN
TN
Sp = TP / (TP + FP)
Measures of Prediction Accuracy
Exon Level
MISSING
EXON
WRONG
EXON
CORRECT
EXON
Sn =
number of correct exons
number of actual exons
Sp =
number of correct exons
number of predicted exons
REALITY
PREDICTION
Sensitivity
Specificity
Spliced Alignment (Procrustes)
•New genomic sequence
•Selection of candidate exons
AUG --- GU initial exons
AG --- GU internal exons
AG --- UAA or UAG or UGA terminal exons
•Filtration (based on the codon usge statistics)
•Construction of all possible chains of candidate exons
•Finding a chain with the maximum global similarity to
the target protein
Spliced Alignment (Procrustes)
Predicted Exon Assembly
(Procrustes)
PCR Primers Prediction (GenePrimer)
Exon 1085..1182 (98) hit using first 2 primers
Exon 1628..1676 (49) missed
Exon 1900..2001 (102) hit using first 8 primers
Exon 2110..2184 (75) missed
Exon 2516..2722 (207) hit using first 4 primers
Exon 3385..3472 (88) missed
Exon 3546..3746 (201) hit using first primer
...
GRAIL gene identification program
POSSIBLE EXONS
REFINED EXON
POSITIONS
FINAL EXON
CANDIDATES
Suboptimal Solutions for the Human Growth
Hormone Gene (GeneParser)
GeneMark Accuracy Evaluation
Bibliography
http://linkage.rockefeller.edu/wli/gene/list.html
and
http://www-hto.usc.edu/software/procrustes/fans_ref/
Gene Discovery Exercise
http://metalab.unc.edu/pharmacy/Bioinfo/Gene