Transcript Document

An approach to sequence similarity significance estimation

Jacek Leluk

Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University Warsaw, Poland Department of Molecular Biology, Institute of Biotechnology and Environmental Science, University of Zielona Góra, Poland

Elżbieta Gajewska

Faculty of Mathematics and Information Science, Technical University of Warsaw, Warsaw, Poland

Artur Mikołajczyk

Faculty of Physics, Wrocław Technical University, Wrocław, Poland

Sławomir Walkowiak

Department of Biophysics, Warsaw University, Warsaw, Poland Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Fundamental steps of the procedure leading BIAŁKOWYCH to optimal 2 sequences alignment

1

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 0 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 0.0%

2

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 0 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 0.0%

3

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 0 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 0.0%

4

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 1 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 25.0%

5

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 0 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 0.0%   

n - 1

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 1 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 3.6%

n

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 18 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 62.1%

n + 1

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 5 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 17.2%

n + 2

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 2 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 6.9%   

n+m-3

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 1 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 33.3%

n+m-2

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 0 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 0.0%

n+m-1

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 0 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 0.0%

n

R V C P K I L M E C K K D S D C L A E C I C L E H G Y C G 22 M V C P K I L M K C K H D S D C L L D C V C L E D I G Y C G V S 7 3 %

m

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Comparison of the fragments of 1st and 2nd domain of chicken ovomucoid using unitary matrix, GCM, PAM250 and algorithm of genetic semihomology

1) GTT AAT TGC AGC CTG TAT GCCAGCGGCATC GGCAAGGATGGGACGAGT TGGGTA GCC V N C S L Y A S G I G K D G T S W V A 2) ATT GAT TGC TCT CCG TAC CTC CAA GTT GTAAGAGAT GGT AACACC ATGGTA GCC I D C S P Y L Q - V V R D G N T M V A UNITARY MATRIX V N C S L Y A S G I G K D G T S W V A I D C S P Y 0 0 1 1 0 1 0 0 0 0 0 0 D G N T M V A 1 1 0 0 0 1 1 SCORE

%

7/19

36.8

GENETIC CODE MATRIX GTT AAT TGC AGC CTG TAT GCCAGCGGCATC GGCAAGGATGGGACGAGT TGGGTA GCC ATT GAT TGC TCT CCG TAC CTC < CAA > GTT GTAAGAGAT GGT AACACC ATGGTA GCC 2 2 3 0 2 2 1 0 0 1 1 1 3 2 1 1 1 3 3 29/57

50.9

PAM250 SCORING V N C S L Y A S G I G K D G T S W V A I D C S P Y L 1 1 2 2 0 2 0 < Q > 0 0 V V R D G N T M V A 1 0 1 2 2 1 1 0 2 2 42/97 42/89 20/38 43.3

47.2

52.6

GENETIC SEMIHOMOLOGY V N C S L Y A S G I G K D G T S W V A I D C S P Y L 2 2 3 3 2 3 0 < Q > V V R D G N T M V A 0 0 2 1 2 3 3 1 1 0 3 3 34/57

59.6

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

WHAT IS IMPORTANT IN THE PROTEIN SIMILARITY SEARCH ?

1) Contribution (%) of identical positions

P K I L M E C K K D

8

P K I L M E C K K D

2

P K I L M K C K H D

8 0 %

S D C L L D C V C L

2 0 % similar not similar

2) Length of the compared strings (sequences)

L C E

1

M V EI C I E P K I R C I K V C T K D E R I T W C G

33.3%

M V Y W C P R R F M H C V H L K A G G C T C W C L I L D ET

8

C L R L D Y Y

2 6 % casual probably similar

3) Distribution of the identical positions along the analyzed sequence

M V EMICI E PKIRCI K VCTKDE R IT L

5

MVEMI MAGDA RCIKVCTKDERITCL

5

H V YYWRP E RFMHTV K LKAGGC R CW L

20%

HHYYW MAGDA HTVQLKAGGCWCWAG

20% casual similar

4) Residues at conservative positions

M V C P K I L M K C K H D S D C L L D C V C L E D M V C P K I L M K C K H D S D T L L D E D E G K R R T K R E H F K E S N L A A A F K E Q Q N C P G P R E W C F T T R M N D S S C V C L E D C A C P Q T

not similar similar

5) Structural/genetic similarity of the amino acids at non-conservative positions

Identity only

M V C P K I L M K C K H D S D C L L D C V C L E D R L C R R L V K R C R K E T E C I V E C I C I D E

Structural Genetic

M V C P K I L M K C K H D S D C L L D C V C L E D M V C P K I L M K C K H D S D C L L D C V C L E D R L C R R L V K R C R K E T E C I V E C I C I D E R L C R R L V K R C R K E T E C I V E C I C I D E Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

The sequence identity estimation procedure

The probability of randomly occurred minimum identity match (

a

is equal to declared or higher) is:

P an

k n

 

a

 

n

k

 

x k

x x

2

n

x

 1  

n

k

Where:

x

– the number of unit types in sequence (20 for proteins; 4 for NA)

n

– the sequence length (the number of compared position pairs)

a

– the number of identical positions

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

S

tatement of

S

equence

S

imilarity

S

ignificance Program

SSSS

(written by A. Mikołajczyk)

The program SSSS calculates:

 the probability of casual hit for a certain minimum identity degree at assumed  number of unit types and length of compared sequences the maximum number of identities that can be reached at declared probability  value and sequence length the percent of identity that can be reached for a assumed number of unit types and sequence length and declared probability/number of identities

The initial data requirements:

 the number of unit types  the sequence length  the probability of casual hit or the declared number of identities

Application:

 protein and nucleic acid sequence comparison for identity significance  estimation.

comparative studies on identity degree for protein families of various character, structure, function, location and origin (within one family and between families)

Availability:

•Freeware available upon request to the authors Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

The changes of >=50% match probability for different sequence length

n 10 50 100 500 1000 2000 P(a) for a/n=0.5 7.81269073486328E-0002 1.22513451649683E-0004 6.63850248347665E-0008 3.10414102023724E-0033 1.28142854362134E-0064 3.07817485439889E-0127 Nucleic acids

150

Probablility of >=50% identity match vs. NA sequence length

127 100 64 50 0 0 2 4 8 500 33 1000 1500

Sequence length

2000 2500

n 10 50 100 500 1000 2000 P(a) for a/n=0.5 6,36898E-05 1.10048906341455E-0019 7.26930745611374E-0038 1.83659152523996E-0182 6.33956683460252E-0363 1.06764298924076E-0723 Proteins Probablility of >=50% identity match vs. protein sequence length

800 600 400 200 0 0 5 19 38 500 182 363 1000 1500

Sequence length

723 2000 2500 Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

The minimum identity scores (a/n) for P(a)<=1E-4 and P(a)<=1E 6 for NAs of different length (n) P(a)<=1E-4

n a 10 50 9 26 100 43 500 163 1000 303 2000 574

a/n 0,9 0,52 0,43 0,326 0,303 0,287

P(a) 2,95639E-05 3,80215E-05 6,42443E-05 8,18E-05 8,52E-05 9,12E-05 n a

P(a)<=1E-6 a/n

P(a) 10 50 10 29 100 48 500 174 1000 318 2000 595

1 0,58 0,48 0,348 0,318

9,54E-07 7,20E-07 5,74E-07 6,57E-07 7,71E-07 8,30E-07

The minimum identity scores (a/n) for P(a)<=1E-10 and P(a)<=1E-20 for proteins of different length (n) P(a)<=1E-10 P(a)<=1E-20

n 10 50 100 500 a 9 18 25 62 1000 100 2000 169

a/n 0,9 0,36 0,25 0,124 0,1 0,0845

P(a) 1,87E-11 1,46E-11 1,82E-11 7,91E-11 8,41E-11 5,85E-11 n 10 50 100 a - 26 36 500 82 1000 126 2000 203

a/n - 0,52 0,36 0,164 0,126 0,1015

P(a) - 5,55E-21 1,19E-21 5,25E-21 7,03E-21 6,53E-21 Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Proposed scoring for 6-position window identity calculation - part 1

x - number of identities y- number of non-identities between first and last identity x=6 x=5 x=4 x=3 72 +24 = 96 60 +14 = 74 +22 = 82 48 +6 = 54 +12= 60 +18= 66

12x+6(3-y)

36+ 3 = 39 + 6 = 42 + 9 = 45 +12= 48 12x+3(4-y) or 12x+12/(y+1)

96 82 74 74 74 74 66 60 60 60 54 54 54 54 54 54 48 45 45 45 45 45 39 39 39 39

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Studies on phylogenetic relationships Program SSSS2 (Ela Gajewska and Jacek Leluk) • Freely accessible Java application • Contact with the authors

[email protected], [email protected]

• Phylogenetic trees generally reveal correlation with observed similarity

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Program SSSS2 The basic criteria used for analysis

Contribution of identities (%) Length of the sequence Distribution of identical positions

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Pairwise similarity estimation by program SSSS2 (Sequence Similarity Significance Statement v. 2)

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Pairwise similarity estimation

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Phylograms

Ink4 inhibitor family

CAB65455.1 BAA33541.1 AAA50282.1 NP 004927.2 CAC87045.1 CAC67498.1 NP 031696.1 NP 570825.1 O77617 AAG44950.1 AAG59801.1 NP 113738.1 AAC08963.1 AAB39600.1 P51480 AAC08962.1 AAD00229.1 AAD00227.1 AAD00228.1 AAD00230.1 AAD00236.1 AAD00231.1 NP 478104.1 AAC97110.1 AAL76343.1 NP 571977.1 NP 031697.1 CAC12811.1 Q60773 NP 034008.1 A57378 NP 001791.1 AAA85436.1 CAB65454.1 CAC87046.1 NP 000068.1 P42771 AAB60645.1 2002364A AAB32713.1 BAA33540.1 AAG01087.1 AAD00232.1 AAB94534.1 NP 478103.1 AAD14050.1

0.1

cAMP inhibitor family

P04541 NP 862822.1

NP 006814.1

P27776 A40536 AAA40867.1

AAH48244.1

AAA39940.1

AAD30289.1

AAA86697.1

Q90641 JC4128 AAH36011.1

AAK00638.1

NP 115860.1

Q9C010 NP 861460.1

NP 861459.1

NP 036759.1

AAA41879.1

P27775 NP 032889.2

AAB59678.1

B46707 AAH61162.1

Q04758 A40962 AAL90456.1

Q9Y2B9 NP 008997.1

NP 861521.1

AAD55445.1

NP 861520.1

AAQ17071.1

AAQ17070.1

AAQ04718.1

AAC09065.1

O70139 NP 035236.1

NP 703199.1

NP 032888.1

NP 446224.1

AAA72716.1

OKR BCI AAH22265.1

OKRBCI

0.1

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Studies on very long sequences Program jSSSS (Sławomir Walkowiak and Jacek Leluk) Particularly useful for genomic comparative studies.

Reasonable computing time for 10

5

-10

6

fragments kbp DNA

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra

Jacek Leluk ICM, Warsaw University and Department of Molecular Biology, University of Zielona Góra