Protein Analysis Course

Download Report

Transcript Protein Analysis Course

Protein Analysis Course
Day 1: Databases, dotplots and
pairwise alignment
Todays timetable
Databases and file formats
 Exercises
Dotplot and pairwise alignment
 Exercises
Coffee breaks during the exercises
Databases and file formats
Sequence file format
FASTA
UniProt (Universal protein resource)
 Primary structure
PDB (Protein Database)
 Tertiary structure
Sequence file format
 FASTA (a.k.a Pearson format)
 Most commonly used
 Can be easily construted by hand if needed
 Straightforward way to store multiple sequences – just
concatenate multiple FASTA –files
 Content:
 First line (Header line) always starts with symbol ”>” followed by
identifiers and descriptions
 Header line is ALWAYS just one line before sequence
 After header line (from the second line) starts the sequence
(presented using single-letter codes)
 Sequence normally divided into multiple lines (often required)
 Recommended line length max 80 chars (also with header line)
FASTA
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
…
Databases: UniProt
 UniProt is the universal protein resource, a central repository of
protein data created by combining Swiss-Prot, TrEMBL and PIR.
This makes it the world's most comprehensive resource on protein
information [wikipedia]
 UniProt provides three core database:
 The UniProt Archive (UniParc) provides a stable, comprehensive
sequence collection without redundant sequences by storing the
complete body of publicly available protein sequence data
 The UniProt Reference Clusters (UniRef) databases provide nonredundant reference data collections based on the UniProt
knowledgebase in order to obtain complete coverage of sequence
space at several resolutions
 The UniProt Knowledgebase (UniProtKB) is the central database of
protein sequences with accurate, consistent, and rich sequence and
functional annotation
UniProt Archive (UniParc)
 Comprehensive and non-redundant database that contains most of the publicly
available protein sequences in the world
 Currently UniParc contains protein sequences from the following publicly available
databases:


















EMBL/DDBJ/GenBank nucleotide sequence databases
Ensembl
European Patent Office (EPO)
FlyBase
H-Invitational Database (H-Inv)
Internation Protein Index (IPI)
Japan Patent Office (JPO)
PIR-PSD
Protein Data Bank (PDB)
Protein Research Foundation (PRF)
RefSeq
Saccharomyces Genome database (SGD)
TAIR Arabidopsis thaliana Information Resource
TROME
USA Patent Office (USPTO)
UniProtKB/Swiss-Prot, UniProtKB/Swiss-Prot protein isoforms, UniProtKB/TrEMBL
Vertebrate Genome Annotation database (VEGA)
WormBase
UniProt Reference Clusters (UniRef)
 Sequence clusters, used to speed up similarity
searches
 UniRef100
 Cluster is composed of sequences that are identical
 UniRef90
 Cluster is composed of sequences that have at least
90% sequence identity
 UniRef50
Cluster is composed of sequences that have at least
50% sequence identity
Protein knowledgebase (UniProtKB)
Is the central hub for the collection of
functional information on proteins, with
accurate, consistent and rich annotation
Consists of two sections:
 Swiss-Prot, which is manually annotated and
reviewed by curator
 TrEMBL, which is automatically annotated and
is not reviewed
UniProt entry
 Every line in a entry
begins with a 2 letter
identifier
 UniProt format closely
resembles EMBL format
except that considerably
more information about
physical and biochemical
properties is provided
 More information here
Databases: PDB
Founded in 1971 by Brookhaven National
Laboratory, New York.
Transferred to the Research Collaboratory
for Structural Bioinformatics (RCSB) in
1998.
Currently it holds more than 55,000
released structures.
PDB
Methods used to solve 3d structure:
X-ray: 86%
NMR: 13%
Electron Microscopy: 0,7%
Other: 0,3%
PDB file format
Text file – you can edit with a text editor
e.g. WordPad
Atomic co-ordinates
Rich annotation
Citation
Experimental Method
Biological source e.
Etc.
FYI: Errors in databases
 Be aware of errors in the databases:
 sequence errors:
 genome projects’ error rate is 1/10,000nts;
 ESTs’ error rate is 1/100nts.
 annotation errors:
 Automated computer programs do not always give correct annotations.
 SwissProt is a protein database curated and annotated manually by
biologists. Most reliable database, but is not up-to-date
Exercises
Go to the course web page and start with
exercises given in file:
database_exercises.doc
http://ekhidna.biocenter.helsinki.fi/how
Pairwise sequence alignments
 Motivation – Why alignments?
 Sequence comparison
Dotplot
The alignment problem
 Pairwise alignment algorithms
Exact algorithms
Heuristic algorithms
Database searches
 Web tools:
Build alignments using EBI server,
Blast at NCBI, EBI,
PairsDB, …
Motivation
 Proteins perform most of the functions required in
biological systems:
 Signaling (kinases, ...)
 Enzymes (proteases, …)
 Structural (collagen, elastin, …)
 Immune system (antibodies, ...)
 Storage and transport (hemoglobin, …)
…
 Large amount of information available in current
databanks.
 Goal: Want to extrapolate information about the function
of a newly discovered sequence by comparing it to
annotated sequences.
Does it make sense?
 All functional information is ultimately contained
within the sequence.
 Proteins are evolutionary related:
Selective pressure is on function, and thus on residues
with functional role (eg: active site or structural key
residues are conserved).
 Modular nature of proteins.
 Two sequences have the same structure if
corresponding residues are similar enough on
physico-chemical level.
Application of sequence alignment
 Determining function of newly discovered
genetic or protein sequences.
 Identification of functional patterns/domains.
 Predicting structure of proteins.
 Determining evolutionary relationships among
genes, proteins, and entire species.
Aligning and comparing sequences, and searching
databases for similar sequences – a cornerstone
of bioinformatics!!
Pairwise alignment
Pairwise alignment = identification of residue-residue correspondence.
?????
101
GLP_HORSE
60
AGVIGTILLISYGIRRLIKKSPSDVKP
||:||.|||::|..|||.|:.|:||.|
AGIIGIILLLAYVSRRLRKRPPADVPP
115
86
For the alignment to be meaningful, the correspondence should reflect
the functional or evolutionary relationship
What criteria should we use to obtain biologically meaningful
alignments?
Terminology
 Identity:
 percentage of pairs of identical residues between two aligned sequences.
 Similarity:
 percentage of pairs of similar residues between two aligned sequences.
 one must define what similar means. Eg:
 as observed in well studied evolutionary
related protein families,
 physico-chemical amino acid
properties: hydropathy, size, …
 Homology:





two sequences are homologous if and only if they have a common ancestor.
it´s either yes or no.
Two types: orthology and paralogy
not to be confused with similarity!
don’t mix up with analogy
DotPlot
 The simplest way of
comparing two
sequences:
 A dot is placed
where both
sequence elements
are identical.
 Gives an overview of
all possible
alignments.
 Each diagonal
indicates a possible
(ungapped)
alignment
Filtering Out the Noise in Dotplots
 Dots may be scored according to a sliding window and a similarity
cutoff to reduce noise:
Window size = 5, Similarity cutoff = 3
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
| |
||
||||
|
|| |||
|
LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
| |
||
||||
|
|| |||
|
LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
| |
||
||||
|
|| |||
|
LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
L
SLETVHKKLYAGQYQNAGQFCDDIWLMLDNA
T
L
I
S
K
T
R
I
K
L
R
D
K
T
L
G
*
D
Q
*
T
Y
G
Q
*
E
Y
P
Q
W
E
Q
P
…
W
Q
…
 The smaller the window, the more noise.
 With large windows, the sensitivity for short sequences is reduced.
Dotlet
At http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
Let´s find repeated domains in the following sequence :
> SLIT_DROME (P24014):
MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCTGLNVDCSHRGLTSVPRKISAD
VERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVITTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILT
LNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSWLSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCAD
GIVDCREKSLTSVPVTLPDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLLLNAN
EISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCESPKRMHRRRIESLREEKFKCSWGEL
RMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGRISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENK
IKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFEHLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSE
GCLGDGYCPPSCTCTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYNKLQCLQRH
ALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQMKDKLILSTPSSSFVCRGRVRNDILAKCN
ACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNATCTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKC
ECQPGFSGEFCDTKIQFCSPEFNPCANGAKCMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQ
TSPCQNHECKHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAVELFNGRIRVSYD
VGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDPAQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFG
NAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLENKCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQV
REYYTENDCRSRQPLKYAKCVGGCGNQCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY
DotPlot summary
 Comparing a sequence with itself, can be used
to identify:
Repeated domains,
Regions of low complexity (eg, …GYCAAAAAAAAALK…).
 Comparing two protein sequences, can be used
to identify:
Local regions of similarity,
Conserved protein domains.
The Pairwise Alignment Problem
 Lign up diagonal by edit operations:
 substitution (mutation)
 gap or indel (insertion/deletion)
sequence 1
deletion
seq1 IGTILLISYGIRRLIKKSPSDVKP----LPSPDTDVP
|| ||| | ||| | | || |
|| |
|
seq2 IGIILLLAYVSRRLRKRPPADVPPPASTVPSADAPPP
insertion
But there are many ways to align 2
sequences  we need to score
alignments to decide which is the best.
sequence 2
substitution
gap
Scoring the Edit Operations
 For example:
 identical: +10 (it´s good)
 substitution: +2 for S-A, -1 for K-P, …
 gap: -3
PSDVKP--P
| || | |
PADVPPPAP
Score: +50+2-1+2*(-3) = 45
Choosing an appropriate scoring scheme: where
biological information is introduced (eg, reward the
evolutionary most likely alignment).
Standard notation:  | for identical
 : for very similar (eg, size and hydropathy)
 . for somewhat similar (eg, size or hydropathy)
Gap penalty
TIL--------LISYGIRRLIK
Few long gaps
TILKKSPSDVKLISYGIRRLIK
is better than
many small gaps
IG-TI--LYDL-SYYAG---IR
IGKIIPRL--LVAY--VLIGSR
 Different scores for
 gap opening, eg: -5
 gap extension, eg: L*(-1)
with L=length of extension
 gap opening > gap
extension
gap opening
gap extension
TIL--------LISYGIRRLIK
TILKKSPSDVKLISYGIRRLIK
gap score= -5 -6
Gap penalty
 Can also consider special penalty for gaps at end/beginning of
alignment (eg, zero penalty).
 Need to be careful in adjusting the gap score to the substitution
score:
 too strong penalty  no gaps,
 too weak penalty  too many gaps.
 Insertions and deletions have been found to occur in nature at
significantly lower frequency than mutations.
Residue Substitution
 A substitution score for each aa pair
 a substitution matrix.
 Most used: based on evolutionary relationship.
 Two types:
 PAM series,
 BLOSUM series.
PAM (Percent Accepted Mutation)
 PAM1: observed mutations in
carefully selected sets of
closely related proteins (1572
sequences from 71 families).
(1978)
 Idea: observed substitutions
are the result of 1 mutation
(not many).
 PAMn: iterate PAM1 n times to
obtain substitution rate
between more divergent
sequences.
Use
when
PAM:
0
%identity: 100
PAM250
30
75
80
60
110
50
200
25
250
20
BLOSUM (BLOck Substitution Matrix)
 Based on a larger set than PAM is.
 More recent than PAM. (1992)
 Different approach than PAM:
 not based on an explicit
evolutionary model,
 observed aa substitutions in a set
of conserved aa patterns called
blocks.
 BLOSUMn: from blocks which are n%
identical.
 BLOSUM62: empirically shown to be
among the best at detecting weak
similarity.
BLOSUM62
Tips for using substitution matrices
 Generally, BLOSUM matrices perform better than PAM for local
similarity searches.
 For database searches, the most commonly used matrix is
BLOSUM62.
 When comparing closely related proteins, one should use lower
PAM or higher BLOSUM, for distantly related proteins higher PAM
or lower BLOSUM matrices
BLOSUM 80
PAM 1
Less divergent
BLOSUM 62
PAM 120
BLOSUM 45
PAM 250
More divergent
 Caution: substitution matrices are statistical in nature. In a given
alignment, a substitution may or may not correspond to an actual
mutation.
Pairwise Alignment Algorithms
 Given a scoring scheme, an alignment algorithm tries to find the best
alignment between 2 sequences according to that scheme.
 Exact algorithms:
 guaranteed to return an alignment with the best possible score.
 Heuristic alignments:
 not guaranteed to return best alignments.
 but they are quicker (and hopefully still return good alignments).
 Two types of alignment:
 Global: forced over the entire length of 2 sequences.
 Local: between substrings of 2 sequences..
Global vs Local Alignment
 Global alignments:
 are sensitive to gap penalties,
 Assumes homology.
 Outputs everything – either matches or gaps
 can be used to compare 2 proteins with same
function (in, eg, human/mouse).
 Local alignments:
 Can be used to look for conserved domains
or motifs in 2 proteins,
 search for local similarities in large
sequences,
 database searches,
 scanning an entire genome with a short
sequence.
 Does not output everything – only the best
hits
Exact Algorithms: Dynamic
Programming
How can we find the best alignment between 2 sequences?
 Exhaustive search among all possible
alignments is not possible (eg, for 2 sequences
of 100 and 95 residues: 55 millions possible
alignments with 5 gaps).
 Problem solved by dynamic programming:
1. initialize top row and left column,
2. compute best local scores iteratively,
3. keep track of where best local score
comes from,
4. traceback to obtain the best alignments.
 May exist several best solutions: an alignment
reported to you may be one among a number of
possibilities.
best global score
Example of 2 best solutions:
ATTCTCTGA
-TAC--TGA
ATTCTCTGA
-TA--CTGA
The example is from www.pasteur.fr
Local and global Alignment Servers (Exact
Algorithm)
Use the Needleman-Wunsch algorithm (1970)
and the Smith-Waterman algorithm (1981).
 Server at EBI: EMBOSS-Align
 Let´s submit to http://www.ebi.ac.uk/emboss/align/index.html the
sequence :
>uniprot|P35858|ALS_HUMAN Insulin-like growth factor-binding protein complex
MALRKGGLALALLLLSWVALGPRSLEGADPGTPGEAEGPACPAACVCSYDDDADELSVFC
SSRNLTRLPDGVPGGTQALWLDGNNLSSVPPAAFQNLSSLGFLNLQGGQLGSLEPQALLG
LENLCHLHLERNQLRSLALGTFAHTPALASLGLSNNRLSRLEDGLFEGLGSLWDLNLGWN
SLAVLPDAAFRGLGSLRELVLAGNRLAYLQPALFSGLAELRELDLSRNALRAIKANVFVQ
LPRLQKLYLDRNLIAAVAPGAFLGLKALRWLDLSHNRVAGLLEDTFPGLLGLRVLRLSHN
AIASLRPRTFKDLHFLEELQLGHNRIRQLAERSFEGLGQLEVLTLDHNQLQEVKAGAFLG
LTNVAVMNLSGNCLRNLPEQVFRGLGKLHSLHLEGSCLGRIRPHTFTGLSGLRRLFLKDN
GLVGIEEQSLWGLAELLELDLTSNQLTHLPHRLFQGLGKLEYLLLSRNRLAELPADALGP
LQRAFWLDVSHNRLEALPNSLLAPLGRLRYLSLRNNSLRTFTPQPPGLERLWLEGNPWDC
GCPLKALRDFALQNPSAVPRFVQAICEGDDCQPPAYTYNNITCASPPEVVGLDLRDLSEA
HFAPC
>uniprot|O08770|GPV_RAT Platelet glycoprotein V precursor (GPV) (CD42D).
MLRSVLLSAVLSLVGAQPFPCPKTCKCVVRDAVQCSGGSVAHIAELGLPTNLTHILLFRM
DRGVLQSHSFSGMTVLQRLMLSDSHISAIDPGTFNDLVKLKTLRLTRNKISHLPRAILDK
MVLLEQLFLDHNALRDLDQNLFQKLLNLRDLCLNQNQLSFLPANLFSSLGKLKVLDLSRN
NLTHLPQGLLGAQIKLEKLLLYSNRLMSLDSGLLANLGALTELRLERNHLRSIAPGAFDS
LGNLSTLTLSGNLLESLPPALFLHVSWLTRLTLFENPLEELPEVLFGEMAGLRELWLNGT
HLRTLPAAAFRNLSGLQTLGLTRNPLLSALPPGMFHGLTELRVLAVHTNALEELPEDALR
GLGRLRQVSLRHNRLRALPRTLFRNLSSLVTVQLEHNQLKTLPGDVFAALPQLTRVLLGH
NPWLCDCGLWPFLQWLRHHLELLGRDEPPQCNGPESRASLTFWELLQGDQWCPSSRGLPP
DPPTENALKAPDPTQRPNSSQSWAWVQLVARGESPDNRFYWNLYILLLIAQATIAGFIVF
AMIKIGQLFRTLIREELLFEAMGKSSN
Heuristic Algorithms
 Motivations:
 Exact algorithms are exhaustive but computationally
expensive.
 Exact algorithms are impractical for comparing a query
sequence to millions of other sequences in a database
(database scanning),
 and so, database scanning requires faster alignment
algorithm (at the cost of optimality).
Heuristic Algorithms
 Probing a database with a query is similar to aligning a query with
a very long sequence.
 need fast local alignment methods.
 Main idea:
 Use dynamic programming, but limited to (sub-)sequences
which are likely to produce interesting alignments with the
query.
 Heuristic part of the algorithm: eliminate from search
uninteresting sequences (need to make a guess).
 Algorithms:
 FASTA : Lipman-Pearson (1985).
 BLAST (Basic Local Alignment Search Tool) : Altshul et al.
(1990).
BLAST Overview
 Many versions for different query-database cases:
 blastp: protein - protein
 blastn: nucleotide - nucleotide
 blastx: nucleotide  protein - protein
 tblastn: protein - protein  nucleotide
 tblastx: nucleotide  protein - protein 
nucleotide
 Comes in many flavours.
 Fast and reliable.
 Easy to use.
BLAST Overview
 BLAST computes “an alignment”, not necessarily the exact optimal
alignment.
 Given the query and the database (long sequence):
 Find all words of length k (default: k=3 for AA and k=11 for
DNA) that match the query with a score high enough.
 Look for subsequences in the database that contain these
words.
 Extend subsequences to see if match score can be increased.
 Compute total score when no more extensions are possible.
 Rank the alignments.
BLAST at NCBI
Let´s submit the query sequence
>1IGR:A INSULIN-LIKE GROWTH FACTOR RECEPTOR
EICGPGIDIRNDYQQLKRLENCTVIEGYLHILLISKAEDYRSYR
FPKLTVITEYSLGDLFPNLTVIRGWKLFYNYALVIFEMTNLKDI
GLYNLRNITRGAIRIEKNADLCYLSTVDWSLILDAVSNNYIVGN
KPPKECGDLCPGTMEEKPMCEKTTINNEYNYRCWTTNRCQKMCP
STCGKRACTENNECCHPECLGSCSAPDNDTACVACRHYYYAGVC
VPACPPNTYRFEGWRCVDRDFCANILSAESSDSEGFVIHDGECM
QECPSGFIRNGSQSMYCIPCEGPCPKVCEEEKKTKTIDSVTSAQ
MLQGCTIFKGNLLINIRRGNNIASELENFMGLIEVVTGYVKIRH
SHALVSLSFLKNLRLILGEEQLEGNYSFYVLDNQNLQQLWDWDH
RNLTIKAGKMYFAFNPKLCVSEIYRMEEVTGTKGRQSKGDINTR
NNGERASCESDVDDDDKEQKLISEEDLN
At http://www.ncbi.nlm.nih.gov/BLAST/
Bit score: S’
The value S’ is derived from the raw
alignment score S, but statistical
properties of the scoring system have
been taken into account. Because bit
scores are normalised w.r.t. scoring
system, they can be used to compare
alignment scores from different
searches.
E value: Expectation value.
Expected # of alignments with scores
equivalent to or better than S to
occur by chance. The lower the E
value, the more significant the score.
NCBI Blast output help:
http://www.ncbi.nlm.nih.gov/Educatio
n/BLASTinfo/Blast_output.html
BLAST servers
 Pairwise alignment:
 BLAST:
http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi
 Database screening:
 BLAST:
 http://www.ncbi.nlm.nih.gov/BLAST/
 http://www.ebi.ac.uk/blast/index.html
 http://www.ch.embnet.org/software/bBLAST.html
 http://www.ch.embnet.org/software/aBLAST.html
Remark: there is a server with a powerful implementation of Smith-Waterman
for database screening: http://www.ebi.ac.uk/MPsrch/. Runs about 50 times
slower, but is more sensitive and returns less false positives than Blast.
PSI-BLAST
 Position-Specific Iterated Blast:
 More sensitive, ie better at detecting distant relationships,
than BLAST.
 Computes position-specific substitution matrices (PSSMs)
to score matches between query and database sequences.
(Blast uses precomputed substitution matrices, eg
BLOSUM62.)
PSI-BLAST
 Repeatedly searches the target databases.
 At each round:
 compute a multiple alignment of high scoring
sequences to generate a new PSSM for next round of
searching.
 Iterates until no new sequences found (or until a maximal
number of iteration is reached).
Significance of Alignments
 Scores cannot be used to rank alignments:
 a bad but long alignment may have a higher score than a
good but short alignment.
 We need a normalized scoring scheme that would allow to
compare alignments, and evaluate their biological significance.
 Idea:
 Probe the database with random sequences.
 This gives a distribution of scores (it follows the extremevalue distribution).
 Establish a threshold for significance.
Extreme-Value Distribution
Score distribution
for random
sequences
probability that the score
of our query is no better
than random: P-value
score
score of our query
Difficulty: finding a significance threshold.
Quantifying the Significance of
Alignments
For an alignment with raw score S:
 P-value:
 The probability of an alignment occurring with score S or
better if the aligned-against sequence is random.
 The lower the P-value, the more significant the alignment.
 E-value:
 Expected number of alignments with scores equivalent to or
better than S to occur by chance only.
 The lower the E-value, the more significant the alignment.
 E-value = P-value * size of database.
Rough Guide for P-values and Evalues
 P-Value (reported by many programs): 0 ≤ P-val ≤ 1
P<= 10-100
Exact match
10-100 < P < 10-50
Sequences very nearly identical, e.g.: alleles or SNPs
10-50 < P < 10-10
Closely related sequences, homology certain
10-5 < P < 10-1
Usually distant relatives
P>10-1
Match probably insignificant
 E-value (reported by some programs, eg PSI-Blast): 0 ≤ E-val ≤ size
of database
E<=0.02
Sequences probably homologous
0.02 <=E <=1
Homology can’t be ruled out
E>1
This match would be obtained by chance
Rules of thumb for pairwise
alignment
 Use server defaults in the absence of any other information.
 Adjust the substitution matrix to the expected divergence of
the 2 sequences. Use BLOSUM62 if no a priori information.
 For distantly related sequences, use PSI-Blast rather than
BLAST. If PSI-BLAST doesn’t give you anything use GTG.
 Many ways of aligning 2 sequences.
 A returned alignment is not the absolute truth.
 Inspect the alignment from the biologist´s perspective.
Exercises
Go to the course web page and start with
exercises given in file:
p_alignment_exercises.doc
http://ekhidna.biocenter.helsinki.fi/how