Bioinformatics Workshops 1 & 2
Download
Report
Transcript Bioinformatics Workshops 1 & 2
Bioinformatics Workshop 1
Sequences and Similarity Searches
• Open a web browser and type in the URL:
– informatics.gurdon.cam.ac.uk/online/workshops
– Bookmark this page
• Click on the link to the file:
– useful-websites.html
– Bookmark this page too
– It also contains links to the example sequence
files used in the workshop, and the
presentations themselves
The Basic Questions
Where, and how, do I
find something?
How do I know it’s real?
Exercise 0:
Write a concise definition of what a
gene is.
Part 1: Structural Genomics
DNA arranged in
chromosomes
Vertebrate ~ 109 base pairs
Chromosomes and Genes
Total of ~30,000 genes on ~20
chromosomes
1000 – 2000 genes per chromosome
Gene to Protein
~ gene
locus
genome
primary transcript
mRNA
protein
Sequence Signals
mRNA
CTACCATCCATGCTAACCATTCTACTAGCATAACTGGCTA
MLTILL A
Genomic Signals
promoters
===CACGATCGAGTC==
=================
splice sites
==ACGTA…………CAGTA==
==================
enhancers
transcription start site
===CGCTATAAGCG===
=================
===CGCAATAAAGCG==
=================
polyadenylation signal
Derivative Sequences
mRNA
5’
3’
capture by
cloning into
cDNA library
5’ EST
EST: single pass sequence
from each end of the clone
3’ EST
cDNA sequence
cDNA: multiple pass
sequencing over whole
length of the clone
Gene Models
gene model
exons
Sequences and Genes
(Accession Numbers and Names)
gene
mRNAs/cDNAs
S43105.1
BT006437.1
‘Cyclin B1, isoform 1 [mus musculus]’
X58708.1
‘Cyclin B1, isoform 2 [mus musculus]’
NM_111985.3
proteins
‘similar to Cyclin B1 [mus musculus]’
AAB22970.1
AAP21245.1
CAA41545.1
NP_187759.2
‘CCNB1, Cyclin B1 [mus musculus]’
Gene Symbols, Names, Etc.
Gene Symbol:
CCNB1
Gene Name:
cyclin B1 [Homo sapiens]
Description:
G2/mitotic-specific cyclin B1
Aliases:
CCNB, CYCB1
A Gene-Centric View
Entrez Gene
http://www.ncbi.nlm.nih.gov/
genomic location
S43105.1
AAB22970.1
BT006437.1
X58708.1
Cyclin B1
NM_111985.3
AAP21245.1
CAA41545.1
NP_187759.2
expression data
Exercise 1:
Go to Entrez Gene and look for
your favourite gene or genes.
Sequences and Accession Numbers
NM_001015922.1
gi=62860271
GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA
NM_001015922.2
gi=62860589
GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAA
BC009638.1
gi=16307106
GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA
NP_001015922.1
protein translated from mRNA
XM_001102567.1
predicted mRNA
XP_001089765.1
predicted protein translated from predicted mRNA
mRNA Splicing Signals
mRNA
CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA
gene model
genome
exon
intron
exon
intron
exon
splice sites
CTACCATCCATGCTAACCATTCTAC
CATTTTATACTCATGCAACGGACCGT
AGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA
CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA
GTAAG. .TTTCAG
donor
acceptor
Gene Predictions
Given:
- coding sequence must run from ATG – STOP codon in-frame
- introns GT. . . . . . AG can be spliced out
Also take a statistical approach:
- coding and non-coding sequence are slightly different in composition
- some ‘possible’ splice sites are more likely than others
scan genomic sequence …
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .
. . .CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.
.CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.
. .
.
.CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.
.
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.
. .
most likely gene model
. . .CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA. .
Supporting Evidence!
exons:
1
2
3
4
gene model
genome
EST evidence
We note that in the absence of EST evidence it is only really possible to predict
coding sequence with any confidence (and even then…).
So predicted genes based on computational gene models alone will usually lack
UTR regions, which has some important consequences.
Theoretical/Predicted Sequences
exons:
1
2
3
4
predicted gene model
genome
predicted transcript
predicted protein
We’ve now reversed the process of working out exon structure from
aligning cDNA sequences against the genome sequence, but we
shouldn’t lose sight of the fact that we don’t really know if these
predicted proteins exists – especially where supporting EST evidence
is weak, or non-existent.
Sequences for a model organism
ESTs – millions @ £10 each
Cheap to sequence – so we get millions per organism
But lots of errors
And incomplete gene sequences
Can give us relative expression levels
cDNAs – tens of thousands @ £1000 each
Expensive – but only need to do one (or a small number) per gene
Few errors with multipass sequencing
Gives us protein sequences
Genomes – one ! @ £30,000,000
Extremely expensive
But the only way to get the whole picture
Gives us gene regulation
So What’s in the Databases Now?
NCBI July 2005
DNA
3,300,000
cDNAs
15,000,000
ESTs
nr
Proteins
2,700,000
proteins
RefSeq
950,000
proteins
Part 2: Comparative Genomics
Evolution by sequence mutation
Gene sequence
ATGAAGGCTGCCTACGACTGCCGTG
ATGCAGGCTGCCTACGACTGCCGTG
ATGCAGGCTGCCAACGACTGCCGTG
Imagine one mutation gets fixed every 100,000
ATGCATGCTGCCAACGACTGCCGTG
years in this gene sequence…
ATGCATGCTGCCAACGACTGCCCTG
ATGCATGCTGCCAACGGCTGCCCTG
ATGCATGCTGCCAACGGATGCCCTG
ATGCATGCCGCCAACGGATGCCCTG
ATGCATGCCGCCAACGGATGTCCTG
Speciation
Gene A
ATGAAGGCTGCCTACGACTGCCGTG
ATGCAGGCTGCCTACGACTGCCGTG
ATGCAGGCTGCCAACGACTGCCGTG
ATGCATGCTGCCAACGACTGCCGTG
ATGCATGCTGCCAACGACTGCCCTG
ATGCATGCTGCCAACGGCTGCCCTG
ATGCATGCTGCCAACGGATGCCCTG
ATGAAGGCCGCCTACGACTGCCGTG
ATGAAGGCCGCCAACGACTGTCGTG
ATGAAAGCCGCCAACGACTGTCGTG
ATGAAAGCCGCCAACGACAGTCGTG
ATGAAAGCCGCCTACGACAGTCGTG
ATGAAAGCCGCCTACGACAGTCCTG
ATGCATGCTGCCAACGGATGCCCTG
||| | || ||| |||
| ||||
ATGAAAGCCGCCTACGACAGTCCTG
If the genetic difference means they can no
longer interbreed, with fertile offspring – then
we have a new species…
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
||| | || ||| |||
| ||||
ATGAAAGCCGCCTACGACAGTCCTG
We can still easily detect residual similarity
between these sequences, this is what we call
homology – detectable similarity because of
common evolutionary origin.
ATGCATGCTGCCAACGGATGCCCTG
||| | | || | |
| || |
ATGGAAGGCGCTTAGGATAGTCCAG
After longer periods of evolution, homology may
no longer be detectable in the DNA sequence…
Computers Can Detect Homology
ATGCATGCTGCCAACGGATGCCCTG
||| | || ||| |||
| ||||
ATGAAAGCCGCCTACGACAGTCCTG
In fact computers are very good at this task – the two
primary challenges are:
(a) performing the search fast enough to look through
millions of sequence in a timescale compatible with a
lab scientist’s attention span
(b) at low levels of similarity, being able to distinguish
between biologically related sequences and chance
matches…
GCTGACTCGTAGCGCTTAGCTAGCT
|
||
|
|
|
|
CCAACATCTAGCCAGATTAGTTAGT
Orthologs
Gene duplication
though speciation
The two copies of
Gene A will now evolve
independently, but will
continue to have the
~same function
They are
ORTHOLOGS
Paralogs
Gene duplication
though internal
genome duplication
The two copies of
Gene A will now evolve
independently, but will
probably not continue
to have exactly the
same function
They are PARALOGS
‘Other’-logs
What about gene duplication
after speciation ?
How can we describe
the relationship(s)
between the various
copies of gene A in the
two frogs?
Bear in mind that understanding gene
function is more important than
semantics…
The two copies of A in the orange frog are
sometimes called IN-PARALOGS.
If they were also present in the green frog
(and therefore were in the ancestor
species) they would be OUT-PARALOGS.
The Essential Paradigm
1. any group of modern species
can be traced back to some
extinct common ancestor
2. in all likelihood they share
orthologous genes which have the
same function in the modern
animal as in the extinct ancestor
cyclin b1
cyclin b1
3. If we can experimentally
determine the function of a gene
in one of these organisms, then
there is a good chance the
ORTHOLOGOUS gene in another
organism will have the same
function
Function Conserved Longer than
Detectable Similarity
start from first self-replicating sequence
whole genome
duplication
local duplication
living organisms
same function
detectable similarity
Redundancy in the Genetic Code
GCA
GCC
GCG
GCT
A
A
A
A
alanine
TGC
TGT
C
C
cystine
GAC
GAT
D
D
aspartate
GGA
GGC
GGG
GGT
G
G
G
G
glycine
‘Synonymous’ or ‘silent’
mutations in the third
position of the codon
triplets have no effect
on the amino acid
coded for – so there is
no evolutionary
pressure against this…
Protein Similarity Persists Longer
CTATCACGAGAACCTGTG
CTATCCCGAGAACCTGTG
CTATCCCGAGAACCAGTG
CTATCCCGTGAACCAGTG
CTATCCCGTGAGCCAGTG
CTATCCCGTGAGCCAGTT
CTGTCCCGTGAGCCAGTT
CTATCACGAGAACCTGTG
|| || || || || ||
CTGTCCCGTGAGCCAGTT
67%
CTATCACGAGAACCTGTG
| || |
|| ||
TTGTCCCGGTCGCCAGTT
44%
LSREPV
||||||
LSREPV
LSREPV
||| ||
LSRFPV
100%
80%
Always Compare Protein Sequences
DNA comparison
ATGAATGCAGCCTATGATTGCCGAGCCAGAATGCTAAGG
||||| || || || || || || || ||||| || ||
ATGAAGGCCGCATACGACTGTCGTGCTAGAATCCTGAGA
amino acid comparison
MNAAYDCRARMLR
| ||||||||+||
MKAAYDCRARILR
The DNA sequence can change while
the amino acid sequence stays the
same, so always look for similarities by
comparing amino acid sequences.
Exercise 1
nucleotide vs amino acid search
Go to the file example-sequences.html, and locate the section for this
exercise. There should be two sequences: ‘surfeit1’ for frog and fly.
Go to NCBI Blast home page, then ‘Align two sequences’ (bottom left
‘special’ panel), paste one sequence into each window and hit ‘Align’ – this
will do a direct DNA/DNA comparison.
Now find the open reading frames of the two genes, and translate them
into amino acid protein sequences, then repeat the two sequences
comparison.
Go to NCBI ORF Finder – paste sequence – hit OrfFind – identify longest
ORF – click on it – next screen, hit Accept – change View to Fasta protein
– hit View – copy sequence to Blast2Seqs. Do the same with the other
sequence.
Before you hit ‘Align’ change the ‘Program’ (top left) to blastp…
Answers: Exercise 1
The Essential Task
experiment
data mining
gene sequence
what is its function?
Cyclin-A
FoxA1
database of
proteins in
other species
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
we can only do this
because of implied
function based on
orthology
Functional Orthologs ?
Xenopus gene
Human gene
function unknown
sequence similarity
orthologs
But we know that
function is largely
determined by shape
function known,
annotation ‘Gravin’
available
same function ?
similar shape?
Which in general we
cannot determine – but
it is probably SHAPE not
SEQUENCE that is
conserved!
We make an assumption that the same gene function
is likely to be present in the two organisms, and the
ones that have this function are likely to be the most
similar in sequence
Finding Orthologs
So how do we find orthologs, and can we know when we have?
The simplest is Reciprocal Best BLAST, but it implicitly relies on having
all the protein sequences of you own organism, and the one you wish to
find an ortholog in.
frog protein
database of
human proteins
best match
human protein
database of
frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved
their overall organisation from one organism to another.
Human
chromosome 5
Mouse
chromosome 10
Mouse
chromosome 2
And we find the same genes (i.e. orthologs!) in more or less the same
order in the syntenic sections.
These of course represent chromosomal re-arrangements since these
organisms diverged.
Metazome
Fortunately someone has done all the hard work for us….
Dan Rokhsar
http://www.metazome.net/
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again.
Pick probable ortholog vertebrate genes from common organisms
(human, mouse, rat, chicken, frog, fish) and paste their protein
sequences into a temporary space.
Go to Metazome (http://www.metazome.net/), find the blast window,
open two versions of it, and blast your sequences against the
Tetrapod or Jawed vertebrate node.
See if you get the same cluster ID as best top hit and have a look
at the Metazome alignment(s)…
Part 3: Finding Sequence Similarities
ATGCATGCTGCCAACGGATGTCCTG
||| | || ||| ||| | ||||||
ATGAAAGCCGCCTACGAAAGTCCTG
We want computer programs which will compare sequences
at all possible different alignments, looking for a degree of
similarity greater than we would expect to find by chance.
But first we have to consider the implication of gaps…
Insertions and deletions are other possible forms of
mutations and they can really mess up our simple
alignments:
ATGCATGCTGGCCAACGGATGTCCTG
||| | || ||| | | |
|
ATGAAAGCCGCCTACGAAAGTCCTG
Gaps in Alignments
Consider these two obviously similar sequences:
TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA
| ||
|
|| |||||||||||||||||||| ||||||||| ||| |||
|
|||
| | |
TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA
In fact we realise that the most probable alignment (regarding biological
origin) is with a small gap in each sequence:
TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA
|||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||
TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA
So in general we allow ourselves to insert gaps, until we find the optimal
alignment.
But where should this process stop?
The Downside of Gaps
Take two random sequences, with no ‘real’ similarity:
GACACTAGGTCGATGCGTGGTGGCGAGA
ACGCATCCGGATGTGCACCGTGGAACTG
And allow ‘cost free’ gaps:
GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA
|| | |
| | | |||
||||
||
ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG
Clearly, although the alignment has no mismatches, it is obviously not biologically
meaningful!
To prevent this we assign a cost to adding gaps which is offset against the benefit of
finding matches – and this is the essence of ‘finding gapped alignments’.
We want to find the ‘alignment’ between the two (or more) sequences which
shows the greatest degree of similarity while introducing the fewest gaps …
BLAST
There are many programs used to find similarities between sequences.
They range from relatively slow programs which find the exact best
matching alignment, through ones which take progressively inexact
shortcuts to speed things up. Of this latter class, the best known, and
easily most widely used is BLAST, developed by Stephen Altschul and
others, and continuously refined over the last 10-15 years.
The essential idea is to compare your query sequence against a collection
or ‘database’ of target sequences, looking for the one(s) that match the
query sequence the best.
database
query
>query
AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG
CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA
GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG
GTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCA
CGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGC
AACGAA
COMPARE
LIST MATCHES
>target1
AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG
GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT
GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT
GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT
TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA
ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAG
>target2
CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT
CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC
TTTTTAATAAAATCTATTTCTATAG
>target3
GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT
TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA
ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT
GGAAAAAAAAATACTGGCTTCC
>target4
CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC
CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC
CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG
CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG
Flavours of BLAST
query sequence
other operation?
database sequences
BLASTn
ACGATAGATCCCATCCATAAAT
ATGACGATAGATCCCATCAT
CGATAGGACCACCACA
GATAGACCAGGATACATAGGATAATTA
AGCTCGCTTGGCTCGATGGCT
BLASTp
MQWCGYRWTYQGYRW
MKJLSPWERSYTRGHYTWER
MGHTVNBZY
MKLPWRHGDBKJGMNDFD
MBKLRPIUHDFRTASGSLKWWRTVBN
BLASTx
ACGATAGATCCCATCCATAAAT
tBLASTn
MQWCGYRWTYQGYRW
MKJLSPWERSYTRGHYTWER
MGHTVNBZY
MKLPWRHGDBKJGMNDFD
MBKLRPIUHDFRTASGSLKWWRTVBN
ATGACGATAGATCCCATCAT
CGATAGGACCACCACA
GATAGACCAGGATACATAGGATAATTA
AGCTCGCTTGGCTCGATGGCT
tBLASTx
ACGATAGATCCCATCCATAAAT
ATGACGATAGATCCCATCAT
CGATAGGACCACCACA
GATAGACCAGGATACATAGGATAATTA
AGCTCGCTTGGCTCGATGGCT
How does it work?
The main task of any sequence comparison program is to test all possible
mutual alignments of two sequence and see how good the match is:
query
1st database
sequence
CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC
CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC
|||||||||||||||||||||||||
||||||||||||||||||||||||
||||||||||||||||||||||||
| ||
|
|||||||||||||||||||||||||
| | | ||
|| || || || || ||
CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT
CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT
CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT
This would actually be a very slow search process if implemented like
this…
BLAST achieves its speed through two strategies:
- it takes a WORD based approach
- it pre-INDEXES database sequences
BLAST: WORDS and INDEXING
Database of
sequences
1 GACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA
2 TAAGCAAATTTAATTTTGTTTACATTTTC
3 GTTAAGACCTTCCCTGACATTTGCAGCAGTTTCAAATGTA
Numbered list of all possible ‘words’
AAAAAAAA
AAAAAAAC
AAAAAAAG
:
ACAAATCC
ACAAATCC
ACAAATCC
:
GACAAATC
GACAAATG
:
TCCAAACC
TCCAAACC
:
00001
00002
00003
07967
07968
07979
33568
33569
64321
64322
Build a position index of all words in the database
sequence
position
word
1
1
33658
1
2
07967
1
:
3
16210
3
15
33568
3
16
07967
:
Analyse the Query Sequence
QUERY
SEQUENCE
>query
AGACAAATCCAAACCCCTGAAGTTCTCCACCAGCAAAGCCA
Numbered list of all possible ‘words’
AAAAAAAA
AAAAAAAC
AAAAAAAG
:
ACAAATCC
ACAAATCC
ACAAATCC
:
GACAAATC
GACAAATG
:
TCCAAACC
TCCAAACC
:
Analyse QUERY SEQUENCE
00001
00002
00003
position
word
1
14236
2
33658
07967
07968
07979
07967
Index of database
3
:
sequence
position
word
1
1
33658
1
2
07967
1
:
3
16210
3
15
33568
3
16
07967
33568
33569
64321
64322
:
Expand from Word Based Matches
We ‘instantly’ know which sequences in the database have at least
a word length match with our query sequence, and at what relative
position.
Next, the potential alignments are expanded, adding up a score for
(total matches – mismatches – gap penalties), to make the best
possible alignment. But this is usually for a tiny proportion of the
sequences in the database – so overall it is much quicker.
The highest scoring alignments are reported.
But we can potentially miss alignments with no word-size bits in
common, consider BLASTn with a default word-size of 11:
TCGGAAGTGGAAGCTGAACCTGATTGTAGAGTTGGAGGCCAGTGTTCTGGCTGAGC
||||||||| ||||| |||||||||| |||||||||| |||| ||||| |||||||
TCGGAAGTGTAAGCTCAACCTGATTGCAGAGTTGGAGTCCAGAGTTCTAGCTGAGC
Care is sometimes needed…
BLAST –Typical Output
INPUT:
>partial cDNA sequence, Xenopus tropicalis
CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCC
CCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAA
GAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA
OUTPUT:
Query= (311 letters)
Database: NCBI Protein Reference Sequences 954,378 sequences; 347,895,532 total letters
>gi|41055060|ref|NP_957420.1|
similar to guanine nucleotide-releasing factor 2 (specific
for crk proto-oncogene) [Danio rerio]
Length=691
Score
Expect
Identities
Positives
Gaps
Frame
Query
Sbjct
=
=
=
=
=
=
133 bits (335)
6e-31
76/98 (77%)
82/98 (83%)
4/98 (4%)
+2
26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196
MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK
+ VKT EKPVNK +
1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59
Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310
SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA
Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97
When is a match significant?
Here is a ‘typical’ weak alignment from BLASTp:
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV
F
K S+ + C + + N Y N
+C
P+
+ +W
+P
+
D
I
N
M ++
NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
In fact the sequences were randomly generated, so there
is no biologically significant alignment…
RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV
NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS
E-values
The number of matches like the discovered
match that I would expect to find by chance.
An E-value of 0.0 implies that I would expect no matches like
this to arise by chance, therefore…
An E-value of 1 implies I would expect 1 match like this to
arise by chance, so if I have a match with such an E-value…
Also “expect value“ or “expectation”
E-values From First Principles
Notation:
1.2e-35 = 1.2 x 10-35
4.8 x 106 = 4,800,000
Some database statistics (23rd July 2005):
Database: NCBI RefSeq mRNA
272,619 sequences; 503,566,580 total letters (~5.0 x 108)
Database: NCBI nr
3,329,110 sequences; 14,601,814,750 total letters (~1.4 x 1010)
We will consider first searching a nucleotide sequence (‘ACGTAGACGT’)
against a nucleotide database, e.g. the RefSeq mRNA above.
Then we will consider the more complex case of amino acid sequence (protein)
searches. Which is of course what we mostly do.
Calculating an E-value
The RefSeq mRNA database has ~ 5.0 x 108 letters
There are 4 possible nucleotides - ACGT
How many matches do we expect to find by chance?
Query = ‘A’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
A
A
A
A
A
AA A
A A
AA
AA
Expected number of matches = (5.0 x 108) / 4 = ~1.2 x 108
Query = ‘AC’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
AC
AC
AC
AC
Expected number of matches = (5.0 x 108) / (4x 4) = ~3.1 x 107
Query = ‘ACG’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
ACG
Expected number of matches = (5.0 x 108) / (4 x 4 x 4) = ~8.1 x 106
Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer
Expected number of matches = (5.0 x 108) / (4 x 4 x 4 x 4 … 60 times )
= (5.0 x 108) / 1036
= 5.0 x 10-28
E-value = 5.0 x 10-28
E-values In Practice
So if I take a 60 nt sequence:
>sequence
ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA
and actually BLAST it against the RefSeq mRNA database, I get:
BLAST OUTPUT:
>gi|27469838|gb|BC041710.1|
Homo sapiens Rap guanine nucleotide exchange factor
(GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE:6051007), complete cds
Length=6060
-28 - !?
Score = 119 bits (60), Expect = 2e-26 theoretical value was 5.0e
Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus
Query
1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036
What do I get if I BLAST it against the larger nr database?
BLAST OUTPUT:
>gi|27469838|gb|BC041710.1|
Homo sapiens Rap guanine nucleotide exchange factor
(GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE:6051007), complete cds
Length=6060
Score = 119 bits (60), Expect = 6e-25
Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus
Query
1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036
E-value Exercise
Given a transcription factor binding site:
ACC[T/G]TA
How many would you expect to find by chance in a 10k promoter
sequence?
How would this differ if there was an optional additional base
between the 4th and 5th positions?
I.e.
ACC[T/G]TA
OR
ACC[T/G]?TA
E-value Exercise: Answer
ACC[T/G]TA
Expect ‘A’ every 4 nt
Expect ‘ACC’ every 4x4x4 = 64 nt
Expect ‘T or G’ every 2nd nt
Expect ‘ACC[T/G]’ every 64x2 nt = 128 nt
Expect ‘TA’ every 4x4 = 16 nt
Expect ‘ACC[T/G]TA’ every 128x16 nt = 2048 nt
(4x4x4x2x4x4)
We would expect ~5 of these promoter sites every 10k by chance
If also ACC[T/G]?TAA allowed?
The two motifs independently have the same E-value.
To allow either means we expect twice as many.
E-values: Effect of Database Size
The nr mRNA database has ~ 1.4 x 1010 letters (was RefSeq and 5.0 x108)
There are 4 possible nucleotides - ACGT
How many matches do we expect to find by chance?
Query = ‘A’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
A
A
A
A
A
AA A
A A
AA
AA
Expected number of matches = (1.4 x 1010) / 4 = ~1.2 x 108
Query = ‘AC’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
AC
AC
AC
AC
Expected number of matches = (1.4 x 1010) / (4x 4) = ~3.1 x 107
Query = ‘ACG’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
ACG
Expected number of matches = (1.4 x 1010) / (4 x 4 x 4) = ~8.1 x 106
Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer
Expected number of matches = (1.4 x 1010) / (4 x 4 x 4 x 4 … 60 times )
= (1.4 x 1010) / 1036
= 1.4 x 10-26
E-value = 1.4 x 10-26
(was E-value = 5.0 x 10-28)
E-values: Effect of Database Size
5.0
x108 letters
1.4 x 1010 letters
RefSeq
30 x bigger
nr
E-value =
5.0e-28
BLAST the same sequence
against each
E-value = 1.4e-26
The database was ~30 times bigger and so the E-value was ~30 times bigger.
The E-value is simply dependent on database size.
Why were the values different?
Our calculated E-value for searching against the RefSeq mRNA database was
5.0 x 10-28.
But our actual BLAST search at NCBI gave a value of
2.0 x 10-26 - about 40x larger - why is this?
Gapped alignments
If we were expecting N matches for a query sequence ‘ACGTACGTACGT’,
imagine what would happen to N if we allowed gaps in our matches.
ACGTAC?GTACGT
This would now give us additional possible alignments that would meet our
‘match’ criteria:
ACGTACGTACGT
||||||||||||
ACGTACGTACGT
ACGTACAGTACGT
|||||| ||||||
ACGTAC-GTACGT
ACGTACCGTACGT
|||||| ||||||
ACGTAC-GTACGT
etc.
We will expect many more matches in a given database, if we allow our
alignments to have gaps. The E-value will be larger.
E-values: Effect of Query Length
BLAST 500 nt sequence against a database
>sequence
ACTAGTCTAGCTAGACATCG
ATCGATGATGCTACACAGAT
AGACGATAGATAGTAAGTCG
ATCGATCGCGCATCGATCGT
CTAGATCGATCGCTCGCTGT
GTAGATAGATCGGCGATAGA
BLASTn
database
Get a full length match with sequence
XYZ at an E-value = 5.0e-160
BLAST half of the same sequence against the same database
>sequence
ACTAGTCTAGCTAGACATCG
ATCGATGATGCTACACAGAT
AGACGATAGATAGTAAGTCG
BLASTn
database
Get a match with sequence XYZ
again, but at an E-value = 5.0e-80
Biologically it’s the same match! Does it mean we are
any less sure that this match didn’t occur by chance?
The E-value is simply dependent on match length.
Why not just use % identity?
At some levels this a good question.
But consider two very different searches, both of which give a 75% identity match
Query1 was 60 nt long:
CGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG
||||||||||| || | || | || || |||| | | | |||||| | ||||||||||
CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCG
Which would have an E-value ~ 5.0 x 10-19
And, Query2 only 16 nt long:
ACGTACGTACGTACGT
||| || | |||| ||
ACGCACCTTCGTAGGT
Which would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in
the database just by chance…
So what’s the real problem?
Basically you are usually trying to answer the question:
Can I find the ortholog of my gene in some other species, so that I can
work out what it might be doing in my organism?
The difficulty is because:
BLAST
ORTHOLOGY
biological
knowledge
BLAST
nature of query
sequence
Similarity + Probability
phylogenetic
relationship
match length, PI,
size of database…
Are there any useful guidelines though, at least for biological meaningfulness?
Rules of Thumb
How good does an E-value have to be before we might even think we have an
ortholog?
larger/worse
E-values
10-5
10-10
smaller/better
10-40
10-100
0.0
fantasy
borderline
encouraging
pretty good
can’t get better
But note that in some gene families with closely related members you can get an
E-value of 0.0 for several different matches, and then % identity may be more
sensitive. Also bear in mind, in cases like this, that ideas of ‘functional’ orthology
may break down, with more than one locus producing identical proteins which
share the same function…
Protein BLAST
It’s (nearly) always better to make comparisons at the amino acid level between
protein sequences than the DNA level, because the amino acid sequence is more
conserved than the underlying DNA sequence.
Does this cause us to treat expected values any differently?
If we follow the argument as before then for an exact match of a 20 amino acid
sequence in the RefSeq protein database, each additional amino acid will reduce
the E-value by 1/20th (there are 20 different amino acids). And as there are
347,895,532 letters in that database,
E-value = ~3.5 x 108 / (20 x 20 x 20 …20 times) = ~3.5 x 10-18.
But this is what we get if we run the blast at NCBI:
Score = 43.1 bits (100), Expect = 8e-04
Identities = 20/20 (100%), Positives = 20/20 (100%), Gaps = 0/20 (0%)
Frame = +3
Query
3
Sbjct
972
SSSSFRAYRAALSEVEPPCI
SSSSFRAYRAALSEVEPPCI
SSSSFRAYRAALSEVEPPCI
62
991
Really too big a discrepancy to easily explain with hand waving…
Amino Acid Substitutions
In fact we need to take into
account both amino acid
substitutability, as well as, as
before, allowing gapped
alignments. On average any
residue can be substituted for
by about 2 others, so each
position has about 1/7th
chance of ‘matching’ rather
than 1/20th.
A
C
F
G
I
L
M
P
V
W
S
LWY
LMV
IMFV
ILV
ILM
FY
N
Q
S
T
Y
DHS
REHK
ANT
S
HFW
H
K
R
NQY
RQE
QK
D
E
NE
DQK
So now we get:
E-value = ~3.5 x 108 / (7 x 7 x 7 …20 times) = ~4.4 x 10-9,
which is much closer to the actual BLAST value.
These substitutabilities are dealt with by the BLOSUM and PAM matrices
Exercises
Go to the file random-DNA-sequences.html, select one of the 20
randomly generated nucleotide sequences, and do a BLASTx (translated
DNA->protein) at NCBI against the nr protein database.
Did you find any ‘significant’ hits?
Repeat with a second sequence.
What conclusions might you draw from this exercise?
Try the same sequence(s) against the nr nucleotide database.
Is there any general difference?
Part 4: Tweaking BLAST
Although you normally see BLAST as a web page with boxes to place data in
and tick boxes, etc., it is actually a command line program that can be run just by
typing the appropriate command and options, e.g.
prompt> blastall –p <blast type> –i <input sequence> –d <database>
This is the simplest form: where the basic program ‘blastall’ takes a number of
different options, or parameters, indicated by the –x and followed by its value.
-p <which blast flavour to run>
-i <file with query sequence in>
-d <pre-indexed database name>
Not All Parameters are …
There are many other parameters, and if not listed explicitly they will use a
default value most appropriate to the blast flavour requested.
E.g. for –W <word size> blastn uses –W 11, where blastx uses –W 3.
There are also some options that appear on the web pages that are not really
parameters but manage the job in a similar way. One of the most useful of these
is on the NCBI blast pages where you can use Entrez queries or pick from an
organism list to modify your search.
The Many Parameters of BLAST
There are almost literally hundreds of parameters, but most are way too obscure
even for die-hard techies like me! Very few of them are regularly useful in any
but their default value, but just occasionally they are very necessary.
Here are some of the ones that I have used:
-e
-m
-F
-U
-G
-E
-q
-r
-b
-g
-W
-z
-S
-l
max expected value
output format
(graphical or tabular/spreadsheet)
filter query sequence for low complexity
(default TRUE)
use only upper case regions of query
(default FALSE)
gap opening cost
gap extension cost
nucleotide mismatch penalty
(BLASTx uses matrices)
nucleotide match reward
number of matching sequences to report
allow gaps
(default TRUE)
word size
effective database size
(removes effect of actual database size!)
query strands to search
(default both directions)
restrict database sequences to given list of ‘gi‘ numbers
BLAST Parameters Exercises
1. BLASTn vs. BLASTx
Open the file example-sequences.html, copy the sequence:
>blastn-vs-blastx
This is a Xenopus tropicalis cDNA sequence.
Go to the NCBI BLAST Home Page/Nucleotide-nucleotide BLAST (blastn)
section. Paste your sequence into the box.
Run BLASTn against the nr nucleotide database using all default options.
Then hit [format] to wait for the results in a new page.
(hint if you paste the sequence definition line ‘>name’ into the box as well, your
results will be labelled accordingly, which can be useful)
Now repeat but go to the TRANSLATED BLAST section, and BLAST against the
nr protein database using BLASTx.
How might the different results help us view the presence of this gene in other
vertebrates?
Results for Exercise 1.
BLASTn
BLASTx
BLAST Parameters Exercises
2. Low complexity filtering
Open the file example-sequences.html, copy the sequence:
>low-complexity-filtering-A
This sequence contains a long AT tandem repeat.
Go to the NCBI BLAST Home Page/TRANSLATED BLAST section/BLASTx.
Paste your sequence into the box.
Carefully UNTICK the “Choose filter [ ] Low complexity” BOX in the second
section. And then run BLASTx against the nr database.
What do you feel about these alignments?
Re-run, but leave the low-complexity filter ON this time.
Does this change our view of the protein matches?
Now continue with >low-complexity-filtering-B and –C.
C is an especially interesting case – what can we deduce about the cDNA
sequence? Annotators beware!
Results for Exercise 2A (OFF)
BLASTn – low complexity filtering OFF
Results for Exercise 2A (ON)
BLASTn – low complexity filtering ON
Results for Exercise 2B
ON
OFF
Results for Exercise 2C
ON
OFF
There is a sequence error, an
extra G at position 117 in the
sequence:
cDNA
(117)
AGAAAAGAAGAAACATGGCAATGGATCAGAA
|||||||||||||||| ||||||||||||||
AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
BLAST Parameters Exercises
3. Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search
to more specific items.
For instance to find only matching sequences in fruit fly, enter ‘Drosophila
melanogaster[ORGN]’ in the Limit by entrez query box in the second section
(you can also select the organism from the adjacent drop-down list).
To combine items use logical AND, OR or NOT.
Open the file example-sequences.html.
Copy the sequence >cyclin-D1-Xt and go to the NCBI BLAST Home Page/
TRANSLATED BLAST section/BLASTx, and paste the sequence.
Use an Entrez query to find all rodent sequences (rat and mouse) with a good
match to cyclin-D1. At what E-value do we expect we are no longer looking at
cyclins? Try running the search again with that E-value as a limit…
BLAST Parameters Exercises
4. BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequences.html.
Also open the NCBI BLAST Home Page/SPECIAL – Align two sequences section.
There are several Xenopus tropicalis cyclins in the examples file.
Copy the sequence >cyclin-A1-Xt to the Sequence 1 BLAST window
Copy the sequence >cyclin-A2-Xt to the Sequence 2 BLAST window
(i) Run the default comparison, should be BLASTn. Note the alignment.
Now run again using tBLASTx – what does this do to our understanding of the
relationship between these two sequences? Are they homologs, orthologs or
paralogs – or none of these?
(ii) Revert to BLASTn, and try varying the values for mismatch penalties and
gapping – start by reducing the mismatch penalty to -1. Then try reducing the gap
open and gap extension penalties….
What do we learn from this?
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2…
Results for Exercise 4 (i)
BLASTn
tBLASTx
Results for Exercise 4 (ii)
Mismatch penalty = -2 (default)
Mismatch penalty = -1
BLAST Parameters Exercises
5. E-Value maximum for reporting
Open the file example-sequences.html.
Copy the sequence >sumo-binding-motif and go to the NCBI BLAST Home
Page.
Go to the PROTEIN BLAST section, BLASTp, and paste the sequence.
Run the search with the default values.
Now re-run the search, setting the maximum E-value in the box:
Expect
100
What difference does this make?
BLAST Parameters Exercises
6. Word Size
Open the file example-sequences.html.
Copy the sequence >morpholino and go to the NCBI BLAST Home Page.
Go to the NUCLEOTIDE BLAST section, BLASTn, and paste the sequence.
Check OFF the low complexity filter, and then run the search.
Now re-run the search, setting the following parameters:
Low complexity
Expect
Word Size
Other advanced
OFF
100
7
-q-1
(mismatch penalty -1 instead of default -3)
What difference does this make?
END