Introduction to Bioinformatics 2. Biology Background

Download Report

Transcript Introduction to Bioinformatics 2. Biology Background

Sequence Alignments with Indels

Evolution produces insertions and deletions (indels)
–

In addition to substitutions
Good example:
MHHNALQRRTVWVNAY
MHHALQRRTVWVNAY-
MHHNALQRRTVWVNAY
MHH-ALQRRTVWVNAY
Blosum Score = 2 (end = -6)
Score = 79 (gap = -6)

An alignment must have equal length aligned sequences
–

For a full alignment we must add gaps at the start and the ends
Combinatorially difficult problem to find best indel solution
Gap



So far we ignored gaps
A gap corresponds to an insertion or a deletion
of a residue
A conventional wisdom dictates that the
penalty for a gap must be several times greater
than the penalty for a mutation. That is
because a gap/extra residue
–
–
Interrupts the entire polymer chain
In DNA shifts the reading frame in coding regions
Gap Penalties

Gaps are penalised
–
–

One common scheme is affine gap penalties
–
–
–

Write wx to indicate the penalty for a gap of length x
For example, each gap scores -6, so wx = -6*x
Score -12 for opening a gap
And -2 for every subsequent gap
i.e., wx = -12 - 2*(x-1)
Start and end gap penalties often set to zero
–
But this can leave a doubt unless we have fragments

About evolutionary conclusions
Dot Matrix Representations (Dotplots)
To help visualise best alignments

Plot where each pair is the same, then draw best line
M N
N
L
N

M N A L S Q L N
N
M


Q



S


N
H


L


Q

A

S
H
S Q

L
N
L

A
M
A



Getting Alignments
from Dotplot Paths
M N A L S Q L N
N

A
Indicates that M
matches with a gap

M
N



To indicate gaps
NAL-SQLN
NALMSQ-N
Stage 2:

Q
Align middle
Use triangles


S
–
–

L
Stage 1:

H
Indicates that L
matches with a gap
–
Sort the ends out
MNAL-SQLN-NALMSQ-NH
Dotplots for Real Proteins

Need a way to automatically find the best path(s)
Dynamic Programming Approach

BLAST is quick
–
–

Dynamic Programming:
–

Also known as: Needleman-Wunsch Algorithm
Can use it to draw the Dotplot paths
–

But not guaranteed to find best alignment
Gapped blast has indels, but no guarantee…
From that we can get the alignment
Mathematically guaranteed
–
–
–
To find the best scoring alignment
Given a substitution scheme (scoring scheme, e.g., BLOSUM)
And given a gap penalty
The Needleman-Wunsch algorithm



A smart way to reduce the massive number of
possibilities that need to be considered, yet still
guarantees that the best solution will be found (Saul
Needleman and Christian Wunsch, 1970).
The basic idea is to build up the best alignment by
using optimal alignments of smaller subsequences.
The Needleman-Wunsch algorithm is an example of
dynamic programming, a discipline invented by
Richard Bellman (an American mathematician) in
1953!
Dynamic Programming

A divide-and-conquer strategy:

Break the problem into smaller subproblems.
– Solve the smaller problems optimally.
– Use the sub-problem solutions to construct an
optimal solution for the original problem.
Dynamic programming can be applied only to problems
exhibiting the properties of overlapping subproblems.
Examples include
– Travelling salesman problem
– Finding the best chess move
–

Overview of Needleman-Wunsch

Four Stages
1.
2.
Initialise a matrix for the sequences
Fill in the entries of that matrix (call these Si,j)

At the same time drawing arrows in the matrix
–
3.
4.
Diagonal, up, left
Use the arrows to find the best scoring path(s)
Interpret the paths as alignments as before
Illustrate with: MNALQM & NALMSQA
Stage 1
Initialising the Matrix

Draw the grid
Put in increasing gap penalties
Then put in BLOSUM scores
Stage 2
Putting Scores and Arrows in
Put the score in
Draw the arrow
Mathematically, we are calculating:

Where:
–
Si,j is the matrix entry at (i,j) [the one we want to fill in]

–
Si-1,j-1 is above and to the left of this
s(ai,bj) is the BLOSUM score for the



i-th residue from the horizontal sequence and
j-th residue from the vertical sequance
(i.e., just the scores we have written in brackets)
This diagram might help:
Fill in the next row and column
A Close up View
Continue filling in the Si,j entries
Stage 3
Finding the best path

Scores Si,j in the matrix
–

However!
–

Are the BLOSUM scores for alignments
We must take into account final gap penalties
Look down the final column and along the final row
–
–
Find the highest scoring number
Remembering to take off the gap penalty the correct
number of times
Finding the best path
So, the best path is:
Stage 4: Generating the Alignment
Firstly, draw the Dotplot
Secondly, Generate the Alignment

Using the technique previously mentioned
–
This path gives us an alignment with three gaps
M N
- N
S = -6 6

A
A
4
L
L
4
- - Q
M S Q
-6 -6 5
M
A
-1
=
0
Should check that you get the same score
–
As on the diagram
Other Alignments
MNALQ-M-NALMSQA
(score=-4)
MNALQM--NALMSQA (score=-5)
Smith - Waterman Alterations


To make the algorithm find best local alignment
Adjustments only to the scoring scheme for Si,j:
–
The scoring scheme must include:

–
When Si,j becomes negative, set it to zero


Some negative scores for mismatches
So local paths are not penalised for earlier bad routes
To find best local alignment
–
–
Find highest scoring matrix position (anywhere)
And work backwards until a zero is reached
Local and Global Alignments

For illustration purposes only
–
Calculations done slightly differently (don’t worry)
Needleman & Wunsch
best global alignments
Smith & Waterman
best local alignments
Smith - Waterman - Eggert




To make the algorithm find n best alignments
Find best local alignment
Zero all cells used in the current alignment
Find the highest remaining score
–

Second-best alignment
Repeat until results are below a threshold
An example database search
The Cystic Fibrosis Gene


Found by Lap-Chee Tsui and Francis Collins in
1989
Pure bioinformatics analysis
Cystic Fibrosis Gene:
No complete mRNA/cDNA clone
Cystic Fibrosis Gene:
DNA sequence -> predicted protein
Cystic Fibrosis Gene:
Database search top hits
Cystic Fibrosis Gene:
Top hit was 19% identity
Cystic Fibrosis Gene:
Dotplot shows what 19% means
Cystic Fibrosis Gene:
Protein matches itself
Cystic Fibrosis Gene:
Hydrophobic features
Cystic Fibrosis Gene:
Hydrophobic features
Window size 21 better than 9.
Matches membrane size (12 helices marked)
Cystic Fibrosis Gene:
Predicted structure diagram
Remember:
All this was derived from a
predicted protein sequence.
No cDNA = no protein to
experiment on!