Transcript Slide 1

Pairwise Alignment, Part II
Reconstructing the Path and Calculating the LCS
from the Values and Directions Tables
SETUP: Part II involves using the Visual Basic Editor (VBE)
to write VBA Function & Subroutines, and creating Buttons.
1. Add the Developer tab to
the ribbon. Click the Office
button
, and then click the
Excel Options button at the
bottom. In the Popular
category, select the Show
Developer tab in the Ribbon
check box from the options
that appear and then click OK.
2. Click the Developer tab
in the ribbon whenever
you want to access the
Visual Basic button in the
Code group.
Note: You can switch back and forth between workbook environment and the Visual Basic
Editor (VBE) environment with the Alt+F11 combination or with the application tabs at the
bottom of the Windows screen.
Excel 2007 comes with a new file format: Macro-Enabled Workbook that uses
the file extension .xlsm. You can no longer run macros and VBA scripts in regular
Excel files, i.e. those that have the extension .xlsx . Therefore you will need to
click the Office button
, select Save As  Excel Macro-Enabled Workbook to
save your spreadsheet so that it will run the scripts that you will now be adding.
Open the Visual Basic Editor (VBE) and
select View → Project Explorer (Ctrl+R).
You will now see a listing of the
2 worksheets – Values and
Directions – in your workbook.
Double-click on the
Values item to view the
Code Section.
Visual Basic code uses the
worksheet in which it
resides as its default
location when determining
cell references. You will be
placing the VB code into
the Values sheet.
in the VBA Editor, copy and paste the maxAddress function and 2 subroutines clearLCS
and showLCS on this slide and the next into the Code Section of the Values Worksheet.
(We will explain what the VB script code does after you get it working)
Option Explicit
Function maxAddress(The_Range)
Dim maxNum As Long
Dim Cell As range
maxNum = Application.Max(The_Range)
' Returns the LAST cell in the range that = maxNum
For Each Cell In The_Range
If Cell = maxNum Then
maxAddress = Cell.Address
End If
Next Cell
End Function
Sub clearLCS()
range("C10:M20").Interior.ColorIndex = xlColorIndexNone
Worksheets("Directions").range("C10:M20").Interior.ColorIndex = xlColorIndexNone
range("C4:C6").Value = ""
End Sub
Sub showLCS()
Dim direction, maxCellAddress, VALUES_TABLE_DATA_RANGE As String
Dim r, c, UP_ARROW, LEFT_ARROW, BACKSLASH, CLR_GRAY, ROW_LTR, COL_LTR As Long
Dim bMATCH, bLEFT, bUP As Boolean
' CONSTANT declarations
UP_ARROW = ChrW(8593) 'Hex 2191 = Dec 8593 Up Arrow
LEFT_ARROW = ChrW(8592) 'Hex 2190 = Dec 8592 Left Arrow
BACKSLASH = ChrW(92) 'Hex 5C = Dec 92
CLR_GRAY = RGB(192, 192, 192)
ROW_LTR = 9
COL_LTR = 2
VALUES_TABLE_DATA_RANGE = "D11:M20"
maxCellAddress = maxAddress(range(VALUES_TABLE_DATA_RANGE))
r = range(maxCellAddress).Row
c = range(maxCellAddress).Column
Do
Cells(r, c).Interior.Color = CLR_GRAY
Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY
direction = Worksheets("Directions").Cells(r, c).Value
bMATCH = (direction = BACKSLASH)
bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1)
bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1)
If bMATCH Then
r = r - 1
c = c - 1
ElseIf bLEFT Then
c = c - 1
ElseIf bUP Then
r = r - 1
End If
Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1
End Sub
You will now insert 2 buttons in the
Values worksheet that will float in the
space above the cells A4:A6.
In the Developer tab, select
Insert, then choose the
Button (Form Control).
Draw the outlines of your
buttons with the mouse.
When the Assign Macro Dialog
appears, select Sheet1.showLCS.
Name the button showLCS.
Create a 2nd button, assign the
macro Sheet1.clearLCS and name
the button clearLCS.
When you click on the showLCS button, if Seq 1 is ATCTGAT and Seq 2 is TGCATA,
then the data table cells should be highlighted in grey as in the picture below.
The clearLCS button should remove the background color.
Reconstruction of the Path
We reconstruct the LCS path by locating the cell containing the largest
value in the Values table, then following the arrows back to a cell
containing the value zero. If there are several cells with the largest
value, all are possible LCS’s. However, in this exercise, we will pick the
one that is most distant.
If the beginning letter(s) of the 2 sequences did not match, then at
this point, we have only reached the beginning letter of one of the
sequences (the T in the vertical sequence).
In order to reach the beginning letter of the 2nd sequence (the A in the
horizontal sequence), we need to continue traveling either left or up.
What does the VBA code DO? What does it MEAN?
Option Explicit
Function maxAddress(The_Range)
Dim maxNum As Long
Dim Cell As range
maxNum = Application.Max(The_Range)
' Returns the LAST cell in the range that = maxNum
For Each Cell In The_Range
If Cell = maxNum Then
maxAddress = Cell.Address
End If
Next Cell
End Function
Option Explicit: In most programming languages, variables must be DECLARED
before they are used. Certain scripting languages, like VBA and Javascript, allow
programmers to use variables without declaring them. This can lead to huge problems
when trying to track down bugs, since a simple misspelling of a variable name creates a
new variable, different from the one intended. The Option Explicit declaration is a
safeguard measure that will show the programmer a compile error if VBA encounters a
variable that has not previously been declared.
Functions and the function maxAddress
Option Explicit
Function maxAddress(The_Range)
Dim maxNum As Long
Dim Cell As range
maxNum = Application.Max(The_Range)
' Returns the LAST cell in the range that = maxNum
For Each Cell In The_Range
If Cell = maxNum Then
maxAddress = Cell.Address
End If
Next Cell
End Function
Function: In VBA, a FUNCTION returns a value, while a SUBROUTINE does not.
One implements the return value of a function by assigning the function NAME a value
somewhere in the function’s body. No explicit return statement is used.
The maxAddress function 1st uses the common Excel worksheet function max() to find
the maximum value in a range and assign it to maxNum.
It then uses a for-each loop (analogous to a Java for-each loop) to (a) iterate through all of
the cells in the range and (b) return the address (cell reference) of the LAST cell it finds
whose value equals maxNum.
NOTE: If one wanted to return the 1st cell that the for-each loop found equal to maxNum,
one would place an EXIT FUNCTION or EXIT FOR statement at the end of the IF statement.
Subroutines and the subroutine clearLCS
Sub clearLCS()
range("C10:M20").Interior.ColorIndex = xlColorIndexNone
Worksheets("Directions").range("C10:M20").Interior.ColorIndex = xlColorIndexNone
range("C4:C6").Value = ""
End Sub
Subroutine: In VBA, a SUBROUTINE does not return a value, but technically speaking
is just like a FUNCTION. Subroutines, however, are generally written to perform actions.
In contrast, functions are meant to calculate and return values, but not perform actions.
The clearLCS() subroutine clears the background color of all cells in the indicated range. It
does so not only in the default worksheet (i.e. the VALUES worksheet, line 1 of the
subroutine body), but in the DIRECTIONS worksheets as well (line 2).
Line 3 of the subroutine body clears any text in the range C4:C6, although we haven’t yet
placed anything in these cells. Later, we will place strings in these cells that will show the
alignment and matching bases for the 2 sequences.
The subroutine showLCS
Sub showLCS()
Dim direction, maxCellAddress, VALUES_TABLE_DATA_RANGE As String
Dim r, c, UP_ARROW, LEFT_ARROW, BACKSLASH, CLR_GRAY, ROW_LTR, COL_LTR As Long
Dim bMATCH, bLEFT, bUP As Boolean
' CONSTANT declarations
UP_ARROW = ChrW(8593) 'Hex 2191 = Dec 8593 Up Arrow
LEFT_ARROW = ChrW(8592) 'Hex 2190 = Dec 8592 Left Arrow
BACKSLASH = ChrW(92) 'Hex 5C = Dec 92
CLR_GRAY = RGB(192, 192, 192)
ROW_LTR = 9
COL_LTR = 2
VALUES_TABLE_DATA_RANGE = "D11:M20"
maxCellAddress = maxAddress(range(VALUES_TABLE_DATA_RANGE))
r = range(maxCellAddress).Row
c = range(maxCellAddress).Column
The code above shows the initial statements for the showLCS() subroutine, that is, the
statements that precede the do while loop.
The 1st section in showLCS() contains the Dim statements, which is where the local
variables are declared.
Notice that some of these variables are capitalized to show that they are CONSTANTS, i.e.
the values they are first assigned will not change during the execution of the program.
The next command uses the maxAddress function to fetch the cell reference of the cell in
the Values table that contains the largest value and is closest to the end of the table.
Finally, we extract the row and column numbers from the maximum value cell address.
The subroutine showLCS (continued)
r = range(maxCellAddress).Row
c = range(maxCellAddress).Column
Do
Cells(r, c).Interior.Color = CLR_GRAY
Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY
direction = Worksheets("Directions").Cells(r, c).Value
bMATCH = (direction = BACKSLASH)
bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1)
bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1)
If bMATCH Then
r = r - 1
c = c - 1
ElseIf bLEFT Then
c = c - 1
ElseIf bUP Then
r = r - 1
End If
Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1
End Sub
We’ll now examine the code within showLCS’s Do While Loop.
The 1st section in the Do While Loop shades the background color of the last cell In the
Values table whose value equals the table’s maximum (i.e. the cell we start with to
retrace the path), as well as the corresponding cell in the Directions table. It also
retrieves the direction from the latter cell.
The direction variable is used to set the values for the 3 boolean variables: bMATCH,
bLEFT and bUP, depending upon whether the direction indicates diagonal, left or up.
The 2nd part of bLEFT’s and bUP’s boolean Or expressions is not used until the loop has
reached the beginning of one of the 2 sequences.
The subroutine showLCS (continued)
r = range(maxCellAddress).Row
c = range(maxCellAddress).Column
Do
Cells(r, c).Interior.Color = CLR_GRAY
Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY
direction = Worksheets("Directions").Cells(r, c).Value
bMATCH = (direction = BACKSLASH)
bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1)
bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1)
If bMATCH Then
r = r - 1
c = c - 1
ElseIf bLEFT Then
c = c - 1
ElseIf bUP Then
r = r - 1
End If
Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1
End Sub
If the direction value is diagonal, then bMATCH is true. In this case, we decrement both
the row (r) and the column (c) variables so that in the next iteration, the cell we will
examine will be the cell located diagonally to the current cell’s top left.
If the direction value is left, then bLEFT is true. In this case, we decrement just the
column (c) variable so that in the next iteration, we will examine the cell to the left of the
current cell.
Finally, if the direction value is up, then bUP is true. In this case, we decrement just the
row (r) variable so that in the next iteration, we will examine the cell above the current
cell.
The subroutine showLCS (continued)
r = range(maxCellAddress).Row
c = range(maxCellAddress).Column
Do
Cells(r, c).Interior.Color = CLR_GRAY
Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY
direction = Worksheets("Directions").Cells(r, c).Value
bMATCH = (direction = BACKSLASH)
bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1)
bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1)
If bMATCH Then
r = r - 1
c = c - 1
ElseIf bLEFT Then
c = c - 1
ElseIf bUP Then
r = r - 1
ROW_LTR +
= 1
9
End If
Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1
COL_LTR +
= 1
2
End Sub
The subroutine keeps looping as long as c > COL_LTR + 1 Or r > ROW_LTR + 1
COL_LTR and ROW_LTR are the column and row
containing the letters of the 2 sequences
Hence COL_LTR + 1 and ROW_LTR + 1 are the
column and row that contain the zeroes.
Therefore, the loop will keep iterating until r and c
correspond to cell C10 (or cells above it or to its left).
= 10
= 3
The subroutine showLCS (continued)
r = range(maxCellAddress).Row
c = range(maxCellAddress).Column
Do
Cells(r, c).Interior.Color = CLR_GRAY
Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY
direction = Worksheets("Directions").Cells(r, c).Value
bMATCH = (direction = BACKSLASH)
bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1)
bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1)
If bMATCH Then
r = r - 1
r = 3 ( ROW_LTR + 1 )
c = c - 1
ElseIf bLEFT Then
c = 5 ( > COL _LTR + 1)
c = c - 1
ElseIf bUP Then
r = r - 1
End If
Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1
End Sub
If the boolean expressions for bLEFT and bUP consisted only of the 1st condition, then r and
c would stop being decremented when either the row or column (or both) reached the
cells containing zeroes.
When the loop reaches one of the cell’s containing a zero, the direction variable will
contain a “0”, not one of the 3 direction values. At this point, we have reached the
beginning of one of the sequences (in the table above, the blue sequence). Therefore to
continue moving left (to get to the beginning of the red sequence), we need to add an
expression that recognizes this case.
If we want to move left, that means that c > COL_LTR + 1 (since at this point r = ROW_LTR + 1).
Similarly, had we wanted to move up, that would mean that r > ROW_LTR + 1.
Reconstructing the LCS Letter Sequence
The bases that the 2 sequences have in common are in the cells containing backslashes.
In order to compare the 2 sequences, we need to line up the strings next to each other and
show both the bases that they have in common, and where they differ. To do this, we will
declare 3 string variables: strSeq1, strSeq2 and strLCS.
Open VBA. Add the Dim statement below to showLCS() to declare these 3 String variables.
Place in the section with the other variable declarations.
Dim strSeq1, strSeq2, strLCS As String
Reconstructing the LCS Letter Sequence (continued)
r = range(maxCellAddress).Row
c = range(maxCellAddress).Column
strSeq1 = "" : strSeq2 = "" : strLCS = "" ' Initialize to empty strings
Do
Cells(r, c).Interior.Color = CLR_GRAY
Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY
direction = Worksheets("Directions").Cells(r, c).Value
bMATCH = (direction = BACKSLASH)
bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1)
bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1)
If bMATCH Then
r = r - 1
c = c - 1
ElseIf bLEFT Then
c = c - 1
ElseIf bUP Then
r = r - 1
End If
Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1
range("C4").Value = strSeq1
range("C5").Value = strLCS
range("C6").Value = strSeq2
Initialize the strings so they are empty (the colon character allows you to put multiple
statements on a single line). Place this line just before the Do-While Loop.
After the Do-While Loop, add the statements for pasting these 3 strings into the Values
worksheet in the 3 cells C4:C6. (Recall that these were the cells on line 3 of the clearLCS()
subroutine body.)
Reconstructing the LCS Letter Sequence (continued)
If bMATCH Then
strSeq1 = Cells(ROW_LTR, c).Value + strSeq1
strSeq2 = Cells(r, COL_LTR).Value + strSeq2
strLCS = Cells(r, COL_LTR).Value + strLCS
r = r - 1
c = c - 1
ElseIf bLEFT Then
c = c - 1
ElseIf bUP Then
r = r - 1
End If
Add the 3 lines above to the body of the If bMATCH statement. These statements will
PREPEND the letter from the sequence into the string. We need to PREPEND each letter
encountered because we are building the LCS Letter Sequence in reverse, from tail to head.
Note that to capture the letter from the 1st sequence – as we move backwards from
column to column, we are always referencing the letter from the ROW_LTR row (row 9).
To capture the letter from the 2nd sequence – as we move backwards from row to row, we
are always referencing the letter from the COL_LTR column (column 2).
For the string that will show the common letters shared by the 2 aligned sequences, it
doesn’t matter whether we use cell(ROW_LTR, C) or cell(r, COL_LTR), because the letters in
both of these cells match.
Note the importance of placing these 3 new statements BEFORE the lines that decrement
r and c; otherwise, we would be extracting letters from the wrong cells!
Reconstructing the LCS Letter Sequence (continued)
Click on the showLCS button.
You should see the LCS (Longest Common
Subsequence) appear in each of the cells
C4:C6.
Format C4 red and C6 blue to visually connect
each to their original sequence. Although the
text that all 3 contain is identical right now,
we will be making changes that will make
each distinct.
Also format these 3 cells so that they use the
Courier New font (size 14
recommended). Courier New is a FIXEDWIDTH font, meaning that all of its characters
are the same width (unlike most other fonts).
Using this font will prove extremely useful in
viewing similarities and differences between
the 2 sequences because corresponding
letters in the 2 strings will automatically line
up vertically.
Reconstructing the LCS Letter Sequence (continued)
If bMATCH Then
strSeq1 = Cells(ROW_LTR, c).Value + strSeq1
strSeq2 = Cells(r, COL_LTR).Value + strSeq2
strLCS = Cells(r, COL_LTR).Value + strLCS
r = r - 1
c = c - 1
ElseIf bLEFT Then
strSeq1 = Cells(ROW_LTR, c).Value + strSeq1
strSeq2 = "-" + strSeq2
strLCS = "-" + strLCS
c = c - 1
ElseIf bUP Then
strSeq1 = "+" + strSeq1
strSeq2 = Cells(r, COL_LTR).Value + strSeq2
strLCS = "+" + strLCS
r = r - 1
End If
Add each of the 3 line code sections above to the If bLEFT and If bUP statements, exactly as
you just did with the bMATCH statement.
Recall that when there is a match, we captured the shared letters in all 3 strings.
When we travel LEFT, changing columns, we want to copy the letters from Sequence 1 in the
ROW_LTR (9th) row into strSeq1.
At the same time, we need to place a character in both strSeq2 and strLCS to preserve the
alignment so that corresponding characters in all 3 strings will coincide. Therefore we add
the "–" character to indicate a deletion.
Reconstructing the LCS Letter Sequence (continued)
If bMATCH Then
strSeq1 = Cells(ROW_LTR, c).Value + strSeq1
strSeq2 = Cells(r, COL_LTR).Value + strSeq2
strLCS = Cells(r, COL_LTR).Value + strLCS
r = r - 1
c = c - 1
ElseIf bLEFT Then
strSeq1 = Cells(ROW_LTR, c).Value + strSeq1
strSeq2 = "-" + strSeq2
strLCS = "-" + strLCS
c = c - 1
ElseIf bUP Then
strSeq1 = "+" + strSeq1
strSeq2 = Cells(r, COL_LTR).Value + strSeq2
strLCS = "+" + strLCS
r = r - 1
End If
When we travel UP, changing rows, we want to copy the letters from Sequence 2 in the
COL_LTR (2nd) column into strSeq2.
At the same time, we need to place a character in both strSeq1 and strLCS to preserve the
alignment so that corresponding characters in all 3 strings will coincide. Therefore we add
the "+" character to indicate an insertion.
The indel (insertion/deletion) convention that we’ve implemented corresponds to a method
for CONVERTING Seq 1 to Seq 2. Had we wanted an implementation that would have
converted Seq 2 to Seq 1, we would have reversed our usage of the + and - characters.
Reconstructing the LCS Letter Sequence (continued)
When completed, clicking on the
showLCS button when
Sequence 1 = ATCTGAT and
Sequence 2 = TGCATA will show
a pairwise alignment as in the
image to the left.
Notice the usage of the plus and
minus characters. To CONVERT
Seq 1 to Seq 2, we need to delete
the 1st three characters of Seq 1
(ATC), hence 3 minus signs appear
at the beginning of strLCS (and at
the beginning of strSeq2 to align
it with strSeq1).
When we reach the 6th and 9th
positions, we need to insert two
characters (C and A), hence our
use of the 2 plus signs.
strSeq1
strLCS
strSeq2