Dynamic Programming - LCS and Edit Distance

Download Report

Transcript Dynamic Programming - LCS and Edit Distance

Dynamic Programming …
Continued
Longest Common Subsequence (LCS)
AND
Edit Distance
Announcements

Exam 2 Grades are posted – the average was a 78, much higher
than last time. No curve since multiple people got a 100.

I calculated what final grade you would have if I had to give
grades today.
◦ This includes the lab programs IF you did them.
◦ Remember, there are still 2 assignments that haven’t been graded yet
and the final exam. So there is still room for you to bring your grade
up.

If you still haven’t completed your Lab Program #1 and Lab
Program #2, Remember those are 2% of your grade each.
◦ And we only have 3 lab programs left.
◦ The next 3 weeks will be the last 3 lab programming assignments,
11/3 and 11/10 will be dynamic programming and 11/17 will be
backtracking.
◦ The lab before Thanksgiving on 11/24 is cancelled.
◦ And the very last lab on 12/1 will be a review and an activity.
Algorithm Design Techniques

We will cover in this class:
◦
◦
◦
◦

Greedy Algorithms
Divide and Conquer Algorithms
Dynamic Programming Algorithms
And Backtracking Algorithms
These are 4 common types of algorithms
used to solve problems.
◦ For many problems, it is quite likely that at
least one of these methods will work.
Dynamic Programming Intro

We have looked at several algorithms that
involve recursion.

In some situations, these algorithms solve
fairly difficult problems efficiently
◦ BUT in other cases they are inefficient because
they recalculate certain function values many
times.

The problems we’ve seen before are:
◦ The Fibonacci example
◦ The Change problem
Longest Common Subsequence Problem

The problem is to find the longest common
subsequence in 2 given strings.

A subsequence of a string is simply some subset
of the letters in the whole string in the order
they appear in the string.
◦ For example, given the string “GOODMORNING”
◦ “ODOR” is a subsequence
 made up of the characters at indices 1, 3, 5, and 6.
◦ “MOOD” is not a subsequence,
 since the characters are out of order.
Longest Common Subsequence

Recursive solution to the problem:
◦ If the last characters of both strings s1 and s2 match,
 then the LCS = 1 + the LCS of both of the strings with their
For example,
last characters removed.
“BIRD” and “FIND”
◦ If the initial characters of both strings do NOT match,
then the LCS will be one of two options: For example,
1) The LCS of x and y without its last character. “BIR” and “FIN”
2) The LCS of y and x without its last character.
◦
We will then take the maximum of the 2 values.
Max ( LCS(“BI” , “FIN”) , LCS(“BIR” , “FI”) )

(Also, we could just have easily compared the first 2
characters of x and y and used a similar technique.)
public static int lcsrec(String x, String y)
{
// If one of the strings has 1 character, search for that
// character in the other string and return 0 or 1.
if (x.length() == 1)
return find(x.charAt(0), y);
if (y.length() == 1)
return find(y.charAt(0), x);
// Solve the problem recursively.
// Check if corresponding last characters match.
if (x.charAt(len1-1) == y.charAt(len2-1))
return 1 + lcsrec(x.substring(0, x.length()-1),
y.substring(0,y.length()-1));
// Corresponding characters do not match.
else
return max(lcsrec(x, y.substring(0, y.length()-1)),
lcsrec(x.substring(0,x.length()-1), y));
}
LCS – trace through recursive
solution

LCS(“FUN” , “FIN”) = 2
◦ Last chars match: 1 + LCS (“FU”, “FI”)

=1+1=2
LCS(“FU”, “FI”)= 1
◦ Last chars Do NOT match:
max(LCS(“FU”, “F”), LCS(“F”, “FI”)
= max (1 , 1) = 1

LCS(“FU”, “F”)
Base case: return 1, since “F” is in “FU”
LCS(“F”, “FI”)
return 1, since “F” is in “FI”
Longest Common Subsequence

Now, our goal will be to take this recursive
solution and build a dynamic programming
solution.
◦ The key here is to notice that the heart of each
recursive call is the pair of indexes, telling us which
prefix string we are considering.
◦ In some sense, we can build the answer to
"longer" LCS questions based on the answers to
smaller LCS questions.
◦ This can be seen trace through the recursion at
the very last few steps.
Longest Common Subsequnce

If we make the recursive call on the strings
RACECAR and CREAM,
◦ once we have the answers to the recursive calls
for inputs RACECAR and CREA and the inputs
RACECA and CREAM,
◦ we can use those two answers and immediately
take the maximum of the two to solve our
problem!
Longest Common Subsequence

Thus, think of storing the answers to
these recursive calls in a table, such as
this:
R
C
R
E
A
M

A
C
E
C
A
R
XXX
In this chart for example, the slot with the
XXX will store an integer that represents the
longest common subsequence of CREA and
RAC. (In this case 2.)
Longest Common Subsequence

To build the table, first initialize the first row
and column.
◦ Basically, we search for the first letter in the other
string, when we get there, we put a 1, and all other
values subsequent to that on the row or column
are also one.
◦ This corresponds to the base case in the recursive
code.
C
R
E
A
M
R
0
1
1
1
1
A
0
C
1
E
1
C
1
A
1
R
1
Longest Common Subsequence

Now, we simply fill out the chart according to the recursive rule:
1)
Check to see if the "last" characters match.
Recursive: If so, delete this and take the LCS of what's left and add 1
to it.
Dynamic Programming: Look up&left in the table, add 1 to
that.
2)
If not, then we try 2 possibilities, and take the maximum of those 2
possibilities.
Recursive: (These possibilities are simply taking the LCS of the
whole first word and the second word minus the last letter, and vice
versa.)
Dynamic Programming: Max ( cell to the left , cell above )
C
R
E
A
M
R
0
1
1
1
1
A
0
1
1
2
2
C
1
1
1
2
2
E
1
1
2
2
2
C
1
1
2
2
2
A
1
1
2
3
3
R
1
2
2
3
3
Longest Common Subsequence

Dynamic Programming Code…
public static int computeLCS(String a, String b)
{
int[][] lengths = new int[a.length()+1][b.length()+1];
// row 0 and column 0 are init. to 0 already
for (int i = 1; i <= a.length(); i++)
for (int j = 1; j <= b.length(); j++)
if (a.charAt(i) == b.charAt(j))
lengths[i][j] = lengths[i-1][j-1] + 1;
else
lengths[i][j] = Math.max(lengths[i][j-1], lengths[i-1][j]);
return lengths[a.length()][b.length()];
}
Longest Common Subsequence

Example on the board…
Edit Distance

The Edit Distance (or Levenshtein distance)
is a metric for measuring the amount of
difference between two.
◦ The Edit Distance is defined as the minimum number
of edits needed to transform one string into the
other.

It has many applications, such as spell checkers,
natural language translation, and bioinformatics.
◦ Part B of Assignment #4 is an example of one
application in Bioinformatics, measuring the amount
of difference between two DNA sequences.
Edit Distance

The problem of finding an edit distance
between two strings is as follows:
◦ Given an initial string s, and a target string t,
what is the minimum number of changes that have
to be applied to s to turn it into t ?

The list of valid changes are:
1) Inserting a character
2) Deleting a character
3) Changing a character to another character.
Edit Distance

You may think there are too many recursive
cases. We could insert a character in quite a
few locations!
◦ (If the string is length n, then we can insert a
character in n+1 locations.)
◦ However, the key observation that leads to a
recursive solution to the problem is that
ultimately, the last characters will have to match.
Edit Distance

So, when matching one word to another one, consider the last
characters of strings s and t.

If we are lucky enough that they ALREADY match,
◦ then we can simply "cancel" and recursively find the edit distance
between the two strings left when we delete this last character
from both strings.

Otherwise, we MUST make one of three changes:
1)
2)
3)
delete the last character of string s
delete the last character of string t
change the last character of string s to the last character of string
t.
◦ Also, in our recursive solution, we must note that the edit distance
between the empty string and another string is the length of the
second string. (This corresponds to having to insert each letter for
the transformation.)
Edit Distance

So, an outline of our recursive solution is as follows:
1)
If either string is empty, return the length of the other string.
2)
If the last characters of both strings match, recursively find
the edit distance between each of the strings without that
last character.
3)
If they don't match then return 1 + minimum value of the
following three choices:
a)
b)
c)
Recursive call with the string s w/o its last character and the string t
Recursive call with the string s and the string t w/o its last character
Recursive call with the string s w/o its last character and the string t
w/o its last character.
Edit Distance

Recursive Example on the board…
Edit Distance

Now, how do we use this to create a DP
solution?
◦ We simply need to store the answers to all
the possible recursive calls.
◦ In particular, all the possible recursive calls we
are interested in are determining the edit
distance between prefixes of s and t.
Edit Distance

Consider the following example with s="hello" and
t="keep".
◦ To deal with empty strings, an extra row and column
have been added to the chart below:
k
e
e
p

0
1
2
3
4
h
1
1
2
3
4
e
2
2
1
2
3
l
3
3
2
2
3
l
4
4
3
3
3
o
5
5
4
4
4
An entry in this table simply holds the edit distance
between two prefixes of the two strings.
◦ For example, the highlighted square indicates that the
edit distance between the strings "he" and "keep" is 3.
Edit Distance

In order to fill in all the values in this table we
will do the following:
1) Initialize values corresponding to the base case in
the recursive solution.
k
e
e
p
0
1
2
3
4
h
1
e
2
l
3
l
4
o
5
Edit Distance

In order to fill in all the values in this table we
will do the following:
2) Loop through the table from the top left to the
bottom right. In doing so, simply follow the recursive
solution.
 If the characters you are looking at match,
k
e
e
p
0
1
2
3
4
h
1
1
2
e
2
2
l
3
3
l
4
4
o
5
5
Just bring down
the up&left value.
 Else the characters don't match, min ( 1+ above, 1+ left, 1+diag up)
k
e
e
p
0
1
2
3
4
h
1
1
2
e
2
2
1
l
3
3
l
4
4
o
5
5
Edit Distance

Dynamic Programming Example on the
Board…
References
Slides adapted from Arup Guha’s Computer
Science II Lecture notes:
http://www.cs.ucf.edu/~dmarino/ucf/cop3503/le
ctures/
 Additional material from the textbook:

Data Structures and Algorithm Analysis in Java (Second
Edition) by Mark Allen Weiss

Additional images:
www.wikipedia.com
xkcd.com