$29
1.1 Problem statement
We saw in lecture that ‘di ng’ two strings can be accomplished with dynamic programming: Lecture 2 - P10. In this problem we will solve a more general version. We would like to align two strings, s and t, in a way to produce a minimum-cost alignment.
To produce an alignment on two strings s and t, we insert the special character ‘-’ some number of times into each string to produce align s and align t so that:
align s and align t have the same length, and
There is no i such that align s[i] and align t[i] are both ‘-’.
The cost of an alignment is given by a cost function, which we will call cost. The cost of an alignment is the sum over all i of cost(align s[i], align t[i]). The cost of aligning a letter with itself is always 0.
For example, given this cost table:
Letter from s
a b c -
Letter from t
a
0
1
5
3
b
5
0
5
3
c
3
4
0
3
-
2
2
1
The minimum-cost alignment between string s = ‘acb’ and t = ‘baa’ is given by:
1
CS 5112 Algorithms and Data Structures for Applications
align_s: -acb
align_t: ba-a
With a total cost of 5, which is lower than any other alignment.
This problem arises, for example, in DNA sequencing; given two strands of DNA, there are many sequences of mutations (insertions, deletions, etc.) which would have transformed one to the other; we would like to nd the most probable. We know that certain mutations are more likely than other, and these probabilities are re ected in the cost table(the pair with higher likelihood has a lower cost).
1.2 TODO
In diffing.py:
ll cell . You should ll in fill cell(table, i, j, s, t, cost), which will be called once per cell. table is the table as lled in so far; use its get method to nd these values. Your function should not mutate the table directly, but rather return a DiffingTableCell. cost is a function which encodes a cost function: it takes two characters and returns their cost. Each call to fill cell should call it no more than four times. fill cell should run in O(1) time.
cell ordering . You should ll in cell ordering(n,m), which will be called once. n and m are the lengths of the strings s and t. cell ordering should return a list of two-tuples; these are the coordinates of cells in the order you would like fill cell to be called. No cell should be repeated, but you do not need to use all cells.
di from table . You should ll in diff from table(s,t, table). s is the string being respaced, and table is the table that is already lled in. diff from table should run in O(n + m) time, and make no calls to cost.
Once you have nished implementing these 3 functions, you can do a quick sanity check by running the main given at the bottom of diffing.py which puts all the pieces together with our dynamic programming framework. See if the result is coherent with the example we gave you in this document.
Test harness To help you concentrate more on the actual algorithms part, we have also pro-vided you a test harness test diffing.py that runs tests on your diffing.py implementation using diffing tests.txt as the input test le. It is hence your responsibility to make sure your code works before submission. However, the test harness and the test cases we gave you are non-exhaustive, meaning passing these tests alone does not necessarily guarantee a perfect score.
• Respacing
2.1 Problem statement
The respacing problem is to put spaces back into a string that has lost them, given a dictionary. For example, given the string \itwasthebestoftimes" and an English dictionary, we would like to reconstruct the original sentence: \it was the best of times".
2.2 TODO
In respacing.py:
2
CS 5112 Algorithms and Data Structures for Applications
ll cell . You should ll in fill cell(T, i, j, string, is word), which will be called once per cell. T is the table as lled in so far; use its get method to nd these values. Your function should not mutate T directly, but rather return a RespaceTableCell. is word is a function which encodes a dictionary: given a string, it returns true or false, according to whether the string is a word. Each call to fill cell should call it no more than once. fill cell should run in O(N) time.
cell ordering . You should ll in cell ordering(N), which will be called once. N is the length of the string to be respaced. cell ordering should return a list of two-tuples; these are the coordinates of cells in the order you would like fill cell to be called. No cell should be repeated, but you do not need to use all cells.
respace from table . You should ll in respace from table(s, table). s is the string being respaced, and table is the table that is already lled in. respace from table should run in O(N2) time, and make no calls to is word.
Just like the rst problem, an example for putting these all together with our dynamic programming framework is given in the main at the bottom of respacing.py.
Test harness Also we have also provided you a non-exhaustive test harness test respace.py that runs tests on your respacing.py implementation using respace tests.txt as the input test le. Passing these tests alone does not necessarily guarantee a perfect score.
• Dynamic Programming Framework
We have provided the skeleton code for you, in dynamic programming.py, diffing.py, and respacing.py. It is designed to let you just think about the algorithm and worry less about the software engineering stu , while also ensuring that the algorithm you design is in fact a dynamic program. You will write some code to ll in a dynamic programming table, and then to compute the nal answer from this table.
The table for respacing is (N + 1) (N + 1), where N is the length of the word to be respaced. Each cell may contain a single boolean and a single integer.
The table for di ng is (jsj + 1) (jtj + 1), where s and t are the strings to be di ed and jsj stands for the length of string s. Each cell may contain a single integer and two characters (a character is a length-1 string).
• Testing and logistics
4.1 Outside resources
We are asking you to come up with new algorithms using dynamic programming. We expect that nding an appropriate optimal substructure will be the bulk of the work { the code is fairly short once you have found one. You should NOT use or seek any resources that describe how to solve these problems, even in pseudocode.
4.2 Testing
In addition to the provided tests, we encourage you to write additional ones because in real-life software development, you will eventually have to learn to write tests like this yourself and I recommend reading up on Richard’s slides from last year if you want to know more.
3
CS 5112 Algorithms and Data Structures for Applications
4.3 Python etc.
You can assume that all strings (words in respacing, input to respacing, and strings in di ng) consist of lowercase letters a,b,c,...z, have length at least 1, and are not None. Please use python2.
Please do not use any additional imports.
You will need to submit your diffing.py and respacing.py on CMS. Please be careful not to swap them by accident!
4