$24
(45 pts) Recall that the string alignment problem takes as input two strings x and y, composed of symbols xi; yj 2 , for a xed symbol set , and returns a minimal-cost set of edit operations for transforming the string x into string y.
Let x contain nx symbols, let y contain ny symbols, and let the set of edit operations be those de ned in the lecture notes (substitution, insertion, deletion, and transposition).
Let the cost of indel be 1, the cost of swap be 13 (plus the cost of the two sub ops), and the cost of sub be 12, except when xi = yj , which is a \no-op" and has cost 0. In this problem, we will implement and apply three functions.
(i) alignStrings(x,y) takes as input two ASCII strings x and y, and runs a dynamic programming algorithm to return the cost matrix S, which contains the optimal costs for all the subproblems for aligning these two strings.
alignStrings(x,y) :
// x,y are ASCII strings
S = table
of
length nx by ny
// for memoizing the subproblem costs
initialize S
// fill in the basecases
for i = 1
to
nx
for j =
1 to ny
S[i,j] = cost(i,j)
// optimal cost for x[0..i] and y[0..j]
}}
return S
extractAlignment(S,x,y) takes as input an optimal cost matrix S, strings x; y, and returns a vector a that represents an optimal sequence of edit operations to convert x into y. This optimal sequence is recovered by nding a path on the implicit DAG of decisions made by alignStrings to obtain the value S[nx; ny], starting from S[0; 0].
extractAlignment(S,x,y) : // S is an optimal cost matrix from alignStrings
initialize a // empty vector of edit operations
[i,j] = [nx,ny] // initialize the search for a path to S[0,0]
while i 0 or j 0
a[i] = determineOptimalOp(S,i,j,x,y) // what was an optimal choice?
[i,j] = updateIndices(S,i,j,a) // move to next position
}
return a
When storing the sequence of edit operations in a, use a special symbol to denote no-ops.
1
CSCI 3104
Profs. Clauset & Grochow
Problem Set 7
, CU-Boulder
commonSubstrings(x,L,a) which takes as input the ASCII string x, an integer 1 L nx, and an optimal sequence a of edits to x, which would transform x into y. This function returns each of the substrings of length at least L in x that aligns exactly, via a run of no-ops, to a substring in y.
From scratch, implement the functions alignStrings, extractAlignment, and commonSubstrings. You may not use any library functions that make their imple-mentation trivial. Within your implementation of extractAlignment, ties must be broken uniformly at random.
Submit (i) a paragraph for each function that explains how you implemented it (describe how it works and how it uses its data structures), and (ii) your code implementation, with code comments.
Hint: test your code by reproducing the APE / STEP and the EXPONENTIAL / POLYNOMIAL examples in the lecture notes (to do this exactly, you’ll need to use unit costs instead of the ones given above).
Using asymptotic analysis, determine the running time of the call commonSubstrings(x, L, extractAlignment( alignStrings(x,y), x,y ) ) Justify your answer.
(15 pts extra credit) Describe an algorithm for counting the number of optimal alignments, given an optimal cost matrix S. Prove that your algorithm is correct, and give is asymptotic running time.
Hint: Convert this problem into a form that allows us to apply an algorithm we’ve already seen.
String alignment algorithms can be used to detect changes between di erent ver-sions of the same document (as in version control systems) or to detect verbatim copying between di erent documents (as in plagiarism detection systems).
The two data string les for PS7 (see class Moodle) contain actual documents recently released by two independent organizations. Use your functions from (1a) to align the text of these two documents. Present the results of your analysis, including a reporting of all the substrings in x of length L = 9 or more that could have been taken from y, and brie y comment on whether these documents could be reasonably considered original works, under CU’s academic honesty policy.
(20 pts) Ron and Hermione are having a competition to see who can compute the nth Pell number Pn more quickly, without resorting to magic. Recall that the nth Pell number is de ned as Pn = 2 Pn 1 + Pn 2 for n 1 with base cases P0 = 0 and P1 = 1. Ron opens with the classic recursive algorithm:
2
CSCI 3104
Profs. Clauset & Grochow
Problem Set 7
, CU-Boulder
Pell(n) :
if n == 0 { return 0 }
else if n == 1 { return 1 }
else { return 2*Pell(n-1) + Pell(n-2) }
which he claims takes R(n) = R(n 1) + R(n
2) + c = O( n) time.
Hermione counters with a dynamic programming approach that \memoizes" (a.k.a. memorizes) the intermediate Pell numbers by storing them in an array P[n]. She claims this allows an algorithm to compute larger Pell numbers more quickly, and writes down the following algorithm.1
MemPell(n) {
if n == 0 { return 0 } else if n == 1 { return 1 } else {
if (P[n] == undefined) { P[n] = 2*MemPell(n-1) + MemPell(n-2) } return P[n]
}
}
Describe the behavior of MemPell(n) in terms of a traversal of a computation tree. Describe how the array P is lled.
Determine the asymptotic running time of MemPell. Prove your claim is correct by induction on the contents of the array.
Ron then claims that he can beat Hermione’s dynamic programming algorithm in both time and space with another dynamic programming algorithm, which eliminates the recursion completely and instead builds up directly to the nal solution by lling the P array in order. Ron’s new algorithm2 is
DynPell(n) :
P[0] = 0, P[1] = 1
for i = 2 to n { P[i] = 2*P[i-1] + P[i-2] }
return P[n]
Determine the time and space usage of DynPell(n). Justify your answers and compare them to the answers in part (2a).
1Ron brie y whines about Hermione’s P[n]=undefined trick (\an unallocated array!"), but she point out that MemPell(n) can simply be wrapped within a second function that rst allocates an array of size n, initializes each entry to undefined, and then calls MemPell(n) as given.
2Ron is now using Hermione’s unde ned array trick; assume he also uses her solution of wrapping this function within another that correctly allocates the array.
3
CSCI 3104
Profs. Clauset & Grochow
Problem Set 7
, CU-Boulder
With a gleam in her eye, Hermione tells Ron that she can do everything he can do better: she can compute the nth Pell number even faster because intermediate results do not need to be stored. Over Ron’s pathetic cries, Hermione says
FasterPell(n) : a = 0, b = 1 for i = 2 to n
c = 2*a + b a = b
b = c
end return a
Ron giggles and says that Hermione has a bug in her algorithm. Determine the error, give its correction, and then determine the time and space usage of FasterPell(n). Justify your claims.
In a table, list each of the four algorithms as columns and for each give its asymp-totic time and space requirements, along with the implied or explicit data struc-tures that each requires. Brie y discuss how these di erent approaches compare, and where the improvements come from. (Hint: what data structure do all recur-sive algorithms implicitly use?)
(5 pts extra credit) Implement FasterPell and then compute Pn where n is the four-digit number representing your MMDD birthday, and report the rst ve digits of Pn. Now, assuming that it takes one nanosecond per operation, estimate the number of years required to compute Pn using Ron’s classic recursive algorithm and compare that to the clock time required to compute Pn using FasterPell.
4