$29
Reading: Jones and Pevzner Ch 1-4, 12;
Gusfield, Ch 16; JXZ, Ch 2, 10.
Q1: Implement the branch-and-bound algorithm for the Partial Digest Problem
(PDP) as given on p. 90 of Jones and Pevzner using your favoriate
programming language (e.g., C, C++, Java or Python). Test your program on
the following three datasets. Here, each number in parentheses indicates
the multiplicity of the preceding number.
L1 = {1(9), 2(8), 3(7), 4(6), 5(5), 6(4), 7(3), 8(2), 9(1)}.
L2 = {1(1), 2(1), 3(2), 4(1), 5(2), 6(1), 7(1), 9(2), 10(1), 12(1), 14(1), 15(1)}.
L3 = {1(16), 2(15), 3(14), 4(13), 5(12), 6(11), 7(10), 8(9), 9(8),
10(9), 11(8), 12(7), 13(6), 14(5), 15(4), 16(3), 17(2), 18(1)}.
To simpify your output, please print only the number of distinct solutions
for each dataset (symmetric solutions are considered as the same) and up to
three actual distinct solutions (in any order). Please turn in your source
code as well as the above output found in each test.
Done.
Q2: Problem 4.9, p. 121, of Jones and Pevzner.
Need to think about this problem later
Q3: Problem 4.14, p. 122, of Jones and Pevzner. You may find some complete
bacterial genomes at e.g., http://www.ebi.ac.uk/genomes/bacteria.html
For l = 5, illustrate the frequency distributions on the bacterial genome
and the random string (i.e., how many 5-mers having each frequency in each
case) using a simple plot. In other words, your plot should show the number
of distinct 5-mers (Y-axis) that occur in the bacterial genome (or random
string) with each specific frequency number (X-axis). What can you observe
from this comparison?
Bacillus amyloliquefaciens CC178 (4MB) (3916828)
Bacteria genome seems to have a few conserved 5-mers where a generated random sequence don't have anything like that. (Coding done)
Q4: Problem 12.4, p. 417, of Jones and Pevzner. Here, "coin tossing" basically
means "randomization". In other words, you may assume that you can retrieve
random numbers according to a specific probability distribution as we did
in the Gibbs Sampler algorithm.
Randomization: weighted on the distance from different centroids. So, always the closest cluster may not be selected, rather cluster with lower probability can also be chosen. Since each cluster has some probability to be assigned weighted on the distance from the data point.
Q5: Problem 12.6, p. 417, of Jones and Pevzner.
Remove a percentage of sequences from the total set of sequences rather removing only one at each iteration.