$24
Sequence alignment via dynamic programming
A central problem pursued in bioinformatics is determining similarity between sequences of nucleotides or amino acids (the basic information that makes up our genetic structure). Similarity information is required to perform some typical tasks including: using a given sequence of amino acids to infer a proteins shape and function, and finding all the genes and proteins in a given genome.
In this assignment, you will solve the problem of aligning two gene sequences optimally using dynamic programming. Given two gene sequences, our goal for global alignment is to find the optimal alignment of the two sequences to minimize the alignment cost, which is defined by a symmetric cost matrix like the one below:
-
A
T
G
C
-
A T G C
0
1
1
1
1
1
0
2
2
2
1
2
0
2
2
1
2
2
0
2
1
2
2
2
0
This cost matrix assign a cost of 0 if two identical characters are aligned, a cost of 1 if we align a character with a space (deletion or insertion), and a cost of 2 if two different characters are aligned (substitutions). The following alignment of two sequences (ATCC and TCAC) have a total cost of 6.
sequence 1:
A
-
T
C
C
sequence 2:
T
C
A
-
C
Cost:
2
1
2
1
0
A better alignment is as follows with a total cost of 2.
sequence 1:
A
T
C
-
C
sequence 2:
-
T
C
A
C
Cost:
1
0
0
1
0
Note that this can be viewed as a generalization of the edit distance problem introduced in class, whose cost matrix has a cost of 1 for all insertions and deletions and substitutions. You should adapt the dynamic programming solution presented in class for edit distance to solve this generalized alignment problem.
1 Specification of the input and output
Please follow below instructions before starting to develop your program. Your program will take two input files. The first specify the cost matrix. The second file specify sequences to be aligned. Below we describe the format of these two files.
Input file 1: Cost File The cost will be specified by a 6 by 6 matrix like the one shown above. Each row of the matrix will be one line. The first row and the first column will specify the alphabet. The remaining entries specifies the cost of aligning a particular pair of alphabets. Please see imp2cost.txt as an example cost file in your program. During grading, we use cost files with the same name and structure as imp2cost.txt. We will only change the values but the alphabets will be the same and the matrix will remain symmetric.
Input file 2: Input sequences Each sequence input file can contain multiple pairs to be aligned. Each line of the file contains one pair of sequences to be aligned. The two sequences are separated by a single comma. We have provided some sample input files named imp2input.txt.
Output file Given the inputs, the output of your program will be a single txt file. Given an input sequence file contains multiple pairs, the output file will also contain multiple lines, each line reporting the result for one pair. In particular, the line will begin with the alignment of the two sequences, separated by comma, then followed by the cost of the alignment. For example, if your input is ”AATC” and ”ATC”, the alignment output will be ”AATC, A-TC:1”, assuming cost of 1 for deletion. For the given input file, we have also provided an output file imp2output our.txt.
2 Empirical Verification of correctness.
Your code will need to compiled and run on flip. The TA will run your algorithm on his inputs and verify the correctness of the algorithm considering two aspects. First, we will verify if your program arrives at the correct optimal cost for the given inputs. This will be achieved by comparing the cost you report with the optimal cost. We will further verify if the alignment your output is correct. This will be done by checking if the cost of your reported alignment is actually the same as your reported cost. Specifically, we will use a python script, which take the cost file and the output file your algorithm produced, and verify if the reported alignments and reported costs match up. To help you debug your program and run the verification your self, you can use the provided checker.py.
Specifically checker.py checks imp2output.txt (must use this file name) to verify that the cost of the reported alignments is the same as the reported costs. To run the script, be sure to place the script in the main folder, which should also contain the output file to be checked imp2output.txt, and the cost matrix input file imp2cost.txt. Assume that you have python2.7 or python3, you can simply type: python checker.py If there was any failure, a proper message would be given.
Our Test Expectation During testing, we will place your program into one main folder containing our own inputs
imp2input.txt, imp2cost.txt. Since we will not pass any file names as parameters to your program, please make sure that your program works with the default names imp2input.txt, imp2cost.txt and is able to find them in the same folder. Finally, your output file should be placed the same folder under name imp2output.txt.
3 Empirical analysis of run time.
For this part, you need to generate your own random inputs of sequences of five different lengths: 500, 1000,
2000, 4000, 5000. Feel free to consider even longer inputs to push the limit your implementation. For each length, please generate 10 random pairs of sequences (using alphabet A, G, T and C) and compute the pairwise alignments and report the average time for computing the alignments.
4 Report.
In addition to submitting the source code (with clear instructions for running them), you will also need to submit a report, which should include the following:
• Pseudo-code. Please provide the pseudo-code for your algorithm.
• Asymptotic Analysis of run time. Please analyze the runtime for your algorithm.
• Reporting and plotting the runtime. Describe how the running time is measured in your experi- ments. What type of machine is used for the experiments. Is there any part excluded from the runtime measurements (e.g., output of the alignment)? Plot the running times as a function of the input size. Label your plot (axes, title, etc.). Include an additional plot of the running times on a log-log axis 1 . Note that if the slope of a line in a log-log plot is m, then the line is of the form O(xm ) on a linear plot. Determine the slope of the line in your log-log plot and from the slope, infer the experimental running time for your algorithm.
• Interpretation and discussion Discuss the runtime plot. Do the growth curve match your expec- tation based on their theoretical bounds?
5 Hand in Instructions
Code and report should be submitted through the TEACH website. Students are encouraged to work in groups of up to 3 people. Each group should submit only one submission, which should in- clude the names of all group members. You can choose any of the following programming languages: java, python, c/c++. If you wish to use a language that is outside of this list, you will need to obtain advance approval from at least one of the TA in charge.
1 See here for a tutorial on log-log plots: http://www.eecs.orst.edu/~glencora/cs325/videos/loglogplots.mp4 (and ac- companying file: https://web.engr.oregonstate.edu/~glencora/wiki/uploads/loglogplots.m).