$23.99
Instructions: Please put all answers in a single PDF with your name and NetID and upload to SAKAI before class on the due date (there is a LaTeX template on the course web site for you to use). Definitely consider working in a group; please include the names of the people in your group and write up your solutions separately. If you look at any references (even wikipedia), cite them. If you happen to track the number of hours you spent on the homework, it would be great if you could put that at the top of your homework to give us an indication of how difficult it was.
Problem 1
PCA
Let X = ΛZ + where X ∈ Rp×n , Λ ∈ Rp×k , Z ∈ Rk×n , and ∈ Rp×n . As in Factor analysis, assume that the entries of Z have standard normal distribution priors, and that i follows a Np (0, Ψ) distribution for i ∈ {1, . . . , n} where Ψ is diagonal. Unlike in the FA model, let each element of the Ψ matrix ψj = ψ (i.e., all of the diagonal elements are the same). This is a model known as Probabilistic PCA.
Now, generate three different matrices X in the following way: set n = p = 100, set k = 3, and generate each element of zi,k ∼ N (0, 1), and similarly for each element of λk,j ∼ N (0, 1). Generate matrix X = ΛZ and then add on N (0, ψ) noise to each element, where ψ = {0.2, 2, 10}. You should have three matrices now, each 100 × 100, and each generated from a low dimensional subspace.
In this question, you will reconstruct the covariance of this matrix using eigenvalues and eigenvectors.
(a) Use the eigen() function in R to compute the eigenvalues and eigenvectors for the covariance of X (cov(X)) for each of the three matrices, and plot the normalized eigenvalues (turn in this plot). How does the distribution of the eigenvalues change as the amount of noise (ψ) in the original matrices increases?
(b) Compute the RMSE between Cov(X ) and the matrix reconstruction using the first three eigenvectors:
xeig$vectors[,1:3] %*% diag(xeig$values[1:3]) %*% t(xeig$vectors[,1:3]), or ΦΩΦT for Φ the truncated matrix of eigenvalues and Ω = diag(ω) for ω eigen- values. How well do these eigenvectors recapitulate the original low dimensional data matrices? How does this change as the amount of noise increases?
(c) What is the effect of reconstructing these matrices using the first three eigenvalues?
How might this be useful for a specific application?
Problem 2
String Kernel and Gaussian Processes.
In molecular biology, transcription factors (some proteins) typically recognize and bind to specific DNA regulatory sequences. The protein binding microarray (PBM) is a novel biotechnology that offers a quantitative way to measure the DNA binding specificities of any single transcription factor protein. In this task, we are trying to build a Gaussian processes regression model to predict the binding intensities of a transcription factor on a set of DNA sequences. In SAKAI, we have 1000 training sequences and their measured intensities (in the training data gp.txt file), and 100 test sequences and their measured intensities (in the test data gp.txt file).
(a) Write a program to implement the string kernel on these DNA sequences, using the lengths of substrings|A| = {1, 2, 3}, and compute the Gram Matrix K for the samples. (Hint : The alphabet of a DNA sequence is “A”, “C”, “G”, “T”. ) For this limited set of underlying strings, you can do this in a brute force way, instead of using a suffix tree (i.e., enumerate all of the strings, and count the number of occurrences of each string in each DNA sequence. Then κ(xi , xj ) is a simple function of those vectors).
(b) Using Gaussian processes regression (use RBF kernel with the characteristic length scale σ = 5), predict the protein binding intensities for the test data set using the training data set.
Note: For questions a and b, write down the steps for building your string kernel and equations for your Gram Matrix and Gaussian processes regression, and paste your code at the end.
(c) Draw a scatter plot for your predicted intensities versus the measured intensities in the test data. Calculate RMSE for the predicted intensities versus measured intensities.
(d) How might you improve the prediction accuracy (name three different ways )?
Note: This problem is still an open and challenging area in genomic sciences.