Starting from:
$30

$24

Machine Learning Assignment 3 Solution

    1. The goal of this exercise is to partially show one of the remarkable properties of SVD that we discussed in class. Let X be a n d matrix and let X = U V T be its singular-value decomposition where is a r r diagonal matrix with 11 22 : : : rr. Let i = ii. Show that maxv:kvk=1 kXvk 1.1 [4 points]
[Hint: Write a vector v in the basis of the columns of V , i.e., write v =
Pi

v
. What is Xv
given this expansion? Use this to compute kXvk2 and bound it.]

i
i

    2. In class we showed that the span of the rst two right singular vectors gives a best- t subspace of dimension 2. Generalize the argument to higher k. That is, show that for any X, the span of the rst k right singular vectors gives a solution to the best- t subspace problem for dimension k. You have to use this without using any properties of the SVD (even those we stated in class). [4 points]

[Hint: Use induction. Suppose the claim is true for k 1 and prove it for k. Recall that in the argument for k = 2, we had to choose an orthonormal basis w1; w2 for an optimal subspace S carefully so that w2 was perpendicular to v1. You can use a generalization of this to higher k; you don’t have to prove this last fact.]

    3. In class we saw a way to compute the largest singular vector without computing the entire SVD. What if you had to compute the smallest singular vector? Describe a way to compute the smallest right singular vector of a matrix X by a reduction. [3 points]

You don’t have to prove anything just an idea about what to do will su  ce.


    • Provide a complete proof without circularly referring to the two main theorems we stated about SVD!

1

[Hint: It might be to best to work with the matrix Y = XT X. If X had SVD X = U V T , then Y = V 2V T and the eigenvalues of Y are exactly the squares of the singular values of X. Perform a suitable shift on Y to get the smallest singular vector of X.]

    4. The goal of this exercise is to implement the singular value projection idea we saw in class.

Let X be a unknown n d matrix of rank at most k (a small number). We are given as input ((Xij : (i; j) 2 O)) for some subset of entries O [n] [d]. Your goal is to gure out the missing entries in X.

For any matrix Y 2 n  d, de ne L(Y ) =
(i;j)
2
O(Xij  Yij)2
. (The loss function is computable
as we see the entries in Y ).
P



    (a) Describe how to compute the gradient of L with respect to Y . [1 point].

    (b) You can use the code from edstem for the following exercise. Generate a random rank k n d matrix where k = 5, n = 1000, d = 500. Note that technically, writing down the matrix completely will need you to write down 500; 000 entries!

Generate a random subset O of [n] [d] of density p = :1. One way to do this is in python to generate a random n d matrix with entries in the range (0; 1) and set all entries bigger than p to 0 and those less than p to 1.

Now use SVP algorithm as discussed in class for a couple of step-sizes and plot the cost function of each iteration. Your submission must be (a) A screenshot of your code for computing the projected gradient and (b) The nal plot of the error. [3 points]

(See the workspace on power iteration for helper code on generating the indices of ob-served entries. You have to implement the cost function, the gradient function, and call

an internal SVD computation to compute the top 5 SVD (e.g., scipy:sparse:linalg:svds(Y; k = 5)) for each projection step. Note that you can use the earlier code we had for demoing GD

here except you have to change the input handle for gradient to the projected one.)

You might have noticed that the above method converges to the true unknown X im-pressively even with only 1% of entries visible. However, you can in fact do much better in terms of implementation (and you need to when computing at the scale of the Net-ix dataset). Lets say Xt is the current low-rank matrix approximation candidate after t’th iteration. Can you think of ways to speed up the computation of the best rank k

approximation to Yt = Xt rL(Xt)? Indeed writing down Yt entirely itself will be prohibitive when you have n = 500000 and d = 17000. Instead, you can store just the low-rank factorization of Xt. Also note that rL(Xt) will be a sparse matrix (only 1% of its entries are non-zero). You can use these to properties to compute matrix-vector products of Yt faster than even writing down the entire matrix. Indeed, we had to use such implementations in our paper from 2009.

Let us investigate what speed-up you might gain by the above approach by looking at a simpler setting in the next problem.

    5. Let n be a large number (say n = 10; 000) and k a small number (say 20). Let G be a sparse n n matrix with density p (say 0:01). MATLAB, NUMPY (and other numerical analysis software) can exploit the sparsity of X to compute the top singular vector quite e ciently. Next, let U be a n k matrix U and let X = U UT . Computing the rst right singular vector


2

of X directly on MATLAB would be quite slow. But, the rst right singular vector of X is the same as the rst left singular vector of U which you can compute much more e ciently.

Now, suppose we had to compute the rst singular vector of Z = X + G (This is similar to the situation in SVP). Doing so directly would be quite expensive. But we can use power iteration faster as for any vector v, we can compute Zv = U(UT v) + Gv without ever having to even write down Z (but only keeping U; G).

You can use MATLAB, R, Numpy/Scipy, or Julia for this exercise. Generate a random n n sparse matrix G with density p for n = 10; 000, p = 0:01. Generate a random n k matrix U. Compute Z = U UT + G.

Next, use the built-in command to compute the top singular vector and time the operation. Implement the power iteration (as in the above step) to compute the top singular vector of Z (say for 10; 20; 30; :::; 100 iterations) and time the algorithm. Is there a di erence in speed?

Your submission should be the following items. [3 points]

(a) A table or  gure showing the run-times of the in-built command and the run-times for di erent number of iterations obtained by averaging over 10 runs (as X; U are random

- it is better to consider multiple runs).

    (b) A gure plotting the number of iterations (on the x-axis) with the error (what power iteration computes and the true answer as computed by the built-in command) on the y-axis.



































3

More products