Starting from:
$30

$24

Machine Learning Homework #1 solved

    • [42 points] Linear regression on a polynomial

The filesq1xTrain.npy, q1xTest.npy, q1yTrain.npy and q1yTest.npy specify a linear regression problem for a polynomial. q1xTrain.npy represent the inputs (x(i) ∈ R) and q1yTrain.npy represents the outputs (y(i) ∈ R) of the training set, with one training example per row.

    (a) [15 points] You will compare the following two optimization methods, in finding the coefficients of a polynomial of degree one (i.e. slope and intercept) that minimize the training loss.

            ▪ Batch gradient descent

            ▪ Stochastic gradient descent

        i. [12 points] Give the coefficients generated by each of the optimization methods. Report the hy-perparameters used to generate the coefficients.

        ii. [3 points] We compare two optimization methods in terms of the number of epochs required for convergence. We define an “epoch” as one pass through the entire training samples. Compare the number of epochs to converge for the methods. Which method converges faster? Report the hyperparameters used for comparison. [Hint: in this question, the training process can be viewed

as convergent when the mean squared error EMS = N1 Ni=1(w0 + w1x(i) − y(i))2 on the training dataset is consistently small enough (e.g. ≤ 0.2).]

    (b) [15 points] Next, you will investigate the problem of over-fitting. Recall the figure from lecture that explored over-fitting as a function of the degree of the polynomial M, where the Root-Mean-Square (RMS) Error is defined as


1


i. [10 points] Regenerate the above chart with the provided data. To find the parameters, use the closed form solution of linear regression (assuming all the condition is met) that minimize the error for a M-degree polynomial (for M = 0 . . . 9) for the training data in q1xTrain.npy and q1yTrain.npy. For the test curve, use the data in q1xTest.npy and q1yTest.npy. [Note: For different values of M, we assume the feature vector is ϕ(x(i)) = (1, x(i), (x(i))2, · · · , (x(i))M ). The trend of your curve is not necessarily the same as the sample plot.]

        ii. [5 points] Which degree polynomial would you say best fits the data? Was there evidence of under/over-fitting the data? Use your generated charts to justify your answer.

    (c) [12 points] Finally, you will explore the role of regularization. Recall the image from lecture that explored the effect of the regularization factor λ:

        i. [10 points] Derive the closed form solution of the ridge regression, find the coefficients that

minimize the error for a ninth degree polynomial (M = 9) given regularization factor λ (for

    • ∈ {0, 10−8, 10−7, 10−6, 10−5, . . . , 10−1, 100(= 1)}) for the training data specified in q1xTrain.npy and q1yTrain.npy. Now use these parameters to plot the ERMS of both the training data and test data as a function of λ and regenerate the above chart, using q1xTest.npy and q1yTest.npy as the test data). Specifically, use the following regularized objective function:

1
N
(wT ϕ(x(i)) − y(i))2 +
λ
||w||22.






2


2

i=1

for optimizing the parameters w, but please use the original (unregularized) ERMS for plotting. [Note: the trend of your curve is not necessarily the same as the sample chart.]

ii. [2 points] Which λ value seemed to work the best?



    • [36 points] Locally weighted linear regression

Consider a linear regression problem in which we want to weight different training examples differently. Specifically, suppose we want to minimize

1
ED(w) =

2

N
r(i)(wT x(i) − y(i))2.
i=1

where r(i) is the weight for the sample (x(i), y (i)). In class, we worked out what happens for the case where all the weights (the r(i)’s) are the same. In this problem, we will generalize some of those ideas to the weighted setting, and also implement the locally weighted linear regression algorithm. [Note: the weight r(i) can be different for each of the training example.]

(a) [2 points] Show that ED(w) can also be written as

ED(w) = (Xw − y)T R(Xw − y)

for an appropriate diagonal matrix R, and where X is a matrix whose i-th row is x(i) and y is the vector whose i-th entry is y(i). State clearly what R is.

    (b) [8 points] If all the r(i)’s equal 1, then we saw in class that the normal equation is

XT Xw = XT y,

and that the value of w∗ that minimizes ED(w) is given by (XT X)−1XT y. By finding the derivative ∇wED(w) and setting that to zero, generalize the normal equation and the closed form solution to this weighted setting, and give the new value of w∗ that minimizes ED(w) in closed form as a function of X,
R and y.

    (c) [8 points] Suppose we have a training set {(x(i), y(i)); i = 1 . . . , N} of N independent examples, but in which the y(i)’s were observed with differing variances. Specifically, suppose that

p(y(i)
x(i); w) =

1
exp

(y(i) − wT x(i))2







2(σ(i))2
|


2πσ(i)





3





I.e., y(i) has mean wT x(i) and variance (σ (i))2 (where the σ(i)’s are fixed, known, constants). Show that finding the maximum likelihood estimate of w reduces to solving a weighted linear regression problem. State clearly what the r(i)’s are in terms of the σ(i)’s.

    (d) [18 points] The following items will use the files q2x.npy which contains the inputs (x(i)) and q2y.npy which contains the outputs (y(i)) for a linear regression problem, with one training example per row.

        i. [4 points] Implement (unweighted) linear regression (y = wT x) on this dataset (using the closed form solution we learned in lectures, remember to include the intercept term.). Plot on the same figure the data (each data sample can be shown as a point (x(i), y(i)) in the figure) and the straight line resulting from your fit.

        ii. [8 points] Implement locally weighted linear regression on this dataset (using the weighted normal equations you derived in part (b)), and plot on the same figure the data and the curve resulting from your fit. When evaluating local regression at a query point x (which is real-valued in this problem), use weights

r(i) = exp  −(x − x(i))2

2τ2

with a bandwidth parameter τ = 0.8. (Again, remember to include the intercept term.)

    iii. [6 points] Repeat (ii) four times with τ = 0.1, 0.3, 2 and 10. Comment briefly on what happens to the fit when τ is too small or too large.

    • [22 points] Derivation and Proof

(a) [8 points] Consider the linear regression problem for 1D data, where we would like to learn a function h(x) = w1x + w0 with parameters w0 and w1 to minimize the sum squared error L = 12 Ni=1(y(i) − h(x(i)))2 for N pairs of data samples (x(i), y(i)). Derive the solution for w0 and w1 for this 1D case of


















1



N
(i)

(i)
¯ ¯






¯




¯










i=1 x

y



−Y X

















N










linear regression. Show the derivation to get the solution w
0
= Y

w
1
X and w
1
=


1

N



2
−X¯2






















i=1 x(i)


¯



¯















N




(1)
(2)
(N)





(1)
, y
(2)
,···
, y
(N)
}.







where X is the mean of {x

, x
, · · · , x
} and Y is the mean of {y


















(b) [14 points] Consider the definition and property of positive (semi-)definite matrix. Let A be a real, symmetric d × d matrix. A is positive semi-definite (PSD) if, for all z ∈ Rd, zT Az ≥ 0. A is positive definite (PD) if, for all z ̸= 0, zT Az > 0. We write A 0 when A is PSD, and A ≻ 0 when A is PD. The spectral theorem says that every real symmetric matrix A can be expressed via the spectral decomposition A = UΛUT where U is a d × d matrix such that UUT = UT U = I and

    • = diag(λ1, λ2, · · · , λd). Multiplying on the right by U we see that AU = UΛ. If we let ui denote the i-th column of U, we have Aui = λiui for each i. Then λi are eigenvalues of A, and the corresponding columns are eigenvectors associated to λi. The eigenvalues constitute the “spectrum” of A, and the spectral decomposition is also called the eigenvalue decomposition of A.

        i. [6 points] Prove A is PD iff λi > 0 for each i.

        ii. [8 points] Consider the linear regression problem where Φ and y are as defined in class and the closed form solution is (ΦT Φ)−1ΦT y. We can get the eigenvalues of symmetric matrix ΦT Φ using spectral decomposition. We apply ridge regression, and the symmetric matrix in the solution is ΦT Φ + βI. Prove that the ridge regression has an effect of shifting all singular values by a constant β. For any β > 0, ridge regression makes the matrix ΦT Φ + βI PD.






4
Code Submission Instructions

Your solution to Q1 and Q2 should be written in a single python program q1.py and q2.py, respectively. The program should be runnable with the following command line: python3 q1.py (or python3 q2.py) and produce outputs and plots for all subproblems. The program should read the data files (i.e.,*.npy) from the same (current) working directory: for example, X train = np.load(‘q1xTrain.npy’).


The program should print all necessary outputs (e.g. coefficients computed) into standard output (std-out). For plots, it is okay to save figures as multiple files (e.g. q1-b.png) as you want. There are no requirements on the filename or format, as long as it produces valid outputs. Be sure to include all outputs in your writeup.

Please upload your code files (q1.py and q2.py) to Gradescope. Additional python files are allowed.

It is fine to include your outputs into your submission, but this is not required (we will re-run the code).

However, please DO NOT include data files (*.npy) into your Gradescope submission.

Credits

Some questions adopted/adapted from Stanford CS229 materials and from Bishop PRML.











































5

More products