Starting from:

$30

Set 1 Machine Learning & Data Mining Solution

    • Basics [16 Points]

Relevant materials: lecture 1

Answer each of the following problems with 1-2 short sentences.

Problem A [2 points]: What is a hypothesis set?


Solution A:



Problem B [2 points]: What is the hypothesis set of a linear model?


Solution B:



Problem C [2 points]: What is overfitting?


Solution C:



Problem D [2 points]: What are two ways to prevent overfitting?


Solution D:



Problem E [2 points]: What are training data and test data, and how are they used differently? Why should you never change your model based on information from test data?


Solution E:



Problem F [2 points]: What are the two assumptions we make about how our dataset is sampled?


Solution F:



Problem G [2 points]: Consider the machine learning problem of deciding whether or not an email is spam. What could X, the input space, be? What could Y , the output space, be?




2

Machine Learning & Data Mining
Caltech CS/CNS/EE 155
Set 1
January 4th, 2022





Solution G:



Problem H [2 points]: What is the k-fold cross-validation procedure?


Solution H:


















































3

Machine Learning & Data Mining
Caltech CS/CNS/EE 155
Set 1
January 4th, 2022




    • Bias-Variance Tradeoff [34 Points]

Relevant materials: lecture 1


Problem A [5 points]: Derive the bias-variance decomposition for the squared error loss function. That is, show that for a model fS trained on a dataset S to predict a target y(x) for each x,

ES [Eout (fS)] = Ex[Bias(x) + Var(x)]

given the following definitions:

F (x) = ES [fS(x)]

Eout(fS) = Ex  (fS(x) − y(x))2

Bias(x) = (F (x) − y(x))2

Var(x) = ES  (fS(x) − F (x))2



Solution A:


In the following problems you will explore the bias-variance tradeoff by producing learning curves for polynomial regression models.

A learning curve for a model is a plot showing both the training error and the cross-validation error as a function of the number of points in the training set. These plots provide valuable information regarding the bias and variance of a model and can help determine whether a model is over– or under–fitting.

Polynomial regression is a type of regression that models the target y as a degree–d polynomial function of the input x. (The modeler chooses d.) You don’t need to know how it works for this problem, just know that it produces a polynomial that attempts to fit the data.

Problem B [14 points]: Use the provided 2_notebook.ipynb Jupyter notebook to enter your code for this question. This notebook contains examples of using NumPy’s polyfit and polyval methods, and scikit-learn’s KFold method; you may find it helpful to read through and run this example code prior to continuing with this problem. Additionally, you may find it helpful to look at the documentation for scikit-learn’s learning curve method for some guidance.


The dataset bv_data.csv is provided and has a header denoting which columns correspond to which values. Using this dataset, plot learning curves for 1st–, 2nd–, 6th–, and 12th–degree polynomial regression (4 separate plots) by following these steps for each degree d ∈ {1, 2, 6, 12}:

    1. For each N ∈ {20, 25, 30, 35, · · · , 100}:

        i. Perform 5-fold cross-validation on the first N points in the dataset (setting aside the other points), computing the both the training and validation error for each fold.

4

Machine Learning & Data Mining
Caltech CS/CNS/EE 155
Set 1
January 4th, 2022




            ▪ Use the mean squared error loss as the error function.

            ▪ Use NumPy’s polyfit method to perform the degree–d polynomial regression and NumPy’s polyval method to help compute the errors. (See the example code and NumPy documenta-tion for details.)

            ▪ When partitioning your data into folds, although in practice you should randomize your partitions, for the purposes of this set, simply divide the data into K contiguous blocks.

        ii. Compute the average of the training and validation errors from the 5 folds.

    2. Create a learning curve by plotting both the average training and validation error as functions of N. Hint: Have same y-axis scale for all degrees d.



Solution B:



Problem C [3 points]: Based on the learning curves, which polynomial regression model (i.e. which degree polynomial) has the highest bias? How can you tell?


Solution C:



Problem D [3 points]: Which model has the highest variance? How can you tell?


Solution D:



Problem E [3 points]: What does the learning curve of the quadratic model tell you about how much the model will improve if we had additional training points?


Solution E:



Problem F [3 points]: Why is training error generally lower than validation error?


Solution F:



Problem G [3 points]: Based on the learning curves, which model would you expect to perform best on some unseen data drawn from the same distribution as the training data, and why?



5

Machine Learning & Data Mining
Caltech CS/CNS/EE 155
Set 1
January 4th, 2022





Solution G:


























































6


    • Stochastic Gradient Descent [36 Points]

Relevant materials: lecture 2

Stochastic gradient descent (SGD) is an important optimization method in machine learning, used every-where from logistic regression to training neural networks. In this problem, you will be asked to first implement SGD for linear regression using the squared loss function. Then, you will analyze how several parameters affect the learning process.

Linear regression learns a model of the form:

d
f(x1, x2, · · · , xd) =    wixi    + b

i=1

Problem A [2 points]: We can make our algebra and coding simpler by writing f(x1, x2, · · · , xd) = wT x for vectors w and x. But at first glance, this formulation seems to be missing the bias term b from the equation above. How should we define x and w such that the model includes the bias term?

Hint: Include an additional element in w and x.


Solution A:


Linear regression learns a model by minimizing the squared loss function L, which is the sum across all training data {(x1, y1), · · · , (xN , yN )} of the squared difference between actual and predicted output values:
N
L(f) =    (yi − wT xi)2

i=1

Problem B [2 points]: SGD uses the gradient of the loss function to make incremental adjustments to the weight vector w. Derive the gradient of the squared loss function with respect to w for linear regression.


Solution B:


The following few problems ask you to work with the first of two provided Jupyter notebooks for this prob-lem, 3_notebook_part1.ipynb, which includes tools for gradient descent visualization. This notebook utilizes the files sgd_helper.py and multiopt.mp4, but you should not need to modify either of these files.

For your implementation of problems C-E, do not consider the bias term.

Problem C [8 points]: Implement the loss, gradient, and SGD functions, defined in the notebook, to perform SGD, using the guidelines below:

• Use a squared loss function.

7

    • Terminate the SGD process after a specified number of epochs, where each epoch performs one SGD iteration for each point in the dataset.

    • It is recommended, but not required, that you shuffle the order of the points before each epoch such that you go through the points in a random order. You can use numpy.random.permutation.

    • Measure the loss after each epoch. Your SGD function should output a vector with the loss after each epoch, and a matrix of the weights after each epoch (one row per epoch). Note that the weights from all epochs are stored in order to run subsequent visualization code to illustrate SGD.



Solution C: See code.



Problem D [2 points]: Run the visualization code in the notebook corresponding to problem D. How does the convergence behavior of SGD change as the starting point varies? How does this differ between datasets 1 and 2? Please answer in 2-3 sentences.


Solution D:



Problem E [6 points]: Run the visualization code in the notebook corresponding to problem E. One of the cells—titled ”Plotting SGD Convergence”—must be filled in as follows. Perform SGD on dataset 1 for each of the learning rates η ∈ {1e-6, 5e-6, 1e-5, 3e-5, 1e-4}. On a single plot, show the training error vs. number of epochs trained for each of these values of η. What happens as η changes?


Solution E:


The following problems consider SGD with the larger, higher-dimensional dataset, sgd_data.csv. The file has a header denoting which columns correspond to which values. For these problems, use the Jupyter notebook 3_notebook_part2.ipynb.

For your implementation of problems F-H, do consider the bias term using your answer to problem A.

Problem F [6 points]: Use your SGD code with the given dataset, and report your final weights. Follow the guidelines below for your implementation:

    • Use η = e−15 as the step size.

    • Use w = [0.001, 0.001, 0.001, 0.001] as the initial weight vector and b = 0.001 as the initial bias.

    • Use at least 800 epochs.



8


    • You should incorporate the bias term in your implementation of SGD and do so in the vector style of problem A.

    • Note that for these problems, it is no longer necessary for the SGD function to store the weights after all epochs; you may change your code to only return the final weights.


Solution F:


Problem G [2 points]: Perform SGD as in the previous problem for each learning rate η in

{e−10, e−11, e−12, e−13, e−14, e−15},

and calculate the training error at the beginning of each epoch during training. On a single plot, show training error vs. number of epochs trained for each of these values of η. Explain what is happening.


Solution G:


Problem H [2 points]: The closed form solution for linear regression with least squares is


N
−1    N
w =
xixiT
xiyi  .

i=1
i=1

Compute this analytical solution. Does the result match up with what you got from SGD?


Solution H:


Answer the remaining questions in 1-2 short sentences.

Problem I [2 points]: Is there any reason to use SGD when a closed form solution exists?


Solution I:


Problem J [2 points]: Based on the SGD convergence plots that you generated earlier, describe a stopping condition that is more sophisticated than a pre-defined number of epochs.


Solution J:



Problem K [2 points]: How does the convergence behavior of the weight vector differ between the per-ceptron and SGD algorithms?


9


Solution K:


























































10


    • The Perceptron [14 Points]

Relevant materials: lecture 2

The perceptron is a simple linear model used for binary classification. For an input vector x ∈ Rd, weights w ∈ Rd, and bias b ∈ R, a perceptron f : Rd → {−1, 1} takes the form

d
f(x) = sign    wixi    + b

i=1

The weights and bias of a perceptron can be thought of as defining a hyperplane that divides Rd such that each side represents an output class. For example, for a two dimensional dataset, a perceptron could be drawn as a line that separates all points of class +1 from all points of class −1.

The PLA (or the Perceptron Learning Algorithm) is a simple method of training a perceptron. First, an initial guess is made for the weight vector w. Then, one misclassified point is chosen arbitrarily and the w vector is updated by

wt+1 = wt + y(t)x(t)

bt+1 = bt + y(t),

where x(t) and y(t) correspond to the misclassified point selected at the tth iteration. This process continues until all points are classified correctly.

The following few problems ask you to work with the provided Jupyter notebook for this problem, titled 4_notebook.ipynb. This notebook utilizes the file perceptron_helper.py, but you should not need to modify this file.

Problem A [8 points]: The graph below shows an example 2D dataset. The + points are in the +1 class and the ◦ point is in the −1 class.















Figure 1: The green + are positive and the red ◦ is negative


11

Implement the update_perceptron and run_perceptron methods in the notebook, and perform the perceptron algorithm with initial weights w1 = 0, w2 = 1, b = 0.

Give your solution in the form a table showing the weights and bias at each timestep and the misclassified point ([x1, x2], y) that is chosen for the next iteration’s update. You can iterate through the three points in any order. Your code should output the values in the table below; cross-check your answer with the table to confirm that your perceptron code is operating correctly.

t
b
w1
w2
x1
x2
y
0
0
0
1
1
-2
+1
1
1
1
-1
0
3
+1
2
2
1
2
1
-2
+1
3
3
2
0












Include in your report both: the table that your code outputs, as well as the plots showing the perceptron’s classifier at each step (see notebook for more detail).


Solution A:



Problem B [4 points]: A dataset S = {(x1, y1), · · · , (xN , yN )} ⊂ Rd × R is linearly separable if there exists a perceptron that correctly classifies all data points in the set. In other words, there exists a hyperplane that separates positive data points and negative data points.

In a 2D dataset, how many data points are in the smallest dataset that is not linearly separable, such that no three points are collinear? How about for a 3D dataset such that no four points are coplanar? Please limit your solution to a few lines - you should justify but not prove your answer.

Finally, how does this generalize for an N-dimensional set, in which no <N-dimensional hyperplane con-tains a non-linearly-separable subset? For the N-dimensional case, you may state your answer without proof or justification.


Solution B:



Problem C [2 points]: Run the visualization code in the Jupyter notebook section corresponding to ques-tion C (report your plots). Assume a dataset is not linearly separable. Will the Perceptron Learning Algo-rithm ever converge? Why or why not?


Solution C:




12

More products