$24
Rubric: {mechanics:5}
IMPORTANT!!! Before proceeding, please carefully read the general homework instructions at https://www.cs.ubc.ca/~fwood/CS340/homework/. The above 5 points are for following the submission instructions. You can ignore the words “mechanics”, “reasoning”, etc.
We use blue to highlight the deliverables that you must answer/do/submit with the assignment.
1 Finding Similar Items
For this question we’ll use the Amazon product data set1 from http://jmcauley.ucsd.edu/data/amazon/. We will focus on the “Patio, Lawn, and Garden” section. You should start by downloading the ratings at https://stanford.io/2Q7QTvu and place the file in your data directory with the original filename. Once you do that, running python main.py -q 1 should do the following steps:
• Load the raw ratings data set into a Pandas dataframe.
• Construct the user-product matrix as a sparse matrix (to be precise, a scipy.sparse.csr_matrix).
• Create bi-directional mappings from the user ID (e.g. “A2VNYWOPJ13AFP”) to the integer index into the rows of X.
• Create bi-directional mappings from the item ID (e.g. “0981850006”) to the integer index into the columns of X.
1.1 Exploratory data analysis
1.1.1 Most popular item
Rubric: {code:1}
Find the item with the most total stars. Submit the product name and the number of stars.
Note: once you find the ID of the item, you can look up the name by going to the url https://www.amazon.com/dp/ITEM_ID, where ITEM_ID is the ID of the item. For example, the URL for item ID “B00CFM0P7Y” is https://www.amazon.com/dp/B00CFM0P7Y.
1 The author of the data set has asked for the following citations: (1) Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. R. He, J. McAuley. WWW, 2016, and (2) Image-based recommendations on styles and substitutes. J. McAuley, C. Targett, J. Shi, A. van den Hengel. SIGIR, 2015.
1
1.1.2 User with most reviews
Rubric: {code:1}
Find the user who has rated the most items, and the number of items they rated.
1.1.3 Histograms
Rubric: {code:2}
Make the following histograms:
1. The number of ratings per user
2. The number of ratings per item
3. The ratings themselves
Note: for the first two, use plt.yscale(’log’, nonposy=’clip’) to put the histograms on a log-scale.
Also, you can use X.getnnz to get the total number of nonzero elements along a specific axis.
1.2 Finding similar items with nearest neighbours
Rubric: {code:6}
We’ll use scikit-learn’s neighbors.NearestNeighbors object to find the items most similar to the example item above, namely the Brass Grill Brush 18 Inch Heavy Duty and Extra Strong, Solid Oak Handle, at URL https://www.amazon.com/dp/B00CFM0P7Y.
Find the 5 most similar items to the Grill Brush using the following metrics (Make sure to include the code you have written in your pdf GradeScope submission.):
1. Euclidean distance (the NearestNeighbors default)
2. Normalized Euclidean distance (you’ll need to do the normalization)
3. Cosine similarity (by setting metric=’cosine’)
Some notes/hints...
• If you run python main.py -q 1.2, it will grab the row of X associated with the grill brush. The mappers take care of going back and forther between the IDs (like “B00CFM0P7Y”) and the indices of the sparse array (0, 1, 2, . . .).
• Keep in mind that scikit-learn’s NearestNeighbors is for taking neighbors across rows, but here we’re working across columns.
• Keep in mind that scikit-learn’s NearestNeighbors will include the query item itself as one of the nearest neighbours if the query item is in the “training set”.
• Normalizing the columns of a matrix would usually be reasonable to implement, but because X is stored as a sparse matrix it’s a bit more of a mess. Therefore, use sklearn.preprocessing.normalize to help you with the normalization in part 2.
Did normalized Euclidean distance and cosine similarity yields the same similar items, as expected?
2
1.3 Total popularity
Rubric: {reasoning:2}
For both Euclidean distance and cosine similarity, find the number of reviews for each of the 5 recommended items and report it. Do the results make sense given what we discussed in class about Euclidean distance vs. cosine similarity and popular items?
Note: in main.py you are welcome to combine this code with your code from the previous part, so that you don’t have to copy/paste all that code in another section of main.py.
2 Matrix Notation and Minimizing Quadratics
2.1 Converting to Matrix/Vector/Norm Notation
Rubric: {reasoning:3}
Using our standard supervised learning notation (X, y, w) express the following functions in terms of vectors, matrices, and norms (there should be no summations or maximums).
1.
maxi2{1,2,...,n} |wT xi
yi|.
n
λ
d
2.
i=1 vi(wT xi
yi)2
+
j=1 wj2.
2
P
n
T
2
1
Pd
3.
P
i=1 |w
xi
yi|
+
2
Pj=1
λj |wj |.
Note that in part 2 we give a weight vi to each training example and the value λ is a non-negative scalar, whereas in part 3 we are regularizing the parameters with di↵erent weights λj . You can use V to denote a diagonal matrix that has the values vi along the diagonal, and⇤as a diagonal matrix that has the λj values along the diagonal. You can assume that all the vi and λi values are non-negative.
2.2 Minimizing Quadratic Functions as Linear Systems
Rubric: {reasoning:3}
Write finding a minimizer w of the functions below as a system of linear equations (using vector/matrix notation and simplifying as much as possible). Note that all the functions below are convex so finding a w with rf (w) = 0 is sufficient to minimize the functions (but show your work in getting to this point).
1.
f (w) =
1
kw
vk2 (projection of v onto real space).
2
2.
f (w) =
1
kXw
yk2 +
1
wT ⇤w (least squares with weighted regularization).
2
2
1
n
λ
3.
f (w) =
i=1 vi(wT xi yi)2 +
kw w0k2 (weighted least squares shrunk towards non-zero w0).
2
2
Above we
assume that v and w0 are d by 1 vectors (in part 3 v is a vector of length n by 1), and⇤is a d by
P
d diagonal matrix (with positive entries along the diagonal). You can use V as a diagonal matrix containing the vi values along the diagonal.
Hint: Once you convert to vector/matrix notation, you can use the results from class to quickly compute these quantities term-wise. As a sanity check for your derivation, make sure that your results have the right dimensions. As a sanity check, make that the dimensions match for all quantities/operations: in order to make the dimensions match for some parts you may need to introduce an identity matrix. For example, XT Xw + λw can be re-written as (XT X + λI)w.
3
3 Robust Regression and Gradient Descent
If you run python main.py -q 3, it will load a one-dimensional regression dataset that has a non-trivial number of ‘outlier’ data points. These points do not fit the general trend of the rest of the data, and pull the least squares model away from the main downward trend that most data points exhibit:
Note: we are fitting the regression without an intercept here, just for simplicity of the homework question. In reality one would rarely do this. But here it’s OK because the “true” line passes through the origin (by design). In Q?? we’ll address this explicitly.
3.1 Weighted Least Squares in One Dimension
Rubric: {code:3}
One of the most common variations on least squares is weighted least squares. In this formulation, we have a weight vi for every training example. To fit the model, we minimize the weighted squared error,
f (w) =
1
n
vi(wT xi yi)2.
Xi
2
=1
In this formulation, the model focuses on making the error small for examples i where vi is high. Similarly, if vi is low then the model allows a larger error. Note: these weights vi (one per training example) are completely di↵erent from the model parameters wj (one per feature), which, confusingly, we sometimes also call ”weights”.
Complete the model class, WeightedLeastSquares, that implements this model (note that Q2.2.3 asks you to show how a few similar formulation can be solved as a linear system). Apply this model to the data containing outliers, setting v = 1 for the first 400 data points and v = 0.1 for the last 100 data points (which are the outliers). Hand in your code and the updated plot.
4
3.2 Smooth Approximation to the L1-Norm
Rubric: {reasoning:3}
Unfortunately, we typically do not know the identities of the outliers. In situations where we suspect that there are outliers, but we do not know which examples are outliers, it makes sense to use a loss function that is more robust to outliers. In class, we discussed using the sum of absolute values objective,
n
Xi
|wT xi
yi|.
f (w) =
=1
This is less sensitive to outliers than least squares, but it is non-di↵erentiable and harder to optimize. Nevertheless, there are various smooth approximations to the absolute value function that are easy to optimize. One possible approximation is to use the log-sum-exp approximation of the max function2:
|r| = max{r, r} ⇡ log(exp(r) + exp( r)).
Using this approximation, we obtain an objective of the form
Xn
f (w)= log exp(wT xi yi) + exp(yi wT xi) .
i=1
which is smooth but less sensitive to outliers than the squared error. Derive the gradient rf of this function with respect to w. You should show your work but you do not have to express the final result in matrix notation.
3.3 Robust Regression
Rubric: {code:3}
The class LinearModelGradient is the same as LeastSquares, except that it fits the least squares model using a gradient descent method. If you run python main.py -q 3.3 you’ll see it produces the same fit as we obtained using the normal equations.
The typical input to a gradient method is a function that, given w, returns f (w) and rf (w). See funObj in LinearModelGradient for an example. Note that the fit function of LinearModelGradient also has a numerical check that the gradient code is approximately correct, since implementing gradients is often error-prone.3
An advantage of gradient-based strategies is that they are able to solve problems that do not have closed-form solutions, such as the formulation from the previous section. The class LinearModelGradient has most of the implementation of a gradient-based strategy for fitting the robust regression model under the log-sum-exp approximation. The only part missing is the function and gradient calculation inside the funObj code. Modify funObj to implement the objective function and gradient based on the smooth approximation to the absolute value function (from the previous section). Hand in your code, as well as the plot obtained using this robust regression approach.
4 Linear Regression and Nonlinear Bases
In class we discussed fitting a linear regression model by minimizing the squared error. In this question, you will start with a data set where least squares performs poorly. You will then explore how adding a
2 Other possibilities are the Huber loss, or |r| ⇡ pr2 + ✏ for some small ✏.
3 Sometimes the numerical gradient checker itself can be wrong. See CPSC 303 for a lot more on numerical di↵erentiation.
5
bias variable and using nonlinear (polynomial) bases can drastically improve the performance. You will also explore how the complexity of a basis a↵ects both the training error and the test error.
4.1 Adding a Bias Variable
Rubric: {code:3}
If you run python main.py -q 4, it will:
1. Load a one-dimensional regression dataset.
2. Fit a least-squares linear regression model.
3. Report the training error.
4. Report the test error (on a dataset not used for training).
5. Draw a figure showing the training data and what the linear model looks like.
Unfortunately, this is an awful model of the data. The average squared training error on the data set is over 28000 (as is the test error), and the figure produced by the demo confirms that the predictions are usually nowhere near the training data:
The y-intercept of this data is clearly not zero (it looks like it’s closer to 200), so we should expect to improve performance by adding a bias (a.k.a. intercept) variable, so that our model is
yi = wT xi + w0.
instead of
yi = wT xi.
In file linear_model.py, complete the class, LeastSquaresBias, that has the same input/model/predict format as the LeastSquares class, but that adds a bias variable (also called an intercept) w0 (also called β in lecture). Hand in your new class, the updated plot, and the updated training/test error.
Hint: recall that adding a bias w0 is equivalent to adding a column of ones to the matrix X. Don’t forget that you need to do the same transformation in the predict function.
6
4.2 Polynomial Basis
Rubric: {code:4}
Adding a bias variable improves the prediction substantially, but the model is still problematic because the target seems to be a non-linear function of the input. Complete LeastSquarePoly class, that takes a data vector x (i.e., assuming we only have one feature) and the polynomial order p. The function should perform a least squares fit based on a matrix Z where each of its rows contains the values (xi)j for j = 0 up to p. E.g., LeastSquaresPoly.fit(x,y) with p = 3 should form the matrix
6
1
x1
(x1)2
(x1)3
7
..
Z =
2
1
x2
(x2)2
(x2)3
3
,
6
.
n
n
N
7
4
5
6 1
x (x )2 (x )3
7
and fit a least squares model based on it. Hand in the new class, and report the training and test error for p = 0 through p = 10. Explain the e↵ect of p on the training error and on the test error.
Note: you should write the code yourself; don’t use a library like sklearn’s PolynomialFeatures.
5 Very-Short Answer Questions
Rubric: {reasoning:7}
1. Suppose that a training example is global outlier, meaning it is really far from all other data points. How is the cluster assignment of this example by k-means? And how is it set by density-based clustering?
2. Why do need random restarts for k-means but not for density-based clustering?
3. Can hierarchical clustering find non-convex clusters?
4. For model-based outlier detection, list an example method and problem with identifying outliers using this method.
5. For graphical-based outlier detection, list an example method and problem with identifying outliers using this method.
6. For supervised outlier detection, list an example method and problem with identifying outliers using this method.
7. If we want to do linear regression with 1 feature, explain why it would or would not make sense to use gradient descent to compute the least squares solution.
8. Why do we typically add a column of 1 values to X when we do linear regression? Should we do this if we’re using decision trees?
9. If a function is convex, what does that say about stationary points of the function? Does convexity imply that a stationary points exists?
10. Why do we need gradient descent for the robust regression problem, as opposed to just using the normal equations? Hint: it is NOT because of the non-di↵erentiability. Recall that we used gradient descent even after smoothing away the non-di↵erentiable part of the loss.
11. What is the problem with having too small of a learning rate in gradient descent?
12. What is the problem with having too large of a learning rate in gradient descent?
7
13. What is the purpose of the log-sum-exp function and how is this related to gradient descent?
14. What type of non-linear transform might be suitable if we had a periodic function?
8