$24
Goals. The goal of this exercise is to
Build a recommendation system. Learn to evaluate its performance.
Implement and understand matrix factorization using SGD.
Implement and understand the alternating least-squares (ALS) algorithm.
Choose the appropriate number of factors, as well as regularization parameters. Compare your system to a few baselines.
Setup, data and sample code. Obtain the folder labs/ex10 of the course github repository
github.com/epfml/ML course
We will use the dataset movielens100k.csv in this exercise, and we have provided sample code templates that already contain useful snippets of code required for this exercise.
Visualization and Creating Train-Test Splits
Since our goal is to predict the unseen ratings, we can create a test set by \punching holes" in the matrix, i.e.
we randomly select some of the ratings in the matrix as test points, while we leave the rest for training.
Figure 1 shows the number of ratings for each user and each movie. We can see that the number of ratings varies quite a lot among users and movies. This is very typical of datasets for recommendation systems, and Big Data in general.
Exercise 1:
Fill in the notebook function split data to split dataset into train and test. We only consider users and movies that have more than 10 ratings, and will discard all other users and movies. Among the remaining valid ratings, you should randomly select a rating to become a test rating with probability 10%, and training with prob. 90%. This way, we keep some minimum amount of training data for users and movies that have few ratings.
We can visualize the resulting split as in Figure 2.
Performance Evaluation
We will use the root mean-square error (RMSE) to measure the performance. Given true test ratings xdn for the d'th movie and n'th user, and our prediction x^dn = Wd: Z:n = (W Z)dn, we compute the RMSE as follows:
1
X
1
2
RMSE(W; Z) :=
(d;n)
xdn (W Z)dn
(1)
j j
2
2
and W 2 RD K , Z 2 RN K . Here [D] [N] is the set of the indices of the observed ratings of the input matrix X. The RMSE can be computed either on a training set , or on a hold our hold-out test set .
Figure 1: The left plot shows the (ordered) number of ratings for each user, while the right plot shows the (ordered) number of ratings for the movies. Both numbers are showed in descending order for clarity.
Figure 2: This gure shows obtained train-test split. The left plot shows the training data and the right gure shows the test data. In each plot, a dot indicates a user-movie pair with a non-zero rating.
Baseline Models
We will use the following models, that use the mean to predict, as baselines:
1
X
Global Mean: x^ :=
xdn
j j
(d;n)
2
1
X
User Mean: x^n :=
d
:n xdn;
j :nj
2
1 X
Movie Mean: x^d := j d:j n2 d: xdn;
1
N
1
D
X X
X X
(2)
=
n=1 d
xdn =
d: xdn;
j j
:n
j j
d=1 n
2
2
(3)
(4)
where is the set of non-zero rating indices (d; n) in the training data matrix, :n is the set of movies rated by
the n-th user, and d: is the set of users who have rated the d-th movie.
Exercise 2:
We will compare the above three baselines rst.
Before implementing, think about the following: which of the three models will give the best performance, and why?
Implement the notebook functions baseline global mean(), baseline user mean() and baseline item mean().
Compare the resulting models. Which model gives you the lowest training RMSE? Which one gives lowest test RMSE?
Hint: You can change the random seed of your train/test split, and thereby generate several estimates.
Matrix Factorization with SGD
Exercise 3:
Task: Implement Stochastic Gradient Descent
Derive the full gradient r(W;Z)f(W; Z) for f = RMSE(W; Z). Note that since we have (D + N) K variables, the gradient will have (D+N) K entries (each corresponding to an entry of W or Z respectively).
Derive a stochastic gradient G using the sum structure of f over the elements. We want to do this in such a way that G only depends on a single observed rating (d; n) 2 .
Implement Stochastic Gradient Descent as described in the lecture, for our objective function being RMSE as de ned in Equation (1).
Implement SGD for the regularized cost function
1
X
2
w
z
L(W; Z) :=
xdn (W Z)dn
+
kWkFrob2
+
kZkFrob2
2
2
2
(d;n)2
where w; z 0 are scalars.
Experimentally nd the best stepsize to obtain the lowest training error value.
Does the test error also decrease monotonically during optimization, or does it increase again after some time? Try for di erent choices of K.
(OPTIONAL: Can you speed up your code, by for example maintaining the set of values (W Z)dn for the few observed values (d; n) 2 , and thereby avoiding the computation of the matrix multiplication W Z in every step?)
Alternating Least Squares
Exercise 4:
Theory Question: Alternating Least Squares with Missing Entries.
Mathematically derive the precise updates for W and Z, for the Alternating Least Squares (ALS) algorithm for the more general setting, when only the ratings (d; n) 2 contribute to the cost, i.e.
1
X
2
L(W; Z) :=
xdn (W Z)dn
2
(d;n)2
Hint: Compute the gradient with respect to each group of variables, and set to zero.
2. Do the same for the regularized cost function
1
X
2
w
z
L(W; Z) :=
xdn (W Z)dn
+
kWkFrob2
+
kZkFrob2
2
2
2
(d;n)2
where w; z 0 are scalars.
Exercise 5:
From our provided template, implement the ALS algorithm for regularized matrix completion, as in the above de ned objective function.
Step 1 Initialize the item matrix W by assigning the average rating for that movie as the rst row, and small random numbers for the remaining entries;
Step 2 Fix W, Solve for Z by minimizing the objective function (the sum of squared errors); Step 3 Fix Z, solve for W by minimizing the objective function similarly;
Step 4 Repeat Steps 2 and 3 until a stopping criterion is satis ed.
Try di erent values of the latent dimension K.
4