Starting from:
$30

$24

Assignment #1 solution

IMPORTANT!!!!! Before proceeding, please carefully read the homework instructions:

www.cs.ubc.ca/~schmidtm/Courses/540-W18/assignments.pdf




We will deduct 50% on assignments that do not follow the instructions.




Most of the questions below are related to topics covered in CPSC 340, or other courses listed on the prerequisite form. There are several \notes" available on the webpage which can help with some relevant background.




If you nd this assignment to be di cult overall, that is an early warning sign that you may not be prepared to take CPSC 540 at this time. Future assignments will be longer and more di cult than this one.




We use blue to highlight the deliverables that you must answer/do/submit with the assignment.







Basic Information




Name:



Student ID:



Graduate students in CPSC/EECE/STAT must submit the prerequisite form as part of a1sol.zip: https://www.cs.ubc.ca/~schmidtm/Courses/540_prereqs.pdf






Very-Short Answer Questions



Give a short and concise 1-2 sentence answer to the below questions.




Why was I unimpressed when a student who thought they did not need to take CPSC 340 said they were able to obtain 97% training accuracy (on their high-dimensional supervised learning problem) as the main result of their MSc thesis?



What is the di erence between a test set error and the test error?



Suppose that a famous person in the machine learning community is advertising their \extremely-deep convolutional fuzzy-genetic Hilbert-long-short recurrent neural network" classi er, which has 500 hyper-parameters. This person claims that if you take 10 di erent famous (and very-di cult) datasets, and tune the 500 hyper-parameters based on each dataset's validation set, that you can beat the current best-known validation set error on all 10 datasets. Explain whether or not this amazing claim is likely to be meaningful.



In a parametric model, what is the e ect of the number of training examples n that our model uses on the training error and on the approximation error (the di erence between the training error and test error)?



Give a way to set the random tree depth in a random forest model that makes the model parametric, and a choice that makes the model non-parametric.









1



In the regression setting, the popular software XGBoost uses the squared error at the leaves of its regression tree, which is di erent than the \number of training errors" (Pni=1(^yi6=yi)) we used for decision trees in 340. Why does it use the squared error instead of the number of training errors?



Describe a situation where it could be better to use gradient descent than Newton's method (known as IRLS in statistics) to t the parameters of a logistic regression model.



How does in an L2-regularizer a ect the sparsity pattern of the solution (number of wj set to exactly 0), the training error, and the approximation error?



Minimizing the squared error used by in k-means clustering is NP-hard. Given this, does it make sense that the standard k-means algorithm is easily able to nd a local optimum?



Suppose that a matrix X is non-singular. What is the relationship between the condition number of the matrix, (X), and the matrix L2-norm of the matrix, kXk2.
How many regression weights do we have in a multi-class logistic regression problem with k classes?



Give a supervised learning scenario where you would use the sigmoid likelihood and one where you would use a Poisson likelihood.



Suppose we need to multiply a huge number of matrices to compute a product like A1A2A3 Ak. The matrices have wildly-di erent sizes so the order of multiplication will a ect the runtime (e.g., A1(A2A3) may be faster to compute than (A1A2)A3). Describe (at a high level) an O(k3)-time algorithm that nds the lowest-cost order to multiply the matrices.



You have a supervised learning dataset fX; yg. You t a 1-hidden-layer neural network using stochastic gradient descent to minimize the squared error, that makes predictions of the form y^i = vW xi where W and v are the parameters. You nd that this gives the same training error as using the linear model (^yi = wxi) that minimizes the squared error. You thought the accuracy might be higher for the neural network. Explain why or why not this result is reasonable.



Is it possible that the neural network and training procedure from the previous question results in a higher training error than the linear least squares model? Is it possible that it results in a lower training error?



What are two reasons that convolutional neural networks over t less than classic neural networks?






Calculation Questions



2.1 Minimizing Strictly-Convex Quadratic Functions




Solve for the minimizer w of the below strictly-convex quadratic functions:




f(w) = 12 kw uk (projection of u onto the real space under the quadratic norm de ned by ).



f(w) = 212 kXw yk2 +w w (ridge regression with known variance and weighted L2-regularization).





21
P
n
3. f(w) =
i=1 vi(wxi yi)2 + 21 (w u) (w u) (weighted least squares shrunk towards u).
Above we use our usual supervised learning notation. In addition, we assume that u is d 1 and v is n 1, while and are symmetric positive-de nite d d matrices. You can use V as a diagonal matrix with v along the diagonal (with the vi non-negative). Hint: positive-de nite matrices are invertible.



















2
2.2 Norm Inequalities




Show that the following inequalities hold for vectors w 2 Rd, u 2 Rd, and X 2 Rn d:




kwk1 kwk2 kwk1 (relationship between decreasing p-norms)



12 kw + uk22 kwk22 + kuk22 (\not the triangle inequality" inequality)



kXk2 kXkF (matrix norm induced by L2-norm is smaller than Frobenius norm)



You should use the de nitions of the norms, but should not use the known equivalences between these norms (since these are the things you are trying to prove). Hint: for many of these it's easier if you work with squared values (and you may need to \complete the square"). Beyond non-negativity of norms,

it may also help to use the Cauchy-Schwartz inequality, to use that
k
x
k1
= xsign(x), and to use that
n
d
min
n;d






Pi=1
Pj=1 xij2
= Pc=1f


g c(X)2 (where c(X) is singular value c of X).


2.3
MAP Estimation











In 340, we showed that under the assumptions of a Gaussian likelihood and Gaussian prior,

yi N (wxi; 1); wj N
0;
;


1




















that the MAP estimate is equivalent to solving the L2-regularized least squares problem




1
n




d






Xi




X
f(w) = 2
=1
(wxi yi)2 + 2
wj2;










j=1



in the \loss plus regularizer" framework. For each of the alternate assumptions below, write it in the \loss plus regularizer" framework (simplifying as much as possible):




Laplace likelihood (with a scale of 1) for each training example and Gaussian prior with separate variance j2 for each variable
yi L(wxi; 1); wj N 0; j2 :




2. Robust student-t likelihood and Gaussian prior centered at u.
























(wxi yi)2




+1












p(yi
xi; w) =




1








1 +


2
; wj
uj;
1
;










1




















j


p
B
;














N








2
2














































1
where u is d 1, B is the \Beta" function, and the parameter is called the \degrees of freedom".





We use a Poisson-distributed likelihood (for the case where yi represents counts), and we use a uniform prior for some constant ,



p(yi
wxi) =
exp(yiwxi) exp( exp(wxi))
;
p(w
)
/
:
yi!
j




j










(This prior is \improper" since w 2 Rd but it doesn't integrate to 1 over this domain, but nevertheless the posterior will be a proper distribution.)




For this question, you do not need to convert to matrix notation.







This likelihood is more robust than the Laplace likelihood, but leads to a non-convex objective.









3
2.4 Gradients and Hessian in Matrix Notation




Express the gradient rf(w) and Hessian r2f(w) of the following functions in matrix notation, simplifying as much as possible:




1. Regularized and tilted Least Squares




f(w) = 12 kXw yk2 + 2 kwk2 + wu:







wher u is d 1.




2. L2-regularized weighted least squares with non-Euclidean quadratic regularizaiton,




1
n
1
d
d




X
vi(wxi yi)2 +




X Xj
f(w) =










wiwj ij
2
i=1
2
i=1










=1



where you can use V as a matrix with the vi along the diagonal and as a positive-de nite d d (symmetric) matrix with ij in position (i; j).




3. Squared hinge loss,

n

f(w) = 12 X maxf0; 1 yiwxig 2 :




i=1




Hint: You can use the results from the linear and quadratic gradients and Hessians notes to simplify the derivations. You can use 0 to represent the zero vector or a matrix of zeroes and I to denote the identity matrix. It will help to convert the second question to matrix notation rst. For the last question you'll need to de ne new vectors to express the gradient and Hessian in matrix notation and you can use as element-wise multiplication of vectors. As a sanity check, make sure that your results have the right dimension.







Coding Questions



If you have not previously used Julia, there is a list of useful Julia commands (and syntax) among the list of notes on the course webpage.







3.1 Regularization and Hyper-Parameter Tuning




Download a1.zip from the course webpage, and start Julia (latest version) in a directory containing the extracted les. If you run the script example nonLinear, it will:







Load a one-dimensional regression dataset.



Fit a least-squares linear regression model.



Report the test set error.



Draw a gure showing the training/testing data and what the model looks like.



This script uses the JLD package to load the data and the PyPlot package to make the plot. If you have not previously used these packages, they can be installed using:2







2Last term, several people (eventually including myself) had a runtime problem on some system. This

seems to be xed using the answer of K. Gkinis at this url: https://stackoverflow.com/questions/46399480/ julia-runtime-error-when-using-pyplot







4



using Pkg




Pkg.add("JLD")




Pkg.add("PyPlot")




Unfortunately, this is not a great model of the data, and the gure shows that a linear model is probably not suitable.




Write a function called leastSquaresRBFL2 that implements least squares using Gaussian radial basis functions (RBFs) and L2-regularization.



You should start from the leastSquares function and use the same conventions: n refers to the number of training examples, d refers to the number of features, X refers to the data matrix, y refers to the targets, Z refers to the data matrix after the change of basis, and so on. Note that you'll have to add two additional input arguments ( for the regularization parameter and for the Gaussian RBF variance) compared to the leastSquares function. To make your code easier to understand/debug, you may want to de ne a new function rbfBasis which computes the Gaussian RBFs for a given training set, testing set, and value. Hand in your function and the plot generated with = 1 and = 1.




When dealing with larger datasets, an important issue is the dependence of the computational cost on the number of training examples n and the number of features d. What is the cost in big-O notation of training the model on n training examples with d features under (a) the linear basis, and (b) Gaussian RBFs (for a xed )? What is the cost of classifying t new examples under these two bases? Assume



that multiplication by an n by d matrix costs O(nd) and that solving a d by d linear system costs O(d3).




Modify the script to split the training data into a \train" and \validation" set (you can use half the examples for training and half for validation), and use these to select and . Hand in your modi ed script and the plot you obtain with the best values of and .



There are reasons why this dataset is particularly well-suited to Gaussian RBFs are that (i) the period of the oscillations stays constant and (ii) we have evenly sampled the training data across its domain. If either of these assumptions are violated, the performance with our Gaussian RBFs might be much worse. Consider a scenario where either (i) or (ii) is violated, and describe a way that you could address this problem.



Note: the distancesSquared function in misc.jl is a vectorized way to quickly compute the squared Euclidean distance between all pairs of rows in two matrices.




3.2 Multi-Class Logistic Regression




The script example multiClass.jl loads a multi-class classi cation dataset and ts a \one-vs-all" logistic regression classi er, then reports the validation error and shows a plot of the data/classi er. The performance on the validation set is ok, but could be much better. For example, this classi er never even predicts some of the classes.







Using a one-vs-all classi er hurts performance because the classi ers are t independently, so there is no attempt to calibrate the columns of the matrix W . An alternative to this independent model is to use the softmax probability,

exp(wi xi)

p(yijW; xi) = y :




Pk exp(wxi)

c=1 c

Here c is a possible label and wc is column c of W . Similarly, yi is the training label, wyi is column yi of W . The loss function corresponding to the negative logarithm of the softmax probability is given by




f(W ) =
n "
wyi xi + log
k
exp(wc0 xi)!# :


X


cX0




i=1


=1





5



Make a new function, softmaxClassi er, which ts W using the softmax loss from the previous section instead of tting k independent classi ers. Hand in the code and report the validation error.




Hint: you can use the derivativeCheck option when calling ndMin to help you debug the gradient of the softmax loss. Also, note that the ndMin function treats the parameters as a vector (you may want to use reshape when writing the softmax objective).







3.3 Robust and Brittle Regression




The script example outliers.jl loads a one-dimensional regression dataset that has a non-trivial number of \outlier" data points. These points do not t the general trend of the rest of the data, and pull the least squares model away from the main cluster of points. One way to improve the performance in this setting is simply to remove or downweight the outliers. However, in high-dimensions it may be di cult to determine whether points are indeed outliers (or the errors might simply be heavy-tailed). In such cases, it is preferable to replace the squared error with an error that is more robust to outliers.







Write a new function, leastAbsolutes(X,y), that adds a bias variable and ts a linear regression model by minimizing the absolute error instead of the square error,



f(w) = kXw yk1:




You should turn this into a linear program as shown in class, and you can solve this linear program using the linprog function the MathProgBase package. Hand in the new function and the updated plot.




The previous question assumes that the \outliers" are points that we don't want to model. But what if we want good performance in the worst case across all examples (including the outliers)? In this setting we may want to consider a \brittle" regression method that chases outliers in order to improve the worst-case error. For example, consider minimizing the absolute error in the worst-case,



f(w) = kXw yk1:




This objective function is non-smooth because of the absolute value function as well as the max function.




Show how to formulate this non-smooth optimization problem as a linear program.




Write and hand in a function, leastMax, that ts this model using linprog (after adding a bias variable). Hand in the new function and the updated plot.



To use the linprog function, you can use:




using MathProgBase, GLPKMathProgInterface solution = linprog(c,A,d,b,lb,ub,GLPKSolverLP()) x = solution.sol




This requires installing the appropriate packages, and nds a vector x minimizing the function cx subject to d Ax b and lb x ub. You can set values of c to 0 for variables that don't a ect the cost function, and you can set values in b=d=lb/ub (or the other variables) to appropriate in nite values if there are no lower/upper bounds. The vectors c=d=b=lb=ub should all be lists.


































6

More products