Starting from:
$30

$24

Assignment 4

Rubric: {mechanics:5}

IMPORTANT!!! Before proceeding, please carefully read the general homework instructions at https://www.cs.ubc.ca/~fwood/CS340/homework/. The above 5 points are for following the submission instructions. You can ignore the words “mechanics”, “reasoning”, etc.

We use blue to highlight the deliverables that you must answer/do/submit with the assignment.


1    Convex Functions

Rubric: {reasoning:5}

Recall that convex loss functions are typically easier to minimize than non-convex functions, so it’s important to be able to identify whether a function is convex.

Show that the following functions are convex:

1.
f (w) = ↵w2
w  +  with w 2 R, ↵   0,
2 R,  2 R (1D quadratic).
2.
f (w) =   log(↵w) with ↵ > 0 and w > 0 (“negative logarithm”)
3.
f (w) = kXw
yk1 +
λ
kwk1 with w 2 Rd,

0 (L1-regularized robust regression).



2



4.
f (w) =
n





yiwT xi)) with w
Rd (logistic regression).


i=1 log(1 + exp(



P
i=1
{
|



|}

2 k
k2
2 R

5.
f (w) = P
n
[max
0,

wT xi
yi

✏] +
λ

w2
2 with w

d, ✏   0,    0 (support vector regression).
















General hint: for the first two you can check that the second derivative is non-negative since they are one-dimensional. For the last 3 you’ll have to use some of the results regarding how combining convex functions can yield convex functions which can be found in the lecture slides.

Hint for part 4 (logistic regression): this function may seem non-convex since it contains log(z) and log is concave, but there is a flaw in that reasoning: for example log(exp(z)) = z is convex despite containing a log. To show convexity, you can reduce the problem to showing that log(1 + exp(z)) is convex, which can
be done by computing the second derivative. It may simplify matters to note that
exp(z)
=
1

.

1+exp(z)

1+exp(
z)









2    Logistic Regression with Sparse Regularization

If you run python main.py -q 2, it will:

    1. Load a binary classification dataset containing a training and a validation set.

    2. ‘Standardize’ the columns of X and add a bias variable (in utils.load dataset).

    3. Apply the same transformation to Xvalidate (in utils.load dataset).




1

    4. Fit a logistic regression model.

    5. Report the number of features selected by the model (number of non-zero regression weights).

    6. Report the error on the validation set.

Logistic regression does reasonably well on this dataset, but it uses all the features (even though only the prime-numbered features are relevant) and the validation error is above the minimum achievable for this model (which is 1 percent, if you have enough data and know which features are relevant). In this question, you will modify this demo to use di↵erent forms of regularization to improve on these aspects.

Note: your results may vary a bit depending on versions of Python and its libraries.


2.1    L2-Regularization

Rubric: {code:2}






Make a new class, logRegL2, that takes an input parameter
and fits a logistic regression model with
L2-regularization. Specifically, while logReg computes w by minimizing


n






Xi




f (w) =
log(1 + exp(  yiwT xi)),


=1




your new function logRegL2 should compute w by minimizing



n




Xi




f (w) =
log(1 + exp(
yiwT xi))
+

kwk2.




2

=1






Hand in your updated code. Using this new code with
= 1, report how the following quantities change:

the training error, the validation error, the number of features used, and the number of gradient descent iterations.

Note: as you may have noticed, lambda is a special keyword in Python and therefore we can’t use it as a variable name. As an alternative we humbly suggest lammy, which is what Mike’s niece calls her stu↵ed animal toy lamb. However, you are free to deviate from this suggestion. In fact, as of Python 3 one can now use actual greek letters as variable names, like the symbol. But, depending on your text editor, it may be annoying to input this symbol.


2.2    L1-Regularization

Rubric: {code:3}

Make a new class, logRegL1, that takes an input parameter and fits a logistic regression model with L1-regularization,
n


Xi


f (w) =
log(1 + exp(
yiwT xi)) +  kwk1.
=1


Hand in your updated code. Using this new code with
= 1, report how the following quantities change:

the training error, the validation error, the number of features used, and the number of gradient descent iterations.

You should use the function minimizers.findMinL1, which implements a proximal-gradient method to mini-mize the sum of a di↵erentiable function g and kwk1,
f (w) = g(w) +    kwk1.

2

This function has a similar interface to findMin, EXCEPT that (a) you only pass in the the function/gradient of the di↵erentiable part, g, rather than the whole function f ; and (b) you need to provide the value . To reiterate, your funObj should not contain the L1 regularization term; rather it should only implement the function value and gradient for the training error term. The reason is that the optimizer handles the non-smooth L1 regularization term in a specialized way (beyond the scope of CPSC 340).


2.3    L0-Regularization

Rubric: {code:4}

The class logRegL0 contains part of the code needed to implement the forward selection algorithm, which approximates the solution with L0-regularization,

n


Xi


f (w) =

log(1 + exp(  yiwT xi)) +  kwk0.
=1



The for loop in this function is missing the part where we fit the model using the subset selected new, then compute the score and updates the minLoss/bestFeature. Modify the for loop in this code so that it fits the model using only the features selected new, computes the score above using these features, and updates the minLoss/bestFeature variables. Hand in your updated code. Using this new code with = 1, report the training error, validation error, and number of features selected.


Note that the code di↵ers a bit from what we discussed in class, since we assume that the first feature is the bias variable and assume that the bias variable is always included. Also, note that for this particular case using the L0-norm with = 1 is equivalent to what is known as the Akaike Information Criterion (AIC) for variable selection.

Also note that, for numerical reasons, your answers may vary depending on exactly what system and package versions you are using. That is fine.


2.4    Discussion

Rubric: {reasoning:2}

In a short paragraph, briefly discuss your results from the above. How do the di↵erent forms of regularization compare with each other? Can you provide some intuition for your results? No need to write a long essay, please!


2.5    Comparison with scikit-learn

Rubric: {reasoning:1}

Compare your results (training error, validation error, number of nonzero weights) for L2 and L1 regulariza-tion with scikit-learn’s LogisticRegression. Use the penalty parameter to specify the type of regularization. The parameter C corresponds to λ1 , so if you had = 1 then use C=1 (which happens to be the default anyway). You should set fit_intercept to False since we’ve already added the column of ones to X and thus there’s no need to explicitly fit an intercept parameter. After you’ve trained the model, you can access the weights with model.coef_.






3
2.6    L12  regularization

Rubric: {reasoning:4}

Previously we’ve considered L2 and L1 regularization which use the L2 and L1 norms respectively. Now consider least squares linear regression with “L 12 regularization” (in quotation marks because the “L 12 norm” is not a true norm):

1
n

d



Xi
yi)2 +
X
f (w) = 2




(wT xi

|wj |1/2 .



=1

j=1

Let’s consider the case of d = 1 and assume there is no intercept term being used, so the loss simplifies to

1
n





Xi
yi)2
p






f (w) = 2




(wxi

+    |w| .



=1





Finally, let’s assume n = 2 where our 2 data points are (x1, y1) = (1, 2) and (x2, y2) = (0, 1).

1.
Plug in the data set values and write the loss in its simplified form, without a summation.
2.
If
= 0, what is the solution, i.e. arg minw f (w)?
3.
If
! 1, what is the solution, i.e., arg minw f (w)?
4.
Plot f (w) when
= 1. What is arg minw f (w) when  = 1? Answer to one decimal place if appropriate.
5.
Plot f (w) when
= 10.  What is arg minw f (w) when   = 10?  Answer to one decimal place if

appropriate.


    6. Does L 12 regularization behave more like L1 regularization or L2 regularization when it comes to performing feature selection? Briefly justify your answer.

7. Is least squares with L 12 regularization a convex optimization problem? Briefly justify your answer.


3    Multi-Class Logistic

If you run python main.py -q 3 the code loads a multi-class classification dataset with yi 2 {0, 1, 2, 3, 4} and fits a ‘one-vs-all’ classification model using least squares, then reports the validation error and shows a plot of the data/classifier. The performance on the validation set is ok, but could be much better. For example, this classifier never even predicts that examples will be in classes 0 or 4.


3.1    Softmax Classification, toy example

Rubric: {reasoning:2}

Linear classifiers make their decisions by finding the class label c maximizing the quantity wcT xi, so we want to train the model to make wyTi xi larger than wcT0 xi for all the classes c0 that are not yi. Here c0 is a possible label and wc0 is row c0 of W . Similarly, yi is the training label, wyi is row yi of W , and in this setting we are assuming a discrete label yi 2 {1, 2, . . . , k}. Before we move on to implementing the softmax classifier to fix the issues raised in the introduction, let’s work through a toy example:







4

Consider the dataset below, which has n = 10 training examples, d = 2 features, and k = 3 classes:

2
1
0
3


2
1
3


6
0
1
7


6
1
7



1
1




2



6
1
1
7


6 7


6


7


6
2
7


6
1
0
7


6
1
7

X =

0
0

,
y =

2

.

6


7


6

7


6
1
0
7


6 7


6


7


6
2
7


6
1
0
7


6 7


6


7


6
3
7


6
1
1
7


6 7


6


7


6
3
7


6
1
0
7


6 7


6


7


6
3
7


6


7


6 7


4


5


4 5

Suppose that you want to classify the following test example:

⇥   ⇤ xˆ =  1  1 .

Suppose we fit a multi-class linear classifier using the softmax loss, and we obtain the following weight

matrix:
2
+2
1
3





W =

+2
2


4
+3
1
5
Under this model, what class label would we assign to the test example? (Show your work.)


3.2    One-vs-all Logistic Regression

Rubric: {code:2}

Using the squared error on this problem hurts performance because it has ‘bad errors’ (the model gets penalized if it classifies examples ‘too correctly’). Write a new class, logLinearClassifier, that replaces the squared loss in the one-vs-all model with the logistic loss. Hand in the code and report the validation error.

3.3    Softmax Classifier Gradient

Rubric: {reasoning:5}

Using a one-vs-all classifier can hurt performance because the classifiers are fit independently, so there is no attempt to calibrate the columns of the matrix W . As we discussed in lecture, an alternative to this independent model is to use the softmax loss, which is given by


n
"

k
!# ,
f (W ) =
i=1

wyTi xi + log
exp(wcT0 xi)





=1


X


cX0

Show that the partial derivatives of this function, which make up its gradient, are given by the following expression:



n

@f
Xi


=xij [p(yi = c | W, xi)   I(yi = c)] ,

@Wcj



=1
where...


• I(yi = c) is the indicator function (it is 1 when yi = c and 0 otherwise)

5
    • p(yi = c | W, xi) is the predicted probability of example i being class c, defined as

p(yi = c | W, xi) =

exp(wcT xi)

P
k
T


c0 =1 exp(wc0 xi)
3.4    Softmax Classifier Implementation

Rubric: {code:5}

Make a new class, softmaxClassifier, which fits W using the softmax loss from the previous section instead of fitting k independent classifiers. Hand in the code and report the validation error.

Hint: you may want to use utils.check_gradient to check that your implementation of the gradient is correct.

Hint: with softmax classification, our parameters live in a matrix W instead of a vector w. However, most optimization routines like scipy.optimize.minimize, or the optimization code we provide to you, are set up to optimize with respect to a vector of parameters. The standard approach is to “flatten” the matrix W into a vector (of length kd, in this case) before passing it into the optimizer. On the other hand, it’s inconvenient to work with the flattened form everywhere in the code; intuitively, we think of it as a matrix W and our code will be more readable if the data structure reflects our thinking. Thus, the approach we recommend is to reshape the parameters back and forth as needed. The funObj function is directly communicating with the optimization code and thus will need to take in a vector. At the top of funObj you can immediately reshape the incoming vector of parameters into a k ⇥ d matrix using np.reshape. You can then compute the gradient using sane, readable code with the W matrix inside funObj. You’ll end up with a gradient that’s also a matrix: one partial derivative per element of W . Right at the end of funObj, you can flatten this gradient matrix into a vector using grad.flatten(). If you do this, the optimizer will be sending in a vector of parameters to funObj, and receiving a gradient vector back out, which is the interface it wants – and your funObj code will be much more readable, too. You may need to do a bit more reshaping elsewhere, but this is the key piece.


3.5    Comparison with scikit-learn, again

Rubric: {reasoning:1}

Compare your results (training error and validation error for both one-vs-all and softmax) with scikit-learn’s LogisticRegression, which can also handle multi-class problems. One-vs-all is the default; for softmax, set multi_class=’multinomial’. For the softmax case, you’ll also need to change the solver. You can use solver=’lbfgs’. Since your comparison code above isn’t using regularization, set C very large to e↵ectively disable regularization. Again, set fit_intercept to False for the same reason as above (there is already a column of 1’s added to the data set).


3.6    Cost of Multinomial Logistic Regression

Rubric: {reasoning:2}

Assume that we have

    • n training examples.

    • d features.

    • k classes.


6

    • t testing examples.

    • T iterations of gradient descent for training.

Also assume that we take X and form new features Z using Gaussian RBFs as a non-linear feature trans-formation.

    1. In O() notation, what is the cost of training the softmax classifier with gradient descent?

    2. What is the cost of classifying the t test examples?
˜
Hint: you’ll need to take into account the cost of forming the basis at training (Z) and test (Z) time. It will be helpful to think of the dimensions of all the various matrices.


4    Very-Short Answer Questions

Rubric: {reasoning:12}

    1. Suppose that a client wants you to identify the set of “relevant” factors that help prediction. Why shouldn’t you promise them that you can do this?

    2. Consider performing feature selection by measuring the “mutual information” between each column of X and the target label y, and selecting the features whose mutual information is above a certain threshold (meaning that the features provides a sufficient number of “bits” that help in predicting the label values). Without delving into any details about mutual information, what is a potential problem with this approach?

    3. What is a setting where you would use the L1-loss, and what is a setting where you would use L1-regularization?

    4. Among L0-regularization, L1-regularization, and L2-regularization: which yield convex objectives? Which yield unique solutions? Which yield sparse solutions?

5. What is the e↵ect of in L1-regularization on the sparsity level of the solution? What is the e↵ect of on the two parts of the fundamental trade-o↵?

    6. Suppose you have a feature selection method that tends not generate false positives but has many false negatives (it misses relevant variables). Describe an ensemble method for feature selection that could improve the performance of this method.

    7. Suppose a binary classification dataset has 3 features. If this dataset is “linearly separable”, what does this precisely mean in three-dimensional space?

    8. When searching for a good w for a linear classifier, why do we use the logistic loss instead of just minimizing the number of classification errors?

    9. What are “support vectors” and what’s special about them?

    10. What is a disadvantage of using the perceptron algorithm to fit a linear classifier?

    11. Why we would use a multi-class SVM loss instead of using binary SVMs in a one-vs-all framework?

12. How does the hyper-parameter a↵ect the shape of the Gaussian RBFs bumps? How does it a↵ect the fundamental tradeo↵?








7

More products