Starting from:
$35

$29

Homework 2 Solution

Conceptual Questions [10 points]

A0. The answers to these questions should be answerable without referring to external materials. Brie y justify your answers with a few words.

    a. [2 points] Suppose that your estimated model for predicting house prices has a large positive weight on ’number of bathrooms’. Does it implies that if we remove the feature "number of bathrooms" and re t the model, the new predictions will be strictly worse than before? Why?

    b. [2 points] Compared to L2 norm penalty, explain why a L1 norm penalty is more likely to result in a larger number of 0s in the weight vector or not?

    c. [2 points]  In at most one sentence each, state one possible upside and one possible downside of using the
    d. 
following regularizer: Pi jwij0:5

    d. [1 points]  True or False: If the step-size for gradient descent is too large, it may not converge.

    e. [2 points]  In your own words, describe why SGD works.

    f. [2 points] In at most one sentence each, state one possible advantage of SGD (stochastic gradient descent) over GD (gradient descent) and one possible disadvantage of SGD relative to GD.
1
Convexity and Norms [30 points]

A1.
A norm k k over Rn is de ned by the properties:
i) non-negative: kxk  0 for
nall x 2 Rn with equality
if and only if x = 0, ii) absolute scalability:
k
a x
=
j
a
j k
x
k
for all a
2 R
and x
2 R
, iii) triangle inequality:
kx + yk  kxk + kyk for all x; y 2 Rn.



k












[3 points]

n















a.

Show that f(x) = (Pi=1 jxij) is a norm. (Hint: begin by showing that ja + bj  jaj + jbj for all

a; b 2 R.)

b. [2 points]
Show that g(x) =
in=1 jxij1=2

2 is not a norm. (Hint: it su  ces to  nd two points in n = 2

dimensions such that the triangle inequality does not hold.)







P















Context:  norms are often used in regularization to encourage speci c behaviors of solutions.    If we de ne
kxkp := (
in=1 jxijp)1=p then one can show that kxkp is a norm for all p   1. The important cases of p = 2 and
p = 1
P










correspond to the penalty for ridge regression and the lasso, respectively.













k k


k k
j  j
k k   k k   k k
P











P

B1.
[6
points]  For any x
2 Rn, de ne the following norms:  kxk1
=
in=1 jxij, kxk2 =

in=1 jxij2
,
x 1 := limp!1 x p = maxi=1;:::;n xi . Show that  x 1
x 2
x 1.
p

A2. [3 points]  A set A   Rn is convex if  x + (1
)y 2 A for all x; y 2 A and  2 [0; 1].












For each of the grey-shaded sets above (I-III), state whether each one is convex, or state why it is not convex using any of the points a; b; c; d in your answer.

A3. [4 points] We say a function f : Rd ! R is convex on a set A if f( x + (1 )y) f(x) + (1 )f(y) for all x; y 2 A and 2 [0; 1].









For each of the grey-colored functions below (I-III), state whether each one is convex on the given interval or state why not with a counterexample using any of the points a; b; c; d in your answer.


    a. Function in panel I on [a; c]

    b. Function in panel II on [a; c]

    c. Function in panel III on [a; d]

    d. Function in panel III on [c; d]






2

B2. Use just the de nitions above and let k k be a norm.

    a. [3 points]  Show that f(x) = kxk is a convex function.

    b. [3 points]  Show that fx 2 Rn : kxk  1g is a convex set.

c.
[2 points]  Draw a picture of the set f(x1; x2) : g(x1; x2)   4g where g(x1; x2) =  jx1j
1=2
1=2

2
.



+ jx2j





(This is the function considered in 1b above specialized to n = 2.) We know g is not a norm. Is
the






de ned set convex? Why not?

Context: It is a fact that a function f de ned over a set A    Rn is convex if and only if the set f(x; z) 2
Rn+1 : z    f(x); x 2 Ag is convex. Draw a picture of this for yourself to be sure you understand it.



B3. For i = 1; : : : ; n let ‘i(w) be convex functions over w 2 Rd (e.g., ‘i(w) = (yi w>xi)2), k k is any norm, and > 0.

a. [3 points]  Show that

n
X
‘i(w) +  kwk
i=1

is convex over w 2 Rd (Hint: Show that if f; g are convex functions, then f(x) + g(x) is also convex.)

    b. [1 points] Explain in one sentence why we prefer to use loss functions and regularized loss functions that are convex.

Lasso [45 points]

Given    > 0 and data (x1; y1); : : : ; (xn; yn), the Lasso is the problem of solving


2
2
n

d



X
2
Xj

min
T


arg

(xi w + b
yi) +jwjj

w
Rd;b R





i=1

=1

is a regularization tuning parameter. For the programming part of this homework, you are required to implement the coordinate descent method of Algorithm 1 that can solve the Lasso problem.

You may use common computing packages (such as NumPy or SciPy), but do not use an existing Lasso solver (e.g., of scikit-learn).

Before you get started, here are some hints that you may  nd helpful:



Algorithm 1: Coordinate Descent Algorithm for Lasso


while not converged do









1

n


d
b



Pi=1
yi
P
j=1 wjxi;j

n





for k 2 f1; 2;    dg do
ak    2 Pn  x2

i=1    i;k







ck
2
n  xi;k  yi
(b +
6
wjxi;j)

P(ck
+  )=ak
ck <P



8
i=1

j=k

wk

0
)=ak
ck 2 [   ;  ]

<
(ck

ck >



:





end

end




3
For-loops can be slow whereas vector/matrix computation in Numpy is very optimized; exploit this as much as possible.

The pseudocode provided has many opportunities to speed up computation by precomputing quantities like ak before the for loop. These small changes can speed things up considerably.

As a sanity check, ensure the objective value is nonincreasing with each step.

It is up to you to decide on a suitable stopping condition. A common criteria is to stop when no element of w changes by more than some small during an iteration. If you need your algorithm to run faster, an easy place to start is to loosen this condition.

You will need to solve the Lasso on the same dataset for many values of . This is called a regularization path. One way to do this e ciently is to start at a large , and then for each consecutive solution, initialize the algorithm with the previous solution, decreasing by a constant ratio (e.g., by a factor of 2) until nished.

The smallest value of  for which the solution wb is entirely zero is given by
max = k=1;:::;d 2

xi;k
0
yi
0
n

yj
11


n



1

n




i=1





j=1


max
X





X





@

@




AA






















This is helpful for choosing the    rst    in a regularization path.

A4. We will rst try out your solver with some synthetic data. A bene t of the Lasso is that if we believe many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, e ectively di erentiating between the relevant and irrelevant features. Suppose that x 2 Rd; y 2 R; k < d, and pairs of data (xi; yi) for i = 1; : : : ; n are generated independently according to the model yi = wT xi + i where

w =
(
j=k
if j 2 f1; : : : ; kg
j

0
otherwise

where i N (0; 2) is some Gaussian noise (in the model above b = 0). Note that since k < d, the features k + 1 through d are unnecessary (and potentially even harmful) for predicting y.

With this model in mind, let n = 500; d = 1000; k = 100; and = 1. Generate some data by choosing xi 2 Rd, where each component is drawn from a N (0; 1) distribution and yi generated as speci ed above.

    a. [10 points] With your synthetic data, solve multiple Lasso problems on a regularization path, starting at max where 0 features are selected and decreasing by a constant ratio (e.g., 1.5) until nearly all the features are chosen. In plot 1, plot the number of non-zeros as a function of on the x-axis (Tip: use plt.xscale(’log’)).

    b. [10 points] For each value of tried, record values for false discovery rate (FDR) (number of incorrect nonzeros in wb/total number of nonzeros in wb) and true positive rate (TPR) (number of correct nonzeros in wb/k). In plot 2, plot these values with the x-axis as FDR, and the y-axis as TPR and note that in an ideal situation we would have an (FDR,TPR) pair in the upper left corner, but that can always trivially achieve (0; 0) and ( d dk ; 1).

    c. [5 points]  Comment on the e ect of  in these two plots.


A5. Now we put the Lasso to work on some real data. Download the training data set \crime-train.txt" and the test data set \crime-test.txt" from the website under Homework 2. Store your data in your working directory and read in the les with:

import pandas as pd

df_train = pd.read_table("crime-train.txt")

df_test = pd.read_table("crime-test.txt")


4
This stores the data as Pandas DataFrame objects. DataFrames are similar to Numpy arrays but more exible; unlike Numpy arrays, they store row and column indices along with the values of the data. Each column of a DataFrame can also, in principle, store data of a di erent type. For this assignment, however, all data are oats. Here are a few commands that will get you working with Pandas for this assignment:

df.head()    # Print the first few lines of DataFrame df.

df.index    # Get the row indices for df.

df.columns    # Get the column indices.

df[‘‘foo’’’]    # Return the column named ‘‘foo’’’.

df.drop(‘‘foo’’, axis = 1)    # Return all columns except ‘‘foo’’.

df.values    # Return the values as a Numpy array.

df[‘‘foo’’’].values    # Grab column foo and convert to Numpy array.

df.iloc[:3,:3]    # Use numerical indices (like Numpy) to get 3 rows and cols.

The data consist of local crime statistics for 1,994 US communities. The response y is the crime rate. The name of the response variable is ViolentCrimesPerPop, and it is held in the rst column of df train and df test. There are 95 features. These features include possibly relevant variables such as the size of the police force or the percentage of children that graduate high school. The data have been split for you into a training and test set with 1,595 and 399 entries, respectively1.


We’d like to use this training set to t a model which can predict the crime rate in new communities and evaluate model performance on the test set. As there are a considerable number of input variables, over tting is a serious issue. In order to avoid this, use the coordinate descent LASSO algorithm you just implemented in the previous problem.

Begin by running the LASSO solver with    =    max de ned above. For the initial weights, just use 0. Then, cut
down by a factor of 2 and run again, but this time pass in the values of w^ from your = max solution as your initial weights. This is faster than initializing with 0 weights each time. Continue the process of cutting by a factor of 2 until the smallest value of is less than 0.01. For all plots use a log-scale for the dimension (Tip: use plt.xscale(’log’)).

        a. [4 points]  Plot the number of nonzeros of each solution versus  .

        b. [4 points] Plot the regularization paths (in one plot) for the coe cients for input variables agePct12t29, pctWSocSec, pctUrban, agePct65up, and householdsize.

        c. [4 points]  Plot the squared error on the training and test data versus  .

        d. [4 points] Sometimes a larger value of performs nearly as well as a smaller value, but a larger value will select fewer variables and perhaps be more interpretable. Inspect the weights (on features) for = 30. Which feature variable had the largest (most positive) Lasso coe cient? What about the most negative? Discuss brie y. A description of the variables in the data set can be found here: http://archive.ics. uci.edu/ml/machine-learning-databases/communities/communities.names.

        e. [4 points] Suppose there was a large negative weight on agePct65up and upon seeing this result, a politician suggests policies that encourage people over the age of 65 to move to high crime areas in an e ort to reduce crime. What is the (statistical) aw in this line of reasoning? (Hint: re trucks are often seen around burning buildings, do re trucks cause re?)

Logistic Regression

Binary Logistic Regression [30 points]

A6. Let us again consider the MNIST dataset, but now just binary classi cation, speci cally, recognizing if a digit is a 2 or 7. Here, let Y = 1 for all the 7’s digits in the dataset, and use Y = 1 for 2. We will use


    • The features have been standardized to have mean 0 and variance 1.


5
regularized logistic regression. Given a binary classi cation dataset f(xi; yi)gni=1 for xi 2 Rd and yi 2 f 1; 1g we showed in class that the regularized negative log likelihood objective function can be written as

1
n








Xi






J(w; b) = n
log(1 + exp(  yi(b + xiT w))) +  jjwjj22







=1






Note that the o set term b is not regularized. For all experiments, use  = 10
1. Let  i(w; b) =

1


.


1+exp(
yi(b+x
T
w))







i












    a. [8 points] Derive the gradients rwJ(w; b), rbJ(w; b) and give your answers in terms of i(w; b) (your answers should not contain exponentials).

    b. [8 points] Implement gradient descent with an initial iterate of all zeros. Try several values of step sizes to nd one that appears to make convergence on the training set as fast as possible. Run until you feel you are near to convergence.

        (i) For both the training set and the test, plot J(w; b) as a function of the iteration number (and show both curves on the same plot).

        (ii) For both the training set and the test, classify the points according to the rule sign(b + xTi w) and plot the misclassi cation error as a function of the iteration number (and show both curves on the same plot).

Note that you are only optimizing on the training set. The J(w; b) and misclassi cation error plots should be on separate plots.

    c. [7 points] Repeat (b) using stochastic gradient descent with a batch size of 1. Note, the expected gradient with respect to the random selection should be equal to the gradient found in part (a). Take careful note of how to scale the regularizer.

    d. [7 points] Repeat (b) using stochastic gradient descent with batch size of 100. That is, instead of ap-proximating the gradient with a single example, use 100. Note, the expected gradient with respect to the random selection should be equal to the gradient found in part (a).


B4.

Multinomial Logistic Regression[25 points]

We’ve talked a lot about binary classi cation, but what if we have k > 2 classes, like the 10 digits of MNIST? Concretely, suppose you have a dataset f(xi; yi)gni=1 where xi 2 Rd and yi 2 f1; : : : ; kg. Like in our least squares classi er of homework 1 for MNIST, we will assign a separate weight vector w(‘) for each class ‘ = 1; : : : ; k; let W = [w(1); : : : ; w(k)] 2 Rd k. We can generalize the binary classi cation probabilistic model to multiple classes as follows: let

PW
(y

= ‘ W; x
) =

exp(w(‘)  xi)















i
j
i
P
k

(j)
xi)



The negative log-likelihood function is equal to

j=1 exp(w










k   exp(w(jx)i)xi) !

L(W) =



n
1fyi = ‘g log







k





exp(w(‘)







X X‘


















P
j=1


d
d




i=1
=1












De ne the softmax( ) operator to be the function that takes in a vector  2 R and outputs a vector in R

whose ith component is equal to



exp( i)

. Clearly, this vector is nonnegative and sums to one. If for

P
d






j=1 exp( j)












any i we have i maxj6=i j then softmax( ) approximates ei, a vector of all zeros with a one in the ith component.

For each yi let yi be the one-hot encoding of yi (i.e., yi 2 f0; 1gk is a vector of all zeros aside from a 1 in the yith index).



6

a. [5 points]
If y(W )
= softmax(W >x ), show that
rW L
(W) =
n
x
(y
i
y(W ))>.






i
i


i=1
i



i




[5 points]
Recall Ridge Regression on MNIST problem in HomeworkP 1 and de ne J(W ) =
1
n
y

b.



i=1 k

i


b








b
2




W >xik22.
If yi(W )
= W >xi show that rW J(W ) =
in=1 xi(yi
yi(W ))>.
ComparingPthe least

squares linear regression gradient step of this part to the gradient step of minimizing the negative log






P











likelihood of the logistic model of part a may shed light on why we call this classi cation problem


e




e









logistic regression.

    c. [15 points]  Using the original representations of the MNIST  attened images xi 2 Rd (d = 28  28 =

        764) and all k = 10 classes, implement gradient descent (or stochastic gradient descent) for both J(W ) and L(W ) and run until convergence on the training set of MNIST. For each of the two solutions, report the classi cation accuracy of each on the training and test sets using the most natural arg maxj ejW >xi classi cation rule.

We highly encourage you to use PyTorch for this problem! The base object in PyTorch is the torch.tensor, which is essentially a numpy.ndarray but with some powerful new features. Namely, tensors have accelerator support (GPU, TPU) and high-performance autodi erentiation. Don’t worry too much about the details of PyTorch! We will discuss PyTorch and the torch.autograd package in greater detail once we get to neural networks! At a high-level though, torch.autograd allows us to automatically calculate the gradients of our model parameters with minimal additional cost. Yep, that’s right! Your days of writing out gradients by hand are numbered! :D

We include some starter pseudocode for the negative log-likelihood + softmax portion of this question. You are expected to nd good hyperparameters. You can install the library at https://pytorch.org/ and access the relevant beginner tutorial here.

import torch

W = torch.zeros(784, 10, requires_grad=True)

for epoch in range(epochs):

y_hat = torch.matmul(X_train, W)

    • cross entropy combines softmax calculation with NLLLoss loss = torch.nn.functional.cross_entropy(y_hat, y_train)

    • computes derivatives of the loss with respect to W loss.backward()

    • gradient descent update

W.data = W.data - step_size * W.grad

    • .backward() accumulates gradients into W.grad instead

    • of overwriting, so we need to zero out the weights W.grad.zero_()






















7

More products