$24
Short Answer and “True or False” Conceptual Questions
1. The answers to these questions should be answerable without referring to external materials. Briefly justify your answers with a few words.
a. [2 points] Suppose that your estimated model for predicting house prices has a large positive weight on the feature number of bathrooms. If we remove this feature and refit the model, will the new model have a strictly higher error than before? Why?
b. [2 points] Compared to L2 norm penalty, explain why a L1 norm penalty is more likely to result in sparsity (a larger number of 0s) in the weight vector.
c. [2 points] In at most one sentence each, state one possible upside and one possible downside of using the
following regularizer: i |wi|0.5 .
d. [1 point] True or False: If the step-size for gradient descent is too large, it may not converge.
e. [2 points] In your own words, describe why stochastic gradient descent (SGD) works, even though only a small portion of the data is considered at each update.
f. [2 points] In at most one sentence each, state one possible advantage of SGD over GD (gradient descent), and one possible disadvantage of SGD relative to GD.
What to Submit:
• Part d: True or False.
• Parts a-f: Brief (2-3 sentence) explanation.
Lasso on a Real Dataset
A Lasso Algorithm
n
Given λ > 0 and data xi, yi
, the Lasso is the problem of solving
i=1
n
d
min
T
2
arg
(xi w + b − yi)
+ λ j=1 |wj|
w∈Rd,b∈R i=1
where λ is a regularization parameter. For the programming part of this homework, we have implemented the coordinate descent method shown in Algorithm 1 to solve the Lasso problem for you.
Algorithm 1: Coordinate Descent Algorithm for Lasso
while not converged do
b ←
1
n
yi −
d
j=1 wjxi,j
n
i=1
for k ∈ {1, 2, ··· d} do
ak ← 2 n x2
i=1 i,k
ck ← 2
n
i=1
(ck
wk ← 0 (ck
end
end
xi,k yi − (b + j̸=k wjxi,j)
+ λ)/ak
ck < −λ
− λ)/ak
ck ∈ [−λ, λ]
ck > λ
You will often apply Lasso on the same dataset for many values of λ. This is called a regularization path. One way to do this efficiently is to start at a largeλ, and then for each consecutive solution, initialize the algorithm with the previous solution, decreasing λ by a constant ratio (e.g., by a factor of 2). The smallest value of λ for which the solution w is entirely zero is given by
λmax = k=1,...,d
2
xi,k yi
n
yj
(1)
max
n
1
n
i=1
−
j=1
This is helpful for choosing the first λ in a regularization path.
A benefit of the Lasso is that if we believe many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, effectively differentiating between the relevant and irrelevant features.
Dataset
Download the training data set “crime-train.txt” and the test data set “crime-test.txt” from the course website. Store your data in your working directory, ensure you have pandas installed, and read in the files with the following Python code:
import pandas as pd
df_train = pd.read_table("crime-train.txt")
df_test = pd.read_table("crime-test.txt")
This stores the data as Pandas DataFrame objects. DataFrames are similar to Numpy arrays but more flexible; unlike arrays, DataFrames store row and column indices along with the values of the data. Each column of a DataFrame can also store data of a different type (here, all data are floats).
Here are a few commands that will get you working with Pandas for this assignment:
df.head() # Print the first few lines of DataFrame df.
df.index # Get the row indices for df.
df.columns # Get the column indices.
df[‘‘foo’’] # Return the column named ‘‘foo’’.
df.drop(‘‘foo’’, axis = 1) # Return all columns except ‘‘foo’’.
df.values # Return the values as a Numpy array.
df[‘‘foo’’].values # Grab column foo and convert to Numpy array.
df.iloc[:3,:3] # Use numerical indices (like Numpy) to get 3 rows and cols.
The data consist of local crime statistics for 1,994 US communities. The response y is the rate of violent crimes reported per capita in a community. The name of the response variable is ViolentCrimesPerPop, and it is held in the first column of df train and df test. There are 95 features. These features include many variables.
Some features are the consequence of complex political processes, such as the size of the police force and other systemic and historical factors. Others are demographic characteristics of the community, including self-reported statistics about race, age, education, and employment drawn from Census reports.
The dataset is split into a training and test set with 1,595 and 399 entries, respectively. The features have been standardized to have mean 0 and variance 1. We will use this training set to fit a model to predict the crime rate in new communities and evaluate model performance on the test set. As there are a considerable number of input variables and fairly few training observations, overfitting is a serious issue, and the coordinate descent Lasso algorithm may mitigate this problem during training.
The goals of this problem are threefold: (i) to encourage you to think about how data collection processes affect the resulting model trained from that data; (ii) to encourage you to think deeply about models you might train and how they might be misused; and (iii) to see how Lasso encourages sparsity of linear models in settings where d is large relative to n. We emphasize that training a model on this dataset can suggest a degree of correlation between a community’s demographics and the rate at which a community experiences
3
and reports violent crime. We strongly encourage students to consider why these correlations may or may not hold more generally, whether correlations might result from a common cause, and what issues can result in misinterpreting what a model can explain.
Applying Lasso
2.
a. [4 points] Read the documentation for the original version of this dataset: http://archive.ics.uci.edu/ ml/datasets/communities+and+crime. Report 3 features included in this dataset for which historical policy choices in the US would lead to variability in these features. As an example, the number of police in a community is often the consequence of decisions made by governing bodies, elections, and amount of tax revenue available to decision makers.
b. [4 points] Before you train a model, describe 3 features in the dataset which might, if found to have nonzero weight in model, be interpreted as reasons for higher levels of violent crime, but which might actually be a result rather than (or in addition to being) the cause of this violence.
Now, we will run the Lasso solver. Begin with λ = λmax defined in Equation (1). Initialize all weights to 0. Then, reduce λ by a factor of 2 and run again, but this time initialize wˆ from your λ = λmax solution as your initial weights, as described above. Continue the process of reducing λ by a factor of 2 until λ < 0.01. For all plots use a log-scale for the λ dimension (Tip: use plt.xscale(’log’)).
c. [4 points] Plot the number of nonzero weights of each solution as a function of λ.
d. [4 points] Plot the regularization paths (in one plot) for the coefficients for input variablesagePct12t29, pctWSocSec, pctUrban, agePct65up, and householdsize.
e. [4 points] On one plot, plot the squared error on the training and test data as a function of λ.
f. [4 points] Sometimes a larger value of λ performs nearly as well as a smaller value, but a larger value will select fewer variables and perhaps be more interpretable. Inspect the weights wˆ for λ = 30. Which feature had the largest (most positive) Lasso coefficient? What about the most negative? Discuss briefly.
g. [4 points] Suppose there was a large negative weight on agePct65up and upon seeing this result, a politician suggests policies that encourage people over the age of 65 to move to high crime areas in an effort to reduce crime. What is the (statistical) flaw in this line of reasoning? (Hint: fire trucks are often seen around burning buildings, do fire trucks cause fire?)
What to Submit:
• Parts a, b: 1-2 sentence explanation.
• Part c: Plot 1.
• Part d: Plot 2.
• Part e: Plot 3.
• Parts f, g: Answers and 1-2 sentence explanation.
• Code on Gradescope through coding submission. (Note: there is no autograder for this question. However, please still submit your code and specify how to run it to generate the plots.)
4
Logistic Regression
Binary Logistic Regression
3. Here we consider the MNIST dataset, but for binary classification. Specifically, the task is to determine whether a digit is a 2 or 7. Here, let Y = 1 for all the “7” digits in the dataset, and use Y = −1 for “2”. We will use regularized logistic regression. Given a binary classification dataset {(xi, yi)}ni=1 for xi ∈ Rd and yi ∈ {−1, 1} we showed in class that the regularized negative log likelihood objective function can be written as
1
n
J(w, b) =
log(1 + exp(−yi(b + xiT w))) + λ||w||22
n
i=1
Note that the offset termb is not regularized. For all experiments, use λ = 10−1. Let µi(w, b) =
1
.
1+exp(−yi(b+xiT w))
a. [8 points] Derive the gradients ∇wJ(w, b), ∇bJ(w, b) and give your answers in terms of µi(w, b) (your answers should not contain exponentials).
b. [8 points] Implement gradient descent with an initial iterate of all zeros. Try several values of step sizes to find one that appears to make convergence on the training set as fast as possible. Run until you feel you are near to convergence.
(i) For both the training set and the test, plot J(w, b) as a function of the iteration number (and show both curves on the same plot).
(ii) For both the training set and the test, classify the points according to the rule sign(b + xTi w) and plot the misclassification error as a function of the iteration number (and show both curves on the same plot).
Reminder: Make sure you are only using the test set for evaluation (not for training).
c. [7 points] Repeat (b) using stochastic gradient descent with a batch size of 1. Note, the expected gradient with respect to the random selection should be equal to the gradient found in part (a). Show both plots described in (b) when using batch size 1. Take careful note of how to scale the regularizer.
d. [7 points] Repeat (b) using stochastic gradient descent with batch size of 100. That is, instead of ap-proximating the gradient with a single example, use 100. Note, the expected gradient with respect to the random selection should be equal to the gradient found in part (a).
What to Submit
• Part a: Proof
• Part b: Separate plots for b(i) and b(ii).
• Part c: Separate plots for c which reproduce those from b(i) and b(ii) for this case.
• Part d: Separate plots for c which reproduce those from b(i) and b(ii) for this case.
• Code on Gradescope through coding submission.
Administrative
4.
a. [2 points] About how many hours did you spend on this homework? There is no right or wrong answer :)
5