$24
General instructions
The assignment should be implemented in Python 3. You should make sure that your code can be run on the ip server.
You can work in team of up to 3 people. Each team will only need to submit one copy of the source code and report.
You need to submit your source code (self contained, well documented and with clear instruction for how to run) and a report via TEACH. In your submission, please clearly indicate your team members’ information.
Be sure to answer all the questions in your report. Your report should be typed, submitted in the pdf format. You will be graded based on both your code as well as the report. In particular, the clarity and quality of the report will be worth 10 points. So please write your report in clear and concise manner. Clearly label your gures, legends, and tables.
Linear regression
Data You will use the Boston Housing dataset of the housing prices in Boston suburbs. The goal is to predict the median value of housing of an area (in thousands) based on 13 attributes describing the area (e.g., crime rate, accessibility etc). The le housing desc.txt describes the data. Data is divided into two sets: (1) a training set housing train.txt for learning, and (2) a testing set housing test.txt for testing. Your task is to implement linear regression and explore some variations with it on this data.
(10 pts) Load the training data into the corresponding X and Y matrices, where X stores the features and Y stores the desired outputs. The rows of X and Y correspond to the examples and the columns of X correspond to the features. Introduce the dummy variable to X by adding an extra column of ones to X (You can make this extra column to be the rst column. Changing the position of the added
column will only change the order of the learned weight and does not matter in practice. Compute the optimal weight vector w using w = (XT X) 1XT Y . Feel free to use existing numerical packages (e.g., numpy) to perform the computation. Report the learned weight vector.
2. (10 pts) Apply the learned weight vector to the training data and testing data respectively and compute for each case the average squared error(ASE), de ned by 1=n Pni=1(yi y^i)2, which is the sum of squared error normalized by the total number of examples in the data. Report the training and testing ASEs respectively. Which one is larger? Is it consistent with your expectation?
Write your code so that you get the results for questions 1 and 2 using the following command:
python q1 2.py housing train.txt housing test.txt
The output should include:
the learned weight vector ASE over the training data ASE over the testing data
1
(10 pts) Remove the dummy variable (the column of ones) from X, repeat 1 and 2. How does this change in uence the ASE on the training and testing data? Provide an explanation for this in uence.
Write your code so that you get the results for question 3 using the following command:
python q1 3.py housing train.txt housing test.txt
The output should include:
the learned weight vector ASE over the training data ASE over the testing data
(20 pts) Modify the data by adding additional random features. You will do this to both training and testing data. In particular, for each instance, generate d (consider d = 2; 4; 6; :::10, feel free to explore more values) random features, by sampling from a standard normal distribution. For each d value, apply linear regression to nd the optimal weight vector and compute its resulting training and testing ASEs. Plot the training and testing ASEs as a function of d. What trends do you observe for training data and test data respectively? Do more features lead to better prediction performance at testing stage? Provide an explanation to your observations.
Write your code so that you get the results for question 4 using the following command:
python q1 4.py housing train.txt housing test.txt
The output should include:
plot of the training ASE (y-axis) as a function of d (x-axis) plot of the testing ASE (y-axis) as a function of d (x-axis)
Logistic regression with regularization
Data. For this part you will work with the USPS handwritten digit dataset and implement the logistic regression classi er to di erentiate digit 4 from digit 9. Each example is an image of digit 4 or 9, with 16 by 16 pixels. Treating the gray-scale value of each pixel as a feature (between 0 and 255), each example has 162 = 256 features. For each class, we have 700 training samples and 400 testing samples. You can view these images collectively at http://www.cs.nyu.edu/~roweis/data/usps_4.jpg, andhttp://www.cs.nyu.edu/ ~roweis/data/usps_9.jpg
The data is in the csv format and each row corresponds to a hand-written digit (the rst 256 columns) and its label (last column, 0 for digit 4 and 1 for digit 9).
(20 pts) Implement the batch gradient descent algorithm to train a binary logistic regression classi er. The behavior of Gradient descent can be strongly in uenced by the learning rate. Experiment with di erent learning rates, report your observation on the convergence behavior of the gradient descent algorithm. For your implementation, you will need to decide a stopping condition. You might use a xed number of iterations, the change of the objective value (when it ceases to be signi cant) or the norm of the gradient (when it is smaller than a small threshold). Note, if you observe an over ow, then your learning rate is too big, so you need to try smaller (e.g., divide by 2 or 10) learning rates.
Once you identify a suitable learning rate, rerun the training of the model from the beginning. For each gradient descent iteration, plot the training accuracy and the testing accuracy of your model as a function of the number of gradient descent iterations. What trend do you observe? Write your code so that you get the results for question 1 using the following command:
python q2 1.py usps train.txt usps test.txt learningrate
The output should include:
plot of the learning curve: training accuracy (y-axis) as a function of the number of gradient descent iterations (x-axis)
2
plot of the learning curve: testing accuracy (y-axis) as a function of the number of gradient descent iterations (x-axis)
(10 pts) Logistic regression is typically used with regularization. Here we will explore L2 regulariza-tion, which adds to the logistic regression objective an additional regularization term of the squared Euclidean norm of the weight vector.
n
L(w) = Xl(g(wT xi); yi) + 12 jwj2
i=1
where the loss function l is the same as introduced in class. Find the gradient for this objective function and modify the batch gradient descent algorithm with this new gradient. Provide the pseudo code for your modi ed algorithm.
(25 pts) Implement your derived algorithm, and experiment with di erent values (e.g., 10 3; 10 2; :::; 103). Report the training and testing accuracies (i.e., the percentage of correct predictions) achieved by the weight vectors learned with di erent values. Discuss your results in terms of the relationship between
training/testing performance and the values. Write your code so that you get the results for question
3 using the following command:
python q2 3.py usps train.txt usps test.txt lambdas
where lambdas contains the list of values to be tested. The output should include:
plot of the training accuracy (y-axis) as a function of the value (x-axis) plot of the testing accuracy (y-axis) as a function of the value (x-axis)
Remark 1 For logistic regression, it would be a good idea to normalize the features to the range of [0; 1]. This will makes it easier to nd a proper learning rate. You can nd information about feature normalization at https: // en. wikipedia. org/ wiki/ Feature_ scaling )
3