Assignment 3 Solution

Starting from:

~~$35~~

$29

Home

Please use Google Classroom to upload your submission by the deadline mentioned above. Your submission should comprise of a single le (PDF/ZIP), named <Your Roll No> Assign3, with all your solutions.

For late submissions, 10% is deducted for each day (including weekend) late after an assign-ment is due. Note that each student begins the course with 7 grace days for late submission of assignments. Late submissions will automatically use your grace days balance, if you have any left. You can see your balance on the CS6510 Marks and Grace Days document.

You have to use Python for the programming questions.

Please read the department plagiarism policy. Do not engage in any form of cheating - strict penalties will be imposed for both givers and takers. Please talk to instructor or TA if you have concerns.

Questions: Theory

(No programming required)

1. Logistic Regression: (10 marks)

(a) Plot the sigmoid function 1=(1 + e wx) vs x 2 R or increasing weight w 2 f1; 5; 100g. A qualitative sketch is enough. Use these plots to argue why a solution with large weights can cause logistic regression to over t. [2 marks]

1
(b) To prevent over tting, we want the weights to be small. To achieve this, instead of maximum likelihood estimation MLE for logistic regression:

n
Y
max P (YijXi; w0; ; wd)
w0; ;wd i=1

we can consider maximum a posterior (MAP) estimation:

n
Y
max P (YijXi; w0; ; wd)P (w0; ; wd)
w0; ;wd i=1

where P (w0; ; wd) is a prior on the weights. Assuming a standard Gaussian prior N (0; I) for the weight vector (I = Identity matrix), derive the gradient ascent update rules for the weights. [3 marks]

(c) One way to extend logistic regression to multi-class (say, K class labels) setting is to consider (K 1) sets of weight vectors and de ne:

d
X
P (Y = ykjX) / exp(wk0 + wkiXi) for k = 1; ; K 1
i=1

What model does this imply for P (Y = ykjX)? What would be the classi cation rule in this case? [3 marks]

(d) Draw a set of training data with three labels and the decision boundary resulting from a multi-class logistic regression. (The boundary does not need to be quantitatively correct but should qualitatively depict how a typical boundary from multi-class logistic regression would look like.) [2 marks]

2. Kernel Regression and Variants: (7 marks) Given a set of n examples, (xi; yi); i =
1
;

; n
, a linearT
smoother
is de ned as follows. For any x, there
exists a vector l(x) =

n

T

(l1(x); ; ln(x))
such that the estimated output y^ of x is y^ =

i=1 li(x)yi
= l(x)

Y

Y

n
1 vector,
Y

y

a linear function of the
where

is a

i =

i. This means that the prediction is
P

training responses (yis) and it varies slowly and smoothly with change or noise in yis.

As an example, for linear regression, we assume the data are generated from the model
T
P
m

1 T
yi =

j=1 wjhj(xi) + i, where h are basis functions (such as polynomials: x, x2, x3, ,
etc). The least squares estimate for the coe cient vector w, as we know, is given by w =
(H H) H Y where
H is a n
m matrix, H

= h
(x ).
Given an input x, note that
T
T
(H
T
H)
1
T
Y = (H(H
T
ij

1
j
i T
Y . If l(x) = H(H
T
H)
1
h(x),
y^ = h(x)
w = h(x)

H

H)

h(x))

then y^ = l(x)T Y . Therefore, linear regression is a linear smoother.

Now, answer the following questions:

(a) In kernel regression using the kernel K(xi; x) = exp the estimated output y^? Is kernel regression a linear

jjxi xjj22 , given an input x, what is

2
smoother? [2.5 marks]
(b)
2

jj

Suppose we t a linear regression model, but instead of sum of residual squares Hw

Y jj2, we minimized the sum of absolute values of residuals: jjHw
Y jj1. Prove that

this is not a linear smoother (give a counter-example). (Hint: Think about the median

y
;

; y

where n is odd, the median y

minimizes the sum

for a set of real numbers ( 1

n)
n
yij.) [2.5 marks]
M

of absolute di erences M = arg minj
Pi=1 jyj

2
(c) If we divide the range (a; b) (a and b are real numbers, and a < b) into m equally
spaced bins denoted by B1; ; Bk. De ne the estimated output y^ =
1
Pi:xi2Bk yi

jBkj

for x 2 Bk, where jBkj is the number of points in Bk. In other words, the estimate y^ is a step function obtained by averaging the yis over each bin. This estimate is called the regressogram. Is this estimate a linear smoother? If yes, give the vector l(x) for a given input x; otherwise, state your reasons. [2 marks]

Questions: Programming

3. Linear Regression: (13 marks) We will now implement Linear Regression to predict the age of Abalone (a type of snail). The data set is made available as part of the provided zip archive (linregdata). You can read more about the dataset at the UCI repository link. We are interested in predicting the last column of the data that corresponds to the age of the abalone using all the other attributes.

(a) The rst column in the data denotes the attribute that encodes-female, infant and male as 0, 1 and 2 respectively. The numbers used to represent these values are symbols and therefore should not be ordinal. Transform this attribute into a three column binary representation. For example, represent female as (1, 0, 0), infant as (0, 1, 0) and male as (0, 0, 1). [0.5 marks]

(b) Before performing linear regression, we must rst standardize the independent variables, which includes everything except the last attribute (target attribute). Standardizing means subtracting each attribute by its mean and dividing by its standard deviation. Standardization will transform the attributes to possess zero mean and unit standard deviation. You can use this fact to verify the correctness of your code. [0.5 marks]

(c) Implement the following functions: (i) mylinridgereg(X, Y, ) that calculates the linear least squares solution with the ridge regression penalty parameter ( ) and returns the regression weights; (ii) mylinridgeregeval(X, weights) that returns a predic-tion of the target variable given the input variables and regression weights; and (iii) meansquarederr(T, Tdash) that computes the mean squared error between the pre-dicted and actual target values. [2 + 1 + 1 = 4 marks]

(d) Partition the dataset into 80% training and 20% testing (Let’s call this the partition fraction, in this case 0.2). Now, use your mylinridgereg with di erent values to t the penalized linear model to the training data and predict the target variable for both training and testing data. [1 mark]

(e) Identify the with the best performance and examine the weights of the ridge regression model. Which are the most signi cant attributes? Try removing two or three of the least signi cant attributes and observe how the mean squared errors change. [1 mark ]

(f) We now would like to ask the question: Does the e ect of on error change for di erent partitions of the data into training and test sets? To do this, change the partition fraction (a value between 0 and 1, as de ned earlier) with at least 4 other values. Repeat the following steps 25 times for each partition fraction:

Randomly divide data into training and test sets. Standardize the training input variables.

3
Standardize the testing input variables using the means and standard deviations from the training set.

Follow step (d) for each such partition.

For each partition fraction, plot a gure with on the x-axis, and MSE on the y-axis. For each gure, include 2 graphs - one for the training MSE and one for the test MSE. (You should then have 5 gures in total, with 2 plots on each gure.) [3 marks]

(g) Do the above gures give you clarity? Also, plot two more gures. In the rst graph, plot the minimum average mean squared testing error versus the partition fraction values. In the second graph, plot the value that produced the minimum average mean squared testing error versus the partition fraction. [1 mark]

(h) How good is your model? So far, we have been looking at only the mean squared error. We might also be interested in understanding the contribution of each prediction towards the error. Maybe the error is due to a few samples with large errors and others have tiny errors. One way to visualize this information is to a plot of predicted versus actual values. Use the best choice for the training fraction and , make two graphs corresponding to the training and testing set. The X and Y axes in these graphs will correspond to the predicted and actual target values respectively. If the model is good, then all the points will be close to a 45-degree line through the plot. [2 marks]

Include all the plots and your observations in your submission.

4. Kaggle - Taxi Fare Prediction: (10 marks) The next task of this assignment is to work on a (completed) Kaggle challenge on taxi fare prediction. As part of this task, please visit https://www.kaggle.com/c/new-york-city-taxi-fare-prediction to know more about this problem, and download the data. (You now know how to download data from Kaggle.)

You are allowed to use any machine learning library of your choice: scikitlearn, pandas, Weka (we recommend scikitlearn), and any regression method too. Use train.csv to train your classi er. Predict the cuisine on the data in test.csv, and report your best 2 scores in your report. (We will also upload your codes randomly to con rm the scores.)

Deliverables:

Code

Brief report (PDF) with top-2 scores of your methods, and a brief description of the methods that resulted in the top 2 scores.

Your report should also include your analysis of why your best 2 methods performed better than others you tried.

4