Starting from:
$30

$24

CS 229 Problem Set #1




[40 points] Linear Classi ers (logistic regression and GDA)



In this problem, we cover two probabilistic linear classi ers we have covered in class so far. First, a discriminative linear classi er: logistic regression. Second, a generative linear classi er: Gaussian discriminant analysis (GDA). Both the algorithms nd a linear decision boundary that separates the data into two classes, but make di erent assumptions. Our goal in this problem is to get a deeper understanding of the similarities and di erences (and, strengths and weaknesses) of these two algorithms.




For this problem, we will consider two datasets, along with starter codes provided in the following les:




src/linearclass/ds1_{train,valid}.csv



src/linearclass/ds2_{train,valid}.csv



src/linearclass/logreg.py



src/linearclass/gda.py



Each le contains n examples, one example (x(i); y(i)) per row. In particular, the i-th row contains columns x(1i) 2 R, x(2i) 2 R, and y(i) 2 f0; 1g. In the subproblems that follow, we will investigate using logistic regression and Gaussian discriminant analysis (GDA) to perform binary classi cation on these two datasets.




[10 points]



In lecture we saw the average empirical loss for logistic regression:




     J( )=
1 n


y(i) log(h (x(i))) + (1 y(i)) log(1 h (x(i)))
;
n i=1




X










where y(i) 2 f0; 1g, h (x) = g( T x) and g(z) = 1=(1 + e z).




Find the Hessian H of this function, and show that for any vector z, it holds true that




        

zT Hz 0:












that g0(z) = g(z)(1 g(z)).
P
i P
j
i i j
j
= (xT z)2


0. Recall also
Hint: You may want to start by showing that




z x x z







Remark: This is one of the standard ways of showing that the matrix H is positive semi-de nite, written \H 0." This implies that J is convex, and has no local minima other than the global one. If you have some other way of showing H 0, you’re also welcome to use your method instead of the one above.




(b) [5 points] Coding problem. Follow the instructions in src/linearclass/logreg.py

~

to train a logistic regression classi er using Newton’s Method. Starting with = 0, run Newton’s Method until the updates to are small: Speci cally, train until the rst iteration k such that k k k 1k1 < , where = 1 10 5. Make sure to write your model’s predicted probabilities on the validation set to the le speci ed in the code.




Include a plot of the validation data with x1 on the horizontal axis and x2 on the vertical axis. To visualize the two classes, use a di erent symbol for examples x(i) with y(i) = 0 than for those with y( i) = 1. On the same gure, plot the decision boundary found by logistic regression (i.e, line corresponding to p(yjx) = 0:5).

 


 CS229 Problem Set #1
3






[5 points] Recall that in GDA we model the joint distribution of (x; y) by the following equations:
        p(y) =
(1 if y = 0












if y = 1






0)
p(xjy = 0) = (2 )d=2j j1=2
exp
2 (x
0)T 1(x






1


1










exp




1)T 1(x
1) ;
p(xjy = 1) = (2 )d=2j j1=2
2 (x






1


1



























where , 0, 1, and are the parameters of our model.

Suppose we have already t , 0, 1, and , and now want to predict y given a new point x. To show that GDA results in a classi er that has a linear decision boundary, show the posterior distribution can be written as




   p(y = 1 j x; ; 0; 1; ) =
1
;




1 + exp( ( T x + 0))



where 2 Rd and 0 2 R are appropriate functions of , , 0, and 1.




[7 points] Given the dataset, we claim that the maximum likelihood estimates of the pa-rameters are given by



         1


n














Xi
1fy(i) = 1g
= n


=1


P












n












=
i=1 1fy(i) = 0gx(i)


0


Pi
n


f


g
1 =


n






P
n


1


y(i) = 1 x(i)




i=1 1fy(i) = 1g






=1










= 1
P
(x(i)
y(i) )(x(i) y(i) )T






n
































n


=1














Xi










The log-likelihood of the data is




























n




‘( ; 0; 1; ) = log
=1
p(x(i); y(i); ; 0; 1; )


























iY
















n














Y
p(x(i)jy(i); 0; 1; )p(y(i); ):


= log




i=1




By maximizing ‘ with respect to the four parameters, prove that the maximum likelihood estimates of , 0; 1, and are indeed as given in the formulas above. (You may assume that there is at least one positive and one negative example, so that the denominators in the de nitions of 0 and 1 above are non-zero.)




[5 points] Coding problem. In src/linearclass/gda.py, ll in the code to calculate , 0, 1, and , use these parameters to derive , and use the resulting GDA model to make predictions on the validation set. Make sure to write your model’s predictions on the validation set to the le speci ed in the code.
 


 CS229 Problem Set #1
4






Include a plot of the validation data with x1 on the horizontal axis and x2 on the vertical axis. To visualize the two classes, use a di erent symbol for examples x(i) with y(i) = 0 than for those with y(i) = 1. On the same gure, plot the decision boundary found by GDA (i.e, line corresponding to p(yjx) = 0:5).




[2 points] For Dataset 1, compare the validation set plots obtained in part (b) and part (e) from logistic regression and GDA respectively, and brie y comment on your observation in a couple of lines.



[5 points] Repeat the steps in part (b) and part (e) for Dataset 2. Create similar plots on the validation set of Dataset 2 and include those plots in your writeup.



On which dataset does GDA seem to perform worse than logistic regression? Why might this be the case?




[1 points] For the dataset where GDA performed worse in parts (f) and (g), can you nd a transformation of the x(i)’s such that GDA performs signi cantly better? What might this transformation be?
 


 CS229 Problem Set #1
5






[30 points] Incomplete, Positive-Only Labels



In this problem we will consider training binary classi ers in situations where we do not have full access to the labels. In particular, we consider a scenario, which is not too infrequent in real life, where we have labels only for a subset of the positive examples. All the negative examples and the rest of the positive examples are unlabelled.




We formalize the scenario as follows. Let f(x(i); t(i))gni=1 be a standard dataset of i.i.d distributed examples. Here x(i)’s are the inputs/features and t(i) are the labels. Now consider the situation where t(i)’s are not observed by us. Instead, we only observe the labels of some of the positive examples. Concretely, we assume that we observe y(i)’s that are generated by

  
 
 
 
 
 
 
 
 


 CS229 Problem Set #1
6






corresponding to model’s predicted probability = 0.5) in red color. Include this plot in your writeup.




[5 points] Coding problem: The naive method on partial labels



We now consider the case where the t-labels are unavailable, so you only have access to the y-labels at training time. Extend your code in src/posonly/posonly.py to re-train the classi er (still using x1 and x2 as input features), but using the y-labels only. Output the predictions on the test set to the appropriate le (as described in the code comments).




Create a plot to visualize the test set with x1 on the horizontal axis and x2 on the vertical axis. Use di erent symbols for examples x(i) with true label t(i) = 1 (even though we only used the y(i) labels for training, use the true t(i) labels for plotting) than those with t(i) = 0. On the same gure, plot the decision boundary obtained by your model (i.e, line corresponding to model’s predicted probability = 0.5) in red color. Include this plot in your writeup.




Note that the algorithm should learn a function h( ) that approximately predicts the prob-ability p(y(i) = 1 j x(i)). Also note that we expect it to perform poorly on predicting the probability of interest, namely p(t(i) = 1 j x(i)).




In the following sub-questions we will attempt to solve the problem with only partial observations. That is, we only have access to f(x(i); y(i))gni=1, and will try to predict p(t(i) = 1jx(i)).




[5 points] Warm-up with Bayes rule Show that under our assumptions, for any i,



 p(t(i) = 1 j y(i) = 1; x(i)) = 1
(1)



That is, observing a positive partial label y(i) = 1 tells us for sure the hidden true label is

1. Use Bayes rule to derive this (an informal explanation will not earn credit).




[5 points] Show that for any example, the probability that true label t(i) is positive is 1= times the probability that the partial label is positive. That is, show that



   p(t(i) = 1 j x(i)) =
1
p(y(i) = 1 j x(i))
(2)








Note that the equation above suggests that if we know the value of , then we can convert a function h( ) that approximately predicts the probability h(x(i)) p(y(i) = 1 j x(i)) into

a function that approximately predicts p(t(i) = 1 j x(i)) by multiplying the factor 1= .




[5 points] Estimating



The solution to estimate p(t(i)jx(i)) outlined in the previous sub-question requires the knowl-edge of which we don’t have. Now we will design a way to estimate based on the function h( ) that approximately predicts p(y(i) = 1 j x(i)) (which we obtained in part b).




To simplify the analysis, let’s assume that we have magically obtained a function h(x) that perfectly predicts the value of p(y(i) = 1 j x(i)), that is, h(x(i)) = p(y(i) = 1 j x(i)).

We make the crucial assumption that p(t(i) = 1 j x(i)) 2 f0; 1g. This assumption means that the process of generating the \true" label t(i) is a noise-free process. This assumption is not very unreasonable to make. Note, we are NOT assuming that the observed label y(i) is noise-free, which would be an unreasonable assumption!

 


 CS229 Problem Set #1
7
Now we will show that:


= E[h(x(i)) j y(i) = 1]
(3)



The above result motivates the following algorithm to estimate by estimating the RHS of the equation above using samples: Let V+ be the set of labeled (and hence positive) examples in the validation set V , given by V+ = fx(i) 2 V j y(i) = 1g.

Then we use

1 X h(x(i)):




jV+j x(i)2V+




to estimate . (You will be asked to implement this algorithm in the next sub-question. For this sub-question, you only need to show equation (3). Moreover, this sub-question may be slightly harder than other sub-questions.)




[5 points] Coding problem.



Using the validation set, estimate the constant by averaging your classi er’s predictions over all labeled examples in the validation set:1




1 X h(x(i)):




jV+j x(i)2V+




Add code in src/posonly/posonly.py to rescale your predictions h(y(i) = 1 j x(i)) of the classi er that is obtained from part b, using the equation (2) obtained in part (d) and using the estimated value for .




Finally, create a plot to visualize the test set with x1 on the horizontal axis and x2 on the vertical axis. Use di erent symbols for examples x(i) with true label t(i) = 1 (even though we only used the y(i) labels for training, use the true t(i) labels for plotting) than those with t(i) = 0. On the same gure, plot the decision boundary obtained by your model (i.e, line corresponding to model’s adjusted predicted probability = 0.5) in red color. Include this plot in your writeup.




Remark: We saw that the true probability p(t j x) was only a constant factor away from p(y j x). This means, if our task is to only rank examples (i.e. sort them) in a particular order (e.g, sort the proteins in order of being most likely to be involved in transmitting signals across membranes), then in fact we do not even need to estimate . The rank based on p(y j x) will agree with the rank based on p(t j x).




 


 CS229 Problem Set #1
8






[25 points] Poisson Regression



In this question we will construct another kind of a commonly used GLM, which is called Poisson Regression. In a GLM, the choice of the exponential family distribution is based on the kind of problem at hand. If we are solving a classi cation problem, then we use an exponential family distribution with support over discrete classes (such as Bernoulli, or Categorical). Simiarly, if the output is real valued, we can use Gaussian or Laplace (both are in the exponential family). Sometimes the desired output is to predict counts, for e.g., predicting the number of emails expected in a day, or the number of customers expected to enter a store in the next hour, etc. based on input features (also called covariates). You may recall that a probability distribution with support over integers (i.e. counts) is the Poisson distribution, and it also happens to be in the exponential family.




In the following sub-problems, we will start by showing that the Poisson distribution is in the exponential family, derive the functional form of the hypothesis, derive the update rules for training models, and nally using the provided dataset train a real model and make predictions on the test set.




(a) [5 points] Consider the Poisson distribution parameterized by :

e y

p(y; ) = :







(Here y has positive integer values and y! is the factorial of y. ) Show that the Poisson distribution is in the exponential family, and clearly state the values for b(y), , T (y), and a( ).




[3 points] Consider performing regression using a GLM model with a Poisson response variable. What is the canonical response function for the family? (You may use the fact that a Poisson random variable with parameter has mean .)



[7 points] For a training set f(x(i); y(i)); i = 1; : : : ; ng, let the log-likelihood of an example be log p(y(i)jx(i); ). By taking the derivative of the log-likelihood with respect to j, derive the stochastic gradient ascent update rule for learning using a GLM model with Poisson responses y and the canonical response function.



[10 points] Coding problem



Consider a website that wants to predict its daily tra c. The website owners have collected a dataset of past tra c to their website, along with some features which they think are useful in predicting the number of visitors per day. The dataset is split into train/valid sets and the starter code is provided in the following les:




src/poisson/{train,valid}.csv



src/poisson/poisson.py



We will apply Poisson regression to model the number of visitors per day. Note that ap-plying Poisson regression in particular assumes that the data follows a Poisson distribution whose natural parameter is a linear combination of the input features (i.e., = T x). In src/poisson/poisson.py, implement Poisson regression for this dataset and use full batch gradient ascent to maximize the log-likelihood of . For the stopping criterion, check if the change in parameters has a norm smaller than a small value such as 10 5.




Using the trained model, predict the expected counts for the validation set, and create a scatter plot between the true counts vs predicted counts (on the validation set). In the

 


 CS229 Problem Set #1
9






scatter plot, let x-axis be the true count and y-axis be the corresponding predicted expected count. Note that the true counts are integers while the expected counts are generally real values.

 


 CS229 Problem Set #1
10






[15 points] Convexity of Generalized Linear Models



In this question we will explore and show some nice properties of Generalized Linear Models, speci cally those related to its use of Exponential Family distributions to model the output.




Most commonly, GLMs are trained by using the negative log-likelihood (NLL) as the loss func-tion. This is mathematically equivalent to Maximum Likelihood Estimation (i.e., maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood). In this problem, our goal is to show that the NLL loss of a GLM is a convex function w.r.t the model parameters. As a reminder, this is convenient because a convex function is one for which any local minimum is also a global minimum, and there is extensive research on how to optimize various types of con-vex functions e ciently with various algorithms such as gradient descent or stochastic gradient descent.




To recap, an exponential family distribution is one whose probability density can be represented




p(y; ) = b(y) exp( T T (y) a( ));




where is the natural parameter of the distribution. Moreover, in a Generalized Linear Model, is modeled as T x, where x 2 Rd are the input features of the example, and 2 Rd are learnable parameters. In order to show that the NLL loss is convex for GLMs, we break down the process into sub-parts, and approach them one at a time. Our approach is to show that the second derivative (i.e., Hessian) of the loss w.r.t the model parameters is Positive Semi-De nite (PSD) at all values of the model parameters. We will also show some nice properties of Exponential Family distributions as intermediate steps.




For the sake of convenience we restrict ourselves to the case where is a scalar. Assume p(Y jX; ) ExponentialFamily( ), where 2 R is a scalar, and T (y) = y. This makes the exponential family representation take the form




p(y; ) = b(y) exp( y a( )):




[5 points] Derive an expression for the mean of the distribution. Show that E[Y ; ] = @@ a( ) (note that E[Y ; ] = E[Y jX; ] since = T x). In other words, show that the mean of an exponential family distribution is the rst derivative of the log-partition function with respect to the natural parameter.



Hint: Start with observing that @@ R p(y; )dy = R @@ p(y; )dy.

[5 points] Next, derive an expression for the variance of the distribution. In particular, show



that Var(Y ; ) = @2 a( ) (again, note that Var(Y ; ) = Var(Y jX; )). In other words, show




@ 2




that the variance of an exponential family distribution is the second derivative of the log-partition function w.r.t. the natural parameter.




Hint: Building upon the result in the previous sub-problem can simplify the derivation.




[5 points] Finally, write out the loss function ‘( ), the NLL of the distribution, as a function of . Then, calculate the Hessian of the loss w.r.t , and show that it is always PSD. This concludes the proof that NLL loss of GLM is convex.



Hint 1: Use the chain rule of calculus along with the results of the previous parts to simplify your derivations.




Hint 2: Recall that variance of any probability distribution is non-negative.







Any GLM model is convex in its model parameters.
 


 CS229 Problem Set #1
11






The exponential family of probability distributions are mathematically nice. Whereas cal-culating mean and variance of distributions in general involves integrals (hard), surprisingly we can calculate them using derivatives (easy) for exponential family.
 


 CS229 Problem Set #1
12






[25 points] Linear regression: linear in what?



In the rst two lectures, you have seen how to t a linear function of the data for the regression problem. In this question, we will see how linear regression can be used to t non-linear functions of the data using feature maps. We will also explore some of its limitations, for which future lectures will discuss xes.




[5 points] Learning degree-3 polynomials of the input



Suppose we have a dataset f(x(i); y(i))gni=1 where x(i); y(i) 2 R. We would like to t a third degree polynomial h (x) = 3x3 + 2x2 + 1x1 + 0 to the dataset. The key observation here is that the function h (x) is still linear in the unknown parameter , even though it’s not linear in the input x. This allows us to convert the problem into a linear regression problem as follows.




Let : R ! R4 be a function that transforms the original input x to a 4-dimensional vector

de ned as

      (x) =
2
x2
3


R4
(4)


6
1
7








x3
2






6
x
7








4
5
















Let x^ 2 R4 be a shorthand for (x), and let x^(i) , (x(i)) be the transformed input in the training dataset. We construct a new dataset f( (x(i)); y(i))gni=1 = f(^x(i); y(i))gni=1 by replacing the original inputs x(i)’s by x^(i)’s. We see that tting h (x) = 3x3 + 2x2 + 1x1 + 0 to the old dataset is equivalent to tting a linear function h (^x) = 3x^3 + 2x^2 + 1x^1 + 0 to the new dataset because




h (x) = 3x3 + 2x2 + 1x1 + 0 = 3 (x)3 + 2 (x)2 + 1 (x)1 + 0 = T x^ (5)




In other words, we can use linear regression on the new dataset to nd parameters 0; : : : ; 3. Please write down 1) the objective function J( ) of the linear regression problem on the new dataset f(^x(i); y (i))gni=1 and 2) the update rule of the batch gradient descent algorithm for linear regression on the dataset f(^x(i); y(i))gni=1.




Terminology: In machine learning, is often called the feature map which maps the original input x to a new set of variables. To distinguish between these two sets of variables, we will call x the input attributes, and call (x) the features. (Unfortunately, di erent authors use di erent terms to describe these two things. In this course, we will do our best to follow the above convention consistently.)




[5 points] Coding question: degree-3 polynomial regression



For this sub-question question, we will use the dataset provided in the following les:




src/featuremaps/{train,valid,test}.csv




Each le contains two columns: x and y. In the terminology described in the introduction, x is the attribute (in this case one dimensional) and y is the output label.




Using the formulation of the previous sub-question, implement linear regression with nor-mal equations using the feature map of degree-3 polynomials. Use the starter code pro-vided in src/featuremaps/featuremap.py to implement the algorithm.




Create a scatter plot of the training data, and plot the learnt hypothesis as a smooth curve over it. Submit the plot in the writeup as the solution for this problem.

 


 CS229 Problem Set #1
13






Remark: Suppose Xb is the design matrix of the transformed dataset. You may sometimes encounter a non-invertible matrix XbT Xb. For a numerically stable code implementation, always use np.linalg.solve to obtain the parameters directly, rather than explicitly cal-culating the inverse and then multiplying it with XbT y.

[5 points] Coding question: degree-k polynomial regression



Now we extend the idea above to degree-k polynomials by considering : R ! Rk+1 to be

3
1

x 7
     (x) =
6
x2
7
2 Rk+1
(6)


6


7




7
... 7 4 5
xk




Follow the same procedure as the previous sub-question, and implement the algorithm with k = 3; 5; 10; 20. Create a similar plot as in the previous sub-question, and include the hypothesis curves for each value of k with a di erent color. Include a legend in the plot to indicate which color is for which value of k.




Submit the plot in the writeup as the solution for this sub-problem. Observe how the tting of the training dataset changes as k increases. Brie y comment on your observations in the plot.




[5 points] Coding question: other feature maps



You may have observed that it requires a relatively high degree k to t the given training data, and this is because the dataset cannot be explained (i.e., approximated) very well by low-degree polynomials. By visualizing the data, you may have realized that y can be approximated well by a sine wave. In fact, we generated the data by sampling from y = sin(x) + , where is noise with Gaussian distribution. Please update the feature map to include a sine transformation as follows:




3
1

      

6
x
7






(x) =
x.2
2
Rk+2
(7)


6
..
7






6


7








6


7








6


7








6


7






7



4 xk 5

sin(x)




With the updated feature map, train di erent models for values of k = 0; 1; 2; 3; 5; 10; 20, and plot the resulting hypothesis curves over the data as before.




Submit the plot as a solution to this sub-problem. Compare the tted models with the previous sub-question, and brie y comment about noticable di erences in the t with this feature map.




[5 points] Over tting with expressive models and small data



For the rest of the problem, we will consider a small dataset (a random subset of the dataset you have been using so far) with much fewer examples, provided in the following le:




src/featuremaps/small.csv

 


 CS229 Problem Set #1
14






We will be exploring what happens when the number of features start becoming bigger than the number of examples in the training set. Run your algorithm on this small dataset using the following feature map




3
1

x 7
     (x) =
6
x2
7
2 Rk+1
(8)


6


7




7
... 7 4 5
xk




with k = 1; 2; 5; 10; 20.




Create a plot of the various hypothesis curves (just like previous sub-questions). Observe how the tting of the training dataset changes as k increases. Submit the plot in the writeup and comment on what you observe.




Remark: The phenomenon you observe where the models start to t the training dataset very well, but suddenly \goes wild" is due to what is called over tting. The intuition to have for now is that, when the amount of data you have is small relative to the expressive capacity of the family of possible models (that is, the hypothesis class, which, in this case, is the family of all degree k polynomials), it results in over tting.




Loosely speaking, the set of hypothesis function is \very exible" and can be easily forced to pass through all your data points especially in unnatural ways. In other words, the model explains the noises in the training dataset, which shouldn’t be explained in the rst place. This hurts the predictive power of the model on test examples. We will describe over tting in more detail in future lectures when we cover learning theory and bias-variance tradeo s.

More products