$24
Question 1
Consider a data set in which each data point tn is associated with a weighting factor rn > 0, so that the sum-of-squares error function becomes
1
N
ED(w) =
å rnftn wTf(xn)g2.
2
n=1
Find an expression for the solution w that minimizes this error function. Give two alternative interpretations of the weighted sum-of-squares error function in terms of (i) data dependent noise variance and (ii) replicated data points.
Question 2
We saw in Section 2.3.6 that the conjugate prior for a Gaussian distribution with un-known mean and unknown precision (inverse variance) is a normal-gamma distri-bution. This property also holds for the case of the conditional Gaussian distribution p(tjx, w, b) of the linear regression model. If we consider the likelihood function,
N
p(tjX, w, b) = Õ N (tnjwTf(xn), b 1)
n=1
then the conjugate prior for w and b is given by
p(w, b) = N (wjm0, b 1S0)Gam(bja0, b0).
Show that the corresponding posterior distribution takes the same functional form,so that
p(w, bjt) = N (wjmN, b 1SN )Gam(bjaN, bN ).
and find expressions for the posterior parameters mN, SN, aN, and bN.
1
Machine Learning (CS405) – Homework #3
2
Question 3
Show that the integration over w in the Bayesian linear regression model gives the result
Z
expf E(w)gdw = expf E(mN )g(2p)M/2jAj 1/2.
Hence show that the log marginal likelihood is given by
ln p(tja, b) =
M
ln a +
N
ln b
E(mN )
1
ln jAj
N
ln(2p).
2
2
2
2
Question 4
Consider real-valued variables X and Y . The Y variable is generated, conditional on
X, from the following process:
• N(0, s2)
Y = aX + e
where every e is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0, and standard deviation s. This is a one-feature linear regression model, where a is the only weight parameter. The conditional prob-ability of Y has distribution p(YjX, a) N(aX, s2), so it can be written as
p(YjX, a) =
p
1
exp(
1
(Y aX)2)
2s2
2ps
Assume we have a training dataset of n pairs (Xi, Yi) for i = 1...n, and s is known. Derive the maximum likelihood estimate of the parameter a in terms of the train-ing example Xi0s and Yi0s. We recommend you start with the simplest form of the problem:
F(a) = 12 åi(Yi aXi)2
Question 5
If a data point y follows the Poisson distribution with rate parameter q, then the probability of a single observation y is
qye
q
p(yjq) =
, for y = 0, 1, 2, . . .
y!
You are given data points y1, . . . , yn independently drawn from a Poisson distribution with parameter q . Write down the log-likelihood of the data as a function of q .
Machine Learning (CS405) – Homework #3
3
Question 6
Suppose you are given n observations, X1, . . . , Xn, independent and identically dis-tributed with a Gamma(a, l) distribution. The following information might be useful for the problem.
(a) If X Gamma(a, l), then E[X] =
a
and E[X2] =
a(a+1)
l
l2
(b) The probability density function of X Gamma(a, l) is fX (x) =
1
l
a
x
a
1
e
lx
G(a)
where the function G is only dependent on a and not l.
Suppose, we are given a known, fixed value for a. Compute the maximum likelihood estimator for l.
Machine Learning (CS405) – Homework #3
4
Program Question
In this question, we will try to use logistic regression to solve a binary classification
problem. Given some information of a house, such as area and the number of living rooms, would it be expensive? We would like logisticRegressionScikit()topred1ifitisexpensve, and 0 otherwise. We will use the hw3_house_sales.zip dataset.
We will first implement it with python Scikit learn package, and then try o imple-ment it by updating weights with gradient descent. We will derive the gradient formula, and use Stochastic gradient descent and AdaGrad to calculate the weights.
(a) Logistic regression with Scikit. Fill in the func-tion using the Scikit toolbox.
Report the weights and prediction accuracy here in your submitted file.
(b) Gradient derivation. Assume a sigmoid is applied to a linear function of the input features:
1
hw(x) =
1 + e
wT x
Assume lso that P(y = 1jx; w) = hw(x), P(y = 0jx; w) = 1 hw(x). Calcu-
late the maximum likelihood estimation L(w) = P(YjX; w), then formulate the tochastic gradient ascent rule. Please writing out the log likelihood, calculating
the isticRegressdervactveand writing out the update formula step by step.LregrsnSGD()withsimplegradientdescent. Fill in thesigmoidvationfunction.Todothat,twohelperfunctionsmodel_optimize(),tocalculatethesigmoidfunctionresult,and c()
• to calculate the gradient of w, will be needed. Both helper functions can be used in the following AdaGrad optimization function. Use a learning rate of 10 4, run with 2000 iterations. Keep track of the accuracy every 100 iterations in the training set (no need to report). It will be used later.
Report weights, training accuracy and test accuracy here in your submit-ted file. Your final score will depends on correct sigmoid_activation(), model_optimize(), LogisticRegressionSGD() functions.
(d) Logistic regression with AdaGrad. Fill in the LogisticRegressionAda() function. Use a learning rate of 10 4, run with 2000 iterations. Keep tracks of the accuracy every 100 iterations in the training set (no need to report). It will be used later.
Report weights, training accuracy and test accuracy here in your submitted file.
(e) Comparision of Scikit, SGD and AdaGrad convergence. Plot the loss function of SGD and AdaGrad over 2000 iterations on both the training and test data. What do you observe? Which one has better accuracy on the test dataset? Why might that be the case?
Reference. The datasets and questions are from website and University of Pennsylva-nia.