$24
Instructions: Submit two files only: writeup.pdf and code.zip. Only question 1 and 4.2 are programming questions; written answers are expected for all other questions. For all questions, show all your work, including intermediate steps.
1 Linear Regression [20 pts]
Using any programming language, implement linear least square regression (i.e., with no polynomial terms) with the penalty term wT w. Download the dataset called regression-dataset.zip from the course webpage, which contains the training data and labels for each fold. The input and output spaces are continuous (i.e., x 2 Rd and y 2 R). Find the best by 10-fold cross validation. Draw a graph that shows the euclidean error as
increases from 0 to 4 in increments of 0.1.
2 Maximum Likelihood and Maximum a Posteriori Estimates [20 pts]
2.1 MLE [10 pts]
Consider the probability density function below:
P (x) = 3 xe x2
where is a parameter and x is a positive real number. Suppose you get m i.i.d samples xi drawn from this distribution. Show how one can compute the maximum likelihood estimate for based on these samples.
2.2 MAP [10 pts]
Consider the gamma probability density function below:
P ; (x) = x 1e x
)
Show that the gamma distribution is a conjugate prior of the exponential distribution P (x) = e x.
Find the maximum a posteriori estimate (MAP) of as a function of and .
3 Generative vs Discriminative Models [20 pts]
In class, we learned that when X = hX1; X2; : : : ; Xni where each Xi is distributed normally, and Y is Boolean, then logistic regression is the discriminative equivalent of Naive Bayes.
Prove that P (Y jX) takes the logistic form when X is a vector of boolean variables, i.e., logistic regression is also the discriminative counterpart of a Naive Bayes classifier over boolean features.
4 Support Vector Machines [40 pts]
4.1 Constrained Optimization [20 pts]
Consider the following optimization problem, with variable x 2 R.
minimize x2 + 1
subject to (x 2)(x 4) 0
What is the optimal point x and optimal value p of this constrained optimization problem?
State the Langrangian and the dual function g( ).
State the Langrange dual optimization problem. Find the dual optimal point and the dual optimal value d . Does strong duality hold? Explain your answer.
1
4.2 SVM with Kernels [20 pts]
The Ionosphere dataset from the UCI Machine Learning Repository contains radar data that was collected by a system in Goose Bay, Labrador. Received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 pulse numbers. Instances in this dataset are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal. The prediction is either g=“good” or b=“bad”. Download the dataset called ionosphere.csv from the course webpage.
Use your favourite machine learning toolkit to compare the performance of the SVM using each of the following kernels — linear, polynomial and radial basis. The polynomial and radial basis kernels each take parameters. The hyper-parameter C controls the tradeoff between training error and margin. Experiment with various combinations of the hyper-parameter C = f1; 10; 100g and the kernel parameters.
Explain your procedures and report the training accuracy performance. Which kernel performed the best?
What does this say, if anything, about the data?
2