$24
Questions
1. (30 points) In this problem, you will implement a program to t two mul-tivariate Gaussian distributions to the 2-class data and classify the test data by computing the log odds log P (C1jx) . The priors P (C ) and P (C ) should be
P (C2jx)
1
2
estimated from the training data. Three pairs of training data and test data are given. The parameters 1, 2, S1 and S2, the mean and covariance for class 1 and class 2, are learned in the following three models for each training data and test data pair,
Model 1: Assume independent S1 and S2 (the discriminant function is as equation (5.17) in the textbook).
Model 2: Assume S1 = S2. In other words, shared S between two classes (the discriminant function is as equation (5.22) in the textbook).
Model 3: Assume S1 and S2 are diagonal and the diagonal entries are identical within S1 and S2: S1 = 1I, S2 = 2I. (You need to derive the discriminant function yourself).
(10 points) Write the likelihood function and derive S1 and S2 by max-imum likelihood estimation of model 2 and model 3.
(10 points) Your program should return and print out the learned pa-rameters P (C1); P (C2), 1 and 2 of each data pair to the MATLAB command window. Your implementation of model 1 and model 2 should return and print out the learned parameters S1; S2. Your implementation of model 3 will return and print out 1 and 2.
(10 points) For each test set, print out the error rates of each model to the MATLAB command window (three models per each test set). Match each data pair to one of the models and justify your answer. Also, explain the di erence in your results in the report.
1Instructor: Rui Kuang (kuan0009@umn.edu). TA: Jungseok Hong (jungseok@umn.edu) and Ujval Bangalore Umesh (banga038@umn.edu).
1
In this problem, you will apply dimension reduction and classi cation on the Optdigits dataset provided in optdigits train.txt and optdigits test.txt.
(5 points) Implement k-Nearest Neighbor (KNN) to classify the Optdigits dataset with k = f1; 3; 5; 7g. Print out the error rate on the test set for each value of k to the MATLAB command window.
(10 points) Implement your own version of Principal Component Anal-ysis (PCA) and apply it the Optdigits training data. Generate a plot of proportion of variance (see Figure 6.4 (b) in the main textbook), and select the minimum number (K) of eigenvectors that explain at least 90% of the variance. Show both the plot and K in the report. Project the training and test data to the K principal components and run KNN on the projected data for k = f1; 3; 5; 7g. Print out the error rate on the test set for each value of k to the MATLAB command window.
(5 points) Next, project both the training and test data to R2 using only the rst two principal components to plot all samples in the projected space and label some data points with the corresponding digit in 10 dif-ferent colors for the 10 types of digits for a good visualization (similar to Figure 6.5).
(10 points) Implement your own version of Linear Discriminant Analy-sis (LDA) and apply it to compute a projection only using the Optdigits training data into L dimensions (L = 2, 4, 9). Run KNN on the pro-jected data for k = f1; 3; 5g. Print out the error rate on the test set for each combination of k and L to the MATLAB command window. (Hint: matlab function pinv() can be used to invert singular matrix as an ap-proximation.)
(10 points) Similarly, project both the training and test data to R2 with the LDA projections and, plot all samples in the projected space and label some data points with the corresponding digit in 10 di erent colors for the 10 types of digits.
In this problem, you will work on dimension reduction and classi cation on a Faces dataset from the UCI repository2. We provided the processed les face train data 960.txt and face test data 960.txt with 500 and 124 im-ages, respectively. Each image is of size 30 32 with the pixel values in a row in the les and the last column identi es the labels: 1 (sunglasses), and 0
https://archive.ics.uci.edu/ml/datasets/CMU+Face+Images
2
(open) of the image. You can visualize the ith image with the following matlab command line:
imagesc(reshape(faces data(i,1:end-1),32,30)’).
(10 points) Implement PCA and apply it to nd the principal compo-nents with combined training and test sets. First, visualize the rst 5 eigen-faces using a similar command line as above.
(10 points) Repeat what you did in question 2 (b), using PCA and KNN on this Faces dataset.
(10 points) Use the rst K = f10; 50; 100g principle components to ap-proximate the rst ve images of the training set ( rst row of the data matrix) by projecting the centered data using the rst K principal com-ponents then \back project" (weighted sum of the components) to the original space and add the mean. For each K, plot the reconstructed im-age. Explain your observations in the report.
(Hint: Read section 6.3 on page 126 and 127 of the textbook for the projection and "back projection" to the original space.)
Instructions
Solutions to all questions must be included in a report including result expla-nations, learned parameter values and all error rates and plots.
All programming questions must be written in MATLAB, no other program-ming languages will be accepted. The code must be able to be executed from the MATLAB command window on the cselabs machines. Each function must take the inputs in the order speci ed and print/display the required output to the Matlab command window. For each part, you can submit additional les/functions (as needed) which will be used by the main functions speci ed below. Put comments in your code so that one can follow the key parts and steps. Please follow the rules strictly. If we cannot run your code, you will receive no credit.
Question 1:
{ MultiGaussian(training data: le name of the training data, testing data: le name of the testing data, Model: the model number). The function
3
must output the learned parameters and error rates as required in Ques-tion 1.
Question 2:
{ myKNN(training data, test data, k). The function returns the prediction for the test set.
{ myPCA(data, num principal components). The function returns the prin-cipal components and the corresponding eigenvalues.
{ myLDA(data, num principal components). The function returns the pro-jection matrix and the corresponding eigenvalues.
{ script 2a.m, script 2b.m and script 2c.m Script les that solves question 2 (a), (b), (c), (d) and (e) calling the appropriate functions, do the plots and print values asked.
Question 3:
{ script 3a.m, script 3b.m and script 3c.m Script les that solves question 3 (a), (b) and (c) calling the appropriate functions, do the plots and print values asked.
For each dataset, rows are the samples and columns are the features with the last column containing the label.
You can use the eig function to calculate eigenvalues and eigenvectors. To visualize the projected data, you can use the text function. To specify the color, use the Color parameter in the text function. If the gure does not show all the data, you can use the axis function to scale the axis.
Submission
Things to submit:
hw2 sol.pdf: A document which contains the report with solutions to all questions.
MultiGaussian: Code for Question 1.
myKNN.m, myPCA.m, myLDA.m, script 2a.m, script 2b.m, script 2c.m Code for Question 2.
4
script 3a.m, script 3b.m, script 3c.m Code for Question 3.
Any other les, except the data, which are necessary for your code.
Instructions for Submission:
All material must be submitted electronically via canvas.
A zip le containing all the solutions mentioned other than the hw2 sol pdf. Should have a report hw2 sol.pdf not included in the zip le.
Failing to follow the instructions might result in points lost.
5