Programming Assignment #2 Solution

Starting from:

~~$35~~

$29

Home

This is an individual assignment. However, you are allowed to discuss the problems with other students in the class. But you should write your own code and report.

If you have any discussion with others, you should acknowledge the discussion in your report by mentioning their name.

Be precise with your explanations in the report. Unnecessary verbosity will be penalized.

You have to hand in the report as a hard-copy in the assignments hand-in box opposite to MC317. You have to submit the code in myCourses. Note that you should do both these submissions before the deadline.

After the deadline, you have one week to submit your assignment with 30% penalty.

You are free to use libraries with general utilities for linear algebra, data handling, and plotting, such as numpy, scipy and matplotlib for python. However, you should implement the algorithms yourself, which means you should not use pre-existing implementations of the nearest neighbor and linear discriminant analysis algorithms as found in SciKit learn, Tensor ow, etc.!

If you have questions regarding the assignment, you can ask for clari cations in the class discussion forum or go to the following o ce hours: Prasanna, Philip (section 1), Ali, Lucas (section 2).

Linear Classi cation and Nearest Neighbor Classi ca-tion

You will use a synthetic data set for the classi cation task that you’ll generate yourself. Generate two classes with 20 features each. Each class is given by a multivariate Gaussian distribution, with both classes sharing the same covariance matrix. You are provided with the mean vectors (DS1-m0 for mean vector of negative class and DS1-m1 for mean vector of positive class) and the covariance matrix (DS1-cov). Generate 2000 examples for each class, and label the data to be positive if they came from the Gaussian with mean m1 and negative if they came from the Gaussian with mean m0. Randomly pick 30% of each class (i.e., 600 data points per class) as a test set, and train the classi ers on the remaining 70%. data When you report performance results, it should be on the left out 30%. Call this dataset at DS1, and submit it with your code.

We rst consider the probabilistic LDA model as seen in class: given the class variable, the data are assumed to be Gaussians with di erent means for di erent classes but with the same covariance matrix. This model can formally be speci ed as follows:

Bernoulli( ); X j Y = j N ( j ; ):

Estimate the parameters of the probabilistic LDA model using the maximum likelihood approach. For DS1, report the best t accuracy, precision, recall and F-measure achieved by the classi er, along with the coe cients learnt.

For DS1, use k-NN to learn a classi er. Repeat the experiment for di erent values of k and report the performance for each value. We will compare this non-linear classi er to the linear approach, and nd out how powerful linear classi ers can be. Do you do better than LDA or worse? Are there particular values of k which perform better? Report the best t accuracy, precision, recall and f-measure achieved by this classi er.

Now instead of having a single multivariate Gaussian distribution per class, each class is going to be generated by a mixture of 3 Gaussians. For each class, we’ll de ne 3 Gaussians, with the rst Gaussian of the rst class sharing the covariance matrix with the rst Gaussian of the second class and so on. For both the classes, x the mixture probability as (0.1,0.42,0.48) i.e. the sample has arisen from rst Gaussian with probability 0.1, second with probability 0.42 and so on. Mean for three Gaussians in the positive class are given as DS2-c1-m1, DS2-c1-m2, DS2-c1-m3. Mean for three Gaussians in the negative class are gives as DS2-c2-m1, DS2-c2-m2, DS2-c2-m3. Corresponding 3 covariance matrices are given as DS2-cov-1, DS2-cov-2 and DS2-cov-3. Now sample from this distribution and generate the dataset similar to question 1. Call this dataset as DS2, and submit it with your code.

Now perform the experiments in questions 2 and 3 again, but now using DS2. Report the same performance measures as before. What do you observe?

Comment on any similarities and di erences between the performance of both classi ers on datasets DS1 and DS2?

Instruction for code submission

Submit a single zipped folder with your McGill id as the name of the folder. For example if your McGill ID is 12345678, then the submission should be 12345678.zip.

If you are using python, you must submit your solution as a jupyter notebook.

Make sure all the data les needed to run your code is within the folder and loaded with relative path. We should be able to run your code without making any modi cations.

Instruction for report submission

You report should be brief and to the point. When asked for comments, your comment should not be more than 3-4 lines.

Do not include your code in the report!

If you report consists of more than one page, make sure the pages are stapled.