$29
There are several strategies available to you.
• If you ever get stuck, the best way is to ask on Slack. That way, your solutions will be available to the other students in the class.
• Your instructor and TAs will offer office hours1, which are a great way to get some one-on-one help.
• You are allowed to use well known libraries such as scikit-learn, scikit-image, numpy, scipy, etc. in this assignment. Any reference or copy of public code repositories should be properly cited in your submission (examples include Github, Wikipedia, Blogs).
1https://cs5785-2019.github.io/index.html
CS5785 Fall 2019: Homework 1 Page 2
PROGRAMMING EXERCISES
1. Digit Recognizer
(a) Join the Digit Recognizer competition on Kaggle. Download the training and test data. The competition page describes how these files are formatted.
(b) Write a function to display an MNIST digit. Display one of each digit.
(c) Examine the prior probability of the classes in the training data. Is it uniform across the digits? Display a normalized histogram of digit counts. Is it even?
(d) Pick one example of each digit from your training data. Then, for each sample digit, compute and show the best match (nearest neighbor) between your chosen sample and the rest of the training data. Use L2 distance between the two images’ pixel values as the metric. This probably won’t be perfect, so add an asterisk next to the erroneous examples (if any).
(e) Consider the case of binary comparison between the digits 0 and 1. Ignoring all the other digits, compute the pairwise distances for all genuine matches and all impostor matches, again using the L2 norm. Plot histograms of the genuine and impostor distances on the same set of axes.
(f ) Generate an ROC curve from the above sets of distances. What is the equal error rate? What is the error rate of a classifier that simply guesses randomly?
(g) Implement a K-NN classifier. (You cannot use external libraries for this question; it should be your own implementation.)
(h) Randomly split the training data into two halves. Train your k-NN classifier on the first half of the data, and test it on the second half, reporting your average accuracy.
(i) Generate a confusion matrix (of size 10 £ 10) from your results. Which digits are particularly tricky to classify?
(j) Train your classifier with all of the training data, and test your classifier with the test data. Submit your results to Kaggle.
2. The Titanic Disaster
(a) Join the Titanic: Machine Learning From Disaster competition on Kaggle. Download the training and test data.
(b) Using logistic regression, try to predict whether a passenger survived the disaster. You can choose the features (or combinations of features) you would like to use or ignore, provided you justify your reasoning.
(c) Train your classifier using all of the training data, and test it using the testing data. Submit your results to Kaggle.
2
CS5785 Fall 2019: Homework 1 Page 3
WRITTEN EXERCISES
1. Variance of a sum. Show that the variance of a sum is v ar [X ¡Y ] ˘ v ar [X ] ¯v ar [Y ] ¡2cov[X ,Y ], where cov[X ,Y ] is the covariance between random variables X and Y .
2. Bayes rule for quality control. You’re the foreman at a factory making ten million widgets per year. As a quality control step before shipment, you create a detector that tests for defective widgets be-fore sending them to customers. The test is uniformly 95% accurate, meaning that the probability of testing positive given that the widget is defective is 0.95, as is the probability of testing negative given that the widget is not defective. Further, only one in 100,000 widgets is actually defective.
(a) Suppose the test shows that a widget is defective. What are the chances that it’s actually defective given the test result?
(b) If we throw out all widgets that test as defective, how many good widgets are thrown away per year? How many bad widgets are still shipped to customers each year?
3. In k-nearest neighbors, the classification is achieved by plurality vote in the vicinity of data. Sup-pose our training data comprises n data points with two classes, each comprising exactly half of the training data, with some overlap between the two classes.
(a) Describe what happens to the average 0-1 prediction error on the training data when the neighbor count k varies from n to 1. (In this case, the prediction for training data point xi includes (xi , yi ) as part of the example training data used by kNN.)
(b) We randomly choose half of the data to be removed from the training data, train on the re-maining half, and test on the held-out half. Predict and explain with a sketch how the average 0-1 prediction error on the held-out validation set might change when k varies? Explain your reasoning.
(c) In kNN, once k is determined, all of the k-nearest neighbors are weighted equally in deciding the class label. This may be inappropriate when k is large. Suggest a modification to the algorithm that avoids this caveat.
(d) Give two reasons why kNN may be undesirable when the input dimension is high.
(e) The plot below shows a binary classification task with two classes: red and blue. Training data points are shown as triangles, and test data points are shown as squares, and all data points are colored according to their class. The line in the plot is the decision rule from training a logistic regression classifier on the training data, with points above the line classified as red and points below the line classified as blue.
i. Calculate confusion matrices for both the training and test data for the logistic regression classifier.
ii. Now, calculate confusion matrices for both the training and test data for the kNN algo-rithm with k ˘ 1 on the same data, using the same train/test split (you should be able to do this by hand).
iii. Discuss the difference in accuracy on both the training and test data between the two algorithms. Explain why this difference occurs.
3
CS5785 Fall 2019: Homework 1 Page 4
4