Homework #2 Solution

Starting from:

~~$30~~

$24

Home

R Programming Submission Instructions

Make sure you clearly list each team member's names and Unity IDs at the top of your submission.

Your code should be named hw2:R. Add this le, along with a README to the zip le mentioned in the rst page.

Failure to follow naming conventions or programming related instructions speci ed below may result in your submission not being graded.

Carefully read what the function names have been requested by the instructor. In this homework or the following ones, if your code does not follow the naming format requested by the instructor, you will not receive credit.

For each function, both the input and output formats are provided in the hw2:R. Function calls are speci ed in hw2 checker:R. Please ensure that you follow the correct input and output formats. Once again, if you do not follow the format requested, you will not receive credit. It is clearly stated which functions need to be implemented by you in the comments in hw2.R

You are free to write your own functions to handle sub-tasks, but the TA will only call the functions he has requested. If the requested functions do not run/return the correct values/do not nish running in speci ed time, you will not receive full credit.

1
Homework 2

DO NOT set working directory (setwd function) or clear memory (rm(list=ls(all=T))) in your code. TA(s) will do so in their own auto grader.

The TA will have an autograder which will rst run source(hw2.R), then call each of the functions requested in the homework and compare with the correct solution.

Your code should be clearly documented.

To test you code, step through the hw2 checker.R le. If you update you code, make sure to run source(`./hw2.R') again to update your function de nitions. You can also check the \Source on save" option in R Studio to do this automatically on save.

Update: Please DO NOT include install.packages() in your hw2.R. Please note, we are specifying the allowed packages, which means that we already have them installed on our test machine. From HW2, using install.packages in your code would result in a penalty of 5 points.

Problems

Decision Tree Construction (20 points) [Song Ju]. Create decision trees for the hw2q1.csv dataset by hand, as explained below, using Hunt's algorithm. Note the following:

In the given dataset, all of the input attributes are binary except for the rst attribute, V1, which is ratio and continuous.

The output label has two class values: T or F.

In the case of ties then selecting an attribute, break ties in favor of the leftmost attribute.

When considering a split for the continuous V1 attribute, identify the best value to split on (e.g. 15 and 15) by testing all possible split values.

When calculating Information Gain or Gini Index, write down each step of the computation process and show your work step by step. Draw a separate tree after each attribute split, like the example in Figure 1. You can use a program (e.g. tikz with LATEX) to draw your trees, or draw them by hand on paper and scan your results into the nal pdf.

Construct the decision tree manually, using Information Gain (IG) to select the best attribute to split on, as in the ID3 algorithm. The maximum depth of your tree should be 4 (count the root node as depth 0), meaning that any node at depth 4 will be a leaf node.

Construct the tree manually using the Gini index. The maximum depth of the tree should be 2.

How are the trees di erent? As part of your explanation, give 2 examples of data objects that would be classi ed di erently by the two trees.

Which decision tree will perform better on the training dataset (hw2q1.csv)? Which will perform better on a test dataset? Can we know the answer?

Evaluation Measures (10 points) [Song Ju] Given the decision tree in Figure 1, complete the following tasks:

Figure 1: Decision Tree

Compute the training classi cation error for the classi er using both optimistic error and pes-simistic error. For pessimistic training error, use a penalty of 0.5 per leaf node. You may assume that the whole training dataset is represented by the class labels in the leaf nodes of the tree in Figure 1, with a total of 34 training instances.

Use the decision tree above to classify the provided dataset. hw2q2.csv. Construct a confusion matrix and report the test Accuracy, Error Rate, Precision, Recall, and F1 score. Use \Yes" as the positive class in the confusion matrix.

Decision Tree Pruning (10 points) [Ruth Okoilu]. Consider the decision tree in Figure 1. We will focus on the sub-tree which splits on the attribute \Color". Answer the following questions and show your work.

Calculate the optimistic training classi cation error before splitting and after splitting using Color, respectively. If we want to minimize the optimistic error rate, should the node be pruned?

Calculate the pessimistic training errors before splitting and after splitting using Color respectively. When calculating pessimistic errors, each leaf node will add a factor of 0.8 to the error. If we want to minimize the pessimistic error rate, should the node be pruned?

Assuming that the \Color" node is pruned, recalculate the test Error Rate using hw2q2.csv. Based on your evaluation using the test dataset in hw2q2.csv, was the original tree (with the Color node) over- tting? Why or why not?

1-NN, Evaluation, Cross Validation (15 points) [Ruth Okoilu] Consider the following dataset (9 in-stances) with 2 continuous attributes (x1 and x2) and a class attribute y, shown in Table 1. For this question, we will consider a 1-Nearest-Neighbor (1-NN) classi er that uses euclidean distance.

Table 1: 1-NN

ID
x1
x1
y
1
35.0
15.0
-

2
2.5
11.0
-

3
10.5
12.5
+

4
44.0
11.0
+

5
1.5
13.0
-

6
48.0
11.0
+

7
45.0
13.0
-

8
38.0
10.0
+

9
7.5
13.5
-

Calculate the distance matrix for the dataset using euclidean distance.

By hand, evaluate the 1-NN classi er, calculating the confusion matrix and testing accuracy (show work). Use the following evaluation methods:

A holdout test dataset consisting of last 4 instances

3-fold cross-validation, using the following folds with IDs: [3,6,9], [1,4,7], [2,5,8] respectively.

Leave one out cross validation (LOOCV)

For a data analysis homework, you are asked to perform an experiment with a binary classi cation algorithm. You are given a dataset with 20 instances and a class attribute that can be either Positive or Negative. The dataset includes 10 positive and 10 negative instances, each with d attributes. You decide to use LOOCV (i.e 20 singleton test sets) to evaluate the algorithm. As a baseline, you compare your algorithm to a \simple majority classi er," which always predicts the majority class in the training dataset (if there is no majority, one of the classes is chosen at random). You expect the test accuracy of the simple majority classi er to be about 50% using LOOCV. Instead it performs di erently. How does it perform and why?

R Programming Part 1: KNN and Decision Tree Classi ers (45 points) [Krishna Gadiraju] You are given the following les:

hw2 train.csv: CSV le with 100 rows, 101 columns. The rst 100 columns (named V1:V100) refer to your features, nal column (called 'Class') refers to the Class variable. You have 4 classes:

f1, 2, 3, 4g. This is your training dataset

hw2 test.csv: This is your test dataset and follows the same format as your training dataset. It has 50 rows.

Data was generated by computing the TF-IDF matrix of a corpus of news papers. 100 top words from several documents were extracted (i.e., feature extraction was already done for you), and the TF-IDF matrix was calculated for these words. A similar transformation was applied to your test dataset as well.

Please note that this exercise has sections where you need to implement methods, and sections where you can use a library. Whether you need to implement/use library is clearly stated:

Part A: Distance Computation (similar to HW1): Recall the calculate matrix, calculate euclidean and calculate cosine methods from HW1. Once again, calculate matrix implementation has

already been provided to you, that takes the training data matrix, and test data matrix and computes pairwise distances between every row (sentence/document) in test data matrix and train data matrix. calculate euclidean and calculate cosine follow the same de nitions as in HW1. Copy your solutions from HW1 for these two methods in hw2.R (make sure to correct them if you had any errors in HW1). We will use these methods to calculate the distance matrix for the KNN algorithm.

Part B: Vanilla KNN Classi cation: (Implement) You are tasked with implementing the KNN algorithm using the dataset provided to you. You will be implementing the KNN classi er in hw2.R using the knn classifier method. The input and output variables, and their formats are explained in detail in the form of comments inside the knn classifier function. Read every comment carefully before implementing your method.

Part B-2: KNN using con dence (Implement): Read the paper titled "A simple KNN algorithm for text categorization" [1]1. The authors present a simple solution to text catego-rization using KNN algorithm. While the primary contribution of the paper is on the feature selection methods used to reduce computational complexity of classifying a large corpus of text data, we will focus on the measure they used to compute the nearest neighbors. [1] de nes a con dence measure on top of the similarity measure to calculate the nearest neighbors. While the standard procedure of KNN algorithm (assuming data is normalized and feature selection, if needed is completed) is:

Step 1: for a test instance, calculate distance to all training instances

Step 2: Find the k nearest training instances

You can nd the paper by searching scholar.google.com for the title.

Step 3: Choose class based on majority count in k nearest training instances

the algorithm in [1] modi es Step 3. Instead of taking the majority class of the K near-est neighbors, [1] weights each neighbor according to its cosine similarity and then choose the class with most total weight among the K nearest neighbors. This weighted score be-tween the test instance and each class is called con dence. In other words, for a test in-stance t, considering k1; k2; : : : ; kK are its K nearest neighbors, and k2; k4; k5 belong to

class c, Confidence(t; c) = sum(cosine similarities of t andfk2; k4 ; k5g) . The nal class label sum(cosine similarities of t andfk1 to kkg)

belongs to the class with maximum con dence. Implement this functionality in the method knn classifier confidence. The input, output formats, as well as allowed packages are described as comments under the function.

Part C: Decision Tree using Cross Validation (Library):

Using the training dataset hw2 train.csv, build a CART decision tree using the rpart library and gini index for the splitting criterion. Then using this model, classify the outcomes from the test dataset hw2 test.csv. Code for this question has to be written in the function dtree.

Use the caret library to tune the complexity parameter of the CART classi er, using n-fold cross validation on the outcomes training dataset. Then using this model, classify the outcomes from the test dataset hw2 test.csv. Code for this question has to be written in the function dtree cv.

The input, output formats, as well as allowed packages are described as comments under the function.

Part D: Confusion Matrix and Accuracy (Library): Write a function calculate accuracy that takes the predicted class labels of type vector (factor), ground truth labels of type vector (factor) and returns a list, with the rst element as the confusion matrix, and second element as the overall accuracy. The input, output formats, as well as allowed packages are described as comments under the function. An example of how the confusion matrix should be represented is shown in Figure 2. Prediction refers to the predicted values, Reference refers to the ground truth values.

Part E: Analysis: Report these answers in your pdf. Compare the performance of the KNN Classifer using euclidean distance, KNN Classi er using cosine similarity, KNN classi er using con dence and and Decision Tree classi ers (with and without hyperparameter tuning). At the minimum, we expect to see the following comparisons. You are welcome to perform a more in-depth analysis in addition to these analyses.

Comparison in terms of overall accuracy - which classi er performed best?

Comparison in terms of confusion matrix - In each classi er, which class had the maximum misclassi cations? Did one class perform better than the other in these cases?

Figure 2: Sample Confusion Matrix generated with random data

References

P. Soucy and G. W. Mineau, \A simple knn algorithm for text categorization," in Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE, 2001, pp. 647{648.

5