$24
Splitting Heuristic for Decision Trees [20 pts]
(a) How many mistakes does the best 1-leaf decision tree make over the 2n training examples? (The 1-leaf decision tree does not split the data even once. Make sure you answer for the general case when n ≥ 4.)
The best 1-leaf decision tree would be a tree that always predicts Y as 1; since it does not split the data even once, the best solution would be to always classify as 1 since that is the
most common label. The best 1-leaf tree makes mistakes, an error rate of 1/8.
(b) Is there a split that reduces the number of mistakes by at least one? (That is, is there a decision tree with 1 internal node with fewer mistakes than your answer to part (a)?) Why or why not? (Note that, as in lecture, you should restrict your attention to splits that consider a single attribute.)
No, since we can put any variable Xi to split at the root, and the rate of mistakes will still be 1/8. Splitting on any Xi for i = 1,2,3 will split the data into one leaf that contains only 1s (if Xi = 1) and one leaf where the proportion of 1s is 3/4 (since when Xi=0 where i=1,2,3, there are 3/4 1s and 1/4 0s). Splitting on Xi where i > 3 then the proportion of 1s will still be 7/8, and so both branches will predict 1. In both leaves, the tree predicts 1, so it makes the same number of errors as the 1-leaf decision tree that always predicts 1.
(c) What is the entropy of the output label Y for the 1-leaf decision tree (no splits at all)?
(d) Is there a split that reduces the entropy of the output Y by a non-zero amount? If so, what is it, and what is the resulting conditional entropy of Y given this split? (Again, as in lecture, you should restrict your attention to splits that consider a single attribute.)
Splitting at X1, X2, or X3 reduces the entropy of output Y by a nonzero amount. i = 1,2,3
For each split of X1, X2, or X3, Xi is 1 half of the time and 0 the other half. When Xi is 1, then Y is 1 all the time, so there is no entropy; when Xi is 0, Y is 1 3/4 of the time and 0 1/4 of the time
2 Entropy and Information [5 pts]
For a Bernoulli random variable, P(X=1) can range from 0 to 1, where q=0 is when the event never occurs and q=1 is when the event always occurs. In the above example, q=0 when there are no positive examples (p=0 and n>0). Therefore, the event of a positive example never occurs, so there is no uncertainty, and so the entropy is 0.
.
The maximum entropy occurs when p=n (there are an equal number of positive and negative examples, so we are completely uncertain). When q=1, there are no negative examples (n=0 and p>0). Likewise, the entropy is 0.
The derivative of the entropy equation is . Setting it equal to 0, we get q = ½, and see that the maximum entropy is when q=½, or when p = n. When p=n, there is an equal number of positive and negative examples; then
GAIN[Y, split] = H[Y] - H[Y | split] = 0
Each subset’s entropy is equal (since the ratio is the same for all k), and if each subset has the same ratio, then the original set of p and n examples has the same ratio of p and n examples. So, no information is gained from a split. Since each subset’s entropy is the same,
then the conditional entropy given the split is for k subsets, which
is just equal to , the original entropy with no split. So, no information is gained:
3 k-Nearest Neighbor [10 pts]
One of the problems with k-nearest neighbor learning is selecting a value for k. Say you are given the following data set. This is a binary classification task in which the instances are described by two real-valued attributes. The labels or classes of each instance are denoted as either an asterisk or a circle.
k=1 minimizes the training set error for this data set. The resulting training set error is 0 (since when k=1, each data point location is its own neighbor, and thus each data point is classified as its own class). Training set is not a reasonable estimate of test set error because when we are given unseen data, we cannot classify the unseen data point as its own class because it does not have a class yet.
k=5 or k=7 minimizes LOOCV error.
With k=5 and leave-one-out, the error rate is 4/14 (it misclassifies the 2 asterisks at (2,7) and (3,8) and the 2 circles at (7,2) and (8,3), while classifying the 5 grouped circles and 5 grouped asterisks correctly).
With k=7, the error rate is 4/14 (it misclassifies the same points, and classifies the 5 grouped circles and 5 grouped asterisks correctly since each of the 5 circles’ 7 nearest neighbors are 4 circles and 3 asterisks with leave-one-out, and likewise for the 5 grouped asterisks).
Cross-validation is a better measure of test set performance because performance may be different for each split so we want the average performance to maximize the generalizability of the model to unseen data.
The highest k for this data set is 13 (every remaining point in the training set). With LOOCV and there being only 7 asterisks and 6 circles or 7 circles and 6 asterisks when 1 is left out, every point would be misclassified (it just classifies every point as the most frequent class, and with LOOCV, the most frequent class will always be the wrong class). Using too large of a k leads to underfitting.
The lowest k for this data set is 1. The error rate for k=1 is 10/14; only the asterisks at (5,1) and (9,5) and the circles at (1,5) and (5,9) are classified correctly. As we can see, the error rates are high compared to k=5 or k=7; this is because the highest k misclassifies every point and having a low k leads to overfitting.
4 Programming [50 pts]
workclass: The majority of people are self-employed-not-inc. Notably, almost everyone that works in the private sector make less than 50k, and people that work in federal or local government have a higher chance to make more than 50k than less than 50k.
education: Most people with education level Bachelors or 11th grade make less than 50k. The ratio of people who make less than and more than 50k is much closer to 1 for HS grad, Prof school, Assoc-voc, and 7th-8th
marital status: Divorced people are most likely to make more than 50k; for all other marital statuses, one is much more likely to make less than 50k. occupation: Sales and Exec-managerial are nearly equally likely to make more than 50k as they are to make less than; for all other occupations, one is much more likely to make less than 50k.
relationship: Husbands are nearly equally likely to make more than 50k as they are to make less than; for all other relationship statuses, one is more likely to make less than.
race: White and black people are much more likely to make less than 50; for other races (Asians and Other) are more equally likely to make more or less than 50k.
native country: Almost all people's native country is the US in this data set; one is more likely to make less than 50k.
age: As one gets older, one is less likely to make less than 50k; the age groups that are most likely to make more than 50k are 30, 40, and 50 year olds.
final weight: The census believes each entry represents less than 300k people most often; the entry that is most likely to make more than 50k out of all entries is the one that represents ~200k people.
education number: As the education number increases, one is more likely to make more than 50k.
capital gain: As the capital gain increases, one is more likely to make more than 50k.
capital loss: As the capital loss increases, one is more likely to make more than 50k.
hours per week: People that work less than 40 hours a week are very unlikely to make more than 50k. People that work 40-60 hours are more likely to make more than 50k than people who work other hours.
sex: Males are unlikely to make more than 50k, while females are somewhat more likely to make more than 50k.
A small k leads to a high validation error; this is because the model is overfitting with a small k and so the model is not generalizable to unseen data. A large k also is not the best k because it leads to underfitting; the decision boundary is too smooth. The best k is 15, with an error of 0.237.
The best depth to use is 5; too large of a max depth leads to overfitting, specifically greater than a depth of 5. We can see that there is overfitting because as the max depth increases, the training error decreases while the test error increases and then stagnates at around 0.21.
Training experience does not affect the KNN model as much as it affects the decision tree. For both KNN and decision tree, the training error increases as the amount of training data increases until over half of the training data is used, then stagnates at around 0.25 error for KNN and 0.15 error for the decision tree.
For the decision tree, the testing error decreases with more training data, from above 0.25 error to below 0.2.
For KNN, the testing error stays around the same level at just below 0.25.
• As we can see, the performance of the random classifier and decision tree stays the same, but the performance of the KNN increases significantly (this is because KNN is negatively affected by data that is not normalized since unnormalized data affects how close the data is)
• The new best k is 27
• The decision tree performance is unaffected
• The KNN performance differs in that both the training and test errors are lower than what they were prior to the data being normalized
• Instead of increasing and stagnating, the KNN training error stays at around the same level as it started
• Instead of stagnating at around 0.25, the KNN testing error starts at just below 0.25 and decreases to around 0.15