$29
• Splitting Heuristic for Decision Trees [14 pts]
Recall that the ID3 algorithm iteratively grows a decision tree from the root downwards. On each iteration, the algorithm replaces one leaf node with an internal node that splits the data based on one decision attribute (or feature). In particular, the ID3 algorithm chooses the split that reduces the entropy the most, but there are other choices. For example, since our goal in the end is to have the lowest error, why not instead choose the split that reduces error the most? In this problem, we will explore one reason why reducing entropy is a better criterion.
Consider the following simple setting. Let us suppose each example is described by n boolean features: X = hX1; : : : ; Xni, where Xi 2 f0; 1g, and where n 4. Furthermore, the target function to be learned is f : X ! Y , where Y = X1 _ X2 _ X3. That is, Y = 1 if X1 = 1 or X2 = 1 or X3 = 1, and Y = 0 otherwise. Suppose that your training data contains all of the 2n possible examples, each labeled by f. For example, when n = 4, the data set would be
X1 X2 X3 X4
Y
X1 X2 X3 X4
Y
0
0
0
0
0
0
0
0
1
0
1
0
0
0
1
1
0
0
1
1
0
1
0
0
1
0
1
0
1
1
1
1
0
0
1
1
1
0
1
1
0
0
1
0
1
0
0
1
1
1
1
0
1
0
1
1
0
1
1
1
0
1
1
0
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
(a) How many mistakes does the best 1-leaf decision tree make over the 2n training examples? (The 1-leaf decision tree does not split the data even once. Make sure you answer for the general case when n 4.)
(b) Is there a split that reduces the number of mistakes by at least one? (That is, is there a decision tree with 1 internal node with fewer mistakes than your answer to part (a)?) Why or why not? (Note that, as in lecture, you should restrict your attention to splits that consider a single attribute.)
(c) What is the entropy of the output label Y for the 1-leaf decision tree (no splits at all)?
(d) Is there a split that reduces the entropy of the output Y by a non-zero amount? If so, what is it, and what is the resulting conditional entropy of Y given this split? (Again, as in lecture, you should restrict your attention to splits that consider a single attribute.)
• Entropy and Information [6 pts]
The entropy of a Bernoulli (Boolean 0/1) random variable X with P (X = 1) = q is given by
B(q) = q log q (1 q) log(1 q):
Suppose that a set S of examples contains p positive examples and n negative examples. The
p
entropy of S is de ned as H(S) = B p+n . In this problem, you should assume that the base
2
of all logarithms is 2. That is, log(z) := log2(z) in this problem (as in the lectures concerning entropy).
(a) Show that 0 H(S) 1 and that H(S) = 1 when p = n.
(b) Based on an attribute, we split our examples into k disjoint subsets Sk, with pk positive
and nk negative examples in each. If the ratio
pk
is the same for all k, show that the
pk+nk
information gain of this attribute is 0.
• k-Nearest Neighbor [10 pts]
One of the problems with k-nearest neighbor learning is selecting a value for k. Say you are given the following data set. This is a binary classi cation task in which the instances are described by two real-valued attributes. The labels or classes of each instance are denoted as either an asterisk or a circle.
(a) What value of k minimizes training set error for this data set, and what is the resulting training set error? Why is training set error not a reasonable estimate of test set error, especially given this value of k?
(b) What value of k minimizes the leave-one-out cross-validation error for this data set, and what is the resulting error? Why is cross-validation a better measure of test set performance?
(c) What are the LOOCV errors for the lowest and highest k for this data set? Why might using too large or too small a value of k be bad?
3
• Programming exercise : Applying decision trees [24 pts]
Submission instructions
Only provide answers and plots. Do not submit code.
Introduction1
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this problem, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Starter Files
code and data
code : titanic.py
data : titanic_train.csv
documentation
Decision Tree Classi er:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Cross-Validation:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Metrics:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
Download the code and data sets from the course website. For more information on the data set, see the Kaggle description: https://www.kaggle.com/c/titanic/data. (The provided data sets are modi ed versions of the data available from Kaggle.2)
• This assignment is adapted from the Kaggle Titanic competition, available at https://www.kaggle.com/c/ titanic. Some parts of the problem are copied verbatim from Kaggle.
2Passengers with missing values for any feature have been removed. Also, the categorical feature Sex has been mapped to {’female’: 0, ’male’: 1} and Embarked to {’C’: 0, ’Q’: 1, ’S’: 2}. If you are interested more in this process of data munging, Kaggle has an excellent tutorial available at https://www.kaggle.com/c/titanic/ details/getting-started-with-python-ii.
4
Note that any portions of the code that you must modify have been indicated with TODO. Do not change any code outside of these blocks.
4.1 Visualization [4 pts]
One of the rst things to do before trying any formal machine learning technique is to dive into the data. This can include looking for funny values in the data, looking for outliers, looking at the range of feature values, what features seem important, etc.
(a) Run the code (titanic.py) to make histograms for each feature, separating the examples by class (e.g. survival). This should produce seven plots, one for each feature, and each plot should have two overlapping histograms, with the color of the histogram indicating the class. For each feature, what trends do you observe in the data?
4.2 Evaluation [20 pts]
Now, let us use scikit-learn to train a DecisionTreeClassifier on the data.
Using the predictive capabilities of the scikit-learn package is very simple. In fact, it can be carried out in three simple steps: initializing the model, tting it to the training data, and predicting new values.3
(b) Before trying out any classi er, it is often useful to establish a baseline. We have implemented one simple baseline classi er, MajorityVoteClassifier, that always predicts the majority class from the training set. Read through the MajorityVoteClassifier and its usage and make sure you understand how it works.
Your goal is to implement and evaluate another baseline classi er, RandomClassifier, that predicts a target class according to the distribution of classes in the training data set. For example, if 60% of the examples in the training set have Survived = 0 and 40% have Survived = 1, then, when applied to a test set, RandomClassifier should randomly predict 60% of the examples as Survived = 0 and 40% as Survived = 1.
Implement the missing portions of RandomClassifier according to the provided speci ca-tions. Then train your RandomClassifier on the entire training data set, and evaluate its training error. If you implemented everything correctly, you should have an error of 0:485.
(c) Now that we have a baseline, train and evaluate a DecisionTreeClassifier (using the class from scikit-learn and referring to the documentation as needed). Make sure you initialize your classi er with the appropriate parameters; in particular, use the ‘entropy’ criterion discussed in class. What is the training error of this classi er?
(d) So far, we have looked only at training error, but as we learned in class, training error is a poor metric for evaluating classi ers. Let us use cross-validation instead.
• Note that almost all of the model techniques in scikit-learn share a few common named functions, once they are initialized. You can always nd out more about them in the documentation for each model. These are
some-model-name.fit(...), some-model-name.predict(...), and some-model-name.score(...).
5
Implement the missing portions of error(...) according to the provided speci cations. You may nd it helpful to use train_test_split(...) from scikit-learn. To ensure that we always get the same splits across di erent runs (and thus can compare the classi er results), set the random_state parameter to be the trial number.
Next, use your error(...) function to evaluate the training error and (cross-validation) test error of each of your three models. To do this, generate a random 80=20 split of the training data, train each model on the 80% fraction, evaluate the error on either the 80% or the 20% fraction, and repeat this 100 times to get an average result. What are the average training and test error of each of your classi ers on the Titanic data set?
(e) One problem with decision trees is that they can over t to training data, yielding complex classi ers that do not generalize well to new data. Let us see whether this is the case for the Titanic data.
One way to prevent decision trees from over tting is to limit their depth. Repeat your cross-validation experiments but for increasing depth limits, speci cally, 1; 2; : : : ; 20. Then plot the average training error and test error against the depth limit. (Also plot the average test error for your baseline classi ers. As the baseline classi ers are independent of the depth limit, their plots should be at lines.) Include this plot in your writeup, making sure to label all axes and include a legend for your classi ers. What is the best depth limit to use for this data? Do you see over tting? Justify your answers using the plot.
(f) Another useful tool for evaluating classi ers is learning curves, which show how classi er performance (e.g. error) relates to experience (e.g. amount of training data).
Run another experiment using a decision tree with the best depth limit you found above. This time, vary the amount of training data by starting with splits of 0:05 (5% of the data used for training) and working up to splits of size 0:95 (95% of the data used for training) in increments of 0:05. Then plot the decision tree training and test error against the amount of training data. (Also plot the average test error for your baseline classi ers.) Include this plot in your writeup, and provide a 1-2 sentence description of your observations.
6