$24
Rubric: {mechanics:5}
IMPORTANT!!! Before proceeding, please carefully read the general homework instructions at https://www.cs.ubc.ca/~fwood/CS340/homework/. The above 5 points are for following the submission instructions. You can ignore the words “mechanics”, “reasoning”, etc.
We use blue to highlight the deliverables that you must answer/do/submit with the assignment.
1 Training and Testing
If you run python main.py -q 1, it will load the citiesSmall.pkl data set from Assignment 1. Note that this file contains not only training data, but also test data, X_test and y_test. After training a depth-2 decision tree with the information gain splitting rule, it will evaluate the performance of the classifier on the test data. With a depth-2 decision tree, the training and test error are fairly close, so the model hasn’t overfit much.
1.1 Training and Testing Error Curves
Rubric: {reasoning:2}
Make a plot that contains the training error and testing error as you vary the depth from 1 through 15. How do each of these errors change with the decision tree depth?
Note: it’s OK to reuse code from Assignment 1.
1.2 Validation Set
Rubric: {reasoning:3}
Suppose that we didn’t have an explicit test set available. In this case, we might instead use a validation set. Split the training set into two equal-sized parts: use the first n/2 examples as a training set and the second n/2 examples as a validation set (we’re assuming that the examples are already in a random order). What depth of decision tree would we pick to minimize the validation set error? Does the answer change if you switch the training and validation set? How could use more of our data to estimate the depth more reliably?
1
2 Naive Bayes
In this section we’ll implement naive Bayes, a very fast classification method that is often surprisingly accurate for text data with simple representations like bag of words.
2.1 Naive Bayes by Hand
Consider the dataset below, which has 10 training examples and 3 features:
2
0
0
1
3
2
spam
3
0
1
1
spam
6
0
1
1
7
6
spam
7
X =
6
7
,
y =
6
7
.
1
1
0
spam
6
7
6
7
6
0
1
0
7
6
spam
7
6
7
6
7
6
0
1
1
7
6
spam
7
6
7
6
7
6
1
0
0
7
6
not spam
7
6
7
6
7
6
1
1
0
7
6
not spam
7
6
7
6
7
6
7
6
7
4
1
0
1
5
4
not spam
5
6
7
6
7
6
1
0
0
7
6
not spam
7
The feature in the first column is <your name> (whether the e-mail contained your name), in the second column is “pharmaceutical” (whether the e-mail contained this word), and the third column is “PayPal” (whether the e-mail contained this word). Suppose you believe that a naive Bayes model would be appropriate for this dataset, and you want to classify the following test example:
⇥ ⇤ xˆ = 1 1 0 .
2.1.1 Prior probabilities
Rubric: {reasoning:1} Compute the estimates of the class prior probabilities (you don’t need to show any work):
• p(spam).
• p(not spam).
2.1.2 Conditional probabilities
Rubric: {reasoning:1}
Compute the estimates of the 6 conditional probabilities required by naive Bayes for this example (you don’t need to show any work):
• p(<your name> = 1 | spam).
• p(pharmaceutical = 1 | spam).
• p(PayPal = 0 | spam).
• p(<your name> = 1 | not spam).
• p(pharmaceutical = 1 | not spam).
• p(PayPal = 0 | not spam).
2
2.1.3 Prediction
Rubric: {reasoning:1}
Under the naive Bayes model and your estimates of the above probabilities, what is the most likely label for the test example? (Show your work.)
2.1.4 Laplace smoothing
Rubric: {reasoning:2}
One way to think of Laplace smoothing is that you’re augmenting the training set with extra counts. Consider the estimates of the conditional probabilities in this dataset when we use Laplace smoothing (with β = 1). Give a set of extra training examples that we could add to the original training set that would make the basic estimates give us the estimates with Laplace smoothing (in other words give a set of extra training examples that, if they were included in the training set and we didn’t use Laplace smoothing, would give the same estimates of the conditional probabilities as using the original dataset with Laplace smoothing). Present your answer in a reasonably easy-to-read format, for example the same format as the data set at the start of this question.
2.2 Bag of Words
Rubric: {reasoning:3}
If you run python main.py -q 2.2, it will load the following dataset:
1. X: A binary matrix. Each row corresponds to a newsgroup post, and each column corresponds to whether a particular word was used in the post. A value of 1 means that the word occured in the post.
2. wordlist: The set of words that correspond to each column.
3. y: A vector with values 0 through 3, with the value corresponding to the newsgroup that the post came from.
4. groupnames: The names of the four newsgroups.
5. Xvalidate and yvalidate: the word lists and newsgroup labels for additional newsgroup posts. Answer the following:
1. Which word corresponds to column 51 of X? (This is column 50 in Python.)
2. Which words are present in training example 501?
3. Which newsgroup name does training example 501 come from?
2.3 Naive Bayes Implementation
Rubric: {code:5}
If you run python main.py -q 2.3 it will load the newsgroups dataset, fit a basic naive Bayes model and report the validation error.
The predict() function of the naive Bayes classifier is already implemented. However, in fit() the calcu-lation of the variable p xy is incorrect (right now, it just sets all values to 1/2). Modify this function so that p xy correctly computes the conditional probabilities of these values based on the frequencies in the data
3
set. Include your code and the validation error that you obtain in you pdf GradeScope submission. Also, compare your validation error to what you obtain with scikit-learn’s implementation, BernoulliNB.
2.4 Runtime of Naive Bayes for Discrete Data
Rubric: {reasoning:3}
For a given training example i, the predict function in the provided code computes the quantity
d
p(yi | xi) / p(yi)
jY
p(xij | yi),
=1
for each class yi (and where the proportionality constant is not relevant). For many problems, a lot of the p(xij | yi) values may be very small. This can cause the above product to underflow. The standard fix for this is to compute the logarithm of this quantity and use that log(ab) = log(a) + log(b),
Xd
log p(yi | xi) = log p(yi) + log p(xij | yi) + (irrelevant propportionality constant) .
j=1
This turns the multiplications into additions and thus typically would not underflow.
Assume you have the following setup:
• The training set has n objects each with d features.
• The test set has t objects with d features.
• Each feature can have up to c discrete values (you can assume c n).
• There are k class labels (you can assume k n)
You can implement the training phase of a naive Bayes classifier in this setup in O(nd), since you only need to do a constant amount of work for each X(i, j) value. (You do not have to actually implement it in this way for the previous question, but you should think about how this could be done.) What is the cost of classifying t test examples with the model and this way of computing the predictions?
3 K-Nearest Neighbours
Rubric: {code:3, reasoning:4}
In the citiesSmall dataset, nearby points tend to receive the same class label because they are part of the same U.S. state. For this problem, perhaps a k-nearest neighbours classifier might be a better choice than a decision tree. The file knn.py has implemented the training function for a k-nearest neighbour classifier (which is to just memorize the data).
Fill in the predict function in knn.py so that the model file implements the k-nearest neighbour prediction rule. You should Euclidean distance, and may numpy’s sort and/or argsort functions useful. You can also use utils.euclidean_dist_squared, which computes the squared Euclidean distances between all pairs of points in two matrices.
1. Write the predict function. Include this code in your GradeScope submission.
2. Report the training and test error obtained on the citiesSmall dataset for k = 1, k = 3, and k = 10. How do these numbers compare to what you got with the decision tree?
4
3. Include the plot generated by utils.plotClassifier on the citiesSmall dataset for k = 1, using both your implementation of KNN and the KNeighborsClassifier from scikit-learn.
4. Why is the training error 0 for k = 1?
5. If you didn’t have an explicit test set, how would you choose k?
4 Random Forests
4.1 Implementation
Rubric: {code:4,reasoning:3}
The file vowels.pkl contains a supervised learning dataset where we are trying to predict which of the 11 “steady-state” English vowels that a speaker is trying to pronounce.
You arepprovided with a RandomStump class that di↵ers from DecisionStumpInfoGain in that it only con-siders b dc randomly-chosen features.1 You are also provided with a RandomTree class that is exactly the same as DecisionTree except that it uses RandomStump instead of DecisionStump and it takes a bootstrap sample of the data before fitting. In other words, RandomTree is the entity we discussed in class, which makes up a random forest.
If you run python main.py -q 4 it will fit a deep DecisionTree using the information gain splitting crite-rion. You will notice that the model overfits badly.
1. Why doesn’t the random tree model have a training error of 0?
2. Create a class RandomForest in a file called random_forest.py that takes in hyperparameters num_trees and max_depth and fits num_trees random trees each with maximum depth max_depth. For predic-tion, have all trees predict and then take the mode. Make sure to include the code you have written in your pdf GradeScope submission.
3. Using 50 trees, and a max depth of 1, report the training and testing error. Compare this to what we got with a single DecisionTree and with a single RandomTree. Are the results what you expected? Discuss.
4. Compare your implementation with scikit-learn’s RandomForestClassifier for both speed and accu-racy, and briefly discuss. You can use all default hyperparameters if you wish, or you can try changing them.
5 Clustering
If you run python main.py -q 5, it will load a dataset with two features and a very obvious clustering structure. It will then apply the k-means algorithm with a random initialization. The result of applying the algorithm will thus depend on the randomization, but a typical run might look like this:
1 The notation bxc means the “floor” of x, or “x rounded down”. You can compute this with np.floor(x) or math.floor(x).
5
(Note that the colours are arbitrary – this is the label switching issue.) But the ‘correct’ clustering (that was used to make the data) is this:
5.1 Selecting among k-means Initializations
Rubric: {reasoning:5}
If you run the demo several times, it will find di↵erent clusterings. To select among clusterings for a fixed value of k, one strategy is to minimize the sum of squared distances between examples xi and their means wyi ,
n
n d
Xi
kxi
wyi k22 =
X X
f (w1, w2, . . . , wk, y1, y2, . . . , yn) =
(xij wyij )2.
=1
i=1 j=1
where yi is the index of the closest mean to xi. This is a natural criterion because the steps of k-means alternately optimize this objective function in terms of the wc and the yi values.
1. In the kmeans.py file, add a new function called error that takes the same input as the predict function but that returns the value of this above objective function. Make sure to include the code you have written in your pdf GradeScope submission.
6
2. What trend do you observe if you print the value of this error after each iteration of the k-means algorithm?
3. Using the code from question 5 in main.py (modify if needed), output the clustering obtained by running k-means 50 times (with k = 4) and taking the one with the lowest error. Submit your plot.
4. Looking at the hyperparameters of scikit-learn’s KMeans, explain the first four (n clusters, init, n init, max iter) very briefly.
5.2 Selecting k in k-means
Rubric: {reasoning:5}
We now turn to the task of choosing the number of clusters k.
1. Explain why we should not choose k by taking the value that minimizes the error function.
2. Explain why even evaluating the error function on test data still wouldn’t be a suitable approach to choosing k.
3. Hand in a plot of the minimum error found across 50 random initializations, as a function of k, taking k from 1 to 10.
4. The elbow method for choosing k consists of looking at the above plot and visually trying to choose the k that makes the sharpest “elbow” (the biggest change in slope). What values of k might be reasonable according to this method? Note: there is not a single correct answer here; it is somewhat open to interpretation and there is a range of reasonable answers.
5.3 Density-Based Clustering
Rubric: {reasoning:2}
If you run python main.py -q 5.3, it will apply the basic density-based clustering algorithm to the dataset from the previous part, but with some outliers added. The final output should look somewhat like this:
(The right plot is zoomed in to show the non-outlier part of the data.) Even though we know that each object was generated from one of four clusters (and we have 4 outliers), the algorithm finds 6 clusters and does not assign some of the original non-outlier objects to any cluster. However, the clusters will change if we change the parameters of the algorithm. Find and report values for the two parameters, eps (which we called the “radius” in class) and minPts, such that the density-based clustering method finds:
7
1. The 4 “true” clusters.
2. 3 clusters (merging the top two, which also seems like a reasonable interpretaition).
3. 2 clusters.
4. 1 cluster (consisting of the non-outlier points).
6 Very-Short Answer Questions
Rubric: {reasoning:13}
Write a short one or two sentence answer to each of the questions below. Make sure your answer is clear and concise.
1. What is an advantage of using a boxplot to visualize data rather than just computing its mean and variance?
2. What is a reason that the the data may not be IID in the email spam filtering example from lecture?
3. What is the di↵erence between a validation set and a test set?
4. Why can’t we (typically) use the training error to select a hyper-parameter?
5. What is the e↵ect of n on the optimization bias (assuming we use a parametric model).
6. What is an advantage and a disadvantage of using a large k value in k-fold cross-validation.
7. Why can we ignore p(xi) when we use naive Bayes?
8. For each of the three values below in a naive Bayes model, say whether it’s a parameter or a hyper-parameter:
(a) Our estimate of p(yi) for some yi.
(b) Our estimate of p(xij | yi) for some xij and yi.
(c) The value β in Laplace smoothing.
9. What is the e↵ect of k in KNN on the two parts (training error and approximation error) of the fundamental trade-o↵. Hint: think about the extreme values.
10. Suppose we want to classify whether segments of raw audio represent words or not. What is an easy way to make our classifier invariant to small translations of the raw audio?
11. Both supervised learning and clustering models take in an input xi and produce a label yi. What is the key di↵erence?
12. Suppose you chose k in k-means clustering (using the squared distances to examples) from a validation set instead of a training set. Would this work better than using the training set (which just chooses the largest value of k)?
13. In k-means clustering the clusters are guaranteed to be convex regions. Are the areas that are given the same label by KNN also convex?
8