$24
Different marking schemes will be used for undergrad (SEng 474) and grad (CSc 578D) students. Undergrad students do not have to answer the grad questions.
All code questions use the python and/or the sckit-learn library. You may install it, along with the NumPy and SciPy libraries on your own computer. Alternatively, you can work in the lab.
http://scikit-learn.org/stable/install.html
http://www.numpy.org/
Decision Trees (SEng 474: 20 points; CSc 578D: 15 points) a) (SEng 474: 15 points; CSc 578D: 10 points)
By hand, construct the root and the first level of a decision tree for the contact lenses data (attached to this assignment on connex) using the ID3 algorithm. Show the details of your construction and all your calculations; no points will be given for solutions only.
b) (SEng 474: 5 points; CSc 578D: 5 points)
Using the tree.DecisionTreeClassifier module from python’s sckit-learn, fit a tree using the contact-lenses data using criterion=’entropy’.
Compare the entropy values obtain in part a) with the ones calculated by the sklearn.tree
module. Explain in detail why the trees are not the same. You may find the
documentation for decision trees helpful:
http://scikit-learn.org/stable/modules/tree.html
Note: You can import the data directly from the ’contact-lenses.arff’ file using the Arff2Skl() converter from util2.py provided with this assignment, using these lines of code:
from util2 import Arff2Skl
cvt = Arff2Skl('contact-lenses.arff')
label = cvt.meta.names()[-1]
X, y = cvt.transform(label)
2: Classifier Accuracy (SEng 474 and CSc 578D: 10 points)
Assume you were given a dataset built from random data, where attributes values have been randomly generated with no consideration to the class labels. The dataset has three classes: “red”, “blue” and “yellow”. You were asked to build a classifier for this dataset, and told that 50% of the data will be used for training, and 50% for testing. The testing set is balanced, so you can assume it has the same distribution as the training set. Because you are smart, you will start by establishing a theoretical baseline for your classifier’s performance.
a) (2 points)
Assume the data is equally split between the three classes (33.3% “red”, 33.3% “blue” and 33.3% “yellow”) and your classifier systematically predicts “red” for every test instances, what is the expected error rate of your classifier? (Show your work)
b) (3 points)
What if instead of always predicting “red”, the classifier predicted “red” with a probability of 0.7, and “blue” with a probability of 0.3. What is the expected error rate of the classifier in this case? (Show your work)
c) (2 points)
Now lets assume that the data is not split equally, but has half (1/2) of its data labeled “red”, one-fourth (1/4) labeled as “blue”, and one-forth (1/4) labeled as “yellow”. What is the expected error rate of the classifier if, as in question a), the prediction is “red” for every test instances.
d) (3 points)
With this dataset (half (1/2) labeled “red”, one-fourth (1/4) labeled “blue”, and one-forth (1/4) labeled “yellow”) What is the expected error rate of the classifier if, as in question b), it predicted “red” with a probability of 0.7, and “blue” with a probability of 0.3. (Show your work)
MLE and MAP estimates (SEng 474: 10 points; CSc 578D: 15 points)
Let θ = P(X=T). Calculate the MLE for θ for the following dataset by finding the maximum of P(D| θ). Show your work. [4 marks]
D={T, T, T, T, T, T, T, F, F, F}
Recall the PDF for a Beta random variable is proportional to
Θ(β1-1) * (1- θ)(β2-1)
with parameters β1 and β2. Let’s say you have evidence from previous studies that P(X=T) = ½. Let β1 = 4. Find β2 and then calculate the MAP estimate P(θ | D) for θ with a Beta(β1, β2) prior and the dataset above. Show your work. [6 marks]
(578D students only) In class we used the mode to find β1 and β2. Repeat part b using the mean of the beta distribution instead. [5 marks]
Gradient Descent (SEng 474: 20 points; CSc 578D: 20 points)
We have a new dataset where the input data has 2 continuous variables (x1 and x2), and the task is to predict a continuous value as the output y (i.e. regression). We have reason to believe the following new model for regression is a better fit to the problem domain:
yˆi = w0 + w1xi,1 + w2 xi4,2
Write down the error function for this model. You should use the sum of squared error, as in class: [5 marks]
N
1
E(X) = ∑(yi − yˆi )2
Derive the gradient descent update for w0 using learning rate κ [5 marks]
Derive the gradient descent update for w1 using learning rate κ [5 marks]
Derive the gradient descent update for w2 using learning rate κ [5 marks]