Starting from:
$35

$29

Assignment One Solution

Provide credit to any sources other than the course sta that helped you solve the problems. This includes the names of all students you talked to regarding the problems.

You can look up de nitions/basics online (eg wikipedia, stack-exchange, etc) The questions marked with one asterisk can be slightly involved.

Problem 1. (10 points). Design the decision tree for the tennis data using Gini impurity mea-sure. Compute the Gini measure for all attributes at each node, and continue until all the examples are correctly labeled by the tree.


Problem 2 (20 points). Consider the training set given in Table 1. The attribute \Shirt Size Fine" is a re nement of the attribute \Shirt Size", wherein the value \Medium" has been further categorized into two values \Small-Medium" and \Large-Medium".

    1. What is the entropy of the labels?

    2. Compute the information gain for \Shirt Size" and \Shirt Size Fine". Which is higher?

3. A function f is called concave if for x; y, and any 0    1,

f( x + (1   )y)    f(x) + (1   )f(y):
(1)

The logarithm function is concave. You can assume that as a fact for all other parts of this assignment. For this part, you have to show that Equation (1) holds for = 1=2, and f is the logarithm function.

4 The following inequality is called as the log-sum inequality. For positive x1; x2; y1; y2,
x1 log
x1
+ x2 log
x2
(x1 + x2) log
x1
+ x2
:
(2)

y1

y2

y1
+ y2



Prove this using the concavity of logarithms.




1

Gender
Car Type
Shirt Size
Shirt Size Fine
Label





M
Family
Small
Small
+
M
Sports
Medium
Small-Medium
+
M
Family
Large
Large

F
Sports
Small
Small
+
M
Sports
Extra Large
Extra Large
+
F
Luxury
Small
Small

M
Family
Medium
Large-Medium

M
Sports
Extra Large
Extra Large
+
M
Sports
Large
Large
+
F
Luxury
Medium
Large-Medium

F
Sports
Medium
Large-Medium
+
F
Family
Small
Small
+
F
Luxury
Large
Large

M
Luxury
Medium
Small Medium

F
Family
Medium
Small-Medium
+







Table 1: Training Data


5    We will show that part 2 of this problem can be generalized as follows. Consider a training set of any size with the four features as in Table 1, again with the property that \Shirt Size Fine" is a re nement of the attribute \Shirt Size". Show that the information gain for \Shirt Size Fine" is always at least that for \Shirt Size". This is a heuristic justi cation for the fact that IG picks attributes that have more possibilities.

(hint: Suppose nm are the number of medium’s, and nml and nms are the number of small-medium, and large medium respectively. then nml + nms = nm. You may also want to de ne terms such at n+m which are the number of medium’s with +ve labels). You may have to use Equation (2) carefully!

Problem 3. (10 points). Consider the training set given in Table 2. There are nine examples, each with three features. Feature 1 and Feature 2 are binary, and Feature 3 is continuous.

    1. For Feature 1 and Feature 2, compute the information gain with respect to the examples.

    2. To use a feature that takes continuous values, we x a threshold and categorize the continuous feature depending on whether it is greater than the threshold or not. For example, in the given example, if the threshold is xed at 4.5, we convert Feature 3 into a categorical feature, fS; G; G; S; G; S; G; G; Gg where S; G imply that the value is smaller than and greater than the threshold respectively.

For Feature 3, compute the information gain with respect to the threshold values 2.5, 3.5, 4.5, 5.5, 6.5, and 7.5. Which threshold has the highest information gain?

    3. Which feature will be chosen as the root node, if we use the threshold value with the highest information gain for the third feature?

    4. Construct any decision tree that gives correct answers for all the training examples.

2
Feature 1
Feature 2
Feature 3
Label




T
T
1.0
+
T
T
6.0
+
T
F
5.0

F
F
4.0
+
F
T
7.0

F
T
3.0

F
F
8.0

T
F
7.0
+
F
T
5.0







Table 2: Training Data


Problem 4. Decision Trees (30 points). See attached jupyter notebook for details.











































3

More products