Homework #2 Solution

Starting from:

~~$35~~

$29

Task 1

Describe the difference between classification and clustering?

Task 2

Describe what is entropy?

Task 3

Describe and compare the following “feature selection measures” or called “splitting criteria”:

information gain, gain ratio, and Gini index?

Task 4

Given training instances and their attributes, construct the following three decision trees by hand, and then implement the three decision trees using Python Decision Tree models:

ID3: information gain

C4.5: gain ratio

CART: gini index

Question 1: We have

(1) 6 training instances and 6 testing instances

(2)3 attributes: (a) 2-value attribute (Home/Away), (b) 2-value attribute (In/Out), (c) 4-value attribute (NBC/ESPN/FOX/ABC)

Date
University
Is
Is
Media
Label:

Home/Away?
Opponent

Win/Lose

in AP Top

25 at

Preseason?

1
9/2/17
Temple
Home
Out
1-NBC
Win

2
9/9/17
Georgia
Home
In
1-NBC
Lose

3
9/16/17
Boston College
Away
Out
2-
Win

ESPN

4
9/23/17
Michigan State
Away
Out
3-FOX
Win

5
9/30/17
Miami Ohio
Home
Out
1-NBC
Win

6
10/7/17
North Carolina
Away
Out
4-ABC
Win

7
10/19/17
USC
Home
In
1-NBC
?

8
10/25/17
North Carolina
Home
Out
1-NBC
?

State

9
11/4/17
Wake Forest
Home
Out
1-NBC
?

10
11/12/17
Miami Florida
Away
In
4-ABC
?

11
11/18/17
Navy
Home
Out
1-NBC
?

12
11/26/17
Stanford
Away
In
4-ABC
?

Question 2: We have

14 training instances and 1 testing instance

4 attributes: (a) 3-value attribute (Sunny/Overcast/Rainy), (b) 3-value attribute (Hot/Mild/Cool), (c) 2-value attribute (High/Normal), (d) 2-value attribute (True/False)

ID
Date
Outlook
Temperature
Humidity
Windy
Label:

Play?

1
9/1/17
Sunny
Hot
High
"False"
No

2
9/8/17
Sunny
Hot
High
"True"
No

3
9/15/17
Overcast
Hot
High
"False"
Yes

4
9/22/17
Rainy
Mild
High
"False"
Yes

5
9/29/17
Rainy
Cool
Normal
"False"
Yes

6
10/1/17
Rainy
Cool
Normal
"True"
No

7
10/8/17
Overcast
Cool
Normal
"True"
Yes

8
10/15/17
Sunny
Mild
High
"False"
No

9
10/22/17
Sunny
Cool
Normal
"False"
Yes

10
10/29/17
Rainy
Mild
Normal
"False"
Yes

11
11/1/17
Sunny
Mild
Normal
"True"
Yes

12
11/8/17
Overcast
Mild
High
"True"
Yes

13
11/15/17
Overcast
Hot
Normal
"False"
Yes

14
11/22/17
Rainy
Mild
High
"True"
No

15
11/29/17
Rainy
Hot
High
"False"
?

Task 5

Given a university’s football game data for the last two seasons, please construct three classification models to predict game results on games, and evaluate the model performance. Here, the three classification models are ID3, C4.5, and Naïve Bayes.

Data

Each data object (or called instance) is a game. We have three attributes: (1) “Is Home/Away?”, a 2-value attribute (“Home”, “Away”), (2) “Is Opponent in AP Top 25 at Preseason?”, a 2-value attribute (“In”, “Out”), (3) “Media”, a 5-value attribute (“1-NBC”, “2- ESPN”, “3-FOX”, “4-ABC”, “5-CBS”). The label “Win/Lose” is binary (“Win”, “Lose”).

Training set

24 games. Please use game ID 1-24 to construct classification models.

Testing set

12 games. Please use your classification models to predict labels of game ID 25-36 and evaluate the performance of the classification models.

Predictive labels

Suppose “Win” is the positive label and “Lose” is the negative label. Keep it in mind when you use Precision and Recall to evaluate the models.

Stop criteria of decision tree models

We stop splitting instances into child nodes when one of the criteria is satisfied:

(1) All features have been used; (2) Information Gain or Gain Ratio will be zero with any feature that has not yet been used.

Prediction criteria

If the node is not pure, we use the majority of this node for prediction: For example, if we have 5 positives and 1 negatives, we predict the testing case at this node to be a positive. (2) If the node has a balance (half/half labels), e.g., 2 positives and 2 negatives, we use the majority of the root node (the entire dataset) for prediction.

ID
Date
Opponent
Is_Home_o
Is_Oppone
Media
Label
Training
Data:

r_Away
nt_in_AP25

_Preseason

1
9/5/1
Texas
Home
Out
1-NBC
Win

5

2
9/12/
Virginia
Away
Out
4-ABC
Win

15

3
9/19/
GeorgiaTec
Home
In
1-NBC
Win

15
h

4
9/26/
UMass
Home
Out
1-NBC
Win

15

5
10/3/
Clemson
Away
In
4-ABC
Lose

15

6
10/10
Navy
Home
Out
1-NBC
Win

/15

7
10/17
USC
Home
In
1-NBC
Win

/15

8
10/31
Temple
Away
Out
4-ABC
Win

/15

9
11/7/
PITT
Away
Out
4-ABC
Win

15

10
11/14
WakeFores
Home
Out
1-NBC
Win

/15
t

11
11/21
BostonColl
Away
Out
1-NBC
Win

/15
ege

12
11/28
Stanford
Away
In
3-FOX
Lose

/15

13
9/4/1
Texas
Away
Out
4-ABC
Lose

6

14
9/10/
Nevada
Home
Out
1-NBC
Win

16

15
9/17/
MichiganSt
Home
Out
1-NBC
Lose

16
ate

16
9/24/
Duke
Home
Out
1-NBC
Lose

16

17

10/1/

Syracuse

Home
Out

2-ESPN

Win

16

18

10/8/

NorthCaroli

Away
Out

4-ABC

Lose

16

naState

19

10/15

Stanford

Home
In

1-NBC

Lose

/16

20

10/29

MiamiFlori

Home
Out

1-NBC

Win

/16

da

21

11/5/

Navy

Home
Out

5-CBS

Lose

16

22

11/12

Army

Home
Out

1-NBC

Win

/16

23

11/19

VirginiaTec

Home
In

1-NBC

Lose

/16

h

24

11/26

USC

Away
In

4-ABC

Lose

/16

Testing Data

ID
Date

Opponent

Is_Home_or
Is_Opponent_in_AP25
Media

Label

_Away

_Preseason

25

9/2/17

Temple

Home

Out
1-NBC

Win
26

9/9/17

Georgia

Home

In
1-NBC

Lose
27

9/16/1

BostonColleg
Away

Out
2-ESPN

Win

7

e

28

9/23/1

MichiganStat
Away

Out
3-FOX

Win

7

e

29

9/30/1

MiamiOhio

Home

Out
1-NBC

Win

7

30

10/7/1

NorthCarolin
Away

Out
4-ABC

Win

7

a

31

10/21/

USC

Home

In
1-NBC

Win

17

32

10/28/

NorthCarolin
Home

Out
1-NBC

Win

17

aState

33

11/4/1

WakeForest

Home

Out
1-NBC

Win

7

34

11/11/

MiamiFlorida
Away

In
4-ABC

Lose

17

35

11/18/

Navy

Home

Out
1-NBC

Win

17

36

11/25/

Stanford

Away

In
4-ABC

Lose

17

Question 1: ID3 model, a decision tree model using “Information Gain”

Programming: Use ID3 to construct a decision tree based on the training set (24 games). Use the tree to predict labels of instances in the testing set (12 games) based on their attributes. Calculate Accuracy, Precision, Recall, and F1 score on the testing result.

Attach a figure of your decision tree (either hand- or electronically drawn) and write down prediction label of the 12 testing games as well as evaluation result in the PDF.

Question 2: C4.5 model, a decision tree model using “Gain Ratio”

Programming: Use C4.5 to construct a decision tree based on the training set (24 games). Use the tree to predict labels of instances in the testing set (12 games) based on their attributes. Calculate Accuracy, Precision, Recall, and F1 score on the testing result.

Attach a figure of your decision tree (either hand- or electronically drawn) and write down prediction label of the 12 testing games as well as evaluation result in the PDF.

Question 3: which model is the best, which model performs the worst? Can you explain why?

Please submit a report (PDF or word) that includes a link to your code, your answers/results, and your explanations or interpretations (if any).

More products

$6.00 OFF

Assignment 1: C and Unix Warmup Solution

$30

$24

Buy now

$6.00 OFF

Homework Assignment 4 Solution

$30

$24

Buy now

$6.00 OFF

Homework Assignment 2 Solution

$30

$24

Buy now