Starting from:
$35

$29

Homework #2 Solution

Task 1

Describe the difference between classification and clustering?




Task 2

Describe what is entropy?

Task 3

Describe and compare the following “feature selection measures” or called “splitting criteria”:




information gain, gain ratio, and Gini index?




Task 4

Given training instances and their attributes, construct the following three decision trees by hand, and then implement the three decision trees using Python Decision Tree models:




 
ID3: information gain

 
C4.5: gain ratio

 
CART: gini index




Question 1: We have




(1) 6 training instances and 6 testing instances

(2)3 attributes: (a) 2-value attribute (Home/Away), (b) 2-value attribute (In/Out), (c) 4-value attribute (NBC/ESPN/FOX/ABC)






Date
University
Is
Is
Media
Label:






Home/Away?
Opponent


Win/Lose








in AP Top












25 at












Preseason?




1
9/2/17
Temple
Home
Out
1-NBC
Win














2
9/9/17
Georgia
Home
In
1-NBC
Lose














3
9/16/17
Boston College
Away
Out
2-
Win










ESPN
















4
9/23/17
Michigan State
Away
Out
3-FOX
Win














5
9/30/17
Miami Ohio
Home
Out
1-NBC
Win














6
10/7/17
North Carolina
Away
Out
4-ABC
Win














7
10/19/17
USC
Home
In
1-NBC
?














8
10/25/17
North Carolina
Home
Out
1-NBC
?




State






















9
11/4/17
Wake Forest
Home
Out
1-NBC
?














10
11/12/17
Miami Florida
Away
In
4-ABC
?














11
11/18/17
Navy
Home
Out
1-NBC
?














12
11/26/17
Stanford
Away
In
4-ABC
?




















Question 2: We have




 
14 training instances and 1 testing instance

 
4 attributes: (a) 3-value attribute (Sunny/Overcast/Rainy), (b) 3-value attribute (Hot/Mild/Cool), (c) 2-value attribute (High/Normal), (d) 2-value attribute (True/False)




ID
Date
Outlook
Temperature
Humidity
Windy
Label:












Play?














1
9/1/17
Sunny
Hot
High
"False"
No














2
9/8/17
Sunny
Hot
High
"True"
No














3
9/15/17
Overcast
Hot
High
"False"
Yes














4
9/22/17
Rainy
Mild
High
"False"
Yes














5
9/29/17
Rainy
Cool
Normal
"False"
Yes














6
10/1/17
Rainy
Cool
Normal
"True"
No














7
10/8/17
Overcast
Cool
Normal
"True"
Yes














8
10/15/17
Sunny
Mild
High
"False"
No














9
10/22/17
Sunny
Cool
Normal
"False"
Yes














10
10/29/17
Rainy
Mild
Normal
"False"
Yes














11
11/1/17
Sunny
Mild
Normal
"True"
Yes














12
11/8/17
Overcast
Mild
High
"True"
Yes














13
11/15/17
Overcast
Hot
Normal
"False"
Yes














14
11/22/17
Rainy
Mild
High
"True"
No














15
11/29/17
Rainy
Hot
High
"False"
?














Task 5

Given a university’s football game data for the last two seasons, please construct three classification models to predict game results on games, and evaluate the model performance. Here, the three classification models are ID3, C4.5, and Naïve Bayes.

 
Data




 
Each data object (or called instance) is a game. We have three attributes: (1) “Is Home/Away?”, a 2-value attribute (“Home”, “Away”), (2) “Is Opponent in AP Top 25 at Preseason?”, a 2-value attribute (“In”, “Out”), (3) “Media”, a 5-value attribute (“1-NBC”, “2- ESPN”, “3-FOX”, “4-ABC”, “5-CBS”). The label “Win/Lose” is binary (“Win”, “Lose”).




 
Training set

 
24 games. Please use game ID 1-24 to construct classification models.




 
Testing set




 
12 games. Please use your classification models to predict labels of game ID 25-36 and evaluate the performance of the classification models.




 
Predictive labels

 
Suppose “Win” is the positive label and “Lose” is the negative label. Keep it in mind when you use Precision and Recall to evaluate the models.


 
Stop criteria of decision tree models

 
We stop splitting instances into child nodes when one of the criteria is satisfied:

(1) All features have been used; (2) Information Gain or Gain Ratio will be zero with any feature that has not yet been used.




 
Prediction criteria

 
If the node is not pure, we use the majority of this node for prediction: For example, if we have 5 positives and 1 negatives, we predict the testing case at this node to be a positive. (2) If the node has a balance (half/half labels), e.g., 2 positives and 2 negatives, we use the majority of the root node (the entire dataset) for prediction.




ID
Date
Opponent
Is_Home_o
Is_Oppone
Media
Label
Training
Data:


r_Away
nt_in_AP25












_Preseason


















1
9/5/1
Texas
Home
Out
1-NBC
Win


5










2
9/12/
Virginia
Away
Out
4-ABC
Win


15










3
9/19/
GeorgiaTec
Home
In
1-NBC
Win


15
h








4
9/26/
UMass
Home
Out
1-NBC
Win


15










5
10/3/
Clemson
Away
In
4-ABC
Lose


15










6
10/10
Navy
Home
Out
1-NBC
Win


/15










7
10/17
USC
Home
In
1-NBC
Win


/15










8
10/31
Temple
Away
Out
4-ABC
Win


/15










9
11/7/
PITT
Away
Out
4-ABC
Win


15










10
11/14
WakeFores
Home
Out
1-NBC
Win


/15
t








11
11/21
BostonColl
Away
Out
1-NBC
Win


/15
ege








12
11/28
Stanford
Away
In
3-FOX
Lose


/15










13
9/4/1
Texas
Away
Out
4-ABC
Lose


6










14
9/10/
Nevada
Home
Out
1-NBC
Win


16










15
9/17/
MichiganSt
Home
Out
1-NBC
Lose


16
ate








16
9/24/
Duke
Home
Out
1-NBC
Lose


16










17


10/1/


Syracuse


Home
Out


2-ESPN




Win




16
























18


10/8/


NorthCaroli


Away
Out


4-ABC




Lose




16


naState


















19


10/15


Stanford


Home
In


1-NBC




Lose




/16
























20


10/29


MiamiFlori


Home
Out


1-NBC




Win




/16


da


















21


11/5/


Navy


Home
Out


5-CBS




Lose




16
























22


11/12


Army


Home
Out


1-NBC




Win




/16
























23


11/19


VirginiaTec


Home
In


1-NBC




Lose




/16


h


















24


11/26


USC


Away
In


4-ABC




Lose




/16
























Testing Data




















































ID
Date


Opponent


Is_Home_or
Is_Opponent_in_AP25
Media


Label
















_Away


_Preseason






25


9/2/17


Temple


Home


Out
1-NBC


Win
26


9/9/17


Georgia


Home


In
1-NBC


Lose
27


9/16/1


BostonColleg
Away


Out
2-ESPN


Win




7


e
















28


9/23/1


MichiganStat
Away


Out
3-FOX


Win




7


e
















29


9/30/1


MiamiOhio


Home


Out
1-NBC


Win




7






















30


10/7/1


NorthCarolin
Away


Out
4-ABC


Win




7


a
















31


10/21/


USC


Home


In
1-NBC


Win




17






















32


10/28/


NorthCarolin
Home


Out
1-NBC


Win




17


aState
















33


11/4/1


WakeForest


Home


Out
1-NBC


Win




7






















34


11/11/


MiamiFlorida
Away


In
4-ABC


Lose




17






















35


11/18/


Navy


Home


Out
1-NBC


Win




17






















36


11/25/


Stanford


Away


In
4-ABC


Lose




17




























Question 1: ID3 model, a decision tree model using “Information Gain”




 
Programming: Use ID3 to construct a decision tree based on the training set (24 games). Use the tree to predict labels of instances in the testing set (12 games) based on their attributes. Calculate Accuracy, Precision, Recall, and F1 score on the testing result.




 
Attach a figure of your decision tree (either hand- or electronically drawn) and write down prediction label of the 12 testing games as well as evaluation result in the PDF.




Question 2: C4.5 model, a decision tree model using “Gain Ratio”




 
Programming: Use C4.5 to construct a decision tree based on the training set (24 games). Use the tree to predict labels of instances in the testing set (12 games) based on their attributes. Calculate Accuracy, Precision, Recall, and F1 score on the testing result.




 
Attach a figure of your decision tree (either hand- or electronically drawn) and write down prediction label of the 12 testing games as well as evaluation result in the PDF.




Question 3: which model is the best, which model performs the worst? Can you explain why?




Please submit a report (PDF or word) that includes a link to your code, your answers/results, and your explanations or interpretations (if any).

More products