$29
Task 1
Describe the difference between classification and clustering?
Task 2
Describe what is entropy?
Task 3
Describe and compare the following “feature selection measures” or called “splitting criteria”:
information gain, gain ratio, and Gini index?
Task 4
Given training instances and their attributes, construct the following three decision trees by hand, and then implement the three decision trees using Python Decision Tree models:
ID3: information gain
C4.5: gain ratio
CART: gini index
Question 1: We have
(1) 6 training instances and 6 testing instances
(2)3 attributes: (a) 2-value attribute (Home/Away), (b) 2-value attribute (In/Out), (c) 4-value attribute (NBC/ESPN/FOX/ABC)
Date
University
Is
Is
Media
Label:
Home/Away?
Opponent
Win/Lose
in AP Top
25 at
Preseason?
1
9/2/17
Temple
Home
Out
1-NBC
Win
2
9/9/17
Georgia
Home
In
1-NBC
Lose
3
9/16/17
Boston College
Away
Out
2-
Win
ESPN
4
9/23/17
Michigan State
Away
Out
3-FOX
Win
5
9/30/17
Miami Ohio
Home
Out
1-NBC
Win
6
10/7/17
North Carolina
Away
Out
4-ABC
Win
7
10/19/17
USC
Home
In
1-NBC
?
8
10/25/17
North Carolina
Home
Out
1-NBC
?
State
9
11/4/17
Wake Forest
Home
Out
1-NBC
?
10
11/12/17
Miami Florida
Away
In
4-ABC
?
11
11/18/17
Navy
Home
Out
1-NBC
?
12
11/26/17
Stanford
Away
In
4-ABC
?
Question 2: We have
14 training instances and 1 testing instance
4 attributes: (a) 3-value attribute (Sunny/Overcast/Rainy), (b) 3-value attribute (Hot/Mild/Cool), (c) 2-value attribute (High/Normal), (d) 2-value attribute (True/False)
ID
Date
Outlook
Temperature
Humidity
Windy
Label:
Play?
1
9/1/17
Sunny
Hot
High
"False"
No
2
9/8/17
Sunny
Hot
High
"True"
No
3
9/15/17
Overcast
Hot
High
"False"
Yes
4
9/22/17
Rainy
Mild
High
"False"
Yes
5
9/29/17
Rainy
Cool
Normal
"False"
Yes
6
10/1/17
Rainy
Cool
Normal
"True"
No
7
10/8/17
Overcast
Cool
Normal
"True"
Yes
8
10/15/17
Sunny
Mild
High
"False"
No
9
10/22/17
Sunny
Cool
Normal
"False"
Yes
10
10/29/17
Rainy
Mild
Normal
"False"
Yes
11
11/1/17
Sunny
Mild
Normal
"True"
Yes
12
11/8/17
Overcast
Mild
High
"True"
Yes
13
11/15/17
Overcast
Hot
Normal
"False"
Yes
14
11/22/17
Rainy
Mild
High
"True"
No
15
11/29/17
Rainy
Hot
High
"False"
?
Task 5
Given a university’s football game data for the last two seasons, please construct three classification models to predict game results on games, and evaluate the model performance. Here, the three classification models are ID3, C4.5, and Naïve Bayes.
Data
Each data object (or called instance) is a game. We have three attributes: (1) “Is Home/Away?”, a 2-value attribute (“Home”, “Away”), (2) “Is Opponent in AP Top 25 at Preseason?”, a 2-value attribute (“In”, “Out”), (3) “Media”, a 5-value attribute (“1-NBC”, “2- ESPN”, “3-FOX”, “4-ABC”, “5-CBS”). The label “Win/Lose” is binary (“Win”, “Lose”).
Training set
24 games. Please use game ID 1-24 to construct classification models.
Testing set
12 games. Please use your classification models to predict labels of game ID 25-36 and evaluate the performance of the classification models.
Predictive labels
Suppose “Win” is the positive label and “Lose” is the negative label. Keep it in mind when you use Precision and Recall to evaluate the models.
Stop criteria of decision tree models
We stop splitting instances into child nodes when one of the criteria is satisfied:
(1) All features have been used; (2) Information Gain or Gain Ratio will be zero with any feature that has not yet been used.
Prediction criteria
If the node is not pure, we use the majority of this node for prediction: For example, if we have 5 positives and 1 negatives, we predict the testing case at this node to be a positive. (2) If the node has a balance (half/half labels), e.g., 2 positives and 2 negatives, we use the majority of the root node (the entire dataset) for prediction.
ID
Date
Opponent
Is_Home_o
Is_Oppone
Media
Label
Training
Data:
r_Away
nt_in_AP25
_Preseason
1
9/5/1
Texas
Home
Out
1-NBC
Win
5
2
9/12/
Virginia
Away
Out
4-ABC
Win
15
3
9/19/
GeorgiaTec
Home
In
1-NBC
Win
15
h
4
9/26/
UMass
Home
Out
1-NBC
Win
15
5
10/3/
Clemson
Away
In
4-ABC
Lose
15
6
10/10
Navy
Home
Out
1-NBC
Win
/15
7
10/17
USC
Home
In
1-NBC
Win
/15
8
10/31
Temple
Away
Out
4-ABC
Win
/15
9
11/7/
PITT
Away
Out
4-ABC
Win
15
10
11/14
WakeFores
Home
Out
1-NBC
Win
/15
t
11
11/21
BostonColl
Away
Out
1-NBC
Win
/15
ege
12
11/28
Stanford
Away
In
3-FOX
Lose
/15
13
9/4/1
Texas
Away
Out
4-ABC
Lose
6
14
9/10/
Nevada
Home
Out
1-NBC
Win
16
15
9/17/
MichiganSt
Home
Out
1-NBC
Lose
16
ate
16
9/24/
Duke
Home
Out
1-NBC
Lose
16
17
10/1/
Syracuse
Home
Out
2-ESPN
Win
16
18
10/8/
NorthCaroli
Away
Out
4-ABC
Lose
16
naState
19
10/15
Stanford
Home
In
1-NBC
Lose
/16
20
10/29
MiamiFlori
Home
Out
1-NBC
Win
/16
da
21
11/5/
Navy
Home
Out
5-CBS
Lose
16
22
11/12
Army
Home
Out
1-NBC
Win
/16
23
11/19
VirginiaTec
Home
In
1-NBC
Lose
/16
h
24
11/26
USC
Away
In
4-ABC
Lose
/16
Testing Data
ID
Date
Opponent
Is_Home_or
Is_Opponent_in_AP25
Media
Label
_Away
_Preseason
25
9/2/17
Temple
Home
Out
1-NBC
Win
26
9/9/17
Georgia
Home
In
1-NBC
Lose
27
9/16/1
BostonColleg
Away
Out
2-ESPN
Win
7
e
28
9/23/1
MichiganStat
Away
Out
3-FOX
Win
7
e
29
9/30/1
MiamiOhio
Home
Out
1-NBC
Win
7
30
10/7/1
NorthCarolin
Away
Out
4-ABC
Win
7
a
31
10/21/
USC
Home
In
1-NBC
Win
17
32
10/28/
NorthCarolin
Home
Out
1-NBC
Win
17
aState
33
11/4/1
WakeForest
Home
Out
1-NBC
Win
7
34
11/11/
MiamiFlorida
Away
In
4-ABC
Lose
17
35
11/18/
Navy
Home
Out
1-NBC
Win
17
36
11/25/
Stanford
Away
In
4-ABC
Lose
17
Question 1: ID3 model, a decision tree model using “Information Gain”
Programming: Use ID3 to construct a decision tree based on the training set (24 games). Use the tree to predict labels of instances in the testing set (12 games) based on their attributes. Calculate Accuracy, Precision, Recall, and F1 score on the testing result.
Attach a figure of your decision tree (either hand- or electronically drawn) and write down prediction label of the 12 testing games as well as evaluation result in the PDF.
Question 2: C4.5 model, a decision tree model using “Gain Ratio”
Programming: Use C4.5 to construct a decision tree based on the training set (24 games). Use the tree to predict labels of instances in the testing set (12 games) based on their attributes. Calculate Accuracy, Precision, Recall, and F1 score on the testing result.
Attach a figure of your decision tree (either hand- or electronically drawn) and write down prediction label of the 12 testing games as well as evaluation result in the PDF.
Question 3: which model is the best, which model performs the worst? Can you explain why?
Please submit a report (PDF or word) that includes a link to your code, your answers/results, and your explanations or interpretations (if any).