Starting from:

$30

CS 5402 – Intro to Data Mining HW #2 Solved


1. Consider the following dataset where the decision attribute is restaurant:
mealPreference
gender
drinkPreference
restaurant
hamburger
M
coke
mcdonalds
fish
M
pepsi
burgerKing
chicken
F
coke
mcdonalds
hamburger
M
coke
mcdonalds
chicken
M
pepsi
wendys
fish
F
coke
burgerKing
chicken
M
pepsi
burgerKing
chicken
F
coke
wendys
hamburger
F
coke
mcdonalds

Use the 1-rule (1R) method to find the best single attribute to determine restaurant. In order to demonstrate that you actually know how this method works (and aren’t just guessing at which attribute is best), you must fill in ALL of the blank values in the table below; otherwise, you will not receive any credit for this problem. If there is a tie for most frequent value for restaurant, choose whichever of the tied attributes you want.    (10 pts.) 
Attribute
Attribute Value
# Rows with Attribute Value
Most Frequent Value for restaurant
Errors
Total Errors
mealPreference
hamburger
3
 



fish
2
  



chicken
4
 








gender

5

 


F
4









drinkPreference
pepsi
3




coke
6




Based on these calculations, list the rules that would be generated by the 1R method for determining restaurant.  



2.  Create the dataset given in problem 1. as an arff or csv file, and run DecisionStump on it in Weka. List the classification rules that are produced (you can just include a screenshot of your Weka output) AND draw a tree that corresponds to the rules.      (1 pt.)








3. Statistical modeling can be used to compute the probability of occurrence of an attribute value. Based on the data given in the table below, if we have a new instance where ageGroup = youngAdult, gender = M, and bookPreference = nonFiction, what is the likelihood that musicPreference = country? Just set up the equation to compute this with the appropriate values; you don’t have to actually calculate the final answer.  (1 pt.)





4.  Create the dataset given in problem 1. as an arff or csv file, and run Id3 on it in Weka. Show the decision tree output that is produced by Weka AND draw the tree by hand.  (1 pt.)
Note: Id3 may not be installed with the initial download of Weka 3.8, in which case you will need to install the package named simpleEducationalLearningSchemes.



5. Consider the following dataset where the decision attribute is musicPreference:
ageGroup
gender
bookPreference
musicPreference
youngAdult
M
sciFiction
rock
senior
M
mystery
classical
middleAge
F
mystery
rock
youngAdult
M
nonFiction
country
middleAge
M
sciFiction
rock
senior
F
nonFiction
classical
middleAge
F
mystery
country
youngAdult
F
mystery
country

If we want to make a decision tree for determining musicPreference, we must decide which of the three attributes (ageGroup, gender, or bookPreference) to use as the root of the tree.        
        a. Set up the equation to compute what in lecture we called entropyBeforeSplit for musicPreference. You do not have to actually solve (i.e., calculate the terms in) the equation, just set up the equation.  (1.5 pts.)

 
        b. Set up the equation to compute entropy for bookPreference when its value is mystery. That is, a tree with bookPreference at the root would have three branches (one for sciFiction, one for mystery, and one for nonFiction), requiring us to compute entropySciFiction, entropyMystery, and entropyNonFiction; here we only want you to set up the equation to compute entropyMystery. You do not have to actually solve (i.e., calculate the terms in) the equation, just set it up.  (1.5 pts.)

 
        c. Suppose that instead of considering bookPreference to be the root of a decision tree for musicPreference, we had instead considered gender. Set up the equation to compute information gain for gender given the variables specified below. (1.5 pts.)

entropy before any split:        X
entropy for gender = M        Y
entropy for gender = F        Z

 


6.      Consider the following dataset where the decision attribute is play:
outlook
temperature
humidity
windy
play
good
warm
high
FALSE
no
good
warm
high
TRUE
no
bad
warm
high
FALSE
no
bad
cool
normal
FALSE
yes
bad
cool
normal
TRUE
yes
good
cool
normal
FALSE
yes
good
warm
normal
TRUE
yes
bad
cool
high
TRUE
yes
bad
warm
normal
FALSE
yes
good
cool
high
TRUE
no

    a. Do ONLY the necessary calculations to determine what the ROOT NODE would be for a CART decision tree. YOU MUST SHOW YOUR WORK!!!   (6.5 pts.)
Note: If there’s a tie for which attribute you’d pick to be the root of the tree, just list those attributes and say that we could pick from them.

    b. Write a Python program that runs the CART algorithm on this dataset. Include both your source code and a screenshot showing the resulting tree. The dataset (hw2_prob6.csv) is posted on Canvas along with this assignment.          (3.5 pts.)

    c. Run SimpleCart in Weka on this dataset specifying the options minNumObj = 1 and usePrune = False. Show a screenshot of the CART decision tree that it produces.  (0.5 pt.)
Note: SimpleCART may not be installed with the initial download of Weka 3.8, in which case you will need to install the package.


More products