$24
In this assignment, you will implement the Decision Tree learning algorithm. You will use a data-set that includes mushroom records drawn from the Audubon Society Field Guide to North American Mushrooms (1981). The database describes samples from 23 species of gilled mushrooms in the families Agaricus and Lepiota. Each sample is described by a string of 23 characters, on a single line of the provided le mushroom_data.txt; each such string describes the values of 22 attributes for each sample (described below), and the last character corresponds to the correct classi cation of the mushroom into either edible (e) or poisonous (p) mushrooms. For example, the rst two samples from the data-set are a poisonous and then an edible species, as follows:
x s n t p f c n k e e s s w w p w o p k s u p x s y t a f c b k e c s s w w p w o p n n g e
The 22 attribute variables, and their values, are given in Table 1, at the end of this document.
(and are also tabulated in the le properties.txt, for reference).
For full points, your program will work as follows:
(5 pts.) The program will execute from the command line, and handle all input/output using standard system IO (in Java, e.g., this is System.in and System.out, and in C++ you would use cin and cout, etc.).
When the program begins, it should ask the user for two inputs:
A training set size: this should be some integer value S that is a multiple of 250, within bounds 250 S 1000.
A training set increment: this should be some integer value I 2 f10; 25; 50g (that is, one of those three values only).
Your program should indicate what sorts of values it expects, and validate the data; if the user does not enter an integer in the proper range, then the program should terminate with appropriate, meaningful in-program messages. (It should not simply crash, throw exception messages, or terminate with no explanation, for instance.)
Note: The assignment folder contains a sample runs of the program so you can see what is expected in this step and later steps; your output should be much the same, if not identical.
(40 pts.) Once the user has made the necessary choices, the program will proceed automat-ically without any more user intervention. It will load the le describing the data-set (and the le of attributes and values, if necessary for your implementation), and then:
The program will choose a training set of the size S chosen by the user. It will do this by choosing S distinct examples at random from the entire data-set. These are removed from the data-set so that any testing of the tree will not be performed on the data instances used to train.
It will choose an increment I from the training set and build a tree using those I elements. For example, if the user chooses S = 250 and I = 25, it will rst choose 250 examples
1
from the overall data and set them aside; it will then take 25 elements of those, and use them to build the tree.
The tree will be built using the algorithm speci ed in Russell and Norvig, Figure 18:5, page 702; the choice of nodes for expansion in the Importance() function will be done using the information gain heuristic covered in section 18:3:4, and covered in class.
After building the tree, it will run through all the data in the original data-set (after the training set has been removed), and check the accuracy of the tree on that data. This will be reported in terms of the percentage of examples for which the tree is correct (to 4 decimal places, as seen in the sample output).
The algorithm will repeat steps (b){(d), increasing the increment each time on the size of the set used to build the tree. Again, if the user had chosen S = 250 and I = 25, this will mean the algorithm will repeat the tree-building and testing process 10 times, using 25; 50; 75; : : : ; 250 items of the available training data. (For any legal values chosen, the nal iteration will use all of the elements of the training set.)
Once the algorithm has nished generating and testing trees, it should summarize the testing results at the end of its print-out.
Again, see the sample runs included with this document to see the basic format expected. Those sample runs contain one where the user chose the values S = 250; I = 10 and one where they chose S = 1000; I = 50.
(5 pts.) After completing and bug-checking your program, run it twice with di erent sets of parameters, and save the results to two separate text- les. Produce one graph for each run, using whatever software you like, showing the training set size as x-coordinate and the percentage correct on testing as y-coordinate. Image les of the two graphs should be saved in some standard image format (PDF, PNG, etc.); two examples have been included with this document, giving graphs for the two test-runs also included.
(5 pts.:) Required of students in a graduate (519) section; optional for other (419) students. Your program should provide the option to be run in verbose mode (via a command-line argument or the like). If this option is chosen, then after building the nal tree, using all training data, but before the summary results, the program should print out the structure of the nal tree that it built. This print-out should contain information about the attributes used for splitting at each internal node (referenced either by name of attribute or number in the attribute table), along with the value of the attribute used on each branch (referenced either by single-letter code or full name). Leaves of the tree should be labeled Edible or Poison, and the branching structure should be evident. A sample of what this might look like is included in the second of the two sample runs. Other formats, if readable, are permissible.
In the latter case, there is some extra output; this is covered in the bonus/graduate portion of the assignment, described below. If you do not do that portion of the assignment, then you can ignore that part of that output le; the remainder of it is what is expected from any solution to the assignment, in any case.
2
You will hand in a single compressed archive containing the following:
Complete source-code for your program. This can be written in any language you choose, but should follow proper software engineering principles for said language, including meaningful comments and appropriate code style. Code should be in its own folder within the archive, to keep it organized and separate from the rest of the submission.
A README le that provides instructions for how the program can be run. It is expected that this will consist of the command-line instructions necessarily. If you have completed the nal part of the assignment (the optional verbose mode), then your instructions should include the commands needed to activate that mode and see the tree print-out at the end.
Two sample runs, stored in text form, of the program output. Each program run should use di erent parameters for both S and I. If you do the verbose mode portion of the assignment, then one run should use that mode.
Two distinct graphs, one for each of the runs in your text- les. File-names or text/titles within the les should indicate which graph accompanies which run.
A table of attribute values for the mushroom data-set is on the next page.
3
#
Property
Values
1.
cap shape:
bell=b, conical=c, convex=x, at=f, knobbed=k, sunken=s
2.
cap surface:
brous=f, grooves=g, scaly=y, smooth=s
3.
cap color:
brown=n, bu =b, cinnamon=c, gray=g, green=r, pink=p,
purple=u, red=e, white=w, yellow=y
4.
bruises:
bruises=t, no=f
5.
odor:
almond=a, anise=l, creosote=c, shy=y, foul=f, musty=m,
none=n, pungent=p, spicy=s
6.
gill attachment:
attached=a, descending=d, free=f, notched=n
7.
gill spacing:
close=c, crowded=w, distant=d
8.
gill size:
broad=b, narrow=n
9.
gill color:
black=k, brown=n, bu =b, chocolate=h, gray=g, green=r,
orange=o, pink=p, purple=u, red=e, white=w, yellow=y
10.
stalk shape:
enlarging=e, tapering=t
11.
stalk root:
bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z,
rooted=r
12.
stalk surface above ring:
brous=f, scaly=y, silky=k, smooth=s
13.
stalk surface below ring:
brous=f, scaly=y, silky=k, smooth=s
14.
stalk color above ring:
brown=n, bu =b, cinnamon=c, gray=g, orange=o, pink=p,
red=e, white=w, yellow=y
15.
stalk color below ring:
brown=n, bu =b, cinnamon=c, gray=g, orange=o, pink=p,
red=e, white=w, yellow=y
16
veil type:
partial=p, universal=u
17.
veil color:
brown=n, orange=o, white=w, yellow=y
18.
ring number:
none=n, one=o, two=t
19.
ring type:
cobwebby=c, evanescent=e, aring=f, large=l, none=n, pen-
dant=p, sheathing=s, zone=z
20.
spore print color:
black=k, brown=n, bu =b, chocolate=h, green=r, or-
ange=o, purple=u, white=w, yellow=y
21.
population:
abundant=a, clustered=c, numerous=n, scattered=s, sev-
eral=v, solitary=y
22.
habitat:
grasses=g, leaves=l, meadows=m, paths=p, urban=u,
waste=w, woods=d
Table 1: Mushroom attributes in data-set.
4