Homework 2: Managing Data Solution

Starting from:

~~$30~~

$24

Home

This is the second project for CS1210. Please don’t wait to start on this project, or I can guarantee you won’t finish it. So, just to review:

This is a challenging project, and you have been given two weeks to work on it. If you wait to begin, you will almost surely fail to complete it. The best strategy for success is to work on the project a little bit every day.

I will cover the Naive Bayes classifier in class; we’ll take a day to do so soon. I would start with the first part of the assignment and wait to deal with the prediction part of the assignment until we cover Naive Bayes.

The work you hand in should be only your own; you are not to work with or discuss your work with any other student. Sharing your code or referring to code produced by others is a violation of the student honor code and will be dealt with accordingly.

Help is always available from the TAs or the instructor during their posted office hours. You may also post general questions on the discussion board (although you should never post your Python code). I have opened a discussion board topic specifically for HW2.

Background

In this assignment, we will be looking at a dataset of survey responses collected from subjects asked to evaluate their risk tolerance. Each subject was asked to consider two hypothetical lotteries:

you have a 50% chance of success, with a payout of $100.

you have a 90% chance of success, with a payout of $20.

Assuming the subject was given the chance to bet $10, they were asked which lottery they would choose (we’ll call this the result field, with possible values {lottery a, lottery b}).

In addition to their choice of lottery, subjects were asked to provide a number of demographic variables, here denoted by field name and showing possible values:

gender ={male, female}

age ={18 − 29, 30 − 44, 45 − 60, 60}

income = {$0 − $24k , $25k − $49k , $50k − $99k , $100k − $149k , $150} education ={high school, some college, bachelors degree, graduate degree}

location ={east north central, east south central, middle atlantic, mountain, new england , pacific, south atlantic, west north central, west south central}

as well as answer a number of additional questions:

smoke ={ yes, no}

drink ={ yes, no}

gamble ={ yes, no}

skydive ={ yes, no}

speed ={ yes, no}

cheat ={ yes, no}

steak ={ yes, no}

cook ={rare, medium rare, medium, medium well, well}

1

The header of the csv file provided gives the wording of the questions; here, we’ve used the names I used in my Python solution.

In this project, we will be (i) loading the data into Python and representing it as a dictionary, (ii) plotting aspects of the data as a form of interactive data exploration, (iii) applying a machine learning technique called a Naive Bayes classified in an attempt to learn a relation that allows us to predict someone’s preference based on their demographic and risk-tolerance variables, and (iv), evaluating our learning algorithm in a principled fashion.

The rest of these instructions outline the functions that you should implement, describing their input/output behaviors. As you work on each function, test your work on the document provided to make sure your code functions as expected. Feel free to upload versions of your code as you go; we only grade the last version uploaded, so this practice allows you to "lock in" working partial solutions prior to the deadline. Finally, some general guidance.

Start with the template file and immediately complete the hawkid () function so that we may properly credit you for your work. Test hawkid () to ensure it in fact returns your own hawkid as the only element in a single element tuple. Also be sure to add your name and section to the comments at the top. Ignore this instruction at your own risk: some of you have become serial offenders and I have had to manually regrade your code too many times!

The template file also contains two useful lists, fields and values, which you can use when parsing the data. Note that these lists contain (all lower-case renditions) of the data field names and possible values, respectively (see the descriptions above), so you will likely have to convert some inputs to lower case in order to compare them with the values provided.

At the risk of screwing up the autograder (and your grade!), do not change the values of fields and values; use them as provided.

As with HW1, you will be graded on both the correctness and the quality of your code, including the quality of your comments!

As usual, respect the function signatures provided.

Be careful with iteration; always choose the most appropriate form of iteration (comprehension, while, or for) as the function mandates. Poorly selected iterative forms may be graded down, even if they work!

Like for HW1, we will use the autograder to determine a portion of your grade, with the TAs assigning the remining points based on style and the overall quality of your code.

def readData(filename=’steak-risk-survey.csv’, fields=fields, values=values):

This function returns a list, where each element of the list is a dictionary corresponding to a single line of the data from the input file. You will find the csv module and the csv.reader() function useful here; you can read about these at the following URL:

https://docs.python.org/3.5/library/csv.html

My solution makes use of a helper function that converts each data element — a row in the original csv file provided — into a dictionary, with the main function tasked with reading in each line of data and producing the list returned.

A few caveats are in order. First, we don’t care about the ’RespondentID’ field, and this can be safely ignored. Second, there are some rows in the input file that don’t correspond to survey responses and so can also be ignored. Finally, any survey response that fails to provide a choice of lottery (the second column in the original csv file, immediately following the ’RespondentID’ column, labeled ’result’ in the

2

fields variable, can also be ignored. Following these caveats, my code yields:

D= r e a dDa t a ( ’ s t e a k - r i s k - s u r v e y . c s v ’ )

l e n (D)

5 4 6

D[ 0 ]

{ ’ r e s u l t ’ : ’ l o t t e r y b ’ }

D[ 1 ]

{ ’ r e s u l t ’ :
’ l o t t e r y a ’ ,
’ smo k e ’ :
’ n o ’ , ’ d r i n k ’ :
’ y e s ’ ,
’ g amb l e ’ : ’ n o ’ ,
\ \
’ s k y d i v e ’ : ’ n o ’ , ’ s p e e d ’ : ’ n o ’ , ’ c h e a t e d ’ : ’ n o ’ , ’ s t e a k ’ : ’ y e s ’ ,
\ \
’ c o o k ’ : ’me d i um r a r e ’ ,
’ g e n d e r ’ :
’ma l e ’ ,
’ a g e ’ :
’ 6 0 ’ ,

\ \
’ i n c ome ’ :
’ 5 0 , 0 0 0 −9 9 , 9 9 9 ’ , ’ ’ e d u c a t i o n ’ :
’ s ome
c o l l e g e
o r a s s o c i a t e d e g r e e ’ ,\ \

’ l o c a t i o n ’ : ’ e a s t n o r t h c e n t r a l ’ }

where D[0] corresponds to survey respondent 3237565956 and D[1] corresponds to survey respondent 3234982343 in the original csv file (I’ve used \\ to indicate an extra line feed for readability).

def showPlot(D, field, values):

For a given field and its associated values, plots a chart showing lottery preference that looks like these for showPlot (D, ′smok e′, (′yes′, ′no′)) and showPlot (D, ′age′, (′18 − 29′,′ 30 − 44′,′ 45 − 60′,′ 60′)) where D is a set of examples returned by readData()

Of course, your function should work for any field/value pairs from the original data. An excellent reference on using matplotlib in Python can be found here:

https://pythonspot.com/en/matplotlib-bar-chart/

train(D, fields=fields, values=values):

This function trains a Naive Bayes classifier to predict lottery preference based on the a collection of the other variables. The basic idea here is quite simple. In the absence of any other information about an individual, we can predict that individuals lottery preference simply by comparing probabilities P(′lottery a′) and P(′lottery b′), which can be approximated by the frequencies with which these appear in the population. In other words, if there are 1000 examples in D and 650 of these prefer Lottery B, then a new individual’s lottery preference is probably also Lottery B, given that 65% of the observed individuals prefer Lottery B over Lottery A.

3

We can extend this idea to using conditional probabilities. If I know whether or not the individuals in D are smokers, than I can ‘‘condition’’ my prediction based on whether the new individual is a smoker. Consider the following breakdown:

Lottery A
Lottery B
Total
smoker
200
150
350
non-smoker
150
500
650

Total
250
650
1000

Clearly, if I know nothing about the new individual, then I should predict that they will prefer Lottery B (650 to 350). But if I know the new individual is a smoker, than I should predict that they would prefer Lottery A (200 to 150). This is the key idea underlying the Naive Bayes algorithm; a really good and more complete explanation will be given in class. Now what about train()? First, this function should produce a table of counts analogous to the one shown above for each of the specified fields and their values. The easiest way to code such a function is to create a dictionary of dictionaries of dictionaries, where the outermost dictionary is indexed by field in fields, the next dictionary is indexed by lottery preference, and the innermost dictionary is indexed by field value. Once the table is constructed, the values within should be converted to probabilities, by an appropriate series of sums and divisions. So, for example:

P [ ’ smo k e ’ ]

{ ’ l o t t e r y a ’ : { ’ n o ’ : 0 . 8 2 1 , ’ y e s ’ : 0 . 1 7 9 } , ’ l o t t e r y b ’ : { ’ n o ’ : 0 . 8 6 3 , ’ y e s ’ : 0 . 1 3 7 } }

where I’ve reduced the number of decimals shown to make them fit on this page. This classifier (the table of probabilities, P) is now ready to use for prediction purposes.

predict(example, P, fields=fields, values=values):

Predicition will be explained in class. The short answer is that the probability a new individual will prefer a given lottery is the product of probabilities for that lottery over all matching field values and individuals who share them.

test(D, P, fields=fields, values=values):

The test () function takes a classifier, P, and a collection of of examples, D, and compares the prediction of the classifier using only the specified fields to the actual observed value, reporting a percentage of correct outcomes.

Testing Your Code

I have provided a function, evaluate( fields, values), that you can use to manage the process of evaluating a subset of fields as predictive elements. When invoked, this function will reserve 10% of the data for testing, train the Naive Bayes prediction algorithm with the remaining 90% of the data, and then return the percentage of correct predictions made by the system.

e v a l u a t e ( )

Na i v eBaye s =0 . 5 0 1 6 6 6 6 6 6 6 6 6 6 6 6 6 ,

e v a l u a t e ( fi e l d s = [ ’ r e s u l t ’ , Na i v eBaye s =0 . 5 3 2 9 6 2 9 6 2 9 6 2 9 6 3 ,

e v a l u a t e ( fi e l d s = [ ’ r e s u l t ’ , Na i v eBaye s =0 . 5 3 2 4 0 7 4 0 7 4 0 7 4 0 7 7 ,

e v a l u a t e ( fi e l d s = [ ’ r e s u l t ’ , v a l u e s = [ ( ’ l o t t e r y

r a n d om g u e s s i n g=0 . 4 9 3 7 0 3 7 0 3 7 0 3 7 0 3 6 4

’ g amb l e ’ ] , v a l u e s = [ ( ’ l o t t e r y a ’ , ’ l o t t e r y b ’ ) , ( ’ n o ’ , ’ y e s ’ ) ] )

r a n d om g u e s s i n g=0 . 5 0 8 3 3 3 3 3 3 3 3 3 3 3 3 3

’ g amb l e ’ ] , v a l u e s = [ ( ’ l o t t e r y a ’ , ’ l o t t e r y b ’ ) , ( ’ n o ’ , ’ y e s ’ ) ] )

r a n d om g u e s s i n g=0 . 4 9 8 1 4 8 1 4 8 1 4 8 1 4 8 1

’ d r i n k ’ , ’ c h e a t ’ ] , \ \

a ’ , ’ l o t t e r y b ’ ) , ( ’ n o ’ , ’ y e s ’ ) , ( ’ n o ’ , ’ y e s ’ ) ] )

4

Na i v eBaye s =0 . 5 0 5 1 8 5 1 8 5 1 8 5 1 8 5 3 ,
r a n d om
g u e s s i n g=0 .
5 0 0 7 4 0 7 4 0 7 4 0 7 4 0 7

e v a l u a t e ( fi e l d s = [ ’ r e s u l t ’ ,
’ d r i n k ’ ,
’ c h e a t ’ ] ,
\ \

v a l u e s = [ ( ’ l o t t e r y a ’ , ’ l o t t e r y b ’ ) , ( ’ n o ’ , ’ y e s ’ ) , ( ’ n o ’ , ’ y e s ’ ) ] )
Na i v eBaye s =0 . 5 0 5 , r a n d om g u e s s i n g=0 . 5 1
1 2 9 6 2 9 6 2 9 6 2
9 6 4

e v a l u a t e ( fi e l d s = [ ’ r e s u l t ’ , ’ s t e a k ’ , ’ g e n d e r ’ , ’ a g e ’ ] , \ \

v a l u e s = [ ( ’ l o t t e r y
a ’ , ’ l o t t e r y b ’ ) , ( ’ n o ’ , ’ y e s ’ ) , ( ’ma l e ’ , ’ f ema l e ’ ) , \ \

( ’ 1 8 - 2 9 ’ , ’ 3 0 - 4 4 ’ , ’ 4 5 - 6 0 ’ , ’ 6 0 ’ ) ] )
Na i v eBaye s =0 . 5 3 0 9 2 5 9 2 5 9 2 5 9 2 6 1 ,
r a n d om
g u e s s i n g=0 .
4 9 9 0 7 4 0 7 4 0 7 4 0 7 4 0 6

which illustrates how just gambling, drinking, cheating, eating and age provide insight into risk tolerance!

5