$29
Naive Bayes algorithm on speed dating data
In this programming assignment, you are given a dataset of experimental speed dating events, and your task is to predict whether the participant of a date decides to give his or her partner a second date after the speed dating event (i.e., the \decision" column in the dataset). You will implement al-gorithms to learn and apply some naive Bayes classi cation (NBC) models to make such predictions.
More speci cally, the dataset dating-full.csv is to be used for this assignment. This .csv le contains information for 6744 speed dating events in the comma-separated format. The le eld-meaning.pdf contains the complete description for the meaning of each column of the dataset.
You are asked to implement your algorithms in Python. Note that although there are many data mining algorithms available online, for this assignment (as well as the next few programming as-signments) you must design and implement your own versions of the algorithm. DO NOT use any publicly available code including libraries such as sklearn. Your code will be checked against public implementations. In addition, we will not provide separate testing data to you. You are asked to design your own tests to ensure that your code runs correctly and meets the speci cations below. Note: You may use the pandas library for data processing purposes.
To make it easier to refer to a few sets of columns in the dataset, we will use the following terms (usages will be italicized):
1. preference scores of participant: [attractive important, sincere important, intelligence important, funny important, ambition important, shared interests important]
2. preference scores of partner: [pref o attractive, pref o sincere, pref o intelligence, pref o funny pref o ambitious, pref o shared interests]
3. continuous valued columns: All columns other than [gender, race, race o, samerace, eld, decision].
4. rating of partner from participant: [attractive partner, sincere partner, intelligence partner, funny partner, ambition partner, shared interests partner]
In the following sections, we specify a number of steps you are asked to complete for this assignment.
Note that all results in sample outputs are ctitious and for representation only.
• Preprocessing (4 pts)
Write a Python script named preprocess.py which reads the le dating-full.csv as input and performs the following operations to output a new le dating.csv:
(i) The format of values in some columns of the dataset is not uni ed. Strip the surrounding quotes in the values for columns race, race o and eld (e.g., ‘Asian/Paci c Islander/Asian-American’ ! Asian/Paci c Islander/Asian-American), count how many cells are changed after this pre-processing step, and output this number.
1
Expected output line: Quotes removed from [count-of-changed-cells] cells.
(ii) Convert all the values in the column eld to lowercase if they are not already in lowercases (e.g., Law ! law). Count the number of cells that are changed after this pre-processing step, and output this number.
Expected output line: Standardized [count-of-changed-cells] cells to lower case.
(iii) Use label encoding to convert the categorical values in columns gender, race, race o and eld to numeric values (start from 0). The process of label encoding works by mapping each
categorical value of an attribute to an integer number between 0 and nvalues 1 where nvalues is the number of distinct values for that attribute. Sort the values of each categorical attribute lexicographically before you start the encoding process. You are then asked to output the mapped numeric values for ‘male’ in the gender column, for ‘European/Caucasian-American’ in the race column, for ‘Latino/Hispanic American’ in the race o column, and for ‘law’ in the eld column.
Expected output lines:
Value assigned for male in column gender: [value-for-male].
Value assigned for European/Caucasian-American in column race: [value-for-European/Caucasian-American].
Value assigned for Latino/Hispanic American in column race o: [value-for-Latino/Hispanic American].
Value assigned for law in column eld: [value-for-law].
(iv) As the speed dating experiments are conducted in several di erent batches, the instructions participants received across di erent batches vary slightly. For example, in some batches of experiments participants are asked to allocate a total of 100 points among the six attributes (i.e., attractiveness, sincerity, intelligence, fun, ambition, shared interests) to indicate how much they value each of these attributes in their romantic partner|that is, the values in preference scores of participant columns of a row should sum up to 100 (similarly, values in preference scores of partner columns of a row should also sum up to 100)|while in some other batches of experiments, participants are not explicitly instructed to do so.
To deal with this problem, let’s conduct one more pre-process step for values in prefer-ence scores of participant and preference scores of partner columns. For each row, let’s rst sum up all the values in the six columns that belong to the set preference scores of participant (denote the sum value as total), and then transform the value for each column in the set prefer-ence scores of participant in that row as follows: new value=old value/total. We then conduct similar transformation for values in the set preference scores of partner.
Finally, you are asked to output the mean values for each column in these two sets after the transformation.
Expected output lines:
Mean of attractive important: [mean-rounded-to-2-digits].
...
Mean of shared interests important: [mean-rounded-to-2-digits].
Mean of pref o attractive: [mean-rounded-to-2-digits].
...
2
Mean of pref o shared interests: [mean-rounded-to-2-digits].
In summary, below are the sample inputs and outputs we expect to see. We expect 18 lines of output in total (the numbers are cititious):
$python preprocess.py dating-full.csv dating.csv Quotes removed from 123 cells. Standardized 456 cells to lower case.
Value assigned for male in column gender: 0.
Value assigned for European/Caucasian-American in column race: 1.
Value assigned for Latino/Hispanic American in column race o: 4.
Value assigned for law in column field: 2.
Mean of attractive important: 0.12.
...
Mean of shared interests important: 0.34.
Mean of pref o attractive: 0.45.
...
Mean of pref o shared interests: 0.56.
• Visualizing interesting trends in data (6 pts)
(i) First, let’s explore how males and females di er in terms of what are the attributes they value the most in their romantic partners. Please perform the following task on dating.csv and include your visualization code in a le named 2 1.py.
(a) Divide the dataset into two sub-datasets by the gender of participant
(b) Within each sub-dataset, compute the mean values for each column in the set prefer-ence scores of participant
(c) Use a single barplot to contrast how females and males value the six attributes in their romantic partners di erently. Please use color of the bars to indicate gender.
What do you observe from this visualization? What characteristics do males favor in their romantic partners? How does this di er from what females prefer?
(ii) Next, let’s explore how a participant’s rating to their partner on each of the six attributes relate to how likely he/she will decide to give the partner a second date. Please perform the following task on dating.csv and include your visualization code in a le named 2 2.py.
(a) Given an attribute in the set rating of partner from participant (e.g., attractive partner), determine the number of distinct values for this attribute.
(b) Given a particular value for the chosen attribute (e.g., a value of 10 for attribute ‘at-tractive partner’), compute the fraction of participants who decide to give the partner a second date among all participants whose rating of the partner on the chosen attribute (e.g., attractive partner) is the given value (e.g., 10). We refer to this probability as the success rate for the group of partners whose rating on the chosen attribute is the speci ed value.
(c) Repeat the above process for all distinct values on each of the six attributes in the set rating of partner from participant.
3
(d) For each of the six attributes in the set rating of partner from participant, draw a scatter plot using the information computed above. Speci cally, for the scatter plot of a partic-ular attribute (e.g., attractive partner), use x-axis to represent di erent values on that attribute and y-axis to represent the success rate.
What do you observe from these scatter plots?
• Convert continuous attributes to categorical attributes (3pts)
Write a Python script named discretize.py to discretize all columns in continuous valued columns by splitting them into 5 bins of equal-width in the range of values for that column (check eld-meaning.pdf for the range of each column; for those columns that you’ve nished pre-processing in Question 1(iv), the range should be considered as [0, 1]). The script reads dating.csv as input and produces dating-binned.csv as output. As an output of your scripts, please print the number of items in each of the 5 bins (bins are sorted from small value ranges to large value ranges) for each column in continuous valued columns.
The sample inputs and outputs are as follows. We expect 47 lines of output, and the order of the attributes in the output should be the same as the order they occur in the dataset:
$python discretize.py dating.csv dating-binned.csv age: [3203 1188 1110 742 511] age o: [2151 1292 1233 1383 685]
importance same race: [1282 4306 1070 58 28]
...
like: [119 473 2258 2804 1090]
• Training-Test Split (2pts)
Use the sample function from pandas with the parameters initialized as random state = 47, frac = 0.2 to take a random 20% sample from the entire dataset. This sample will serve as your test dataset, and the rest will be your training dataset. (Note: The use of the random state will ensure all students have the same training and test datasets; incorrect or no initialization of this parameter will lead to non-reproducible results). Create a new script called split.py that takes dating-binned.csv as input and outputs trainingSet.csv and testSet.csv.
• Implement a Naive Bayes Classi er (15 pt)
Learn a NBC model using the data in the training dataset, and then apply the learned model to the test dataset.
Evaluate the accuracy of your learned model and print out the model’s accuracy on both the training dataset and the test dataset as speci ed below.
Code Speci cation:
Write a function named nbc(t frac) to train your NBC which takes a parameter t frac that represents the fraction of the training data to sample from the original training set. Use the
4
sample function from pandas with the parameters initialized as random state = 47, frac =
• frac to generate random samples of training data of di erent sizes.
1. Use all the attributes and all training examples in trainingSet.csv to train the NBC by calling your nbc(t frac) function with t frac = 1. After get the learned model, apply it on all examples in the training dataset (i.e., trainingSet.csv) and test dataset (i.e., testSet.csv) and compute the accuracy respectively. Please put your code for this question in a le called 5 1.py.
Expected output lines:
Training Accuracy: [training-accuracy-rounded-to-2-decimals]
Testing Accuracy: [testing-accuracy-rounded-to-2-decimals]
The sample inputs and outputs are as follows:
$python 5 1.py
Training Accuracy: 0.71
Testing Accuracy: 0.68
2. Examine the e ects of varying the number of bins for continuous attributes during the dis-cretization step. Please put your code for this question in a le called 5 2.py.
(i) Given the number of bins b 2 B = f2; 5; 10; 50; 100; 200g, perform discretization for all columns in set continuous valued columns by splitting the values in each column into b bins of equal width within its range. (You can perform the binning procedure through discretize.py, now taking the number of bins as a parameter and using dating.csv as input as earlier.)
(ii) Repeat the train-test split as described in Question 4 for the obtained dataset after discretizing each continuous attribute into b bins.
(iii) For each value of b, train the NBC on the corresponding new training dataset by call-ing your nbc(t frac) function with t frac = 1, and apply the learned model on the corresponding new test dataset.
(iv) Draw a plot to show how the value of b a ects the learned NBC model’s performance on the training dataset and the test dataset, with x-axis representing the value of b and y-axis representing the model accuracy. Comment on what you observe in the plot.
The sample inputs and outputs are as follows:
$python 5 2.py
Bin size:
2
Training Accuracy:
0.34
Testing Accuracy:
0.12
Bin size:
5
Training Accuracy:
0.78
Testing Accuracy:
0.56
.
.
Bin size:
200
Training Accuracy:
0.90
Testing Accuracy:
0.88
3. Plot the learning curve. Please put your code for this question in a le called 5 3.py.
5
(i) For each f in F = f0:01; 0:1; 0:2; 0:5; 0:6; 0:75; 0:9; 1g, randomly sample a fraction of the training data in trainingSet.csv with our xed seed (i.e., random state=47).
(ii) Train a NBC model on the selected f fraction of the training dataset (You can call your nbc(t frac) function with t frac = f). Evaluate the performance of the learned model on all examples in the selected samples of training data as well as all examples in the test dataset (i.e., testSet.csv), and compute the accuracy respectively. Do so for all f 2 F .
(iii) Draw a plot of learning curves where the x-axis representing the values of f and the y-axis representing the corresponding model’s accuracy on training/test dataset. Comment on what you observe in this plot.
Submission Instructions:
Instructions below describe the details of how to submit your code and assignment using turnin on data.cs.purdue.edu. Alternate submissions (i.e., outside the turnin system) will not be accepted.
After logging into data.cs.purdue.edu, please follow these steps to submit your assignment:
1. Make a directory named yourF irstN ame yourLastN ame and copy all of your les to this directory.
2. While in the upper level directory (if the les are in /homes/yin/ming yin, go to /homes/yin), execute the following command:
turnin -c cs573 -p HW2 your folder name
(e.g. your professor would use: turnin -c cs573 -p HW2 ming yin to submit her work)
Keep in mind that old submissions are overwritten with new ones whenever you execute this command.
You can verify the contents of your submission by executing the following command: turnin -v -c cs573 -p HW2
Do not forget the -v ag here, as otherwise your submission would be replaced with an empty one.
Your submission should include the following les:
1. The source code in python.
2. Your evaluation & analysis in .pdf format. Note that your analysis should include visualization plots as well as a discussion of results, as described in details in the questions above.
3. A README le containing your name, instructions to run your code and anything you would like us to know about your program (like errors, special conditions, etc).
6