$29
Classify the following attributes as binary, discrete, or continuous. Further classify the attributes as nominal, ordinal, interval, ratio.
Rating of an Amazon product by a person on a scale of 1 to 5
The Internet Speed
Number of customers in a store.
MST Student ID
Distance
MST letter grade (A, B, C, D)
The temperature at Rolla
Task 2 Distance/Similarity Measures
Given the four boxes shown in the following figure, answer the following questions. In the diagram, numbers indicate the lengths and widths and you can consider each box to be a vector of two real numbers, length and width. For example, the top left box would be (2,1), while the bottom right box would be (3,3). Restrict your choices of similarity/distance measure to Euclidean distance and correlation. Briefly explain your choice.
Which proximity measure would you use to group the boxes based on their shapes (length-width ratio)? Justify your answer.
Which proximity measure would you use to group the boxes based on their size? Justify your answer.
Task 3 Data Preprocessing of Titanic
You can download the Kaggle Titanic dataset from files/data/ Titanic.zip. You can refer to https://www.kaggle.com/c/titanic/data for more details. The data has been split into two groups:
training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
Data Dictionary
VariableDefinitionKey survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years sibsp # of siblings / spouses aboard the Titanic parch # of parents / children aboard the Titanic ticket Ticket number fare Passenger fare cabin Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
Let us start with acquiring data: The Python Pandas packages helps us work with our datasets. We start by acquiring the training and testing datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets together.
train_df = pd.read_csv('../input/train.csv') test_df = pd.read_csv('../input/test.csv') combine = [train_df, test_df]
Subtask 1: Analyze by describing data
Q1: Which features are available in the dataset?
Q2: Which features are categorical?
Q3: Which features are numerical?
Q4: Which features are mixed data types?
Q5: Which features contain blank, null or empty values?
Q6: What are the data types (e.g., integer, floats or strings for various features?
Q7: To understand what is the distribution of numerical feature values across the samples, please list the properties (count, mean, std, min, 25% percentile, 50% percentile, 75% percentile, max) of numerical features?
Q8: To understand what is the distribution of categorical features, we define: count is the total number of categorical values per column; unique is the total number of unique categorical values per column; top is the most frequent categorical value; freq is the total number of the most frequent categorical value. Please the properties (count, unique, top, freq) of categorical features?
Subtask 2: Analyze by pivoting features
Q9: Can you observe significant correlation (0.5) among Pclass=1 and Survived? If Pclas has significant correlation with Survivied, we should include this feature in the predictive model. Based on your computation, will you include this feature in the predictive model?
Q10: Are Women (Sex=female) were more likely to have survived?
Q11: Let us start by understanding correlations between a numeric feature (Age) and our predictive goal (Survived) . A histogram chart is useful for analyzing continuous numerical variables like Age where banding or ranges will help identify useful patterns. The histogram can indicate distribution of samples using automatically defined bins or equally ranged bands. This helps us answer questions relating to specific bands (e.g., infants, old). Please plot the histogram plots between ages and Survived (Figure 1 is an example), and answer the following questions:
Do infants (Age <=4) have high survival rate?
Do oldest passengers (Age = 80) survive?
Do large number of 15-25 year olds not survive? Based on your analysis of the histograms,
Should we consider Age in our model training? (If yes, then we should complete the Age feature for null values.)
Should we should band age groups?
Figure 1: a sample histogram plot of age
Q12: We can combine three features (age, Pclass, and survivied) for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values. Here is an example plot:
Figure 2: a sample histograms plot of age, Pclass, and survivied.
Please plot the histogram plot using python, and answer the following questions:
Does Pclass=3 have most passengers, however most did not survive?
Do infant passengers in Pclass=2 and Pclass=3 mostly survive?
Do most passengers in Pclass=1 survive?
Does Pclass vary in terms of Age distribution of passengers?
Should we consider Pclass for model training?
Q13: We want to correlate categorical features (with non-numeric values) and numeric features. We can consider correlating Embarked (Categorical non-numeric), Sex (Categorical non-numeric), Fare (Numeric continuous), with Survived (Categorical numeric). Please plot a histogram figure to illustrate the correlations of Embarked, Sex, Fare, and Survivied. Here is a sample plot:
Figure 3: a sample figure of the correlations of Embarked, Sex, Fare, and Survivied
And answer the following questions:
Do higher fare paying passengers have better survival?
Port of embarkation correlates with survival rates
Should we consider banding fare feature?
Q14: What is the rate of duplicates for the Ticket feature? Is there a correlation between Ticket and survival? Should we drop the Ticket feature?
Q15: Is the Cabin feature complete? How many null values there are in the Cabin features of the combined dataset of training and test dataset? Should we drop the Cabin feature?
Q16: We can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal. In this question ,please convert Sex feature to a new feature called Gender where female=1 and male=0.
Q17: We start estimating and completing features with missing or null values. We will first do this for the Age feature. We can consider three methods to complete a numerical continuous feature. A simple way is to generate random numbers between mean and standard deviation. More accurate way of guessing missing values is to use the K-Nearest Neighbor algorithm to select the top-K most similar data points, and then use the top-K most similar data points to impute the missing values of ages.
Q18: Completing a categorical feature: Embarked feature takes S, Q, C values based on port of embarkation. Our training dataset has some missing values. Please simply fill these with the most common occurrences.
Q19: Completing and converting a numeric feature. Please complete the Fare feature for single missing value in test dataset using mode to get the value that occurs most frequently for this feature.
Q20: Convert the Fare feature to ordinal values based on the FareBand defined follows:
Ordinal
Fare
FareBand
Survivied
Indicator
0
(-0.001,
0.197309
7.91]
1
(7.91,
0.303571
14.454]
2
(14.454,
0.454955
31.0]
3
(31.0,
0.581081
512.329]
Please submit a report (PDF or word) that includes a link to your code, your answers/results, and your explanations or interpretations (if any).