Starting from:
$35

$29

Project 01: Exploratory Data Analysis

Partner work is allowed on this project. Browse through the UCI Machine Learning Repository to nd a data set that is interesting to you, and has both categorical and numerical data (but is small enough to work with). Note: If your chosen data set is not small enough to download, take a sample of the data set that contains at least 200 entities (instances/rows).

Problem 1: Write the Introduction

In a well-written paragraph, answer the following questions about the data:

    • (4 points) What was the data used for?

    • (2 points) Who (or what organization) uploaded the data?

    • (5 points) How many attributes and how many entities are represented in the data?

{ How many numerical attributes? { How many categorical attributes?

{ Would you suggest that each categorical attribute be label-encoded or one-hot-encoded? { Why did you choose the encoding?

    • (4 points) Are there missing values in the data? If so, what proportion of the data is missing overall? What proportion of data is missing per attribute (you may use a plot or table to summarize this information)?

    • (7 points) Why is this data set interesting to you?

    • (6 points) Of the attributes used to describe this data, which do you think are the most descriptive of the data and why (before doing any data analysis)?

Part 2: Write Python code for data analysis

Use Python to write the following functions, without using any functions with the same purpose in sklearn, pandas, numpy, or any other library (though you may want to use these libraries to check your answers):

    • (5 points) A function that will compute the multi-dimensional mean of a numerical, multi-dimensional data set input as a 2-dimensional numpy array

    • (5 points) A function that will compute the estimated covariance between two attributes that are input as one-dimensional numpy vectors

    • (5 points) A function that will compute the correlation between two attributes that are input as two numpy vectors.

    • (5 points) A function that will normalize the attributes in a two-dimensional numpy array using range normalization.

    • (5 points) A function that will normalize the attributes in a two-dimensional numpy array using standard normalization.

    • (5 points) A function that will compute the covariance matrix of a data set.

    • (5 points) A function that will label-encode a two-dimensional categorical data array that is passed in as input.

Part 3: Analyze the data with your code and write up the results

Use your code from Part 2 to answer the following questions in a well-written paragraph, and create the following plots from the numerical portion of the data. Use your functions to compute the multi-dimensional mean and covariance matrix of the numerical portion of your data set. Before answering the questions:

    • (5 points) Convert all categorical attributes using label encoding or one-hot-encoding

    • (2 points) If your data has missing values,  ll in those values with the attribute mean.

Questions to answer:

    • (2 points) What is the multi-dimensional mean of the numerical data matrix (where categor-ical data have been converted to numerical values)?

    • (4 points) What is the covariance matrix of the numerical data matrix (where categorical data have been converted to numerical values)?

    • (5 points) Choose 5 pairs of attributes that you think could be related. Create scatter plots of all 5 pairs and include these in your report, along with a description and analysis that summarizes why these pairs of attributes might be related, and how the scatter plots do or do not support this intuition.

    • (3 points) Which range-normalized numerical attributes have the greatest estimated covari-ance? What is their estimated covariance? Create a scatter plot of these range-normalized attributes.

    • (3 points) Which Z-score-normalized numerical attributes have the greatest correlation? What is their correlation? Create a scatter plot of these Z-score-normalized attributes.

    • (3 points) Which Z-score-normalized numerical attributes have the smallest correlation? What is their correlation? Create a scatter plot of these Z-score-normalized attributes.

    • (3 points) How many pairs of features have correlation greater than or equal to 0.5?

    • (3 points) How many pairs of features have negative estimated covariance?

    • (2 points) What is the total variance of the data?

    • (2 points) What is the total variance of the data, restricted to the ve features that have the greatest estimated variance?
Tips and Acknowledgements

Make sure to submit your answer as a PDF on Gradscope and Brightspace. Make sure to show your work. Include any code snippets you used to generate an answer, using comments in the code to clearly indicate which problem corresponds to which code.

Acknowledgements: Project adapted from assignments of Veronika Strnadova-Neeley.

More products