$29
nstructor: Ilias Tagkopoulos TAs: Trevor Chan, Jason Youn, and Ameen Eetemadi {tchchan, jyoun, eetemadi}@ucdavis.edu
September 29, 2018
General Instructions: The homework should be submitted electronically through Canvas. Each submission should be a zip file that includes the following: (a) a report in pdf format ("re-port_HW1.pdf") that includes your answers to all questions, plots, figures and any instructions to run your code, (b) the python code files. Please note: (a) do not include any other files, for instance files that we have provided such as datasets, (b) each function should be written with the appropriate remarks in the code so it is generally understandable (what it does, how it does it), (c) do not use any toolbox unless it is explicitly allowed in the homework description. Shared/copied code from any source is not allowed, as it is considered plagiarism.
OF CARS AND MEN [100PT]
In this exercise, you will investigate the type of relationship that exists between the “miles per gallon” (mpg) rating of a car and several of its attributes. For this task, you will use the “Auto MPG” dataset (“auto-mpg.data” file; 398 cars, 9 features; remove the 6 records with missing values to end up with 392 samples) that is available in the UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Auto+MPG
For this assignment, you will need to code your own source code from scratch. However, you are encouraged to see code from any resource, for example sci-kit learn, to help you write your own implementation of the methods. Perform and report (code and results) the following:
1
Assume that we want to classify the cars into 3 categories: low, medium and high mpg. Find what the threshold for each category should be, so that all samples are divided into three equally-sized bins. [10pt]
Create a 2D scatterplot matrix, similar to that of Figure 1.4 in the ML book (K. Murphy, page 6; also available on the lecture 1 slides - the figure with the flowers). You may use any published code to perform this. Which pair from all pair-wise feature combinations is the most informative regarding the three mpg categories? [10pt]
Write a linear regression solver that can accommodate polynomial basis functions on a single variable for prediction of MPG. Your code should use the Ordinary Least Squares (OLS) estimator (i.e. the Maximum-likelihood estimator). Code it without using any ex-isting code. [20p]
Split the dataset in the first 200 samples for training and the rest 192 samples for testing. Use your solver to regress for 0th to 3rd order polynomial on a single independent variable (feature) each time by using mpg as the dependent variable. Report (a) the training and
the testing mean squared errors for each variable individually (except the “car name” string variable, so a total of 7 features that are independent variables). Plot the lines and data for the testing set, one plot per variable (so 5 lines in each plot, 7 plots total). Which polynomial order performs the best in the test set? Which is the most informative feature for mpg consumption in that case? [20pt]
Modify your solver to be able to handle second order polynomials of all 8 independent variables simultaneously (i.e. 15 terms). Regress with 0th, 1st and 2nd order and report
the training and (b) the testing mean squared error (MSE). Use the same 200/192 split. [20pt]
Using logistic regression (1st order) for low/medium/high classification. Report the train-ing/testing classification precision (you might want to look how precision is defined and how it is calculated). [10pt]
If a USA manufacturer (origin 1) had considered to introduce a model in 1980 with the fol-lowing characteristics: 6 cylinders, 350 cc displacement, 180 horsepower, 3700 lb weight, 9 m/sec2 acceleration, what is the MPG rating that we should have expected? In which mpg category (low,medium,high mpg) would it belong? Use second-order, multi-variate polynomial and logistic regression. [10pt]
Predict the mpg of the vehicle in the photo. Clearly state your assumptions. [3pt bonus]
2