Starting from:
$35

$29

CSC 4780/6780 Homework 03



    • What are we doing?


Linear regression is a very common method of making predictions. You should learn both ways of solving for the coe cients: matrix inversion and gradient descent.

You have decided to create a company called "Zillom" that estimates the price that a house will sell for. I have given you a spreadsheet (properties.xlsx) with the features and prices of 519 houses that have sold recently in Cleveland. The rst ve columns are the features you will use to predict prices:


    • sqft hvac: Indoor square footage

    • sqft yard: Outdoor square footage

    • bedrooms: Number of bedrooms in the house

    • bathrooms: Number of bathrooms in the house

    • miles to school: Number of children would need to walk to the nearest elementary school



You are going to use this spreadsheet (and linear regression!) to create a formula for predicting the sale price of any house in Cleveland.





1

(I got Stable Di usion running, and I asked it to make "an Edward Hopper painting of a realtor in front of a modern house" for you. I’ve included three of the images in this document.)































    • Write programs that do linear regression


You are going to create three python programs:


    • linreg mi.py uses matrix inversion to come up with the formula.

    • linreg sckit.py uses scikit-learn to  nd the formula.

    • linreg gd.py uses gradient to converge upon the formula.


All three take the    lename of the spreadsheet as an argument:


> python3 linreg_mi.py properties.xlsx


The program will read the excel spreadsheet that has the features of houses and the price they sold for:









2






















(property id will be the index for the dataframe; you ignore it in the calculations.)


All three will nd the hyperplane that minimizes the L2 error for those 519 data points. Each program will output those coe cients as a formula for predicting house prices:


predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) +

($-17,421.84 x miles_to_school)}


They will also output the R2 score for    t. What is R2?


    • R2

We usually speak of the inputs for a prediction as the matrix X where each row xi is the input for one data point.

We usually speak of the vector of correct answers ("the ground truth") as Y where each element yi is the output for one data point. The mean of Y is usually denoted y.

Your linear regression will create a set of coe cients B. For each input xi, you can use B to create a prediction y^i.

The dumbest linear function for estimating would be just the constant function that returned the mean of Y . This would be equivalent to asking "How much will this house sell for?" and getting the answer, "Well, I’m going to ignore all the features of the house, and tell you that the average price of these 519 houses is $603,139.95." The sum of squared errors for this dumb approach would be



3

n
X
(yi    y)2

i=1

Your predictions y^i should have a smaller sum of squared errors:

n
X
(yi    y^i)2

i=1

The R2 score for a set of predictions is:



P
n
(yi
y)

2

2

i=1




R

= 1

i=1
(yi
y^i)





n


2



P




If the data is basically linear without much noise, R2 will be close to 1: you have good    t.

If the data is not linear or very noisy, R2 will be close to 0: your t is terrible, about as bad as ignoring all the features and just using the mean.


    • Steps


You will edit two    les util.py and linreg gd.py.



4.1    util.py


You need to write three functions in util.py. When they are done correctly, linreg mi.py and textttlinreg scikit.pywill run unchanged. The three functions are:


    • read excel data which reads in the excel le and returns X, Y, and labels. Y is a 1-dimentional numpy array containing the last column of the spreadsheet. X is a 2-dimensional numpy array that contains the data in the other columns and the rst column is lled with 1s. labels is a list of strings from the header in the spreadsheet.

    • format prediction which takes B (the vector of coe cients) and labels that you created in read excel data. Then it returns a string like this:


predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) +

($-17,421.84 x miles_to_school)}

    • score that takes B, X, and Y and returns the R2 score.

When util.py is done, you should be able to run linreg mi.py and linreg scikit.py.


4

4.2    linreg gd.py


In this, you will be using gradient descent to minimize the squared error. The result should be very nearly the same as linreg mi.py.


The features in properties.xlsx are on very di erent scales (2 bathrooms vs 50,000 square foot yards). As a result, converging would take a very, very long time if you don’t rst standardize the features.

The rst step is to nd the mean and the standard deviation of each column of X. Use those to make each feature have a mean of 0 and a standard deviation of 1.

Then start with a guess of zero for all the coe  cients. Do the following many times:

    • Calculate the gradient

    • Update your guess. (Multiply the gradient by -0.001 and add to the last guess.)

    • Compute and record the new mean squared error

When the gradient gets small (and thus the changes to the coe cients gets small), stop. It should take a few hundred iterations.

The coe cients that you have calculated are for standardized inputs. Using the means and standard deviations you computed early, adjust them to use unstandardized data. (The math for this is in the next section.)

Calculate and display the R2 score.

Plot the mean square error vs. iterations. This will be most interesting if you use log scaling for both the x and y axes. Save it as err.png. Mine looks like this:



























5

4.3    Standardizing and compensating for standardizing

Using the matrix X, you will calculate the vector of means M = [m1; m2; : : : ; md] and standard
deviations S = [s1; s2; : : : ; sd].

Then you will create a new matrix X0 that has normalized each entry. The entries in column j are given by

x0
= (xj    mj ) =sj =
xj

mj





j

sj

sj






Now the matrix X0  has two nice properties:

    • The mean of every column is 0.

    • The standard deviation of every column is 1.

When you use those numbers to do linear regression, you will get a vector B0 = [b00; b01; b02; : : : ; b0d] which can be used for predictions like this:


y^ = b00 + b01x01 + : : : + b0dx0d

where the inputs have been standardized using the M and S that you calculated from the training data.

However, we really want the vector B = [b0; b1; b2; : : : ; bd] so that we can put non-standardized data
[x1; : : : ; xd] into the formula


y^ = b0 + b1x1 + : : : + bdxd

Using the de nition of x0j from above we have:

y^ = b00  + b10
s1

s1
+ : : : + bd0
sj

sj



x1
m1

xj
mj

Expanding and sorting we get:












y^ =  b00
b10
m1
: : :  bd0
mj


b10
bd0







+

x1 + : : : +

xj


s1


sj


s1

sj


Thus,

b0
= b0
b0
m1
: : :  b0
mj







0
1 s1
d sj

6

and for 0 < j    d:

b0
bj =  j



Use those for your    nal answer and the R2 calculation.
































    • Scikit-Learn


In a job situation, you will use the sklearn implementation 99% of the time. It uses Single Value De-composition and psuedo-inverses, so it is usually faster and more reliable than the matrix inversion approach.


    • Criteria for success


If your name is Fred Jones, you will turn in a zip le called HW03 Jones Fred.zip of a directory called HW03 Jones Fred. It will contain:


    • linreg mi.py (You don’t need to edit.)

    • linreg scikit.py (You don’t need to edit.)



7

    • linreg gd.py (Add about 22 lines of code.)

    • util.py (Add about 20 lines of code.)

    • err.png

    • properties.xlsx (You don’t need to edit.)


Be sure to format your python code with black before you submit it.

We will run your code like this:

cd HW03_Jones_Fred

python3 linreg_mi.py properties.xlsx

python3 linreg_scikit.py properties.xlsx

python3 linreg_gd.py properties.xlsx

We expect the following output:

> python3 linreg_mi.py properties.xlsx

Read 519 rows, 5 features from ’properties.xlsx’.

predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) + ($-17,421.84 x miles_to_school)

R2 = 0.875699

> python3 linreg_scikit.py properties.xlsx

Read 519 rows, 5 features from ’properties.xlsx’.

predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) + ($-17,421.84 x miles_to_school)

R2 = 0.875699

> python3 linreg_gd.py properties.xlsx

Read 519 rows, 5 features from ’properties.xlsx’.

Took 352 iterations to converge

predicted price = $32,362.82 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,196.55 x bedrooms) + ($9,598.99 x bathrooms) + ($-17,421.85 x miles_to_school)

R2 = 0.875699


You will get 4 points for a well-written util.py that enables linreg mi.py and linreg scikit.py to get the right answer.


You will get 5 points for a well-written linreg gd.py that uses gradient descent and gets approx-imately the same answer as linreg mi.py.


You will get 1 points for a err.png that looks right.

Do this work by yourself. Stackover ow is OK. A hint from another student is OK. Looking at another student’s code is not OK.

8































    • Extra help


A good video on gradient descent and linear regression: https://youtu.be/sDv4f4s2SB8






























9

More products