$24
[60 points] Linear Regression - Body Weight vs Brain Weight
The dataset at http://people.sc.fsu.edu/~jburkardt/datasets/regression/x01.txt is a collection of average weight of the body and the weight of the brain for a number of mammal species. For this homework, we have already pre-processed the data in the le data.csv. Each line in this le corresponds to a mammal species with rst column indicating the body weight and the second column indicating the brain weight. The values are separated by commas.
Write a program BodyVsBrain.java with the following command line format:
$java BodyVsBrain FLAG [arg1 arg2]
where the two optional arguments are real valued.
[5 points] When FLAG=100, we will compute the statistics of the dataset, like sample mean y = q
1
P
n
1
P
n
y)2.
i=1 yi
and sample standard deviation
i=1
(yi
n
n
1
Print n, the number of data points on the rst line, the sample mean and the sample standard deviation of body weight on the second line, and the sample mean and the sample standard deviation of brain weight on the third line.
For real values in this homework, keep four digits after decimal point (This is just for printing, do not round o the numbers for further calculations). For example (the numbers are made-up):
$java BodyVsBrain 100
100
87.6543 40.1234
32.1098 10.1234
2. [5 points] We will perform linear regression with the model
f(x) = 0 + 1x:
We rst de ne the mean squared error as a function of 0; 1:
MSE( 0; 1) =
1
n
( 0 + 1xi yi)2:
Xi
n
=1
When FLAG=200, arg1= 0 and arg2= 1. Print the corresponding MSE.
$java BodyVsBrain 200 0 0
834962.5029
$java BodyVsBrain 200 100.00 0
805204.5061
$java BodyVsBrain 200 -50.9 0.5
251205.9923
$java BodyVsBrain 200 50.55 50.55
2299433922.4050
[10 points] We perform gradient descent on MSE. At current parameter ( 0; 1), the gradient is de ned by the vector of partial derivatives
@MSE( 0; 1)
2
n
=
Xi
+ 1xi
yi)
(1)
@ 0
n
( 0
=1
@MSE( 0; 1)
2
n
=
Xi
+ 1xi
yi)xi:
(2)
@ 1
n
( 0
=1
When FLAG=300, arg1= 0 and arg2= 1. Print the corresponding gradient as two numbers on separate lines.
$java BodyVsBrain 300 0 0
-397.5800
-1650157.9772
$java BodyVsBrain 300 100.00 0
-197.5800
-1593531.1385
$java BodyVsBrain 300 -50.9 0.5
-216.2458
-747355.5257
$java BodyVsBrain 300 50.55 50.55
28328.3870
92565806.1970
[10 points] Gradient descent starts from initial parameter ( 0(0); 1(0)), and iterates the following updates at time t = 1; 2; : : : ; T :
(t)
(t 1)
@M SE( (t 1)
; (t 1))
0
=
0
0
1
(3)
@ 0
(t)
(t 1)
@M SE( (t 1)
; (t 1))
1
=
1
0
1
:
(4)
@ 1
When FLAG=400, arg1= and arg2=T . Start from initial parameter ( 0(0); 1(0)) = (0; 0). Perform
iterations of gradient descent. Print the following in each iteration on a line, separated by space: t; 0(t); 1(t); M SE( 0(t); 1(t)). For example,
$java BodyVsBrain 400 1e-7 5
1 0.0000 0.1650 588028.7174
2 0.0001 0.2993 424542.0532
3 0.0001 0.4085 316302.9585
4 0.0001 0.4974 244641.4458
5 0.0001 0.5698 197196.7363
$java BodyVsBrain 400 1e-8 5
CS 540
1 0.0000 0.0165 807985.9714
2 0.0000 0.0327 782005.3556
3 0.0000 0.0486 756983.8884
4 0.0000 0.0642 732886.1600
5 0.0000 0.0795 709678.0680
$java BodyVsBrain 400 1e-9 5
1 0.0000 0.0017 832242.0182
2 0.0000 0.0033 829531.6620
3 0.0000 0.0049 826831.3965
4 0.0000 0.0066 824141.1842
5 0.0000 0.0082 821460.9876
$java BodyVsBrain 400 1e-5 5
1
0.0040
16.5016 227288400.6112
2
-0.0855 -274.4627 70632902935.6433
3
1.4727
4855.9687 21960124169280.0430
4
-26.0212
-85606.4253 6827522877409993.0000
5
458.7454
1509472.8203 2122714266189187072.0000
Note with = 10 5, gradient descent is diverging.
The following is not required for hand in, but try di erent initial parameters, , and much larger T and see how small you can make MSE.
[10 points] Instead of using gradient descent, we can compute the closed-form solution for the param-eters directly. For ordinary least squared in 1D, this is
P
(xi x)(yi y)
1 =x)2(xiP
^
^
0 = y
1x;
where x = (
^
^
(note the order),
xi)=n and y = ( yi)=n. Required: When FLAG=500, print 0
; 1
and the
corresponding MSE on a single line separated by space.
P
P
^
The following is not required for hand in, but give it some thought: what does a negative 0 mean?
^
[5 points] Using 0; 1 you can predict the brain weight of a species given the body weight. Required: When FLAG=600, arg1=body weight. Print a single real number which is the predicted brain weight for a species with that body weight. Use the closed-form solution to make the predictions.
For example,
$java BodyVsBrain 600 500 394.6009
The following is not required for hand in, but give it some thought:
CS 540
What’s the prediction for the brain weight of a species with body weight 50? What does that say about the model?
[5 points] If you are agonizing over your inability to get gradient descent to match the closed-form solution in question 5, you are not alone. The culprit is the scale of input x compared to the scale of the implicit o set value 1 (think 0 = 0 1). Gradient descent converges slowly when these scales di er greatly, a situation known as bad condition number in optimization. When FLAG=700, we ask you to rst normalize the input x (note: not the labels y):
1
n
Xi
x
=
n
xi
(5)
=1
v
stdx
=
1
n
(xi x)2
(6)
n 1
i=1
u
u
X
t
xi
(xi
x)=stdx:
(7)
Then proceed exactly as when FLAG=400. The two arguments are again arg1= and arg2=T . For example,
$java BodyVsBrain 700 0.1 5
1 39.7580 165.2826 574430.4723
2 71.5644 298.0419 406416.1542
3 97.0095 404.6776 298065.1590
4 117.3656 490.3301 228190.2996
5 133.6505 559.1284 183128.3050
$java BodyVsBrain 700 1 5
1 397.5800 1652.8263 790900.6521
2 0.0000 53.3170 749635.6514
3 397.5800 1601.2292 710989.9691
4 0.0000 103.2496 674797.3426
5 397.5800 1552.9073 640902.0629
$java BodyVsBrain 700 0.01 5
1 3.9758 16.5283 806348.0413
2 7.8721 32.7313 778849.6074
3 11.6904 48.6155 752423.6728
4 15.4324 64.1871 727028.4073
5 19.0996 79.4523 702623.6118
With = 0:2 your model should converge within 25 iterations.
(Note the ’s are now for the normalized version of x, but you can easily translate them back for the original x with a little algebra. This is not required for the homework.)
CS 540
[10 points] Now we implement Stochastic Gradient Descent (SGD). With everything the same as part 8 (including x normalization), we modify the de nition of gradient in equations (1) and (2) as follows. In iteration t we randomly pick one of the n items. Say we picked (xjt ; yjt ). We approximate the gradient using that item only:
@MSE( 0
; 1)
2( 0 + 1xjt
yjt )
(8)
@ 0
@MSE( 0
; 1)
2( 0 + 1xjt
yjt )xjt :
(9)
@ 1
When FLAG=800, print the same information as in part 7. For example (your results will di er because of randomness in the items picked):
$java BodyVsBrain 800 0.1 5
1 0.3240 -0.0946 834990.2247
2 0.2655 -0.0769 834984.1173
3 2.2096 -0.4283 834796.9149
4 2.5849 -0.5191 834799.7060
5 107.9122 44.4145 732235.1437
Since out data set is small, there is little advantage of SGD over gradient descent. However, on large data sets SGD becomes more desirable.
Hints:
Update 0; 1 simultaneously in an iteration. Don’t use a new 0 to calculate 1.
Use double instead of oat in Java.
Don’t round the variables themselves to 4 digits in the middle stages. That is for printing only.