$29
I. Problem
1. Linear regression with single variable by built-in function
To choose the most relative features, you should plot all the features with the target first. Then calculate the weight and the bias to make the simple prediction, also record your accuracy.
2. Linear regression with single variable by your own gradient descent
In this case, you should implement linear regression with your own gradient descent model. With the random initial weight and bias(you should give a reasonable number), updating the coefficient to get your result.
3. Linear regression with multi-variable by your own gradient descent
In this case, you should build a multi-variable model. Record the MSE(mean squared error)and the R2(coefficient of determination) for both training and testing. Also compare each iteration only update wj and each iteration updates w.
4. Polynomial regression by your own gradient descent
In this case, you should build a quadratic or higher(i.e. the degree of polynomial is >=2) polynomial model. Make your R2 higher as you can. Record the MSE and the R2 for both training and testing.
5. (Bonus) Making different regression model to make the r2_score > 0.87 Hint: You can also change your loss function
II. Dataset Concrete_Data
Number of Instances:1030.
Number of Attributes: 8 +1 output attribute
Attribute information:
1.Cement
-- quantitative
-- kg in a m3 mixture
-- Input Variable
2.Blast Furnace Slag
--
quantitative
-- kg in a m3 mixture
-- Input Variable
3.Fly Ash--
quantitative
-- kg in a m3 mixture
-- Input Variable
4.Water--
quantitative
--
kg in a m3 mixture --
Input Variable
5.Superplasticizer
-- quantitative --
kg in a m3 mixture --
Input Variable
6.Coarse Aggregate--
quantitative --
kg in a m3 mixture --
Input Variable
7.Fine Aggregate--
quantitative --
kg in a m3 mixture --
Input Variable
8.Age -- quantitative --
Day (1~365)
-- Input Variable
9.Concrete compressive strength --
quantitative -- MPa
--
Output Variable
Testing data = 20% of the whole dataset
III. Report & Scoring
This is a team-based program assignment, so one team should only submit one report and one source code to E3.
The report should contain the following:
1. What environments the members are using (5%)
2. Visualization of all the features with the target(5%)
3. The code, graph, r2_score, weight and bias for problem 1(10%)
4. The code, graph, r2_score, weight and bias for problem 2(25%)
5. Compare Problem1 and Problem2, show what you got.(5%)
6. The code, MSE, and the r2_score for problem 3(20%)
7. Compare the performance between two different update method.
8. The code, MSE, and the r2_score for problem 4(20%)
9. Answer the question(5%)
1. What is overfitting?
2. Stochastic gradient descent is also a kind of gradient descent, what is the benefit of using SGD?
3. Why the different initial value to GD model may cause different result?
4. What is the bad learning rate? What problem will happen if we use it?
5. After finishing this homework, what have you learned, what problems you encountered, and how the problems were solved?
10. Bonus(10%)
There are some rules to follow:
• C / C++ / Java / Python / Matlab are allowed to use. For visualization, Excel or other programs are allowed.
• Report format should be PDF.
• Attach your code when you are submitting.
• No cheating and plagiarizing.
• Delay:Your score *= 0.8