Starting from:
$35

$29

Project 3: Regularization in Least Squares


    • Project Summary

Regularized least squares is a family of methods that adds terms to the least squares formulations to constrain the solution. Regularization can helps users incorporate goals that we want the resulting approximating function to have such as the size of the coe cients, sparsity (how many are non-zero), smoothness of the model, etc. In this project, you will explore the least squares regularization techniques used in ridge regression and Tikhonov regularization. How do the di erent techniques e ect the resulting least squares problem in terms of normal equations and other linear algebra techniques? In this project you will explore how these techniques can be used to minimize the e ects of noise and other factors on the resulting regression lines. Possible options for the independent direction of this project include generalizations of Tikhonov, L1 regularization (e.g. LASSO), continuous regularized least squares, and smooth spline approximation (Reproducing Kernel Hilbert Spaces).

    • Project Background

A key numerical tool for many data scientists and statisticians is regression models. At this point, the primary algorithm we have explored for building these models is the least squares model, which aims to solve for a model x that minimizes the loss function or constraint
x
k
Ax
jj2
;
(1)
argmin


b 2



Where A is an n  m matrix, x is m  1 column vector, and b is an n  1 column vector. In order for least squares approximations to produce a useful result several conditions must be met. First, the resulting linear system needs to have a unique solution. Second, enough information about the problems must be known in order to chose a appropriate approximating function and avoid over- tting. In addition to these requirements, it may be preferable for x to be sparse (have few non-zero entries) to allow for simpler interpretation

of x.

Addressing these issues is non-trivial and an active area of research. In this project, you will explore some of the most common ways of trying to avoid these issues. These approaches reformulate the least squares problem in order to avoid the issues always aiming for greater numerical stability and sometimes sparsity. The result is a less accurate approximation than one would expect with the amount of given data.

2.1    Ridge Regression

The rst method you will explore that modi es the least squares problem is ridge re-gression. In some cases the A matrix that represents the data we are using to build our regression to will be ill-posed. This can be caused by noise in measurements or the inclu-sion of variables that have no real correlation with the output being predicted. When these conditions occur it becomes possible that an exact solution to the original system does not

1

exist and the solution to the problem x will begin to exhibit undesirable properties, such as large oscillatory regression coe cients as the least squares problem attempts to compensate for the abnormalities in the data. One way to ght the growth in regression coe cients is to change the loss function of the least squares problem in equation 1 to include a penalization term on the magnitude of x as shown below

argmin
k
Ax
b
2
+

k
x
2
(2)
x



k2




k2

where  is a constant chosen to optimally penalize the l2
-norm of x.

As with the method of least squares, the loss function for ridge regression cannot be solved directly to determine the minimizer x. Instead we must determine a matrix E such that x = Eb. It can be shown that the estimator that provides the optimal solution to the matrix system Ax = b under the constraint set forth by equation 2 is

Eridge = (AT A +  I) 1AT :

It is the form of the ridge estimator that gives ridge regression its name. If >> 0 the values along the main diagonal of the estimator become very large forming a sort of "ridge" in the matrix.

2.1.1    Questions to Investigate

    1. Deriving the Ridge Estimator

        (a) Given that kxk22 = xT x, rewrite equation (2) in terms of matrix products. Hint: just like the least squares case you should get a quadratic.

        (b) Using methods from calculus and properties from linear algebra, determine the point x that minimizes equation (2).

    2. Exploring the Ridge Estimator

        (a) i. Sample the line y = 3x + 2 20 times and add some Gaussian noise.

            ii. Select 10 of these data points at random and use them to build a line of best t (y = mx + b) for the data using ridge regression, = 0. We will call the other 10 points the validation set.

            iii. Calculate the sum of squared errors of your model on the other 10 points.

            iv. Increase by 0.1 and build another regression to test against the validation set. Does it perform worse or better?

            v. Repeat this process for di erent values of , what is the optimal value? Which model most accurately ts y = 3x + 2?

        (b) Repeat the same process you perfromed in 2a but this time sample the line y = x2 and t the model y = ax5 + bx4 + cx3 + dx2 + ex + f. How does Ridge regression help improve this model?


2

2.2    Tikhonov’s Regularization

Ridge regression is a method that is able to create more accurate approximations than the standard least squares approach when the corresponding linear system is under determined or the underlying physics being modeled is poorly understood. It does this by introducing an acceptable amount of bias into the problem. By bias we mean that we penalize the magnitude of the x vector, we bias our solution to have a smaller magnitude. Unfortunately, ridge regression does not solve all of the original issues with least squares and introduces some new issues that one would like to avoid. For example, the ridge normalization e ects each of the variables in the regression equally. In other words, the ridge model will shrink all of the coe cients of x no matter how important they are. The next technique you will investigate allows for more control over how the regularization will a ect the solution. It is called Tikhonov Regularization.

Note that the minimization problem for ridge regression can be written as

argmin kAx    bjj22 + jj Ixjj22;
x
where = p . Tikhonov Regularization considers a more general form of this minimiza-tion problem. Speci cally, it poses a more general minimization problem of the form

x
k
Ax
k2
+ k
k2
:
argmin


b 2

x 2


where is a weight matrix that is chose so that x will satisfy properties desired by the user or so that certain vectors x^ can be eleminated from the possible solutions. Typically, the Tikhonov Regularizatio problem is recast as
x
k
Ax
k2
+
2
k
jj2
:
(3)
argmin


b 2



Lx 2



where L is a unitary matrix and the constant is a weighting constant. For the remainder of this document, we consider the minimization problem in equation (3) when we refer to Tikhonov regularization.

Like ridge regression, Tikhonov regularization discourages the elements of the solution x from being oscillatory and large. Additional feature of Tikhonov regularization is that it is able to enforce smoothness in the derivative of the solution x. Here the derivative is mean in a nite di erence sense. In other words, the derivative is approximated via weighted averages. One of the most common nite di erence equations for approximating the rst derivative of a function is the centered di erence. The centered di erence approximation of f0(t) is given by
df
(t)
f(t +  t)  f(t
t)





:
dt


2  t



The truncation error in this approximation is O((  t)2).

Remark 2.1. Finite di erence approximations can be constructed by adding together linear combinations of Taylor series. The truncation error is of the order of the rst term in the expansion of the remainder of the series that is not included in the approximation.

3

For a vector function x of length n with jth entry xj and assuming t = 1, the derivative of x can be approximated using the matrix product

2
1



1







3 2
x2 3







2



2

0
..






0


0










2
0

2

0
: : :
0

x1




6



1



1


.

7 6

7


x0 = Dx =





















...













...

:
(4)





0

...  ...
...
0






6
0
: : :
0

1
0

1

7 6
xn
7







2

2







6







7 6

7



4
















5 4

5


The matrix D is an (n    2)    n tridiagonal matrix.

In Tikhonov regularization, this derivative matrix is used in the regularization term as

follows:









argmin
k
Ax
b 2
+
2
k
Dx 2
:
(5)
x


k2



k2


With the new loss function, the least squares estimator can be derived, and the nal solution to the problem can be solved. To determine an optimal value for the most simple technique that is used is cross-validation. Cross validation is a fancy way of saying guess and check, increasing the magnitude of until the error in the model is minimized on a data set that is separate from the data that was used to build the model.

2.2.1    Questions to Consider

    1. Show that D is unitary.

    2. Using the methods you used to derive the ridge estimator, derive the estimator needed to solve the the Tikhonov regularization problem.

    3. Sample the curve y = sin(x) + sin(5x) and add gaussian noise. Use Tikhonov Regu-larization to calculate a regression for a data set with the smoothing method explained above. How does the model t? How does it compare to the least squares solution. How do the derivatives compare.

    4. Other nite di erence methods exist to approximate the rst derivative of a discrete signal. What happens if you use those methods to penalize this model? One option is the forward nite di erence formula which has a truncation error of O( h). The forward di erence is given as

df
(t)
f(t +  t)  f(t)
:





dt


Deltat


    • Software Expectations

You may use software for this project to solve the systems of equations resulting from the ridge and Tikhonov estimators. By this we mean you do not need to code up your own


4

version of gaussian elimination. Though it is easy to build your own solver using standard linear algebra packages.

If you would like to dive deeper into these methods (possibly for the independent di-rection), you can explore the python library cvxpy to perform your regressions. Using this library is not required nor expected. Documentation on this library can be found at https://www.cvxpy.org/.


    • Independent Directions

Possible next steps for this project include, but are not limited to:

    1. Explore LASSO and elastic net constraints for the least squares problem. These meth-ods hope to adjust the loss function to favor non-zero coe cients only for the most variables/terms to promote sparsity in the nal system.

    2. Explore the applications of least squares in system identi cation techniques. One of these techniques, the Sparse Identi cation of Non-linear Dynamics algorithm (SINDY), employs regularized least squares to t di erential equations to discrete data sets by approximating derivatives with nite di erence methods. This extension is very cool and but will require additional research by the students.

    3. The application of regularization in continuous least squares models. These methods seek to build least squares approximations to a function over a given interval instead of a predetermined discreet set of points.


    • Helpful Sources

        1. https://epubs.siam.org/doi/abs/10.1137/S0895479897326432

        2. https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b


















5

More products