$24
Maximum Likelihood Estimation (MLE) and Maximum a Posteriori (MAP)
Suppose you are given a dataset D = fX; Y g, where
• = [x1x2 : : : xN ] 2 RN Y = [y1y2 : : : yN ] 2 RN
and xi 2 R8i = 1; : : : ; N, yi 2 R, 8i = 1; : : : ; N. Moreover, suppose that:
yi = 1 + 2xi + + K (xi)K 1 + ei 8i = 1; : : : ; N (1)
Where ei N (0; 2) is random Gaussian Noise and = ( 1; : : : ; n)T . It is known that the Maximum Likelihood Estimation (MLE) approach works by de ning the conditional probability of y given x, p (yjx) and then optimizes the parameters to maximize this probability distribution over D. Moreover, it is also known that this approach can be made equivalent to the deterministic approach to solve such problems (the Least Square method) by taking the negative-log of p (yjx). Indeed, by assuming that the noise ei is Gaussian for each i = 1; : : : ; N, we have
Note that the above equation is equivalent to the function you optimized in the Exercise 3 of Lab3 with GD, with A := (X), x := and b := y.
When it is unclear how to set the parameter K and it is impossible to use the error plot, it is required to use the Maximum A Posterior (MAP) approach.
To show how it works, suppose that we know that the parameters are normally distributed N (0; 2I). Then we can use the Bayes Theorem to express the A Posteriori probability on y given x and as
p( !jX; Y ) = p(Y jX; )p( )
p(Y jX)
The MAP solution searches for a set of parameters that maximizes p( jX; Y ). Following the same reasoning as before,
N N
Xi xi; yi) = arg min X xi; ) log p( )
MAP = arg max p( X; Y ) = arg min log p( log p(yi
j j j
=1 i=1
Given the two optimization problem above, you are required to implement a program that compare the two solutions, that we will refer to as MLE and MAP .
1. De ne a test problem in the following way:
◦ Let the user x a positive integer K > 0, and de ne true = (1; 1; : : : ; 1)T (you can also consider di erent true);
◦ De ne an input dataset
X = [x1x2 : : : xN ] 2 RN
where the xi are N uniformly distributed datapoints in the interval [a; b], where a < b are value that the user can select;
• Given a set of functions f 1; 2; : : : ; K g, de ne the Generalized Vandermonde matrix (X) 2 RN K , whose element in position i; j is j(xi). In particular, write a function de ning the classical Vandermonde matrix where j(x) = xj 1;
• Given a variance 2 > 0 de ned by the user, compute
• = (X) true + e
where e N (0; 2I) is Gaussian distributed noise with variance 2. Note that the test problem de ned in this way is very similar to what we did to de ne a test problem in the rst Lab.
2. We now built a dataset D = f(X; Y )g such that true = (1; 1; : : : ; 1)T 2 RK is the best solution to the least squares problem
(X) Y
3. Pretend not to know the correct value of K. The rst task is to try to guess it and use it to approximate the true solution true by MLE and MAP. To do that:
◦ Write a function that takes as input the training data D = (X; Y ) and K and returns the MLE solution (with Gaussian assumption) MLE 2 RK for that problem. Note that the loss function can be optimized by GD, SGD or Normal Equations.
◦ Write a function that takes as input a set of K-dimensional parameter vector and a test set T E = f(Xtest; Ytest)g and returns the average absolute error of the polynomial regressor f (x) over Xtest, computed as:
• jjf (Xtest) Ytestjj2
Ntest2
• For di erent values of K, plot the training datapoints and the test datapoints with di erent colors, and visualize (as a continuous line) the learnt regression model f MLE (x). Comment the results.
• For increasing values of K, use the functions de ned above to compute the training and test error, where the test set is generated by sampling Ntest new points on the same interval [a; b] of the training set and generating the corresponding Ytest with the same procedure of the training set. Plot the two errors with respect to K. Comment the results.
• Write a function that takes as input the training data D = (X; Y ), K and > 0 and returns the MAP solution (with Gaussian assumption) MAP 2 RK for that problem. Note that the loss function can be optimized by GD, SGD or Normal Equations.
• For K lower, equal and greater than the correct degree of the test polynomial, plot the training datapoints and the test datapoints with di erent colors, and visualize (as a continuous line) the learnt regression model f MAP (x) with di erent values of . Comment the results.
• For K being way greater than the correct degree of the polynomial, compute the MLE and MAP solution. Compare the test error of the two, for di erent values of (in the case of MAP).
• For K greater than the true degree of the polynomial, de ne Err( ) = jj truejj2 where true
jj truejj2
has been padded with zeros to match the shape of . Compute Err( MLE) and Err( MAP ) for increasing values of K and di erent values of .
◦ Compare the results obtained by increasing the number N of datapoints.
◦ Compare the results obtained by the three algorithms GD, SGD and Normal Equations.
4. Repeat the rst point of the homework by setting
yi P oi(yj true;1 + true;2xi + + true;K (xi)K 1) 8i = 1; : : : ; N (2)
Where P oi(yj ) is the Poisson distribution with parameter (you can nd the Python function to do that), such that
ye
P oi(yj ) =
And consequently,
p (yjx) = P oi(yj 1 + 2xi + + K (xi)K 1)
5. Compute the MLE and MAP losses for this choice of p (yjx) and repeat the experiments in the previous points.