$24
Problem 1. (50 points)
We have a data set of the form f(xi; yi)gNi=1, where y 2 R and x 2 Rd. We assume d is large and not all dimensions of x are informative in predicting y. Consider the following regression model
for this problem:
ind
Normal(0; diag( 1; : : : ; d) 1);
yi Normal(xiT w; 1);
w
iid
; b0);Gamma(e0; f0):
k Gamma(a0
1
Use the density function Gamma( j 1; 2) =
2
1 1e 2 . In this homework, you will derive
1)
a variational inference algorithm for approximating the posterior distribution with q(w; 1; : : : ; d; ) p(w; 1; : : : ; d; jy; x)
a) Using the factorization q(w; 1; : : : ; d; ) = q(w)q( ) dk=1 q( k), derive the optimal form of each q distribution. Use these optimal q distributions to derive a variational inference algorithm for approximating the posterior.
b) Summarize the algorithm derived in Part (a) using pseudo-code in a way similar to how algorithms are presented in the notes for the class.
c) Using these q distributions, calculate the variational objective function. You will need to evaluate this function in the next problem to show the convergence of your algorithm.Q
1
Problem 2. (50 points)
Implement the algorithm derived in Problem 1 and run it on the three data sets provided. Set the prior parameters a0 = b0 = 10 16 and e0 = f0 = 1. We will not discuss sparsity-promoting \ARD" priors in detail in this course, but setting a0 and b0 in this way will encourage only a few dimensions of w to be signi cantly non-zero since many k should be extremely large according to q( k).
For each of the three data sets provided, show the following:
a) Run your algorithm for 500 iterations and plot the variational objective function.
b) Using the nal iteration, plot 1=Eq[ k] as a function of k.
c) Give the value of 1=Eq[ ] for the nal iteration.
d) Using w^ = Eq(w)[w], calculate y^i = xTi w^ for each data point. Using the zi associated with yi (see below), plot y^i vs zi as a solid line. On the same plot show (zi; yi) as a scatter plot. Also show the function (zi; 10 sinc(zi)) as a solid line in a di erent color.
Hint about Part (d): z is the horizonal axis and y the vertical axis. Both solid lines should look like a function that smoothly passes through the data. The second line is ground truth.
Details about the data
The data was generated by sampling z Uniform( 5; 5) independently N times for N = 100; 250; 500 (giving a total of three data sets). For each zn in a given data set, the response yn = 10 sinc(zn) + n, where n N(0; 1).
We use zn to construct a \kernel matrix" X. This is a mapping of zn into a higher dimensional space (see Bishop for more details). For our purposes, it’s just important to know that the nth row (or column, depending on which data set you use) of X corresponds to the location zn. We let Xn;1 = 1 and use the Gaussian kernel for the remaining dimensions, Xn;i+1 = expf (zn zi)2g for i = 1; : : : ; N. Therefore, the dimensionality of each xi is one greater than the number of data points. The sparse model picks out the relevant locations within the data for performing the regression.
Each data set contains the vector y, the matrix X and the vector of original locations z. This last vector will be useful for plotting.
2