$29
1. (20 points) Estimating Using Simulation.
Read all instructions before you start.
Consider a bivariate random variable X := (X1; X2), where X1; X2 are both independent and have a uniform distribution over ( 1; 1).
(a) (5 points) What is the probability that the random variable X takes values within a circle of radius 1 ? Derive the expression.
(b) (5 points) Use the previous result to estimate the value of purely relying on simulation of X. Justify your approach in words.
(c) (5 points) Report the estimates of using sample sizes N = 10; 102; 103; 104; 105; 106; 107; 108. Use the “single” datatype for the simulation. How will you write code to handle the situation when you desire to increase the accuracy of the estimate with a sample size as large as N = 109 or larger ? Does your code handle this case ?
(d) (5 points) How will you estimate the sample size M required to have the estimate of within [ 0:01; + 0:01] with 0:95 probability ? Assume the true value of to be known, so that you know the interval exactly. Describe an algorithm and justify it. Compute this estimate for the sample size M and report it.
2. (20 points) Multivariate Gaussian. Read all instructions before you start.
Generate N points (with N taking the values 10; 102; 103; 104; 105) from a multivariate 2D Gaus-sian probability density function with mean = [1; 2]0 and a covariance matrix C with the first row as [1:6250; 1:9486] and the second row as [ 1:9486; 3:8750].
For this simulation, you are only allowed to use the randn() and eig() functions in Matlab.
For each data sample of size N, compute the maximum likelihood (ML) estimates of the mean and the covariance matrix.
(a) (5 points) Describe and justify your method for generating sample points from the 2D Gaus-sian.
(b) (5 points) For each value of N, repeat the experiment 100 times, and plot a box plot of the error between the true mean and the ML estimate bN (which depends on N), where the error measure is k bN k = k k. Use a logarithmic scale on the horizontal axis, i.e., log10 N.
(c) (5 points) For each value of N, repeat the experiment 100 times, and plot a box plot of the error between the true covariance C and the ML estimate CbN (which depends on N), where the error measure is k C CbN kFro = k C kFro. Use a logarithmic scale on the horizontal axis, i.e., log10 N.
(d) (5 points) For each value of N, for a single data sample, within a single figure, plot the 2D scatter plot of the generated data and show the principal modes of variation of the data by plotting a line starting at the empirical mean and going a distance equal to the empirical eigen-value along a direction given by the empirical eigen-vector.
3. (15 points) PCA and Line (Hyperplane) Fitting. Read all instructions before you start.
Consider the observed set of points of the form (x; y) 2 R2 in the file "points2D_Set1.mat". Assume each observation (x; y) is drawn independently from the joint probability density function P (X; Y ) of random variables X and Y .
For this question, you cannot use the functions mean(), cov(), and pca() in Matlab.
(5 points) How can principal component analysis (PCA) be used to best approximate a linear relationship between random variables X and Y . Describe the method clearly, using appropri-ate mathematical descriptions for clarity. Your description should be clear enough to lead to a programmable implementation.
(5 points) Show a scatter plot of the points. Overlay on the scatter plot, the graph of a line showing the linear relationship between Y and X.
(5 points) Repeat the same analysis for the set of points in "points2D_Set2.mat". Show a scatter plot of the points. Overlay on the scatter plot, the graph of a line showing the linear relationship between Y and X. Compared to the result on the other set of points, justify the quality of the approximation resulting in this question using logical arguments.
4. (25 points) Multivariate Gaussian fitting for Principal Component Analysis. Read all instructions before you start.
Download the dataset comprising images of handwritten digits in http://yann.lecun.com/ exdb/mnist; this has been downloaded in the folder “data” and stored as “mnist.mat”.
Each image is stored as a matrix (28 28) of numbers. You can visualize these images (or matrices) in Matlab using the functions imagesc() or imshow(). Use the Matlab command “axis equal” to use the same units on each axis of the image.
For the following computations, make sure to convert (cast) the integer data type to a floating-point type. For this question, you cannot use the functions mean(), cov(), and pca() in Matlab. For every digit, from 0 to 9, compute:
(i) the mean (3 points),
(ii) the covariance matrix C (5 points), and
(ii) the first mode of variation determined by the eigenvector v1 and the corresponding eigenvalue 1 (where 1 is the largest of all eigenvalues) of the covariance matrix C (7 points).
Note: Before computing the mean and covariance matrix, convert each 28 28 pixel image matrix to a 282 1 vector by concatenating its columns. To visualize the 282 1 mean vector, convert it back to a matrix and then visualize it using imagesc(). Use the reshape() function to change matrices to vectors and vice versa. The covariance matrix will be of size 282 282.
(5 points) For each digit, sort the 282 eigenvalues of the covariance matrix and plot them as
a graph. Comment and justify what you observe. How many “principal” / significant modes of variation (i.e., number of “large” eigenvalues) do you find, for each digit ? Are the significant modes of variation equal to 282 or far less ? Why ?
p
p
(5 points) For each digit, show the 3 images side by side: (i)
1v1, (ii) , and (iii) + 1v1,
to show the principal mode of variation of the digits around their mean. Comment and justify what
3
you observe. For a certain digit, say 1, what does the principal mode of variation tell you about how people write that digit ?
5. (25 points) Principal Component Analysis (PCA) for Image data Read all instructions before you start.
Consider the dataset provided within the folder “data_fruit”
For this question, you cannot use the functions mean(), cov(), and pca() in Matlab.
Each datum is an image of size 80 80 pixels with 3 color channels red (R), green (G), and blue (B), i.e., a 80 80 3 array. For PCA, each image should be resized to a vector of length 19200. For visualization, reshape each vector back to a RGB image of size 80 80 pixels using the function reshape(), followed by a shift and rescaling of the values into the range [0; 1], followed by displaying the matrix using the function image().
(9 points: 1 + 4 + 4) Similar to the analysis done in previous question, find the mean , the covariance matrix C, and the the first 4 eigenvectors of C. Display the mean and the eigenvec-tors as images (side by side, in the same figure); you can use the function subplot(). Find the first 10 eigenvalues, sort them, and plot their values on a graph. Use the function eigs() for efficient computation.
(8 points: 4 + 4) For each fruit image in the dataset, finds its closest representation as a linear combination of the mean and the first 4 eigenvectors. Use the measure of closeness as the Frobenius norm of the difference. Describe the algorithm used to produce this closest representation in mathematical terms and describe the logic behind your algorithm. Display the original fruit image and its closest representation, as images (side by side, in the same figure).
(9 points: 3 + 3 + 3) Using all of the top 4 eigenvectors and the mean image, sample random images from the multivariate Gaussian to generate new images of "fruit". Describe the underly-ing algorithm clearly in words and including suitable mathematical notation. Display three such images that are distinct from any image in the given the dataset, but are representative of the dataset and can be considered as that of a new / generated fruit.
4