$29
Remember the honor code while submitting this (and every other) assignment. All members of the group should work on all parts of the assignment. We will adopt a zero-tolerance policy against any violation.
Submission instructions:
1. You should type out all the answers to the written problems in Word (with the equation editor) or using Latex, or write it neatly on paper and scan it. In either case, prepare a pdf le.
2. Put the pdf le and the code for the programming parts all in one zip le. The pdf should contain the names and ID numbers of all students in the group within the header. The pdf le should also contain instructions for running your code. Name the zip le as follows: A3-IdNumberOfFirstStudent-IdNumberOfSecondStudent.zip. (If you are doing the assignment alone, the name of the zip le is A3-IdNumber.zip).
3. Upload the le on moodle BEFORE 11:55 pm on the due date (i.e. 10th september). We will nevertheless allow and not penalize any submission until 6:00 am on the following day (i.e. 11th September). No assignments will be accepted thereafter.
4. Note that only one student per group should upload their work on moodle.
5. Please preserve a copy of all your work until the end of the semester.
Questions:
1. Consider a shelf containing n books, each one with a distinct color. Let us suppose that you pick a book uniformly at random with replacement (i.e. you put the book back on the shelf after picking it) and independently of what
was picked earlier. Let X(n) be the number of times you would need to pick a book in this fashion, such that you have chosen a book of each color at least once. We can write that X(n) = X1 + X2 + ::: + Xn where Xi denotes the additional number of times you have to pick a book such that you move from having picked books of i 1 distinct colors to i distinct colors. We wish to determine E(X) and V ar(X). To this end, do as follows:
(a) What is X1? When books with i 1 distinct types of colors have been collected, what is the probability of picking a book with a di erent color (i.e. di erent from the previous i 1 colors)? [3 points]
(b) Due to independence, Xi is a geometric random variable. What is its parameter? Let Z be a random variable for the trial number for the rst head obtained in a sequence of independent Bernoulli trials with head probability p. Then P (Z = k) = (1 p)k 1p where k = 1; 2; 3; :::, and Z is said to be a geometric random variable with parameter p. [3 points]
(c) Show that the expected value of a geometric random variable with parameter p is 1=p. Derive the variance of a geometric random variable. [4+4=8 points]
(d) Hence derive E(X(n)) for this problem. [3 points]
(e) Hence derive an upper bound on V ar(X(n)) for this problem. You will need the inequality that the sum of reciprocals of squares of positive integers is upper bounded by 2=6. [3 points]
(f) Plot a graph of E(X(n)) versus n for di erent n. If E(X(n)) = (f(n)), what is f(n)? [3+2=5 points]
2. (a) A student is trying to design a procedure to generate a sample from a distribution function F , where F is invertible. For this, (s)he generates a sample ui from a [0; 1] uniform distribution using the ‘rand’ function of MATLAB, and computes vi = F 1(ui). This is repeated n times for i = 1:::n. Prove that the values fvigni=1 follow the distribution F . [8 points]
(b) Let Y1; Y2; :::; Yn represent data from a continuous distribution F . The empirical distribution function Fe of
P
in=1 1(Yi
x)
these data is de ned as Fe(x) =
n
where 1(z) = 1 if
n
Now de ne D = maxxjFe(x) F (x)j. Also de ne E = max0 y 1
the predicate z is true and 0 otherwise.
Pi=1
1n i
y)
y where U1; U2; :::; Un
(U
represent data from a [0; 1] uniform distribution. Now prove that P (E d) = P (D d). Brie y explain what you think is the practical signi cance of this result in statistics. [12+5=17 points]
3. In this exercise, we will perform maximum likelihood based plane tting. Let the equation of the plane be z = ax + by + c. Let us suppose we have access to accurate X and Y coordinates of some N points lying on the plane. We also have access to the Z coordinates of these points, but those have been corrupted independently
by noise from N (0; 2). Write down the log-likelihood function L to be maximized in order to determine a; b; c. Write down three linear equations corresponding to setting partial derivatives of L w.r.t. a; b; c (respectively) to 0. Express these equations in matrix and vector form. [5+5=10 points ]
Now write MATLAB code to solve this linear system for data consisting of XYZ coordinates of N = 2000 points, stored in the le ‘XYZ.txt’ in the homework folder. Read the data using the MATLAB function ‘dlmwrite’. The data consist of N rows, each containing the X,Y,Z coordinates of a point (in that order). What is the predicted equation of the plane? What is the predicted noise variance? State these in your report, and print them out via your code. [10 points]
Given your knowledge of maximum likelihood techniques, what is the expected value of the estimates of a; b; c in terms of a; b; c; 2; N. What is their variance? [5 points]
4. We have extensively seen parametric PDF estimation in class via maximum likelihood. In many situations, the family of the PDF is however unknown. Estimation under such a scenario is called nonparametric density es-timation. We have studied one such technique in class, namely histogramming, and we also analyzed its rate of convergence. There is another popular technique for nonparametric density estimation. It is called KDE or
P
n
(x x
)2=(2 2))
Kernel density esitmation, the formula for which is given as p^n(x; )n=
i=1 exp (
n p
i
. Here p^n(x)
2
is an estimate of the underlying probability density at value x, fxigi=1 are the n samples values, from which the
unknown PDF is being estimated, and is a bandwidth parameter (similar to a histogram bin-width parameter). The choice of the appropriate is not very straightforward. We will implement one possible procedure to choose
- called cross-validation. For this, do as follows:
(a) Use MATLAB to draw n = 1000 independent samples from N (0; 16). We will use a random subset of 750 samples (set T ) for building the PDF, and the remaining 250 as the validation set V . Note that T and V must be disjoint sets.
(b) In your report, write down an expression for the joint likelihood of the samples in V , based on the estimate of the PDF built from T with bandwidth parameter . [3 points]
(c) For di erent values of from the set f0:001; 0:1; 0:2; 0:9; 1; 2; 3; 5; 10; 20; 100g, write MATLAB code to evaluate the log of the joint likelihood LL of the samples in V , based on the estimate of the PDF built from T . Plot of a graph of LL versus log and include it in your report. In the report, state which value of yielded the best LL value, and print it via your code as well. This procedure is called cross-validation. [7 points]
(d) In this experiment, we know the ground truth pdf which we shall denote as p(x). So we can peek into it, in order to choose the best . This is impractical in actual experiments, but for now it will serve as a method
of comparison. For each , write MATLAB code to evaluate D = Pxi2V (p(xi) p^n(x; ))2. Plot of a graph of D versus log and include it in the report. In the report, state which value of yielded the best D value, and also what was the D value for the parameter which yielded the best LL. [7 points]
(e) Now, suppose the set T and V were equal to each other. What happens to the cross-validation procedure, and why? Explain in the report. [4+4=8 points]