$24
BDR and nearest neighbors Consider a classification problem with c classes and uniform class
probabilities, i.e. PY (i) = 1/c, ∀i. Assume that the goal is to classify an iid sequence of observations
= {x1, . . . , xn} as a whole (i.e. the samples are not classified one at a time).
Compute the BDR for this problem and show that it converges (in probability) to a nearest neighbor rule based on the class-conditional distributions and the distribution of the observations. Show that the distance function is the Kullback-Leibler divergence
p(x)
D[p(x)||q(x)] = p(x) log dx.
This proves that the BDR for the classification of sequences is really just a nearest neighbor rule.
Assuming that all densities are Gaussian with equal covariance Σ, the class conditional densities have mean μi and the observation density has mean μ write down an expression for the decision rule as a function of the Gaussian parameters. Provide an interpretation for this new decision rule, by stating what are the items being compared and what is the distance function.
2. Kernel methods Problem 4.3.3 in DHS
1
Multinomial EM In this problem we consider an example where there is a closed-form solution to ML estimation from incomplete data. The goal is to compare with the EM solution and get some insight on how the steps of the latter can be substantially easier to derive than the former.
Consider our bridge example and let U be the type of vehicle that crosses the bridge. U that can take
values, (compact, sedan, station wagon, and pick-up truck) that we denote by U ∈ {1, 2, 3, 4}. On a given day, an operator collects an iid sample of size n from U and the number of vehicles of each type is counted and stored in a vector D = (x1, x2, x3, x4). The resulting random variable X (the histogram of vehicle classes) has a multinomial distribution
PX1,X2,X3,X4
(x1
, x2
, x3, x4
; Ψ) = x1
!x2!x3!x4!
2
x1
4
x2
4
− 4 Ψ
x3
4 Ψ
x4
+ 4 Ψ
− 4 Ψ
.
n!
1
1
1
1
1
1
1
However, it is later realized that the operator included motorcycles in the compact class. It is established that bikes have probability 14 Ψ, which leads to a new model
PX11 ,X12,X2,X3,X4 (x11, x12, x2, x3, x4; Ψ) =
n!
1
x11
1
x12
1
1
x2
1
1
x3
1
x4
=
Ψ
−
Ψ
−
Ψ
Ψ
.
x11!x12!x2!x3!x4!
2
4
4
4
4
4
4
Determining the parameter Ψ from the available data is as a problem of ML estimation with missing data, since we only have measurements for
x1 = x11 + x12
but not for x11 and x12 independently.
Determine the value of Ψ that maximizes the likelihood of D, i.e.
Ψ∗i = arg max PX1,X2,X3,X4 (D; Ψ)
Ψ
by using standard ML estimation procedures.
Assume that we have the complete data, i.e. Dc = (x11, x12, x2, x3, x4). Determine the value of Ψ that maximizes its likelihood, i.e.
Ψ∗c = arg max PX11 ,X12,X2,X3,X4 (Dc; Ψ),
Ψ
by using standard ML estimation procedures. Compare the difficulty of obtaining this solution vs. that of obtaining the solution in a). Does this look like a problem where EM might be helpful?
Derive the E and M-steps of the EM algorithm for this problem.
Using the equations for the EM steps, determine the fixed point of the algorithm (i.e. the solution)
by making
Ψk+1 = Ψk
where k is the iteration number. Compare to the solution obtained in a).
2
Mixtures of Gaussians The goal of this problem is to give you some “hands-on” experience on the very important case of EM as a tool for the estimation of the parameters of a mixture. Consider a
mixture of two Gaussians
2
PX (x) = πcG(x, μc, Σc)
c=1
where the covariance matrices are diagonal, i.e. Σc = diag(σc,21, σc,22), and a training sample of five points
= {(−2.5, −1), (−2, 0.5), (−1, 0), (2.5, −1), (2, 1)}.
Assume that the following hold
μ1
=
−μ2
Σ1
=
Σ1 = σ2I,
1
.
π1
=
π2 =
2
Plot the
log-likelihood surface log P
2
X (D) as a function of the mean parameters (entries of μ1) for
σ
∈ {0.1, 1, 2}. Let the coordinate axis cover the range ([2−5, 5]). What can you say about the local
maxima of the likelihood surface, and how it changes with σ
? How does the convergence to the optimal
depend on the location of the initial parameter guess?
b) Starting from the initial parameter estimate
π(0)
=
π(0)
=
1
1
2
2
(0) (0)
μ1 = −μ2 = (−0.1, 0)
compute all the quantities involved in the first 3 iterations of the EM algorithm. For each iteration produce
plot 1: the posterior surface PZ|X(1|x) for the first class as a function of x,
plot 2: the mean of each Gaussian, the contour where the Mahalanobis distance associated with it becomes 1, the points in D, and the means of the solutions obtained the previous steps.
Let EM run until convergence, storing the mean estimates at each iteration. Produce the two plots above for the final solution. In plot 2, plot the values of the means as they progress from the initial to the final estimate.
EM and MAP estimates In this problem we use EM for the maximization of the posterior
probability
Ψ∗ = arg max PΨ|X (Ψ|x).
Ψ
Consider the binomial distribution of problem 3. and a Gamma prior
PΨ(Ψ) = Γ(ν1 + ν2) Ψν1−1(1 − Ψ)ν2−1.
Γ(ν1)Γ(ν2)
Derive the equations of the EM algorithm for MAP estimation of the parameter Ψ.
3
(Computer) This week we use the cheetah image to evaluate the performance of a classifier based on mixture models estimated with EM. Once again we use the decomposition into 8 × 8 image blocks, com-pute the DCT of each block, and zig-zag scan. For this (using the data in TrainingSamplesDCT new 8.mat) we fit a mixture of Gaussians of diagonal covariance to each class, i.e.
C
PX|Y (x|i) = πcG(x, μc, Σc)
c=1
where all Σc are diagonal matrices. We then apply the BDR based on these density estimates to the cheetah image and measure the probability of error as a function of the number of dimensions of the space (as before, use {1, 2, 4, 8, 16, 24, 32, . . ., 64} dimensions).
For each class, learn 5 mixtures of C = 8 components, using a random initialization (recall that the mixture weights must add up to one). Plot the probability of error vs. dimension for each of the 25 classifiers obtained with all possible mixture pairs. Comment the dependence of the probability of error on the initialization.
For each class, learn mixtures with C ∈ {1, 2, 4, 8, 16, 32}. Plot the probability of error vs. dimension for each number of mixture components. What is the effect of the number of mixture components on the probability of error?
4