Problem Set #7 Solution

Starting from:

~~$29.99~~

$23.99

Home

Feel free to talk to other members of the class in doing the homework. I am more concerned that you learn how to solve the problem than that you demonstrate that you solved it entirely on your own. You should, however, write down your solution yourself. Please try to keep the solution brief and clear.

• Please use Piazza first if you have questions about the homework. Also feel free to send us e-mails and come to office hours.

• Please, no handwritten solutions. You will submit your solution manuscript as a single pdf file.

• The homework is due at 11:59 PM on the due date.      We will be   using Compass for collecting the homework assignments.   Please submit an electronic copy via Compass2g (http://compass2g.illinois.edu). Please do NOT hand in a hard copy of your write-up. Contact the TAs if you are face technical difficulties in submitting the assignment.

• You cannot use the late submission credit hours for this problem set.

• No code is needed for any of these problems. You can do the calculations however you please. You need to turn in only the report. Please name your report as hNetIDi-hw7.pdf.

1. [EM Algorithm - 70 points]

Assume we have a set D of m data points, where for each data point x from D,

}

x ∈ {0, 1 n+1

. Denote the i-th bit of the j-th example as x(j).  Thus, the index i ranges

from 0 . . . n, and the index j ranges from 1 . . . m.

Assume these data points were generated according to the following distribution: Postulate a hidden random variable Z with values z = 1, 2, where the probability of

z = 1 is α and the probability of z = 2 is 1 − α, where 0 < α < 1.

i

For a specific example x(j), a random value of Z is chosen, but its true value z is hidden. Note that each example x(j)  has a fixed underlying z. If z = 1, x(j)  is set to

1 with probability pi . If z = 2, the bit is set to 1 with probability qi . Thus, there are

2n + 3 unknown parameters. You will use EM to develop an algorithm to estimate

these unknown parameters.

(a) [10 points] Express Pr(x(j) ) first in terms of conditional probabilities and then in terms of the unknown parameters α, pi , and qi .

z

(b) [10 points] Let f (j) = Pr(Z = z | x(j)), i.e. the probability that the data point

1

x(j)  has z as the value of its hidden variable Z . Express f (j) and f

(j)

i

2

in terms of

the unknown parameters.

(c) [10 points] Derive an expression for the expected log likelihood (E[LL]) of the entire data set D and its associated z settings given new parameter estimates α˜, p˜i , q˜i .

(d) [10 points] Maximize the log likelihood (LL) and determine the update rules for the parameters according to the EM algorithm.

(e) [10 points] Examine the update rules explain them in English. Describe in pseu- docode how you would run the algorithm:   initialization, iteration, termination. What equations would you use at which steps in the algorithm?

(f ) [10 points] Assume that your task is to predict the value of x0 given an assign- ment to the other n variables and that you have the parameters of the model. Show how to use these parameters to predict x0. (Hint: Consider the ratio be- tween P (X0 = 0) and P (X0 = 1).)

(g) [10 points] Show that the decision surface for this prediction is a linear function of the xi ’s.

2. [Tree Dependent Distributions - 30 points]

A tree dependent distribution is a probability distribution over n variables, {x1, . . . , xn } that can be represented as a tree built over n nodes corresponding to the variables. If there is a directed edge from variable xi to variable xj , then xi is said to be the parent of xj . Each directed edge hxi , xj i has a weight that indicates the conditional probability Pr(xj  | xi ). In addition, we also have probability Pr(xr ) associated with the root node xr . While computing joint probabilities over tree-dependent distributions, we assume that a node is independent of all its non-descendants given its parent. For instance, in our example above, xj  is independent of all its non-descendants given xi .

To learn a tree-dependent distribution, we need to learn three things: the structure of the tree, the conditional probabilities on the edges of the tree, and the probabilities on the nodes. Assume that you have an algorithm to learn an undirected tree T with all required probabilities. To clarify, for all undirected edges hxi , xj i, we have learned both probabilities, Pr(xi | xj ) and Pr(xj  | xi ). (There exists such an algorithm and we will be covering that in class.) The only aspect missing is the directionality of edges to convert this undirected tree to a directed one.

However, it is okay to not learn the directionality of the edges explicitly.    In this problem, you will show that choosing any arbitrary node as the root and directing all edges away from it is sufficient, and that two directed trees obtained this way from the same underlying undirected tree T are equivalent.

(a) [10 points] State exactly what is meant by the statement: “The two directed trees obtained from T are equivalent.”

(b) [20 points] Show that no matter which node in T is chosen as the root for the “direction” stage, the resulting directed trees are all equivalent (based on your definition above).