Starting from:
$29.99

$23.99

Homework #9 Solution

Instructions:  Please put  all answers in a single PDF  with your name and NetID and upload  to  SAKAI before class on the  due  date  (there  is a  LaTeX  template  on the course web site for you to use).  Definitely consider working in a group; please include the names of the people in your group and write up your solutions separately.   If you look at any references (even wikipedia), cite them.  If you happen  to track  the number of hours you spent on the homework, it would be great if you could put that  at the top of your homework to give us an indication  of how difficult it was.

 

 

Problem 1

 

Adaptive K-means  Revisited:

 

 

Statement of the problem:  Given a data  set X ∈ R1000×1  with an unknown number  of cluster components,  propose an algorithm  which simultaneously  identifies the number of cluster components  and clusters the data.

 

 

In Problem 3 of Homework 4, you thought about  how to construct  an algorithm  based on K-means that  would identify clusters  in a data  with an a priori  unknown number of cluster  components.   You saw that  the  number  of clusters  grows quickly and  is on the  order  of the  number  of data  points.   In order  to  prevent  overfitting  and  encour- age clustering  behavior,  many of you proposed various penalty  functions to remove or otherwise discourage small clusters.

Now that  we know about  Dirichlet  processes, we will put  a DP prior on the partition of our data  into clusters.  Let us put  this model in a probabilistic  framework:  let each cluster represent a univariate  Gaussian  distribution; this model will then  represent an infinite Gaussian  mixture  model.  In this mapping  to our Chinese Restaurant Process metaphor,  a table at a restaurant is a cluster,  and a data  point is a customer.

 

 

(a)  Write  out the generative model. What  is a simple choice of base distribution?

 

(b)  Write  out  the  equation  that  you would use in your Gibbs sampler  for cluster  as- signment.  Exploit  exchangeability  (i.e.  treat  each data  point as if it were the last to arrive at the restaurant, then remove it from it’s current cluster assignment, and reassign it to a cluster using this equation).

 

(c)  How does this algorithm  handle empty clusters?

 

(d)  Explain  how this  model addresses  the  problems  we encountered  in Homework 4, including how this approach  differs from specifying a penalty  to cluster size of the form: ‘remove a cluster if it has fewer than  γ points assigned to it’.

 

Problem 2

 

Project  Proposal  Update:

 

 

You  should  now have  dived  into  the  analysis  of your  data  and  chosen  a  modeling framework and an validation  approach.   An important step in the  research  process is to look at  what  other  ideas researchers  have used to approach  a related  problem.   In addition  to making progress on your project  this  week, you will perform a literature search to see what other researchers have worked with similar data or related problems. What  relevant approaches,  feature  sets, or kernels have they developed that  might be useful in your own analysis?  What  modeling approaches  and simplifying assumptions worked  in this  related  work,  and  what  didn’t  work?   What  can  you take  from the scientific literature to your project?   Write  this  up as a few paragraphs that  will (as with last week’s homework) be included in your final project writeup; include citations.

More products