$24
Instructions: Submit two files only: writeup.pdf and code.zip. Only question 2 is a programming questi-on; written answers are expected for all other questions. For all questions, show all your work, including intermediate steps.
1 Boosting [40 pts]
Consider the dataset below, which consists of N = 4 points:
(X1 : 0; 1; ); (X2 : 1; 0; +); (X3; 1; 0; +); (X4 : 0; 1; )
Run (by hand, not by programming) the Adaboost algorithm using a decision stump as the weak learner for T=4 timestamps. A decision stump is a depth-1 decision tree that splits on a single variable. An example of a decision stump weak learner is shown in the figure on the right.
[30 points] For each timestamp t = 1:::4,
state the weak learner ht and draw it (as a vertical or horizontal line) on the plot. compute t, t, Zt, Dt(i) and ht(xi), 8i = 1:::N.
[5 points] After 4 timestamps, use the final classifier H(x) to make predictions about the training data. What is the training error of Adaboost for this dataset?
(b) [5 points] Explain why Adaboost does better than a decision stump on the above dataset.
1
2 Active Learning [60 pts]
Use any programming language to implement an active learning algorithm on a movie review dataset, and compare two selection strategies.
Data. Download review polarity.zip from the course website assignments page. This dataset is from (Pang et al., 2002) and contains 1000 positive and 1000 negative movie reviews. Process the movie review data to create a feature vector for each review, consisted of unique words and their frequency in the review. Remove stop words from consideration. Randomly set aside 20% of the data for testing; use the rest for training. Describe your particular approach in preparing the data.
You will need to code the active learning procedure from scratch, but you can use any machine learning toolkits to implement the underlying learning algorithms (e.g., SVM, decision trees, etc). To implement the active learning procedure, start off with a small training set T of 100 data points. Let T 0 be the remaining training data. Implement the following two selection strategies using K = 10:
Uncertainty Sampling: train a SVM algorithm on T . Using the trained SVM, select K training examples from T 0 that are the closest to the separating hyperplane (i.e. K data points with the highest uncertainty).
Query by Committee: train 3 classifiers of your choice (e.g. knn, decision tree, naive bayes) on T . Select K training examples from T 0 that have the most disagreements among the classifiers.
Add the selected K training examples to the current training set T , and retrain. For each selection strategy, plot the test error against the training set size. Repeat the process until the entire training set is used.
In the end, there should be two plots of test error (y-axis) vs training set size (x-axis)—one for Uncertainty Sampling, one for Query by Committee. Describe the two plots. Which of these two selection strategies performed better? Explain why.
2