Project W8: Dimensionality Reduction and Visualization

Starting from:

~~$35~~

$29

Character recognition has been a classical problem in pattern recognition. Digit recognition is an important problem in character recognition with many applications such as automatic routing of mails according to zip codes, automatic recognition of signature dates on documents, automatic digitization of distances and other labels on maps and technical drawings, and auto-matic recognition of license plates of vehicles.

In this assignment, the goal is to develop a handwritten digit recognition system. The input to the system will be a digitized image of a digit and the output will be the type of the digit f0; 1; 2; 3; 4; 5; 6; 7; 8; 9g. Figure 1 shows example images from the MNIST database of handwritten digits (http://yann.lecun.com/exdb/mnist/) that has been heavily used in the literature. The data set that will be used in this assignment contains around 500 patterns for each class (for a total of 5,000 patterns). The data le (a MATLAB mat le) that is available on the course home page contains a matrix called digits with 5,000 rows where each row corresponds to a pattern, and a vector called labels where each value shows the class for the corresponding pattern. Each pattern is a digitized 20 20 image of a digit (represented in column-major order as a 400-dimensional vector).

Download the digit data set and divide it into two subsets by randomly selecting half of the patterns from each class for training and the remaining patterns for testing. Then, perform the classi cation experiments described below. Use the same subsets for the rest of the assignment. Do not forget to use separate subsets for training and testing.

Figure 1: Example digits.

Hint: In Matlab, you can visualize the digits using the following piece of code:

load digits.mat

i = 1; %select the digit to display

I = digits( i, : );

imagesc( reshape( I, 20, 20 ) );

colormap( gray );

axis image;

1
Background

A quadratic Gaussian classi er models each class with a Gaussian distribution. Gaussian can be considered as a model where the feature vectors for a given class are continuous-valued, randomly corrupted versions of a single typical or prototype vector. The Gaussian density for
the feature vector x 2 Rd is de ned as
exp

1(x)
p(xj ; ) =
1

1
(x )T

(2 )d=2j j1=2

2

where d is the dimension of x, and and are the mean vector and the covariance matrix, respectively.
The training procedure involves tting a Gaussian pc(xj c; c) to the corresponding subset of the training data for each class c = 1; : : : ; C where C is the number of classes. Fitting can be done by using the maximum likelihood estimates of the mean vector and the covariance matrix

1
n

Xi

^ = n

xi

=1

1
n

^

Xi
T
=
n
(xi ^)(xi ^)

=1

using the samples for each class. In the formulas above, n represents the number of samples that belong to a particular class. Estimation should be done separately by using the subset of samples for each class.

Once the Gaussian densities are learned for all classes, during the classi cation process, a pattern can be assigned to the class c that gives the highest probability for that pattern’s

feature vector x as
c = arg max pc(xj c; c):
c=1;:::;C

Given a data set with known class labels, the performance of the classi er can be evaluated by comparing the predicted class by the classi er to the true class known for each sample. The quantitative performance can be summarized using the classi cation error that is computed as the ratio of the number of wrongly predicted samples to the total number of samples.

Question 1 (35 pts)

In this question, you will use principal components analysis (PCA) to project the 400-dimensional data onto lower dimensional subspaces to observe the e ect of dimensionality on the performance of the Gaussian classi er.

1. Use PCA to obtain a new set of bases (use the training data set, i.e., 2,500 patterns for PCA). Plot the eigenvalues in descending order. How many components (subspace dimension) would you choose by just looking at this plot?

2. Display the sample mean for the whole training data set as an image (using samples for all classes together). Also display the bases (eigenvectors) that you chose as images (e.g., like in Figure 1) and discuss the results with respect to your expectations.

3. Choose di erent subspaces with dimensions between 1 and 200 (choose at least 20 di erent subspace dimensions, the more the better), and project the data (project both the training data and the test data using the transformation matrix estimated from the training data) onto these subspaces. Train a Gaussian classi er using data in each subspace (do not forget to use half of the data for training and the remaining half for testing).

2

4. Plot classi cation error vs. the number of components used for each subspace, and discuss your results. Compute the classi cation error for both the training set and the test set (training is always done using the training set), and provide two plots.

Question 2 (35 pts)

In this question, you will use Fisher linear discriminant analysis (LDA) to project the 400-dimensional data onto lower dimensional subspaces.

1. Use LDA to obtain a new set of bases (use the training data set, i.e., 2,500 patterns for LDA). Display these bases as images and discuss the results with respect to your expectations.

2. Using subspaces with dimensions between 1 and 9 (the largest possible for LDA for a data set containing 10 classes), project the data (project both the training data and the test data using the transformation matrix estimated from the training data) onto these subspaces. Train a Gaussian classi er using data in each subspace (do not forget to use half of the data for training and the remaining half for testing).

3. Plot classi cation error vs. the dimension of each subspace, and discuss your results. Compute the classi cation error for both the training set and the test set (training is always done using the training set), and provide two plots.

Question 3 (30 pts)

In this question, you will use dimensionality reduction particularly designed for visualization of high-dimensional data sets. In particular, you are asked to use

Sammon’s mapping (J. W. Sammon, \A Nonlinear Mapping for Data Structure Analysis," IEEE Transactions on Computers, vol. C-18, no. 5, pp:401-409, May 1969), and

t-SNE (L. J. P. van der Maaten and G. E. Hinton, \Visualizing High-Dimensional Data Using t-SNE," Journal of Machine Learning Research, vol. 9, pp:2579-2605, November 2008)

for mapping the digit data set to two dimensions. For each method, compute the resulting mapping for the whole data set, and present the scatter of the patterns together with their class information. Discuss the setup that you used (e.g., parameters needed for initialization, iterations, or stopping, etc), and comment on the resulting visualizations.

Note: You are required to upload your assignment report (as a pdf le) and all code that you wrote in a single archive (in tar, rar or zip format) to Moodle. Your report must include all required details listed for each question. You are free to write your own code or use other tools. However, you should still include any code that you have written in your submission, and properly cite any other tools that you have used (you do not need to upload the external libraries together with your submission but provide a citation and a link in your report). Make sure that the tools you are using are correct implementations of the particular steps outlined in this assignment. Note that using code from other sources without proper citation will be considered as plagiarism.

Do not forget to write your name in the report that you are submitting. Also do not forget to include your name in the lename of the pdf le so that the lename becomes unique.

3