Starting from:
$35

$29

Assignment 1 Dimensionality Reduction Solution

Introduction




In this assignment, in addition to related theory/math questions, you’ll work on visualizing data and reducing its dimensionality.




You may not use any functions from machine learning library in your code, however you may use statistical functions. For example, if available you MAY NOT use functions like




pca




entropy




however you MAY use basic statistical functions like:




std mean cov eig




Grading




Although all assignments will be weighed equally in computing your homework grade, below is the grading rubric we will use for this assignment:




Part 1
(Theory)
30pts
Part 2
(PCA)
40pts
Part 3
(Eigenfaces)
20pts
Report


10pts




TOTAL
100pts









Table 1: Grading Rubric



DataSets




Yale Faces Datasaet This dataset consists of 154 images (each of which is 243x320 pixels) taken from 14 people at 11 di erent viewing conditions (for our purposes, the rst person was removed from the o cial dataset so person ID=2 is the rst person).




The lename of each images encode class information:




subject< ID .< condition




Data obtained from: http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html





 
Theory Questions




 
(15 points) Consider the following data:






2
5
14
3


2
12
0
3


6
2


7


6


5


Class 1 =
0
3
, Class 2 =
1
37


6




7


6




7


6
3
1
7


6
5
1
7


8
11


6
1


6




7


6




7


4




5


4




5
 
Compute the information gain for each feature. You could standardize the data overall, although it won’t make a di erence. (13pts).




 
Which feature is more discriminating based on results in Part (a) (2pt)?




 
(15 points) In principle component analysis (PCA) we are trying to maximize the variance of the data after projection while minimizing how far the magnitude of w, jwj is from being unit length. This results in attempting to nd the value of w that maximizes the equation




wT w (wT w 1)




where is the covariance matrix of the observable data matrix X.







One problem with PCA is that it doesn’t take class labels into account. Therefore projecting using PCA can result in worse class separation, making the classi cation problem more di cult, especially for linear classi ers.







To avoid this, if we have class information, one idea is to separate the data by class and aim to nd the projection that maximize the distance between the means of the class data after pro-jection, while minimizing their variance after projection. This is called linear discriminant analysis (LDA).




Let Ci be the set of observations that have class label i, and i; i be the mean and standard deviations, respectively, of those sets. Assuming that we only have two classes, we then want to nd the value of w that maximizes the equation:




( 1w 2w)T ( 1w 2w) (( 1w)T ( 1w) + ( 2w)T ( 2w))




Which is equivalent to




wT ( 1 2)T ( 1 2)w (wT ( 1T 1 + 2T 2)w)




Show that to maximize this we must nd the eigenvector/eigenvalue pairs, (w; )




for the equation:




( 1T 1 + 2T 2) 1(( 1 2)T ( 1 2))w = w

 
(40pts) Dimensionality Reduction via PCA




Download and extract the dataset yalefaces.zip from Blackboard. This dataset has 154 images (N = 154) each of which is a 243x320 image (D = 77760). In order to process this data your script will need to:




 
Read in the list of les




 
Create a 154x1600 data matrix such that for each image le




 
Read in the image as a 2D array (234x320 pixels)




 
Subsample the image to become a 40x40 pixel image (for processing speed)




 
Flatten the image to a 1D array (1x1600)




 
Concatenate this as a row of your data matrix.




Once you have your data matrix, your script should:




 
Standardizes the data




 
Reduces the data to 2D using PCA




 
Graphs the data for visualization




Recall that although you may not use any package ML functions like pca, you may use statistical functions like eig.




Your graph should end up looking similar to Figure 1 (although it may be rotated di erently, de-pending how you ordered things).


 
(20 points) Eigenfaces




Download and extract the dataset yalefaces.zip from Blackboard. This dataset has 154 images (N = 154) each of which is a 243x320 image (D = 77760). In order to process this data your script will need to:




 
Read in the list of les




 
Create a 154x1600 data matrix such that for each image le




 
Read in the image as a 2D array (234x320 pixels)




 
Subsample the image to become a 40x40 pixel image (for processing speed)




 
Flatten the image to a 1D array (1x1600)




 
Concatenate this as a row of your data matrix.







Write a script that:




 
Imports the data as mentioned above.




 
Standardizes the data.




 
Performs PCA on the data (again, although you may not use any package ML functions like pca, you may use statistical functions like eig).




 
Determines the number of principle components necessary to encode at least 95% of the infor-mation, k.




 
Visualizes the most important principle component as a 40x40 image (see Figure 2).




 
Reconstructs the rst person using the primary principle component and then using the k most signi cant eigen-vectors (see Figure 3). For the fun of it maybe even look to see if you can perfectly reconstruct the face if you use all the eigen-vectors!




Your principle eigenface should end up looking similar to Figure 2.










Figure 2: Primary Principle Component


Your reconstruction should end up looking similar to Figure 3.




Figure 3: Reconstruction of rst person (ID=2)

Submission




For your submission, upload to Blackboard a single zip le containing:




 
PDF Writeup




 
Source Code




 
readme.txt le




The readme.txt le should contain information on how to run your code to reproduce results for each part of the assignment. Do not include spaces or special characters (other than the underscore character) in your le and directory names. Doing so may break our grading scripts.




The PDF document should contain the following:




 
Part 1: Your answers to the theory questions.




 
Part 2: The visualization of the PCA result




 
Part 3:




 
Number of principle components needed to represent 95% of information, k.




 
Visualization of primary principle component




 
Visualization of the reconstruction of the rst person using




 
Original image




 
Single principle component




 
k principle components.