$24
Show your work. Include any code snippets you used to generate an answer, using comments in the code to clearly indicate which problem corresponds to which code
1. (2 points) In Python, generate a (2-dimensional multivariate Gaussian) data matrix D using the following code:
mu = np.array([0,0])
Sigma = np.array([[1,0], [0, 1]])
X1, X2 = np.random.multivariate_normal(mu, Sigma, 1000).T
D = np.array([X1, X2]).T
Create a scatter plot of the data, with the x-axis corresponding to the rst attribute (column) in D, and the y-axis corresponding to the second attribute (column) in D.
2. (7 points) Using the scaling matrix S and rotation matrix R below to transform the data D from Question 1, by multiplying each data instance (row) xi by RS. Let DRS be the matrix of the transformed data. That is, each 2-dimensional row vector xi in D should be transformed into a 2-dimensional vector RSxi in DRS.
(a) (4 points) Plot the transformed data DRS in the same gure as the original data D, using di erent colors to di erentiate between the original and transformed data.
R =
sin( =4)
cos( =4)
;
S =
0
2
cos( =4)
sin( =4)
5
0
(b) (2 points) Write down the covariance matrix of the transformed data DRS.
(c) (1 point) What is the total variance of the transformed data DRS.
3. (8 points) Use sklearn’s PCA function to transform the data matrix DRS from Question 2 to a 2-dimensional space where the coordinate axes are the principal components.
(a) (4points) Plot the PCA-transformed data, with the x-axis corresponding to the rst principal component and the y-axis corresponding to the second principal component.
(b) (2 points) What is the estimated covariance matrix of the PCA-transformed data?
(c) (2 points) What is the fraction of the total variance captured in the direction of the rst principal component? What is the fraction of the total variance captured in the direction of the second principal component?
4. (18 points) Load the Boston data set into Python using sklearn’s datasets package. Use sklearn’s PCA function to reduce the dimensionality of the data to 2 dimensions.
(a) (5 points) First, standard-normalize the data. Then, create a scatter plot of the 2-dimensional, PCA-transformed normalized Boston data, with the x-axis corresponding to the rst principal component and the y-axis corresponding to the second principal component.
(b) (3 points) Create a plot of the fraction of the total variance explained by the rst r components for r = 1; 2; : : : ; 13.
(c) (2 points)
i. (1 point) If we want to capture at least 90% of the variance of the normalized Boston data, how many principal components (i.e., what dimensionality) should we use?
ii. (1 point) If we use two principal components of the normalized Boston data, how much (what fraction or percentage) of the total variance do we capture?
(d) (4 points) Use scikit-learn’s implementation of k-means to nd 2 clusters in the two-dimensional, PCA-transformed normalized Boston data set (the input to k-means should be the data that was plotted in part 4e). Plot the 2-dimensional data with colors corresponding to predicted cluster membership for each point. On the same plot, also plot the two means found by the k-means algorithm in a di erent color than the colors used for the data.
(e) (4 points) Use scikit-learn’s implementation of DBSCAN to nd clusters in the two-dimensional, PCA-transformed normalized Boston data set (the input to DBSCAN should be the data that was plotted in part ). Plot the 2-dimensional data with colors corresponding to predicted cluster membership for each point. Noise points should be colored di erently than any of the clusters. How many clusters were found by DBSCAN?
Acknowledgements: Homework problems adapted from assignments of Veronika Strnadova-Neeley.