$23.99
I. Principal Component Analysis: real data
Introduction: Data for this portion of the assignment consists of 4 vectors of data, each with 150 entries (NOTE: each row is a sample with 4 entries; the number of rows is the number of samples). Each column represents a measurement, in centimeters, of one specific feature, taken from a sample of 150 flowers. The four features are sepal length and width, and petal length and width. We would like to distill this data down to a lower dimension (2 or 3), and try to determine how many species might be represented by the data.
1. For the given data, de-mean each entry, using column sample means. Perform a PCA on the de-meaned data. Clearly explain how you performed the PCA (submit pseudocode for your own PCA code, or reference and describe any function you called, i.e, from what library, how it works, what it takes as inputs and produces as outputs).
2. State what portion of the variance in the data is contained in each of the 4 principal components.
From these values, state how many true components are needed to represent the data.
3. On a 2D graph with the horizontal axis given by the first PC, and the vertical plot given by the second PC, plot the 2D representation of all the data points (i.e., find the 2D projection of each row). Discuss: (a.) are there any visually apparent clusters, if so, how many, and (b.) from this do you think you can conjecture anything about the number of species represented by the data?
4. Repeat parts 1- 3 for standardized data: in addition to de-meaning the entries by column means, you should also scale by the inverse of the sample standard deviation, i.e., for each element xij in the original matrix X , normalize the element to x˜ij where
xij − x¯j
s
x˜ij = .
j
Note whether or not this scaling changes the outcome of your analysis.
5. Turn in a report including (a) a matrix with the 4 components for the data and their associated variances, and (b) plots for parts 3 and 4 (i.e., for both the de-meaned and the standardized data sets).
II.Regression analysis:
Introduction: Data given to you in Comp1 IE529 contains two vectors, ‘lift kg’ and ‘putt m’, where
‘lift kg(i)’ corresponds to a maximum weight lifted by athlete i in kilograms, and ‘putt m(i)’ corre- sponds to a longest shot-put by athlete i in meters. Your assignment is to use regression methods to determine a model that describes the relationship between the two variables. That is, suppose x1 =
’lift’ and x2 = ’putt’; you should find a mathematical model relating x1 and x2, such as
x2 = 10 + 2x1 + x2 − 0.1x3 .
1 1
The relationship may be linear/affine, polynomial or logistic.
1. Write a simple program to compute a least squares solution for the case of linear or polyno- mial regression, and determine the lowest order model that fits the data reasonably well (order
1 is linear, and higher orders are polynomial).
2. Consider using an existing logistic regression function in Matlab or Python to determine if a logistic model will fit the data much better or not. Discuss.
2
3. State which of your candidate models best describes the data, taking into account that simplicity is preferred. Provide plots of a linear fit, one polynomial fit (i.e., second order or higher) and if possible one logistic fit. Compute the sum-of-square of residuals, i.e., give the cost k i k2, for
each of the models plotted.
4. Turn in a report including (a) clearly labeled plots with associated explicit models and costs, (b) a brief discussion explaining your model choice, and (c) your code/function calls.