$24
Answer each of the following questions referring to the data sets Data1, Data2, and Data3 that are available on the ECE 485 web site under the Assignment's tab.
You are encouraged to use Matlab in doing the ECE 485 assignments (use 'help stats' or 'doc stats' at the Matlab command prompt to get the details on Matlab's available statistical functions).
Using the 2 goodness-of- t test determine whether the data set con-tained in the les Data1 and Data2 can be reasonably modeled by a Gaussian distribution at an = 0:05 con dence level.
Plot the histogram for each data set.
Overlay the best t Gaussian on this histogram plot.
Provide the results of the 2 goodness-of- t test for each data sets, inclusive of its computed p-value.
For the data set(s) that fail the 2 goodness-of- t test for Gaussian p(x) determine which p(x) distribution does provide a reasonable model for the data.
i.e., use the 2 goodness-of- t test to determine statistically what other distribution could be used to model the data.
Hint: the shape of the data's histogram can provide a a good indication of which analytical p(x)'s you should test.
2. Write a function to generate N random data samples with mean =
[ x1 x2 ]
T
and =
x21
x1 x2
x2 x1
x2
2
Use this function to generate three sets of data all with N = 1000, = [5; 5]T , with = 0:8, = 0:2 and = 0:9, x1 = 2 and x2 = 1.
Produce a 2-D scatter plot for each generated data set.
On these scatter plots overlay the eigenvectors of and draw the 1-, 2-, and 3- ellipses that are associated with the generated data.
The data in the le Data3 belongs to 3 classes, with the third column of the le denoting which class each data item belongs to.
For each class estimate its mean and covariance, assuming that the data within each class follows p(x) N( ; ).
For each of the following unclassi ed data point use the Maha-
lanobis distance to assigned each of the unclassi ed data points x1; x2; x3 and x4 to one of the three classes.
x1 = [10; 2]; x2 = [ 3; 4]; x3 = [2; 2]; and x4 = [5 7]
Provide scatter plots of each of the three clusters (all on the same plot but in di erent colours), place the x1 through x4 data points on this plot, and use your eigenvector and ellipse drawing routing from Question 2 to draw the principal axes and 1-, 2-, and 3-ellipses associated with each pattern class.
How would your estimations of the three classes' statistics change if you did not have a priori knowledge of the class labels?
i.e., If you were given the le Data3 with just its rst 2 columns and not its third column.
This denotes the distinction between supervised learning (with column 3) and unsupervised learning (without column 3).
For each pair of classes plot (on the same plot as generated from 3(c) above) the 2-Class decision boundaries de ned by where pxi (x) = pxj (x), for i 6= j and i; j 2 f1; 2; 3g and provide the formulas for these decisions boundaries.
2