Homework #1 Solution

Starting from:

~~$35~~

$29

Home

Problems

Data Properties (18 points) [Ruth Okoilu]

[Lecture 2] Answer the following questions about attribute types.

Classify the following attributes as nominal, ordinal, interval or ratio. Also classify them as binary1, discreet or continuous. If necessary, give a few examples of values that might appear for this attribute to justify your answer. If you make any assumptions in your answer, you must state them explicitly.

Blood group (A, B, AB, O)

Ticket number for ra e draws

Brightness as measured by a light meter

Grade in terms of Pass or Fail

Time zones (EST, PST, CST)

Income earned in a month

Vehicle license plate number

Distance from the center of campus.

Dorm room number

Kelvin temperature

Table 1 is an automobile dataset with 6 attributes. For each attribute, list which of the following statistics/operations can be calculated on that attribute: mode, median, Pearson correlation, mean, standard deviation, z-score normalization, binary discretization (into a \high" and \low" group). If you make any assumptions in your answer, you must state them explicitly.

Table 1: Automobile Dataset

Make
Fuel-type
# of doors
Height
# of Cylinders
Price
alfa-romero
gas
two
48.8
four
$1349
alfa-romero
diesel
four
50.52
six
$16500
alfa-romero
diesel
two
54.3
four
$16500
audi
gas
four
55.7
eight
$13950
audi
diesel
two
70.38
eight
$17450
audi
gas
four
71.74
six
$15250
bmw
diesel
two
55.1
four
$16430
bmw
gas
four
54.3
eight
$116925
bmw
diesel
two
53.33
six
$20970
bmw
diesel
four
53.3
six
$21105

A quiz was conducted to test students' knowledge of data mining. We don't know what questions were on the quiz or how it was graded, but we do know students' scores on a scale of 0 to 5. Give an example of a situation where it would make sense to treat this as an Ordinal attribute. Then given an example of when it would make sense to consider it as a Ratio attribute. Brie y justify each answer.

Data Transformation and Data Quality (12 points) [Ruth Okoilu]

In a medical experiment, 8 patients were subjected to two di erent treatments. The following results are the systolic blood pressure (SBP) recorded for each patient after treatment, for each treatment, for patients 1-8, respectively. For example, Patient 1 had an SBP of 160 after treatment A and 300 after treatment B, and so on. NA is used to indicate missing data.

Treatment A: 160, 120, 130, NA, 120, NA, 240, 140

Treatment B: 300, 100, NA, 130, 110, 100, 120, 90

Transform this data into a tabular, record format. Use attributes ID, Patient, Treatment, and SBP for the columns and observations in rows.

Evaluate the following strategies for dealing with missing data (NA) from the medical experiment above. Give an advantage and disadvantage of each strategy, and which you would choose. Brie y justify your answers in terms of the data above.

Strategy 1: Remove the patients with any missing values.

Strategy 2: Estimate the value of missing data for an attribute by taking the average value of other participants for that attribute.

Consider the results of the medical experiments above.

Are the results 240 and 300 noise or outliers? When analyzing the data, what are some strategies to deal with these values? Brie y justify both answers.

The instrument we used to measure SBP will give a reading within +/- 3 of the actual SBP (i.e. if the real SBP is 150, it will give a value between 147-153). Will this inaccuracy create noise or outliers? What are some strategies to deal with this inaccuracy? Brie y justify both answers.

Sampling (7 points) [Ruth Okoilu]

State the sampling method used in the following scenarios and give a reason for your answer. Choose from the following options: simple random sample with replacement, simple random sample without replacement, strati ed sampling, progressive/adaptive sampling.

To determine the average salary of professors at NC State University, the faculty were divided into the following groups: instructors, assistant professors, associate professors, and professors. Twenty faculty members from each group were selected for the study.

From the following population, f2, 2, 4, 4, 6, 6, 8g, a sample f2,2,2,6,8g was collected.

Data is collected in an experiment until a predictive model reaches 90% accuracy.

The U.S. Congress is made up of 2 chambers: 1) a Senate of 100 members, with 2 members from each state, and 2) a House of Representatives of 435 members, with members from each state proportional to that state's population. For example, Alaska has 2 Senators and 1 House representative, while Florida has 2 Senators and 27 House representatives. Both the Senate and the House are conducting surveys of their constituents, which they want to re ect the makeup of each chamber. You suggest that they use strati ed sampling for this survey, sending surveys to a certain number of people from each state. Each survey will be sent to 1000 participants.

Why is strati ed sampling appropriate here?

For the Senate survey, how many surveys would you recommend sending to people in Alaska?

For the House survey, how many surveys would you recommend sending to people in Florida?

What are some advantages of the \Senate" approach and the \House" approach to strati ed sampling?

Dimensionality Reduction (12 points) [Song Ju]

In this problem, you will analyze the PCA results on the iris dataset. Figure 1 shows the Eigenvalue Scree plot and the principal components of PCA analysis on the raw dataset. The dataset was then normalized using z-scores, and Figure 2 shows the Eigenvalue Scree plot and the principal components of PCA analysis on dataset after normalization.

Figure 1: PCA1 on Raw Dataset

Figure 2: PCA2 on Normalized Dataset

Please answer the following questions:

In Figure 1, what is the most reasonable number of principal components to retain? Brie y justify your choice.

In Figure 1, according to the rst principal component, what are the feature(s) in the original dataset that explain the most variance?

In Figure 2, what is the most reasonable number of principal components to retain? Brie y justify your choice.

In Figure 2, according to the rst principal component, what are the feature(s) in the original dataset that explain the most variance?

Explain the di erence between PCA1 and PCA2. Which one would you use for analysis and why?

Based on the results of PCA1 and PCA2, which feature(s) would you like to select if you need to do feature selection? Brie y justify your choice.

Discretization (12 points) [Song Ju] Consider the following dataset:

No.
OUTLOOK
TEMPERATURE
HUMIDITY
WINDY
PLAY
1
sunny
85.0
85.0
FALSE
no
2
sunny
80.0
90.0
TRUE
no
3
overcast
83.0
86.0
FALSE
yes
4
rainy
70.0
96.0
FALSE
yes
5
rainy
68.0
80.0
FALSE
yes
6
rainy
65.0
70.0
TRUE
no
7
overcast
64.0
65.0
TRUE
yes
8
sunny
72.0
95.0
FALSE
no
9
sunny
69.0
70.0
FALSE
yes
10
rainy
75.0
80.0
FALSE
yes
11
sunny
75.0
71.0
TRUE
yes
12
overcast
73.0
89.0
TRUE
yes
13
overcast
81.0
75.0
FALSE
yes
14
rainy
71.0
91.0
TRUE
no
15
sunny
95.0
85.0
FALSE
yes
16
rainy
50.0
45.0
YES
no

Discretize the attribute TEMPERATURE by binning it into 4 equal-width intervals (the range of each interval should be the same). Show your work.

Discretize the attribute HUMIDITY by binning it into 4 equal-depth intervals (the number of items in each interval should be the same). Show your work.

Consider the following new approach to discretizing a numeric attribute: Given the mean (x) and the standard deviation ( ) of the attribute values, bin the attribute values into the following

intervals: [x + (k 1) , x + k ),

for all integer values k, i.e. k = : : : 4; 3; 2; 1; 0; 1; 2 : : :

Assume that the mean of the attribute HUMIDITY above is x = 80 and that the standard deviation = 13. Discretize HUMIDITY using this new approach. Show your work.

For each of the above discretization approaches, explain its advantages and disadvantages and when you would want to use it.

Distance Metrics (14 points) [Song Ju]

A true distance metric has three properties: a) positive de niteness, b) symmetry, c) triangle inequality. Now consider the following distance functions:

Euclidean distance between two numeric vectors
Manhattan distance between two numeric vectors (L1)

A \divergence function" de ned between two sets as d(A; B) = 1 jA \ Bj=jAj, representing how many of A's items are not present in B.

Cosine distance between two numeric vectors, de ned as 1 minus the cosine similarity: d(A; B) = 1 A B=(jjAjj jjBjj)

For each distance function, describe whether it has each property. If so, give a short explanation of why. If not, give a counter example, including two pairs of items, the distance between them, and how it violates the given property.

A 1-nearest-neighbor (1-NN) classi er labels a new item y in the test dataset Y by nding the closest item x in the training dataset X, and returning the label of x.

Assume we have a distance function d that is very expensive to calculate for any d(x; y) where x 2 X and y 2 Y . However, because we can pre-calculate the distance between any two items in X, d(xi; xj) is relatively cheap to calculate for any xi; xj 2 X.

To classify a new item y, our 1-NN algorithm will have have to make jXj comparisons between y and some xi, since it has to compare y to every item xi 2 X to nd y's closest neighbor. However, if d is a true distance metric, we may be able to reduce the number of comparisons we have to make by skipping some of them.

What property of distance metrics allows us to skip some d(xi; y) comparisons in the 1-NN algorithm?

What strategy could we use to reduce the number of d(xi; y) comparisons? Give one example with values for y, x1, and x2, that illustrates that strategy. (Hint: it may help to draw it out the positions of x1, x2 and y in a 2D space.)

Does this strategy reduce the number of d(xi; y) comparisons in the best case? What about the worst case?

Similarity, Dissimilarity and Normalization (25 points) [Krishna Gadiraju]

R Programming Submission Instructions

Make sure you clearly list each team member's names and Unity IDs at the top of your submission. Your code should be named hw1:R. Add this le, along with a README to the zip le mentioned

in the rst page.

Failure to follow naming conventions or programming related instructions speci ed below may result in your submission not being graded.

If the instructions are unclear, please post your questions on piazza.

Programming related instructions

Carefully read what the function names have been requested by the instructor. In this homework or the following ones, if your code does not follow the naming format requested by the instructor, you will not receive credit.

For each function, both the input and output formats are provided in the hw1:R. Function calls are speci ed in hw1 checker:R. Please ensure that you follow the correct input and output formats. Once again, if you do not follow the format requested, you will not receive credit. It is clearly stated which functions need to be implemented by you in the comments in hw1.R

You are free to write your own functions to handle sub-tasks, but the TA will only call the functions he has requested. If the requested functions do not run/return the correct values/do not nish running in speci ed time, you will not receive full credit.

DO NOT set working directory (setwd function) or clear memory (rm(list=ls(all=T))) in your code. TA(s) will do so in their own auto grader.

The TA will have an autograder which will rst run source(hw1.R), then call each of the functions requested in the homework and compare with the correct solution.

Your code should be clearly documented.

To test you code, step through the hw1 checker.R le. If you update you code, make sure to run source(`./hw1.R') again to update your function de nitions. You can also check the \Source on save" option in R Studio to do this automatically on save.

You can also check you functions manually by running them in the console with smaller vectors. Calculating the distance matrix may take 20-40 seconds. If you want to test on a smaller dataset, you can use the hw1 word frequency small.csv le, which contains a small subset of the original

data.

Questions

You are given the following dataset(s):

Hotel reviews for a hotel in New York [1]. You are provided with the list of words in the dataset, as well as frequency count of each word in all the sentences. Each line in hw1 word f requency:csv represents a 200 element vector (i.e., your vocabulary size, or total number of words in your dataset is 200) representing a single sentence from your dataset. Each value in the line represents the frequency count of that particular word over the entire document space. If the word does not exist in the sentence, you will have a zero. The corresponding mapping of the words to the array indices is provided in hw1 word index:json for your reference (but it will not be used in this assignment). In total, there are 155 sentences.

Part 1: Without normalization Using the data provided in hw1 word f requency:csv, you will generate a 155 x 155 element distance matrix containing the pairwise distances for each sentence, using each of the following distance functions. We have already given you the implementation for the function calculate matrix which calls each of the distance methods speci ed below to compute the distance matrix for every pair of sentences. For euclidean, cosine and manhattan, you are to implement the formulae speci ed below, and the input type is two vectors of same length i.e., any two sentences from hw1 word f requency:csv. For cheybshev, you will use one of the libraries speci ed below, and the input is the entire hw1 word f requency:csv as a matrix. The formulae for each distance/similarity measure, as well as whether you need to implement/use library are clearly stated below: Use the philentropy R package when using a library for the distance functions, and use the package documentation to nd the appropriate function.

p

P

(a) euclidean (implement): euclidean(P , Q) = i(Pi Qi)2, where P and Q are vectors of equal length.

jjP jj = pPi Pi2.
P

(b) cosine (implement): cosine(P , Q) =

i Pi Qi
, where P and Q are vectors of equal length, and

jjP jj jjQjj

6
ADLA { Spring 2019
Homework 1

1/17/2019

manhattan (implement): manhattan(P , Q) =
Pi
j
P
i
Q
, where
j
x
j
is the absolute value of x,
(c)
and P and Q are vectors of equal length.

ij

chebyshev (use library): chebyshev(P , Q) = max(jPi Qij), where P and Q are vectors of equal length. Please note that you will not use this formula directly for chebyshev, since you will be using the library.

Part 2: With normalization After calculating the aforementioned distances using the data from hw1 word f requency:csv, you will normalize each row in the data matrix to [0, 1] range using the

row min(row)

formula max(row) min(row) . You are expected to implement this calculation, not use a library. This has to be implemented in the function normalize data. Then, recompute the euclidean distance using the method you implemented in the previous section. Do you notice any di erence in the two distance matrices? In the function analyze normalization, write some R code to compare the two outcomes. Identify at least 3 properties of the distance matrix that change when using normalized data. In the PDF, list the changes you identi ed and explain why they occurred. Use both numeric calculations and visualizations to justify your answers. (Hint: use the hist function, and the provided plot distance matrix function.)

hw1:R has already been provided for you, with the function de nitions. Complete all the functions requested for in hw1:R. Please note that hw1 checker:R is only for you to understand how the TA will run your les. DO NOT submit hw1 checker:R. Also, please note that the TA may be using a dataset di erent to yours (but with the same dimensions, i.e., 155 sentences, each of length 200), so do not hard code your solutions.

Also, it is recommended you read up on vectorized operations in R. Any submission that takes more than 5 minutes to run on a standard university machine (32 GB RAM, i7 processor) will receive a zero grade. Also, please ensure that all the libraries are correctly loaded using the require method.

Allowed Packages: R Base, utils, plyr, philentropy, data.table, reshape and ggplot2. No other packages are allowed.

References

K. Ganesan and C. Zhai, \Opinion-based entity ranking," Information retrieval, vol. 15, no. 2, pp. 116{ 150, 2012.

7