Assignment 1 Solution

Starting from:

~~$30~~

$24

1 Decision Trees [20 pts]

1.1 Decision Trees and Boolean Expressions [10 pts]

A boolean function can be written represented by a decision tree, i.e., the decision tree equivalent to that boolean function will output 1 if the expression is satisfied, 0 otherwise. Please write down a decision tree that is equivalent to the following:

(C^:A^:B^:E)_(C^A^D)_(:C^B)

1.2 Train a decision Tree [10 pts]

Below is a dataset that determines the survival rate of passengers on the Titanic based on three attributes— Class (C), Gender (G) and Age (A). Train a depth 1 decision tree (i.e., a decision tree with only 1 decision node) using information gain as the criterion for splitting, using IG(S; x) = H(S) H(Sjx), where x 2 fC; G; Ag. Which node is at the root of the tree? Show all your work (i.e., include all intermediate steps) in calculating information gain for each attribute. Round to 4 decimal points for all calculations, and use log base 2.

Class
Gender
Age
No
Yes

Total

1st
Male
Child
0
5

5

1st
Male
Adult
118
57

175

1st
Female
Child
0
1

1

1st
Female
Adult
4
140

144

Lower
Male
Child
35
24

59

Lower
Male
Adult
1211
281

1492

Lower
Female
Child
17
27

44

Lower
Female
Adult
105
176

281

Total:
1490
711

2201

Class
No
Yes

Total
Gender
No
Yes

Total

Age
No
Yes

Total

1st
122
203

325

Male
1364
367

1731
Child
52
57

109

Lower
1368
508

1876

Female
126
344

470

Adult
1438
654

2092

2 K-nearest neighbor [30 pts]

Implement k-nearest neighbours from scratch using any programming language of your choice. Download the dataset (knn-dataset.zip) posted on the course web page, which contains the training data and labels for each fold. Classify each input x according to the most frequent class amongst its k nearest neighbours as measured by the Euclidean distance (L2-norm). Break ties at random. Test the algorithms by 10-fold cross validation.

What to hand in:

Your code for K-nearest neighbor and cross validation

Find the best k by 10-fold cross validation. Draw a graph that show the accuracy as k increases from 1 to 30.

1
3 Perceptron [20 pts]

Consider a threshold perceptron that predicts y = 1 when wT x + w0 0 and y = 0 when wT x + w0 < 0. It is interesting to study the class of Boolean functions that can be represented by a threshold perceptron.

Assume that the input space is X = f0; 1g2 and the output space is Y = f0; 1g. For each of the following Boolean functions, indicate whether it is possible to encode the function as a threshold perceptron. If it is

possible, indicate some values
for

and

. If it is not possible, indicate a feature mapping
^
with

T
w

w0

:X!X

values for w and w0 such that w

(x) + w0 is a linear separator that encodes the function.

and

or

exclusive-or

iff

4 Empirical Evaluation [30 pts]

In this exercise, we will be working with the Breast Cancer Wisconsin (Diagnostic) Dataset (WDBC) from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.html). The dataset consists of 569 samples of biopsied tissue. The tissue for each sample is imaged and 10 characteristics of the nuclei of cells present in each image are characterized, including

radius (mean of distances from center to points on the perimeter) texture (standard deviation of gray-scale values)

perimeter area

smoothness (local variation in radius lengths) compactness (perimeter2 / area - 1.0)

concavity (severity of concave portions of the contour)

concave points (number of concave portions of the contour) symmetry

fractal dimension (“coastline approximation” - 1)

Each of the 569 samples used in the dataset consists of a feature vector of length 30. The first 10 entries in this feature vector are the mean of the characteristics listed above for each image. The second 10 are the standard deviation and last 10 are the largest value of each of these characteristics present in each image. Each sample is also associated with a classification label, which is M for malignant and B for benign.

The dataset has already been divided into a training set (wdbc-train.csv) and test set (wdbc-test.csv), which you can download from the Assignments page on the course website.

Use your favourite machine learning package, and train (a) a decision tree, (b) kNN algorithm on the WDBC dataset, using all 30 features. Choose the best k by 10-fold cross validation. Make sure you shuffle the training data randomly before splitting it into groups to be used for cross validation.

Report the training performance using 10-fold cross validation, in terms of precision, recall, accuracy, sensitivity and specificity. Assume that M is the positive class.

Report the testing performance, in terms of precision, recall, accuracy, sensitivity and specificity. Assume that M is the positive class.

Suppose you were to recommend an algorithm to a conservative doctor who is afraid of making a mistake in diagnosis, which classifier would you recommend? Explain why?

2

More products

$6.00 OFF

HW#6 Morphology Solved

$30

$24

Buy now

$6.00 OFF

HW#5 EE475 Noisy Edge Detection

$30

$24

Buy now

$6.00 OFF

EE475 Homework 4 Solved

$30

$24

Buy now