Assignment 1

Starting from:

~~$30~~

$24

Commentary on Assignment 1: CPSC 340 is tough because it combines knowledge and skills across several disciplines. To succeed in the course, you will need to know or very quickly get up to speed on:

• Basic Python programming, including NumPy and plotting with matplotlib.

• Math to the level of the course prerequisites: linear algebra, multivariable calculus, some probability.

• Statistics, algorithms and data structures to the level of the course prerequisites.

• Some basic LaTeX and git skills so that you can typeset equations and submit your assignments.

This assignment will help you assess whether you are prepared for this course. We anticipate that each of you will have di↵erent strengths and weaknesses, so don’t be worried if you struggle with some aspects of the assignment. But if you find this assignment to be very diﬃcult overall, that is a warning sign that you may not be prepared to take CPSC 340 at this time. Future assignments will be more diﬃcult than this one (and probably around the same length).

Questions 1-4 are on review material, that we expect you to know coming into the course. The rest is new CPSC 340 material from the first few lectures.

A note on the provided code: in the code directory we provide you with a file called main.py. This file, when run with di↵erent arguments, runs the code for di↵erent parts of the assignment. For example,

python main.py -q 6.2

runs the code for Question 6.2. At present, this should do nothing (throws a NotImplementedError), because the code for Question 6.2 still needs to be written (by you). But we do provide some of the bits and pieces to save you time, so that you can focus on the machine learning aspects. For example, you’ll see that the provided code already loads the datasets for you. The file utils.py contains some helper functions. You don’t need to read or modify the code in there. To complete your assignment, you will need to modify grads.py, main.py, decision_stump.py and simple_decision.py (which you’ll need to create). Make sure to include all your code within your GradeScope pdf submission.

Instructions

Rubric: {mechanics:5}

IMPORTANT!!! Before proceeding, please carefully read the general homework instructions at https://www.cs.ubc.ca/~fwood/CS340/homework/. The above 5 points are for following the submission instructions. You can ignore the words “mechanics”, “reasoning”, etc.

We use blue to highlight the deliverables that you must answer/do/submit with the assignment.

1
1 Linear Algebra Review

For these questions you may find it helpful to review these notes on linear algebra:
http://www.cs.ubc.ca/~schmidtm/Documents/2009_Notes_LinearAlgebra.pdf

1.1 Basic Operations

Rubric: {reasoning:7}

Use the definitions below,

4
0
5
4
3
5
4
1
5
4
3
2
2
5

2

5

1

1
1
3

↵ = 2, x = 2
1
3 , y =
2
4
3 , z =
2
2
3, A=
2
1
3
1
3 ,
and use xi to denote element i of vector x. Evaluate the following expressions (you do not need to show your work).

1.
P
n

i=1 xiyi (inner product).
2.

n
xizi (inner product between orthogonal vectors).

P
i=1

3.
↵(x + z) (vector addition and scalar multiplication)
4.
xT z + kxk (inner product in matrix notation and Euclidean norm of x).
5.
Ax (matrix-vector multiplication).
6.
xT Ax (quadratic form).
7.
AT A (matrix tranpose and matrix multiplication).

1.2 Matrix Algebra Rules

Rubric: {reasoning:10}

Assume that {x, y, z} are n ⇥1 column vectors, {A, B, C} are n ⇥n real-valued matrices, 0 is the zero matrix of appropriate size, and I is the identity matrix of appropriate size. State whether each of the below is true in general (you do not need to show your work).
1. xT y = Pn xiyi.
i=1

2. xT x = kxk2.

3. xT x = xxT .

4. (x y)T (x y) = kxk2 2xT y + kyk2.

5. AB = BA.

6. AT(B + IC) = ATB + ATC.

7. (A + BC)T = AT + BTCT.

8. xT Ay = yT AT x.

9. AT A = AAT if A is a symmetric matrix.

10. AT A = 0 if the columns of A are orthonormal.

2
2 Probability Review

For these questions you may find it helpful to review these notes on probability:
http://www.cs.ubc.ca/~schmidtm/Courses/Notes/probability.pdf
And here are some slides giving visual representations of the ideas as well as some simple examples:
http://www.cs.ubc.ca/~schmidtm/Courses/Notes/probabilitySlides.pdf

2.1 Rules of probability

Rubric: {reasoning:6}

Answer the following questions. You do not need to show your work.

1. You are o↵ered the opportunity to play the following game: your opponent rolls 2 regular 6-sided dice. If the di↵erence between the two rolls is at least 3, you win $15. Otherwise, you get nothing. What is a fair price for a ticket to play this game once? In other words, what is the expected value of playing the game?

2. Consider two events A and B such that Pr(A, B) = 0 (they are mutually exclusive). If Pr(A) = 0.4 and Pr(A [ B) = 0.95, what is Pr(B)? Note: p(A, B) means “probability of A and B” while p(A [ B) means “probability of A or B”. It may be helpful to draw a Venn diagram.

3. Instead of assuming that A and B are mutually exclusive (Pr(A, B) = 0), what is the answer to the previous question if we assume that A and B are independent?

2.2 Bayes Rule and Conditional Probability

Rubric: {reasoning:10}

Answer the following questions. You do not need to show your work.

Suppose a drug test produces a positive result with probability 0.97 for drug users, P (T = 1 | D = 1) = 0.97. It also produces a negative result with probability 0.99 for non-drug users, P (T = 0 | D = 0) = 0.99. The probability that a random person uses the drug is 0.0001, so P (D = 1) = 0.0001.

1. What is the probability that a random person would test positive, P (T = 1)?

2. In the above, do most of these positive tests come from true positives or from false positives?

3. What is the probability that a random person who tests positive is a user, P (D = 1 | T = 1)?

4. Suppose you have given this test to a random person and it came back positive, are they likely to be a drug user?

5. What is one factor you could change to make this a more useful test?

3 Calculus Review

3.1 One-variable derivatives

Rubric: {reasoning:8}

Answer the following questions. You do not need to show your work.

3

1.
Find the derivative of the function f(x) = 3x2
2x + 5.
2.
Find the derivative of the function f(x) = x(1
x).

1
for x 2 R. Compute the derivative of the function f(x) = x log(p(x)) and
3.
Let p(x) =

1+exp( x)

simplify it by using the function p(x).

Remember that in this course we will log(x) to mean the “natural” logarithm of x, so that log(exp(1)) = 1.

Also, obseve that p(x) = 1 p( x) for the final part.

3.2 Multi-variable derivatives

Rubric: {reasoning:5}

Compute the gradient vector rf(x) of each of the following functions. You do not need to show your work.

1.
f(x) = x12
+ exp(x1 + 2x2) where x 2 R2.

2.
f(x) = log ⇣

3
where x 2 R3 (simplify the gradient by defining Z =

P
i=1 exp(xi)⌘

3.
f(x) = a
T

3
and a 2 R
3
and b 2 R.

x + b where x 2 R

4.
f(x) =
1
x>Ax where A =

2
1
and x 2 R2.

2

1
2

5.
f(x) =
1
kxk2 where x 2 Rd.

2

P3
i=1 exp(xi)).
Hint: it is helpful to write out the linear algebra expressions in terms of summations.

3.3 Optimization

Find the following quantities. You do not need to show your work. You can/should use your results from parts 3.1 and 3.2 as part of the process.

1. min 3x2 2x + 5, or, in words, the minimum value of the function f(x) = 3x2 2x + 5 for x 2 R.

2. max x(1 x) for x 2 [0, 1].

3. min x(1 x) for x 2 [0, 1].

4. arg max x(1 x) for x 2 [0, 1].

5. min x21 + exp(x2) where x 2 [0, 1]2, or in other words x1 2 [0, 1] and x2 2 [0, 1].

6. arg min x21 + exp(x2) where x 2 [0, 1]2.

Note: the notation x 2 [0, 1] means “x is in the interval [0, 1]”, or, also equivalently, 0 x 1.

Note: the notation “max f(x)” means “the value of f(x) where f(x) is maximized”, whereas “arg max f(x)” means “the value of x such that f(x) is maximized”. Likewise for min and arg min. For example, the min of the function f(x) = (x 1)2 is 0 because the smallest possible value is f(x) = 0, whereas the arg min is 1 because this smallest value occurs at x = 1. The min is always a scalar but the arg min is a value of x, so it’s a vector if x is vector-valued.

4
3.4 Derivatives of code

Rubric: {code:4}

Note: for info on installing and using Python, see https://github.ugrad.cs.ubc.ca/CPSC340-2018W-T1/home/blob/master/python_info.md.

Your repository contains a file named grads.py which defines several Python functions that take in an input variable x, which we assume to be a 1-d array (in math terms, a vector). It also includes (blank) functions that return the corresponding gradients. For each function, write code that computes the gradient of the function in Python. You should do this directly in grads.py; no need to make a fresh copy of the file. When finished, you can run python main.py -q 3.4 to test out your code. Make sure to include the code you have written in your pdf submission for GradeScope.

Hint: it’s probably easiest to first understand on paper what the code is doing, then compute the gradient, and then translate this gradient back into code.

Note: do not worry about the distinction between row vectors and column vectors here. For example, if the correct answer is a vector of length 5, we’ll accept numpy arrays of shape (5,) (a 1-d array) or (5,1) (a column vector) or (1,5) (a row vector). In future assignments we will start to be more careful about this.

Warning: Python uses whitespace instead of curly braces to delimit blocks of code. Some people use tabs and other people use spaces. My text editor (Atom) inserts 4 spaces (rather than tabs) when I press the Tab key, so the file grads.py is indented in this manner. If your text editor inserts tabs, Python will complain and you might get mysterious errors... this is one of the most annoying aspects of Python, especially when starting out. So, please be aware of this issue! And if in doubt you can just manually indent with 4 spaces, or convert everything to tabs. For more information see https://www.youtube.com/watch?v=SsoOG6ZeyUI.

4 Algorithms and Data Structures Review

4.1 Trees

Rubric: {reasoning:2}

Answer the following questions. You do not need to show your work.

1. What is the minimum depth of a binary tree with 64 leaf nodes?

2. What is the minimum depth of binary tree with 64 nodes (includes leaves and all other nodes)?

Note: we’ll use the standard convention that the leaves are not included in the depth, so a tree with depth 1 has 3 nodes with 2 leaves.

4.2 Common Runtimes

Rubric: {reasoning:4}

Answer the following questions using big-O notation You do not need to show your work.

1. What is the cost of finding the largest number in an unsorted list of n numbers?

2. What is the cost of finding the smallest element greater than 0 in a sorted list with n numbers.

3. What is the cost of finding the value associated with a key in a hash table with n numbers? (Assume the values and keys are both scalars.)

5

4. What is the cost of computing the inner product aT x, where a is d ⇥ 1 and x is d ⇥ 1?

5. What is the cost of computing the quadratic form xT Ax when A is d ⇥ d and x is d ⇥ 1.

4.3 Running times of code

Rubric: {reasoning:4}

Your repository contains a file named bigO.py, which defines several functions that take an integer argument N. For each function, state the running time as a function of N, using big-O notation. Please include your answers in your report. Do not write your answers inside bigO.py.

5 Data Exploration

Your repository contains the file fluTrends.csv, which contains estimates of the influenza-like illness per-centage over 52 weeks on 2005-06 by Google Flu Trends. Your main.py loads this data for you and stores it in a pandas DataFrame X, where each row corresponds to a week and each column corresponds to a di↵erent region. If desired, you can convert from a DataFrame to a raw numpy array with X.values().

5.1 Summary Statistics

Rubric: {reasoning:2}

Report the following statistics:

1. The minimum, maximum, mean, median, and mode of all values across the dataset.

2. The 5%, 25%, 50%, 75%, and 95% quantiles of all values across the dataset.

3. The names of the regions with the highest and lowest means, and the highest and lowest variances.

In light of the above, is the mode a reliable estimate of the most “common” value? Describe another way we could give a meaningful “mode” measurement for this (continuous) data. Note: the function utils.mode() will compute the mode value of an array for you.

5.2 Data Visualization

Rubric: {reasoning:3}

Consider the figure below.

6

7
The figure contains the following plots, in a shu✏ed order:

1. A single histogram showing the distribution of each column in X.

2. A histogram showing the distribution of each the values in the matrix X.

3. A boxplot grouping data by weeks, showing the distribution across regions for each week.

4. A plot showing the illness percentages over time.

5. A scatterplot between the two regions with highest correlation.

6. A scatterplot between the two regions with lowest correlation.

Match the plots (labeled A-F) with the descriptions above (labeled 1-6), with an extremely brief (a few words is fine) explanation for each decision.

6 Decision Trees

If you run python main.py -q 6, it will load a dataset containing longitude and latitude data for 400 cities in the US, along with a class label indicating whether they were a “red” state or a “blue” state in the 2012 election.1 Specifically, the first column of the variable X contains the longitude and the second variable contains the latitude, while the variable y is set to 0 for blue states and 1 for red states. After it loads the data, it plots the data and then fits two simple classifiers: a classifier that always predicts the most common label (0 in this case) and a decision stump that discretizes the features (by rounding to the nearest integer) and then finds the best equality-based rule (i.e., check if a feature is equal to some value). It reports the training error with these two classifiers, then plots the decision areas made by the decision stump. The plot is shown below:

As you can see, it is just checking whether the latitude equals 35 and, if so, predicting red (Republican).

This is not a very good classifier.

1 The cities data was sampled from http://simplemaps.com/static/demos/resources/us-cities/cities.csv. The election information was collected from Wikipedia.

8
6.1 Splitting rule

Rubric: {reasoning:1}

Is there a particular type of features for which it makes sense to use an equality-based splitting rule rather than the threshold-based splits we discussed in class?

6.2 Decision Stump Implementation

Rubric: {code:3}

The file decision_stump.py contains the class DecisionStumpEquality which finds the best decision stump using the equality rule and then makes predictions using that rule. Instead of discretizing the data and using a rule based on testing an equality for a single feature, we want to check whether a feature is above or below a threshold and split the data accordingly (this is a more sane approach, which we discussed in class). Create a DecisionStumpErrorRate class to do this, and report the updated error you obtain by using inequalities instead of discretizing and testing equality. Make sure to include the code you have written in your pdf GradeScope submission. Also submit the generated figure of the classification boundary.

Hint: you may want to start by copy/pasting the contents DecisionStumpEquality and then make modifi-cations from there.

6.3 Decision Stump Info Gain Implementation

Rubric: {code:3}

In decision_stump.py, create a DecisionStumpInfoGain class that fits using the information gain criterion discussed in lecture. Make sure to include the code you have written in your pdf GradeScope submission. Report the updated error you obtain, and submit the classification boundary figure.

Notice how the error rate changed. Are you surprised? If so, hang on until the end of this question!

Note: even though this data set only has 2 classes (red and blue), your implementation should work for any number of classes, just like DecisionStumpEquality and DecisionStumpErrorRate.

Hint: take a look at the documentation for np.bincount, at https://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html. The minlength ar-gument comes in handy here to deal with a tricky corner case: when you consider a split, you might not have any examples of a certain class, like class 1, going to one side of the split. Thus, when you call np.bincount, you’ll get a shorter array by default, which is not what you want. Setting minlength to the number of classes solves this problem.

6.4 Constructing Decision Trees

Rubric: {code:2}

Once your DecisionStumpInfoGain class is finished, running python main.py -q 6.4 will fit a decision tree of depth 2 to the same dataset (which results in a lower training error). Look at how the decision tree is stored and how the (recursive) predict function works. Using the splits from the fitted depth-2 decision tree, write a hard-coded version of the predict function that classifies one example using simple if/else statements (see the Decision Trees lecture). Make sure to include the code you have written in your pdf GradeScope submission.

9

Note: this code should implement the specific, fixed decision tree which was learned after calling fit on this particular data set. It does not need to be a learnable model. You should just hard-code the split values directly into the code. Only the predict function is needed.

Hint: if you plot the decision boundary you can do a visual sanity check to see if your code is consistent with the plot.

6.5 Decision Tree Training Error

Rubric: {reasoning:2}

Running python main.py -q 6.5 fits decision trees of di↵erent depths using the following di↵erent imple-mentations:

1. A decision tree using DecisionStump

2. A decision tree using DecisionStumpInfoGain

3. The DecisionTreeClassifier from the popular Python ML library scikit-learn

Run the code and look at the figure. Describe what you observe. Can you explain the results? Why is approach (1) so disappointing? Also, submit a classification boundary plot of the model with the lowest training error.

Note: we set the random_state because sklearn’s DecisionTreeClassifier is non-deterministic. This is probably because it breaks ties randomly.

Note: the code also prints out the amount of time spent. You’ll notice that sklearn’s implementation is substantially faster. This is because our implementation is based on the O(n2d) decision stump learning algorithm and sklearn’s implementation presumably uses the faster O(nd log n) decision stump learning algorithm that we discussed in lecture.

6.6 Comparing implementations

Rubric: {reasoning:2}

In the previous section you compared di↵erent implementations of a machine learning algorithm. Let’s say that two approaches produce the exact same curve of classification error rate vs. tree depth. Does this conclusively demonstrate that the two implementations are the same? If so, why? If not, what other experiment might you perform to build confidence that the implementations are probably equivalent?

6.7 Cost of Fitting Decision Trees

Rubric: {reasoning:3}

In class, we discussed how in general the decision stump minimizing the classification error can be found in O(nd log n) time. Using the greedy recursive splitting procedure, what is the total cost of fitting a decision tree of depth m in terms of n, d, and m?

Hint: even thought there could be (2m 1) decision stumps, keep in mind not every stump will need to go through every example. Note also that we stop growing the decision tree if a node has no examples, so we may not even need to do anything for many of the (2m 1) decision stumps.

10

More products

$6.00 OFF

Programming II Assignment 3: Calculator Solution

$35

$29

Buy now

$6.00 OFF

Programming II Assignment 2: Collections Solution

$30

$24

Buy now

$6.00 OFF

Assignment 4 Solution

$30

$24

Buy now