Starting from:
$35

$29

Problem Set 4 Solution

Implementation of SVM via Gradient Descent (30 points)












Here, you will implement the soft margin SVM using di erent gradient descent methods as described in the section 12.3.4 of the textbook. To recap, to estimate the w; b of the soft margin SVM, we can minimize the cost:

f(w; b) = 2
j=1 (w(j))2
+ C
=1 max (
0; 1
yi
j=1
w(j)xi
+ b!) :
(1)
1
d


n




d
(j)










X


Xi




X
























In order to minimize the function, we rst obtain the gradient with respect to w(j) item in the vector w, as follows.














@f(w; b)
n
@L(xi; yi)




















Xi




rw(j) f(w; b) = @w(j)




@w(j)
;


= w(j) + C


















=1




where:




















@L(xi; yi)


=


0
(j)
if yi (xi w + b) 1




@w(j)








yixi
otherwise.




the jth









(2)
Now, we will implement and compare the following gradient descent techniques:




Batch gradient descent: Iterate through the entire dataset and update the param-eters as follows:



k = 0




while convergence criteria not reached do




for j = 1; :::; d do

Update w(j)
w(j) rw(j) f(w; b)
end for


rbf(w; b)
Update b
b
Update k
k + 1



end while




where,




n is the number of samples in the training data, d is the dimensions of w,


CS 246: Mining Massive Data Sets - Problem Set 4
2







is the learning rate of the gradient descent, and

rw(j) f(w; b) is the value computed from computing equation (2) above and rbf(w; b) is the value computed from your answer in question (a) below.




The convergence criteria for the above algorithm is %cost < , where


%cost =
jfk 1(w; b)
fk(w; b)j 100
:
(3)
fk
1(w; b)












where,

fk(w; b) is the value of equation (1) at kth iteration,




%cost is computed at the end of each iteration of the while loop. Initialize w = 0; b = 0 and compute f0(w; b) with these values. For this method, use = 0:0000003; = 0:25




Stochastic gradient descent: Go through the dataset and update the parameters, one training sample at a time, as follows:



Randomly shu e the training data




i = 1; k = 0




while convergence criteria not reached do




for j = 1; :::; d do

Update w(j)
w(j) rw(j) fi(w; b)
end for


rbfi(w; b)
Update b
b
Update i
(i
mod n) + 1
Update k
k + 1



end while




where,




n is the number of samples in the training data, d is the dimensions of w,




is the learning rate and

rw(j) fi(w; b) is de ned for a single training sample

r (j) fi(w; b) = @fi(w; b) = w(j)




w @w(j)













as follows:




C @L(xi; yi) @w(j)





(Note that you will also have to derive rbfi(w; b), but it should be similar to your solution to question (a) below.

The convergence criteria here is (costk) < , where




(costk) = 0:5 (costk1) + 0:5 %cost;




where,




k = iteration number, and

%cost is same as above (equation 3).




Calculate cost; %cost at the end of each iteration of the while loop.

Initialize cost = 0, w = 0; b = 0 and compute f0(w; b) with these values.

For this method, use = 0:0001; = 0:001.


CS 246: Mining Massive Data Sets - Problem Set 4
3










Mini batch gradient descent: Go through the dataset in batches of predetermined size and update the parameters as follows:



Randomly shu e the training data




l = 0; k = 0




while convergence criteria not reached do




for j = 1; :::; d do

Update w(j)
w(j) rw(j) fl(w; b)
end for


rbfl(w; b)
Update b
b
Update l
(l + 1) mod ((n + batch


size 1)=batch


size)




Update k
k + 1



end while




where,




n is the number of samples in the training data, d is the dimensions of w,




is the learning rate,




batch size is the number of training samples considered in each batch, and rw(j) fl(w; b) is de ned for a batch of training samples as follows:






@fl(w; b)
min(n;(l+1) batch size)
@L(xi; yi)
rw(j) fl(w; b) =




= w(j) + C


X




@w(j)


@w(j) ;


i=l
batch size+1












The convergence criteria is cost(k)
< , where














(costk) = 0:5 (costk1) + 0:5 %cost;




k = iteration number,

and %cost is same as above (equation 3).




Calculate cost; %cost at the end of each iteration of the while loop.

Initialize cost = 0, w = 0; b = 0 and compute f0(w; b) with these values.

For this method, use = 0:00001; = 0:01; batch size = 20.










(a) [5 Points]




Notice that we have not given you the equation for, rbf(w; b).




Task: What is rbf(w; b) used for the Batch Gradient Descent Algorithm?




(Hint: It should be very similar to rw(j) f(w; b).)







(b) [25 Points]




Task: Implement the SVM algorithm for all of the above mentioned gradient descent tech-niques. For this problem, you are allowed to keep the dataset in memory, and you do not need to use Spark.

CS 246: Mining Massive Data Sets - Problem Set 4
4










Use C = 100 for all the techniques. For all other parameters, use the values speci ed in the description of the technique. Note: update w in iteration i + 1 using the values computed in iteration i. Do not update using values computed in the current iteration!




Run your implementation on the data set in q1/data. The data set contains the following les :







features.txt : Each line contains features (comma-separated values) for a single datapoint. It has 6414 datapoints (rows) and 122 features (columns).



target.txt : Each line contains the target variable (y = -1 or 1) for the corresponding row in features.txt.



Task: Plot the value of the cost function fk(w; b) vs. the number of iterations (k). Report the total time taken for convergence by each of the gradient descent techniques. What do you infer from the plots and the time for convergence?




The diagram should have graphs from all the three techniques on the same plot.




As a sanity check, Batch GD should converge in 10-300 iterations and SGD between 500-3000 iterations with Mini Batch GD somewhere in-between. However, the number of itera-tions may vary greatly due to randomness. If your implementation consistently takes longer though, you may have a bug.







What to submit




Equation for rbf(w; b). [part (a)]



Plot of fk(w; b) vs. the number of updates (k). Total time taken for convergence by each of the gradient descent techniques. Interpretation of plot and convergence times. [part (b)]



Submit the code on Gradescope submission website. [part (b)]






Decision Tree Learning (20 points)



In this problem, we want to construct a decision tree to nd out if a person will enjoy beer.







De nitions. Let there be k binary-valued attributes in the data.




We pick an attribute that maximizes the gain at each node:






G = I(D) (I(DL) + I(DR));
(4)
CS 246: Mining Massive Data Sets - Problem Set 4
5







where D is the given dataset, and DL and DR are the sets on left and right hand-side branches after division. Ties may be broken arbitrarily.




There are three commonly used impurity measures used in binary decision trees: Entropy, Gini index, and Classi cation Error. In this problem, we use Gini index and de ne I(D) as follows1:

!

X

I(D) = jDj 1 p2i ;




i




where:




jDj is the number of items in D;




1 Pi p2i is the gini index;




pi is the probability distribution of the items in D, or in other words, pi is the fraction of items that take value i 2 f+; . Put di erently, p+ is the fraction of positive items and p is the fraction of negative items in D.







Note that this intuitively has the feel that the more evenly-distributed the numbers are, the lower the Pi p2i, and the larger the impurity.

(a) [10 Points]




Let k = 3. We have three binary attributes that we could use: \likes wine", \likes running"




and \likes pizza". Suppose the following:




There are 100 people in sample set, 40 of whom like beer and 60 who don’t.




Out of the 100 people, 50 like wine; out of those 50 people who like wine, 20 like beer. Out of the 100 people, 30 like running; out of those 30 people who like running, 20 like




beer.




Out of the 100 people, 80 like pizza; out of those 80 people who like pizza, 30 like beer.







Task: What are the values of G (de ned in Equation 4) for wine, running and pizza at-tributes? Which attribute would you use to split the data at the root if you were to maximize the gain G using the gini index metric de ned above?







1As an example, if D has 10 items, with 4 positive items (i.e. 4 people who enjoy beer), and 6 negative items (i.e. 6 who do not), we have I(D) = 10 (1 (0:16 + 0:36)).
CS 246: Mining Massive Data Sets - Problem Set 4
6







(b) [10 Points]




Let’s consider the following example:




There are 100 attributes with binary values a1; a2; a3; : : : ; a100.




Let there be one example corresponding to each possible assignment of 0’s and 1’s to the values a1; a2; a3 : : : ; a100. (Note that this gives us 2100 training examples.)




Let the values taken by the target variable y depend on the values of a1 for 99% of the datapoints. More speci cally, of all the datapoints where a1 = 1, let 99% of them are labeled +. Similarly, of all the datapoints where a1 = 0, let 99% of them are labeled




with . (Assume that the values taken by y depend on a2; a3; : : : ; a100 for fewer than 99% of the datapoints.)




Assume that we build a complete binary decision tree (i.e., we use values of all at-tributes).







Task: Explain what the decision tree will look like. (A one line explanation will su ce.) Also, in 2-3 sentences, identify what the desired decision tree for this situation should look like to avoid over tting, and why.(The desired decision tree isn’t necessarily a complete binary decision tree)







What to submit




Values of G for wine, running and pizza attributes. [part (a)]



The attribute you would use for splitting the data at the root. [part (a)]



Explain what the decision tree looks like in the described setting. Explain how a decision tree should look like to avoid over tting. (1-2 lines each) [part (b)]






Clustering Data Streams (20 points)






Introduction. In this problem, we study an approach for clustering massive data streams. We will study a framework for turning an approximate clustering algorithm into one that can work on data streams, i.e., one which needs a small amount of memory and a small number of (actually, just one) passes over the data. As the instance of the clustering problem, we will focus on the k-means problem.
CS 246: Mining Massive Data Sets - Problem Set 4
7







De nitions. Before going into further details, we need some de nitions:




The function d : Rp Rp ! R+ denotes the Euclidean distance:




d(x; y) = jjx yjj2:




For any x 2 Rp and T Rp, we de ne:




d(x; T ) = minfd(x; z)g:

z2T




Having subsets S; T Rp, and a weight function w : S ! R+, we de ne:

X

costw(S; T ) = w(x)d(x; T )2:




x2S




Finally, if for all x 2 S we have w(x) = 1, we simply denote costw(S; T ) by cost(S; T ).







Reminder: k-means clustering. The k-means clustering problem is as follows: given a subset S Rp, and an integer k, nd the set T (with jT j = k), which minimizes cost(S; T ). If a weight function w is also given, the k-means objective would be to minimize costw(S; T ), and we call the problem the weighted k-means problem.







Strategy for clustering data streams. We assume we have an algorithm alg which is an -approximate weighted k-means clustering algorithm (for some 1). In other words, given any S Rp, k 2 N, and a weight function w, alg returns a set T Rp, jT j = k, such that:

costw(S; T ) min fcostw(S; T 0)g:




jT 0 j=k




We will see how we can use alg as a building block to make an algorithm for the k-means problem on data streams.




The basic idea here is that of divide and conquer: if S is a huge set that does not t into main memory, we can read a portion of it that does t into memory, solve the problem on this subset (i.e., do a clustering on this subset), record the result (i.e., the cluster centers and some corresponding weights, as we will see), and then read a next portion of S which is again small enough to t into memory, solve the problem on this part, record the result, etc. At the end, we will have to combine the results of the partial problems to construct a solution for the main big problem (i.e., clustering S).




To formalize this idea, we consider the following algorithm, which we denote as algstr:




Partition S into ‘ parts S1; : : : ; S‘.




For each i = 1 to ‘, run alg on Si to get a set of k centers Ti = fti1; ti2; : : : ; tikg, and assume fSi1; Si2; : : : ; Sikg is the corresponding clustering of Si (i.e., Sij = fx 2 Sij d(x; tij) < d(x; tij0 ) 8j0 6= j; 1 j0 kg).

CS 246: Mining Massive Data Sets - Problem Set 4
8







Let Sb = S‘i=1 Ti, and de ne weights w(tij) = jSijj. Run alg on Sb with weights w, to get k centers T . Return T .




Now, we analyze this algorithm. Assuming T = ft1; : : : ; tkg to be the optimal k-means solution for S (that is, T = argminjT 0 j=kfcost(S; T 0)g), we would like to compare cost(S; T ) (where T is returned by algstr) with cost(S; T ).




A small fact might be useful in the analysis below: for any (a; b) 2 R+ we have:




(a + b)2 2a2 + 2b2:







(a) [5pts]




First, we show that the cost of the nal clustering can be bounded in terms of the total cost of the intermediate clusterings:




Task: Prove that:




b

Xi
cost(S; T ) 2 costw(S; T ) + 2
cost(Si; Ti):


=1



Hint: You might want to use Triangle Inequality for Euclidean distance d.







(b) [5pts]




So, to bound the cost of the nal clustering, we can bound the terms on the right hand side of the inequality in part (a). Intuitively speaking, we expect the second term to be small compared to cost(S; T ), because T only uses k centers to represent the data set (S), while the Ti’s, in total, use k‘ centers to represent the same data set (and k‘ is potentially much bigger than k). We show this formally:




Task: Prove that:



X

cost(Si; Ti) cost(S; T ):




i=1







(c) [10pt]




Prove that algstr is a (4 2 + 6 )-approximation algorithm for the k-means problem.




Task: Prove that:

cost(S; T ) (4 2 + 6 ) cost(S; T ):

CS 246: Mining Massive Data Sets - Problem Set 4
9










Hint: You might want to rst prove two useful facts, which help bound the rst term on the right hand side of the inequality in part (a):




costw(S;b T ) costw(S;b T ):






X

costw(S;b T ) 2 cost(Si; Ti) + 2 cost(S; T ):

i=1




Additional notes: We have shown above that algstr is a (4 2 + 6 )-approximation algorithm for the k-means problem. Clearly, 4 2 + 6 , so algstr has a somewhat worse approximation guarantee than alg (with which we started). However, algstr is better suited for the streaming application, as not only it takes just one pass over the data, but also it needs a much smaller amount of memory.




Assuming that alg needs (n) memory to work on an input set S of size n (note that just




representing S in memory will need (n) space), if we partitioning S into n=k equal parts,
algstr


O


p




problem, k represents
can work with only
(
nk
) memory. (Like in the rest of the








p
the number of clusters per partition.)


Note that for typical values of n and k, assuming k n, we have p


n. For instance,
nk
with n = 106, and k = 100, we have
p


= 104, which is 100 times smaller than n.
nk
What to submit










(a) Proof that cost(S; T ) 2 costw(S; T ) + 2
Pi‘=1 cost(Si; Ti).


b




Proof that ‘i=1 cost(Si; Ti)cost(S; T ).



Proof that cost(S; T ) (4 2 + 6 ) cost(S; T ).P






Data Streams (30 points)






In this problem, we study an approach to approximating the frequency of occurrences of di erent items in a data stream. Assume S = ha1; a2; : : : ; at i is a data stream of items from the set f1; 2; : : : ; ng. Assume for any 1 i n, F [i] is the number of times i has appeared in S. We would like to have good approximations of the values F [i] (1 i n) at all times.




A simple way to do this is to just keep the counts for each item 1 i n separately. However, this will require O(n) space, and in many applications (e.g., think online advertising and counts of user’s clicks on ads) this can be prohibitively large. We see in this problem that it is possible to approximate these counts using a much smaller amount of space. To do so, we consider the algorithm explained below.



CS 246: Mining Massive Data Sets - Problem Set 4










10
















Strategy. The algorithm has two parameters ; 0. It picks
log
1
independent hash




functions:
s


















{
















1




e


8j 2
1; log




; hj : f1; 2; : : : ; ng ! f1; 2; : : : ; l




mg;




where log denotes natural logarithm. Also, it associates a count cj;x to any 1 j log 1 and 1 x e . In the beginning of the stream, all these counts are initialized to 0. Then, upon arrival of each ak (1 k t), each of the counts cj;hj (ak) (1 j log 1 ) is incremented by 1.

~ f g ~

For any 1 i n, we de ne F [i] = minj cj;hj (i) . We will show that F [i] provides a good approximation to F [i].




Memory cost. Note that this algorithm only uses O 1 log 1 space.







Properties. A few important properties of the algorithm presented above:




For any 1 i n:




~

F [i] F [i]:




For any 1 i n and 1 j dlog(
1
)e:


















E cj;hj (i) F [i] +




(t F [i]):






e



(a) [10 Points]




Prove that:

i
~
1 :
Pr F [i] F [i] + t
Hint: Use Markov inequality and the property of independence of hash functions.




Based on the proof in part (a) and the properties presented earlier, it can be inferred that ~




F [i] is a good approximation of F [i] for any item i such that F [i] is not very small (compared to t). In many applications (e.g., when the values F [i] have a heavy-tail distribution), we are indeed only interested in approximating the frequencies for items which are not too infrequent. We next consider one such application.







(b) [20 Points]




Warning. This implementation question requires substantial computation time Python implementation reported to take 15min - 1 hour. Therefore, we advise you to start early.

F [i]

~

F [i] F [i]




CS 246: Mining Massive Data Sets - Problem Set 4
11







Dataset. The dataset in q4/data contains the following les:







words stream.txt Each line of this le is a number, corresponding to the ID of a word in the stream.



counts.txt Each line is a pair of numbers separated by a tab. The rst number is an ID of a word and the second number is its associated exact frequency count in the stream.



words stream tiny.txt and counts tiny.txt are smaller versions of the dataset above that you can use for debugging your implementation.



hash params.txt Each line is a pair of numbers separated by a tab, corresponding to parameters a and b which you may use to de ne your own hash functions (See explanation below).









Instructions. Implement the algorithm and run it on the dataset with parameters =

5; = e 10 4. (Note: with this choice of you will be using 5 hash functions - the 5
pairs (a; b) that you’ll need for the hash functions are in hash params.txt). Then for each







distinct word i in the dataset, compute the relative error Er[i] = and plot these







values as a function of the exact word frequency Ft[i] . (You do not have to implement the algorithm in Spark.)







The plot should use a logarithm scale both for the x and the y axes, and there should be ticks to allow reading the powers of 10 (e.g. 10 1, 100, 101 etc...). The plot should have a title, as well as the x and y axes. The exact frequencies F [i] should be read from the counts le. Note that words of low frequency can have a very large relative error. That is not a bug in your implementation, but just a consequence of the bound we proved in question (a).




Answer the following question by reading values from your plot: What is an approximate condition on a word frequency in the document to have a relative error below 1 = 100 ?







Hash functions. You may use the following hash function (see example pseudo-code), with p = 123457, a and b values provided in the hash params le and n buckets (which is equivalent to e ) chosen according to the speci cation of the algorithm. In the provided le, each line gives you a, b values to create one hash function.










Returns hash(x) for hash function given by parameters a, b, p and n_buckets def hash_fun(a, b, p, n_buckets, x)



{




y = x [modulo] p




hash_val = (a*y + b) [modulo] p return hash_val [modulo] n_buckets




}

CS 246: Mining Massive Data Sets - Problem Set 4
12







Note: This hash function implementation produces outputs of value from 0 to (n buckets







1), which is di erent from our speci cation in the Strategy part. You can either keep the range as f0; :::; n buckets 1g, or add 1 to the hash result so the value range becomes f1; :::; n bucketsg, as long as you stay consistent within your implementation.







What to submit

i
~
1 . [part (a)]
(i) Proof that Pr F [i] F [i] + t



Log-log plot of the relative error as a function of the frequency. Answer for which word frequencies is the relative error below 1. [part (b)]



Submit the code on Gradescope submission site. [part (b)]

More products