$24
Spark (25 pts)
Write a Spark program that implements a simple \People You Might Know" social network friendship recommendation algorithm. The key idea is that if two people have a lot of mutual friends, then the system should recommend that they connect with each other.
Data:
Associated data le is soc-LiveJournal1Adj.txt in q1/data.
The le contains the adjacency list and has multiple lines in the following format: <User<TAB<Friends
Here, <User is a unique integer ID corresponding to a unique user and <Friends is a comma separated list of unique IDs corresponding to the friends of the user with the unique ID <User. Note that the friendships are mutual (i.e., edges are undirected): if A is friend with B then B is also friend with A. The data provided is consistent with that rule as there is an explicit entry for each side of each edge.
Algorithm: Let us use a simple algorithm such that, for each user U, the algorithm rec-ommends N = 10 users who are not already friends with U, but have the most number of mutual friends in common with U.
Output:
The output should contain one line per user in the following format: <User<TAB<Recommendations
where <User is a unique ID corresponding to a user and <Recommendations is a comma separated list of unique IDs corresponding to the algorithm’s recommendation of people that <User might know, ordered in decreasing number of mutual friends.
Even if a user has less than 10 second-degree friends, output all of them in decreasing order of the number of mutual friends. If a user has no friends, you can provide an empty list of recommendations. If there are recommended users with the same number of mutual friends, then output those user IDs in numerically ascending order.
Pipeline sketch: Please provide a description of how you used Spark to solve this problem. Don’t write more than 3 to 4 sentences for this: we only want a very high-level description of your strategy to tackle this problem.
CS 246: Mining Massive Data Sets | Homework 1
2
Tips:
Use Google Colab to use Spark seamlessly, e.g., copy and adapt the setup cells from Colab 0.
Before submitting a complete application to Spark, you may go line by line, checking the outputs of each step. Command .take(X) should be helpful, if you want to check the rst X elements in the RDD.
For sanity check, your top 10 recommendations for user ID 11 should be: 27552,7785,27573,27574,27589,27590,27600,27617,27620,27667.
What to submit
Upload your code on Gradescope.
Include in your writeup a short paragraph sketching your spark pipeline.
Include in your writeup the recommendations for the users with following user IDs: 924, 8941, 8942, 9019, 9020, 9021, 9022, 9990, 9992, 9993.
Association Rules (30 pts)
Association Rules are frequently used for Market Basket Analysis (MBA) by retailers to understand the purchase behavior of their customers. This information can be then used for many di erent purposes such as cross-selling and up-selling of products, sales promotions, loyalty programs, store design, discount plans and many others.
Evaluation of item sets: Once you have found the frequent itemsets of a dataset, you need to choose a subset of them as your recommendations. Commonly used metrics for measuring signi cance and interest for selecting rules for recommendations are:
Con dence (denoted as conf(A ! B)): Con dence is de ned as the probability of occurrence of B in the basket if the basket already contains A:
conf(A ! B) = Pr(BjA);
where Pr(BjA) is the conditional probability of nding item set B given that item set A is present.
Lift (denoted as lift(A ! B)): Lift measures how much more \A and B occur together" than \what would be expected if A and B were statistically independent":
conf(A ! B)
lift(A ! B) = ;
where S(B) = Support(B) and N = total number of transactions (baskets).
N
CS 246: Mining Massive Data Sets | Homework 1
3
Conviction (denoted as conv(A ! B)): Conviction compares the \probability that A appears without B if they were independent" with the \actual frequency of the appearance of A without B":
1 S(B)
conv(A ! B) = 1 conf(A ! B):
(a) [3pts]
A drawback of using con dence is that it ignores Pr(B). Why is this a drawback? Explain why lift and conviction do not su er from this drawback.
(b) [3pts]
A measure is symmetrical if measure(A ! B) = measure(B ! A). Which of the measures presented here are symmetrical? For each measure, please provide either a proof that the measure is symmetrical, or a counterexample that shows the measure is not symmetrical.
(c) [4pts]
Perfect implications are rules that hold 100% of the time (or equivalently, the associated conditional probability is 1). A measure is desirable if it reaches its maximum achievable value for all perfect implications. This makes it easy to identify the best rules. Which of the above measures have this property? You may ignore 0=0 but not other in nity cases. Also you may nd it easy to explain by an example.
Application in product recommendations: The action or practice of selling additional products or services to existing customers is called cross-selling. Giving product recommen-dation is one of the examples of cross-selling that are frequently used by online retailers. One simple method to give product recommendations is to recommend products that are frequently browsed together by the customers.
Suppose we want to recommend new products to the customer based on the products they have already browsed online. Write a program using the A-priori algorithm to nd products which are frequently browsed together. Fix the support to s =100 (i.e. product pairs need to occur together at least 100 times to be considered frequent) and nd itemsets of size 2 and 3.
Use the online browsing behavior dataset from browsing.txt in q2/data. Each line repre-sents a browsing session of a customer. On each line, each string of 8 characters represents the ID of an item browsed during that session. The items are separated by spaces.
Note: for parts (d) and (e), the writeup will require a speci c rule ordering but the program need not sort the output. We are not giving partial credits to coding when results are wrong.
CS 246: Mining Massive Data Sets | Homework 1
4
However, two sanity checks are provided and they should be helpful when you progress: (1) there are 647 frequent items after 1st pass (jL1j = 647), (2) the top 5 pairs you should produce in part (d) all have con dence scores greater than 0.985. See detailed instructions below.
(d) [10pts]
Identify pairs of items (X; Y ) such that the support of fX; Y g is at least 100. For all such pairs, compute the con dence scores of the corresponding association rules: X ) Y , Y ) X. Sort the rules in decreasing order of con dence scores and list the top 5 rules in the writeup. Break ties, if any, by lexicographically increasing order on the left hand side of the rule. (You need not use Spark for parts d and e of question 2)
(e) [10pts]
Identify item triples (X; Y; Z) such that the support of fX; Y; Zg is at least 100. For all such triples, compute the con dence scores of the corresponding association rules: (X; Y ) ) Z, (X; Z) ) Y , (Y; Z) ) X. Sort the rules in decreasing order of con dence scores and list the top 5 rules in the writeup. Order the left-hand-side pair lexicographically and break ties, if any, by lexicographical order of the rst then the second item in the pair.
What to submit
Upload all the code on Gradescope and include the following in your writeup:
Explanation for 2(a).
Proofs and/or counterexamples for 2(b).
Explanation for 2(c).
Top 5 rules with con dence scores [2(d)].
Top 5 rules with con dence scores [2(e)].
Locality-Sensitive Hashing (15 pts)
When simulating a random permutation of rows, as described in Sect. 3.3.5 of MMDS, we could save time if we restricted our attention to a randomly chosen k of the n rows, rather than hashing all n row numbers. The downside of doing so is that, if none of the k rows contains a 1 in a certain column, then the result of the minhashing is \don’t know". In other words, we get no row number as the minhash value. It would be a mistake to assume that
CS 246: Mining Massive Data Sets | Homework 1
5
two columns that both minhash to \don’t know" are likely to be similar. However, if the probability of getting \don’t know" as a minhash value is small, we can tolerate the situation and simply ignore such minhash values when computing the fraction of minhashes in which two columns agree.
In part (a) we determine an upper bound on the probability of getting \don’t know" as the minhash value when considering only a k-subset of the n rows, and in part (b) we use this bound to determine an appropriate choice for k, given our tolerance for this probability.
(a) [5pts]
Suppose a column has m 1’s and therefore n m 0’s, and we randomly choose k rows to consider when computing the minhash. Prove that the probability of getting \don’t know" as the minhash value for this column is at most (nnk )m.
(b) [5pts]
Suppose we want the probability of \don’t know" to be at most e 10. Assuming n and m are both very large (but n is much larger than m or k), give a simple approximation to the smallest value of k that will ensure this probability is at most e 10. Your expression should
be a function of n and m. Hints: (1) You can use (
n
k
)m as the exact value of the probability
n
1
)x 1=e.
of \don’t know." (2) Remember that for large x, (1
x
(c) [5pts]
Note: Part (c) should be considered separate from the previous two parts, in that we are no longer restricting our attention to a randomly chosen subset of the rows.
When minhashing, one might expect that we could estimate the Jaccard similarity without using all possible permutations of rows. For example, we could only allow cyclic permuta-tions, i.e. start at a randomly chosen row r, which becomes the rst in the order, followed by rows r + 1, r + 2, and so on, down to the last row, and then continuing with the rst row, second row, and so on, down to row r 1. There are only n such permutations if there are n rows. However, these permutations are not su cient to estimate the Jaccard similarity correctly.
Give an example of two columns such that the probability (over cyclic permutations only) that their minhash values agree is not the same as their Jaccard similarity. In your answer, please provide (a) an example of a matrix with two columns (let the two columns correspond to sets denoted by S1 and S2), (b) the Jaccard similarity of S1 and S2, and (c) the probability that a random cyclic permutation yields the same minhash value for both S1 and S2.
CS 246: Mining Massive Data Sets | Homework 1
6
What to submit
Include the following in your writeup:
Proof for 3(a)
Derivation and nal answer for 3(b)
Example for 3(c) including the three requested items
LSH for Approximate Near Neighbor Search (30 pts)
In this problem, we study the application of LSH to the problem of nding approximate near neighbors.
Assume we have a dataset A of n points in a metric space with distance metric d( ; ). Let c be a constant greater than 1. Then, the (c; )-Approximate Near Neighbor (ANN) problem is de ned as follows: Given a query point z, assuming that there is a point x in the dataset with d(x; z) , return a point x0 from the dataset with d(x0; z) c (this point is called a (c; )-ANN). The parameter c therefore represents the maximum approximation factor allowed in the problem.
Let us consider a LSH family H of hash functions that is ( ; c ; p1; p2)-sensitive for the distance measure d( ; ). Let1 G = Hk = fg = (h1; : : : ; hk)jhi 2 H; 8 1 i kg, where
= log1=p2 (n).
Let us consider the following procedure:
1. Select L = n random members g1; : : : ; gL of G, where = log(1=p1) .
log(1=p2)
Hash all the data points as well as the query point using all gi (1 i L).
Retrieve at most2 3L data points (chosen uniformly at random) from the set of L buckets to which the query point hashes.
Among the points selected in phase 3, report the one that is the closest to the query point as a (c; )-ANN.
The goal of the rst part of this problem is to show that this procedure leads to a correct answer with constant probability.
1The equality G = Hk is saying that every function of G is an AND-construction of k functions of H, so g(x) = g(y) only if hi(x) = hi(y) for every hi underlying g.
If there are fewer than 3L data points hashing to the same buckets as the query point, just take all of them.
CS 246: Mining Massive Data Sets | Homework 1
7
(a) [5 pts]
Let Wj = fx 2 Ajgj(x) = gj(z)g (1 j L) be the set of data points x mapping to the same value as the query point z by the hash function gj. De ne T = fx 2 Ajd(x; z) c g. Prove:
L
1
Pr
"j=1
jT \ Wjj 3L#
6
:
3
X
(Hint: Markov’s Inequality.)
(b) [5 pts]
Let x 2 A be a point such that d(x ; z) . Prove:
Pr [8 1 j L; gj(x ) 6= gj(z)] < 1e:
(c) [5 pts]
Conclude that with probability greater than some xed constant the reported point is an actual (c; )-ANN.
(d) [15 pts]
A dataset of images,3 patches.csv, is provided in q4/data.
Each row in this dataset is a 20 20 image patch represented as a 400-dimensional vector. We will use the L1 distance metric on R400 to de ne similarity of images. We would like to compare the performance of LSH-based approximate near neighbor search with that of linear search.4 You should use the code provided with the dataset for this task.
The included starter code in lsh.py marks all locations where you need to contribute code with TODOs. In particular, you will need to use the functions lsh setup and lsh search and implement your own linear search. The default parameters L = 10; k = 24 to lsh setup work for this exercise, but feel free to use other parameter values as long as you explain the reason behind your parameter choice.
For each of the image patches in columns 100; 200; 300; : : : ; 1000, nd the top 3 near
Dataset and code adopted from Brown University’s Greg Shakhnarovich
4By linear search we mean comparing the query point z directly with every database point x.
CS 246: Mining Massive Data Sets | Homework 1
8
neighbors5 (excluding the original patch itself) using both LSH and linear search. What is the average search time for LSH? What about for linear search?
Assuming fzjj 1 j 10g to be the set of image patches considered (i.e., zj is the image patch in column 100j), fxijg3i=1 to be the approximate near neighbors of zj found using LSH, and fxijg3i=1 to be the (true) top 3 near neighbors of zj found using linear search, compute the following error measure:
error =
1 10
P
i3=1 d(xij; zj)
10 j=1
i3=1 d(xij; zj)
X
Plot the error value as a function of L (for L = 10; 12; 14; : : : ; 20, with k = 24).
Similarly, plot the error value as a function of k (for k = 16; 18; 20; 22; 24 with L = 10).
Brie y comment on the two plots (one sentence per plot would be su cient).
Finally, plot the top 10 near neighbors found6 using the two methods (using the default L = 10; k = 24 or your alternative choice of parameter values for LSH) for the image patch in column 100, together with the image patch itself. You may nd the function plot useful.
How do they compare visually?
What to submit
Include the proof for 4(a) in your writeup.
Include the proof for 4(b) in your writeup.
Include the reasoning for why the reported point is an actual (c; )-ANN in your writeup [4(c)].
Include the following in your writeup for 4(d):
Average search time for LSH and linear search.
Plots for error value vs. L and error value vs. K, and brief comments for each plot
Plot of 10 nearest neighbors found by the two methods (also include the original image) and brief visual comparison
Upload the code for 4(d) on Gradescope.
5Sometimes, the function lsh search may return less than 3 nearest neighbors. You can use a while loop to check that lsh search returns enough results, or you can manually run the program multiple times until it returns the correct number of neighbors.
6Same remark, you may sometimes have less that 10 nearest neighbors in your results; you can use the same hacks to bypass this problem.