$29
Dead ends in PageRank computations (25 points)
Let the matrix of the Web M be an n-by-n matrix, where n is the number of Web pages. The entry mij in row i and column j is 0, unless there is an arc from node (page) j to node i. In that case, the value of mij is 1=k, where k is the number of arcs (links) out of node j. Notice that if node j has k 0 arcs out, then column j has k values of 1=k and the rest 0’s. If node j is a dead end (i.e., it has zero arcs out), then column j is all 0’s.
Let r = [r1; r2; : : : ; rn]T be (an estimate of) the PageRank vector; that is, ri is the estimate of the PageRank of node i. De ne w(r) to be the sum of the components of r; that is
w(r) = in=1 ri.
In one
iteration of the PageRank algorithm, we compute the next estimate r of the PageRank
P
P
n
0
as: r0
= Mr. Speci cally, for each i we compute r0
=
Mijrj. De ne w(r0) to be the
i
j=1
P
sum of components of r0
; that is w(r0) = in=1 ri0.
You may use D (the set of dead nodes) in your equation.
(a) [6pts]
Suppose the Web has no dead ends. Prove that w(r0) = w(r).
(b) [9pts]
Suppose there are still no dead ends, but we use a teleportation probability of 1 where we teleport to a random node (0 < < 1). The expression for the next estimate of ri becomes ri0 = (Pnj=1 Mijrj) + (1 )=n. Under what circumstances will w(r0) = w(r)? Prove your conclusion.
(c) [10pts]
Now let us assume there are one or more dead ends. Call a node \dead" if it is a dead end and \live" if not. At each iteration, we teleport from live nodes with probability 1 and teleport from dead nodes with probability 1. In both cases, we choose a random node uniformly to teleport to. Assume w(r) = 1.
CS 246: Mining Massive Data Sets - Problem Set 3
2
Write the equation for ri0 in terms of , M, r, n, and D (where D is the set of dead nodes).
Then, prove that w(r0) is also 1.
What to submit
Proof [1(a)]
Condition for w(r0) = w(r) and Proof [1(b)]
Equation for ri0 and Proof [1(c)]
Implementing PageRank and HITS (30 points)
In this problem, you will learn how to implement the PageRank and HITS algorithms in Spark. The general computation should be done in Spark, and you may also include numpy operations whenever needed. You will be experimenting with a small randomly generated graph (assume graph has no dead-ends) provided at graph-full.txt.
There are 100 nodes (n = 100) in the small graph and 1000 nodes (n = 1000) in the full graph, and m = 8192 edges, 1000 of which form a directed cycle (through all the nodes) which ensures that the graph is connected. It is easy to see that the existence of such a cycle ensures that there are no dead ends in the graph. There may be multiple directed edges between a pair of nodes, and your solution should treat them as the same edge. The rst column in graph-full.txt refers to the source node, and the second column refers to the destination node.
Implementation hint: You may choose to store the PageRank vector r either in memory or as an RDD. Only the matrix M of links is too large to store in memory, and you are allowed to store matrix M in an RDD. e.g. data = sc.textFile ("graph full.txt"). On an actual cluster, an RDD is partitioned across the nodes of the cluster. However, you cannot then M = data.collect() which fetches the entire RDD to a single machine at the driver node and stores it as an array locally.
(a) PageRank Implementation [15 points]
Assume the directed graph G = (V; E) nodes have positive out-degree, and M such that for any i; j 2 J1; nK:
has n nodes (numbered 1; 2; : : : ; n) and m edges, all = [Mji]n n is a an n n matrix as de ned in class
1
if (i ! j) 2 E;
M
ji
=
deg(i)
0
otherwise:
Here, deg(i) is the number of outgoing edges of node i in G. If there are multiple edges in the same direction between two nodes, treat them as a single edge. By the de nition of
CS 246: Mining Massive Data Sets - Problem Set 3
3
PageRank, assuming 1 to be the teleport probability, and denoting the PageRank vector by the column vector r, we have the following equation:
r =
1
1 + Mr;
(1)
n
Based on this equation, the iterative procedure to compute PageRank works as follows:
Initialize: r(0) = n1 1
For i from 1 to k, iterate: r(i) = 1 n 1 + Mr(i 1)
Run the aforementioned iterative process in Spark for 40 iterations (assuming = 0:8) and obtain the PageRank vector r. In particular, you don’t have to implement the blocking algorithm from lecture. The matrix M can be large and should be processed as an RDD in your solution.
Compute the PageRank scores and report the node id for the following using graph-full.txt:
List the top 5 node ids with the highest PageRank scores.
List the bottom 5 node ids with the lowest PageRank scores.
For a sanity check, we have provided a smaller dataset (graph-small.txt). In that dataset, the top node has id 53 with value 0.036. Note that the graph-small.txt dataset is only provided for sanity check purpose. Your write-up should include results obtained using graph-full.txt (for both part (a) and (b)).
(b) HITS Implementation [15 points]
Assume the directed graph G = (V; E) has n nodes (numbered 1; 2; : : : ; n) and m edges, all nodes have non-negative out-degree, and L = [Lij]n n is a an n n matrix referred to as the link matrix such that for any i; j 2 J1; nK:
Lij =
1
if (i ! j) 2 E;
0
otherwise:
Given the link matrix L and some scaling factors ; , the hubbiness vector h and the authority vector a can be expressed using the equations:
h = La; a = LT h
(2)
where 1 is the n 1 vector with all entries equal to 1.
Based on this equation, the iterative method to compute h and a is as follows:
CS 246: Mining Massive Data Sets - Problem Set 3
4
1. Initialize h with a column vector (of size n 1) of all 1’s.
2. Compute a = LT h and scale so that the largest value in the vector a has value 1.
Compute h = La and scale so that the largest value in the vector h has value 1.
Go to step 2.
Repeat the iterative process for 40 iterations, assume that = 1; = 1 and then obtain the hubbiness and authority scores of all the nodes (pages). The link matrix L can be large and should be processed as an RDD. Compute the following using graph-full.txt:
List the 5 node ids with the highest hubbiness score. List the 5 node ids with the lowest hubbiness score. List the 5 node ids with the highest authority score. List the 5 node ids with the lowest authority score.
For a sanity check, you should con rm that graph-small.txt has highest hubbiness node id 59 with value 1 and highest authority node id 66 with value 1.
What to submit
List 5 node ids with the highest and least PageRank scores [2(a)] using graph-full.txt
List 5 node ids with the highest and least hubbiness and authority scores [2(b)] using graph-full.txt
Upload all the code via Gradescope [2(a) & 2(b)]
Clique-Based Communities (25 points)
Imagine an undirected graph G with nodes 2; 3; 4; : : : ; 1000000. (Note that there is no node 1.) There is an edge between nodes i and j if and only if i and j have a common factor other than 1. Put another way, the only edges that are missing are those between nodes that are relatively prime; e.g., there is no edge between 15 and 56.
We want to nd communities by starting with a clique (not a bi-clique) and growing it by adding nodes. However, when we grow a clique, we want to keep the density of edges at 1; i.e., the set of nodes remains a clique at all times. A maximal clique is a clique for which it is impossible to add a node and still retain the property of being a clique; i.e., a clique C is maximal if every node not in C is missing an edge to at least one member of C.
Let Ci be the set of nodes of G that are divisible by i, where i is a positive integer.
CS 246: Mining Massive Data Sets - Problem Set 3
5
(a) [5 points]
Prove that Ci is a clique for any i.
(b) [10 points]
Under what circumstances is Ci a maximal clique? Prove that your conditions are both necessary and su cient. (Trivial conditions, like \Ci is a maximal clique if and only if Ci is a maximal clique," will receive no credit.)
(c) [10 points]
Prove that C2 is the unique largest clique. That is, it has more elements than any other clique. (Note: Not all cliques are in the form of Ci)
What to submit
Proof that the speci ed nodes are a clique.
Necessary and su cient conditions for Ci to be a maximal clique, with proof.
Proof that C2 is the unique largest clique.
Dense Communities in Networks (20 points)
In this problem, we study the problem of nding dense communities in networks.
De nitions: Assume G = (V; E) is an undirected graph (e.g., representing a social net-work).
For any subset S V , we let the induced edge set (denoted by E[S]) to be the set of edges both of whose endpoints belong to S.
For any v 2 S, we let degS(v) = jfu 2 Sj(u; v) 2 Egj. Then, we de ne the density of S to be:
jE[S]j
(S) = jSj :
Finally, the maximum density of the graph G is the density of the densest induced
subgraph of G, de ned as:
(G) = maxf (S)g:
S V
CS 246: Mining Massive Data Sets - Problem Set 3
6
Goal. Our goal is to nd an induced subgraph of G whose density is not much smaller than (G). Such a set is very densely connected, and hence may indicate a community in the network represented by G. Also, since the graphs of interest are usually very large in practice, we would like the algorithm to be highly scalable. We consider the following algorithm:
Require: G = (V; E) and 0
~
S;S V
while S =6 ; do
A(S) := fi 2 S j degS(i) 2(1 + ) (S)g
S n A(S)
~
if (S) (S) then
~
S S
end if
end while
~
return S
The basic idea in the algorithm is that the nodes with low degrees do not contribute much to the density of a dense subgraph, hence they can be removed without signi cantly in uencing the density.
We analyze the quality and performance of this algorithm. We start with analyzing its performance.
(a) [10 points]
We show through the following steps that the algorithm terminates in a logarithmic number of steps.
Prove that at any iteration of the algorithm, jA(S)j 1+ jSj.
Prove that the algorithm terminates in O(log1+ (n)) iterations, where n is the initial number of nodes.
[10 points]
We show through the following steps that the density of the set returned by the algorithm is at most a factor 2(1 + ) smaller than (G).
Assume S is the densest subgraph of G. Prove that for any v 2 S , we have: degS (v) (G).
Consider the rst iteration of the while loop in which there exists a node v 2 S \ A(S). Prove that 2(1 + ) (S) (G).
iii. Conclude that (S~)
1
(G).
2(1+ )
CS 246: Mining Massive Data Sets - Problem Set 3
7
What to submit
i. Proof of jA(S)j 1+ jSj.
Proof of number of iterations for algorithm to terminate.
i. Proof of degS (v)(G).
Proof of 2(1 + ) (S)(G).
~ 1
iii. Conclude that (S) 2(1+ ) (G).