$24
Short Questions (Pick two out of three)
This section asks questions that admit a short and direct answer, but you need to provide mathemat-ical justification thoughout.
1.1 GDL Blueprint and Compositional Architectures (25 pt)
The GDL blueprint considers a composition of feature transformations of the form ’k : X (›k, Ck) ! X (›k¯1, C k¯1). Here, X (›, C) denotes the space of signals defined over the domain › and having features in C in each point of the domain. In other words, x 2 X (›, C) if x(u) 2 C for each u 2 ›. The previous mapping ’k maps from a domain › k to a coarser domain › k¯1, possibly by modifying the number of feature maps Ck to Ck¯1. Additionally, we constrain each ’ to be local, that is, the output ’k(x)(u˜), where u˜ is a point in ›k¯1, only depends on {x(v);kv ¡2uk1 É –}.
In this exercise, we will explore why this compositional non-linear structure is necessary, by studying limitations of architectures that replace any of these components by simple linear coarsening.
Assume a representation using two layers as above, ie '(x) ˘ ’2 – ’1(x), with ›1 the original 2-d grid of an image, and ›2 a coarsened version (say by a factor of 2 along each horizontal and vertical
dimensions), and ›3
˘
˘ C
3 is a delocalised feature vector (think
{u} a singleton; in other words, '(x) »
of a global pooling). Moreover, for simplicity assume a coarsening step with stride 2, and set – ˘ 1.
1. (5pt) Draw the diagram of the representation ' in terms of its factors ’1 and ’2.
2. (10pt) Suppose now that ’1 is a linear map, given by an average pooling. Construct an example of a two-class classification problem where the resulting representation ' won’t be able to separate the two classes, irrespective of ’2.
3. (10pt) Suppose now that ’2 is a linear map, given by an average pooling. Construct an example of a two-class classification problem where the resulting representation ' won’t be able to separate the two classes, irrespective of ’1.
1.2 Invariant Polynomials (25 pt)
Consider a geometric domain given by a 1-d periodic grid › ˘ {1, . . . d}, and a signal domain X (›,R) »
˘
Rd. We consider two symmetry groups defined in the domain, the cyclic group Cd given by cyclic shifts of ›, ie g.(x1, . . . , xd) ˘ (xk, xk ¯1, . . . , xd, x1, . . . xk¡1), and the symmetric group Sd given by all permutations ¾ : › ! ›. The study and design of generic invariant function approximations starts with a good understanding of the set of invariant polynomials (since any continuous function can be well approximated by a polynomial of a sufficiently large degree).
1. (5pt) Give an example of a polynomial p in (x1, . . . , xd), ie
K
X
X
p(x1, . . . , xd) ˘
fii1,...,ik xi1 . . . xik
k˘0 1Éi1Éi2É...ÉikÉd
such that it is not invariant to neither of these groups.
2. (10pt) Give an example of a polynomial that it is invariant with respect to Cd but not invariant with respect to Sd.
3. (10pt) Can you give an example of the opposite, ie a polymonial that is invariant to Sd but not invariant to Cd? Justify your answer.
1.3 Graph Isomorphism and Graph Convolutional Networks (25 pt)
Let G ˘ (V , E) be an unattributed undirected graph, with nodes V and edges E. Given G, let A 2 {0,1}jV j£jV j be its adjacency matrix, where Ai j ˘ 1 iff (i, j) 2 E. Consider a simplified GCN architcture that takes as input the binary adjacency matrix of a graph, constructs initial node features x(0)i given by node degrees di :˘ Pj Ai j, i 2 V , and propagates node features according to the diffusion x(ik) ˘
• ·
¾ fik x(ik¡1) ¯flk Pj Ai, j x(jk¡1) , where k É K is the layer index (ie, the number of diffusion steps), and
• an activation function such as the ReLU. The output of the GCN is defined as '(G) ˘ jV1 j Pi2V x(iK), ie the average pooling of the last layer. This GCN is therefore parametrised by £ ˘ ({fik,flk})kÉK 2 R2K .
Despite being one of the most popular GNN architectures, the GCN is not a universal approximator. In this exercise, you will construct some example tasks that, while being permutation invariant, cannot be solved by GCNs.
1. (10pt) Draw two non-isomorphic graphs G,G0 for which the outputs '(G),'(G0) will be always the same, irrespective of £. [Hint: Recall the 1-Weisfeller-Lehman test.]
2. (10pt) Fix K ˘ 1 and give an example of a graph prediction task f : G ! {0,1} that ' is provably
unable to solve. [Hint: Use the previous question]
3. (5pt) Give an example of a graph prediction task f˜ : G ! {0,1} that ' is provably unable to solve when K ˘ 1 but able to solve when K ˘ 2.
2 Coding Question: GNNs for the Stochastic Block Model (50 pt)
In this exercise, you will train a Graph Neural Network to perform community detection on the Stochastic Block Model (SBM). You can check Chen et al. (2017) for further details. The SBM is a random graph model with a planted community structure, and has been thoroughly studied in the fields of statistical physics, high-dimensional probability and theoretical computer science. We will focus on the two-class model. Given a collection V of n elements (assume n is even), we partition them randomly in two equally sized subsets of n/2 elements each, giving the hidden variable yi ˘ 1 or yi ˘ ¡1 accordingly. We then draw an edge between nodes i and j with probability p if both i and j belong to the same set (ie yi ˘ yj), and with probability q if yi 6˘yj. Edges are drawn independently from each other. The resulting random graph is G ˘ (V , E). The SBM model is therefore parametrised by two numbers p, q 2 [0,1]. Assume from now on that p ¨ q.
The goal of community detection is to infer the hidden community structure yi, i 2 V , given G. Several well-known methods exist in the literature to perform this inference, from spectral methods, to semi-definite programs, to belief-propagation. Intuitively, these algorithms are trying to perform a clustering (or cut) of the graph into components more connected than average.
In this exercise, we will consider Graph Neural Networks '(G,£) ˘ {'i(G,£); i 2 V } parametrised by £. The output of the GNN is still localised in the graph, and is used to perform node-wise binary classification. For that purpose, we will first construct a training set {(G(k), y(k)}kÉ K of iid instances of the SBM, together with their labels.
1. (7pt) Let ‘(z0, z) ˘ log(1 ¯ exp(¡z0z)) be the logistic loss. Given an input graph G, let yˆi ˘
• i(G,£), i 2 V be the output of the prediction model. Why cannot we use as loss function
˜
1
K
1
(k)
(k)
X
2X
L(£) ˘ K
(k)
) ‘('i(G ,£), yi ) ?
k˘1 jV
j
(
i V k
Justify your answer.
2. (7pt) Instead, we will use
min(
(k) ‘('i(G(k),£), yi(k)),
(k) ‘('i(G(k),£),¡yi(k)))
L(£) ˘ K
K
V (k)
1
1
X
2X
2X
j
k˘1 j
i V
i
V
Explain what are the differences between and ˜ . We will stick with from now on.
L L L
3. (7pt) Explain why p 6˘q is necessary in order to detect the communities.
4. (7pt) For each s 2 {1, . . . ,5}, generate a training set of SBM instances, choosing n ˘ 1000, p ˘ as/n, q ˘ bs/n, K ˘ 100, with as ˘ 12 ¯ s, bs ˘ 8 ¡ s. Visualize one sample for each s to show the community. [Hint: Less connections between community and more connections in the community when s is increasing]
5. (7pt) Using a test set of the same size and characteristics as the train set above, measure the average overlap between predicted and true labels. Again, if yˆi denotes the predicted label of
the model for node i, the overlap over a single SBM instance is defined as
Overlap ˘ 2
ˆjV j max(
Xi
1[sign(yˆi) ˘ yi],Xi
1[sign(yˆi) ˘ ¡yi]
)¡ 2
!
.
1
1
Verify that 0 É Overlap É 1. What does it convey? [Hint: Think about the expected overlap for a purely random guess, when is the overlap equal to 0 and when is it equal to 1]
6. (8pt) Consider a GCN architecture such as the one from 1.3 (except for the final global pooling step). Choose ¾ as ReLU. Use instance normalisation (or spatial batch normalisation assuming a batchsize of 1), and use the node degrees xi ˘ Pj Ai, j as input features. You can play with the depth of the network (K in our previous notation) as well as the choice of activation. Plot the results as a function of the Signal-to-Noise ratio, defined as SNR ˘ (a ¡ b)2/(2(a ¯ b)). In other words, average overlap on the test set in the y-axis vs SNR in the x-axis.
7. (7pt) Compare this model against a spectral clustering baseline. Given the adjacency matrix of a graph A, the spectral clustering computes v2 as the eigenvector with the second-largest eigenvalue:
v1
˘ arg v 1
kAvk
,
v2
˘
arg max
kAvk
,
max
k k˘
kvk˘1,v>v1˘0
and defines the community estimator as yˆ ˘ sign(v2). This is also called the Fiedler vector.
Analyse the results of your GNN model in light of this baseline.
4
5