Starting from:
$35

$29

Homework 04 Solution

Q0 (0pts correct answer, -1,000pts incorrect answer: (0,-1,000) pts): A correct answer to the following questions is worth 0pts. An incorrect answer is worth -1,000pts, which carries over to other homeworks and exams, and can result in an F grade in the course.


    (1) Student interaction with other students / individuals:

        (a) I have copied part of my homework from another student or another person (plagiarism).

        (b) Yes, I discussed the homework with another person but came up with my own answers. Their name(s) is (are)

        (c) No, I did not discuss the homework with anyone

    (2) On using online resources:

        (a) I have copied one of my answers directly from a website (plagiarism).

        (b) I have used online resources to help me answer this question, but I came up with my own answers (you are allowed to use online resources as long as the answer is your own). Here is a list of the websites I have used in this homework:

        (c) I have not used any online resources except the ones provided in the course website.
























1


    • Homework 4: Inductive Biases for Graph and Sequence Data


Learning Objectives: Let students understand basic concepts that are used to add inductive biases in

deep learning models: for images, sequences, and graphs.



Learning Outcomes: After nishing this homework, students will be able to create new inductive biases to solve new tasks.



Here, we are going to implement (1) Graph Neural Network with mean aggregator, (2) Graph Neural Network with LSTM set aggregator, (3) Markov Chain language model and (4) LSTM language model. This HW is a combination of two distinct tasks.

The necessary    les (code and data) can be obtained from Piazza.


1.1    HW Overview

You are going to ll in missing part of several les under each project folder. We plot the folder structure that you should use:

































2


[your purdue login] hw-4


 Data

 cora

 ptb


 GNN


 data


 data.py


 main.py


 model.py


 utils.py


 GraphSAGE

 data

 data.py

 main.py

 model.py

 utils.py


 LM


data

 data.py

 lstm.py

 mc.py

 model.py

 utils.py


 ReadMe





[your purdue login] hw-4: The top-level folder that contains all the les required in this homework. You should replace the le name with your name and follow the naming convention.

ReadMe: Your ReadMe should begin with a couple of example commands, e.g., "python hw4.py data", used to generate the outputs you report. (Even if you follow exactly the same command line provided in each question.) TA must be able to replicate your results with the commands provided here. More detailed options, usages and designs of your program can be followed. You can also list any concerns that you think TA should know while running your program. Put the information that



3


you think it’s more important at the top. Moreover, the le should be written in pure text format that can be displayed with Linux "less" command.

Data: The root folder of all related data les in this HW. Each data folder under project folders is linked to this folder.

GNN: Project folder for you to implement Graph Neural Networks with just mean aggregator. One module that you are going to develop:

{ model.py

The detail will be provided in the task descriptions. All the other les are supporting les that you should not change.

GraphSAGE: Project folder for you to implement GraphSAGE with both mean aggregator and LSTM aggregator. Pay attention that the design of this project is totally di erent from previous one. Implementation of Janossy pooling is also required in this project for LSTM aggregator. One module that you are going to develop:

{ model.py

The detail will be provided in the task descriptions. All the other les are supporting les that you should not change.

LM: Project folder for you to implement language models of Markov chain and LSTM. Four modules that you are going to develop:

{ data.py { lstm.py { mc.py

{ model.py

The detail will be provided in the task descriptions. All the other les are supporting les that you should not change.

























4


    • Graph Neural Networks (GNN)


Task 1a: GNN with Mean Aggregator



This is a warm-up task. You are going implement mean aggregator based on provided formula and supporting les under GNN folder.



In the following GNN related coding assignment, you will work with the Cora dataset. The Cora dataset consists of 2708 scienti c publications classi ed into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

You can regard it as a labeled graph G = (V; E; X; Y ) where V = f1; 2; ; 2708g is the set of n = 2708 nodes, E is the set of 5429 undirected edges, X and Y are the feature and label matrices assigned to the graph. Although the raw dataset is a directed graph, we will regard it as an undirected graph for convenience. For each node v 2 V , we have a feature vector xv and a label yv. Thus, you will have

2
x2
3

2
y2
3



6
x1
7

6
y1
7


X =

...

Y =

...

;
(1)

6
x
n
7

6y
n
7



6


7

6


7



4


5

4


5


where each xu is a 1433 dimensional vector and X is a 2708    1433 matrix.



In GNN of this project, features of a node u and its neighbors Nu will be aggregated together from layer k to next layer k + 1:
hu(k+1) =

h

(xu
W (k+1)  hu(k)






i


;

(fhv(k)gv2Nu ) + b(k+1)
k + 1 > 0


f


(2)



k + 1 = 0


where h(uk) is the embedding features of node u at layer k, W (k) and b(k) are weight and bias matrices of layer k, is a speci c activation and f is a speci c permutation invariant function. In this project, f will be the mean function. Thus, you will have a graph embedding matrix H(k) for each layer k. The nal embedding matrix will then be used to classify labels Y for graph G.

2h(1k)
6h(2k)
H(k) = 6    .
6
6 ..
4

3

7
7
7    (3)
7
5
h(nk)



Related Modules:


5


GNN/data.py GNN/model.py



Action Items:


    1. The formula above is designed to work on each node separately which is slow for deep GNNs in practice. For the mean aggregator of this project, there will be a matrix version of the formula which allows us to exploit the e ciency of matrix computations. To be speci c, you are given an embedding matrix H(k) of layer k, adjacency matrix A where
(

Auv =
0    u and v are disconnected

1    u and v are connected

(4)

Rewrite the above formula from Equation 2, for mean as f, so that you can get H from H(k) and A. You can only de ne constant matrices in this question.


Hint: You can compute the mean operation through a matrix multiplication.

(k+1)

    2. Go through all the related modules. Speci cally, you should understand data.CoraDataset and model.GraphConvolution well to use provided data properly.

        (a) Modify GNN/data.py to compute the necessary adjacency matrix.

        (b) Implement the GCN formula you derived above in in GNN/model.py.

In the report, provide the min, max and average degrees of training, validation and test nodes of the Cora dataset.


Min    Max    Average

Training

Validation

Test


    3. After understanding the matrix version of the formula and all the related modules, you should imple-ment the formula into the data.py and model.py to make GNN run correctly. Run the main.py with default arguments: python main.py. Save the logging outputs from your console. In the report, you should include:

        (a) Learning curves of loss function of training and validation nodes.

        (b) Learning curves of accuracy of training and validation nodes.

        (c) Loss functions and accuracies of test nodes for the saved models.

    4. Run python main.py --num_layers 2, and python main.py --num_layers 20. In your report, include:

        (a) Describe what happens during the training process

        (b) For the model from python main.py --num_layers 2, get H(1); Report the maximum MSE of all pairs of embedding vectors in H(1).

        (c) For the model from python main.py --num_layers 20, get H(19); Report the maximum MSE of all pairs of embedding vectors in H(19).


6


Task 1b: GNN with Mean and LSTM Aggregator



This is an advanced version of previous task. You are going implement a more general GNN supporting both mean and LSTM aggregator. This model is similar to what is known as the GraphSAGE architecture. Pay attention that the design of this project will have time e ciency issues, so we highly recommend you to do your experiments on GPU resources, e.g. the scholar cluster. If it is not available, you should start as soon as possible. Running on CPU will take hours to nish.



In the GNN of this project, features of a node u and its neighbors Nu will be aggregated together from layer k to next layer k + 1:
hu(k+1) =

h

(xu
W (k+1)  hu(k)







i


;

(fhv(k)gv2Nu ) + b(k+1)
k + 1 > 0


f


(5)



k + 1 = 0


where h(uk) is the embedding features of node u at layer k, W (k) and b(k) are weight and bias matrices of layer k, is a speci c activation and f is a speci c permutation invariant function. But this time, f will be either the mean function or LSTM architectures with Janossy pooling.


Consider a general LSTM architecture, as we have seen in class:


h1
h2


ht

h0
h1
h2
ht  1

ht
LSTM
LSTM

... ...
LSTM



c2
ct  1

ct
c0
c1




x1
x2


xt




To use the LSTM as a neighbor aggregator, we will have a sequence of neighbor embeddings hv1 ; hv2 ; ; hvt (where t = jNuj is the number of neighbors to aggregate), which we’ll give as the input of the LSTM. To avoid confusion, we’ll call the short term outputs from the LSTM by m1; m2; : : : ; mt, so the nal output of

~
the LSTM is going to be ct; mt = f(hv1 ; hv2 ;    ; hvt ).

Based on the de nition of LSTM design, we know that state memory ct can be regarded as an embedding
~
of sequence hv1 ; hv2 ; ; hvt . The problem is that the LSTM function f is a permutation sensitive function while we are seeking for permutation invariant function f.


From the class, we know that Janossy pooling can be a method to get a permutation invariant function f
~
from permutation sensitive function f by
1    X

f(hv1 ; hv2 ;    ; hvt ) = t!

2 (t)


~
; hv (2) ;   ; hv (t) )
(6)
f(hv (1)




7


where (t) will be all possible permutations to length t. In practice, enumerate t! permutations is not tractable. In this project, we will use -SGD by sampling a permutation rather than traversing all permu-tations.


~
; hv^(2) ;   ; hv^(t) );  where ^
Uniform(  (t))
(7)
f(hv1 ; hv2 ;   ; hvt )   f(hv^(1)




The above procedure is expensive if the node have large degrees. In what follows we consider 5-ary depen-dencies to ease the computational burden.

5-ary Janossy pooling. Another issue of LSTM is gradient vanishing or explosion when backpropagating through too many time steps. In this project, a node may have more than 100 neighbors which will require LSTM to backpropagate through more than 100 time steps. Rather, we will consider 5-ary Janossy pooling, trained with -SGD. The key idea is only perform a forward and backward pass over the rst 5 neighbors of ^ (ignoring the remaining neighbors). At inference time, where we will sample 20 permutations ^ (described later in the homework) we also need to only consider the rst 5 neighbors from ^ in the forward pass.



Related Modules:


GraphSAGE/model.py




Action Items:


    1. You should address the di erence between this task and previous task. Report do you expect would happen if we also do the 20-layer experiment in this project.

    2. Fill in missing parts of GraphSAGE/model.py.

Hint: If necessary, consult PyTorch’s documentation of the LSTM: https://pytorch.org/docs/ stable/nn.html?highlight=lstm#torch.nn.LSTM

    3. Run python main.py --agg mean and python main.py --agg LSTM. Save the logging outputs from your console. In the report, you should include:

        (a) Learning curves of loss function of training and validation nodes.

        (b) Learning curves of accuracy of training and validation nodes.

        (c) Loss function and accuracy of test nodes for the saved model.

    4. In the previous question with LSTM aggregator, you will also randomly sample a permutation in test stage. Modify the code, and use the average prediction of 20 permutations (average of 20 times of test stage) as nal output. What is the test accuracy over those 20 permutations? Explain why the results improve.






8


    • Language Model (LM)


Task 2a: Markov Chain Modeling



In the following LM related tasks, you will work with the Penn Treebank (PTB) dataset. The PTB dataset is a collection of words and sentences. In the provided loader, an additional word "eos" will be added to each sentences to force language model to learn "end-of-sentence" embeddings. Given the rst several words of a sentence and its context, your task is to predict next word.



In k-order Markov chain language modeling, we are computing the statistics over training data. Suppose for the t-th word wt, we have its context ct = [wt k; wt k+1; ; wt 1]. If the words of context go out of the context, we will just use token -1 (or "unk") for padding.

P (wtjct) =

#(ct; wt)

(8)


#(ct; w)


P



w

where #(ct; w) is the number of appearance of sequence [wt  k; wt
k+1;   ; wt  1; w] in the whole training
subset of the whole dataset.





Equation (8) could lead to zero probabilities. To solve this problem, we will use a simple smoothing schema.

8
#(ct;w)
#(ct; w) > 0



>


    • P
    • #(ct;w)  +1
P (wtjct) =
w
(9)

1=#unk(ct;w)

>


#(ct; w) = 0




>



P
    • #(ct;w)  +1
w

where #unk(ct; w) = 1(wj#(ct; w) = 0).



Related Modules:

LM/data.py




Action Items:

    1. Finish the missing part of each le. Feel free to modify the les as long as you can use Markov chain to predict cross entropy loss and accuracy on test data correctly.

    2. Run python LM/mc.py --order 1, python LM/mc.py --order 2, and python LM/mc.py --order 10. Report their perplexity on test subset. Perplexity is a common metric used to evaluate the per-formance of language models, and it is indeed the exponent of cross entropy loss.

1
N




Xi


exp(
N
log P (wijwi  1
;   ; w1))
(10)


=1




9


Task 2b: LSTM Modeling



In LSTM language modeling, we are computing the statistics over training data.

P (wtjwt  1;   ; w1)
(11)

As mentioned in the class, one of the problem of LSTM is gradient vanishing or explosion over long sequences. To solve this problem, we often do truncated backpropagation through time rather than backprogration though the whole sequence. To be speci c, for a sequence x1; x2; x3; x4; x5; x6 with common backpropagation, a full backprogation will work in the following way:



h1

h2

h3

h4

h5

h6

h0
LSTM,  1
h1
LSTM,  1
h2
LSTM,  1
h3
LSTM,  1
h4
LSTM,  1
h5
LSTM,  1
h6
c0

c1

c2

c3

c4

c5

END












c6

x1

x2

x3

x4

x5

x6








Backprogate and Update  1 to



We will feed forward all 6 time steps with parameter 1, compute the loss over output all 6 time steps, backpropagate gradients through all 6 time steps and get new parameter .

For a truncated backprogragation through time with length 3, a full backprogation will work in the following way:


h1

h2

h3

h0
LSTM,  1
h1
LSTM,  1
h2
LSTM,  1
h3
c0

c1

c2

c3








x1

x2

x3



Backprogate and Update  1 to  2







10



h4

h5

h6

h3
LSTM,  2
h4
LSTM,  2
h5
LSTM,  2
h6
c3

c4

c5

c6








x4

x5

x6



Backprogate and Update  2 to


We will treat the rst 3 time steps as a full sequence, and feed forward and backpropagate just like the previous one with parameter 1. After the parameter are updated to 2, we take the last hidden states h3; c3 as the initial hidden states of next process. Take the last 3 time steps as a full sequence (because there is not enough time step to generate a 3-time-step sequence), and feed forward and backpropagate just like the previous one but with parameter 2 and initial hidden states h3; c3. After parameters are updated, we get new parameter

For using batches, we just equally divide the whole dataset sequence into several independent segments. For example, for sequence x1 to x6, if we want to use a batch size of 2, we just split it into 2 sequences x1; x2; x3 and x4; x5; x6, and training them together.



Related Modules:


LM/data.py LM/model.py



Action Items:


    1. Understand the truncated backprogatation process, and nish the missing part of each le to support a truncated backprogation through time process. You should correctly split training sequence and forward the correct segment to earn full points for this question.

    2. Run python LM/lstm.py --bptt 5, python LM/lstm.py --bptt 35, and python LM/lstm.py --bptt 80. In the report, you should include, for each of the three runs:

        (a) Learning curves of loss function of training and validation nodes.

        (b) Learning curves of accuracy of training and validation nodes.

        (c) Loss functions and accuracies of test nodes for the saved models.




11


Submission Instructions


Please read the instructions carefully. Failing to follow the instructions might lead to loss of points.



Naming convention: [your purdue login] hw-4


All your submission les, including a ReadMe, and codes, should be included in one folder. The folder should be named with the above naming convention. For example, if my purdue account is \jsmith123", then for Homework 4 I should name my folder as \jsmith123 hw-4".




Remove any unnecessary les in your folder, such as training datasets (Data folder). Make sure your folder is structured as the tree shown in Overview section.



Submit: TURNIN INSTRUCTIONS

Please submit your homework    les on data.cs.purdue.edu using the turnin command, e.g.:

turnin -c cs690-dpl -p hw-4 jsmith123 hw-4.


Please make sure you didn’t use any library/source explicitly forbidden to use. If any such library/-source code is used, you will get 0 pt for the coding part of the assignment. If your code doesn’t run on scholar.rcac.purdue.edu, then even if it compiles in another computer, your code will still be considered not-running and the respective part of the assignment will receive 0 pt.



























12

More products