$29
MAML. [5pts] This question is meant to introduce a cool application of automatic di er-entiation. Much like gradient-based hyperparameter optimization (last two slides of Lecture 6), it involves treating the gradient descent learning procedure itself as a computation graph, and di erentiating through it. While the algorithm would be a major pain to implement by hand, it only requires a few extra lines of Autograd code compared to ordinary neural net training.
Suppose you want to train an agent that learns to perform many di erent but related tasks, such as having a robot arm pick up a variety of objects. The agent, through gaining experience with many such tasks, ought to be able to improve the rate at which it can learn similar tasks. This kind of learning is known as learning to learn, or meta-learning.
This question concerns a meta-learning algorithm called Model-Agnostic Meta-Learning (MAML, pronounced \mammal").3 The idea is that if you choose a good enough set of initial weights for the network, it should be possible to learn a new task in only a few steps of gradient descent. Hence, MAML trains a single, task-generic set of weights, with the meta-objective de ned as the loss on any particular task after K steps of gradient descent. (K is a small number, such as 5.) The term \model-agnostic" is because MAML assumes pretty much nothing about the model, other than that it’s trainable by gradient escent.
MAML was originally formulated in the more complex setting of reinforcement learning, but we will consider the setting of simple univariate regression problems. We will sample random univariate regression problems, where the inputs are sampled uniformly from the interval [ 3; 3], and the functions are sampled as random piecewise constant functions with breaks at the integers. For instance,
It’s worth thinking a bit about how the meta-learner might solve this problem. Clearly, 5 iterations would not be enough to train a generic MLP from scratch. But suppose you initialized the network such that each hidden unit computed a basis function which takes the value 1 on the interval [k; k + 1] for some integer k, and 0 everywhere else. Then you could t the function simply by adjusting the weights in the output layer, which is just a linear regression problem, and can therefore be solved pretty quickly. Hence, such a network would be a great according to MAML’s objective. Of course, there are probably other good ways for MAML to solve this problem, and we don’t know what it will actually do when we run it.
The code you actually have to write is mostly straightforward once you know how to use Autograd. The challenging (and hopefully valuable) part is wrapping your head around how the starter code works. This is de ned in maml.py. Here are the functions and classes it de nes:
net_predict: this implements the forward pass for an MLP. The parameters are stored in a Python dict params.
random_init: initializes the weights to a Gaussian with a small standard deviation.
ToyDataGen: This class generates the random piecewise linear functions.
gd_step: Performs one step of gradient descent. Note that this function returns a new set of parameters, rather than modifying the arrays that were passed in.
InnerObjective: The cost function for one regression dataset. This is just mean squared error.
MetaObjective: The MAML objective, i.e. the inner cost after num_steps steps of gradient descent.
train: Runs the actual training, i.e. repeatedly samples random regression datasets and does gradient descent on the meta-objective.
Here is what you need to do. Each of these parts requires only a few lines of code, and you should not need to do any messy derivations.
[2pts] Implement gd_step. You should do this by calling ag.grad.
[2pts] Implement MetaObjective.__call__. (This is Python syntax for the method
that gets called when you call the class instance as if it were a function, i.e. meta_obj(params).) Your implementation should call gd_step.
[1pt] Finish the implementation of train. I.e., sample a random regression dataset, and do a gradient descent step on the meta-objective.
Once you nish the code, calling train will produce a visualization such as the following, where the thinnest line corresponds to the initial parameters learned by MAML, and thicker lines correspond to more steps of SGD:
Observe that your solution will involve calling gd_step on a function which itself calls gd_step. Since gd_step calls ag.grad, this means you are calling ag.grad on a compu-tation graph which was itself generated by ag.grad. Understanding why this happens is an important part of understanding the code.
Submit your code solution as maml.py. You don’t need to submit anything else for this question.
Adam. [5pts] Adam4 is a widely used optimization algorithm which essentially combines the bene ts of RMSprop and momentum. Here is a slightly simpli ed version of the original algorithm. All arithmetic operations, such as squaring or division, are applied elementwise.
m0 0
v0 0
0
While t not converged:
gt rJ ( t 1)
mt
vt
t
1mt 1 + (1 1)gt
2vt
1 + (1 2)gt2
t 1
Amt=(p
+ A)
vt
The hyperparameters of the algorithm are the learning rate A, the moments timescales 1 and 2, and the damping term A. The A subscript stands for \Adam," to distinguish these hyperparameters from the other algorithms discussed in this question.
Here is what you need to analyze:
[1pt] Recall the RMSprop algorithm, rewritten here to match the notation of this ques-tion:
v0 0
0
While t not converged:
gt
rJ ( t 1)
vt
vt 1 + (1 )gt2
t
t 1 Rgt=(p
+ R)
vt
The hyperparameters are R, , and R. Specify Adam hyperparameters ( A; 1; 2; A) which make Adam equivalent to RMSprop with hyperparameters ( R; ; R). You should explain your answer, though a full derivation isn’t required.
(b) [2pts] Now consider SGD with momentum:
p0 0
0
While t not converged:
pt
pt 1 (1 )rJ ( t 1)
t
t 1 + Spt
Here is the original paper, but you don’t need to read it to solve this question: https://arxiv.org/abs/1412. 6980
Specify Adam hyperparameters ( A; 1; 2; A) which make Adam approximately equiv-alent to momentum SGD with parameters ( S; ). Explain your answer.
(c) [2pts] An important fact about Adam is that it is invariant to rescaling of the loss function. I.e., suppose we have a loss function L(y; t), and we de ne a modi ed loss
function as ~(y; t) = C (y; t) for some positive constant C. Show that for A = 0,
L L
Adam is invariant to this rescaling, i.e. it passes through the same sequence of iterates 0;:::; T.
Hint: Denote the quantities computed by Adam on the modi ed loss function as g~t, m~t, etc. Use induction to nd relationships between these and the original gt, mt, etc.)