$24
Collaboration. Weekly homeworks are individual work. See the Course Information handout2 for detailed policies.
[4pts] AlexNet. For this question, you will rst read the following paper:
A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi cation with deep convolutional neural networks. In Advances in Neural Information Processing Sys-
tems (NIPS), 2012.
http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
This is a highly in uential paper (over 45,000 citations on Google Scholar!) because it was one of the rst papers to demonstrate impressive performance for a neural network on a modern computer vision benchmark. It generated lots of excitement both in academia and in the tech industry. The architecture presented in this paper widely used today, and is known as \AlexNet", after the rst author. Reading this paper will also help you review a lot of the important concepts from this class.
[3pts] They use a conv net architecture which has ve convolution layers and three fully connected layers (one of which is the output layer). Your job is to count the number of units, the number of weights, and the number of connections in each layer. I.e., you should complete the following table:
Units # Weights # Connections
Convolution Layer 1 Convolution Layer 2
Convolution Layer 3
Convolution Layer 4
Convolution Layer 5
Fully Connected Layer 1
Fully Connected Layer 2
Output Layer
You can ignore the pooling layers when doing these calculations, i.e. you don't need to consider the units in the pooling layers or the connections between convolution and pooling layers. You can also ignore the biases. Note that the paper gives you the answers
for the numbers of units in the caption to Figure 2. Therefore, we won't mark the column for units, though you would bene t from trying to work it out yourself.
When counting the number of connections, we'll adopt the convention that when the input to a convolution layer is zero-padded, the connections to the dummy zero values count towards the total. (This is the most convenient way to do it, since it means the number of incoming connections is the same for each unit in a given layer.)
[1pt] Now suppose you're working at a software company and want to use an architec-ture similar to AlexNet in a product. Your project manager gives you some additional instructions; for each of the following scenarios, based on your answers to Part 1, suggest a change to the architecture which will help achieve the desired objective. I.e., modify the sizes of one or more layers. (These scenarios are independent.)
You want to reduce the memory usage at test time so that the network can be run on a cell phone; this requires reducing the number of parameters for the network.
Your network will need to make very rapid predictions at test time. You want to reduce the number of connections, since there is approximately one add-multiply operation per connection.
[5pts] Gaussian Na•ve Bayes. In this question, you will derive the maximum likelihood estimates for Gaussian Na•ve Bayes, which is just like the na•ve Bayes model from lecture, except that the features are continuous, and the conditional distribution of each feature given the class is (univariate) Gaussian rather than Bernoulli. Start with the following generative
model for a discrete class label y 2 (1; 2; :::; k) and a real valued vector of d features x = (x1; x2; :::; xd):
p(y = k) = k
(1)
p(xjy = k; ; ) =
D
2 i2
!
1=2
(
D
2 i2 (xiki)2
)
(2)
i=1
exp
=1
Y
Xi
1
where k is the prior on class k, i2 are the variances for each feature, which are shared between all classes, and ki is the mean of the feature i conditioned on class k. We write to represent the vector with elements k and similarly is the vector of variances. The matrix of class means is written where the kth row of is the mean for class k.
[1pt] Use Bayes' rule to derive an expression for p(y = kjx; ; ). Hint: Use the law of total probability to derive an expression for p(xj ; ).
[1pt] Write down an expression for the negative likelihood function (NLL)
`( ; D) = log p(y(1); x(1); y(2); x(2); ; y(N); x(N)j )
(3)
of a particular dataset D = f(y(1); x(1)); (y(2); x(2)); ; (y(N); x(N))g
with parameters
= f ; ; g. (Assume the data are i.i.d.)
[2pts] Take partial derivatives of the likelihood with respect to each of the parameters ki and with respect to the shared variances i2. Based on this, nd the maximum likelihood estimates for and . You may assume that each class appears at least once in the dataset.
2
CSC411 Homework 4
(d) [1pt] Show that the MLE for k is given by the following equation:
k = 1 N 1[y(i) = k] (4)
X
N
i=1
You may assume that each class appears at least once. You will nd it helpful to read about Lagrange multipliers3.
https://en.wikipedia.org/wiki/Lagrange_multiplier
3