$24
Problem 1. Write a function that evaluates the trained network (5 points), as well as computes all the subgradients of W1 and W2 using backpropagation (5 points)
Evaluation (5 points)
Algorithm 1 : Evaluation
class SigmoidCrossEntropy(object):
def crossEntropy(self, x, y, w1, w2, l2_penalty=0.0):
# cross entropy loss
E = -np.sum(y * np.log(x) + (1.0 - y) * np.log(1-x)) / y.shape[0]
# regularization
E += 0.5 * l2_penalty * (np.linalg.norm(w1) + np.linalg.norm(w2)) return E
def evaluate(self, x, y, w1, w2, l2_penalty=0.0):
prob = self.sigmoid(x) # P(y=1)
E = self.crossEntropy(prob, y, w1, w2, l2_penalty) # objective loss performance.append(E)
y_hat = 1 * (prob = 0.5) # class prediction
accuracy = 1 - (np.sum(y_hat ^ y) / y.shape[0]) # error rate performance.append(accuracy)
return performance
For evaluation, cross entropy loss function is used to measure the error of prediction, and error rate is also calculated to check the ratio of correct classi cation as metric of accuracy.
1
cross entropy loss (E) =
(y logz2 + (1 y) log(1 z2))
error rate (Accuracy) =
number of correct classi cation
number of examples
Backpropagation (5 points)
Algorithm 2 : Backpropagation
class LinearTransform(object):
def backward(self, grad_output):
return np.dot(grad_output, self.W.T)
class ReLU(object):
def backward(self, grad_output):
gradient = (self.x 0) * 1.0
gradient[np.where(self.x==0)] = 0.5
return gradient*grad_output
class SigmoidCrossEntropy(object):
def backward(self, grad_output):
return (self.prob - self.y)
class MLP(object):
def train(self, x_batch, y_batch, learning_rate, momentum, l2_penalty):
...
# backpropagation
gradient3 = self.SCE.backward(0)
gradient2 = self.LT2.backward(gradient3)
gradient1 = self.ReLUf.backward(gradient2)
# weight update
delta_w2 = np.dot(z1.T, gradient3)
delta_w1 = np.dot(x_batch.T, gradient1)
self.LT2.update(delta_w2, learning_rate, momentum, l2_penalty)
self.LT1.update(delta_w1, learning_rate, momentum, l2_penalty)
There are three gradient functions of each linear transform(f), Relu(g), and sigmoid cross entropy(E) functions. Loss function can be represented with feed forward func-tions as below.
2
= (y log (f2) + (1 y)log(1 (f2)))
f2 = W2T g + c g = max(0; f1) f1 = W1T x + b
The derivative of each functions is implemented based on its own di erential formula. The derivative of the combined sigmoid entropy functions, gradient3, is
@E
@f2 = z2 y
The derivatives of linear transform function wrt. g, gradient2, is
@f@g2 = W2
The derivatives of Relu functions, gradient1, is
@g =
8[0; 1]; f1
= 0
1;
f1
0
@f1
0;
f1
< 0
<
:
To calculate the delta of each weight vectors, we compute the weights.
@E
=
@E @f2
@W2
@f2 @W2
= (z2 y)g
@E
=
@E @f2 @g @f1
@W1
@f2 @g @f1 @W1
= (z2 y)W2T g0 x
@E and @E and update
@W2 @W1
Problem 2. Write a function that performs stochastic mini-batch gradient descent training (5 points). You may use the deterministic approach of permuting the sequence of the data. Use the momentum approach described in the course slides.
Stochastic mini-batch gradient descent training (5 points)
3
Algorithm 3 : Stochastic mini-batch gradient descent
if __name__ == __main__ :
for epoch in xrange(num_epochs):
randList = np.arange(num_examples)
np.random.shuffle(randList)
batches = randList.reshape((num_batches, int(num_examples/num_batches)))
for b in xrange(num_batches):
x_batch = train_x[batches[b],:]
y_batch = train_y[batches[b],:]
total_loss = mlp.train(x_batch, y_batch, lr, momentum, l2_penalty)
For stochastic mini-batch gradient descent training, we need to divide whole examples into the subset of mini batches. In my implementation, I rst randomly generate list of order, randList (instead of shu ing examples), then divide the list with the de ned number of batches. Then, each example of batches is executed according to the ran-domly generated order from the shu ed list.
Momentum (5 points)
Algorithm 4 : Momentum
class LinearTransform(object):
def update(self, delta, learning_rate=1.0, momentum=0.0, l2_penalty=0.0):
regulization = l2_penalty * self.W
delta = delta + regulization
self.velocity = momentum * self.velocity - learning_rate * delta self.W += self.velocity
Whenever updating weights for every batches, I apply the momentum factor to control the weight changes along with the learning rate.
Problem 3-6. 3) Train the network on all the training examples, tune your parameters (number of hidden units, learning rate, mini-batch size, momentum) until you reach a good performance on the testing set. What accuracy can you achieve? (20 points based on the report). 4) Training Monitoring: For each epoch in training, your function should evaluate the training objective, testing objective, training misclassi cation error rate, testing misclassi cation error rate (5 points). 5) Tuning Parameters: please create three gures with following requirements. Save them into jpg format:
test accuracy with di erent number of batch size: batch-test accuracy.png
test accuracy with di erent learning rate: lr-test accuracy.png
test accuracy with di erent number of hidden units: hidden units-test accuracy.png 6) Discussion about the performance of your neural network.
4
I rst tuned learning rate, which is the most important to get to the local minimum, and then tuned mini-batch size, hidden units, momentum, and l2 penalty respectively in order to train the model. Each section, I put the range of test parameter in [...], and the rest prede ned values of other parameters. I used 100 epoches for all experiments.
Tuning learning rate
learning rate = [1e-06, 5e-06, 1e-05, 5e-05, 1e-04]
num batches = 1000 hidden units = 10 momentum = 0.8 l2 penalty = 0.001
Figure 1: Train Loss Figure 2: Train Accuracy
Figure 3: Test Loss Figure 4: Test Accuracy
5
Analysis: Any other learning rates which are higher than 0.0001 are excluded in this experiment after observing their uctuation without convergence. So, I found that learning rate below 0.0001 can make our model get to the local minimum, and tested which value is the most e ective to obtain high accuracy. In the graph of train loss (Fig. 1), we see that as the learning rate is getting smaller, it converges very slowly. We also see that the learning rates, 0.0001 and 5e-05, are guarantee to converge on training data (Fig. 1), but both generate unstable test loss and accuracy (Fig. 3-4). Therefore, I choose 1e-05 as the learning rate in my model because it let the model to converge in a stable way and generate high test accuracy.
Tuning mini-batch size
num batches = [5, 10, 50, 100, 500]
Figure 5: Train Loss Figure 6: Train Accuracy
Figure 7: Test Loss Figure 8: Test Accuracy
6
learning rate = 1e-05
hidden units = 10
momentum = 0.8
l2 penalty = 0.001
Analysis: The size of mini batches, surprisingly, does not signi cantly a ect the loss and accuracy for both training and testing (Fig. 5-8). Rather, it in uences time per-formance as it is related with high dimensional computation. As shown in Table 1, extreme choices of the mini batch size such as 10 or 1000, require higher computation time. It happens because mini batch size 10 has to deal with 1000-dimensional matrix computation, mini batch size 1000 has larger iterations of learning although it only deals with 10 samples per a batch. I think this experiment shows that the strong point of stochastic minibatch approach because instead of learning whole examples at one time, we can learn a subset of them in a saved time, and we still can obtain reasonable results. Therefore, I chose mini-batch size with 50 since it shows e cient time perfor-mance without signi cantly deteriorating the test accuracy.
Mini Batch Size
Test Accuracy(%)
Time Cost(s)
10
80.45
477.2032
50
81.50
144.3120
100
81.25
148.8222
500
80.40
177.9812
1000
82.05
214.2329
Table 1: Test accuracy and time cost with di erent mini batch size
Tuning the number of hidden units hidden units = [5, 10, 50, 100, 1000]
learning rate = 1e-05 num batches = 50 momentum = 0.8
l2 penalty = 0.001
7
Figure 9: Train Loss Figure 10: Train Accuracy
Figure 11: Test Loss Figure 12: Test Accuracy
Analysis: The number of hidden layer units is the most in uential parameter to obtain higher test accuracy. From the experiments with di erent number of hidden layer units, we see that test accuracy keeps increasing as the number of hidden units increase (Fig. 12). It reveals that this image classi cation can obtain higher accuracy with more a sophisticated neural network model. However, large number of hidden units signi cantly a ects time performance, and from a certain point, the high number of hidden units does not improve test accuracy anymore. Therefore, we need to carefully chose the number of hidden units as considering both computing power and the mount of improvement. In my experiment, 500 would be the good choice for test accuracy if computing resource is allowed, otherwise, unit number 50 is still showing reasonable test accuracy with 500, so, I chose 50 as hidden unit number for the rest training part.
8
Number of Hidden Units
Accuracy(%)
Time Cost(s)
10
81.20
144.2799
50
83.35
701.02990
100
83.75
1080.9208
500
84.45
3025.8877
1000
84.30
5130.4885
Table 2: Test accuracy and time cost with di erent number of hidden units
Tuning momentum momentum = [0.0, 0.6, 0.7, 0.8, 0.9]
Figure 13: Train Loss Figure 14: Train Accuracy
Figure 15: Test Loss Figure 16: Test Accuracy
9
learning rate = 1e-05
num batches = 50
hidden units = 50
l2 penalty = 0.001
Analysis: The momentum is an important factor to control the weight changes and make the model converge faster, avoiding gradient decent oscillation along with a learn-ing rate. In my experiments, I tested momentum values from 0.6 to 0.9 and 0.0. In the result graphs (Fig. 13-16), it shows that as momentum values increase until 0.9, it expedites to converge for both training and testing. However, in the Fig. 15, the momentum value, 0.9 and 0.8, shows a little e ect of over tting, going up from a cer-tain point. Therefore, I choose momentum with 0.7 since it shows robustness of test accuracy and faster convergence.
Tuning l2 penalty
l2 penalty = [0.0, 0.001, 0.01, 1, 10]
learning rate = 1e-05 num batches = 50 hidden units = 50 momentum = 0.7
Analysis: L2 penalty plays a role of preventing over tting and increasing test ac-curacy. In my experiment, very high penalty like 10 is not a good choice because it deteriorates both train and test accuracy (Fig. 17-20). In fact, any other tested values of L2 penalty shows very a little improvement in test accuracy and loss (Table 3). Although the improvement is very trivial, I choose l2 penalty with 1, which shows the highest accuracy among them.
L2 penalty
Test Accuracy (%)
0.0
82.20
0.001
82.35
0.01
82.25
1
82.80
10
81.25
Table 3: Test accuracy with di erent L2 penalty
10
Figure 17: Train Loss Figure 18: Train Accuracy
Figure 19: Test Loss Figure 20: Test Accuracy
In conclusion, tuning parameters with appropriate values is very important to train the model in fast and e cient time as well as to obtain higher test accuracy. Tuning learning rate, momentum is important in the sense of guaranteeing convergence into the local minimum. The proper size of both mini batch and hidden unit is also important to improve time performance and test accuracy. Both hidden unit size and L2 penalty should be well chosen to increase test accuracy as preventing over tting problem.
Finally, I chose parameter values as below to train and evaluate my model.
learning rate = 1e-05
num batches = 50
hidden units = 50
momentum = 0.7
l2 penalty = 1
11
The performance of my neural network
* What accuracy can you achieve? 83.15%
I nally trained my model with tuned parameters, and obtained 83.15% test accuracy. (although I could increase the accuracy up to 84.65% with 500 hidden units.) Fig. 21 shows the test accuracy of train and test data, and until 100 epochs, over tting did not happen. Fig. 22 shows the objective error (loss) of train and test, and train error is higher than test error due to the regularization factor. In sum, using L2 penalty parameter (=1), I can prevent over tting problem, improving test accuracy and loss.
Figure 21: Train and Test Accuracy Figure 22: Train and Test Loss
12