Question 1: Neural Network [60 points, evenly among parts]
We will build a neural network to determine if a bank note is authentic. We have obtained a data set (https: //archive.ics.uci.edu/ml/datasets/banknote+authentication) with features that are properties of a wave transformed bank note image such as variance, skewness, curtosis and entropy. The data set has already been processed and three csv les (train.csv, eval.csv and test.csv) can be downloaded from the course website. Each line in these les have four real-valued features which we will denote with x = (x1; x2; x3; x4), and the class label y which is either 0 or 1.
Our neural network architecture is shown below:
1
x1
x2
x3
x4
Input
Hidden
Output
layer
layer
layer
(1)
(2)
(3)
1
w10(2)
w10(3)
w11(2)
w11(3) y
w12(3)
w23(2)
w24(2)
Figure 1: Neural Network Architecture.
There is one hidden layer with two hidden units, and one output layer with a single output unit. The input layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output layer. Each layer also has a constant bias 1 input with the corresponding weight. The weight of the network from unit j in layer l 1 to unit k in layer l will be denoted by wkj(l). For bias nodes, j will be denoted by 0 in the weights. This convention is demonstrated for some of the weights in gure 1. We will use Sigmoid activation
function for all the activation units. Recall that the sigmoid function activation is obtained by,
1
g(z) = (z) = 1 + e z
(2) (2) (2) (2) (3) (3) (3)
This neural network is fully de ned by the 13 weights w10 ; : : : ; w14 ; w20 ; : : : ; w24 ; w10 ; w11 and w12 .
$java NeuralNet FLAG [args]
Where the optional arguments are real valued (use `double' data type for them).
Forward Propagation:
We will rst focus on making predictions given xed weights.
Recall in a neural network for any unit j, it rst collects input from lower units to produce zj(l), then applies a non linear activation function to produce an activation, a(jl):
zj(l) =(ai(l 1)wji)
X
i:i!j
aj(l) = g(zj(l)) =
1
1 + e z
Note that the activations for the input layer are the input themselves: a(1)j = xj and a(1)0 = 1 is the bias unit.
When FLAG=100, arg1 ... arg13 are the weights w10(2); : : : ; w14(2); w20(2); : : : ; w24(2); w10(3); w11(3) and w12(3), and arg14=x1, arg15=x2, arg16=x3, arg17=x4. Print activations of the hidden layer units (a(2)1, a(2)2) on line one separated by a space followed by the activation of the output layer unit (a(3)1) on line two. When printing, show 5 digits after decimal point by rounding (but do not round the actual variables). For example,
$java NeuralNet 100 .1 .2 .3 .4 .5 .5 .6 .7 .8 .9 .9 .5 .2 1 0 1 0
0.66819 0.86989
0.80346
$java NeuralNet 100 .021 .247 .35879 .1414 .75 .512 .686 .717 .818 .919 .029 .135 .20701 1 0 1 0
0.60094 0.88247
0.57268
$java NeuralNet 100 0 .2 .3 .4 .5 0 .6 .7 .8 .9 0 .5 .2 0 0 0 0
0.50000 0.50000
0.58662
Tip to handle Commandline arguments:
Since there are a lot of commandline arguments in this homework, you can use the following trick to avoid typing the arguments each time you run the program and instead load them from a le directly.
$java NeuralNet `< 100.args`
where 100.args is a le that contains commandline arguments on a single line separated by spaces.
Also note that the quote characters are back-ticks. This doesn't work with normal single quotes.
Back Propagation:
To learn the weights of the neural network, we rst store all the activations computed using forward propagation. Then, we propagate the terms in a backwards manner to calculate the gradient of error with respect to each weight. Once we have the gradients, we can use Gradient Descent to update the weights to minimize the error.
Given a training example x = (x1; x2; x3; x4) and its label y, the error made by the neural network on the example is de ned as
Ex = 12 (a(3)1 y)2
The partial derivative of error with respect to the output activation z1(3) from chain rule is
@Ex
(3)
@Ex @a1(3)
= 1
=
@z1(3)
@a1(3) @z1(3)
Now,
@Ex
@(
1
(a1(3) y)2)
1
(3)
(3)
2
=
=
2(a1
y) = a1
y
@a(3)
@a(3)
2
1
1
and
@a1(3)
@g(z1(3))
(3)
(3)
=
= a1 (1
a1
)
@z(3)
@z(3)
1
1
So, we have
1(3) = (a(3)1 y)a(3)1(1 a(3)1)
When FLAG=200, arg1 ... arg13 are the weights w10(2); : : : ; w14(2); w20(2); : : : ; w24(2); w10(3); w11(3) and w12(3), and arg14=x1, arg15=x2, arg16=x3, arg17=x4, and arg18 = y. Print a single number, 1(3) with 5 decimals precision. For example,
$java NeuralNet 200 .1 .2 .3 .4 .5 .5 .6 .7 .8 .9 .9 .5 .2 1 0 1 0 1 -0.03104
$java NeuralNet 200 -.1 .2 -.3 -.4 .5 -.5 -.6 -.7 .8 -.9 -.9 .5 -.2 -1 0 -1 0 1 -0.14814
$java NeuralNet 200 .101 .809 .31 .9 .13 .55 .66 .12 .31 .1 .92 .05 .22 10 0 11.1 0.01 1 -0.04172
The partial derivative of error with respect to hidden layer activation units can be similarly computed using Chain Rule. For hidden unit j:
j(2) = 1(3)w1(3)ja(2)j(1 a(2)j)
FLAG=300 has same arguments as FLAG=200. Print two numbers 1(2); 2(2) separated by space. For example,
$java NeuralNet 300 .1 .2 .3 .4 .5 .5 .6 .7 .8 .9 .9 .5 .2 1 0 1 0 1 -0.00344 -0.00070
$java NeuralNet 300 -.1 .2 -.3 -.4 .5 -.5 -.6 -.7 .8 -.9 -.9 .5 -.2 -1 0 -1 0 1 -0.01847 0.00657
$java NeuralNet 300 .101 .809 .31 .9 .13 .55 .66 .12 .31 .1 .92 .05 .22 10 0 11.1 0.01 1 -0.00000 -0.00000
We now have all the information that we need to compute the gradient of error with respect to edge weights. To compute the partial derivative with respect to the edge weights:
@Ex
= (l)a(l 1)
@w(l)
j
k
jk
FLAG=400 also has same arguments as FLAG=200. Print
@Ex @Ex @Ex
on line 1,
@Ex @Ex
: : :
@Ex
@w10(3) @w11(3) @w12(3)
@w10(2) @w11(2)
@w14(2)
on line 2, and
@Ex @Ex
: : :
@Ex
on line 3 each separated by a space. For example,
@w20(2) @w21(2)
@w24(2)
$java NeuralNet 400 .1 .2 .3 .4 .5 .5 .6 .7 .8 .9 .9 .5 .2 1 0 1 0 1
-0.03104 -0.02074 -0.02700
-0.00344 -0.00344 -0.00000 -0.00344 -0.00000
-0.00070 -0.00070 -0.00000 -0.00070 -0.00000
$java NeuralNet 400 -.1 .2 -.3 -.4 .5 -.5 -.6 -.7 .8 -.9 -.9 .5 -.2 -1 0 -1 0 1
-0.14814 -0.07777 -0.04916
-0.01847 0.01847 -0.00000 0.01847 -0.00000
0.00657 -0.00657 0.00000 -0.00657 0.00000
$java NeuralNet 400 .101 .809 .31 .9 .13 .55 .66 .12 .31 .1 .92 .05 .22 10 0 11.1 0.01 1
-0.04172 -0.04172 -0.04172
-0.00000 -0.00000 -0.00000 -0.00000 -0.00000
-0.00000 -0.00000 -0.00000 -0.00000 -0.00000
Now we perform Stochastic Gradient Descent to train the neural network on the given bank note authentication data set. To do so, for each training example, we rst compute activations for all the units using Forward Propagation. We then compute terms for each hidden and output unit and
compute gradients of error with respect to each weight using Backward Propagation. We then update the weights as follows:
(l)
(l)
@Ex
wjk
= wjk
@w(l)
jk
where is the learning rate chosen.
We will use the training set \train.csv" for training the network. We will compute error on evaluation set \eval.csv" by summing up the error on each example from the set as follows:
x2X
2X
1
(3)
Eeval =
Ex =
x Eval
2 (a1 y)2
Eval
When FLAG=500, arg1 ... arg13 are initial weights w10(2); : : : ; w14(2); w20(2); : : : ; w24(2); w10(3); w11(3) and w12(3), and arg14= . Print two lines for each training example in the order of their appearance (we shall ignore the actual Stochastic Gradient Descent algorithm in which training examples are randomly selected and use the actual le order instead):
the updated weights w10(2); : : : ; w14(2); w20(2); : : : ; w24(2); w10(3); w11(3), w12(3)
the evaluation set error Eeval after the update
For example, (the rst output line for every training example is wrapped as it is too long to t on a single line. Please refer to the test cases provided for exact outputs)
$java N eu ra lN et
500 .1 .2 .3 .4
.5
.5 .6 .7 .8
.9 .9 .5
.2
.1
0.10020
0.19910
0.29883 0.40218
0.49989 0.50006
0.59973
0.69965
0.80065
0.89997 0.90277 0.50228 0.20243
38
.14430
0.10071
0.19860
0.30025 0.40156
0.49910 0.50026
0.59953
0.70022
0.80040
0.89965 0.90709 0.50388 0.20403
38
.22326
0.09969
0.19466
0.29375 0.40403
0.50010 0.50024
0.59942
0.70005
0.80047
0.89968 0.89494 0.49427 0.19201
37
.84961
0.09986
0.19411
0.29245 0.40606
0.50000 0.50028
0.59928
0.69970
0.80100
0.89965 0.89770 0.49662 0.19451
37
.91446
0.10012
0.19392
0.29335 0.40517
0.49912 0.50032
0.59925
0.69982
0.80088
0.89953 0.90327 0.49720 0.19469
37
.97933
...
0.08324
-0.17301
-0.08162 0.31971
0.54455 -0.05914
1.40048
0.86019
0.85080
0.71119 0.99056 0.36316
-2.41299
7.88486
0.08293
-0.17443
-0.08069 0.31889
0.54418 -0.05913
1.40053
0.86016
0.85083
0.71120 0.98618 0.35992
-2.41737
7.88083
Now, our neural network is trained. We will use the trained neural network to make predictions on the test set \test.csv" and compute the test set accuracy. Accuracy is de ned as the fraction of examples correctly predicted. First, using the initial weights and given, train your neural network.
FLAG=600 has same arguments as FLAG=500. For each example in \test.csv," use the trained neural network weights and print actual label, predicted label and con dence of prediction on each line. Con dence of prediction is the actual value of the activation of the output unit. Consider predicted label to be 0 when this con dence is less than or equal to 0:5, else predict 1. In the end, print the test set accuracy with 2 decimal precision.
$java NeuralNet 600 .1 .2 .3 .4 .5 .5 .6 .7 .8 .9 .9 .5 .2 .1 1 1 0.54785 0 0 0.21332 0 1 0.63250 0 0 0.24230 0 0 0.21203
...
0 0 0.19786
0 0 0.19522
0.93