Assignment 7: Deep Learning Solution

Starting from:

~~$35~~

$29

In Assignment 5, Q3 (bonus question), you were asked to create a classification model for to detect duplicate questons. Now let's try the same problem using a deep learning approach.

You'll need 'quora_duplicate_question_500.csv' for this assignment. This dataset is in the following format

q1
q2
is_duplicate

How do you take a screenshot on a Mac
How do I take a screenshot on my MacBook
1
laptop?
Pro? ...

Is the US election rigged?
Was the US election rigged?
1

How scary is it to drive on the road to Hana
Do I need a four-wheel-drive car to drive all ...
0
g...

...
...
...

Create a function detect_duplicate( ) to detect sentiment as follows:

the input parameter is the full filename path to quora_duplicate_question_500.csv convert q1 and q2 into padded sequences of numbers (see Exercise 5.2)

hold 20% of the data for testing

carefully select hyperparameters, in particular, input sentence length, filters, the number of filters, batch size, and epoch etc.

create a CNN model with the training data. Some hints:

Since you have a small dataset, consider to use pre-trained word vectors

In your model, you use CNN to extract features from q1 and q2, and then predict if they are duplicates based on these features Your model may have a structure shown below.

print out accuracy, precision, recall, and auc calculated from testing data.

Your average precision, recall, accurracy, and auc should be all about 70%.

If your result is lower than that (e.g. below 70%), you need to tune the hyperparameters

This function has no return. Besides your code, also provide a pdf document showing the following

How you choose the hyperparameters

model summary

Screenshots of model trainning history

Testing accuracy, precision, recall, and auc

A few more notes about this assignment:

Due to small sample size, the performance may vary in each round of training. Also, you may see the performance does not improve much from the result of Assignment 5. Don't worry about this for now. We just use this example to practice how to build the deep learning model.

If you use pretrained word vectors, please describe which pretrained word vector you choose. You don't need to submit pretrained word vector files.

Hint: Possible structure of model:

Where the left_cnn or right_cnn is shown below:

In [179]: from keras.layers import Embedding, Dense, Conv1D, MaxPooling1D, \ Dropout, Activation, Input, Flatten, Concatenate

# add import

In [195]: def detect_duplicate(datafile):

# add your code

In [196]: if __name__ == "__main__":

detect_duplicate("../../dataset/quora_duplicate_question_500.csv")

Overall Model: This is just a referenfce structure. You don’t have

____________________________________________________________________tousethesamestructure

________________________________

Layer (type) Output Shape Param # C

onnected to

====================================================================

================================

q1_input (InputLayer)
(None, 35)
0
____________________________________________________________________

________________________________

q2_input (InputLayer)
(None, 35)
0
____________________________________________________________________

________________________________

left_cnn (Model)
(None, 192)
877692
q
1_input[0][0]

____________________________________________________________________

________________________________

right_cnn (Model)
(None, 192)
877692
q
2_input[0][0]

____________________________________________________________________

________________________________

merge_q1_q2 (Concatenate)
(None, 384)
0
l
eft_cnn[1][0]

right_cnn[1][0]

____________________________________________________________________

________________________________

dropout (Dropout) (None, 384) 0 m erge_q1_q2[0][0]

____________________________________________________________________

________________________________

hidden_layer
(Dense)
(None, 64)
24640
d
ropout[0][0]

____________________________________________________________________

________________________________

output (Dense)
(None, 1)
65
h
idden_layer[0][0]

====================================================================

================================

Total params: 1,780,089

Trainable params: 255,489

Non-trainable params: 1,524,600

____________________________________________________________________

________________________________

sub CNN model for left or right CNN:

____________________________________________________________________

________________________________

Layer (type) Output Shape Param #

C

onnected to

====================================================================

================================

main_input (InputLayer)
(None, 35)
0
____________________________________________________________________

________________________________

embedding (Embedding)
(None, 35, 300)
762300
m
ain_input[0][0]

____________________________________________________________________

________________________________

conv_1 (Conv1D)
(None, 35, 64)
19264
e
mbedding[0][0]

____________________________________________________________________

________________________________

conv_2 (Conv1D)
(None, 34, 64)
38464
e
mbedding[0][0]

____________________________________________________________________

________________________________

conv_3 (Conv1D)
(None, 33, 64)
57664
e
mbedding[0][0]

____________________________________________________________________

________________________________

max_1 (MaxPooling1D)
(None, 1, 64)
0
c
onv_1[0][0]

____________________________________________________________________

________________________________

max_2 (MaxPooling1D)
(None, 1, 64)
0
c
onv_2[0][0]

____________________________________________________________________

________________________________

max_3 (MaxPooling1D)
(None, 1, 64)
0
c
onv_3[0][0]

____________________________________________________________________

________________________________

flat_1 (Flatten)
(None, 64)
0
m
ax_1[0][0]

____________________________________________________________________

________________________________

flat_2 (Flatten)
(None, 64)
0
m
ax_2[0][0]

____________________________________________________________________

________________________________

flat_3 (Flatten)
(None, 64)
0
m
ax_3[0][0]

____________________________________________________________________

________________________________

concate (Concatenate)
(None, 192)
0
f
lat_1[0][0]

flat_2[0][0]

flat_3[0][0]

====================================================================

================================

Total params: 877,692

Trainable params: 115,392

Non-trainable params: 762,300

____________________________________________________________________

________________________________

Train on 400 samples, validate on 100 samples Epoch 1/100

Epoch 00000: val_acc improved from -inf to 0.68000, saving model to

best_model

11s - loss: 0.8028 - acc: 0.5950 - val_loss: 0.7682 - val_acc: 0.680

0

Epoch 2/100

Epoch 00001: val_acc did not improve 0s - loss: 0.7252 - acc: 0.6725 - val_loss: 0.7201 - val_acc: 0.6700 Epoch 3/100

Epoch 00002: val_acc improved from 0.68000 to 0.69000, saving model to best_model

0s - loss: 0.7005 - acc: 0.6575 - val_loss: 0.7446 - val_acc: 0.6900 Epoch 4/100

Epoch 00003: val_acc did not improve 0s - loss: 0.6407 - acc: 0.7675 - val_loss: 0.6793 - val_acc: 0.6800 Epoch 5/100

Epoch 00004: val_acc improved from 0.69000 to 0.70000, saving model to best_model

0s - loss: 0.5488 - acc: 0.8350 - val_loss: 0.6725 - val_acc: 0.7000 Epoch 6/100

Epoch 00005: val_acc improved from 0.70000 to 0.71000, saving model to best_model

0s - loss: 0.4717 - acc: 0.8675 - val_loss: 0.6860 - val_acc: 0.7100 Epoch 7/100

Epoch 00006: val_acc did not improve 0s - loss: 0.4090 - acc: 0.9225 - val_loss: 0.6693 - val_acc: 0.6700 Epoch 8/100

Epoch 00007: val_acc improved from 0.71000 to 0.76000, saving model to best_model

0s - loss: 0.3352 - acc: 0.9425 - val_loss: 0.6492 - val_acc: 0.7600 Epoch 9/100

Epoch 00008: val_acc did not improve 0s - loss: 0.2628 - acc: 0.9675 - val_loss: 0.6512 - val_acc: 0.7600 Epoch 10/100

Epoch 00009: val_acc did not improve 0s - loss: 0.2210 - acc: 0.9750 - val_loss: 0.6662 - val_acc: 0.7300 Epoch 11/100

Epoch 00010: val_acc did not improve

0s - loss: 0.1783
- acc:
0.9950 - val_loss: 0.7010
- val_acc: 0.7300
Epoch 12/100

Epoch 00011: val_acc did not improve

0s - loss: 0.1687
- acc: 0.9925 - val_loss: 0.6838
- val_acc: 0.7600
Epoch 13/100

Epoch 00012: val_acc did not improve

0s - loss: 0.1376
- acc: 1.0000 - val_loss: 0.6915
- val_acc: 0.7300
Epoch 14/100

Epoch 00013: val_acc did not improve

0s - loss: 0.1340
- acc: 0.9925 - val_loss: 0.7275
- val_acc: 0.7100
Epoch 00013: early stopping

precision
recall
f1-score
support
0.0

0.79
0.87
0.83

67
1.0

0.67
0.55
0.60

33
avg / total

0.75
0.76
0.75
100

('auc', 0.7403889642695614)

In [ ]:

More products

$6.00 OFF

Assignment 8 Solution

$30

$24

Buy now

$6.00 OFF

Assignment 7 Solution

$30

$24

Buy now

$6.00 OFF

Assignment 6 Solution

$30

$24

Buy now