$29
1. Problem Formulation
Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. This project aims to build a network Intrusion Detection System (IDS), a predictive model distinguishing between bad connections, called intrusions or attacks, and good normal connections.
Model this problem as a BINARY classification problem. Use the following models to detect attack connections (intrusions). Compare the accuracy, recall, precision and F1-score of the following models. PLOT the confusion matrix for each model.
Fully-Connected Neural Networks
Convolutional Neural Networks (CNN)
Hint: For CNN, find a way to view each connection as an image. Please refer to our lab tutorial on CNN for handling data types other than images. You may use either Conv2D or Conv1D.
2. Dataset
Download link:
https://drive.google.com/open?id=1FKyIqsKP4NBOTKRRuVbJjEFxTTCwzlHy
You can also find the data and the description here (we use the 10% subset in this project) http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
This database contains a wide variety of intrusions simulated in a military network environment.
A connection is a sequence of TCP packets starting and ending at some well-defined times, between which data flows to and from a source IP address to a target IP address under some well-defined protocol.
Each connection is labeled as either normal, or as one specific attack type.
The datasets contain a total of 22 attack types, which are as follows:
back,buffer_overflow,ftp_write,guess_passwd,imap,ipsweep,land,loadmodule,multihop,neptune ,nmap,normal,perl,phf,pod,portsweep,rootkit,satan,smurf,spy,teardrop,warezclient,warezmaster.
Type of each feature in the dataset is as follows (see Appendix A for details):
duration: continuous.
protocol_type: symbolic.
service: symbolic.
flag: symbolic.
src_bytes: continuous.
dst_bytes: continuous.
land: symbolic.
wrong_fragment: continuous.
urgent: continuous.
hot: continuous.
num_failed_logins: continuous.
logged_in: symbolic.
num_compromised: continuous.
root_shell: continuous.
su_attempted: continuous.
num_root: continuous.
num_file_creations: continuous.
num_shells: continuous.
num_access_files: continuous.
num_outbound_cmds: continuous.
is_host_login: symbolic.
is_guest_login: symbolic.
count: continuous.
srv_count: continuous.
serror_rate: continuous.
srv_serror_rate: continuous.
rerror_rate: continuous.
srv_rerror_rate: continuous.
same_srv_rate: continuous.
diff_srv_rate: continuous.
srv_diff_host_rate: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
dst_host_same_srv_rate: continuous.
dst_host_diff_srv_rate: continuous.
dst_host_same_src_port_rate: continuous.
dst_host_srv_diff_host_rate: continuous.
dst_host_serror_rate: continuous.
dst_host_srv_serror_rate: continuous.
dst_host_rerror_rate: continuous.
dst_host_srv_rerror_rate: continuous
3. Data Preprocessing
For data preprocessing, we first need encode good connections as “0” and attacks as “1”. To achieve this, you have two options:
Label encoding https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
Write your own encoding function and applying that function to the label column. Check the map() function discussed in this tutorial: https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_dataframes/
Requirements
Split data for training and test. Use training data to train your models and evaluate the model quality using test data
Drop all the redundant records. This data set has a big number of redundant records. Redundant records in the train set will cause learning algorithms to be biased towards the more frequent records.
Removing all records with missing values.
Encode categorical features and normalize numeric features.
Some column(s) may have missing values after you normalize them. Why? Handle them appropriately.
You must use EarlyStopping when training neural networks using Tensorflow.
Tuning the following hyperparameters when training neural networks using Tensorflow to see how they affect performance
Activation: relu, sigmoid, tanh
Layers and neuron counts
Optimizer: adam and sgd
Kernel number and kernel size (for CNN only)
Grading breakdown
You may feel this project is described with some certain degree of vagueness, which is left on purpose. In other words, creativity is strongly encouraged. Your grade for this project will be based on the soundness of your design, the novelty of your work, and the effort you put into the project.
Use the evaluation form on Canvas as a checklist to make sure your work meet all the requirements.
Implementation
70 pts
Your report
20 pts
In-class defense
10 pts
6. Teaming:
Students must work in teams with no more than 3 people. Think clearly about who will do what on the project. Normally people in the same group will receive the same grade. However, the instructor reserve the right to assign different grades to team members depending on their contributions. So you should choose partner carefully!
Deliverables:
The HTML version of your notebook that includes all your source code. Go to “File” and then “Download as”. Click “HTML” to convert the notebook to HTML.
5 pts will be deducted for the incorrect file format.
Your report in PDF format, with your name, your id, course title, assignment id, and due date on the first page. As for length, I would expect a report with more than one page. Your report should include the following sections (but not limited to):
Problem Statement
Methodology
Experimental Results and Analysis
Task Division and Project Reflection
In the section “Task Division and Project Reflection”, describe the following:
who is responsible for which part,
challenges your group encountered and how you solved them
and what you have learned from the project as a team.
10 pts will be deducted for missing the section of task division and project reflection. All the files must be submitted by team leader on Canvas before
11 am, Friday, March 6, 2020
NO late submissions will be accepted.
8. In-class Demo:
Each team member must demo your work during the scheduled demo session. Each team have three minutes to demo your work in class. Failure to show up in defense session will result in zero point for the project. The following is how you should allocate your time:
Model/code design (1 minute)
Findings/results (1 minute)
Task division, challenges encountered, and what you learned from the project (1 minutes)
Hints
The provided CSV file has no column headers. So you may want to add column names using the following code after you load data into dataframe using pd.read_csv():
df.columns = [ 'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent',
'hot',
'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count',
'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate',
'srv_rerror_rate',
'same_srv_rate',
'diff_srv_rate',
'srv_diff_host_rate',
'dst_host_count',
'dst_host_srv_count',
'dst_host_same_srv_rate',
'dst_host_diff_srv_rate',
'dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate',
'dst_host_serror_rate',
'dst_host_srv_serror_rate',
'dst_host_rerror_rate',
'dst_host_srv_rerror_rate',
'outcome'
]
10. Think beyond the Project
Can you model this intrusion detection problem as a multi-class classification problem so that we can detect the specific attack type for each connection? How good such predictive model can be in this case?
Among all the features, can you identify the most important features (this is so called feature importance analysis) and train models only on those most important features, e.g., top-10 mot important features? Refer to our lab on regularization.
Can you apply downsampling or oversampling to create a more balanced dataset to train your model so that you model will not be biased to the more frequent classes/attack types?
Read this published paper based on the exact dataset used in the project. https://www.ee.ryerson.ca/~bagheri/papers/cisda.pdf
A grander challenge dataset for you to play with about IoT applications
https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/bot_iot.php
Appendix A
A complete description of the features is given in the three tables below.
http://kdd.ics.uci.edu/databases/kddcup99/task.html