Project phase 3 Solution

Starting from:

~~$35~~

$29

Home

Phase overview

• Implement a classifier for encrypted traffic

◦ Capture traffic samples generated by various actions performed in the Android VM

◦ Split dataset in training/evaluation

◦ Use training traces to train model (Random forest, SVM, or another method of your choice) to distinguish between actions

◦ Verify accuracy using evaluation traces
Traffic capturing/processing

• Use tcpdump (or any tool of your choice) to capture traffic generated by performing a set of actions on the Android VM

• Convert network flows in traces into feature vectors suitable for a classifier
Apps and actions of interest

• Start Android browser (homepage must be set to en.wikipedia.org)

• Start Youtube app

• Start Weather Channel app

• Start Google News app

• Start Fruit Ninja app

• (all apps above can be installed for free using the Google Play app, already pre-installed)
How to capture traffic

• Start tcpdump, execute action, terminate tcpdump

◦ Either by hand, or by using Android test automation commands (start, monkey) via adb shell

◦ You should aim at having ~50 traces per action (although less may work too)

◦ Label each trace w/ the action it captured
How to process traffic/2

• Data cleanup: the Android VM is relatively quiet in terms of network chatter, but you may end up capturing flows unrelated to the action you are performing

• Suggestion for data cleanup:

◦ Discard obvious noise (e.g., ARP)
◦ Look at DNS requests to figure out the IPs of flows generate by the app

• You may also decide not to cleanup your data and hope the classifier can figure it out

• Looking at captures using Wireshark may help
How to process traffic/3

• Once you cleaned up the traces, divide them in bursts as explained last week

• All traffic in each burst must be partitioned into flows

• A flow is a set of packets sent between the same pair of addresses/ports and carrying the same protocol (TCP/UDP) (note, traffic flows in both directions)
How to process traffic /4

• Once you have a set of flows, you must convert them in feature vectors

• Vectorization: the process of representing an object with a vector of scalar features, suitable for classification algorithms

• Features you can’t use:

◦ IP addresses

◦ MAC addresses

◦ Packet payloads

• Everything else is fair game
How to process traffic/5

• Examples of vectorization:

◦ Convert each flow in a vector including the lengths of the first 10 packets

◦ Convert each flow in a vector containing statistical features of the sequence of packet lengths

◦ ...
I have the vectors, now what?

• Train a classifier to distinguish between vectors generated by different apps

• Zhuoqun will give a brief demo later for those of you not familiar with scikit and machine learning in general
Phase 3 deliverables

◦ A python script named classifyFlows that, given a pcap trace, must print out a list of bursts, flows in every bursts, and label of the action that generated a certain flow (if any)

◦ Output format:

• classifyFlows mytrace PCAP

Burst 1:

<timestamp> <src addr> <dst addr> <src port> <dst port> <proto>\ <#packets sent> <#packets rcvd> <#bytes send> <#bytes rcvd> <label>

<label> must be either the name of an app, or unknown if the classifier is unable to determine which app was detected
Phase 3 deliverables/2

• Internally, your code must:

◦ Partition the traffic in bursts

◦ Partition each burst into flows

◦ Generate feature vectors from flows

◦ Attempt to classify each vector using the model you trained

• We will evaluate the accuracy of your code in classifying traces generated using the Android VM
Phase 3 deliverables/3

• By the phase 3 deadline (4/15, 10:45am) you must upload to Canvas a .zip file containing:

◦ Your Python script

◦ A README file specifying any Python package on which your code depends, and any information we need to be aware of when testing your code

◦ If your code has known limitations or issues, also briefly document them in the file.