$24
For this assignment, we will focus on graph data. You saw an instance of this with Homework 1 -- the airline flight network is actually a graph -- but we only did limited kinds of computation over the graph. However, many real-world datasets are, or can be modeled by, graphs (or trees which are special cases of graphs). Examples include:
Networks (social networks, the Web, the connectome, the Internet, traffic networks, …)
Sets of data in which some of the data is more closely connected than other parts of the data (edges may represent weighted similarity or affinity)
Phylogenetic trees, grammars, etc.
This assignment is the second of four in the course that consists of a Basic component, to be done by everyone, and an Advanced component, to be done by students who wish to do 3 homeworks and a project. Please see the separate steps below for the Advanced component.
The “Basic” Assignment
For this assignment, we will be doing a few common operations on graphs. In the next assignment, when we have the power of matrices, we will do some further computation over the same graph data. (It’s very common to encode graph connectivity through an adjacency matrix that we’ll discuss in lecture.)
To start, go to Jupyter Notebook in your web browser (http://localhost:8888/tree with the big token as before). Click on your work directory, then New|Terminal. Run:
git clone https://bitbucket.org/pennbigdataanalytics/hw2.git
to get your initial data sets and skeleton notebook with test cases.
What to Work on
The basic Homework 2 has only one notebook, Homework-2.ipynb. However, at the earliest possible point you should run through Steps 2.1 and 2.2 to make sure you can (1) connect to a simple version of Spark in your Docker container (this won’t be fast but will let you play with Spark), and (2) can download a 2.5+GB dataset (which will also take a while).
The Data You’ll be Using
The data files come from the Yelp data posted on Kaggle. This data has some quirks and dirty aspects -- some of which we have cleaned for you, and some of which you’ll need to clean yourself in the Homework.
Submitting Homework 2
Once your Jupyter notebook is sanity-checked and passes all tests, go into your work/hw2 directory on Jupyter Notebook. Run zip hw2.zip Homework-2.ipynb. Note that you are required to submit as a .zip file with a Jupyter notebook file. Do not upload alternate formats.
Next, go to the submission site, and if necessary click on the Google icon and log in using your Google@SEAS or GMail account. At this point the system should know you are in the appropriate course. Select CIS 545 hw2-2019 and upload hw2.zip from your Jupyter/hw2 folder, typically found under /Users/{myid}. Please check to make sure your files are synced across the system in advance of the submission deadline, and keep backups!