Starting from:
$30

$24

Homework 3: The Cloud, Graphs, and PageRank Solution

For this assignment, we will focus on array and matrix data. An array is a multidimensional set of values of the same type, and we’ll often encode matrices within arrays. Arrays are going to be incredibly useful for multidimensional data, for images, for matrices, for representing documents, and for machine learning training data. (We’ll see the last aspect in a future assignment.)




This assignment is the third of four in the course that consists of a Basic component, to be done by everyone, and an Advanced component, to be done by students who wish to do 3 homeworks and a project. Please see the separate steps below for the Advanced component.

Step 1. Set up Spark on the Cloud
1.1 Google Cloud
The first part of your homework will involve connecting to Jupyter on Google DataProc on the Cloud. Google DataProc includes Apache Spark.




First, sign up for a $50 Google Cloud credit, according to the link posted on Piazza
You will be asked for a name and email address, which needs to end with one of the listed upenn.edu domains. We used CoursesInTouch to make sure that the domains for all registered students are accepted. Your coupon code will be emailed to you after filling out this form.
The coupon will be valid through this semester. You can only request ONE code.
Next, visit https://console.cloud.google.com/education to redeem your credits to your Gmail/Google account (e.g. someone@gmail.com). Note that this will be applied to the Google account to which you are logged in. We recommend that you use your Google/SEAS account. If you do not have a SEAS account you may request one from CETS or use a personal gmail account. Non-SEAS UPENN accounts are NOT recommended and we cannot provide support if certain features are blocked.
Now we need to do some basic setup:

In the drop-down or the “Manage Resources” page, create a new project. Call it something unique such as cis545-{your-initials}-1. If you get a notification that the name is taken, update the -1 to a higher number. We’ll call this the project-id.
In the “Billing” tab, set your Google Education credits as the billing source.
Enable the APIs and select your project.
Download the and install the Google Cloud SDK for your machine (not within the Cloud website anymore).
As instructed, run ./google-cloud-sdk/install.sh
Run ./google-cloud-sdk/bin/gcloud init




At this stage you should be “ready to go” with respect to setting up a Google cloud account. From here you need to do two main things:

To store your data, create a bucket with a unique name, which we’ll call bucket-name.
To actually run a computation, create a cluster (see Step 5).
To create a cluster, you can run following commands using your Mac OS Terminal or Windows Command Prompt, and the Google Cloud SDK

gcloud dataproc clusters create {cluster-name} \
--project {project-id} \
--bucket {bucket-name} \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh
Then, connect your browser to the Jupyter notebook running on your cluster’s master node.

Create an SSH tunnel to your cluster’s master node from port 10000 on your localhost machine by:
gcloud compute ssh "{cluster-name}-m" --project {project-id} --zone={cluster-zone} -- -D 10000 -N
(don’t forget to add the “-m” at the end of the cluster name, or to make sure it is 10,000 and not 1000!)




This should just appear to hang, but it’s actually setting up a connection.

Open another terminal / console window.

Configure your browser to use the proxy when connecting to your cluster:
<browser executable path "http://<cluster-name-m:8123" --proxy-server="socks5://localhost:10000" --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" --user-data-dir=/tmp/

Note that:

For mac, <browser executable path is /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
For Linux, it is /usr/bin/google-chrome
For Windows, it is "C:\Program Files (x86)\Google\Chrome\Application\chrome.exe". You may also have to change /tmp/ to \Windows\Temp
So far, you can use the Jupyter notebook on Google cloud just as you do in Docker.

Note that you should remember that “the meter is ticking” when you are running your cluster (especially), downloading / uploading data, and keeping data in the bucket. When you take a long break, you should delete your cluster, using gcloud dataproc clusters delete {cluster-name}. When you are done with this assignment, delete your bucket.
Step 2. Use Spark on the Google Cloud
Let’s start by downloading the data files for this assignment, both into your Docker instance and into JupyterHub.

Step 2.1. Download the Repo into Your Local Docker Container
Go to your operating system’s Terminal or Command Prompt. Go to ~/jupyter. Run the command:




git clone https://bitbucket.org/pennbigdataanalytics/hw3.git




You’ll be using this for later parts of the homework. And shortly be replicating this to the Google Cloud.

Step 2.2. Download the Repo into Your Cloud Jupyter/JupterHub Server
Step 2.2.2 Google Cloud Setup
Now open your connection to Jupyter running on Google Cloud - enter the URL into your browser:




http://<cluster-name-m:8123




where cluster-name is set as above, and don’t forget the -m. You should get a “blank” copy of Jupyter. Use the Jupyter Upload button to upload Homework-3-Spark.ipynb from where you cloned hw3 above.




When you open Spark.ipynb on Google Cloud, first make sure the top menu looks like:






Note that it says PySpark on the right. If it doesn’t (e.g., it says Python 3), go to the Kernel menu and choose Change kernel | PySpark.




Next, in a separate browser tab, go to the Google Storage Browser and select your bucket. Go into the notebooks directory. Upload the web-NotreDame.csv file from your downloaded Homework. Go back to your Jupyter tab in the browser and verify that the file shows up.

What to Work on
On Google Cloud, please only work on Homework-3-Spark.ipynb. When you are done, you may Download it to your machine. Be sure to shut down your Google Cloud cluster!

Step 2. Use Jupyter on Your Machine
The basic Homework 3 has 2 notebooks: Homework-3-Spark.ipynb, Homework-3-PageRank.ipynb.




However, you should only complete the Homework-3-Spark.ipynb notebook on the Google Cloud. The other one can and should be completed on your local Jupyter instance as you have done with previous homework assignments.

Submitting Homework 3
Retrieve from your Docker container the following notebook files and zip them into hw3.zip, much as you did for previous homework assignments. The notebooks should be:




Homework-3-Spark.ipynb
Homework-3-PageRank.ipynb



Next, go to the submission site, and if necessary click on the Google icon and log in using your Google@SEAS or GMail account. At this point the system should know you are in the appropriate course. Select the assignment hw3-2019 and upload hw3.zip from your Jupyter folder, typically found under /Users/{myid}.




If you check on the submission site after a few minutes, you should see whether your submission passed validation. You may resubmit as necessary.




More products