$24
Overview of the assignment
In this assignment, students will complete two tasks. The goal of these two tasks is to let students get familiar with Spark and do data analysis using Spark. In the assignment description, the first part is about how to configure the environment and data sets, the second part describes the two tasks in details, and the third part is about the files the students should submit and the grading criteria.
Spark Installation
Spark can be downloaded from the official website:
http://spark.apache.org/downloads.html
Spark 1.6.1 combined with Hadoop 2.4 is recommended. The interface of Spark official website is shown in the following figure.
Scala Installation
Please refer to the Spark slides
Python Configuration
You need to add the paths of your Spark (path/to/your/Spark) and Python (path/to/your/Spark/python) folders to the interpreter’s environment variables named as SPARK_HOME and PYTHONPATH, respectively.
Data
Please download the data from MovieLen over the following link:
https://grouplens.org/datasets/movielens/
You are required to download data sets. It is ml-1m.zip, which size is 6 MB. The zip file contains three dat files and one readme file. The files users.dat, ratings.dat and
movies.dat are needed for the tasks. The description of the data is provided in the README file.
Task1: (40%)
Students are required to calculate each movie’s average rating based on gender of the user. The ratings.dat and users.dat file are needed for this task.
Result format:
Save the result as one text file.
The result is ordering by movieId, gender in ascending order
The result file includes three columns movieId, gender, avg. ratings.
The following snapshot is an example of result for task 1. It shows the exact format of the result.
Task2: (60%)
Students are required to calculate the average rating of each movie genres based on the gender of the user. The ratings.dat, movies.dat and users.dat files are required for
this task.
Result format:
Save the result as one text file.
There are three columns in the result file. The first column is the genres’s name. the second column is the gender and the third column is the avg. ratings. Also, the file should be sorted according to the genres' name in ascending order.
The following snapshots is an example of result for task 2. It shows the exact format of the result.
What you need to turn in:
Source codes for two tasks (you can use either Python or Scala) and name it as Firstname_Lastname_task1 and Firstname_Lastname_task2, respectively. (For example, Priyambada_Jain_task1.py)
Result files of two tasks for large and small data sets and name it as Firstname_Lastname_result_task1.txt, Firstname_Lastname_result_task2.txt
Readme documents: please describe how to run your program in this document.
If you use Scala, please submit the jar package as well and name them as Firstname_Lastname_task1.jar and Firstname_Lastname_task2.jar.
Zip the above files and name it as Firstname_Lastname_HW1.zip
Grading Criteria:
Your codes will be run according to your Readme file. If your programs cannot be
run with the commands you provide, your submission will be graded based on the result files you submit and 20% penalty for it.
If the file generated by your program is unsorted, there will be 20% penalty.
If your program generates more than one file, there will be 20% penalty.
The deadline for assignment 1 is 09/20 midnight. There will be 20% penalty for late submission.
Also, as described for Scala implementation 10% bonus will be awarded.