Assignment 3: SparkRDD and SparkSQL Solution

Starting from:

~~$30~~

$24

Home

Description

This assignment is very similar to Assignment #2 (kNN query) but this time you need to do it using Spark RDD and Spark SQL. In the KNN query, the input is a set of points ( ) in the Euclidean space, a query point ( ), and an integer ( ). The output is the points in that are closest to the query point . In some sense, the Spark RDD implementation would share some similarities with the Hadoop MapReduce implementation and the Spark SQL implementation would share some similarities with the Pig Latin implementation. In this assignment, you will observe the expressiveness and efficiency of Spark over Hadoop.

Submission instructions

The assignment is due on Friday, 12/07/2018, at 11:59 PM Pacific Time.

Late submissions are allowed with a 20% penalty for each calendar day up-to four days late.

You can use either Java or Scala for this assignment.

A sample file can be accessed on the following link: https://drive.google.com/open?id=1hEpg_-XecIKwGcTLIShBw8pvEhtCPaEc

Notice that the file is compressed. You can decompress it first if you want. The file contains three fields separated by comma, ID, x, and y.

Please upload your answer in a single ZIP file named ‘cs226-asg3-<UCR net ID.zip’ where <UCR net ID is replaced with your ID.

Failing to follow the instructions above might result in losing some points.