FINAL PROJECT #3: GEO-LOCATION CLUSTERING IN SPARK SO

Starting from:

~~$35~~

$29

Project Goal

In this project you and your group will use SPARK to implement an iterative algorithm that solves the clustering problem in an efficient distributed fashion. Clustering is the process of grouping a set of objects (or data points) into a set of k clusters of similar objects. Thus, objects that are similar should be in the same cluster and objects that are dissimilar should be in different clusters.

Clustering has many useful applications such as finding a group of consumers with com-mon preferences, grouping documents based on the similarity of their contents, or finding spatial clusters of customers to improve logistics. More specific use cases are

Marketing: given a large set of customer transactions, find customers with similar purchasing behaviors.

Document classification: cluster web log data to discover groups of similar access pat-terns.

Logistics: find the best locations for warehouses or shipping centers to minimize ship-ping times.

We will approach the clustering problem by implementing the k-means algorithm. k-means is a distance-based method that iteratively updates the location of k cluster centroids until convergence. The main user-defined ingredients of the k-means algorithm are the distance function (often Euclidean distance) and the number of clusters k. This parameter needs to be set according to the application or problem domain. (There is no magic for-mula to set k.) In a nutshell, k-means groups the data by minimizing the sum of squared distances between the data points and their respective closest centroid.

Goal: Implement k-means in SPARK and use it for geo-location clustering on various datasets of spatial locations. Can you find an even bigger geo-location dataset for which computation is not feasible in the pseudo-cluster? Cluster this data by executing your SPARK program on Amazon EMR and report your results and experiences.

1

Getting Started

Update your SVN repository. Use the folder final_project/clustering for your submis-sions.

Indicating Group Work and Submission Repository

Groups cannot be larger than four students and must be registered at milestone 1 sign-up with the instructor or TA. The wustlkey of the SVN repository used for submission must also be registered at this milestone presentation. It is your responsibility to add and commit all your final project submissions to exactly this repository.

Usage Agreement

By downloading and using the dataset lat_longs (cf. Problem 2 – Step 3) you agree as follows:

I agree to delete this dataset once the project has been completed.

I will not redistribute this data in any form.

Problem 2: Data Preparation

This problem prepares three datasets and defines milestone 2. Submit your programs for data pre-processing to the SVN repository. Detailed submission instructions follow below.

Step 1: Prepare device status data

Review the contents of the file $DEV1DATA/devicestatus.txt. You will have to pre-process the data in order to get it into a standardized format for later processing. This is a common part of the ETL (Extract-Load-Transform) process called data scrubbing.

The input data contains information collected from mobile devices on Loudacre’s network, including device ID, current status, location and so on. Loudacre Mobile is a (fictional) fast-growing wireless carrier that provides mobile service to customers throughout western USA. Because Loudacre previously acquired other mobile provider’s networks, the data from different subnetworks has a different format. Note that the records in this file have different field delimiters: some use commas, some use pipes (|) and so on. Your task is to

Load the dataset

Determine which delimiter to use (hint: the character at position 19 is the first use of the delimiter)

Filter out any records which do not parse correctly (hint: each record should have exactly 14 values)

Extract the date (first field), model (second field), device ID (third field), and latitude and longitude (13th and 14th fields respectively). You might want to store latitude and longitude as the first two fields to make it consistent with the other two datasets.

2

Filter out locations that have a latitude and longitude of 0.

The model field contains the device manufacturer and model name (e.g. Ronin S2.) Split this field by spaces to separate the manufacturer from the model (e.g. manufac-turer Ronin, model S2.)

Save the extracted data to comma delimited text files in the /loudacre/devicestatus_ etl directory on HDFS .

Confirm that the data in the file(s) was saved correctly. Provide a screen-shot named devicedata.png showing a couple of records in your SVN repository.

Visualize the (latitude, longitude) pairs of the device location data. You do not have to use SPARK for the visualization. Show this visualization at your milestone 2 demo with the TA or instructor.

Submit your implementation by adding a file including your SPARK statements (e.g., step1. py or step1.pyspark) to the final_project/clustering/milestone2 folder in your SVN repos-itory. Do NOT add any data!

Add the new files/folders to your SVN repo before committing:

svn add milestone2/step1.*

svn add milestone2/devicedata.png

svn commit -m ’milestone 2 submission’ .

Step 2: Get and Visualize synthetic location data

Download the synthetic clustering data from http://statistical-research.com/wp-content/ uploads/2013/11/sample_geo.txt and visualize the (latitude, longitude) pairs. You do not have to use SPARK for the visualization.

Submit your plot as step2.png by adding to the final_project/clustering/milestone2 folder in your SVN repository. Do NOT add any data!

Add the new files/folders to your SVN repo before committing:

svn add milestone2/step2.png

svn commit -m ’milestone 2 submission’ .

Step 3: Get and Pre-process the DBpedia location data

Download the large-scale clustering data of (latitude, longitude) pairs extracted from DB-pedia (https://classes.cec.wustl.edu/cse427/lat_longs.zip). Each record represents a location/place that has a Wikipedia article and latitude/longitude information. The for-mat is: lat long name_of_page.

In total, there are 450,151 points in a 2D space (i.e., space with spherical geometry – maybe it would make sense to use the great circle distance when analyzing this data...). To get a

3

smaller sample of this dataset for testing purposes, you could put a bounding box around the US and filter only those records inside the bounding box. Try to visualize this data. Eventually, you want to cluster the whole world using the entire dataset...

Problem 3: Clustering Big Data – k-means in Spark

Step 1: Understanding Parallel Data Processing and Persisting RDDs

This is the theory part of the project. Review the slides form the lecture and Lab 6 to understand the main data concept in SPARK – Resilient Distribute Datasets (RDDs). You will need to persist an RDD (at least once) in your k-means implementation. Additionally, make yourself familiar with how to view stages and tasks, e.g., using the Spark Application UI (when using the spark shell in local mode) or Spark History Server at http://localhost: 18080 (when running scripts locally).

Step 2: Understanding and Implementing k-means

MMDS chapter 7.3 (http://infolab.stanford.edu/~ullman/mmds/ch7.pdf) gives pseudo code and implementation strategies for the k-means clustering algorithm. Detailed implemen-tation requirements/specifications are listed below:

The following functions will be useful for calculating k-means:

closestPoint: given a (latitude/longitude) point and an array of current center points, returns the index in the array of the center closest to the given point

addPoints: given two points, return a point which is the sum of the two points.

EuclideanDistance: given two points, returns the Euclidean distance of the two.

GreatCircleDistance: given two points, returns the great circle distance of the two.

Note, that the addPoints function will be used to compute the new cluster centers. As we are working with spatial data given as latitude-longitude pairs implementing this function in a meaningful way will need some thought!

The used distance measure (Euclidean or great circle), as well as the parameter k (number of clusters) should be read as an input from the command line.

Select a suitable convergence criterion and create a variable convergeDist that will be used to decide when the k-means calculation is done, i.e. when the amount the locations of the means change between iterations is less than convergeDist. A "perfect" solution would be 0, which is not achievable due to numeric computations. Hence, convergeDist is a param-eter that represents a "good enough" solution. Select a small value for it that makes sense for your dataset and convergence criterion.

Parse the input file, which should also specified as a variable via the command line, into (latitude,longitude) pairs. Be sure to persist (cache) the resulting RDD because you will access it each time through the following iterations.

4

Now, plan and implement the main part of the k-means algorithm. Make sure to consider an efficient implementation being aware of tasks, stages, and cached RDDs.

When the iteration is complete, display and return the final k center points and store the k clusters (i.e., all data points plus cluster information).

Step 3: Compute and Visualize Clusters

In this step, you will compare the clusters using Euclidean distance vs. great circle distance.

Calculate the k-means clusters for the device location data using k = 5.

Calculate the k-means clusters for the synthetic location data using k = 2 and k = 4.

Calculate the k-means clusters for the large-scale DBpedia location data. You will need to experiment with the number of clusters (maybe use k = 6 for a start or k = 2 or 4 if you use the US locations only). Argue, what choice of k makes sense by considering the problem context, i.e., what could the clusters actually mean/represent?

Visualize the clusters and cluster centers (use a random subset of data points for the last dataset) for both distance measures. Can you observe a difference?

Step 4: Runtime Analysis

Compare the runtime of your k-means implementation (using the same value for k) for all three datasets using the local mode with at least two threads. Further, rerun your imple-mentation without using persistent RDDs and compare those runtimes to the previously obtained ones. Create a table or bar plots summarizing these results and briefly discuss your findings. After job completion you can read off the runtimes and other statistics from the Spark History Server at http://localhost:18080. You might want to rerun each exper-iment a couple of times and use the average runtime for a more robust comparison (time permitting).

Step 5: Documentation of Approach and Results (Report)

Write the project report documenting your clustering approach, your implementation, the obtained results, and runtime analysis. This report should be readable for an informed outsider and it should not require the reader to look at or run any code.

Problem 4: Big Data and Cloud Execution

Can you find or crawl an even bigger dataset you want to cluster? This dataset should be big enough that your k-means clustering computation is not feasible in your pseudo-cluster. Cluster this data by executing your SPARK program on Amazon EMR and report your results and experiences. If, you do not find another geo-location dataset, feel free to perform clustering on any other Big data clustering problem. Keep in mind that for

5

higher-dimensional data, the interpretation and visualization of the retrieved clusters is much more challenging. If need be you can also use the DBpedia data in EMR.

Document your cloud execution approach and provide the data source in your final project report. Add your pre-processing code to the src folder in your SVN repository. Describe your findings including dataset size and runtimes in your final project report.

Final Submission Instructions

Submit your report including documentation, as well as, results as project_report.pdf by adding it to the final_project/clustering folder in your SVN repository. Submit your im-plementation by adding your implementations to the final_project/clustering/src folder in your SVN repository. Do NOT add any data!

Add the new files/folders to your SVN repo before committing:

svn add src/*

svn add project_report.pdf

$ svn commit -m ’final project submission’ .

6