Data Mining Competition Project

Starting from:

~~$30~~

$24

Home

1. Overview of the Assignment

In this competition project, you need to improve the performance of your recommendation system from Assignment 3. You can use any method (like the hybrid recommendation systems) to improve the prediction accuracy and efficiency.

2. Competition Requirements

2.1 Programming Language and Library Requirements

a. You must use Python to implement the competition project. You can use external Python libraries as long as they are available on Vocareum.

b. You are required to only use the Spark RDD to understand Spark operations. You will not receive any points if you use Spark DataFrame or DataSet.

2.2 Programming Environment

Python 3.6.4, Scala 2.11, JDK 1.8 and Spark 3.1.2

We will use these library versions to compile and test your code. There will be a 20% penalty if we cannot run your code due to the library version inconsistency.

2.3 Write your own code

Do not share your code with other students!!

We will combine all the code we can find from the Web (e.g., GitHub) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all the detected plagiarism.

3. Yelp Data

In this competition, the datasets you are going to use are from:
https://drive.google.com/drive/folders/1SIlY40owpVcGXJw3xeXk76afCwtSUx11?usp=sharing

We generated the following two datasets from the original Yelp review dataset with some filters. We randomly took 60% of the data as the training dataset, 20% of the data as the validation dataset, and 20% of the data as the testing dataset.

A. yelp_train.csv: the training data, which only include the columns: user_id, business_id, and stars.

B. yelp_val.csv: the validation data, which are in the same format as training data.
C. We are not sharing the test dataset.

D. other datasets: providing additional information (like the average star or location of a business)

a. review_train.json: review data only for the training pairs (user, business)

• b. user.json: all user metadata

• c. business.json: all business metadata, including locations, attributes, and categories

d. checkin.json: user checkins for individual businesses

• e. tip.json: tips (short reviews) written by a user about a business

• f. photo.json: photo data, including captions and classifications

4. Task (8 points)

In the competition, you need to build a recommendation system to predict the given (user, business) pairs. You can mine interesting and useful information from the datasets provided in the Google Drive folder to support your recommendation system.

You must make an improvement to your recommendation system from homework assignment 3 in terms of accuracy. You can utilize the validation dataset (yelp_val.csv) to evaluate the accuracy of your recommendation system. There are two options to evaluate your recommendation system:

(1) Error Distribution: You can compare your results to the corresponding ground truth and compute the absolute differences. You can divide the absolute differences into 5 levels and count the number for each level as following:

>=0 and <1: 12345

>=1 and <2: 123

>=2 and <3: 1234

>=3 and <4: 1234

>=4: 12

This means that there are 12345 predictions with < 1 difference from the ground truth. This way you will be able to know the error distribution of your predictions and to improve the performance of your recommendation systems.

(2) RMSE Error: You can compute the RMSE (Root Mean Squared Error) by using following formula:

where Predi is the prediction for business i and Ratei is the true rating for business i. n is the total number of the business you are predicting.

Input format: (we will use the following commands to execute your code)

/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit competition.py <folder_path> <test_file_name> <output_file_name>
Param: folder_path: the path of dataset folder, which contains exactly the same file as the google drive

Param: test_file_name: the name of the testing file (e.g., yelp_val.csv), including the file path

Param: output_file_name: the name of the prediction result file, including the file path

Output format:

a. The output file is a CSV file, containing all the prediction results for each user and business pair in the validation/testing data. The header is “user_id, business_id, prediction”. There is no requirement for the order in this task. There is no requirement for the number of decimals for the similarity values. Please refer to the format in Figure 1.

Figure 1: Output example in CSV

b. You also need to write comments that include the description of your method (less than 300 words) in the first part of your program. The description should include the explanation of the models you are using, especially the way you improved the accuracy or efficiency of the system. We look forward to seeing creative methods. Please also report the error distribution, RMSE, and the total execution time on the validation dataset in the description. Figure 2 shows an example of the description file. If the comments are not included or the comments are not informative, there will be a one-point penalty.

Figure 2: An example of description file

Grading:

We will compare your prediction results against the ground truth. We will use our testing data to evaluate your recommendation systems and grade based on the accuracy using RMSE.

To get the full points for the competition project, your RMSE result should beat that of the TAs’. TAs will also continuously improve their systems and announce the accuracy. The TA systems will be fixed three days before the competition due date. However, if your recommendation system only beats the TAs’ for the validation data, you will receive 50% of the points for the competition.
NOTICE: the current RMSE baseline is 1.03 for the validation dataset.

The final submission with the highest accuracy will receive an extra 6 points on the final grade. The second place will receive extra 5 points. The third one will receive extra 4 points and so on until the sixth one will receive extra 1 point.

To be more like a competition, you could see a "Leaderboard" button in the "Competition" on Vocareum. Every time you submit the code, your RMSE for validation data will be scored and show up on the leaderboard.

5. Submission

You need to submit your Python scripts on Vocareum with exactly the same name:

◦ competition.py

6. Grading Criteria

(% penalty = % penalty of possible points you get)

1. You cannot use the extension for the competition. No late submissions will be accepted for the competition.

2. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. If plagiarism is detected, you will receive no points for the entire assignment and we will report all detected plagiarism.

3. All submissions will be graded on Vocareum. Please strictly follow the format provided, otherwise you won’t receive points even though the answer is correct.

4. Do NOT use Spark DataFrame, DataSet, sparksql.

5. We will not conduct regrades on competition submissions.

6. There will be no points awarded if the total execution time exceeds 25 minutes.

7. Common problems causing fail submission on Vocareum/FAQ

(If your program runs seem successfully on your local machine but fail on Vocareum, please check these)

1. Try your program on Vocareum terminal. Remember to set python version as python3.6,

And use the latest Spark

/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit
2. Check the input command line format.

3. Check the output format, for example, the header, tag, typos.

4. Your Python script should be named as competition.py

5. Check whether your local environment fits the assignment description, i.e. version, configuration.

6. If you implement the core part in Python instead of Spark, or implement it in a high time complexity way (e.g. search an element in a list instead of a set), your program may be killed on Vocareum because it runs too slowly.