$29
Introduction
In this assignment, you will implement the K-Means clustering algorithm, as well as use clustering to solve real-world problems.
1. Implement K-means algorithm
Please use Python 3.6 or 3.7 to implement your homework assignment.
1.1 Data Description
Both blackbox41 and blackbox42 dataset contains 4000 two-dimensionaldata points. Each of them contains 4 clusters. What you need to do is to implement K-Means algorithm to find these 4 clusters among these data points.
1.1.1 Blackbox
The same as last assignment, you are given Blackbox41.pycinstead of ready-made data, and you need to generate your own data from that blackbox. When you askthe blackbox, it will return an all 4000 data points as type ndarray. Your python script submitted to Vocareum must importbothblackbox41 and blackbox42 even though blackbox42 is not provided to you.
from Blackbox41 import Blackbox41
from Blackbox42 import Blackbox42
Remember that your code will be run with 2 different boxes. One way to parse the input argument is
blackbox = sys.argv[-1]
if blackbox == 'blackbox41':
bb = Blackbox41()
elif blackbox == 'blackbox42’:
bb = Blackbox42()
else:
print('invalid blackbox')
sys.exit()
and then use bbas data source. When you develop your algorithm, you only need to care about blackbox41, but you also need to import and add additional parsing logic for blackbox42 since blackbox42 will be used to test your code on Vocareum.
1.1.2 Program
Your program will be run in the following way:
python3 Kmeans.py blackbox41
= results_blackbox41.csv
When we grade your model with hidden blackbox42, it should be run:
python3 Kmeans.py blackbox42
= results_blackbox42.csv
The results_ blackbox41.csv results_blackbox42.csv contains “cluster” labels of each data point you K-means algorithm generated, it should have the following format:
0
1
2
1
…
0
3
In your implementation, please do not use any existing machine learning library call. You must implement the algorithm yourself. Please develop your code yourself and do not copy from other students or from the Internet.
When we grade your algorithm, we will use a hidden blackbox42. Your code will be autograded for technical correctness. Please name your file correctly, or you will wreak havoc on the autograder. The maximum running time is 3 minutes.
1.2. Submission:
Submit your code to Vocareum
Submit KMeans.py to Vocareum
After your submission, Vocareum would run two scripts to test your code, a submission script and a grading script. The submission script will test your code with only blackbox41, while the grading script will test your code with another blackbox42.
The program will terminate and fail if it exceeds the 3 minutestime limit.
After the submission script/grading script finishes, you can view your submission report immediately to see if your code works, while the grading report will be released after the deadline of the homework.
You don’t have to keep the page open while the scripts are running.
Submit your report to Blackboard
Create a single .zip(Firstname_Lastname_HW4.zip) which contains:
KMeans.py
Report.pdf, a briefreport contains your clustering graph (sample points and centroid of each cluster)for blackbox41.Example graph:
Submit your zipped file to the blackboard.
1.3. Rubric:
50 points in total
Program correctness(30 points): program always works correctly and meets the specifications within the time limit
Documentation(10 points): code is well commented and submitted graph is reasonable
Performance (10 points): the classifier has reasonable performance
2. User Analysis Using Clustering
2.1 Introduction
In this exercise, you’ll do a data challenge as a data analyst. Specifically, your task is to apply clustering on real-world users data of Yelp to identify different clusters of users. In this part of the assignment, you are welcome to use existing machine learning libraries(e.g. sklearn).
Yelp dataset
Yelp dataset can be found on https://www.kaggle.com/yelp-dataset/yelp-dataset. There are various datasets such as business, reviews and check-ins. But we’ll focus on the “users” dataset(yelp_academic_dataset_user.json) this time. Description of attributes is as below.
User_id: user id
Name: user name
Review_count: number of user’s reviews
Yelping_since: sign up date of user
Useful: count of receiving “useful” feedback (for the user’s reviews)
Funny: count of receiving “funny” feedback (for the user’s reviews)
Cool: count of receiving “cool” feedback (for the user’s reviews)
Elite: years that the user is part of elite members of Yelp
Friends: list of friends’ user id
Fans: number of fans
Average_stars: average star of reviews
Compliment_hot: count of receiving “hot stuff” compliments (for the user)
Compliment_more: count of receiving “write more” compliments (for the user)
Compliment_profile: count of receiving “like your profile” compliments (for the user)
Compliment_cute: count of receiving “cute pic” compliments (for the user)
Compliment_list: count of receiving “great lists” compliments (for the user)
Compliment_note: count of receiving “just a note” compliments (for the user)
Compliment_plain: count of receiving “thank you” compliments (for the user)
Compliment_cool: count of receiving “you are cool” compliments (for the user)
Compliment_funny: count of receiving “you are funny” compliments (for the user)
Compliment_writer: count of receiving “good writer” compliments (for the user)
Compliment_photos: count of receiving “great photo” compliments (for the user)
2.2 Instructions
Suggested steps:
Load the first 100,000 lines of data.
Preprocessing the data.
Standardize the data.
Apply Principal Component Analysis to reduce the dimension of data.
Apply clustering on the data.
Explain your clusters and findings.
The first 3 steps are easy to understand. For step 4, you should group the attributes according
to their covariance matrix as well as your understanding towards the data, and apply PCA to
each group. For example, you may findUseful, Funny andcoolare highly correlated, so you
might want to make them as a group, and use the first principal component of those three
attributes as a factor for clustering. You can decide the name of this factor, for instance,
review_feedback_factor, that helps you to explain your clusters in the later steps. For step 5, you
use the factors you developed in step 4 for clustering. You are welcome to use more than one
clustering methods and compare the results. For step 6, you need to name and explain your
clusters. For example, you find a cluster’s centroid has very high review_feedback_factorand
low review_stars_factor, you can name this cluster of users as strict gourmet, and explains: the strict gourmettend to give strict reviews for restaurants, but are highly helpful to customers.
While doing this assignment, just imagine you are a data analyst in the marketing department of Yelp. Any valuable insights on users may greatly help your company. Do not limit yourself on the above instructions, you are encouraged to use more methods to enhance your findings and theories. The most important thing to keep in mind is to have fun with the data.
You’ll need to submit a jupyter notebook (.ipynb) file for this part of the assignment. Conclusion and findings can be written in the markdown cells. Structure your code and markdown reasonably among cells and make sure every cell is runnable.
A sample structured notebook(inf552_hw4_yelp_handout.ipynb) is given to you. You can develop your code based on the given structure.
You can use your local jupyter notebook server or any other servers to finish this assignment, but it is best if you use the Kaggle Kernelso that you don’t need to download the datasets (they are really huge) and it also reduces the difficulty for grading. To use the Kaggle kernel, simply click New Notebookon the Yelp dataset page.
2.3 Suggested Libraries
There are some libraries that you may need to use:
numpy.covor pandas.covfor covariance matrix.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cov.html
sklearn.decomposition.pca for PCA.
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
sklearn.preprocessing.StandardScalerfor standardization.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler. html
sklearn.clusterfor clustering.
https://scikit-learn.org/stable/modules/clustering.html
The above libraries are only for reference. It is not mandatory to use those libraries in this part of assignment.
2.4 Submission
Include your Firstname_Lastname_Yelp.ipynb in the Firstname_Lastname_HW4.zipand submit to blackboard.
2.5 Rubric
50 points in total
Program correctness(20 points): Demonstrated correct use of PCA and at least one clustering method.
Documentation(10 points): Code is well commented and well structured.
Report(20 points): The clusters are reasonable and well explained. The findings are insightful.