$24
In this miniproject, you will be exploring two datasets. The goal is to gain experience in deploying basic supervised machine learning techniques to tackle a real-world data science problem. In particular, the project encourages you to explore preprocessing of the data, the e ect of hyper-parameters, size of the dataset, and performing model selection. You are encouraged to explore techniques you have learned in class to visualize the data and thereafter form a hypothesis about possible patterns in the data.
Preprocessing
Your rst task is to acquire the data, analyze it, and clean it (if necessary). You will use two datasets in this project, outlined below.
• Dataset 1 (Adult dataset): This dataset presents several attributes of di erent individuals and the prediction task is to determine whether someone makes over 50K a year. Download and read information about the dataset here.
• Dataset 2 (Your choice!): Select any dataset from UCI or related to your own research. We suggest selecting a dataset of appropriate size (not too small or too large) such that the experiments can be conducted e ectively and e ciently.
The essential subtasks for this part of the project include:
1. Download the datasets. Hints: For clarity, in the Adult dataset, adult.data contains the training/validation data and adult.test contains the test data.
2. Load the datasets into Pandas dataframes or NumPy objects (i.e., arrays or matrices) in Python.
3. Clean the data. You should remove instances that have too many missing or invalid data entries.
1
4. Convert discrete variables into multiple variables using one-hot encoding. For an example on how to do this, check out "Encoding categorical features" in the scikit-learn documentation.
Experiments
In this part, you will compare two supervised learning frameworks, namely K-nearest neighbours (KNN) and decision trees, to predict whether the income of an adult exceeds $50K=yr. A similar analysis should be performed for the second dataset. The speci c subtasks for this part include:
1. Implement and perform 5-fold cross validation on the training/validation data (for the Adult dataset, this data is contained in the adult.data le) to optimize hyperparameters for both models. Your implementation for cross-validation should be from scratch. You should not use existing packages for cross validation. Report the mean of the training and validation metrics for the given hyperparameters.
2. Sample growing subsets of the training/validation data and repeat step 1. We want to understand how the size of a dataset impacts both the training and validation error.
3. Take the best performing model (the one with the best performance on 5-fold cross validation) and apply it on the test set (in the Adult dataset, this is the adult.test le). This is an unbiased estimate of how your model would perform on new/unseen data.
4. [Optional] Go above and beyond! Examples: di erent normalization techniques or other ways of handling of missing data (search \data imputation" techniques). Employ more sophisticated techniques for hyper-parameter search. Engineering new features out of existing ones to get a better performance. Investigate which features are the most useful (e.g., by correlating them with your predictions or removing them from your data)?
5. Analyze your ndings; how did the choice of the various hyper-parameters impact generalization? How about the size of training data? If any of these ndings do not agree with your expectation, you can form hypotheses and further investigate them.
Deliverables
You must submit two separate les to MyCourses (using the exact lenames and le types outlined below):
1. code.zip: Your entire code, which should consist of a jupyter notebook le (.ipynb), and additional python les (.py); the notebook should contain the main body of your code, where we can see and easily reproduce the plots in your report.
2. writeup.pdf: Your (max three pages) project write-up as a pdf (details below).
Project write-up
Your team must submit a project write-up that is a maximum of three pages (single-spaced, 11pt font or larger; minimum 1 inch margins, an extra page for references/bibliographical content can be used). We highly recommend that students use LaTeX to complete their write-ups. You have some exibility in how you report your results, but you must adhere to the following structure and minimum requirements:
Abstract (100-250 words)
Summarize the project task and your most important ndings. For example, include sentences like \In this project we investigated the performance of two classi cation models, namely k-nearest neighbours and decision trees, on predicting if the income of an adult exceeds $50K=yr from various factors, such as age, sex, nationality, etc...", \We found that the k-nearest neighbour regression approach achieved worse/better accuracy than decision trees and was signi cantly faster/slower to train."
Introduction (5+ sentences)
Summarize the project task, the two datasets, and your most important ndings. This should be similar to the abstract but more detailed. You should include background information and potential citations to relevant work, if any, (e.g., other papers analyzing these datasets).
2
Datasets (5+ sentences)
Very brie y describe the datasets and how you processed them. How did you handle the missing data? If you have come up with new new features to get better results, you should explain it here. Present your e orts for better understanding of the data, e.g. through visualization plots.
Results (7+ sentences, possibly with gures or tables)
Describe the results of all the experiments as well as any other interesting results you nd. Elements we expect to see:
1. Comparing performances between KNN and decision trees
2. Revealing how changing hyperparameters a ects performances for both models
3. Describe how reducing the amount of data impacts results
Discussion and Conclusion (5+ sentences)
Summarize the key takeaways from the project and possibly directions for future investigation.
Statement of Contributions (1-3 sentences)
State the breakdown of the workload across the team members.
Evaluation
The mini-project is out of 10 points, and the evaluation breakdown is as follows:
• Completeness (2 points)
{ Did you submit all the materials?
{ Did you run all the required experiments?
{ Did you follow the guidelines for the project write-up?
• Correctness (4 points)
{ Are you cross-validation schemes implemented correctly? { Are your models used/implemented correctly?
{ Are you visualizations informative and visually appealing?
{ Are your reported accuracy close to (our internal) reference solutions?
• Writing quality (2.5 points)
{ Is your report clear and free of grammatical errors and typos?
{ Did you go beyond the bare minimum requirements for the write-up (e.g., by including a discussion of related work in the introduction)?
{ Do you e ectively present numerical results (e.g., via tables or gures)?
• Originality / creativity (1.5 points)
{ Did you go beyond the bare minimum requirements for the experiments?
{ within the context of producing the required results did you propose a creative idea?
{ Note: Simply adding in a random new experiment will not guarantee a high grade on this section! You should be thoughtful and organized in your report in explaining why you performed an additional experiment and how it helped in evaluating your hypothesis.
Final Remarks
You are expected to display initiative, creativity, scienti c rigour, critical thinking, and good communication skills. You don’t need to restrict yourself to the requirements listed above - feel free to go beyond, and explore further
You can discuss methods and technical issues with members of other teams, but you cannot share any code or data with other teams.
3