$24
• Overview
In this THE, you are going to use Weka to do some experiments on various datasets. The aim is to make you familiar with certain machine learning algorithms and Weka. Weka is a collection of machine learning algorithms for data mining tasks. In Weka, the algorithms can either be applied directly to a dataset through Weka desktop application or called from your own Java code. You are expected to use the latest stable release of Weka in this THE.
• Questions
You are expected to answer 5 questions. The datasets you are going to use in these questions will be provided inside THE2 datasets.zip: You can download the zip le from Odtuclass.
2.1 Handling Missing Values (15 Pts.)
In this question, you are given "labor.ar " le. In this dataset, missing values for various at-tributes exist. You are expected to replace missing values. For this purpose, you are going to use ReplaceMissingValues function in Weka. You can use this in Weka using the following steps:
Open Weka-Explorer.
While Preprocess tab is active, click open le and select labor.ar .
In the Filter section, choose Filters ! Unsupervised ! ReplaceMissingValues. Run the lter with the default parameters by clicking Apply.
After applying the ReplaceMissingValues function, the statistics of the dataset are expected to change. Answer the following questions, based on the changes:
1. Which method(s) did ReplaceMissingValues use to replace missing values?(2 pts)
1
2. When you compare the statistics of all attributes after applying the function with the raw statistics, which statistic(s) have changed?(3 pts)
3. How did the dataset be a ected after applying ReplaceMissingValues function in terms of the changes in its statistics ? Discuss brie y for the following attributes only: "duration","standby-pay","wage-increase-third-year","wage-increase- rst-year". (4 pts)
4. Is ReplaceMissingValues function suitable to replace missing values? Discuss brie y for the following attributes only: "duration","standby-pay","wage-increase-third-year","wage-increase- rst-year". If you think it is not suitable for some attbiute(s), brie y discuss why. (6 pts)
2.2 Discretization (15 Pts.)
In this question, you are given the "diabetes.ar " le.In this dataset, you are expected to apply the discretization technique which is either the equal-width binning or equal-depth (or equal frequency) binning over all the attributes on the selected le using Weka. You can do this in Weka using the following steps:
Open Weka-Explorer.
While Preprocess tab is active, click open le and select "diabetes.ar " le . In the Filter section, choose Filters ! Unsupervised ! Discretize.
Run the lter with the default parameters by clicking Apply.
After applying the Discretize function, the dataset is expected to change. Answer the following questions, based on the changes:
1. Which method did Discretize use to discretize attribute values?(2 pts)
2. When you compare the original distribution of "preg" attribute with the distrbiution after applying the function, explain one of the di erence you observed. (3 pts)
3. Assume you are given a hypothetical dataset called "health" and assume all the values of "age" attribute from this dataset are given as follows: 24,15,25,28,4,21,8,26,9,21,34,29. You are expected to apply binning methods on this attribute:
a. Apply equal-width binning method to discretize the values where bin size is 3. Show the resulting bins(2.5 pts)
b. Apply equal-depth binning method to discretize the values. Show the resulting bins(2.5 pts)
4. What is the di erence between equal-depth binning and equal-width binning method? Which one of the methods do you prefer to work with numerical attributes? (5 pts)
2
2.3 Feature Reduction (20 Pts.)
In this question, you are given "vehicles silhouettes.ar " le which originates from UCI Machine Learning Library in the following link: http://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+ Silhouettes%29.
You are expected to apply a well-known feature reduction technique, Principal Component Analysis (PCA) over the given dataset. You will use Weka for this purpose. You can do this in Weka using the following steps:
Open Weka-Explorer.
While Preprocess tab is active, click open le and select vehicles silhouettes.ar .
Click Select Attributes tab. Select Principal Components in Attribute Evaluator section. If an alert window appears which warns to select Ranker select method, click Yes.
Run the classifer with the default parameters by clicking Start.
On the right hand side, in the Atrribute Selection Output Section, the results of the analysis are given.
Answer the following questions according to the results given in the Attribute Selection Output Section:
1. Using the correlation matrix, which attributes have the highest positive correlation?(3 pts)
2. Using the correlation matrix, which attributes have the highest negative correlation?(3 pts)
3. Below, "Eigenvalue", "Proportion" and "Cumulative" titles as well as the corresponding numerical values are given. What do they mean? Brie y explain.(6 pts)
4. Discuss the results of this analysis. Do you think this dataset is suitable for feature reduction? Justify your answer. (8 pts)
2.4 Multilayer Perceptron (25 Pts.)
In this question, you are given the "vehicles silhouettes.ar " le that you already used in the previous question. You are expected to apply Multilayer Perceptron (MLP) classi er using Weka. You can do this in Weka using the following steps:
Open Weka-Explorer.
While Preprocess tab is active, click open le and select vehicles silhouettes.ar . Click Classify tab. Select Classi ers ! Functions! MultilayerPerceptron.
By clicking above the MultilayerPerceptron under Classi er section, you can view/change the default parameters of MultilayerPerceptron.
In the test options section, select and activate Percentage Split which is %66 (hence %66 of the dataset is set for training and the rest is set for testing).
Run the classifer with the default parameters by clicking Start.
3
On the right hand side, in the Classi er Output Section, the results of the analysis are given. Answer the following questions according to the results given in the Classi er Output Section:
1. How many hidden layers and hidden nodes are created?(2 pts)
2. Did Weka normalize the attributes? What is the e ect of normalizing the attributes? (3 pts)
3. What is the bene t of splitting the dataset as training set and test set? Why don’t we just train our model with whole data? (3 pts)
4. Which halting strategy did MLP use?(2 pts)
5. What is the detailed accuracy table by class of the run? (5 pts)
Now change the con gurations of the MLP by clicking on the name of the classi er. Run the classi cation task with di erent training times (100, 500, 1000, 5000,) while keeping other variables same. You are expected to carry out/answer the following:
6. Plot the training time-test accuracy plot. (5 pts)
7. Interpret the accuracy plot: What is the relation between accuracy and training time (epoch count)? What may cause this situation? (5 pts)
2.5 Support Vector Machine (25 Pts.)
In this question, you are given the "vehicles silhouettes.ar " le that you already used in the previous question. You are expected to apply the Support Vector Machine (SVM) classi er using Weka. You can do this in Weka using the following steps:
Open Weka-Explorer.
While Preprocess tab is active, click open le and select vehicles silhouettes.ar . Click Classify tab. Select Classi ers ! Functions! SMO.
By clicking above the SMO under Classi er section, you can view/change the default param-eters of SVM.
In the test options section, select and activate Percentage Split which is %66 (hence %66 of the dataset is set for training and the rest is set for testing).
Run the classifer with the default parameters by clicking Start.
On the right hand side, in the Classi er Output Section, the results of the analysis are given. Answer the following questions according to the results given in the Classi er Output Section:
1. Report summary and detailed accuracy by class. (3 pts)
2. Explain the C parameter of SVM. Change this parameter and run the classi er with various values: Plot and report the e ects. (10 pts)
3. Explain the terms maximum margin hyperplane and support vector. (6 pts)
4. When we run SVM in Weka, it uses a kernel function in default. What is a kernel function? What is the bene t to use a kernel function? Explain clearly (6pts).
4
• Submission and Regulations
1. For each task create a directory named q1, q2, ..., q5. All of your solutions, comments, plots about a task should be inside the correspondent directory. If your directory structure is messy, you will get penalty.
2. Zip all task directories and name it as <ID> <FullNameSurname> and submit it through odtuclass. For example:
e1234567 MuratOzturk.zip
3. Copying from others is strictly forbidden and is subject to discplinary action.
Note: Any extra e ort will be rewarded. Late submissions will not be accepted.
5