$24
Please submit your homework report to iLMS before 23:59 on May 16. A link to a shared Google document ‘hw2 demo registration’ will be announced later for you to reserve a time slot for an individual demonstration with TA. You are encouraged to consult or collaborate with other students while solving the homework problems, however you are required to turn in your own version of the report and programs written in your own words with supporting materials. Copying will not be tolerated.
Aim
Please classify the patched images with various methods taught in class up to Chapter Seven of the textbook. Other than the necessary data preprocessing such as scaling, nor-malizing etc., it is demanded in Homework #2 assignment to practice cross-validation and ensemble methods. You may apply new methods or use new packages to improve the classification performance, but if you do so, you have to give a brief introduction of the key concepts and provide necessary citations, instead of just direct copy paste or import-ing. However, in this assignment, you are not allowed to use any neural network related models (e.g., multilayer perceptron, CNN, etc). In case any neural network related method is applied, you will receive no credits. Once an algorithm package is merged or imported into your code, please list the package link in your reference and describe its mathematical concepts in your report followed by the reason for adoption.
Dataset Description
DeepSat(SAT-6) Airborne Dataset is downloaded from https://www.kaggle.com/crawford/ deepsat-sat6 [1][2]. In order to save storage space and speed up the learning process, only a portion of the original dataset, labeled with ‘building’, ‘grassland’, ’road’, is given in this
1
assignment. Each picture is a 28*28 pixel 4-band (red, green, blue and near infrared) image.
The whole dataset is saved as ‘CSV’ files. Here is the dataset format:
X *.csv : 4-band (‘R’ed,‘G’reen,‘B’lue and near ‘I’nfrared) image data.
{ Each cell represents one pixel value 0 to 255 in ‘R’ed,‘G’reen,‘B’lue or near ‘I’nfrared.
{ Each row is a separated 28*28 pixel 4-band image, which is shown in 1-D array format
fcolorg frowIdxg fcolIdxg=
[R00,R01,...,R2727,G00,...,G2727,B00,...,B2727,I00,..., I 27 27].
However, there is no header shown in the ‘CSV’ files.
y *.csv : label data, where the row indexing matches to that in X *.csv. Each label is 1x3 one-hot encoded vector standing for ‘building’, ‘grassland’ and ‘road.’
You may refer to other’s code on Kaggle [3]. If you are interested, you may also modify your code or ipynb for the full dataset and submit it to the Kaggle.
Submission Format
You have to submit a compressed file hw2 studentID.zip which contains the following files:
hw2 studentID.ipynb: detailed report, Python codes, results, discussion and math-ematical descriptions;
hw2 studentID.tplx: extra Latex related setting, including the bibliography;
hw2 studentID.bib: citations in the ”bibtex” format;
hw2 studentID.pdf: the pdf version of your report which is exported by your ipynb with
%% jupyter nbconvert - -to latex - -template hw2 studentID.tplx hw2 studentID.ipynb
%% pdflatex hw2 studentID.tex
%% bibtex hw2 studentID
%% pdflatex hw2 studentID.tex
%% pdflatex hw2 studentID.tex
Other files or folders in a workable path hierarchy to your jupyter notebook (ipynb).
2
Coding Guidelines
For the purpose of individual demonstration with TA, you are required to create a func-tion code in your jupyter notebook, as specified below, to reduce the data dimensionality, learn a classification model, and evaluate the performance of the learned model.
PipelineModel=hw2 studentID demo(in x, in label, mode) { in x : [string] CSV file for ‘data’.
{ in label: [string] None or CSV file for ‘label’, which contains labels to the cor-responding instances in in x.
{ mode: [string] ‘train’ for building models; ‘test’ for using built model to evaluate performance.
This function should return a best model trained with cross-validation in your program. Also, set this pipeline model as global variable. Please note that the HW2 demonstra-tion will be graded based on the final ranking of accuracy. Every demonstration should be completed within the selected time slot.
If mode=‘train’, please return a PipelineModel trained via cross-validation in your program. When mode=‘test’, please dump the results to files,
hw2 studentID results.csv: save predict labels with the same format as the file assigned in in label when the mode=‘train’.
hw2 studentID performance.csv: show an ‘accuracy’ in ‘%’ in type “float” without any extra ‘string’ characters.
Report Requirement
List names of packages used in your program;
Describe the pipeline combinations in your program;
Describe the cross-validation methods in your program;
For better explanation, draw owcharts of the methods or procedures used in the program;
Describe the mathematical concepts of any new algorithms or models employed as well as the roles they play in your feature selection/extraction or classification task in Markdown cells [4];
Discuss the performance among different classifiers with/without feature selection/extraction.
3
5.1 Basic Requirement
Combine feature engineering and classifier into pipelines [5]. In your program, the pipeline combinations should at least cover 3 different feature engineerings and 6 differ-ent classifiers, including bagging and AdaBoost. There will be more than one pipeline in your program. Some classifiers can turn into feature engineerings, in that case, you might need SelectFromModel [6] to merge them as part of feature engineerings in your pipeline.
Apply cross-validation method to find better parameter combinations to each pipeline. If you apply GridSearchCV and the program is halt for a long time, please remove n job setting. In addition, you could set verbose to make sure your cross-validation is still running.
Please make sure hw2 studentID demo is functional and return a trained pipline model with highest accuracy when mode=‘train’.
If you apply new methods or use new packages to improve the classification perfor-mance, you have to give a brief introduction of the key concepts and provide necessary citations/links, instead of just direct copy paste or importing.
Please submit your ‘report’ in English. Be aware that a ‘report’ is much more than a ‘program.’
References
Deepsat(sat-6) airborne dataset in kaggle. https://www.kaggle.com/crawford/ deepsat-sat6. Accessed: 2018-05-01.
Sat-4 and sat-6 airborne datasets. http://csc.lsu.edu/~saikat/deepsat/. Accessed: 2018-05-01.
Other’s code on kaggle. https://www.kaggle.com/crawford/deepsat-sat6/kernels. Accessed: 2018-05-01.
Markdown. https://daringfireball.net/projects/markdown/basics. Accessed: 2018-03-29.
Pipeline and featureunion: combining estimators. http://scikit-learn.org/stable/
modules/pipeline.html. Accessed: 2018-05-07.
Pipeline and featureunion: combining estimators. http://scikit-learn.org/stable/ modules/generated/sklearn.feature_selection.SelectFromModel.html. Accessed: 2018-05-07.
4