$29
What should I submit, where should I submit and by when?
Your submission for this assignment will be one PDF (.pdf) le and one zip (.zip) le. Instruc-tions on how to prepare and submit these les is given below.
Assignment Package:
http://web.cse.iitk.ac.in/users/purushot/courses/ml/2019-20-a/material/assn2.zip
Deadline for all submissions: 26 October, 9:59PM IST
There is no provision for \late submission" for this assignment
1.1 How to submit the PDF le
The PDF le must be submitted using Gradescope in the group submission mode. Note that this means that auditors may not make submissions to this assignment.
Make only one submission per assignment group on Gradescope. Gradescope allows you to submit in groups - please use this feature to make a group submission.
To clarify the above, for example, there are, as of today i.e. Sep 20 2019, 49 assignment groups. We wish to have only 49 submissions on Gradescope, i.e. one submission per assignment group, not one submission per student.
Link all group members in your group submission. If you miss out on a group member while submitting, that member may end up getting a zero since Gradescope will think that person never submitted anything.
You may overwrite your group’s submission (submitting again on Gradescope simply overwrites the old submission) as many times as you want before the deadline.
Do not submit Microsoft Word or text les. Prepare your PDF le using the style le we have provided (instructions on formatting given later).
1.2 How to submit the ZIP le
Password protect your ZIP le using a password with 8-10 characters. Use only alphanu-meric characters (a-z A-Z 0-9) in your password. Do not use special characters, punctu-ation marks, whitespaces etc in your password. Name your ZIP le submit.zip. Specify the le name properly in the Google form.
Remember, your le is not under attack from hackers with access to supercomputers. This is just an added security measure so that even if someone guesses your submission URL, they cannot see your code immediately. A length 10 alphanumeric password (that does not use dictionary phrases and is generated randomly e.g. 2x4kPh02V9) provides you
1
with more than 55 bits of security. It would take more than 1 million years to go through all 255 combinations at 1K combinations per second.
Make sure that the ZIP le does indeed unzip when used with that password (try unzip -P your-password file.zip
on Linux platforms). If we are unable to unzip your le, you will lose marks.
Upload the password protected ZIP le to your IITK (CC or CSE) website (for CC, log on to webhome.cc.iitk.ac.in, for CSE, log on to turing.cse.iitk.ac.in).
Fill in the following Google form to tell us the exact path to the le as well as the password https://forms.gle/7RAbjxLPxVAoPs8Q9
Do not host your ZIP submission le on le-sharing services like Dropbox or Google drive. Host it on IITK servers only. We will be using a script to autodownload your submissions and if not careful, Dropbox or Google Drive URLs may send us an HTML page (instead of your submission) when we try to autodownload your le. Thus, it is best to host your code submission le locally within IITK servers.
While lling in the form, you have to provide us with the password to your ZIP le in a designated area. Write just the password in that area. For example, do not write \Password: helloworld" in that area if your password is \helloworld". Instead, simply write \helloworld" (without the quotes) in that area. Remember that your password should contain only alphabets and numerals, no spaces, special or punctuation characters.
While lling the form, give the complete URL to the le, not just to the directory that contains that le. The URL should contain the lename as well.
Example of a proper URL: https://web.cse.iitk.ac.in/users/purushot/mlassn2/submit.zip
Example of an improper URL ( le name missing): https://web.cse.iitk.ac.in/users/purushot/mlassn2/
Example of an improper URL (incomplete path): https://web.cse.iitk.ac.in/users/purushot/
We will use an automated script to download all your les. If your URL is malformed or incomplete, or if you have hosted the le outside IITK and it is di cult for us to download automatically, then we will not bother to make any e orts to manually locate your le or else contact you for clari cations. We will simply give your group a zero in these cases.
Make sure you ll in the Google form with your le link before the deadline. We will close the form at the deadline.
Make sure that your ZIP le is actually available at that link at the time of the deadline. We will run a script to automatically download these les after the deadline is over. If your le is missing, we will simply assign your group zero marks.
We will entertain no submissions or mercy appeals over email, Piazza etc. All submissions must take place before the stipulated deadline over the Gradescope and the Google form. The PDF le must be submitted on Gradescope at or before the deadline and the Python le must be available on the link speci ed on the Google form at or before the deadline.
2
Problem 2.1 (My First Recommendation System). The problem of recommendation requires us to predict, for every user, which products/services would they like. We can obtain a simpli ed view of this problem by casting it as a multi-label classi cation problem. Simpli ed as it may be, this view is nevertheless very powerful and indeed widely used in production by actual companies in the e-commerce, social media and internet search sectors.
Users. Each user is represented as a d dimensional feature vector xi 2 Rd. This feature
^
vector is typically very sparse i.e. only d d of these coordinates are non-zero for an average user (some users may of course have more or less coordinates set to non-zero values in their representation). Nothing particularly speci c is known about these d features except that they will always take non-negative values. These features were possibly arrived at by looking at normalized bag-of-\words" features where \words" may have been items these users viewed or bought in the past or their pro le history, but we are not sure since the engineers who developed the feature extraction routines have left the company.
Items. There are a total of L items in the inventory out of which we wish to recommend items to the users. Each item is called a label. For each user i, we are told which all items do they like by giving us a label-vector yi 2 f0; 1gL for that user. For any item j 2 [L], yji = 1 indicates that user i de nitely likes item j. Note that a user may like multiple items in which case all those coordinates would get turned on in the label vector for that user. This illustrates the di erence between multi-class classi cation and multi-label classi cation. In multi-class settings, a data point can belong to exactly one of the possible classes but in multi-label settings, a data point may be associated with one or more of the labels. Going by what is mentioned about the items above, the reader may be tempted to think that just as yji = 1 indicates that user i de nitely likes item j, then yji0 = 0 should indicate that user i dislikes item j0 . However, this is not true: yji0 = 0 indicates that either user did not express an interest when shown the item j0 or else the user was never shown the item j0 at all! In an e-commerce website with 10 million items, it is unreasonable to expect a user to have interacted with, and to have told us, all the items in which they are interested { all recommendation systems have to be careful about this fact.
Your Task. Your task is to learn a recommendation system model which, when given a user feature vector, can tell us the top 5 items (labels) in which that user would be most interested. Given a test user, your method should return a vector pred of length 5 with each entry between 0 and L 1. The rst entry of the vector i.e. pred[0] (note the zero-based indexing) will be interpreted by us as the item your method thinks is most likely to be liked by the user, the second entry of the vector i.e. pred[1] is the next most likely item to be liked by the user, and so on. Note that the rst item is indexed as the 0th label and the last item is indexed as the (L 1)th label (again in zero-based indexing). Thus, if you think the most suitable item for the user is the ninth item, you must set pred[0] = 8. Thus, we expect a ranking of the 5 most suitable items for that user. Note that you cannot cheat by repeating items in your recommendation vector e.g. by setting pred[0] = pred[1] = . . . . Our evaluation code will automatically remove duplicates from your recommendations.
Evaluation Measures. We will evaluate your method on our secret test data (which will have exactly the same value of d and L as the training data i.e. exactly the same set of features and items) using two kinds of performance measures described below
Precision at k (prec@k): For this, we will rst choose some k 2 [5]. Then we will ask, for every test user, what fraction of the top k recommendations given by your method
3
for that user were actually liked by that user. This will be a fractional number 2 [0; 1].
Taking the average of this number across all test users will give us prec@k of your method.
Macro Precision at k (mprec@k): For this, rst choose some k 2 [5]. Then, we will go over each of the items j 2 [L] and for each item, look at all the test users that like that item. Then we will calculate the fraction of these users for whom your method did recommend item j in its top k recommendations. This will be a fractional number 2 [0; 1]. Taking the average of this number across all items will give us mprec@k of your method.
The di erence between prec@k and mprec@k largely arise due to the presence of rare items in the dataset i.e. items that very few users like. You will see in your data itself that an average item is liked by just n^ = 40 users whereas there are a total of n = 10000 users! Whereas a method can get very high prec@k by just recommending popular items to everyone (akin to recommending an iPhone to everyone), such a method may do poorly on mprec@k which gives a high score only if that method pays good attention to all items, not just the popular ones.
Your Data. You have been provided in the assignment package, training data for n = 10000 users, each user i is represented as a d = 16385 dimensional feature vector xi. The feature
^
519.
vectors are sparse and the average number of non-zero features in a data point is only d
L
There are a total of L = 3400 items and each user is associated with a label vector yi 2 f0; 1g
.
^
13:5 items on an
The label vectors are also quite sparse and an average user likes only L
average. Routines have also been provided to you that read the data as well as show you how we will perform evaluation for your submitted code using prec@k and mprec@k.
Caution: all matrices in this problem (feature and label) are sparse and should be stored in compressed representation (csr_matrix from scipy). The data loading routine given to you as a part of the assignment package will return compressed matrices. Be careful about densifying these matrices (e.g. using toarray() from scipy) as you may run into memory issues.
Code and Resource Repository. The following repository o ers the state of the art in large scale multi-label learning with papers that describe various methods that do well on such prob-lems, as well as readily available code for some of these methods. manikvarma.org/downloads/XC/XMLRepository.html
Suggested Methods. The following techniques are known to succeed in various forms on such large-scale multi-label problems
LwP: in literature this is often used as a reranking step (reference [31] in the repository) but can be used as an algorithm in its own right as well. This involves learning one or more prototypes per item that tries to capture prototypical users that like that item.
OvA: in literature this is also known as binary relevance (reference [32, 33, 34] in the repository) and this involves learning a binary classi er to predict for every item, whether a user will like that item or not. One may alternatively, use logistic regression to predict a probability score per item and rank items using that score. Note that this method sort of contradicts our earlier observation that yji = 0 does not mean that user i dislikes item j. Nevertheless, this method is quite successful.
DT: two main types of decision trees are popular in literature for multi-label problems
The rst kind of approaches (examples are references [02, 31, 41] in the repository) use the decision tree to take a user to a leaf where the labels of training users who also reached that leaf, are used to perform prediction
4
The second kind of approaches (examples are references [04, 36] in the repository) use the decision tree to merely eliminate which items would not be of interest to that user. An OvA-style classi er or ranker is then used to decide among the remaining items, which should be recommended
Apart from this there are a few other methods such as output codes (just as we saw for multiclas-si cation), dimensionality reduction (aka embedding) methods, and deep learning methods, are also popular. Several references are available on the repository for all such techniques (although code may not be readily available for all of them).
Take one reference from the following list (references [2, 5, 31, 32, 33, 34, 36, 40, 41] in the repository), give a citation to that paper and brie y describe the method in your own words. You do not have to give all minute algorithmic details (since we can always just read the paper itself) but do explain the salient points which are crucial to the algorithm.
Discuss a few advantages and disadvantages of the method you just described above.
Download an implementation of the method you chose (or if you wish, implement yourself). Caution: the implementations may be in C/C++. Train the method you chose on the train data we have provided you and test it on the train data itself to give us the following statistics: prec@k and mprec@k for k = 1, 3, 5.
Suggest a method that you would use to solve the problem. The can either be
the method you chose to present in the previous parts with some modi cations: in this case describe all modi cations you made and relate them to the advantages and disadvantages you described in the previous part
the method you chose to present in the previous parts without any modi cations: in this case you must justify your choice to not perform any modi cations by enumer-ating what all modi cations you tried but which failed to improve performance
some method other than the one you chose in the previous parts (with or without modi cations) or else a totally new method: in this case give all details of the new method including all details for any modi cations you propose to an existing method
Train your chosen method on the train data we have provided (using any validation technique you feel is good). Store the model you obtained in the form of any number of binary/text/pickled/compressed les as is convenient and write a prediction method in Python in the le predict.py that can take a new data point and use the model les you have stored to make predictions on that data point. Include all the model les, the predict.py le, as well as any library les that may be required to run the prediction code, in your ZIP submission. You are allowed to have subdirectories within the archive.
(10+10+20+10+50 marks)
Some Observations. You may use these to guide your exploration. DT methods usually o er fast prediction as well as good precision. However, their model size is usually large too. OvA methods usually o er very high precision values. However, they are usually slow at prediction. LwP methods usually o er high mprec@k since they do well on rare labels. However, they are usually slow at prediction as well. It is notable however, that these are broad observations and need not apply to every single method in these families. For instance, DT techniques of the second kind (that prepare a shortlist of labels) can use LwP methods in every leaf within the shortlisted labels to obtain reasonable prediction speed as well as good accuracies on rare labels.
5
Code Usage from the Internet. Irrespective of which option you choose in part 5, as well as part 4, you have full freedom to download any code available online to perform training. This is allowed (without any penalty) even if you choose option 2 in part 5. Thus, you are being given full freedom to use code written by others and there will be absolutely no penalty even if you (are able to) directly use someone’s code without making any changes to it. However you must cite all code that you take from others by giving names of authors and name of the URLs from where you obtained said code. There is no penalty for using code for which you have cited the original author. However, there will be heavy penalty for using someone’s work and not giving them credit for it.
Marking Scheme. Parts 1-4 need to be answered in the PDF le itself and part 5 needs to be answered in the ZIP le submission. The 50 marks for part 5 will be awarded as follows: Total size of the submission, once unzipped (smaller is better): 10 marks
Total time taken to return top-5 recommendations for all test users (smaller is better): 10 marks
prec@1,3,5 (larger is better): 3 x 5 = 15 marks
mprec@1,3,5 (larger is better): 3 x 5 = 15 marks
How to Prepare the PDF File
Use the following style le to prepare your report. https://media.neurips.cc/Conferences/NeurIPS2019/Styles/neurips_2019.sty
For an example le and instructions, please refer to the following les https://media.neurips.cc/Conferences/NeurIPS2019/Styles/neurips_2019.tex https://media.neurips.cc/Conferences/NeurIPS2019/Styles/neurips_2019.pdf
You must use the following command in the preamble
\usepackage[preprint]{neurips_2019}
instead of \usepackage[final]{neurips_2019} as the example le currently uses. Use proper LATEX commands to neatly typeset your responses to the various parts of the problem. Use neat math expressions to typeset any mathematical derivations. All plots must be generated electronically - no hand-drawn plots would be accepted. All plots must have axes titles and a legend indicating what the plotted quantities are. Insert the plot into the PDF le using proper LATEX \includegraphics commands.
How to Prepare the ZIP File
Your submission ZIP archive must contain a le called predict.py which we will call to get recommendations for test data points. The assignment package contains a sample submission le sample_submit.zip which shows you how you must structure your submission as well as showing you the way in which your prediction code predict.py must accept input and return recommendations. Do not change this input output behavior otherwise our autograder will not be able to process your recommendations and award you zero marks.
Your ZIP archive itself must contain a python le named predict.py which should not be contained inside any other directory within the archive. Apart from this le your ZIP archive may contain any number of les or even directories and subdirectories.
6
Do not change the name predict.py to anything else. The le must contain a method called getReco which must take in the test features as a csr_matrix and a value k 0 and return k recommendations for each test data point.
There are no \non-editable regions" in any of the les for this assignment. So long as you preserve the input output behavior, feel free to change the structure of the le as you wish. However, do not change the name of the le predict.py to anything else.
Code you download from the internet may be in C/C++. You are free to perform training in C/C++ etc. However, your prediction code must be a python le called predict.py. This le may internally call C/C++ code but it is your job to manage that using extension modules etc. Do not expect us to call anything other than the single le predict.py
The assignment package also contains a le called eval.py which is an example of the kind of le we will be using to evaluate the code that you submit. Before submitting your code, make sure to run eval.py and con rm that there are no errors etc.
Do not submit eval.py to us { we already have it with us. We have given you access to the le eval.py just to show you how we would be evaluating your code.
Make sure that running predict.py does not require us to install any Python libraries which are not available using pip. Libraries that are installable using pip are ne. How-ever, if your prediction routine requires using external code not available using pip, you must supply all that external code to us within your ZIP archive { we should not have to install it. To be sure, we will not install or download any libraries from places like GitHub, personal homepages, or other repositories (even the repository mentioned in the question itself) even if you want us to. Your archive itself should contain all such les.
Other than the above restrictions, you are free to import libraries as you wish,e.g. sklearn, scipy, pandas. Files for libraries not available on pip must be provided by you.
Once we have installed pip libraries required by your code, we should just be able to call predict.py and get predictions i.e. your prediction code must be self contained. We should not have to create any new directories, move les around or rename them. All of that must be done by you while creating the archive (remember, you are allowed to have as many les and (sub)directories apart from the le predict.py as you want).
We do not need your training code for part 5. We do not need any code (training or prediction) that you used to solve parts 1-3. We simply need your prediction code for part 5. Do not increase your submission le size (remember there are marks for submission size too) by including unnecessary les.
The system on which we will run your prediction code will not have a GPU. Make sure your code can run simply on a CPU. Thus, use the CPU version of deep learning libraries if you are using them and not the GPU version of those libraries e.g. keras.
You may use a GPU (if you can nd one) to train your method but your prediction method must still work on a CPU. To be sure, we will not o er your prediction code a GPU even if you ask us to. Do not use any libraries that require GPU access during prediction. There are no restrictions on using GPU for training but we will not provide your code GPU access during prediction.
We will test your submitted code on a secret dataset which would have the same features per user as well as same number of labels.
7