Starting from:
$30

$24

Homework Assignment 1 Solution

Collaboration Policy. Homeworks will be done individually: each student must hand in their own answers. It is acceptable for students to collaborate in understanding the material but not in solving the problems or programming. Use of the Internet is allowed, but should not include searching for existing solutions.




Under absolutely no circumstances code can be exchanged between students. If some code was shown in class, it can be used, but it must be obtained from Canvas, the instructor or the TA.




Assignments from previous offerings of the course must not be re-used. Violations will be penalized appropriately.




Late Policy. No late submissions will be allowed without consent from the instructor. If urgent or unusual circumstances prohibit you from submitting a homework assignment in time, please e-mail me.




MovieLens Datasets Data for this assignment should be obtained from: http://grouplens. org/datasets/movielens/

You should use the following datasets for all problems:




100,000 ratings from 1000 users on 1700 movies. Released 4/1998.



1 million ratings from 6000 users on 4000 movies. Released 2/2003.



10 million ratings and 100,000 tag applications applied to 10,000 movies by 72,000 users. Released 1/2009.



u.data: Data are stored as strings separated by a tab (nt):




user id item id rating timestamp




For example:




196
242
3
881250949
186
302
3
891717742
22
377
1
878887116
244
51
2
880606923
166
346
1
886397596
298
474
4
884182806
115
265
2
881171488
253
465
5
891628467
305
451
3
886324817
6
86
3
883603013














Problem 1. text2bin (10 points) Create a program that transforms the data file from text to binary. Your program should be called text2bin and take two mandatory command line argu-ments

text2bin <input filename <output filename




The input filename is the u.data file in the dataset. The output filename is the file to store the binary output of your program. The output should include the same data in the same order but in binary using the following format:

user id item id rating timestamp




2-byte integer 2-byte integer 1-byte integer 8-byte integer




I recommend the following steps:




Use any of the character/string/buffer C stream input functions to read the data.



Use fwrite() to produce the output.



To read the four strings from the input file:



write custom code, or try using strtok(), or




use strtoul() and friends




To convert a string to a number



Use atoi() and friends, or use strtoul() and friends




Start with something like:



int main(int argc, char **argv)

{




if (argc != 3) {




fprintf(stderr,




"Wrong number of command-line arguments\n");




usage(argv[0]);




return -1;




}




...




Problem 2. bin2text (10 points) Create a program that transforms the binary file you created back to text. Your program should be called bin2text and take two mandatory command line arguments:

bin2text <input filename <output filename




The input filename is the binary file. The output filename is the file to store the text output of your program. The output should be identical to the original dataset file.




I recommend the following steps:




Use any of the character/string/buffer C stream input functions to read the data.



Use fprintf() to produce the output.



Problem 3. bin2indexed (20 points) u.data includes item IDs for movies but looking up the actual title of the movie in u.item is slow. Create a program that replaces the item ID in the binary file with the position (offset) of the corresponding movie item in the u.item file. Your program should be called bin2indexed and take three mandatory command line arguments:

bin2indexed <binary file <item file <output filename




The binary file is the file produced by text2bin. The item file is u.item from the dataset. The output filename is the file to store the binary output of your program.




The output should include the same data in the same order but in binary using the following format.

user id item file offset rating timestamp




2-byte integer 8-byte integer 1-byte integer 8-byte integer




u.item contains one line per movie. Movies are sorted by item ID. For example:





































I recommend the following steps:




Read the text in bold carefully.



ftell() can be used to obtain the current offset of a stream within a file. For example: it should be zero right after the file is opened.



You may need to use malloc()/free()/realloc() to dynamically allocate memory for the in-dex. Do not assume that you know how many movies there are. If this turns out to be too challenging, peek at the total number of movies for a small penalty.









Time your programs: Use time p <command + arguments to time your programs with the various datasets. How does your program scale with file size?







Expectations.




Your program should compile and work correctly. Compiling and performing part of the requirements is better than not compiling.









3



Use comments where necessary to explain what you are doing. No comments or over-commenting are both bad.



Use functions! Do not place all your code in main().



Do not leak memory. Free all allocated memory and close all opened files.



Try to use only material covered in the course this far.






Deliverables. A zip/tar/gz file containing:




text2bin.c



bin2text.c



bin2indexed.c



A pdf file with brief explanations of your approach and timing results.

































































































































4

More products