Starting from:
$35

$29

ASSIGNMENT 3 Solution


Assignment Overview




This assignment covers the foundations of Python 3 string processing and le I/O through an application of comma separated value (CSV) spreadsheets, and also provides some exposure to the rigorous requirements imposed on medium-scale software projects. The goal of this assignment is to produce a Python 3 program merge.py which contains a robust implementation of a join operation to merge multiple spreadsheets into a single spreadsheet by a common key value. Join operations are common in database applications, and often arise when combining multiple related datasets into a single dataset.




Section 2 describes the join operation at a high level and Section 4 describes the data format that will be used for this assignment. Section 5 describes the speci cations of the program to submit and Section 7 describes the marking scheme.




You will likely nd this to be the most complicated assignment speci cation for this course, even though the total amount of code you will write for this task will likely be less than assignments 1 or 2. As a result, you should plan to start early so you can seek help if you have any issues.




Your submitted merge.py program must run without errors using the command `python3 merge.py' (followed by the relevant arguments) on the Linux machines in ECS 242. Submissions which do not run under Python 3 will receive a zero (even if they work perfectly under Python 2); no exceptions will be made to this policy.










 
Joining Tabular Data




Consider the three tables below. The course_names table has two columns, one with the cryptic reg-istration name for a course and the other containing the course's full name. The course_instructors table contains columns for the cryptic course name and the name of the instructor. The enrollment_data table contains columns for the cryptic course name, the course's lecture room and current enroll-ment.































1






Table course


names














course


code


course


name










CSC 106


The Practice of Computer Science






CSC 110


Fundamentals of Programming I






CSC 115


Fundamentals of Programming II






CSC 225


Algorithms & Data Structures I






CSC 226


Algorithms & Data Structures II






CSC 230


Intro. to Computer Architecture






SENG 265


Software Development Methods






SENG 299


Software Architecture and Design























Table course instructors




course


code
instructor




CSC 106
Isabelle Dufour




CSC 110
Tibor van Rooij




CSC 115
Tibor van Rooij




CSC 225
Rich Little




CSC 226
Frank Ruskey




CSC 230
Jason Corless




SENG 265
Bill Bird




SENG 299
Simon Diemert











Table enrollment data




course


code
lecture


room
enrollment






CSC 106
ECS 124
42






CSC 110
ECS 123
63






CSC 115
ECS 123
126






CSC 225
ECS 116
89






CSC 226
ECS 116
70






CSC 230
ECS 125
96






SENG 265
CLE A212
74






SENG 299
ECS 124
70

















A join operation merges multiple sets of tabular data into a single table, with the rows of each constituent table matched up by a common key column. Each of the three tables above has a column called course code. Joining the three tables on the course code column produces the following table.







Result of three-table merge




course


code
course


name
instructor
lecture


room
enrollment










CSC 106
The Practice of Computer Science
Isabelle Dufour
ECS 124
42










CSC 110
Fundamentals of Programming I
Tibor van Rooij
ECS 123
63










CSC 115
Fundamentals of Programming II
Tibor van Rooij
ECS 123
126










CSC 225
Algorithms & Data Structures I
Rich Little
ECS 116
89










CSC 226
Algorithms & Data Structures II
Frank Ruskey
ECS 116
70










CSC 230
Intro. to Computer Architecture
Jason Corless
ECS 125
96










SENG 265
Software Development Methods
Bill Bird
CLE A212
74










SENG 299
Software Architecture and Design
Simon Diemert
ECS 124
70

























There are several di erent types of join operation, which di er on how duplicate or missing data are handled. Your assignment is to implement a particular join operation which is de ned as follows.




A join may be performed on one or more tables. A join of only one table will just return the original table (although the order of rows may have changed).




Column names are case sensitive. For example, `course name' is distinct from `Course Name'. Every table in the set must contain the key column. Any column can be chosen as the key column (not just the rst column in the table), as long as every table in the set has the chosen




column.




In each table, duplicate keys are forbidden. That is, if two rows have the same entry in the key column, the join will fail.




Duplicate column names are forbidden within each table.









2



Between di erent tables, the only column name that may be duplicated is the key column. If two input tables have a column with the same name, other than the key column, the join will fail.




If a table is missing a row for a given key, the join will succeed and the missing row will be treated as a row of blank values. Section 3 below gives an example of this behavior.




The table resulting from the join operation will contain only one occurrance of the key column, which will be the rst column (even if the key column was not the rst column of one of the input tables).










 
Missing Values




Consider the two tables below, which are similar to the example given in the previous section. Note that the course instructors2 is missing the row for SENG 265 and the enrollment data2 table is missing the row for CSC 115.







Table course instructors2




course


code
instructor




CSC 106
Isabelle Dufour




CSC 110
Tibor van Rooij




CSC 115
Tibor van Rooij




CSC 225
Rich Little




CSC 226
Frank Ruskey




CSC 230
Jason Corless




SENG 299
Simon Diemert











Table enrollment data2




course


code
lecture


room
enrollment






CSC 106
ECS 124
42






CSC 110
ECS 123
63






CSC 225
ECS 116
89






CSC 226
ECS 116
70






CSC 230
ECS 125
96






SENG 265
CLE A212
74






SENG 299
ECS 124
70

















According to the rules given in the previous section, a join of the two tables above will succeed. In cases where a key is present in one input table but not the other, a row will be created in the result for the key, with all missing values left blank. The result of joining the two tables above is given below.




Result of two-table merge




course


code
instructor
lecture


room
enrollment








CSC 106
Isabelle Dufour
ECS 124
42








CSC 110
Tibor van Rooij
ECS 123
63












CSC 115
Tibor van Rooij
















CSC 225
Rich Little
ECS 116
89








CSC 226
Frank Ruskey
ECS 116
70








CSC 230
Jason Corless
ECS 125
96








SENG 265


CLE A212
74








SENG 299
Simon Diemert
ECS 124
70






















 
CSV Spreadsheets




Tabular data can be stored in a variety of ways. One simple format for text-based tables and spreadsheets is the comma-separated value (CSV) format. Files in CSV format normally have the extension .csv or .txt. A CSV spreadsheet consists of a line of text for each row of data, with each column separated by a comma. In this assignment, column headings will be given as the rst row of the spreadsheet, and no comma will appear after the last column on each line.



3



The enrollment data table in Section 2 would be represented in CSV format as follows. course_code,lecture_room,enrollment




CSC 106,ECS 124,42




CSC 110,ECS 123,63




CSC 115,ECS 123,126




CSC 225,ECS 116,89




CSC 226,ECS 116,70




CSC 230,ECS 125,96




SENG 265,CLE A212,74




SENG 299,ECS 124,70




Formally, the CSV les used in this assignment will conform to the following speci cation.




Every CSV le must contain at least one non-blank line, which will contain the column headings.




Blank lines (consisting entirely of whitespace characters such as spaces and tabs) will be completely ignored (and will not count as a row of the table.




If the number of columns in non-blank lines is not consistent (for example, if one line has more columns than another line), the le is invalid.




There is no limit to the number of columns (as long as every line has the same number of columns) or the number of rows in the le.




Whitespace (including spaces and tabs, but not including newlines) at the beginning and




end of each column's data will be ignored. Whitespace in the middle of a column's data will




be preserved. For example, the line `CSC 106, ECS 124 , 42' should be parsed as the three values `CSC 106', `ECS 124', and `42'. The .strip() method of Python strings removes all leading and trailing whitespace from the string, so you may nd it helpful.




The number of columns is equal to the number of commas in the line plus one. The data in a column may be blank. For example, the line `Hello,,World,' contains four columns. The second and fourth columns are blank.










 
Spreadsheet Merging Program




Your task is to implement a Python 3 program called merge.py, which will read one or more CSV spreadsheets from text les, join them on a common key column, then print the result, as a CSV spreadsheet, to standard output. Read Section 5.2 below carefully as you develop your implementation to ensure that your program behaves correctly in the various error cases that may arise.




The merge.py program does not read any data from standard input. It is invoked from the command line with the syntax




 
python3 merge.py <key column name <file 1 <file 2 <file 3 <file 4 ...




where <key column name is a column name and each <file i is a le name. At least one le must be provided, but there is no upper limit on the number of les.




If the program is invoked with the correct syntax and no error cases occur, a CSV spreadsheet will be printed to standard output, containing the result of joining the CSV spreadsheets in each of the input les on the column given as the rst command line argument (if the key column name




4



contains spaces, it must be put in quotes on the command line, to ensure that it is parsed as a single argument).




The input CSV les are expected to conform to the speci cation given in Section 4 (see Section 5.2 for information on how your program should handle invalid input les).




The output CSV data must conform to the following speci cations.




The output data must be a valid CSV spreadsheet as de ned by Section 4.




The rst line of output must contain column headings for each column in the joined data. The rst column of output data must be the key column. Since the joined data must comply




with the speci cation in Section 2, the key column must appear exactly once.




The remaining columns must be in the same order as they were in each input le, starting from the rst input le. This means that the structure of the output data will be a ected by the order in which input les are given on the command line.




The rows of the output data (besides the column headers) must be sorted in ascending order by the key. The sort ordering should be the default lexicographical (or dictionary) ordering used by the built in Python sorted() function.




Section 5.1 shows the output associated with a sample input dataset. Section 5.2 describes the various error cases that your program is expected to handle.







5.1 Example




Consider the CSV spreadsheets in the les below.







File student names.csv




Student Name, Major , Student Number




Alastair Avocado, Psychology,V00123456




Rebecca Raspberry, Computer Science ,V00123457




Meredith Malina,Software Engineering ,V00654321




Hannah Hindbaer,Physics,V00654322




Neal Naranja,Anthropology,V00951413




Fiona Framboise, Computer Science,V00314159







File 265 a1 grades.csv




Student Number, A1 mark




V00654321, 18




V00654322,15




V00951413,15




V00123456, 12




V00123457, 17




V00000001, 10































5
File 265 a2 grades.csv




Student Number, A2 mark




V00951413,15




V00314159, 17




V00000001, 11




V00654322,18




V00123457, 14




V00654321, 12




All three spreadsheets have a column called `Student Number' (recall that spaces before or after the data for a column are ignored), so the three spreadsheets can be joined on the Student Number column. Notice that the key column is in a di erent position in the three les, and that some rows are not present in all three les (for example, there is no A1 mark for V00314159 and no A2 mark for V00123456). To join the three tables with merge.py, the following command line could be used.




$ python3 merge.py "Student Number" student_names.csv 265_a1_grades.csv 265_a2_grades.csv




The output produced from the above command would be the following.







Merge output




Student Number,Student Name,Major,A1 mark,A2 mark




V00000001,,,10,11




V00123456,Alastair Avocado,Psychology,12,




V00123457,Rebecca Raspberry,Computer Science,17,14




V00314159,Fiona Framboise,Computer Science,,17




V00654321,Meredith Malina,Software Engineering,18,12




V00654322,Hannah Hindbaer,Physics,15,18




V00951413,Neal Naranja,Anthropology,15,15







Observe that for the keys V00000001, V00123456 and V00314159, missing data is represented by blank cells.




Providing the comand line arguments in a di erent order will produce a di erent result. For example, using the command




$ python3 merge.py "Student Number" 265_a2_grades.csv student_names.csv 265_a1_grades.csv




would produce the result below.







Merge output




Student Number,A2 mark,Student Name,Major,A1 mark




V00000001,11,,,10




V00123456,,Alastair Avocado,Psychology,12




V00123457,14,Rebecca Raspberry,Computer Science,17




V00314159,17,Fiona Framboise,Computer Science,




V00654321,12,Meredith Malina,Software Engineering,18




V00654322,18,Hannah Hindbaer,Physics,15




V00951413,15,Neal Naranja,Anthropology,15







5.2 Error Cases




Your program is expected to gracefully handle all user errors, I/O errors and invalid inputs. In the event of such an error, your program must print a descriptive error message to standard error




6



(not standard output), then exit, without generating any output to standard output. The set of possible error cases is listed below. In the event that the way of handling the error is not explicitly given, you are free to decide on the exact error message to display.




If no command line arguments are given, or if only one argument is given (and no input les are speci ed), the program will exit with the error message




Usage: python3 merge.py <key column name <file 1 <file 2 <file 3 <file 4 ...




If any of the input les cannot be opened, or cannot be read, the program will exit with the error message




Error: Unable to open <filename




where <filename is replaced by the name of the le causing the error.




If the key column is not found in one of the input les, the program will exit with the error message




Error: File <filename No column called "<key column name"




where <filename is replaced by the name of the le causing the error and <key_column_name is replaced by the name of the key column.




If two columns which aren't the key column have the same name (either within one le or between multiple les), the program will print an appropriate error message which must contain the name of the column and exit.




If an input le is not a valid CSV spreadsheet (because it contains no data, or because the column count is not consistent between lines), the program will print an appropriate error message, which must contain the name of the invalid le and exit.




In the event that two rows in the same le contain the same key value, the program will print an appropriate error message, which must contain the name of the duplicate key, and exit. Note that this should only be enforced for key values, not the values of other columns (which may contain duplicate values).







 
Peer Testing




As in previous assignments, part of the mark for this assignment will come from creating a test case that will be used to test your peers' implementations. A test case for merge.py may contain multiple les, and di erent test cases may require a di erent number of input les. Additionally, the output of merge.py on a given set of input les may di er depending on the order of arguments on the command line. Your test case will be submitted as a zip archive called merge_testcase.zip. You may create the archive with a program of your choice, but it is your responsibility to ensure that it can be extracted using the unzip command on the Linux machines in ECS 242. You will receive a zero if your le is named incorrectly or if it cannot be extracted using the unzip command. See Section 6.1 for information on creating zip les in a Linux environment.




Your merge_testcase.zip le must contain the following les:




The set of CSV spreadsheet les needed for your test case. Your test case must use at least two input spreadsheets, and may use at most 1024. If you submit a zip le containing more than 1025 les (including the run_test.sh script below), it will be marked as invalid. The total (uncompressed) size of all spreadsheet les may not exceed 256kb (to accommodate conneX resource limits).




7



A shell script called run_test.sh which contains the complete command line needed to run your test case. For example, for the test case in Section 5.1, the run_test.sh le would contain




python3 merge.py "Student Number" student_names.csv 265_a1_grades.csv 265_a2_grades.csv




Your run_test.sh script should not contain any other lines besides a command like the one above (except for shell comments, which will be ignored). If your test case does not contain a valid shell script called run_test.sh, it will receive a zero. Your script must run without modi cation on the ECS 242 Linux machines using the command




sh run_test.sh




Your merge.py implementation and test case must be submitted by Tuesday, July 12th, 2016 at 11:00pm. After that deadline, the results of testing each implementation using each test case will be posted to the Testing Database on conneX. The complete set of merge_testcase.zip submissions will also be published anonymously to assist your peers in testing their code.




Since your test case will be published, please ensure that it contains no identifying information. Your implementation of merge.py will not be published. Do not add your merge.py implementation to your test case zip le.




After the results of peer testing are posted, you will have the opportunity to revise and resubmit your merge.py implementation until Friday, July 15th, 2016. You will not be permitted to resubmit if you did not submit an implementation before the July 12th deadline (and you will therefore receive a mark of zero). Since all test case will be published, you will also not be permitted to resubmit your test case.




6.1 Creating Zip Files




The zip command can be used to create zip archives from the command line. The basic syntax of zip is




$ zip <name of zip archive <file 1 <file 2 <file 3 ...




For example, to create a zip le containing the three les in the example of Section 5.1, along with run_test.sh, the following command would be used.




 
zip merge_testcase.zip student_names.csv 265_a1_grades.csv 265_a2_grades.csv run_test.sh




You can extract a zip archive using the unzip command. For example,




$ unzip merge_testcase.zip




would extract the merge_testcase.zip le created above. You may nd it helpful to change to an empty directory before extracting zip les, since the contents will be placed in the current directory by default (and may overwrite your existing les).







 
Evaluation




Submit your merge.py and merge_testcase.zip les electronically through the Assignments tab on conneX. Do not submit any other les.




The assignment will be marked out of 18 marks and is worth 9% of your nal grade.




Your implementation and test case will be marked based on its performance on test cases developed by your instructor (not on its performance on your peers' test cases).




The marks are distributed among the components of the assignment as follows.




8


Marks
Component



 
The merge.py implementation functions correctly on a variety of valid inputs.




6Each of the six error cases in Section 5.2 is handled correctly.




2The merge_testcase.zip le contains a valid and non-trivial test case. You will receive zero marks for this component unless you also submit merge.py.







Ensure that all code les needed to compile and run your code in ECS 242 are submitted. Only the les that you submit through conneX will be marked. The best way to make sure your submission is correct is to download it from conneX after submitting and test it. You are not permitted to revise your submission after the due date, and late submissions will not be accepted, so you should ensure that you have submitted the correct version of your code before the due date. conneX will allow you to change your submission before the due date if you notice a mistake. After submitting your assignment, conneX will automatically send you a con rmation email. If you do not receive such an email, your submission was not received. If you have problems with the submission process, send an email to the instructor before the due date.








































































































































9

More products