Starting from:
$30

$24

ASSIGNMENT 1 SOLUTION

Assignment Overview



This programming assignment covers an application of algorithms and data structures arising in the implementation of database systems. Databases and data management are covered by later courses, such as CSC 370. In this assignment, you will design and implement an e cient and correct algorithm for aggregation, which is a fundamental operation in relational database systems. Since this is not a database course, the algorithm will operate on simple text-based spreadsheet les instead of a database.




The speci cation for this assignment can be found in Section 2. Sections 1.1 - 1.3 contain a de-scription and several examples of grouping and aggregation, using the following ( ctional) table of student numbers and grades.







Table grades




id
course


subject
course


num
grade












V00123457
CSC
225




78












V00654322
MATH
110




63












V00654322
CSC
110




75












V00314159
CSC
106




85












V00951413
CSC
106




72












V00951413
MATH
110




68












V00654322
CSC
106




70












V00123456
CSC
110




79












V00314159
CSC
110




47












V00654322
CSC
225




91












V00123456
CSC
106




68












V00123456
CSC
225




89












V00951413
CSC
225




40












V00314159
MATH
100




74












V00654322
MATH
100




71



















1.1 Counts and Averages by Student




Consider the task of computing the number of courses taken by each student, or the task of com-puting the average grade for each student. For both of these tasks, the rows of the table need to be divided into groups by student number (which is the id column of the table). Then, the number




1



of courses per student or average grade per student can be computed by counting or averaging the grade values within each group. Applying a function (such as counting or averaging) to collapse a group of rows into a single row is called aggregation. The table on the left below shows the groupings of rows by the id column, with each group receiving a di erent colour, and the tables on the right show the counts and averages within each group. Notice that the only columns in the result tables are the grouping criteria (in this case, the id column) and the aggregation result. All other columns in the original table are omitted.







Table grades






id
course


subject
course


num
grade




















V00123457
CSC
225




78


V00654322
MATH
110




63


V00654322
CSC
110




75


V00314159
CSC
106




85


V00951413
CSC
106




72


V00951413
MATH
110




68


V00654322
CSC
106




70


V00123456
CSC
110




79


V00314159
CSC
110




47


V00654322
CSC
225




91


V00123456
CSC
106




68


V00123456
CSC
225




89


V00951413
CSC
225




40


V00314159
MATH
100




74


V00654322
MATH
100




71


















Original table, grouped by id column.







1.2 Averages by Subject




Aggregation (Counting Rows)




id count(grade)




V00123457 1




V00654322 5




V00314159 3




V00951413 3







V00123456 3







Result of aggregation by counting the number of grade values in each group.







Aggregation (Averaging)




id avg(grade)




V00123457 78




V00654322 74




V00314159 68.66




V00951413 60




V00123456 78.66




Result of aggregation by averaging the grade values in each group.



Consider the task of computing the average grade within each subject (CSC and MATH). In this case, the grouping criteria is the course_subject column and the aggregation result is avg(grade). The tables below show the grouping and aggregation. As above, notice that the other columns are excluded from the result table.



















































2





Table grades




Aggregation Result






























id
course


subject
course


num
grade


course


subject
avg(grade)
































V00123457
CSC
225




78


CSC
72.18


V00654322
MATH
110




63


MATH
69


























V00654322
CSC
110




75
Result of aggregation by averaging the
V00314159
CSC
106




85


grade values in each group.
V00951413
CSC
106




72












V00951413
MATH
110




68












V00654322
CSC
106




70












V00123456
CSC
110




79












V00314159
CSC
110




47












V00654322
CSC
225




91












V00123456
CSC
106




68












V00123456
CSC
225




89












V00951413
CSC
225




40












V00314159
MATH
100




74












V00654322
MATH
100




71








































Original table, grouped by the course_subject




column.







1.3 Averages by Course




Now suppose we want to compute the average grade for each course. It is not su cient to group by the course_num column, since two courses with the same number may not be the same course (for example, consider CSC 110 and MATH 110). The grouping criteria therefore must include two columns: course_subject and course_num. The tables below show the grouping and aggregation. Again, the only columns in the output table are the grouping criteria and the aggregation result.












































































3





Table grades








Aggregation Result




































id
course


subject
course


num
grade


course


subject
course


num


avg(grade)






































V00123457
CSC
225




78


CSC
225






74.5
V00654322
MATH
110




63


MATH
110






65.5
V00654322
CSC
110




75


CSC
110






67
V00314159
CSC
106




85


CSC
106






73.75
V00951413
CSC
106




72


MATH
100






72.5
V00951413
MATH
110




68


Result of aggregation by averaging the
V00654322
CSC
106




70


grade values in each group.
V00123456
CSC
110




79


















V00314159
CSC
110




47


















V00654322
CSC
225




91


















V00123456
CSC
106




68


















V00123456
CSC
225




89


















V00951413
CSC
225




40


















V00314159
MATH
100




74


















V00654322
MATH
100




71




















































Original table, grouped by the course_subject and




course_num columns.










Speci cation



Your task for this assignment is to write a Java program Aggregate which reads a table from a text le and performs grouping and aggregation with one of three aggregation functions: count (count the number of rows in each group), sum (add up all elements in the aggregation column within each group) and avg (compute the average of all elements in the aggregation column within each group). To receive full marks, the entire implementation must be correct, allow any number of columns to be used as grouping criteria, and run in O(n log2 n) time on a table with n rows. See Section 3 for more details on the evaluation process.




2.1 Data Format




Tabular data can be stored in a variety of ways. One simple format for text-based tables and spreadsheets is the comma-separated value (CSV) format. Files in CSV format normally have the extension .csv or .txt. A CSV spreadsheet consists of a line of text for each row of data, with each column separated by a comma. In this assignment, column headings will be given as the rst row of the spreadsheet, and no comma will appear after the last column on each line.




The grades table in Section 1 would be represented in CSV format as follows.




id,course_subject,course_num,grade




V00123457,CSC,225,78




V00654322,MATH,110,63




V00654322,CSC,110,75




V00314159,CSC,106,85




4



V00951413,CSC,106,72




V00951413,MATH,110,68




V00654322,CSC,106,70




V00123456,CSC,110,79




V00314159,CSC,110,47




V00654322,CSC,225,91




V00123456,CSC,106,68




V00123456,CSC,225,89




V00951413,CSC,225,40




V00314159,MATH,100,74




V00654322,MATH,100,71




Formally, the CSV les used in this assignment will conform to the following speci cation.




Every CSV le must contain at least one non-blank line, which will contain the column headings.



Blank lines (consisting entirely of whitespace characters such as spaces and tabs) will be completely ignored (and will not count as a row of the table).



You may assume that every line contains the same number of columns (an input le with an inconsistent number of columns per line will be considered invalid).



There is no limit to the number of columns (as long as every line has the same number of columns) or the number of rows in the le.



The data in the le may contain letters, numbers, spaces and punctuation, with one exception: comma characters will only appear as column separators (they may not appear in the data).



The number of columns is equal to the number of commas in the line plus one. The data in a column may be blank. For example, the line ‘Hello,,World,’ contains four columns. The second and fourth columns are blank.






2.2 Aggregate program interface




Your Aggregate program will accept command line arguments in the form




$ java Aggregate <function <aggregation column <input file <group column 1 <group column 2 ...




where




<function is one of ‘count’, ‘sum’ or ‘avg’,



<aggregation column is the column name of the column to aggregate,



<input file is the name of the input CSV le, and



the <group column x arguments are column names of grouping columns (at least one must be speci ed, but in a complete implementation, any number of grouping columns should be allowed).



For example, if the table from Section 1 is contained in the le table_of_grades.csv, the number of grades per student (from Section 1.1) would be computed with the command




$ java Aggregate count grade table_of_grades.csv id




and the average grade in each course (from Section 1.3) would be computed with the command




$ java Aggregate avg grade table_of_grades.csv course_subject course_num




The program will report an error if any of the following conditions arise.




The <function argument is not one of ‘count’, ‘sum’ or ‘avg’.



5



The input le does not exist, cannot be opened or is not in the correct format.



The aggregation column or any of the grouping columns do not exist in the input le.



The aggregation column contains any non-numerical data (except the column header). Other columns may contain non-numerical data.



A column is used as both a grouping column and an aggregation column.



If none of the above errors occur, the program will perform the aggregation and output the resulting table to standard output (that is, the console) in the CSV format described in the previous section, including column headers. The only columns that will appear in the output table are the grouping columns and the aggregation result. The grouping columns must appear rst, with the last column containing the aggregation result. The column name of the aggregation result column should contain the function name (count, sum or avg) and the name of the original column (e.g. count(grade) or avg(temperature)). The rows of the result table may appear in any order, as long as the header row is rst (so you do not have to ensure that the order matches the examples in this document). For example, with table_of_grades.csv de ned as in the examples above, the output of the command




$ java Aggregate avg grade table_of_grades.csv course_subject course_num




would be




course_subject,course_num,avg(grade)




CSC,106,73.75




CSC,110,67.0




CSC,225,74.5




MATH,100,72.5




MATH,110,65.5







2.3 Using Outside Code




You are encouraged to use the features of the Java Standard Library (including any of the data structures it provides) in your code. If you use a standard library data structure, make sure you are aware of the running times of the operations you use, since that will be important for determining the running time of your program.




There should be no need to use large volumes of code from other sources (such as outside libraries or the internet) in this assignment. If you believe that your implementation requires an outside library, talk to your instructor.




If you nd a small snippet of code on the internet that you want to use (for example, in a Stack-




Over ow thread), put a comment in your code indicating the source (including a complete URL).




Remember that using code from an outside source without citation is plagiarism.




You are encouraged to discuss the assignment with your peers, and even to share implementation advice or algorithm ideas. However, you are not permitted to use any code from another CSC 225 student under any circumstances, nor are you permitted to share your code in any way with any other student (or the internet) until after the marking is complete. Sharing your code with others before marking is completed, or using another student’s code for assistance (even if you do not directly copy it) is plagiarism.
















6



2.4 Implementation Advice




Although this assignment may seem daunting, most of the complexity lies in the speci cation (since this application is likely new to you) and in the nishing touches needed to make the algorithm asymptotically fast. The actual volume of code needed may be signi cantly less than assignments you have completed in the past. Your instructor’s solution required about 150 lines of Java code.




You are encouraged to try designing a solution from scratch, since the design process is often the most di cult part of a software project. However, if you are stuck, or feeling uninspired, consider implementing your solution in the following steps.




Write a simple program to read a CSV le into an array or list, then output the data in CSV format. Test this thoroughly to ensure that it is correct. To make things easier later, use separate methods for the input and output code. You may use outside libraries to read/write CSV data, but it is likely easier to write this component yourself.



Write a method which takes a table and a set of column names as input, then produces a new table containing only the provided columns (for example, given the grades table used in previous examples and the column names ‘id’ and ‘grade’, the method would make a new table containing only the ‘id’ and ‘grade’ columns). Once this feature has been written, verify that your code can read an input le and prune away all columns besides the group columns and aggregation column, then print the resulting table (without any aggregation being performed). Note that you will receive some marks if you can reach this point (see Section 3).



Write a simple (but possibly slow) aggregation algorithm and verify that it works. Various algorithm ideas will be covered in lectures and labs. It is far more important to have working code than fast code (if your code is not correct, you will receive no marks for its performance).



Once you have a working implementation (even if it is slow), make sure you can analyse and explain its running time. Add comments and other documentation if necessary to clarify complicated parts of the code. Hand in your working implementation.
If your code does not have a worst case running time of O(n log2 n), try implementing an improved algorithm. Again, it is better to submit a slow but correct implementation than a fast but incorrect one. Additionally, a fast implementation will not receive full marks unless you can justify its running time.



Note: You should not use a hash table (or any hashing-based Java data structures, like HashMap) for your implementation if you want your code to have a worst case O(n log2 n) running time. Although hash tables are very useful in most cases, they have a relatively poor worst-case running time. We will study hash tables in June.










Evaluation



Submit all .java les needed to compile your assignment electronically via conneX. Your code must compile and run correctly in the Linux environment in ECS 242. If your code does not compile as submitted, you will receive a mark of zero.




This assignment is worth 9% of your nal grade and will be marked out of 18 during an interactive demo with an instructor. You will be expected to explain the running time of your implementation to the evaluator, and may also be asked to explain or justify some aspects of your code. Demos




7



must be scheduled in advance (through an electronic system available on conneX). If you do not schedule a demo time, or if you do not attend your scheduled demo, you will receive a mark of zero.




The marks are distributed among the components of the assignment as follows.




Marks
Component




5
The input le is read successfully and an output le containing the correct columns


is generated (even if no aggregation is performed).




4
Aggregation is correct when a single grouping column is used.




3
Aggregation is correct for an arbitrary number of grouping columns.




3
You are able to explain and justify the running time of your code to the evaluator.


Note that the running time does not have to be optimal for you to receive these


marks (as long as your explanation is clear and correct). If you use algorithms or


data structures from the Java standard library, you will be expected to know their


running times.




3
The worst case running time of the entire program is O(n log2 n) for an input with


n rows. These marks will only be given if the implementation is correct for multiple


grouping columns and if you can justify the running time.







If your code is not well organized, or if it is poorly documented, the evaluator may ask you explain any aspects that are unclear. If you are unable to do so, up to 2 marks may be deducted. To be clear, you are not required to have spotless, perfectly organized code, but you should be prepared to explain any hard-to-read parts of your code.




You are permitted to delete and resubmit your assignment as many times as you want before the due date, but no submissions or resubmissions will be accepted after the due date has passed. You will receive a mark of zero if you have not o cially submitted your assignment (and received a con rmation email) before the due date.




Your main program must be contained in a le called Aggregate.java. You may use additional les if needed by your solution (as long as the program can be invoked from the command line using the syntax in Section 2.2). Ensure that each submitted le contains a comment with your name and student number.




Ensure that all code les needed to compile and run your code in ECS 242 are submitted. Only the les that you submit through conneX will be marked. The best way to make sure your submission is correct is to download it from conneX after submitting and test it. You are not permitted to revise your submission after the due date, and late submissions will not be accepted, so you should ensure that you have submitted the correct version of your code before the due date. conneX will allow you to change your submission before the due date if you notice a mistake. After submitting your assignment, conneX will automatically send you a con rmation email. If you do not receive such an email, you did not submit the assignment. If you have problems with the submission process, send an email to the instructor before the due date.





































8

More products