Homework #1 Solution

Starting from:

$35

Home

Setup
To access the starter code, you will first need to sign up for a BitBucket account. Please use your @wustl.edu email address when registering for this account. Once you have an account, go to the repository page for this assignment. You should create a fork of this repository by clicking the plus sign on the left side of the screen. Please put your last name and the name of your partner (if you are working with one) in the name of the repository. You should also mark the repository as private, and disallow forks.

Once your repository has been created, if you would like to give your partner access, click the "Settings" link then navigate to "Access Management." You should search for your partner's account and give them write access. You are now ready to clone your project into eclipse. Instead of showing instructions for that here, I will refer you to the CSE131 instructions.

You will also need to give me admin access to your repository so that I can grade your submission. Click the "Settings" link on the repository page and go to "Access Management." Search for my name (Doug Shook) and give my account (account name is "digshake") admin access to your repository.

IMPORTANT NOTE: You MUST mark your repository as private, and you MUST add me as an admin to your account. You can at AT MOST one other person to the repository. If you fail to do this, your work may not be graded and/or you may be found in violation of the academic integrity policy. If you have questions, please ask!
Overview
In this assignment, you will first construct the basic building block of a table: Tuples. Next, you will implement a simple Catalog that keeps track of what tables are currently in the database and provides a way to access them. Finally, you will implement HeapFiles which are the physical representation of the data in our database.
Tuples
Your first task is to complete the TupleDesc and Tuple classes. TupleDesc is a class that specifies a schema for a Tuple. Examine the methods and come up with a design before beginning your implementation.

Some implementation hints for TupleDesc:

* For the purposes of our database, we will only be using two datatypes: int and string (char). ints should be 4 byte signed values. Strings can be up to 128 characters in length. The first byte of a string will contain the length of the string, followed by the bytes containing the content of the string (one byte per character). If a string is less than 128 characters, the leftover bytes will be filled with zeroes until it reaches a size of 128.

* The hashCode method is not required, but if you intend to hash TupleDesc objects (for use with a HashMap for example, then you will need to implement it.

* When implementing toString please use the format specified in the comments, since we will be using this for testing purposes.

Once you have TupleDesc implemented, it is time to focus on the Tuple class. A Tuple is responsible for storing data and making sure that it conforms to the schema specified by a TupleDesc.

Some implementation hints for Tuple:

* You will need to decide on a data structure that can be used to map field names to their content.

Once you have finished these two classes, your code should pass the TupleDescTest and TupleTest tests.
Catalog
Next, we will shift our focus to tables. Tables will consist of many Tuples. Before we construct the tables themselves, we will implement a Catalog that will keep track of what tables currently exist and provide access to those tables. Examine the Catalog class and come up with a design before implementing it.

Some implementation hints for Catalog:

* You will likely need to create a structure to hold relevant information for a table (such as its name, the location of the file, and its primary key). The best way to do this is probably to create a private inner class.

* Many of the methods in this class rely on a table id. This id is generated by a HeapFile which you have not implemented yet. It may be worthwhile to take a look at that method (it is very simple) before testing your Catalog implementation.
Heap Files
The actual physical storage of our data is called a HeapFile. HeapFiles consist of many HeapPages. The primary purpose of a HeapFile is to manage access to the HeapPages and also to stream the data out to disk. The primary function of a HeapPage is to manage the various tuples - making sure that each tuple is written to a proper location and accessing tuples as necessary.

It is highly recommended that you design and implement the HeapPage class first, then design and implement the HeapFile class.

Some implementation hints for the HeapPage class:

* Each HeapPage will have a header and some slots. Slots are used to store the actual data - the size of each slot is the same as the size of the tuples contained in the HeapPage. The header is a binary encoding of which slots are occupied. Each slot is assigned one bit in the header. If that bit is a one, then that slot is occupied, and contains data that can be accessed. If the bit is a zero, then that slot contains no data and can have new data written to it. The header format should track slot occupancy starting with the least significant bit. In other words, the header bit for slot 0 would be in byte 0, bit 0 of the header. Slot 1 would be in byte 0, bit 1. Slot 8 would be in byte 1, bit 0, etc.

* It is important to consider how much space the header will require. All pages are of constant size (see the static variable in HeapPage) so the space that we are using for the header will decrease the amount of space that we can use for tuples. Headers must be a whole number of bytes (no partial bytes).

* Any space that is not taken up by the header can be used to store tuples. Tuples should be stored immediately following the header. No partial tuples can be stored in a HeapPage. If there is any space leftover, the HeapPage will be padded with zeroes until it reaches its defined size.

* When adding or deleting a tuple from a HeapPage be sure to update the header to indicate the occupied status of the modified slot.

Here are some implementation hints for the HeapFile class:

* To create an id for your HeapFile it is sufficient to create some sort of hash based on the File.

* When reading or writing a page, it is strongly recommended that you use a RandomAccessFile for file access.

* When adding a tuple, you must first find a page with an empty slot. If there are no empty slots, create a new heap page instead. Feel free to create some additional methods in HeapPage to help with this if necessary. You should then pass the new tuple to the HeapPage. Finally, be sure to write the modified HeapPage to disk.

* When deleting a tuple, you should be able to find the page id and the tuple id (slot number) from the Tuple object. Be sure to write the modified HeapPage to disk.

* It will be helpful to use the iterator from your HeapPage class to aggregate all of the tuples in the getAllTuples() method.
Example
The following example will load a table from a file, then display all of the tuples contained in that table. In other words, it is equivallent to SELECT * FROM test;. The .txt file contains the schema and the catalog looks for the data in a file called test.dat.
Catalog c = Database.getCatalog();
c.loadSchema("testfiles/test.txt");

int tableId = c.getTableId("test");
td = c.getTupleDesc(tableId);
System.out.println(td);

hf = c.getDbFile(tableId);
ArrayList tups = hf.getAllTuples();
Iterator it = tups.iterator();

while(it.hasNext()) {
System.out.println(it.next());
}


Right now we aren't properly reading in any SQL code or processing any queries. In future assignments we will add this functionality to our database.
Testing
You have been provided some tests that you can use to check basic functionality of each of the objects. We will write more tests in class on January 22nd. These tests will then be reviewed and provided to you to use to test your code. They will also be used to grade your submission. For this assignment, bonus points are being awarded for the unit tests, however in future assignments everyone will be expected to submit unit tests as part of their assignment grade.
Rubric
The below rubric is intended as a guideline and will be followed as closely as possible, however submissions that deviate too much from the specification may be graded using an alternate rubric.

TupleDesc (15 points): instance variables and getters and setters (5 points), toString (5 points), field access methods (5 points)

Tuple (15 points): instance variables and getters and setters (5 points), toString (5 points), field access methods (5 points)

Catalog (15 points): instance variables and getters and setters (5 points), implementation of inner class (5 points), table management methods (5 points)

HeapPage (25 points): instance variables and getters and setters (5 points), numSlots and header size (5 points), adding and deleting tuples (10 points), iterator (5 points)

HeapFile (30 points): instance variables and getters and setters (5 points), adding and deleting tuples (10 points), read and write page (10 points), getAllTuples() (5 points)

Unit Tests (5 bonus points): For each group that submits 2 or more unit tests, 5 bonus points will be awarded. These tests are due by noon on Wednesday, January 23rd. The tests will then be distributed to the class so they can be used to check your work during the assignment. I reserve the right to deduct points for code that is poorly formatted or poorly documented.

Total: 100 points
Submission
To submit your homework, simply commit and push your code to your repository by the due date. Be sure to put your names at the top of each file. Any submissions that are committed and pushed after the due date will be discarded.
Acknowledgements
This lab is a heavily modified version of a lab provided from MIT Open Courseware. I am grateful to them for providing this service.

Sam Madden, Robert Morris, Michael Stonebraker, Carlo Curino, 6.830 / 6.814 Database Systems, Fall 2010. (MIT OpenCourseWare: Massachusetts Institute of Technology), https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-830-database-systems-fall-2010/index.htm (Accessed 9/11/17). License: Creative commons BY-NC-SA