$35
Setup
The code for this assignment has been pushed to the same repository that you forked for the previous assignment. That means all you need to do is sync your repository with the main repository. There are multiple ways to do this, perhaps the easiest way is to login to the BitBucket website, go to the page for your repository, and click the "Sync Now" link on the right side of the screen. You can also use eclipse or the command line to perform a pull from upstream, if you prefer.
While this assignment is related to previous assignments (see the optional part, below), you are not required to have a working version of the first two assignments for this assignment to work.
Overview
In this assignment, you will implement B+ trees as discussed in class.You will also have the option to integrate these trees with the rest of your DBMS.
Nodes
Your first task is to complete the InnerNode and LeafNode classes. InnerNode is a class that represents an internal node of your B+ tree, while LeafNode represents a leaf. Examine the methods and come up with a design before beginning your implementation.
Some implementation hints for your nodes:
* An InnerNode needs to store search keys. These keys come in the form of Fields meaning that keys for our implementation could be either ints or strings. Your implementation should work for both.
* InnerNodes also need to maintain a list of their children. It is up to you to decide how to implement this.
* LeafNodes store Entry objects, where an Entry consists of a Field and a heap page number. For the purposes of this assignment, the Field is most relevant (unless you attempt the optional part, below), but you should still be able to store and retrieve page numbers properly. Note that some of the Fields in the entries will eventually become keys.
* The degree of a node determines how many keys and children (for an InnerNode) and how many entries (for a LeafNode) each node can contain.
* Keys and entries should be stored in sorted order, from least (index 0) to greatest. Entries should be sorted by the Fields that they contain.
* You are not limited to the methods included in the node classes. Please add more methods as necessary.
B+ Trees
Once your nodes are setup, it is time to implement the B+ tree. Our tree needs to be able to search, insert, and delete.
Some implementation hints for BPlusTree:
* Note that search() returns a LeafNode. This will be very useful when implementing insert and delete.
* If a search is requested for a value that is not in the tree, simply return null.
* We will assume that the values being inserted are unique. This means that if an insertion is requested for a value that is already in the tree, the tree should not change.
* If a deletion is requested for a value that is not in the tree, simply do nothing.
* The keys of an InnerNode should represent the largest value contained in the subtree to the immediate left. For example, if the keys contained in an InnerNode are 4 and 8, all of the values in the left child will be less than or equal to 4, all of the values in the middle subtree will be greater than 4 and less than or equal to 8, and all of the values in the rightmost subtree will be greater than 8. This is consistent with the examples from the textbook and the examples we worked on in class.
* When inserting new values, sometimes it will be necessary to split a node. If there are an even number of values, then after the split the two nodes should have an equal number of values (half in the left, half in the right). If there are an odd number of values, the left child should contain one more value than the right child.
* Pseudocode for these methods can be found in the textbook listed on the course website and also in the slides from class.
Integration with previous work (Optional)
By itself, a B+ tree is interesting, but not particularly useful. To see the benefit of a B+ tree in action, we must integrate it with the other parts of our DBMS.
There are two additional pieces of functionality required to integrate our B+ trees:
* Storing a B+ tree in a HeapFile
* Constructing a B+ from a HeapFile
If you wish to tackle this, consider the following:
* You will need a mapping from a B+ node to a tuple. The size of the tuple is important, as that will dictate the degree of the B+ tree. This will differ depending on whether or not we use ints or strings as the key.
* When storing a B+ tree to a HeapFile one node from our tree (either a LeafNode or an InnerNode) will correspond to one page in our HeapFile. The structure of the page should be roughly the same regardless of node type.
* Numbering the nodes in our tree becomes important for storage purposes. Each node should know which page it belongs to in the heap file.
* Reading in a tree works in the exact opposite way: read in one page at a time and turn that page into a Node. The tricky part here comes from determining whether a page represents an InnerNode or a LeafNode. Note that since our tree is balanced, we can easily determine how many inner nodes there are based on the number of pages. If we are careful about how we store nodes (for example, first n pages are inner nodes, all others are leaf nodes), then we can solve this problem.
* Our Catalog should be updated to provide a method to create an index for a given table. It should also look for an index when a table is imported (by searching for tablenameindex.dat, for example) and load the index if it exists. The metadata for a table will need to be updated to include any indexes that exist.
* Queries should ask the Catalog about indexes that exist on relevant columns and use those indexes if they are available.
* When deleting elements, sometimes it will be necessary to attempt to borrow an element from a sibling. You should attempt to borrow from the left sibling first (with the same parent), and if borrowing from that sibling is not possible you should then try borrowing from the right sibling. * When deleting elements, sometimes it will be necessary to attempt to merge nodes. You should attempt to merge the left sibling first (with the same parent), and if merging with that sibling is not possible you should then try merging the right sibling.
Testing
You are expected to write some unit tests for this assignment. Each group should submit two unit tests by noon on Monday, February 25th. I will then go through those unit tests, compile them, and give them back to you to use as your work on the assignment. These tests will also be used to grade your work.
Rubric
The below rubric is intended as a guideline and will be followed as closely as possible, however submissions that deviate too much from the specification may be graded using an alternate rubric.
LeafNode (10 points)
InnerNode (10 points)
BPlusTree: search (15 points). Simple insertion without any splits (10 points). Complex insertion that requires splits (15 points). Simple deletion without any merges (10 points). Complex deletion that requires merges (15 points). Ability to handle different degrees (10).
Tests: Two unit tests (5 points) I reserve the right to deduct points for code that is poorly formatted or poorly documented.
Total: 100 points
Submission
To submit your homework, simply commit and push your code to your repository by the due date. Be sure to put your names at the top of each file. Any submissions that are committed and pushed after the due date will be discarded.
Acknowledgements
This lab was mostly created by me, however since it is based on hw1 which was taken from MIT OpenCourseWare, it is still appropriate to provide attribution to them:
Sam Madden, Robert Morris, Michael Stonebraker, Carlo Curino, 6.830 / 6.814 Database Systems, Fall 2010. (MIT OpenCourseWare: Massachusetts Institute of Technology), https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-830-database-systems-fall-2010/index.htm (Accessed 9/11/17). License: Creative commons BY-NC-SA