$29
Goals
To calculate and use a cryptographic hash value.
To identify duplicate les in a directory tree and output their path names.
1.1 Preparation
For P6, you created a le tree as test data. Go into that tree and add several hard and soft links. Link things from di erent subdirectories and di erent levels of the same path. Create at least one le with multiple hard and soft links leading to it. Make sure you have some les of the same length and some of di erent lengths.
Add another eld to your FileID class to store the ngerprint of the le: an array of unsigned char of length SHA_DIGEST_LENGTH.
Add a comparison function called bySize() to the FileID class that will let you sort by the le size. This function will be called at the beginning of P7.
Add another comparison function called byFprint() to the FileID class that will let you sort by the ngerprint eld.
Add a function to the FileID class that will calculate the SHA256 hash for and store it in the FileID object. It will be called in the last phase of the process. You should copy and modify my SHA demo program to implement it. Use fread() to read a le into a bu er in huge blocks, to minimize disk time. Suggestion if it works for your system: 1010 bytes. For many les, this is big enough to hold the whole le.
In the Sweeper class, de ne a function called areDups() with two pathname parameters. Its job is to compare the two les and return true if they are identical, false otherwise. This function will only be called if the le size is the same and the iNode #s are di erent.
Sweeper::run().
By the end of program 6, you had built a vector<FileID that contained one FileID object for every le and every hard link in your directory tree (starting at the speci ed directory).
The goal of program 7 is to use that vector to discover the duplicate les, and to report on those les in enough detail that someone could actually delete them.
First, sort by size.
In the Sweeper class, declare two iterators (start and end), for your vector. Initialize both to the slot 0 of the vector.
Use stable_sort() and bySize() to sort your vector in order. The prototype is:
void stable_sort ( Iterator first, Iterator last, Compare comp );
7: Finishing File Sweeper CSCI 4547 / 6647 Systems Programming 2
Now, enter a loop. This loop implements what some people call a \sliding window". The top of the window is set to the rst FileID of each size, in turn, and the bottom of the window is set to the rst FileID that is of the next bigger size. Between are the les that could possibly be duplicates.
Enter a loop that increments the end iterator, checking the le size each time. Leave the loop when the size changes. Now your iterators point to the beginning and o -board-end of a block of equal-size FileID objects. This is the correct position for calling sort() or stable_sort()
If the di erence of the iterators is only one slot, you are done with this block of objects. Set the start iterator = the end iterator and continue from the top of the loop.
If the di erence of the iterators is more than one slot, you might have a duplicate OR you might have a le that has hard links. To check, call the checkDups() function.
When checkDups() returns, go back to the top of the loop.
Write a function void checkDups(). This function will do the actual output of duplicate- le pathnames.
Because you used stable_sort(), the objects are still in iNode order. Check whether all the iNode #s in the block are the same,. If so, print the paths from the FileID objects in the block. For each, print \Link: " followed by the pathname.
If there are at least two di erent iNode #s in the block, calculate the SHA256 hash for each FileID and store it in the object. Do this by calling the function in the FileID class. (With a little programming e ort, you can avoid calculating the hash twice for the same iNode #.)
When hash values have been stored in all the objects in the current block, call stable_sort() again to sort the FileID objects by ngerprint.
Start at the top of the block. Use control-break logic to print groups of duplicate les, that is, groups of FileID’s that have the same ngerprint. (Either these les actually are di erent, or we have \broken" SHA256 and have a publishable paper.)