PS7: Data Analysis Speed Demon Solution

Starting from:

~~$30~~

$24

Home

Overview

We live in a world overwhelmed with information. Every two days, we create as much new infor-mation as was created in the entire history of human civilization up until 2003 (according to Eric Schmidt, the CEO of Google). We produce more than 3 exabytes of data per day! Much of this data is stored in large databases, and one of the challenges today is to rapidly process and analyze all the data. Since the databases are so large, it requires very fast algorithms and data structures that are highly optimized for maximum e ciency. In this competition, you will try to develop the fastest algorithm for analyzing a large dataset.

When analyzing a large dataset, there are many di erent goals. We will focus on a particular type of data mining in which we want to discover properties and patterns of the underlying data. Frequently, we want to know various statistics: the average, the median and the mode. Often, we also want to know about patterns: how often do users of a certain pro le (e.g., males between the ages of 18 and 32) buy a certain item? Often, we want to know about correlations: how often does a user who buys item A click on link B?

For the purpose of this competition, we de ne an abstract data mining problem that involves nding correlations in our data set. The database consists of a large set of very large data entries. The goal is to nd how many pairs of entries are identical, i.e., contain the same information. Your job is to implement a data mining program that reads in the database and performs this analysis as fast as possible.

Problem Details

We now describe the details of the competition more precisely.

Input. The input \database" is a le consisting of a set of lines of text, each of which represents one entry. Each line of text contains a large number of characters. (Notice that the lines may consist of thousands, or even tens of thousands, of characters.)

Your program will implement a single public method: public int processData(String filename). This method will take a lename as an input. It should then parse the le and process the data in

the le. Finally, it should return an integer as the answer. (We have given you an existing class le template to use.)

The format of the input is as follows. The rst line of the database contains a single integer i, which represents the number of entries in the database. It is followed by i lines, each containing an arbitrary number of characters and ended by an end-of-line character (ASCII character 10). Note that entries may consist of any of the 128 legal ASCII characters, except for 10 and 13 (which indicate a new line). Characters may be repeated, and entries may be of any length.

1

Output. Your program should calculate the number of pairs of entries that contain an identical multiset (i.e. multiple instances of the same character are allowed) of characters. Notice that the characters may appear in any order: two entries e1 and e2 are equivalent if e1 is equal to some permutation of e2. You should print your output, which should consist of an integer, followed by a newline character.

Example. The following is an example of an input database:

7

BCDEFGH

ABACD

BDCEF

BDCAA

DBACA

DABACA

DABAC

The appropriate output in this case is:

6

In particular, note the following six pairs of equivalent entries:

(ABACD, BDCAA)

(ABACD, DBACA)

(ABACD, DABAC)

(BDCAA, DBACA)

(BDCAA, DABAC)

(DBACA, DABAC)

Rules

The following are the rules of the competition:

Your solution must be written in Java.

You may use any available Java libraries. (I should note that I do not expect the fastest solutions to rely heavily on existing libraries.)

You are allowed to submit only one nal program.

All programs must be written entirely by you. You may use any ideas or algorithms that you nd on the internet, in books, etc. However, all the submitted code must be written by you.

The competition will end on Friday, April 3 at 11:59pm.

You may continue to update your solution up until the deadline. The nal winner will be determined by testing that occurs after the competition ends.

2
Submission Details

Here are some instructions for submitting the code:

You need to submit your solution on Coursemology as usual.

If you want to qualify for the competition, you will need to separately submit your solution on a di erent platform. More details about this will be given later via an announcement.

The top 10 winners of the competition will get bonus marks.

Hints

A few hints toward achieving good performance:

First, develop and test a working solution that achieves good asymptotic performance. Then improve it.

Think about the performance of the data structures you are using and the actual costs of the operations involved.

Remember that for large databases, memory usage is important. Maintaining big data struc-tures that use a lot of memory can be slow.

Think about data locality in memory: accessing elements in memory that are near to each other is much cheaper than accessing elements that are far away from each other.

Beware of the small costs that add up. For example, declaring a variable leads to a memory allocation which has a cost.

Beware the costs of recursion.

Pro le your solution to determine what to optimize.

3
Java Help

How do I read in a le?

There are many di erent ways (and some may be faster than others). The easiest, perhaps, is to use the BufferedReader and FileReader classes:

import java.io.BufferedReader;

import java.io.FileReader;

The entire access of the le needs to be wrapped in a try/catch loop in order to catch any exceptions that may occur while reading the le:

try {

Code for accessing the ile goes here } catch (Exception e) {

System.out.println(e);

}

The rst thing you need to do is to open the le using the FileReader:

FileReader dataFile = new FileReader(fileName);

You then access the le via a BufferedReader:

BufferedReader bufferedDataFile = new BufferedReader(dataFile);

You can then read the le:

String line = bufferedDataFile.readline();

For more information on the Bu eredReader and FileReader (and other le access mechanisms), see the Oracle Java Reference:

https://docs.oracle.com/javase/8/docs/api/java/io/BufferedReader.html

How do I measure how fast my program runs?

We have included several sample databases to test your program on. And remember the StopWatch class you used back in the Sorting Detectives Problem Set? That might be useful here too.

4