Introduction to Data Mining Homework 2 Solution

Starting from:

~~$30~~

$24

Home

General Instructions

The programming assignment will be hosted on hackerrank (https://www.hackerrank.com/) as a programming contest. To participate in this contest, please open a hackerrank account with your illinois.edu email netid. If you already have a username in hackerrank that is di erent from your netid, please ll out your netid and username in this spreadheet (https://docs.google.com/ spreadsheets/d/1WkFufsaJWAPW1yyDWs4bmZu4dEphQPkbvpLRW8TFVGs/edit?usp=sharing).

It is OK to discuss the problems with your classmates. However, it is NOT OK to work together or share code. You need to write your code independently and the TAs will not do the code debugging. Plagiarism is an academic violation to copy, to include text from other sources, including online sources, without proper citation. To get a better idea of what constitutes plagiarism, consult the CS Honor code (http://cs.illinois.edu/academics/honor-code) on academic integrity violations, including examples, and recommended penalties. There is a zero tolerance policy on academic integrity violations. Any student found to be violating this code will be subject to disciplinary action.

Please use Piazza if you have questions about the homework.

For this assignment, TAs and the instructor will NOT provide any additional information. You should assume that we only have this problem but no solution or any idea how to solve it.

Programming Assignment Instructions

In this programming assignment, you are required to implement a contiguous sequen-tial pattern mining algorithm and apply it on text data to mine potential phrase candi-dates. Participate in the programming contest hosted at hackerrank: www.hackerrank. com/cs412-hw2-qi-li.

Please read the problem description carefully.

The input will always be valid. We are mainly testing your understanding of frequent sequential pattern mining, not your coding skills.

Please pay special attention to the output format. We will be using the hackerrank based autograder and it is extremely important that your generated output satis es the requirement.

We don't have speci c constrains for this programming question. The only constrains are the standard environment constraints in hackerrank: https://www.hackerrank. com/environment.

1

The grading will be based on if you pass the test cases. You are provided with one sample test case to debug your code. For the nal grading, we will use additional test cases to test your code.

Programming Assignment (50 points)

This question aims to train you on adapting a pattern mining algorithm on real-world ap-plications based on what you learned in class.

Understand a new problem and design an algorithm to solve it.

Implement a frequent contiguous sequential pattern mining algorithm to mine the frequent phrases from a text corpus.

Problem De nition

A contiguous sequential pattern is a sequence of items that frequently appears as a consecutive subsequence in a database of many sequences. For example, if the corpus is

good sh sandwich and french fries

disgusting sh sandwich but good french fries

their sh sandwich is the best sh sandwich

and the minimum support is 2, then patterns like \ sh sandwich" and \french fries" are both frequent contiguous sequential patterns, while the pattern \sandwich fries" is not a frequent contiguous sequential pattern because these two words are not consecutive to each other. Notice that \sandwich fries" is still a frequent sequential pattern though.

In this problem, the frequency is also de ned di erently: multiple appearances of a subsequence in a single sequence record count multiple times. The reason is that we want to nd phrases and a phrase can be repeated in a single sentence. For example, the pattern \ sh sandwich" appears once in the rst sequence, once in the second sequence and twice in the last, so its support should be calculated as 4. Another example: \A B A B A". Subsequence \A B A" actually occurs twice, so the support is 2.

Input Format

The input dataset is a text corpus. Each line is basically a sequence of strings separated by spaces. Note, punctuations are also considered as words, and are also separated by spaces. Space is not a word. The lower case letters and upper case letters are di erent (i.e., 'A' and 'a' are two words).

Be aware of the size of the input.

Constraints

Minimum length is 2, maximum length is 5, and minimum support is 2. That is, the patterns have to contain at least two words, but no more than 5. The frequency of the patterns is no less than 2.

Output Format

The output are the most frequent 20 patterns you mined out from the input dataset. The frequent patterns should be ordered according to their support from largest to smallest. Ties should be resolved by ordering the frequent patterns according to the ASCII order (increasing order).

Each line of the output should be in the format:

[ Support f r e q u e n t p a t t e r n ]

[ Support f r e q u e n t p a t t e r n ]

. . . . . .

Please refer to the sample output below.

Sample Input

good
g r i l l e d
f i s h
sandwich
and
f r e n c h f r i e s
,
but t h e
s e r v i c e i s bad
d i s g u s t i n g f i s h sandwich ,
but
good
f r e n c h
f r i e s

t h e i r
g r i l l e d
f i s h
sandwich
i s
t h e
b e s t f i s h
sandwich
, but p r i c y
A B A B A B A

Sample Output

[ 4 , ' f i s h sandwich ' ]

[ 3 , ' ,
but ' ]

[ 3 ,
'A B ' ]

[ 3 ,
'A B A ' ]

[ 3 ,
'B
A ' ]

[ 2 ,
'A B A B ' ]

2 , 'A B A B A ' ]

2 , 'B A B ' ]

[ 2 , 'B A B A ' ]

[ 2 , ' f i s h sandwich , ' ]

[ 2 , ' f i s h sandwich , but ' ]

[ 2 , ' f r e n c h f r i e s ' ]

[ 2 , ' g r i l l e d f i s h ' ]

[ 2 , ' g r i l l e d f i s h sandwich ' ]

3

[ 2 , ' sandwich , ' ]

[ 2 , ' sandwich , but ' ]