Introduction to Data Mining Homework 1 Solution

Starting from:

$30

Home

General Instructions

The programming assignment will be hosted on hackerrank (https://www.hackerrank.com/) as a programming contest. To participate in this contest, please open a hackerrank account with your illinois.edu email netid. If you already have a username in hackerrank that is di erent from your netid, please ll out your netid and username in this spreadheet (https://docs.google.com/ spreadsheets/d/1WkFufsaJWAPW1yyDWs4bmZu4dEphQPkbvpLRW8TFVGs/edit?usp=sharing).

It is OK to discuss the problems with the TAs and your classmates. However, it is NOT OK to work together or share code. You need to write your code independently and the TAs will not do the code debugging. Plagiarism is an academic violation to copy, to include text from other sources, including online sources, without proper citation. To get a better idea of what constitutes plagiarism, consult the CS Honor code (http://cs.illinois.edu/academics/honor-code) on academic integrity vio-lations, including examples, and recommended penalties. There is a zero tolerance policy on academic integrity violations. Any student found to be violating this code will be subject to disciplinary action.

Please use Piazza rst if you have questions about the homework. Also feel free to send us e-mails and come to o ce hours.

Programming Assignment Instructions

This question aims to provide you a better understanding of the frequent pattern mining and the closed/maximal pattern mining. Participate in the programming contest hosted at hackerrank: https://www.hackerrank.com/cs412-hw1-qi-li.

Please read the problem description carefully.

The input will always be valid. We are mainly testing your understanding of frequent pattern mining, not your coding skills.

Please pay special attention to the output format. We will be using the hackerrank based autograder and it is extremely important that your generated output satis es the requirement.

We don't have speci c constrains for this programming question. The only constrains are the standard environment constraints in hackerrank: https://www.hackerrank. com/environment.

The grading will be based on how many test cases you passed. You are provided with two sample test cases to debug your code. For the nal grading, we will use additional test cases to test your code.

Please use Piazza rst if you have questions about the homework.

Programming Assignment (50 points)

This question aims to provide you a better understanding of the frequent pattern mining and the closed/maximal pattern mining.

Implement a frequent pattern mining algorithm (e.g., the Apriori algorithm or FP-Growth) to mine the frequent patterns from a transaction dataset.

Implement a closed pattern mining algorithm to mine the closed frequent patterns from the same transaction dataset. An easy way is to write code based on the frequent patterns you got from part 1.

Implement a maximal pattern mining algorithm to mine the maximal frequent patterns from the same transaction dataset. An easy way is to write code based on the frequent patterns you got from part 1.

Input Format

The input dataset is a transaction dataset.

The rst line of the input corresponds to the minimum support.

Each following line of the input corresponds to one transaction. Items in each transaction are seperated by a space.

Please refer to the sample input below. In sample input 0, the minimum support is 2, and the dataset contains 3 transactions and 5 item types (A, B, C, D and E).

Constraints

NA

Output Format

The output are the frequent patterns you mined out from the input dataset.

Each line of the output should be in the format:

Support [ f r e q u e n t p a t t e r n ]

Support [ f r e q u e n t p a t t e r n ]

. . . . . .

The frequent patterns should be ordered according to their support from largest to small-est. Ties should be resolved by ordering the frequent patterns according to the alphabetical order.

First print all the frequent patterns for part 1, then the closed frequent patterns for part 2 and last the maximal frequent patterns for part 3. Each part should be separated by an empty line.

Please refer to the sample output below. In sample output 0, the rst 9 patterns are the frequent patterns for part 1, the following 3 patterns are the closed frequent patterns for part 2 and the last 2 patterns are the maximal frequent patterns for part 3.

Sample Input 0

2

B A C E D

C

C B D

Sample Output 0

[C]

2 [A]

2 [A C]

2 [ B ]

2 [ B C]

2 [ B C D]

2 [ B D]

2 [C D]

2 [D]

[C]

[A C]

[ B C D]

[A C]

[ B C D]

Sample Input 1

2

data mining

f r e q u e n t p a t t e r n
mining
mining
f r e q u e n t
p a t t e r n s from t h e t r a n s a c t i o n d a t a s e t
c l o s e d
and maximal p a t t e r n mining
Sample Output 1

4
[ mining ]

2
[ f r e q u e n t ]

2
[ f r e q u e n t mining ]
2
[ mining p a t t e r n ]
2
[ p a t t e r n ]

4
[ mining ]

2
[ f r e q u e n t mining ]
2
[ mining p a t t e r n ]
2
[ f r e q u e n t mining ]
2
[ mining p a t t e r n ]

4