$29
Two datasets (Market, Gene) can be found on Piazza. For each dataset, we provide the transaction-item representation discussed in class—Each row denotes a transaction, and each transaction consists of a set of items.
In this assignment, you are asked to implement Apriori algorithm that discovers a collection of frequent itemsets from a transaction database.
Template is for Python 3. You are asked to fill in two functions: apriori_gen and get_freq. apriori_gen generates new candidate itemsets based on the frequent itemsets found in the previous iteration, and prunes the itemsets containing any infrequent itemset of size k-1. get_freq returns the candidate itemsets that meet a minimum support threshold. Do not change the input and output of the two functions in the template.
Please take the following steps:
1. Implement Apriori algorithm. The pseudo code can be found below. Apriori (T, minSupport){
◦ T is the database and minSupport is the minimum support
1 = {the set of single item whose support is not less than minSupport} for (k = 2; −1 ≠ ∅; k++ ){
//generate and prune candidate set
is a list of itemsets in which each itemset is formed by merging two itemsets in −1 if their first k-2 items are identical
Remove an itemset from if any (k-1)-subset of this candidate itemset is not in the frequent itemset list −1
//Count the support of each candidate itemset
for each transaction t in database do{
for each candidate in
• increment the support count of all candidate itemsets that are contained in transaction t
if c is a subset of t then count[c] ← count[c] + 1
}
for each candidate in
• Judge if a candidate itemset is frequent or not if the support of c is not less than minSupport
then include c in
}
return { 1, 2, … , }
}
You cannot directly call a function or package that implements Apriori. You need to implement the algorithm by yourself. If you are not sure about whether it is OK to use a certain function, please post your question on Piazza.
2. Apply your implemented Apriori on the Market dataset with minimum support threshold=50%. You should get the following candidate itemsets (Ck) and frequent itemsets (Lk):
Candidate itemsets:
C2: {Eggs,Key-chain}, {Eggs,Mango}, {Eggs,Onion}, {Eggs,Yo-yo}, {Key-chain, Mango}, {Key-chain, Onion}, {Key-chain,Yo-yo}, {Mango,Onion}, {Mango,Yo-yo}, {Onion,Yo-yo}
C3: {Eggs,Key-chain,Onion}, {Key-chain,Mango,Onion}
Frequent itemsets:
L1: {Eggs}, {Key-chain}, {Mango}, {Onion}, {Yo-yo}
L2: {Eggs,Key-chain}, {Eggs,Onion}, {Key-chain,Mango}, {Key-chain,Onion}, {Key-
chain,Yo-yo}, {Mango, Onion}
L3: {Eggs, Key-chain,Onion}
3. If you get the same collection of itemsets at Step 2, you can proceed to apply your implemented Apriori algorithm on the Gene dataset with minimum support threshold=50%. You should be able to get 51 length-1 frequent itemsets (L1), 1275 length-2 candidate itemsets (C2), 29 length-2 frequent itemsets (L2), 20 length-3 candidate itemsets (C3) and 2 length-3 frequent itemsets (L3).
4. Prepare your submission. Your final submission should be a zip file named as Assignment2.zip. In the zip file, you should include:
A folder “Code”, which contains all the codes used in this assignment. You are required to fill in the template. Do not change other functions or add other scripts.
Report: A doc or pdf file named as Assignment2.pdf. The report should consist of the following parts: 1) The frequent itemsets you obtain on Gene dataset (L1, L2, L3). 2) The length-3 candidate itemsets generated during Apriori (C3) on Gene dataset. 3) The codes of your Apriori algorithm implementation.
5. Log in any CSE department server and submit your zip file as follows:
submit_cse469 Assignment2.zip
Please refer to Course Syllabus for late submission policy and academic integrity policy. We will take the submission time recorded by the server as the time of your submission. This assignment must be done independently. Running your submitted code should be able to reproduce the results in your report.