$23.99
Q1. Exercise 26.6 Let us develop a new algorithm for the computation of all large itemsets. Assume that we are given a relation D similar to the Purchases table shown in Figure 26.1. We partition the table horizontally into k parts D 1 , ... , D k .
1. Show that, if itemset X is frequent in D, then it is frequent in at least one of the k parts.
2. Use this observation to develop an algorithm that computes all frequent itemsets in two scans over D. (Hint: In the first scan, compute the locally frequent itemsets for each
part Di , i E {I, ... , k}.)
3. Illustrate your algorithm using the Purchases table shown in Figure 26.1. The first partition consists of the two transactions with transid 111 and 112, the second partition consists of the two transactions with transid 113 and 114. Assume that the minimum support is 70 percent.
Q2. Assume you are given a document database that contains SIX documents. After stemming, the documents contain the following terms:
Document Terms
1 car, manufacturer, Honda, auto
2 auto, computer, navigation
3 Honda, navigation
4 Manufacturer, computer, IBM
5 IBM, personal, computer
6 car, Beetle, VW
Answer the following questions.
1. Show the result of creating an inverted file on the documents.
2. Show the result of creating a signature file with a width of 5 bits. Construct your own hashing function that maps terms to bit positions.
3. Evaluate the following boolean queries using the inverted file and the signature file that you created:
● 'car'
● 'IBM' AND 'computer'
○
● 'IBM" AND 'car'
○
● 'IBM' OR 'auto’
● 'IBM' AND 'computer' AND 'manufacturer'.
4. Assume that the query local against the document database consists of exactly the queries that were stated in the previous question. Also assume that each of these queries is evaluated exactly
once.
(a) Design a signature file with a width of 3 bits and design a hashing function that minimizes the overall number of false positives retrieved when evaluating the same queries given in 27.2.3
(b) Design a signature file with a width of 6 bits and a hashing function that minimizes the overall number of false positives.
(c) Assume you want to construct a signature file. What is the smallest signature width that allows you to evaluate all queries without retrieving any false positives?
Smallest signature width should be of the order of the number of different words which here is equal to
10 when dealing with all types of query. However considering that if the queries are limited to the type of queries asked before it could be done with width = 6 as done in the previous example without any false positives.
5. Consider the following ranked queries:
● 'car’
● 'IBM computer'
● ‘IBM car'
● ‘IBM auto'
● 'IBM computer manufacturer'
(a) Calculate the IDF for every term in the database.
(b) For each document, show its document vector.
(c) For each query, calculate the relevance of each document in the database, with and without the length normalization step.
Doc Scores (Used for the length normalization step) (only used for reference, not asked in the
question) -
(d) Describe how you would use the inverted index to identify the top two documents that match each query.
(e) How would having the inverted lists sorted by relevance instead of document id affect your answer to the previous question?
(f) Replace each document with a variation that contains 10 copies of the same document. For each query, recompute the relevance of each document, with and without
the length normalization step.
Doc Scores (Used for the length normalization step) (only used for reference, not asked in the
question) -
Q3. You are in charge of the Genghis ('We execute fast') search engine. You are designing your server cluster to handle 500 Million hits a day and 10 billion pages of indexed data. Each machine costs
$1000, and can store 10 million pages and respond to 200 queries per second (against these pages).
1. If you were given a budget of $500,000 dollars for purchasing machines, and were required to index all 10 billion pages, could you do it?
2. What is the minimum budget to index all pages? If you assume that each query can be answered by looking at data in just one (10 million page) partition, and that queries are uniformly distributed across partitions, what peak load (in number of queries per second) can such a cluster handle?
3. How would your answer to the previous question change if each query, on average, accessed two partitions?
4. What is the running budget required to handle the desired load of 500 million hits per day if all queries are on a single partition? Assume that queries are uniformly distributed with respect to time of day.
5. Would your answer to the previous question change if the number of queries per day went up to 5 billion hits per day? How would it change if the number of pages went up to 100 billion'?
6. Assume that each query accesses just one partition, that queries are uniformly distributed across partitions, but that at any given time, the peak load on a partition is upto 10
times the average load. What is the minimum budget for purchasing machines in this scenario?
7. Take the cost for machines [take the previous question and multiply it by 10 to reflect the costs of maintenance, administration, network bandwidth, etc. This amount is your annual cost of operation. Assume that you charge advertisers 2 cents per page. What fraction of your inventory (i.e., the total number of pages that you serve over the course of a year) do you have to sell in order to make a profit?