APRIORI
What are the two steps to association rule mining?
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup (support > min_sup). 2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence (confidence> min_cof).
What is the apriori property?
All nonempty subsets of a frequent itemset must also be frequent. This is an anti-monotone property in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well.
How do you generate the candidate itemsets?
By self joining and pruning infrequent candidates.
What are the 2 basic operations of apriori?
Candidate generation and candidate counting.
What are the factors that affect complexity in apriori?
Choice of minimum support threshold Lowering support threshold results in more frequent itemsets This may increase the number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set More space is needed to store the support count of each item If the number of frequent items also increases, both computation and I/O costs may also increase Size of database Since Apriori makes multiple passes, the runtime of the algorithm may increase with the number of transactions Average transaction width Transaction width increases with denser data sets This may increase the max length of frequent itemsets and traversals of the hash tree (the number of subsets in a transaction increases with its width)
What are the discovering rules? Define them
Confidence rule: when the if part is true, how often is the then part true, same as accuracy. It reflects the certainty of discovered rules. Support/coverage: how much of the database contains the if part. It reflects the usefulness of discovered rules.
Describe the self-joining step.
Create set Ck+1 by joining frequent k-itemsets that share the first k-1 items.
What is the apriori principle?
If an itemset is frequent, then all of its subsets must also be frequent. If an itemset is not frequent, then all of its supersets cannot be frequent.
What is Apriori?
It is a basic seminal algorithm for finding frequent itemsets. It uses prior knowledge of frequent itemset properties. It employs an iterative approach known as a level-wise search.
What is lift?
It is the measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole). It is measured against a random choice targeting model.
Challenges of Apriori Algorithm?
Multiple scans of transaction database, a huge number of candidates, tedious workload of support counting for candidates.
How can you overcome overfitting?
Reduce model complexity (ex PCA), or regularization.
How would one improve apriori?
Reduce the number of transaction database scans, shrink the number of candidates, facilitate support counting of candidates.
Describe the pruning step
Remove from Ck+1 the itemsets that contain a subset k-itemset that is not frequent.
What is the core of the apriori algorithm?
Use frequent (k - 1)-itemsets to generate candidate frequent k-itemsets and use database scan and pattern matching to collect counts for the candidate itemsets
