Association rule mining
FP-growth Algorithm
1) Builds a condensed representation of the data base as an FP-tree. 2) Then uses a recursive divide and conquer approach to mine frequent itemsets.
2 Main bottlenecks of A'priori Alg
1) Can generate very large candidate sets. 2) Must scan the DB multiple times to find support.
2 Stages of A'priori algorithm
1) Candidate generation. 2) Candidate test.
4 factors affecting Apriori complexity
1) Choice of minimum support threshold. Larger means more frequent itemsets to consider. 2) Dimensionality. More dimensions means more space needed to store support counts. 3) DB size. Apriori scans DB multiple times. 4) Maximum transaction width. May increase the max length of frequent itemsets.
2 principles of FP-growth Alg
1) Compress database into a Frequent-Pattern tree. Avoids costly repeated database scans. 2) Use a divide and conquer mining task that breaks mining into smaller sub-tasks. Avoids candidate generation issues of A'priori
FP-Grwoth divide and conquer steps
1) For each frequent item, construct its "conditional pattern base", and then its conditional FP-tree 2) Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty, or it contains only one path.
Advantages of FP-Tree structure
1) Scan the DB twice, and only twice 2) Complete: Tree contains all information related to mining frequent patterns 3) Compactness: Tree height is bounded by the maximum number of items in a transaction.
FP-Tree construction steps
1) Scan transaction DB, and find frequent single item patterns. Order them in a list "L" in frequency descending order. (Aka, make histogram of single items from DB) 2) For each transaction, order its frequent items (Do not include infrequent single item sets) according to the order in "L". 3) Scan the DB a second time, and construct the FP-tree by putting each "frequency ordered transaction" into it.
A'priori Algorithm
1) Use frequent k-1 itemsets to generate potential candidates for frequent k itemsets. 2) Scan the database and find support for each potential k itemset. 3) Determine frequent k itemsets by comparing the support of each k itemset with the min support.
Maximal Frequent Itemset
An item set in which none of its immediate supersets is frequent.
Closed Itemset
An itemset in which none of the supersets has the same support as the itemset.
FP-growth stages
FP-Tree -> conditional pattern bases -> conditional FP-Tree -> frequent patterns
Confidence
Fraction of transactions containing both the antecedent and consequent itemsets out of the total number of transactions containing the antecedent itemset. For rule: X -> Y, confidence is | Itemsets with X and Y| / | Itemsets with X|
Support
Fraction of transactions containing itemset. For rule: X -> Y, support is |X| / total number of transactions.
Why do FP-Trees insert items of descending frequency order?
If items were inserted in ascending frequency order, then there would be many more branches at each node. (Answer in FP growth ppt, page 5)
FP-Growth divide and conquer strategy
Recursively grow frequent patterns using the FP-tree: looking for shorter patterns recursively, then concatenating the suffix.
Coverage
The number of rules that a transaction is part of. (Diagram of coverage can be found in my written notes, 10/1)
FP-Tree
Tree built by iterating over all transactions, and adding a branch for each itemset, and incrementing the support count at each node for transactions that have the same sub-path. A linked list for each item is also built by the tree, and maintained in a "Header table".