Data Mining Exam 3
Apriori Algorithm
-Initially, every item is considered as a candidate 1-itemset (let k=1) -Their supports are counted; anything below minsup is discarded -Candidate (k+1)-itemsets are generated from the frequent k-itemsets -Their supports are counted; anything below minsup is discarded -Repeat until no additional frequent itemsets are found Then, use Apriorti rule generation algorithm to come up with tules
Candidate Generation: Fk-1 x Fk-1 Method (Apriori-gen)
-Items in each frequent itemset must be sorted (in lexographical order) -Merges pairs of (k-1)-itemsets if all but their last item are the same
Candidate Generation: Fk-1 x F1 Method
-Items in each frequent itemset must be sorted (in lexographical order) -To generrate k-itemsets, extend each frequent (k-1)-itemset with a frequent 1-itemset that is lexographically larger than the items already in the (k-1)-itemset Benefit: avoids generating duplicates
Fk-1 x F1 Method: Pruning
-Need to check generated candidates for infrequent itemsets, and prune them -Can use a heuristic: For every frequent k-itemset, every item in the set must be contained in at least k-1 of the frequent (k-1)-itemsets Ex: {Beer, Diapers, Bread} is only a frequent itemset if Beer, Diapers, and Bread all show up in 2 of the frequent 2-itemsets. IF Beer is only in 1 of the frequent 2-itemsets, Beer will not be in a frequent 3-itemset.
Fk-1 x Fk-1 Method (Apriori-gen): Pruning
-Need to check generated candidates for infrequent itemsets, and prune them -If any (k-1)-subset of the candidate is not a frequent (k-1)-itemset, the candidate is pruned Ex: {Bread, Diapers, Milk} is only a frequent 3-itemset if {Bread, Diapers}, {Bread, Milk}, and {Diapers, Milk} are all frequent 2-itemsets.
Association Rule Mining Steps
1. Frequent itemset generation: find all the itemsets that satisfy minsup 2. Rule generation: find all the strong rules in the frequent itemsets that satisfy minconf
For GSP, how do you generate candidate k+1-sequences from frequent k-sequences?
1. Generate -1 and -last items and then find any -1 items that match -last items 2. Merge them to get the new item 3. Do example?!
FP-Growth algorithm
1. Make an initial scan of the data to get support counts for each individual item. Discard infrequent items, Sort frequent items in decreasing order 2. Make a second scan of the data to construct the FP-tree. Each transaction is added to the tree with its frequent items sorted by decreasing support 3. Find the frequent itemsets from the FP-tree
What are some ways of storing support count for 2-itemsets in memory?
2D array: arr[i][j] = count Hashmap: "triples method" {i,j} --> count 1D triangular array: (i-1)(n-i/2) + j - i
closed frequent itemsets
A frequent itemset is closed if it has no superset with the same support count
negative association rule
A rule extracted from a negative itemset that meets minsup and minconf
What is the difference between a sequence and an itemset?
A sequence contains ordered elements/events/itemsets. Elements/events/itemsets contain unordered items.
What is a k-sequence?
A sequence is a k-sequence f it contains k items aka k = |A1| + |A2| + · · · + |An|.
What does it mean for a transaction to contain itemset X?
A transaction contains itemset X if X is a subset of the transaction Ex: t2 contains {Bread, Diapers} but not {Bread, Milk}
Association Rule
An implication expression of the form X --> Y, where X and Y are itemsets Ex: {Milk, Diaper} --> {Beer} do NOT mean causality... more like co-occurrence
negatively correlated pattern
An itemset (or association rule) where s(X U Y) < s(X)s(Y)
For GSP, how do you generate candidate 2-sequences from frequent 1-sequences?
Consider both cases of transactions: 1. Bought separately (i.e. AB) Have to consider entire matrix because with separate transactions, order of the transactions matters 2. And bought together (ie. (AB)) Consider only half of matrix because within a single transaction, order doesn't matter
Mining Streams
Data is coming in as you mine, so frequent itemsets are changing quickly so you need to keep up! Basically chunking Accumulate a certain amount (i.e. number of transactions) --> mine those Let the more transactions coming in build up and ignore them Check the first batch's frequent itemset and compare with second batch's and see if the itemsets are still frequent (or remove after certain amount of iterations) and add any new itemsets Constantly build up chunks, mine them, and update your frequent item sets
What are some advantages and disadvantages to using a hashmap for storing support count?
Disadvantages: hashing is expensive, store 3 pairs of numbers per count rather than 1 Advantages: Doesn't waste space on support counts of 0; doesn't get added to hashmap
Goal of association rule
Find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Bonus question: why is looking at support not enough? (21.6)
Support
Fraction of transactions that contain both X and Y s(X --> Y) = (𝜎(𝑋 ∪ 𝑌))/𝑁 where 𝑁 is total number of transactions
apriori principle
If an itemset is frequent, then all of its subsets must also be frequent If an itemset is infrequent, then all of its supersets must also be infrequent
Given s = < {7} {3 8} {9} {4 5 6} {8} >, why is r2 = < {3} {4} {5} {8} > NOT a subsequence of s?
If two elements are grouped initially, the have to be grouped in the subsequence as well if they're included
What is the difference subsequence vs. substring?
In a subsequence, you can have gaps between the elements within the original sequence as long as the order is maintained Substring is basically a continuous subsequence Ex: s= ACTGAACG r1 = CGAAG is a subsequence of s r2 = CTGA is a substring of s
Apriori Rule Generation Algorithm
Initially, all high-confidence rules that have only one item in the consequent are generated and tested against minconf The high-confidence rules that are found are then used to generate the next round of candidate rules by merging consequents Bonus question: How does this relate to the anti-monotone property of confidence? Ex: ABD --> C ACD --> B Consequents are merged to make: AD --> BC
closed frequent itemset algorithm
Keep a list of closed frequent itemsets Each time a frequent itemset is generated, perform subset and superset check
maximal frequent itemset algorithm
Keep a list of maximal frequent itemsets Each time a frequent itemset is generated, perform subset and superset check
Which time series constraints can cause the apriori principle to be violated?
Maxgap
What is the difference of performing a subset check for maximal sets vs. closed sets?
Maximal: Is the freq itemset just found a subset of anything in the maximal list? If so, it is not maximal; end. Else, add it to the maximal list and do a superset check. Closed: Is the freq itemset just found a subset of anything in the maximal list? If so, is its support higher than the superset in the list? If no, it is not closed; end. If yes, add it to the closed list and do a superset check.
What is the difference of performing a superset check for maximal sets vs. closed sets?
Maximal: Is the freq itemset just found a superset of anything in the maximal list? If so, remove items already in the maximal list that are subsets of this freq itemset, as they are no longer maximal. Closed: Is the freq itemset just found a superset of anything in the closed list? If so, does the subset in the list have the same or higher support? If subset's support is the same, remove the subset from the closed list. If subset's support is higher, it remains in the list.
Confidence**
Measures how often items in Y appear in transactions that contain X c(X --> Y) = (𝜎(𝑋 ∪𝑌))/(𝜎(𝑋)) and the higher the score, the more it validates that X --> Y
Advantages of FP-Growth?
Only scan the database twice Avoids large memory requirements, repeated scans of the database
What is the formula for lift (x --> y)?
P(X, Y)/(P(X)P(Y)) = (Nf11)/(f11 + f01)(f11+f10)
four ways of dealing with scaling. explain each
PCY Algorithm, sampling, SON Algorithm, mining streams
What does it mean for a rule to be more general that another rule given that they have the same consequent?
The first rule's LHS is a subset of the second rule's LHS.
What does it mean for a rule to be redundant with another rule given that they have the same consequent?
The second rule is more general and it has the same support
What does it mean for a rule to be more specific that another rule given that they have the same consequent?
The second rule's LHS is a subset of the first rule's LHS.
Sampling
Use a sample dataset that will work with the available memory size Must be truly random Can produce false negatives (frequent in total but not in sample) Can produce false positives (frequent in sample but not in sample) But we can eliminate false positives (cross check with total data) and mitigate false negatives (reduce minsup)
prefix search tree
When order didn't matter, it turned into a lattice, but now order does matter (i.e. AG is not the same as GA) Levels = k Can still prune using support counts
What does a lift less than 1 mean? (e.g. lift = .84)
a negative correlation
negative item
absent item
Lift **
also known as the surprise or interest the ratio of the observed frequency of co-occurrence to the expected frequency
maximal frequent itemsets
an itemset that has has no frequent supersets
negative itemset
an itemset that meets minsup and contains at least one negative item
Frequent Itemset
an itemset whose support is equal to or greater than some minsup threshold
What property does confidence of rules generated from the SAME set have?
anti-monotone property with respect to items in the consequent Ex: Frequent Itemset = {A,B,C,D}: c(ABC --> D) >= c(AB --> CD) >= c(A --> BCD)
Itemset
any collection of 0 or more items Ex: {Beer, Diapers, Eggs} is an itemset
infrequent pattern
any pattern that does not meet minsup
CDIST
distinct occurrences with no event-timestamp overlap allowed
CDIST_O
distinct occurrences with possibility of event-timestamp overlap
FP-trees
frequent pattern tree •Each node in the tree is a single item •Each node stores the support count for the itemset comprising of the items on the path from the root to that node •For each transaction in the dataset, the itemset is inserted into the FP-tree, incrementing the count of all nodes along its common prefix path, and creating new nodes beyond the common prefix •Items are sorted in decreasing support order, so most frequent items are at the top of the tree
K-Itemset
if an itemset contains k items, it is called a k-itemset Ex: {Beer, Diapers, Eggs} is a 3-itemset
When is a rule productive?
if improvement is greater than 0
rule improvement
imp(X --> Y) = c(X --> Y) - maxW ⊂𝑋 {c(W --> Y)}
support ratio
min[𝑠(𝑖1), 𝑠(𝑖2), ..., 𝑠(𝑖𝑘)]/(max[𝑠(𝑖1), 𝑠(𝑖2), ..., 𝑠(𝑖𝑘)])
CMINWIN
number of minimal windows of occurrence
COBJ
one occurrence per object
CWIN
one occurrence per sliding window
Time-series data
order and time matters
Sequential data
order matters but no notion of time Ex: DNA
cross-support pattern**
rules that relate low frequency items to high frequency items created from an itemset whose support ratio is under a user-specified threshold.
anti-monotone property of support
support of an itemset never exceeds the support of its subsets
contingency table
table used to examine the relationship between categorical variables [ ]Y !Y X !X
What does a lift of 1 represent
that X and Y are statistically independent. P(X,Y) = P(X)P(Y)
Maxgap
the maximum allowed time span between two elements of the sequence
window size
the maximum allowed time span of any one element of the sequence
Maxspan
the maximum allowed time span of the entire sequence
Mingap
the minimum allowed time span between two elements of the sequence i.e. How much time will you let pass between two elements of a sequence before you just consider them one transaction?
Transaction Width
the number of items in a transaction
Support count [𝝈(𝑿)]
the number of transactions that contain this itemset
GSP (Generalized Sequential Pattern) Algorithm
•Given the set of frequent sequences at level k, it generates all candidate k+1-sequences •Prunes based on the apriori principle (the antimonotonic property of support: all subsequences of a frequent sequence must be frequent) Same algorithm as apriori but this time order matters (i.e. AB is different from BA)
SON algorithm **
•Improves upon sampling •Divide the dataset up into chunks - however large your memory can handle •Process each chunk as a sample (i.e. find frequent itemsets for this chunk) •Once all chunks are processed, take the union of all frequent itemsets- these are the candidate sets •Compare all candidate itemsets against full dataset to get true frequent itemsets But we can eliminate false positives (cross check with total data) and mitigate false negatives (reduce minsup)
PCY algorithm
•In the first pass of Apriori (support counting 1-itemsets), there is typically memory to spare •Use this extra space for an array •Hash pairs of items into this array and keep total sum of support count for each item •Use this array when constructing candidate 2-itemsets •Make pairs {i,j} such that: 1) i and j are frequent items, and 2) {i,j} hashes to a frequent "bucket" If bucket's hash doesn't meet minsup, then prune everything that hashes into that bucket Pre-pruning