Data Mining Exam 3

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Apriori Algorithm

-Initially, every item is considered as a candidate 1-itemset (let k=1) -Their supports are counted; anything below minsup is discarded -Candidate (k+1)-itemsets are generated from the frequent k-itemsets -Their supports are counted; anything below minsup is discarded -Repeat until no additional frequent itemsets are found Then, use Apriorti rule generation algorithm to come up with tules

Candidate Generation: Fk-1 x Fk-1 Method (Apriori-gen)

-Items in each frequent itemset must be sorted (in lexographical order) -Merges pairs of (k-1)-itemsets if all but their last item are the same

Candidate Generation: Fk-1 x F1 Method

-Items in each frequent itemset must be sorted (in lexographical order) -To generrate k-itemsets, extend each frequent (k-1)-itemset with a frequent 1-itemset that is lexographically larger than the items already in the (k-1)-itemset Benefit: avoids generating duplicates

Fk-1 x F1 Method: Pruning

-Need to check generated candidates for infrequent itemsets, and prune them -Can use a heuristic: For every frequent k-itemset, every item in the set must be contained in at least k-1 of the frequent (k-1)-itemsets Ex: {Beer, Diapers, Bread} is only a frequent itemset if Beer, Diapers, and Bread all show up in 2 of the frequent 2-itemsets. IF Beer is only in 1 of the frequent 2-itemsets, Beer will not be in a frequent 3-itemset.

Fk-1 x Fk-1 Method (Apriori-gen): Pruning

-Need to check generated candidates for infrequent itemsets, and prune them -If any (k-1)-subset of the candidate is not a frequent (k-1)-itemset, the candidate is pruned Ex: {Bread, Diapers, Milk} is only a frequent 3-itemset if {Bread, Diapers}, {Bread, Milk}, and {Diapers, Milk} are all frequent 2-itemsets.

Association Rule Mining Steps

1. Frequent itemset generation: find all the itemsets that satisfy minsup 2. Rule generation: find all the strong rules in the frequent itemsets that satisfy minconf

For GSP, how do you generate candidate k+1-sequences from frequent k-sequences?

1. Generate -1 and -last items and then find any -1 items that match -last items 2. Merge them to get the new item 3. Do example?!

FP-Growth algorithm

1. Make an initial scan of the data to get support counts for each individual item. Discard infrequent items, Sort frequent items in decreasing order 2. Make a second scan of the data to construct the FP-tree. Each transaction is added to the tree with its frequent items sorted by decreasing support 3. Find the frequent itemsets from the FP-tree

What are some ways of storing support count for 2-itemsets in memory?

2D array: arr[i][j] = count Hashmap: "triples method" {i,j} --> count 1D triangular array: (i-1)(n-i/2) + j - i

closed frequent itemsets

A frequent itemset is closed if it has no superset with the same support count

negative association rule

A rule extracted from a negative itemset that meets minsup and minconf

What is the difference between a sequence and an itemset?

A sequence contains ordered elements/events/itemsets. Elements/events/itemsets contain unordered items.

What is a k-sequence?

A sequence is a k-sequence f it contains k items aka k = |A1| + |A2| + · · · + |An|.

What does it mean for a transaction to contain itemset X?

A transaction contains itemset X if X is a subset of the transaction Ex: t2 contains {Bread, Diapers} but not {Bread, Milk}

Association Rule

An implication expression of the form X --> Y, where X and Y are itemsets Ex: {Milk, Diaper} --> {Beer} do NOT mean causality... more like co-occurrence

negatively correlated pattern

An itemset (or association rule) where s(X U Y) < s(X)s(Y)

For GSP, how do you generate candidate 2-sequences from frequent 1-sequences?

Consider both cases of transactions: 1. Bought separately (i.e. AB) Have to consider entire matrix because with separate transactions, order of the transactions matters 2. And bought together (ie. (AB)) Consider only half of matrix because within a single transaction, order doesn't matter

Mining Streams

Data is coming in as you mine, so frequent itemsets are changing quickly so you need to keep up! Basically chunking Accumulate a certain amount (i.e. number of transactions) --> mine those Let the more transactions coming in build up and ignore them Check the first batch's frequent itemset and compare with second batch's and see if the itemsets are still frequent (or remove after certain amount of iterations) and add any new itemsets Constantly build up chunks, mine them, and update your frequent item sets

What are some advantages and disadvantages to using a hashmap for storing support count?

Disadvantages: hashing is expensive, store 3 pairs of numbers per count rather than 1 Advantages: Doesn't waste space on support counts of 0; doesn't get added to hashmap

Goal of association rule

Find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Bonus question: why is looking at support not enough? (21.6)

Support

Fraction of transactions that contain both X and Y s(X --> Y) = (𝜎(𝑋 ∪ 𝑌))/𝑁 where 𝑁 is total number of transactions

apriori principle

If an itemset is frequent, then all of its subsets must also be frequent If an itemset is infrequent, then all of its supersets must also be infrequent

Given s = < {7} {3 8} {9} {4 5 6} {8} >, why is r2 = < {3} {4} {5} {8} > NOT a subsequence of s?

If two elements are grouped initially, the have to be grouped in the subsequence as well if they're included

What is the difference subsequence vs. substring?

In a subsequence, you can have gaps between the elements within the original sequence as long as the order is maintained Substring is basically a continuous subsequence Ex: s= ACTGAACG r1 = CGAAG is a subsequence of s r2 = CTGA is a substring of s

Apriori Rule Generation Algorithm

Initially, all high-confidence rules that have only one item in the consequent are generated and tested against minconf The high-confidence rules that are found are then used to generate the next round of candidate rules by merging consequents Bonus question: How does this relate to the anti-monotone property of confidence? Ex: ABD --> C ACD --> B Consequents are merged to make: AD --> BC

closed frequent itemset algorithm

Keep a list of closed frequent itemsets Each time a frequent itemset is generated, perform subset and superset check

maximal frequent itemset algorithm

Keep a list of maximal frequent itemsets Each time a frequent itemset is generated, perform subset and superset check

Which time series constraints can cause the apriori principle to be violated?

Maxgap

What is the difference of performing a subset check for maximal sets vs. closed sets?

Maximal: Is the freq itemset just found a subset of anything in the maximal list? If so, it is not maximal; end. Else, add it to the maximal list and do a superset check. Closed: Is the freq itemset just found a subset of anything in the maximal list? If so, is its support higher than the superset in the list? If no, it is not closed; end. If yes, add it to the closed list and do a superset check.

What is the difference of performing a superset check for maximal sets vs. closed sets?

Maximal: Is the freq itemset just found a superset of anything in the maximal list? If so, remove items already in the maximal list that are subsets of this freq itemset, as they are no longer maximal. Closed: Is the freq itemset just found a superset of anything in the closed list? If so, does the subset in the list have the same or higher support? If subset's support is the same, remove the subset from the closed list. If subset's support is higher, it remains in the list.

Confidence**

Measures how often items in Y appear in transactions that contain X c(X --> Y) = (𝜎(𝑋 ∪𝑌))/(𝜎(𝑋)) and the higher the score, the more it validates that X --> Y

Advantages of FP-Growth?

Only scan the database twice Avoids large memory requirements, repeated scans of the database

What is the formula for lift (x --> y)?

P(X, Y)/(P(X)P(Y)) = (Nf11)/(f11 + f01)(f11+f10)

four ways of dealing with scaling. explain each

PCY Algorithm, sampling, SON Algorithm, mining streams

What does it mean for a rule to be more general that another rule given that they have the same consequent?

The first rule's LHS is a subset of the second rule's LHS.

What does it mean for a rule to be redundant with another rule given that they have the same consequent?

The second rule is more general and it has the same support

What does it mean for a rule to be more specific that another rule given that they have the same consequent?

The second rule's LHS is a subset of the first rule's LHS.

Sampling

Use a sample dataset that will work with the available memory size Must be truly random Can produce false negatives (frequent in total but not in sample) Can produce false positives (frequent in sample but not in sample) But we can eliminate false positives (cross check with total data) and mitigate false negatives (reduce minsup)

prefix search tree

When order didn't matter, it turned into a lattice, but now order does matter (i.e. AG is not the same as GA) Levels = k Can still prune using support counts

What does a lift less than 1 mean? (e.g. lift = .84)

a negative correlation

negative item

absent item

Lift **

also known as the surprise or interest the ratio of the observed frequency of co-occurrence to the expected frequency

maximal frequent itemsets

an itemset that has has no frequent supersets

negative itemset

an itemset that meets minsup and contains at least one negative item

Frequent Itemset

an itemset whose support is equal to or greater than some minsup threshold

What property does confidence of rules generated from the SAME set have?

anti-monotone property with respect to items in the consequent Ex: Frequent Itemset = {A,B,C,D}: c(ABC --> D) >= c(AB --> CD) >= c(A --> BCD)

Itemset

any collection of 0 or more items Ex: {Beer, Diapers, Eggs} is an itemset

infrequent pattern

any pattern that does not meet minsup

CDIST

distinct occurrences with no event-timestamp overlap allowed

CDIST_O

distinct occurrences with possibility of event-timestamp overlap

FP-trees

frequent pattern tree •Each node in the tree is a single item •Each node stores the support count for the itemset comprising of the items on the path from the root to that node •For each transaction in the dataset, the itemset is inserted into the FP-tree, incrementing the count of all nodes along its common prefix path, and creating new nodes beyond the common prefix •Items are sorted in decreasing support order, so most frequent items are at the top of the tree

K-Itemset

if an itemset contains k items, it is called a k-itemset Ex: {Beer, Diapers, Eggs} is a 3-itemset

When is a rule productive?

if improvement is greater than 0

rule improvement

imp(X --> Y) = c(X --> Y) - maxW ⊂𝑋 {c(W --> Y)}

support ratio

min⁡[𝑠(𝑖1), 𝑠(𝑖2), ..., 𝑠(𝑖𝑘)]/(max⁡[𝑠(𝑖1), 𝑠(𝑖2), ..., 𝑠(𝑖𝑘)])

CMINWIN

number of minimal windows of occurrence

COBJ

one occurrence per object

CWIN

one occurrence per sliding window

Time-series data

order and time matters

Sequential data

order matters but no notion of time Ex: DNA

cross-support pattern**

rules that relate low frequency items to high frequency items created from an itemset whose support ratio is under a user-specified threshold.

anti-monotone property of support

support of an itemset never exceeds the support of its subsets

contingency table

table used to examine the relationship between categorical variables [ ]Y !Y X !X

What does a lift of 1 represent

that X and Y are statistically independent. P(X,Y) = P(X)P(Y)

Maxgap

the maximum allowed time span between two elements of the sequence

window size

the maximum allowed time span of any one element of the sequence

Maxspan

the maximum allowed time span of the entire sequence

Mingap

the minimum allowed time span between two elements of the sequence i.e. How much time will you let pass between two elements of a sequence before you just consider them one transaction?

Transaction Width

the number of items in a transaction

Support count [𝝈(𝑿)]

the number of transactions that contain this itemset

GSP (Generalized Sequential Pattern) Algorithm

•Given the set of frequent sequences at level k, it generates all candidate k+1-sequences •Prunes based on the apriori principle (the antimonotonic property of support: all subsequences of a frequent sequence must be frequent) Same algorithm as apriori but this time order matters (i.e. AB is different from BA)

SON algorithm **

•Improves upon sampling •Divide the dataset up into chunks - however large your memory can handle •Process each chunk as a sample (i.e. find frequent itemsets for this chunk) •Once all chunks are processed, take the union of all frequent itemsets- these are the candidate sets •Compare all candidate itemsets against full dataset to get true frequent itemsets But we can eliminate false positives (cross check with total data) and mitigate false negatives (reduce minsup)

PCY algorithm

•In the first pass of Apriori (support counting 1-itemsets), there is typically memory to spare •Use this extra space for an array •Hash pairs of items into this array and keep total sum of support count for each item •Use this array when constructing candidate 2-itemsets •Make pairs {i,j} such that: 1) i and j are frequent items, and 2) {i,j} hashes to a frequent "bucket" If bucket's hash doesn't meet minsup, then prune everything that hashes into that bucket Pre-pruning


Set pelajaran terkait

NAQT: Authors of Speculative Fiction

View Set

Department of Defense (DoD) Cyber Awareness Challenge 2024 (1 hr) (Pre Test)

View Set

26. Residential and Commercial Financing

View Set

ATI Fundamentals Proctored Exam Study Guide

View Set