Data Analysis Exam 3

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Why is more memory required to find frequent pairs than 3,4,... itemsets?

-there are not as many 3,4,... itemsets that will meet MinSup -Due to the Apriori Principle, for every frequent triple there are AT LEAST 3 frequent pairs

If the two itemsets are independent then the lift of this rule =

1 -since P(X,Y) = P(X)P(Y)

Steps in Association Rule Mining

1. Frequent Itemset Generation 2. Strong Rule Generation

What is the Apriori Principle? (2)

1. If an itemset is frequent, then all of its subsets must also be frequent 2. If an itemset is infrequent, then all of its supersets must also be infrequent

Suppose you have two rules with the same consequent, R: X -> Y and R': W -> Y, where W is a subset of X. How do you know R is a redundant rule? (2)

1. If sigma(R) = sigma(R') for any R' then R is redundant 2. If sigma(R) < sigma(R') over all generalizations of R', then R is non-redundant

Suppose you have two rules with the same consequent, R: X -> Y and R': W -> Y, where W is a subset of X. What can we say about R and R'? (3)

1. R is more specific than R' 2. R' is more general than R 3. R is redundant if there exists a more general rule that has the same support

Closed Frequent Itemset Algorithm (2)

1. Start an empty list where you will place your closed frequent itemsets 2. Each time you find a frequent itemset you must do the following checks: the subset check and the superset check

Maximal Frequent Itemset Algorithm (2)

1. Start an empty list where you will put your maximal frequent itemsets 2. Each time you find a frequent itemset you must do the following checks: subset check and the superset check

What are the biggest problems of Association Analysis? (4)

1. We spend a lot of time pulling data from the database into memory to use since the database is typically very large 2. Since the database is so large you might have to spend time merging work across distributed systems -might have data sorted by location, department, etc. so if you want a rule for all of the data you have to merge these into 1 3. The Tyranny of Counting Pairs 4. Expensive in time and memory -spend a lot of time going through each transaction to determine the support count

Given d items, there are _____ possible candidate itemsets

2^d

What does a support of 50% mean?

50% of all transactions have these items together

Suppose X and Y are independent and that s(Y) = 10%. We would then expect s(X -> Y)

= 10% Since they are independent and have no association between them

What is a substring?

A consecutive sub-sequence. A sequence contained in another where it must be the same order and no gaps are allowd.

What is a maximal frequent itemset?

A frequent itemset that does NOT have any frequent supersets (is not a subset of any other frequent itemsets)

What is a closed frequent itemset?

A frequent itemset that has no superset with the same support count

What is a subsequence?

A sequence contained in another. Gaps are allowed and it must be the same order.

What is a k-sequence?

A sequence with k items in it

What is an association rule?

An implication expression of the form X -> Y, where X&Y are disjoint itemsets -no overlapping items are allowed

What is a k-itemset?

An itemset that contains k items

What is a frequent item set?

An itemset whose support is equal to or greater than some MinSup threshold

What is a cross-support pattern?

An itemset whose support ratio is below a user-specified threshold, Hc -these are the rules that relate a low frequency item to a high frequency item

What is an itemset?

Any collection of 0 or more items

What is Market-Based Analyis?

Association Analysis of sales data -want to find patterns in the data to make more sales, targeted marketing, etc.

What does confidence measure?

Confidence: measures the reliability of our rule -the bigger the more likely it is that when X is purchased so is Y Pr(Y|X)

What is time-series data?

Data where order matters and the time passed between each item of the sequence matter

What is sequential data?

Data where the order matters but there is no notion of time or how long passed between each item of the sequence

T/F: We want our rules to be more general than specific

FALSE

T/F: We want to keep redundant rules

FALSE

T/F: Association Rule Mining can imply causality (why people have these habits)

FALSE -only implies concurrence (the fact that it happens)

How can we fix the problem with a skewed support distribution?

Find all of the cross-support patterns and eliminate them

How do we find the strong rules from the frequent itemsets?

Find all rules in all of the frequent itemsets that satisfy MinConf

Which method (Fk-1 x F1 or Fk-1 x Fk-1) will ALWAYS generate fewer candidates?

Fk-1 x Fk-1 -both will still give you the same final answer

What is Association Analysis?

Given a set of transactions, we find the rules that will predict the occurrence of an item based on the occurrences of other items in that transaction

How do we denote the set of all items in the data?

I

How can a generated strong rule not be useful?

If MinSup is too low, we can get rules that are useless because one item is bought so often -EX: caviar -> milk; bananas -> milk; make-up -> milk This happens just because everyone is buying milk

How do we know when a transaction contains an itemset X?

If X is a subset of the transaction

What does a confidence of 50% mean?

If a customer buys A there is a 50% chance they will buy B

Subset Check to find the Closed Frequent Itemsets

Is the itemset a subset of anything in our list? -YES: Is the support count higher than its superset? -Y: add it to the list and do the superset check -N: it is not closed so you can move on -NO: add it to the list and do the superset check

Subset Check to find the Maximal Frequent Itemsets

Is the itemset a subset of anything in our list? -YES: it is not maximal and you can move on -NO: add this itemset to our list and do the superset check

Superset Check to find the Closed Frequent Itemsets

Is the itemset a superset of anything in our list? -YES: How does the subset's support count compare to our itemset's support count? -EQUAL: remove the subset from the list -GREATER: leave the subset in our list -NO: move on

Superset Check to find the Maximal Frequent Itemsets

Is the itemset a superset of anything in our list? -YES: you can remove any subsets of this itemset from our list -NO: leave the itemset on our list and move on

What is the Top K Algorithm?

It chooses the top k most frequent itemsets and generates strong rules out of those

What does it mean if lift < 1?

It indicates a negative correlation between the two itemsets

What does it mean if lift > 1?

It indicates a positive correlation between the two itemsets

What is the problem with a skewed support distribution?

It is very hard to set MinSup -if too high only a few items will be involved -if too low too many rules will be generated and many may not be useful

What is confidence?

It measures how often items in the consequent appear in transactions that contain the antecedent sigma(XUY) / sigma(X): the support count of the itemset containing the consequent and antecedent over the support count of the antecedent

What is lift?

It measures the ratio of the observed frequency of concurrence to the expected frequency -AKA surprise or interest

What is a Contingency Table?

It shows you the support count of XUY, XUY^c, X^cUY, X^cUY^c -we can get our lift from this table and a lot of other metrics

What is the Fk-1 x F1 Method?

Looking at the frequent k-1 itemsets, create all possible combinations of those and any 1-itemset that is alphabetically larger -if there is nothing that is alphabetically larger, then do not use that k-1 itemset

What is the Fk-1 x Fk-1 method?

Looking at the frequent k-1 itemsets, we can combine them if all but the last item is the same -only need to look below the itemset you are on

Does order matter in an itemset?

NO

When do you eliminate a candidate itemset using the Apriori Algorithm?

Right after using the Fk-1 x F1 Method or the Fk-1 x Fk-1 Method

Why do we want to find a maximal frequent itemset or closed frequent itemsets?

Since datasets are usually very large we got a lot of frequent itemsets which is computationally expensive -we want to find a smaller representative itemsets from which all frequent itemsets can be derived

Why do we use the Top K Algorithm?

So we do not have to set a MinSup threshold

What does support measure?

Support: measures how often this rule occurs over all transactions -what is the probability that these things occur together -the bigger the more often it occurs -P(X,Y)

How do we denote the set of all transactions in the data?

T

T/F: Confidence is an asymmetric metric

TRUE

T/F: Confidence of rules generates from the SAME ITEMSET have an anti-monotone property

TRUE

T/F: If a rule is redundant, it will never be productive

TRUE

T/F: If there is no association between 2 itemsets then they are statistically independent

TRUE

T/F: It is possible that we have a rule with very good objective measures, but subjectively it is NOT a good rule

TRUE

T/F: Lift is a symmetric metric so it does not matter which way you read the rule either itemset can be in the antecedent or consequent

TRUE

T/F: Suppose s(Y) = 0.9. Let the rule X -> Y have a .3 support and .75 confidence. This means that people who buy X are less likley to buy milk

TRUE

T/F: After finding all of the frequent itemset, computing confidence does not require any additional scans of the data

TRUE -we already calculated the support counts of all of the frequent itemsets

What is the support?

The fraction of transactions that contain both the antecedent and consequent sigma(XUY)/n: support count of antecedent and consequent / total number of transactions

What is the Tyranny of Counting Pairs?

The most memory is required to find frequent pairs -we would need to store n^2/2 support counts (integers) which would be about 2n^2 bytes

What is transaction width?

The number of items in a transaction

What is the dimensionality in Association Analysis?

The number of items there are

What does it mean if the improvement of a rule is less than 0?

The rule is not productive and getting more specific is not helping. We want to get rid of this rule

What is the "anti-monotone property of support"?

The support of an itemset can never exceed the support of its subsets -aka Apriori Principle -if x is a subset of y, then s(x) >= s(y)

When generating frequent itemsets why do we use MinSup and not MinConf?

This is because the support is not affected when using different permutations of the same itemset while confidence is affected by what is in the antecedent

What is the goal of association rule mining?

To find all of the rules having support greater than some MinSup threshold and having confidence greater than some MinConf threshold

Why do we use the Apriori Principle?

To reduce the number of candidate itemsets that might be frequent and make it less computationally expensive

How do you set MinConf?

Trial & Error. You should start with a high MinConf and see how many strong rules are generated. If its too few, MinConf is too high.

How do we set MinSup?

Trial & Error. You should start with a high MinSup and see how many frequent itemsets are generated. If its too few, MinSups is too high.

What is the problem with generating the maximal frequent itemsets?

We do not know the support count of each frequent itemset since all of the subsets happen AT LEAST n times but it could be more -instead look at closed frequent itemsets

How do we find the frequent itemsets?

We find all itemsets that satisfy MinSup using the Apriori Principle -support based pruning

What do we do once we find that a rule is NOT redundant?

We want to determine if it is productive

Why do we want to determine if a rule is productive?

We want to see if getting more specific improves the rule

What is a skewed support distribution?

When most items have a low support count and very few items have a very high support count

Does confidence have an anti-monotone property?

Yes -confidence is anti-monotone with regards to the number of items in the consequent -ONLY applies to rules generated from the same itemset

How do you eliminate a candidate itemset using the Apriori Algorithm?

You have to check if all of the subsets of this itemset are also frequent. -if not then you can eliminate it -if they are then you keep it

You want a ______ support for a _________ store and a ______ support for a _________ store

high support for a small store; low support for a big store

Suppose you have two rules with the same consequent, R: X -> Y and R': W -> Y, where W is a subset of X. What is the improvement of a rule?

imp(X -> Y) = c(X -> Y) - max{all confidence metrics of all possible R's}

A rule is considered productive if its improvement..

is greater than 0

The greater the MinConf the _____ strong rules found

less

The greater the MinSup the _____ frequent items found

less

What is the support ratio?

min( sigma(I1), sigma(I2), ... , sigma(Ik) )/ max( sigma(I1), sigma(I2), ... , sigma(Ik) ) where sigma(Ii) is the support count of every item in that itemset -the smaller the more likely that you want to get rid of it

What is the support count?

sigma(x): the number of transactions that contain the itemset

What is the difference between support and support count?

support is the support count/total transactions -support is between 0 and 1 -support is a probability

In a prefix search tree, how long can the tree get?

the length of the longest sequence


Ensembles d'études connexes

Chapter 1: Introduction to Networking

View Set

CISM - Information Security Governance Flash

View Set

Biology Exam #2 - Mutualistic Interactions

View Set

Consumers, Producers, and Food Webs

View Set

Unidad 3 - el vapor de agua en la atmósfera

View Set