Data Mining
Knowing basic statistics allows us to:
- make it easier to fill in missing values, smooth noisy values, and spot outliers during data processing - know if data are symmetric or skewed (if plotted)
For each of the previous measures
-its value is only influenced by the supports of A,B, and A union B, and not by the total number of transactions -the measures range from 0-1 -the higher the value, the closer the relationship between A and B
Rule Mining Process
1. find all frequent itemsets (by definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min_sup) 2. generate strong association rules from the frequent itemsets (by definition, these rules must satisfy the minimum support and minimum confidence) note: the first step is much more costly in terms of computing
Kulczynski or Kulc
1/2(P(A|B) + P(B|A)) average of two confidence measures the probability of itemset B given itemset A, and the probability of itemset A given itemset B
Measures of Central Tendency
Gives us an idea of the "middle" or center of distribution: Mean - average value Median - middle value Mode - most common value Midrange -
Six Pattern Evaluation Mesures
Lift Chi-Squared all_confidence max_confidence Kulc Cosine
Null-Invariant
a measure is null-invariant if its value is free from the influence of null-transactions
Association Rules
a way to represent patterns ex. computer => antivirus software [support = 2%, confidence = 60%]
Market Basket Analysis
analyzes customer buying habits by finding associations between different items bought
Imbalance Ratio
assessed the imbalance of two itemsets IR(A,B) = |sup(A)-sup(B)|/(sup(A)+sup(B)-sup(A union B)) this ratio is independent of the number of null-transactions and independent of the total number of transactions
Minimum Support Threshold + Minimum Confidence Threshold
association rules are considered interesting if they satisfy both of these thresholds rules that satisfy both, are considered strong
Attribute
attribute, dimension, feature, and variable
Mean
average value if values have have weights, this may be referred to as weighted arithmetic mean or weighted average one major problem is that mean can be sensitive to outliers but this can be fixed with a trimmed mean
Confidence
certainty of discovered rules what percentage of customers who bought A also ended up buying B? confidence(A=>B) = P(A | B) confidence(A=>B) = (transactions containing both A and B)/(transactions containing just A)
<more definitions>
closed closed frequent itemset maximal frequent itemset max-itemset
Chi-Squared Test
computes correlation =(observed-expected)^2/expected
Relationship between Confidence and Support
confidence(A=>B) = P(B|A) = (support(A union B))/(support(A)) = (support_count(A union B))/(support_count(A))
Null Transactions
could be why lift and chi-squared perform poorly at distinguishing pattern association relationships in transactional data sets -is a transaction that does not contain any of the itemsets being examined
max_confidence
gives confidence of two association rules A=>B and B=>A max{P(A|B), P(B|A)}
Discrete Attributes
has a finite or countably infinite set of values, which may or may not be represented as integers age, zip codes, number of customers
Frequency of Itemsets
if an itemset is frequent, each of its subsets is frequent as well
Lift
measure for correlation lift(A,B)=> P(A & B)/ (P(A)P(B)) if result is less than 1, A is negatively correlated with occurrence of B if result is greater than 1, A and B are positively correlated if result is =1, then they are independent, and there is no correlation
Interval-Scaled Attributes
measured on a scale of equal-size units values have order and can be positive, 0, or negative. which provides a ranking of values, and allows us to compare and quantify the difference between values we can look at difference, but not as a multiple of one doesn't have a true zero point ex: temperature or dates
Median
middle value in a set of ordered values useful to measure the center if the data is skewed (asymmetric)
Occurrence Frequency of an Itemset
number of transactions that contain the itemset aka frequency, support count, or count
all_confidence
pattern evaluation measure sup(A&B)/max{sup(A),sup(B)} min{P(A|B),P(B|A)}
Frequent Patterns
patterns (item sets, subsequences, substructures) that appear frequently in a data set very useful for data classification, clustering, and other data mining tasks
Skewed
positively skewed - where mode occurs at a value that is smaller than the median negatively - where the mode occurs at a value greater than the median
Ordinal Attribute
qualitative categorical variable where the order matters Examples include size of shirt (S,M,L) or grades (A,B,C) useful for registering subjective assessments of qualities that cannot be measured objectively (like on surveys of satisfaction) can find mode and median, but not mean
Nominal Attributes
qualitative nominal attribute refers to values that are symbols or names of things they are considered categorical, and do not have any meaningful order because they are not numeric, it doesn't make sense to take the mean or median, but it is possible to find the mode (most often occurring value) a.k.a enumerations
Binary Attributes
qualitative is a nominal attribute where the only two categories are 0 or 1, where 0 means the attribute is absent and 1 means the attribute is present also referred to as Boolean if two states refer to true and false
Numeric Attributes
quantitative - measurable quantity, represented in integer or real values can be interval-scaled or ratio-scaled
Ratio-Scaled Attributes
quantitive numeric attribute with an inherent zero-point we can talk about it in terms of a ratio/multiple of another value can compute difference, mean, median, and mode
Itemset
refers to a set of items
Symmetric vs. Asymmetric
refers to binary variables symmetric - if both states are equally valuable and carry the same weight: there is no preference on which outcome should be coded as 0 or 1 asymmetric - if outcome of states are not equally important, such as the positive and negative outcomes of a medical test for HIV note: the most important outcome, which is usually the rarest one, as a 1
Continuous/Numeric Attributes
represented with floating-point variables
Data Object
samples, examples, instances, data points, or objects
Frequent Pattern Mining
searches for recurring relationships in a given data set
Apriori Algorithm
seminal algorithm basic algorithm for finding frequent itemsets ex. find itemsets of size 1, then size 2, then size 3....each of these steps requires one full scan of the database has an iterative approach known as level-wise search, where k-itemsets are used to explore (k+1)-itemsets
Cosine Measure
sqrt(P(A|B) * P(B|A)) harmonized lift - which means it is influenced by the supports of A,B, and A union B, and not by the total number of transactions
Efficiency of Apriori
subset testing can be made faster by using a hash tree of all frequent itemsets hash-based techniques: hashing itemsets into corresponding buckets transaction reduction: reducing the number of transactions scanned in future iterations partitioning: partitioning the data to find candidate itemsets sampling: mining on a subset of the given data (but there is a tradeoff of accuracy) dynamic itemset counting: adding candidate itemsets at different points during a scan
Cons for Apriori
the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, but... -it may still need to generate a huge number of candidate sets -it may need to repeatedly scan the whole database and check a large set of candidates by pattern matching
Midrange
the average of the greatest and least values in the data set in a unimodal frequency curve with perfect symmetric data distribution, the mean, median, and mode are all the same
Apriori Property
used to reduce search space of Apriori to make it more efficient all nonempty subsets of a frequent itemset must also be frequent follows from concept of antimonotonicity: if a set cannot pass a test, all of its supersets will fail the same test as well
Support
usefulness of discovered rules what percentage of the time are A and B purchased together? support(A=>B) = P(A union B) support(A=>B) = (transactions containing both A and B)/(total number of transactions)
Mode
value that occurs the most frequently can be determined for both qualitative and quantitative attributes datasets can have more than 1 mode 1 - unimodal 2 - bimodal 3 - trimodal more - multimodal
Frequent Pattern Growth (FP-growth)
want to mine the complete set of frequent itemsets without such costly candidate generation process divide-and-conquer order of magnitude faster than the Apriori algorithm can mine both long and short frequent patterns
Frequent Itemset
where things come together a lot
Transactional Data
where you buy something
Frequent Sequential Pattern
where you buy something first, and then another thing ex. buying first a PC and then a digital camera and then a memory card