Data Mining

Ace your homework & exams now with Quizwiz!

Knowing basic statistics allows us to:

- make it easier to fill in missing values, smooth noisy values, and spot outliers during data processing - know if data are symmetric or skewed (if plotted)

For each of the previous measures

-its value is only influenced by the supports of A,B, and A union B, and not by the total number of transactions -the measures range from 0-1 -the higher the value, the closer the relationship between A and B

Rule Mining Process

1. find all frequent itemsets (by definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min_sup) 2. generate strong association rules from the frequent itemsets (by definition, these rules must satisfy the minimum support and minimum confidence) note: the first step is much more costly in terms of computing

Kulczynski or Kulc

1/2(P(A|B) + P(B|A)) average of two confidence measures the probability of itemset B given itemset A, and the probability of itemset A given itemset B

Measures of Central Tendency

Gives us an idea of the "middle" or center of distribution: Mean - average value Median - middle value Mode - most common value Midrange -

Six Pattern Evaluation Mesures

Lift Chi-Squared all_confidence max_confidence Kulc Cosine

Null-Invariant

a measure is null-invariant if its value is free from the influence of null-transactions

Association Rules

a way to represent patterns ex. computer => antivirus software [support = 2%, confidence = 60%]

Market Basket Analysis

analyzes customer buying habits by finding associations between different items bought

Imbalance Ratio

assessed the imbalance of two itemsets IR(A,B) = |sup(A)-sup(B)|/(sup(A)+sup(B)-sup(A union B)) this ratio is independent of the number of null-transactions and independent of the total number of transactions

Minimum Support Threshold + Minimum Confidence Threshold

association rules are considered interesting if they satisfy both of these thresholds rules that satisfy both, are considered strong

Attribute

attribute, dimension, feature, and variable

Mean

average value if values have have weights, this may be referred to as weighted arithmetic mean or weighted average one major problem is that mean can be sensitive to outliers but this can be fixed with a trimmed mean

Confidence

certainty of discovered rules what percentage of customers who bought A also ended up buying B? confidence(A=>B) = P(A | B) confidence(A=>B) = (transactions containing both A and B)/(transactions containing just A)

<more definitions>

closed closed frequent itemset maximal frequent itemset max-itemset

Chi-Squared Test

computes correlation =(observed-expected)^2/expected

Relationship between Confidence and Support

confidence(A=>B) = P(B|A) = (support(A union B))/(support(A)) = (support_count(A union B))/(support_count(A))

Null Transactions

could be why lift and chi-squared perform poorly at distinguishing pattern association relationships in transactional data sets -is a transaction that does not contain any of the itemsets being examined

max_confidence

gives confidence of two association rules A=>B and B=>A max{P(A|B), P(B|A)}

Discrete Attributes

has a finite or countably infinite set of values, which may or may not be represented as integers age, zip codes, number of customers

Frequency of Itemsets

if an itemset is frequent, each of its subsets is frequent as well

Lift

measure for correlation lift(A,B)=> P(A & B)/ (P(A)P(B)) if result is less than 1, A is negatively correlated with occurrence of B if result is greater than 1, A and B are positively correlated if result is =1, then they are independent, and there is no correlation

Interval-Scaled Attributes

measured on a scale of equal-size units values have order and can be positive, 0, or negative. which provides a ranking of values, and allows us to compare and quantify the difference between values we can look at difference, but not as a multiple of one doesn't have a true zero point ex: temperature or dates

Median

middle value in a set of ordered values useful to measure the center if the data is skewed (asymmetric)

Occurrence Frequency of an Itemset

number of transactions that contain the itemset aka frequency, support count, or count

all_confidence

pattern evaluation measure sup(A&B)/max{sup(A),sup(B)} min{P(A|B),P(B|A)}

Frequent Patterns

patterns (item sets, subsequences, substructures) that appear frequently in a data set very useful for data classification, clustering, and other data mining tasks

Skewed

positively skewed - where mode occurs at a value that is smaller than the median negatively - where the mode occurs at a value greater than the median

Ordinal Attribute

qualitative categorical variable where the order matters Examples include size of shirt (S,M,L) or grades (A,B,C) useful for registering subjective assessments of qualities that cannot be measured objectively (like on surveys of satisfaction) can find mode and median, but not mean

Nominal Attributes

qualitative nominal attribute refers to values that are symbols or names of things they are considered categorical, and do not have any meaningful order because they are not numeric, it doesn't make sense to take the mean or median, but it is possible to find the mode (most often occurring value) a.k.a enumerations

Binary Attributes

qualitative is a nominal attribute where the only two categories are 0 or 1, where 0 means the attribute is absent and 1 means the attribute is present also referred to as Boolean if two states refer to true and false

Numeric Attributes

quantitative - measurable quantity, represented in integer or real values can be interval-scaled or ratio-scaled

Ratio-Scaled Attributes

quantitive numeric attribute with an inherent zero-point we can talk about it in terms of a ratio/multiple of another value can compute difference, mean, median, and mode

Itemset

refers to a set of items

Symmetric vs. Asymmetric

refers to binary variables symmetric - if both states are equally valuable and carry the same weight: there is no preference on which outcome should be coded as 0 or 1 asymmetric - if outcome of states are not equally important, such as the positive and negative outcomes of a medical test for HIV note: the most important outcome, which is usually the rarest one, as a 1

Continuous/Numeric Attributes

represented with floating-point variables

Data Object

samples, examples, instances, data points, or objects

Frequent Pattern Mining

searches for recurring relationships in a given data set

Apriori Algorithm

seminal algorithm basic algorithm for finding frequent itemsets ex. find itemsets of size 1, then size 2, then size 3....each of these steps requires one full scan of the database has an iterative approach known as level-wise search, where k-itemsets are used to explore (k+1)-itemsets

Cosine Measure

sqrt(P(A|B) * P(B|A)) harmonized lift - which means it is influenced by the supports of A,B, and A union B, and not by the total number of transactions

Efficiency of Apriori

subset testing can be made faster by using a hash tree of all frequent itemsets hash-based techniques: hashing itemsets into corresponding buckets transaction reduction: reducing the number of transactions scanned in future iterations partitioning: partitioning the data to find candidate itemsets sampling: mining on a subset of the given data (but there is a tradeoff of accuracy) dynamic itemset counting: adding candidate itemsets at different points during a scan

Cons for Apriori

the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, but... -it may still need to generate a huge number of candidate sets -it may need to repeatedly scan the whole database and check a large set of candidates by pattern matching

Midrange

the average of the greatest and least values in the data set in a unimodal frequency curve with perfect symmetric data distribution, the mean, median, and mode are all the same

Apriori Property

used to reduce search space of Apriori to make it more efficient all nonempty subsets of a frequent itemset must also be frequent follows from concept of antimonotonicity: if a set cannot pass a test, all of its supersets will fail the same test as well

Support

usefulness of discovered rules what percentage of the time are A and B purchased together? support(A=>B) = P(A union B) support(A=>B) = (transactions containing both A and B)/(total number of transactions)

Mode

value that occurs the most frequently can be determined for both qualitative and quantitative attributes datasets can have more than 1 mode 1 - unimodal 2 - bimodal 3 - trimodal more - multimodal

Frequent Pattern Growth (FP-growth)

want to mine the complete set of frequent itemsets without such costly candidate generation process divide-and-conquer order of magnitude faster than the Apriori algorithm can mine both long and short frequent patterns

Frequent Itemset

where things come together a lot

Transactional Data

where you buy something

Frequent Sequential Pattern

where you buy something first, and then another thing ex. buying first a PC and then a digital camera and then a memory card


Related study sets

patho Chapters 36, 37, 38, 40, 41. week 6

View Set

Introduction to Music - Rock & Roll Hall of Fame Ch 21

View Set

(Exam 1) Validity and Reliability

View Set

Ch. 18 Kidney Clinical and Diagnostic Procedures

View Set

fundamentals of nursing Course Point Quiz- CH. 19

View Set

PSYC 100- LearningCurve 14a- Introduction to Personality and Psychodynamic Theories, Humanistic and Trait Theories

View Set

PATIENT MANAGEMENT: " Very Important File "

View Set