Chapter 4
Observation
a set of observed values associated with a single entity, often displayed as a row in a spreadsheet or database
Dendrogram
a tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering
Market basket analysis
analysis of items frequently co-occurring in transactions (such as purchases)
Unsupervised learning
category of data-mining techniques in which an algorithm explains relationships without an outcome variable to guide the process
Confidence
conditional probability that the consequent of an association rule occurs given the antecedent occurs
Euclidean distance
geometric measure of dissimilarity between observations based on the Pythagorean theorem
Association rule
if-then statement describing the relationship b/t item sets
Antecedent
item set corresponding to the if portion of an if-then association rule
Consequent
item set corresponding to the then portion of an if-then association rule
Single linkage
measure of calculating dissimilarity b/t clusters by considering only the two most similar observations between the two clusters
Complete linkage
measure of calculating dissimilarity between clusters by considering only the 2 most dissimilar observations b/t the 2 clusters
Group average linkage
measure of calculating dissimilarity between clusters by considering the distance between each pair of observations between 2 clusters
Matching coefficient
measure of similarity b/t observations based on the # of matching values of categorical variables
Jaccard's coefficient
measure of similarity b/t observations consisting solely of binary categorical variables that considers only matches of nonzero entries
McQuitty's method
measure that computes the dissimilarity b/t a cluster AB (formed by merging clusters A and B) and a cluster C by averaging the distance b/t A and C and the distance b/t B and C
Median linkage
method that computes the similarity b/t 2 clusters as the median of the similarities b/t each pair of observations in the 2 clusters
Ward's method
procedure that partitions observations in a manner to obtain clusters with the least amount of information loss due to the aggregation
Hierarchical clustering
process of agglomerating observations into a series of nested groups based on a measure of similarity
k-Means clustering
process of organizing observations into one of k groups based on a measure of similarity
Dimension reduction
process of reducing the number of variables to consider in a data-mining approach
Lift ratio
ratio of the confidence of an association rule to the benchmark confidence
Missing at random (MAR)
the case when data for a variable is missing due to a relationship b/t other variables
Missing not at random (MNAR)
the case when data for a variable is missing due to its unrecorded value
Missing completely at random (MCAR)
the case when data for a variable is missing purely due to random chance
Support count
the number of times that a collection of items occurs together in a transaction data set
Centroid linkage
uses the averaging concept of cluster centroids to define between-cluster similarity