Cluster Analysis
The main input into any cluster analysis procedure is ...
a measure of distance between cases (customers, patients, etc.) being clustered
Cluster analysis groups data objects based on ...
- Groups data objects based on information found in the data that describes the objects and their relationship with the goal to differentiate well between the objects. The identified clusters should substantially differ from each other. - the similarity of the records within the cluster is maximized low within-cluster variation (WCV) and the similarity to records outside this cluster is minimized high between-cluster variation (BCV)
Centroid distance
- Distance between two clusters is the distance between the two cluster centroids. - Centroid is the vector of variable averages for all records in a cluster
Major Issues in Cluster Analysis
- How to measure similarity - How to recode categorical variables - How to standardize or normalize numerical variables - How many clusters we expect to uncover
Summary of k-Means
- Non-hierarchical is computationally cheap and more stable - The number of desired clusters, K, is chosen by the user - Practical considerations usually dominate the choice of K; there is no statistically determined optimal number of clusters - The algorithm develops clusters by intratively assigning records to the nearest cluster mean until cluster assignments do not change - Be wary of chance results; data may not have definitive "real" clusters
Distance Measures
- The objective of a distance measure is to quantify the difference between two cases on the variables used for segmentation - A shorter distance implies similar preferences on the segmentation variables, a longer distance implies dissimilar preferences - All numeric data can be clustered using a variety of distance measures
In summary: k-Means Clustering
- There should be low levels of collinearity among the variables - Used on ordinal data might result in some distortions. - Requires that the number of clusters are pre-specified
Goal of Cluster Analysis
Aims to find useful / meaningful groups of objects (clusters), where usefulness is defined by the goals of the data analysis
Agglomerative
Begin with n clusters (i.e., each single observation initially represents a distinct cluster) and sequentially merge similar clusters until a single cluster is left
Silhouette ratings
Can range between -1 (indicating a very poor model) and 1 (indicating an excellent model)
Value of cluster analysis:
Cluster analysis is an a useful exploratory tool. Cluster analysis can also be used to eliminate highly related variables (grouping of variables) Hierarchical clustering gives visual representation of different levels of clustering On other hand, due to non-iterative nature, it can be unstable, can vary highly depending on settings, and is computationally expensive
Separation
How distant the clusters are from each other
Cohesion
How tightly related the records are within the individual clusters
k-means clustering requires that data be _______ as the procedure relies on Euclidean distances.
Interval or ratio-scaled
k-Means Clustering can be applied to very ____ datasets
Large Useful for sample sizes above 500
Complete linkage
Maximum distance between two clusters, i.e., the distance between the pair of records Ai and Bj that are farthest from each other
Single linkage
Minimum distance between two clusters, i.e., the distance between the pair of records Ai and Bj that are closest
Cluster
Observations belonging to one group are: - similar to one another within the same cluster and - dissimilar to the observations in other clusters
Euclidean Distance
One of the most common distance measures - If two two cases are close in geometric sense, then they represent similar data in the database - Customary to normalize (standardize) before computing the Euclidean distance so that variables with larger scales do not dominate the analysis
How to Determine Inter-Cluster Similarity
Single linkage, complete linkage, average linkage, centroid distance
Clusters Should have ____ within-cluster variation compared to the between-cluster variation.
Small
Interpretation of Average Silhouette Value
The average silhouette value over all records yields a measure of how well the cluster solution fits. 0.5 or better provides good evidence of the reality of the clusters in the data. 0.25 - 0.5 provides some evidence of the reality of the clusters in the data. Hopefully, domain-specific knowledge can be brought to bear to support the reality of the clusters Less than 0.25 provides scant evidence of cluster reality.
Divisive (top-down technique)
Start with one cluster containing all the observations, then works in the opposite direction, subdividing into clusters of smaller size
Clustering
The process of subdividing the records of a dataset into homogenous (similar) natural groups of observations that share common characteristics - Also called data segmentation in some applications - Does not try to classify, estimate, or predict the value of a target variable - The groups are not predefined, but to be found by the classification algorithm
k-means
Type of Clustering Algorithms - Center-based clustering for data with continuous attributes - Defines a prototype in terms of a centroid, which is usually the (average) mean of all the points in the cluster - Places each instance (case) in exactly one of K non-overlapping clusters by minimizing the sum of squared distances of each record to the mean (centroid) of its assigned cluster - K are user-defined The centroid might happen to be one of the points in the cluster, but it does not have to be - Does not ensure the clusters will have the same size, but finds clusters that are the best separate - Generally less computationally intensive and thus preferred with very large datasets
Silhouette
a measure of goodness of fit Combines the concepts of cluster cohesion (favoring models which contain tightly cohesive clusters) and cluster separation (favoring models which contain highly separated clusters)
It is notoriously difficult to validate the quality of clustering algorithms because ...
clustering is an unsupervised problem The validation of a clustering process is highly subjective The only true measure of clustering quality is its ability to meet the goals of a specific application
Are there any records that are outliers?
k-means has trouble clustering data that contains outliers
Often preliminary step in a data mining process is to ...
reduce dimensionality of the data; resulting clusters as inputs into a different technique downstream