Cluster Analysis
Centroid method
The distance between two clusters is defined as the difference between the centroids (cluster average)
Sum of squares (review)
- A measure of the total variability of a set of scores around a particular number, usually the mean of the set of scores. - In Ward's method, we are comparing the individual observations for each variable against the grand mean for that variable.
What is a cluster analysis?
- Data mining tool to build a typology based on NATURAL GROUPINGS in the data. - A person-centered analysis. - Allows you to discover PATTERNS in your data, to cluster participants in a survey based on similarity. - An EXPLORATORY data analysis technique in which we group HETEROGENOUS objects/people into HOMOGENOUS groups. - Goal: identify underlying groups/types of people.
Proximity Matrix
- Distance as the index of similarity. - Lower the index, greater the similarity.
Agglomerative Hierarchical Clustering
- In the agglomerative hierarchical approach, we start by defining each data point to be a cluster and combine existing clusters at each step. - Decrease # of clusters but increase # of items in each cluster (n cluster to 1 cluster)
How to measure similarities in a Cluster Analysis? (Clustering Procedures)
- Pearson Correlation - Proximity Matrix - Hierarchical procedures - Non-hierarchical procedures
What is the underlying principles when creating clusters/groups?
- WITHIN-CLUSTER HOMOGENEITY must be MAXIMIZED. Differences among the score profiles of people within each cluster are as small as possible - BETWEEN-CLUSTER HETEROGENEITY must be MAXIMIZED. Differences among the score profiles of people in different clusters are as large as possible
Key in determining the number of clusters
- We assess the "average" similarity across clusters. - As the "average" increases, the within-cluster becomes less similar. - We need to find a balance.
Ward's method (Menon)
- a very popular hierarchical method to form clusters - looks at cluster analysis as an analysis of variance problem, instead of using distance metrics or measures of association. - most appropriate for quantitative variables, and not binary variables. - minimizes the within-cluster sum of squares at each step.
How does Ward's method work?
- start: all sample units in n clusters of size 1 each - step 1: n - 1 clusters are formed, one of size two and the remaining of size 1; the error sum of squares and r-square values are computed; the pair of sample units that yield the smallest error sum of squares or the largest r-square value will form the first cluster. - step 2: n - 2 clusters are formed from that n - 1 clusters defined in step 1. These may include two clusters of size 2, or a single cluster of size 3 including the two items clustered in step 1. Again, the value of r-square is maximized. - At each step of the algorithm clusters or observations are combined in such a way as to MINIMIZE the SUM OF SQUARE or MAXIMIZE the r-SQUARE value within each cluster. - The algorithm stops when all sample units are combined into a single large cluster of size n .
Non-hierarchical procedures in forming clusters in Cluster Analysis
- termed "K Means" clustering procedures - # of clusters is specified in advance by the researcher - do not entail a sequential process of "building up" the clusters - Goal: find an optimal solution, given the number of clusters specified by the researcher
Steps to take after clusters are identified/specified
1) Internal validity - check mean and SD within each cluster. 2) External validity - How well do the clusters work in detecting patterns or differences on other variables? Clusters can be saved as a variable on SPSS, and used as independent groups in an ANOVA, MANOVA etc.
Ways to conduct Agglomerative Hierarchical Clustering
1) Linkage method: a - Simple linkage (minimum distance) b - Complete linkage (maximum distance) c - Average linkage 2) Centroid (cluster averages) linkage/method 3) Ward's method
Ward's method
A type of agglomerative clustering that computes sum of squared distances within clusters and aggregates clusters with the MINIMUM increase in the overall sum of squares.
Agglomerative vs. Divisive procedure
Agglomerative procedure - cluster analysis starts with n number of clusters (1 participant in each cluster). Divisive procedure - cluster starts with 1 cluster (all participants in that cluster).
Which method should we use? Hierarchical? Non-hierarchical? or Both?
Consensus is developing that researchers should use both hierarchical methods and non-hierarchical methods, sequentially.
How are Cluster analysis and Factor analysis different?
Factor Analysis - assesses the structure in items/variables; groups similar items/variables together. Cluster Analysis - assesses the structure in the naturing groupings of people; groups similar people together.
What are the types of hierarchical and non-hierarchical clustering procedures?
Hierarchical 1) Agglomerative hierarchical clustering 2) Divisive hierarchical clustering Non-hierarchical 1) K-means clustering
What are the problems with very few clusters?
Homogeneity within each cluster decreases as the number of cluster decreases. In other words, people within a cluster share less similarity.
For variables that are not on the same scale...
If the variables are not on the same scale, we must standardize the scales before creating a proximity matrix.
How is Pearson Correlation used in measuring similarities in a Cluster Analysis?
It finds relationship/similarities between two profiles' shapes (2D)
Hierarchical procedures in forming clusters in cluster Analysis
It is a SEQUENTIAL, step-by-step procedure to combine cases into clusters based on SIMILARITY COEFFICIENTS. An agglomerative approach is often used.
Dendrogram
It is a tree-like structure that is a product of a Cluster Analysis on SPSS. It helps to determine the number of clusters. Cutoff point is where the "trunk" of the dendrogram gets long. The number of "branches" before the cutoff point on the dendrogram tells us the number of clusters.
How is Proximity Matrix used in measuring similarities in a Cluster Analysis?
It uses squared Euclidean distance to find similarities between two profiles' shapes and elevation (3D).
Can we use cluster analysis for testing a hypothesis?
No, but we can use the groups generated by a cluster analysis to do hypothesis testing.
Euclidean distance
The length of the hypotenuse of a right-angle triangle.
How are Cluster analysis and Factor analysis similar?
They both assess the structure of a group of entities to increase homogeneity within-group and decrease homogeneity (increase heterogeneity) between groups. They both want to maximize similarities within group and minimize similarities (or maximize differences) between groups.
r-square
This r-square value is interpreted as the proportion of variation explained by a particular clustering of the observations.