CH 14 Cluster Analysis
What is the normalize score for Sales = $9,077 Average Sales = $8,914 Std. Dev = $3,550
(9077 - 8914) - 3550 = 0.046
What are the disadvantages of hierarchical clustering?
- HIgh computational costs - Observations that are allocated incorrectly in the process cannot be reallocated subsequently - Low stability (one of the characteristics we look for in a useful cluster sensitive to outliers
How many clusters do we have at least in the process agglomerative clustering? Suppose there are n observations.
1
What is the agglomerative process?
1. Begin with n-clusters (each record its own cluster) 2. Merge similar clusters until one cluster is left (the entire data set)
What is the center?
Average values at each coordination for observations in a cluster
K-means process
1. Randomly pick K centers 2. Assign each observation to the nearest center 3. Recalculate centers 4. Repeat steps 2 and 3 until convergent
What are the agglomerative method steps?
1. Start with n clusters (each record is its own cluster) 2. Merge 2 closest records into one cluster 3. At each step, two clusters closest to each other are merged
What is K-means clustering?
1. Use (K) pre-specified number of clusters 2. Assign cases to each cluster Preferred with large datasets Computationally less intensive
What is the K-Mean Clustering Algorithm?
1.Choose the number of clusters desired, k 2.Start with a partition into k clusters a.Often based on random selection of k centroids 3.At each step, move each record to cluster with closest centroid 4.Re-compute centroids, repeat step 3 5.Stop when moving records increases within-cluster dispersion
How many clusters should be defined if a dendrogram like below is observed?
2 or 4 (cut longest branches)
How many pairwise distances do we have for a cluster with 5 observations and another cluster with 6 observations?
30 (6 * 5)
What kind of algorithm do multiple dimensions require?
A distance measure to form clusters
How do you determine the number of clusters in a dendrogram?
A horizontal line intersects the clusters that are far apart to create clusters - Count vertical lines
What is a dendrogram?
A tree-like diagram that summarizes the process of clustering - Records are at the bottom - Similar records are joined by lines who vertical lengths reflects the distance between the records
Cluster separation
Examines the ratio between-cluster variation to within-cluster variation. The higher the better
What is the main hierarchical method covered in this chapter?
Agglomerative Methods
Average linkage is also known as?
Also called unweighted pair-group method using averages
Stability
Are clusters and cluster assignments sensitive to slight changes in inputs?
At most how many clusters?
At most is n clusters
How to choose k?
Base it on how results will be used. Ex. How many market segments do we want?
What happens to 2 records that are closest to each other in a dendrogram?
Closest 2 records form a cluster
Why should we be wary of chance results?
Data may not have definitive "real" clusters
What is cluster stability?
Do clusters assignments change significantly if some of the inputs are slightly altered?
Initial partition can be based on?
Domain knowledge practical constraints Random (if so, then repeat the process with random partitions)
How do you measure distance between 2 records in a hierarchical method?
Euclidean distance Between records i & j 1. Take the distance between two records 2. Square the distance so it's not negative 3. Take sum of all the distances 4. Square the sum
Separation
Examine the ratio between-cluster variation to within-cluster variation (higher is better)
What kind of tool is cluster analysis?
Exploratory tool only useful when producing meaningful clusters
What is the goal of clustering?
Form groups (clusters) based on similar measurements made on those records
Why do we need to do data normalization in hierarchical and k-means clustering?
Hierarchical and K-Means clustering are both distance based so we need to remove the scale effect of each variable
What are two algorithms considered for cluster analysis?
Hierarchical and Non-Hierarchical
Why is stability a desirable feature in cluster analysis?
If we add noise there will be no change, thus stability
Uses of cluster analysis in huge data
Internet search engines can cluster user queries Grouping securities in portfolios
What does a dendrogram show?
It shows cluster hierarchy
K-Means
K Centers OR K Clusters
What is the non-hierarchical method?
K-Means clustering
At least how many clusters?
Least is 1 cluster
Where is cluster analysis used?
Marketing and industry analysis Segmenting groups of similar customers
How do you measure distance in hierarchical methods?
Measure distance between records and between clusters
Number of clusters
Might be useful considering the purpose of the analysis
How do you measure the distance between 2 clusters?
Minimum distance - Single linkage Maximum distance - Complete linkage Average distance - Average linkage Centroid distance - distance between the centers of 2 clusters
Which cluster is furthest?
Pair that has highest distance is furthest away
Which cluster is closest?
Pair that has shortest distance is closest
What are the advantages of hierarchical clustering?
Purely data driven since it doesn't require a specification of the number of clusters
What is the problem with not normalizing?
Raw distance measures are highly influenced by scales of measurements
What is the divisive method?
Start with one cluster Repeatedly divide into smaller clusters
How do we normalize (standardize) the data?
Subtract the mean from n Divide by std. deviations Also called z-score
Within-Cluster Dispersian
Table Rows show clusters 1st column shows no. of observations 2nd column shows average distance in cluster High average distance means cluster is not well-defined and loose Low average distance means cluster is very tight and well-defined
Cluster Output
Table shows distances between clusters and measurements
What is the goal of clusters?
To obtain meaningful and useful clusters
What kind of task is clustering?
Unsupervised Looks at previous data without human supervision
Centroid linkage is also called?
Unweighted pairing method using centroids
Why is separation a desirable feature in cluster analysis?
We want clusters to be highly separated
For hierarchical clustering, do we need to normalize the data if variables are in different scales?
Yes - distance based algorithm
For k-means, do we need to normalize the data if variables are in different scales?
Yes - distance based algorithm
Random chance can often produce
apparent clusters
Types of clusters
complete single average centroid
Non-hierarchical clustering is
computationally cheap and more stable. Requires user to set [ k ]
Suppose you have a data set named dat1 in R. Which of the following is the correct syntax to run hierarchical clustering?
hclust(dist(dat1))
Suppose you have a data set named dat1 in R. Which of the following is the correct syntax to run hierarchical clusteringwith standardization of each variable?
hclust(dist(scale(dat1)))
What is the meaning of k in k-means?
k centers
How many clusters do we have at most in the process agglomerative clustering? Suppose there are n observations.
n
Different cluster methods
produce different results
Desirable features of clusters
stability and high separation
Goal of K-Mean Clustering Algorithm
to divide the sample into a pre-determined number k of non-overlapping clusters so that the clusters are as homogeneous as possible with respect to the measurements
Hierarchical clustering gives
visual representation of different levels of clustering