Week 10 - Clustering 2, Hierarchical & K-Means
Membership
The cluster to which the data point belongs
HIERARCHICAL CLUSTERING STRUCTURE
.... can be thought of like files on a hard disk
The general procedure for K-means clustering is...
1. Choose a value for K, the number of clusters. 2. Randomly choose K points as centroids. 3. Assign items to cluster with nearest centroid (mean). 4. Recalculate centroids as the average of all data points in a cluster. 5. Repeat steps 3 and 4 till no more reassignments or reach max number of iterations.
There are 2 main types of clustering algorithms
Hierarchical Clustering -Most popular method: Wards Minimum Variance -Do not need to specify a number of clusters. Non-Hierarchical Clustering -Most popular method: K-means clustering -Must specify number of clusters.
Centroid
Mean values of the variables for all items in a particular cluster.
Wards method
Membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. The criterion for merging clusters is that the merge should produce the smallest possible increase in the error sum of squares.
Centres / Seeds
Starting points for non-hierarchical clustering algorithms. Clusters are built around these points.
HIERARCHICAL CLUSTERING PROCEDURE
The general procedure for agglomerative hierarchical clustering is: 1. Define a distance or dissimilarity between clusters. 2. Set every point into its own cluster (n points, n clusters). 3. Repeat the following until there is only 1 cluster: 1. Calculate distances between all clusters. 2. Merge the two closest clusters. 4. Save the sequence of cluster operations. THEN CREATE DENDOGRAM!
HIERARCHICAL CLUSTERING DISTANCE MEASURES
There are many ways you can measure dissimilarity or distance between clusters. The most common methods are: 1. Single Linkage (Nearest Neighbour) 2. Complete Linkage (Furthest Neighbour) 3. Average Linkage
There are 2 types of Hierarchical clustering algorithms
• Agglomerative Clustering (Bottom up) • Divisive Clustering (Top Down)
CLUSTERING
• Clustering is used to group/classify or to create subsets of data with similar attributes. • They work by calculating the similarity of different objects. • This is often considered as the inverse of distance.
K-MEANS CLUSTERING LIMITATIONS
• Difficult to choose K, need human inspection or novel algorithms. • Dependant on seeds / center positions. • Sensitive to outliers
METHODS FOR SELECTING K 2
• In many cases there will not be such an obvious elbow or bend in the scree plot. • Therefore many different methodologies have been proposed to identify the optimum K value. • Each method has been proposed to work with a particular type of data set and not all datasets. • There is no one go-to methodology for identifying K.
K-MEANS CLUSTERING
• K-Means is the most commonly used clustering algorithm. • K refers to the number of clusters you want to classify your data into
METHODS FOR SELECTING K
• One common approach to select the optimal number of clusters is to analyse a Scree Plot which shows the number of clusters vs the sum of squared error • As the number of clusters increases, the sum of squared error will decrease as there are less items in each cluster. • You identify the elbow or bend in the plot which indicates the best number of clusters.
VARIABLE REDUCTION
• Variable reduction techniques can be used to reduce the dimensions (variables / columns) of a dataset before applying clustering methods. • This allows clustering on multidimensional data to be visualised in 2 or 3 dimensional space.