Week 10 - Clustering 2, Hierarchical & K-Means

¡Supera tus tareas y exámenes ahora con Quizwiz!

Membership

The cluster to which the data point belongs

HIERARCHICAL CLUSTERING STRUCTURE

.... can be thought of like files on a hard disk

The general procedure for K-means clustering is...

1. Choose a value for K, the number of clusters. 2. Randomly choose K points as centroids. 3. Assign items to cluster with nearest centroid (mean). 4. Recalculate centroids as the average of all data points in a cluster. 5. Repeat steps 3 and 4 till no more reassignments or reach max number of iterations.

There are 2 main types of clustering algorithms

Hierarchical Clustering -Most popular method: Wards Minimum Variance -Do not need to specify a number of clusters. Non-Hierarchical Clustering -Most popular method: K-means clustering -Must specify number of clusters.

Centroid

Mean values of the variables for all items in a particular cluster.

Wards method

Membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. The criterion for merging clusters is that the merge should produce the smallest possible increase in the error sum of squares.

Centres / Seeds

Starting points for non-hierarchical clustering algorithms. Clusters are built around these points.

HIERARCHICAL CLUSTERING PROCEDURE

The general procedure for agglomerative hierarchical clustering is: 1. Define a distance or dissimilarity between clusters. 2. Set every point into its own cluster (n points, n clusters). 3. Repeat the following until there is only 1 cluster: 1. Calculate distances between all clusters. 2. Merge the two closest clusters. 4. Save the sequence of cluster operations. THEN CREATE DENDOGRAM!

HIERARCHICAL CLUSTERING DISTANCE MEASURES

There are many ways you can measure dissimilarity or distance between clusters. The most common methods are: 1. Single Linkage (Nearest Neighbour) 2. Complete Linkage (Furthest Neighbour) 3. Average Linkage

There are 2 types of Hierarchical clustering algorithms

• Agglomerative Clustering (Bottom up) • Divisive Clustering (Top Down)

CLUSTERING

• Clustering is used to group/classify or to create subsets of data with similar attributes. • They work by calculating the similarity of different objects. • This is often considered as the inverse of distance.

K-MEANS CLUSTERING LIMITATIONS

• Difficult to choose K, need human inspection or novel algorithms. • Dependant on seeds / center positions. • Sensitive to outliers

METHODS FOR SELECTING K 2

• In many cases there will not be such an obvious elbow or bend in the scree plot. • Therefore many different methodologies have been proposed to identify the optimum K value. • Each method has been proposed to work with a particular type of data set and not all datasets. • There is no one go-to methodology for identifying K.

K-MEANS CLUSTERING

• K-Means is the most commonly used clustering algorithm. • K refers to the number of clusters you want to classify your data into

METHODS FOR SELECTING K

• One common approach to select the optimal number of clusters is to analyse a Scree Plot which shows the number of clusters vs the sum of squared error • As the number of clusters increases, the sum of squared error will decrease as there are less items in each cluster. • You identify the elbow or bend in the plot which indicates the best number of clusters.

VARIABLE REDUCTION

• Variable reduction techniques can be used to reduce the dimensions (variables / columns) of a dataset before applying clustering methods. • This allows clustering on multidimensional data to be visualised in 2 or 3 dimensional space.


Conjuntos de estudio relacionados

ch. 1 Law of Agency (Real Estate)

View Set

Analyze Text Structure study island

View Set