Clustering (k-means)
k-Means clustering
partional, exclusive, and complete clustering approach assigns all n items in a dataset to one of k clusters in order to minimize similarities between clusters and maximize similarity within a cluster
clustering
unsupervised machine learning approach used in partitioning unlabeled data into subgroups (clusters) based on similarity
similarity
"alikeness of instances" sometimes expressed as distance function
strengths for k-Means clustering
- simple, non-statistical principles - very flexible and malleable algorithm - wide set of real world applications
weaknesses of k-Means Clustering
- simplistic algorithm - relies on chance - sometimes requires domain knowledge to choose k - not ideal for non-spherical clusters - works with numeric data only
k-Means Clustering Algorithm
1) pick k random cluster centers (*centroids*) 2) assign each item to the nearest cluster center (by distance metric) 3) move each centroid to the mean of the assigned items in that cluster 4) repeat 2 and 3 until convergence achieved
Types of Clustering
Boundary: *Paritional* (boundaries are separate) vs. *Hierarchical* (start with one and increase cluster until boundary includes all) Items: *Exclusive* (every item only belongs to 1 cluster) vs. *Overlapping* (can belong to one or more clusters) Inclusion: *Complete* (every item in data set must belong to a cluster) vs. *Partial* (put items in a cluster if they fit)
WCSS
Within Cluster Sum of Squares SUM ( SUM ( distance from point to its centroid ) ^2 for each cluster ) ) as k --> n, WCSS --> 0
convergence
achieved when the change in cluster assignment is less than a chosen threshold, or non-existent
association vs. clustering
ass: identify patterns clu: learn how to group things
choosing the right "k"
business person is best to pick k- usually you know how many segments you'd want to have rule of thumb: k = sqrt(n/2) or, use: elbow method, average silhouette method, gap statistic
random initialization trap
different initial cluster centers may lead to different cluster results
distance measures
euclidean: we use. sqrt(sum((xi-yi)^2))) also: manhattan (sum(abs(xi-yi))) or pearson, spearman, kendall
elbow method
graph k vs. WCSS, choose k where the elbow point is
good clustering
high intra-class similarity low inter-class similarity
challenges with k-means clustering
sensitive to the initially chosen centroids how to choose value for k?
k-means++
the initialization approach used to overcome the R.I. trap first, choose random data point, calculate distance to that point from all others choose the next point as the one farthest from the first, but still central to all other points keep iterating (R does this for us)