Clustering (k-means)

Ace your homework & exams now with Quizwiz!

k-Means clustering

partional, exclusive, and complete clustering approach assigns all n items in a dataset to one of k clusters in order to minimize similarities between clusters and maximize similarity within a cluster

clustering

unsupervised machine learning approach used in partitioning unlabeled data into subgroups (clusters) based on similarity

similarity

"alikeness of instances" sometimes expressed as distance function

strengths for k-Means clustering

- simple, non-statistical principles - very flexible and malleable algorithm - wide set of real world applications

weaknesses of k-Means Clustering

- simplistic algorithm - relies on chance - sometimes requires domain knowledge to choose k - not ideal for non-spherical clusters - works with numeric data only

k-Means Clustering Algorithm

1) pick k random cluster centers (*centroids*) 2) assign each item to the nearest cluster center (by distance metric) 3) move each centroid to the mean of the assigned items in that cluster 4) repeat 2 and 3 until convergence achieved

Types of Clustering

Boundary: *Paritional* (boundaries are separate) vs. *Hierarchical* (start with one and increase cluster until boundary includes all) Items: *Exclusive* (every item only belongs to 1 cluster) vs. *Overlapping* (can belong to one or more clusters) Inclusion: *Complete* (every item in data set must belong to a cluster) vs. *Partial* (put items in a cluster if they fit)

WCSS

Within Cluster Sum of Squares SUM ( SUM ( distance from point to its centroid ) ^2 for each cluster ) ) as k --> n, WCSS --> 0

convergence

achieved when the change in cluster assignment is less than a chosen threshold, or non-existent

association vs. clustering

ass: identify patterns clu: learn how to group things

choosing the right "k"

business person is best to pick k- usually you know how many segments you'd want to have rule of thumb: k = sqrt(n/2) or, use: elbow method, average silhouette method, gap statistic

random initialization trap

different initial cluster centers may lead to different cluster results

distance measures

euclidean: we use. sqrt(sum((xi-yi)^2))) also: manhattan (sum(abs(xi-yi))) or pearson, spearman, kendall

elbow method

graph k vs. WCSS, choose k where the elbow point is

good clustering

high intra-class similarity low inter-class similarity

challenges with k-means clustering

sensitive to the initially chosen centroids how to choose value for k?

k-means++

the initialization approach used to overcome the R.I. trap first, choose random data point, calculate distance to that point from all others choose the next point as the one farthest from the first, but still central to all other points keep iterating (R does this for us)


Related study sets

PAAE Quiz, PsyAAE: Quiz 2, PsyAA Quiz 1

View Set

American Civics and Government - Workbooks

View Set

Breeds, Markings, and History of Horse

View Set

MGMT MCGRAW HILL 12th edition chapter 4-6 (SB + quiz)

View Set