MIS 441: Clustering and K-means Clustering
stopping criteria for K means clustering
1) object partition does not change, 2) centroid positions don't change, 3) a fixed number of iterations run
steps to K-means clustering (6)
1) pick initial centroids, 2) assign clusters, 3) compute centroids, 4) reassign centroids, 5) compute centroids, 6) converge
________ is the most commonly used example of partitional clustering
K-means
partitional clustering definition
a division of data objects into non-overlapping subsets; each data is in exactly one subset
hierarchical clustering definition
a set of nested clusters organized as a hierarchical tree; each pair of objects is nested in a larger one until only one remains
to assess inter-cluster similarity...
calculate the distance between centroids
each cluster is associated with a randomly chosen ________, the number of which is determined by ______
centroid, K
each point is assigned to the cluster with the __________
closest centroid
when a K-means clustering ________, it reaches a state where the clusters remain unchanged
converges
limitations of k-clustering
differing sizes, different densities, non-globular shapes
the number of K (clusters) depends on....
domain knowledge, software/hardware constraints
a cluster analysis is an __________ data analysis tool used to sort objects into groups
exploratory
clustering breaks large __________ populations into smaller _______ groups
heterogeneous, homogeneous
a good clustering produces high quality clusters where intra-cluster similarity is _____ while inter-cluster similarity is _____
high, low
you can reduce sum of squared error by...
increasing K
solutions to limitations of k-clustering
increasing the size of K
DBI is an index calculated based off of...
inter-cluster differences for pairs of clusters and intra-cluster distances of all clusters; lower=better
______________ distances should be minimized while ___________ distances should be maximized
intra-cluster, inter-cluster
the quality of a clustering depends on...
object representation and similarity of measure used
clustering definition
process of grouping a set of objects into classes based on relation
external criteria to evaluate clusters includes..
purity, rand index
purity definition
ratio between the dominant class of the cluster and the size of the cluster as a whole
rand index definition
ratio between the number of right clustered samples and the total number of samples
supervised classification are....
simple segmentation and query results
to assess intra-cluster similarity, use the...
sum of squared error; the one with the smaller error is better
K stands for..
the number of clusters
clustering is an ________ data mining technique because...
undirected, identifies hidden patterns and structures without a hypothesis, discovers structures with no explanation
clustering is the most common form of _______________
unsupervised learning