Data Mining Ch 4
k-mean clustering
-assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time -the goal is to minimize the differences within each cluster and maximize the differences between the clusters
cluster profiling
-clusters tell you which groups are closely related -up to you to apply an actionable and meaningful label -objective is to identify the features that uniquely describe each cluster
two measures of distance
-distance between two data points -distance between two clusters -distance is not negative
how to choose k
-domain knowledge: those who have been working with the data a while may have an idea -WSS (weighted sum of square), as more clusters are formed, WSS decreases and then stabilizes -> elbow rule wherever decline of WSS dimishes, optimal k is nearby
clustering weaknesses
-not as sophisticated as more modern clustering algorithms -uses element of random change, not guaranteed to find the optimal set of clusters -requires a reasonable guess as to how many clusters naturally exist in the data -not idea for non-spherical clusters or clusters of widely varying density
hierarchical clustering algorithm
-start with every data point in a separate cluster -keep merging similar data points until you reach a single cluster -produces a dendrogram where each cluster is a root and each data point is a leaf -height of bar indicates how similar two clusters are
clusters
-the groups we generate -not predefined -accomplished by finding similarities between data according to characteristics found in the actual data -
clustering strengths
-uses simple principles that can be explained in nonstatistical terms -highly flexible, and can adapt with simple adjustments to address nearly all of its shortcomings -performs well enough under many real-world use cases
average
average distance between observations in one cluster and in the other cluster
wards
cluster is represented by its centroid and so measure the distance between centroids of two clusters
clustering
diverse and varied data can be exemplified by a much smaller number of groups -provides insights into patterns of relationships -unsupervised method -sort data into groups based on similar features - k-means clustering is used to cluster numerical data -sensitive to outliers
major clustering methods
k-means clustering hierarchical clustering
complete
maximum distance between an observation in one cluster and in the other cluster
single
minimum distance between an observation in one cluster and in the other cluster
implementing k-means algorithm
step 1: k-means clustering begins with the data set segmented into k clusters step 2:assign records to the cluster closest step 3: recalculate the centroid step 4: repeat set 2 and 3 so that the assignments do not change
centroid
the center of the discovered cluster -number of clusters is fixed to k, find k cluster centers -assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized