Data Mining Ch 4

Ace your homework & exams now with Quizwiz!

k-mean clustering

-assigns each of the n examples to one of the k clusters, where k is a number that has been determined ahead of time -the goal is to minimize the differences within each cluster and maximize the differences between the clusters

cluster profiling

-clusters tell you which groups are closely related -up to you to apply an actionable and meaningful label -objective is to identify the features that uniquely describe each cluster

two measures of distance

-distance between two data points -distance between two clusters -distance is not negative

how to choose k

-domain knowledge: those who have been working with the data a while may have an idea -WSS (weighted sum of square), as more clusters are formed, WSS decreases and then stabilizes -> elbow rule wherever decline of WSS dimishes, optimal k is nearby

clustering weaknesses

-not as sophisticated as more modern clustering algorithms -uses element of random change, not guaranteed to find the optimal set of clusters -requires a reasonable guess as to how many clusters naturally exist in the data -not idea for non-spherical clusters or clusters of widely varying density

hierarchical clustering algorithm

-start with every data point in a separate cluster -keep merging similar data points until you reach a single cluster -produces a dendrogram where each cluster is a root and each data point is a leaf -height of bar indicates how similar two clusters are

clusters

-the groups we generate -not predefined -accomplished by finding similarities between data according to characteristics found in the actual data -

clustering strengths

-uses simple principles that can be explained in nonstatistical terms -highly flexible, and can adapt with simple adjustments to address nearly all of its shortcomings -performs well enough under many real-world use cases

average

average distance between observations in one cluster and in the other cluster

wards

cluster is represented by its centroid and so measure the distance between centroids of two clusters

clustering

diverse and varied data can be exemplified by a much smaller number of groups -provides insights into patterns of relationships -unsupervised method -sort data into groups based on similar features - k-means clustering is used to cluster numerical data -sensitive to outliers

major clustering methods

k-means clustering hierarchical clustering

complete

maximum distance between an observation in one cluster and in the other cluster

single

minimum distance between an observation in one cluster and in the other cluster

implementing k-means algorithm

step 1: k-means clustering begins with the data set segmented into k clusters step 2:assign records to the cluster closest step 3: recalculate the centroid step 4: repeat set 2 and 3 so that the assignments do not change

centroid

the center of the discovered cluster -number of clusters is fixed to k, find k cluster centers -assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized


Related study sets

Chapter 1 Smartbook (Operations Management)

View Set

Endocrine NCLEX Practice Questions

View Set

Experiment 5- Limiting reactant

View Set

Colloquial Expressions and Slangs

View Set

EAQ: Death & Dying/Spirituality/Culture

View Set

GRE3000 完整格式不完美版

View Set

Peds Exam 2: Mobility Alterations, Prioritization, Delegation, & Emergency Care

View Set