Module 4
within cluster sum of squares
A measure of cluster compactness that sums the squared distances between each point and its cluster centroid. Lower WCSS indicates tighter, more cohesive clusters. It's often used in K-means to evaluate clustering quality and to help choose the optimal number of clusters by looking for the "elbow" point where adding more clusters yields diminishing improvements.
silhouette index
A metric to evaluate clustering quality by measuring how close each point is to its own cluster compared to other clusters. Scores range from -1 to +1, where values near +1 indicate well-separated, cohesive clusters, values around 0 indicate points on cluster boundaries, and negative values suggest possible misclassification. The average silhouette score helps choose the best number of clusters and assess cluster validity.
how to best determine K
WCSS (elbow in the graph)
DBSCAN
density based spatial clustering of applications with noise, params: epsillon - radius around a point to consider its neighborhood minPts - minimum number of points required to form a dense region Core point- point with at least minpts neighbors including itself within distance epsillon border point 0 has fewer than minpts neighbors, but is within distnace of epsillon of a core point nose point - neither core nor border point, outlier visualization: https://www.youtube.com/watch?v=_A9Tq6mGtLI
euclidean case
each cluster has a centroid = average of data points
hierarchical clustering
repeatedly combine two nearest clusters agglomerative (bottom up) divisive (top down ) point assignment
evaluate clustering quality
silhouette index within cluster sum of squares
k means
specify k pick k clusters assign every item to its nearest cluster center using euclidean distance move each cluster center to mean of its assigned item repeat until convergance
