ITAO 30120: Data Analysis with Python - Week 12, Part 2
wrong
A silhouette value close to -1 implies that the item is in the (right, wrong) cluster.
right
A silhouette value close to 1 implies that the item is in the (right, wrong) cluster.
center
After clustering each of the items into one of k clusters, the algorithm will then assign every item to its cluster nearest cluster __________ using a distance metric. These are k number of random points in the feature space.
similarity
After determining the number of clusters, the algorithm attempts to assign every item in the dataset to one and only one of k non-overlapping clusters based on __________.
reassigns
After the "true" new cluster centers are calculated (cluster centroids), the algorithm __________ each item to the cluster that is represented by the center closest to it. This may involve shifting some points from one cluster center to another.
mean
After the algorithm assigns each item to its nearest cluster center, the cluster center is moved to the __________ of its assigned items.
further apart
As the value of k decreases the (closer, further apart) the items within each cluster become.
closer
As the value of k increases, the (closer, further apart) the items within each cluster become.
k, 3
If we choose to group our items into 3 clusters, we set our _______ equal to _______.
a priori
If we don't know how to choose the right value for k, ideally it should be determined through __________ knowledge or business requirements.
rule of thumb
In the absence of prior knowledge when determining the right k, a simple __________ can be used. One such rule is setting k equal to the square root of half the number of observations in the dataset.
cluster centers
K-means++ is an initialization algorithm that is used to choose the initial set of __________. The objective is to ensure that the initial set is as FAR AWAY AS POSSIBLE from each other. This minimizes the impact of randomness on the final set of clusters.
location, initial set
One of the notable shortcomings of the k-means clustering technique is that the final set of clusters is sensitive to the __________ of the __________ of cluster centers. This means that we could run the k-means clustering process several times and end up with different-looking clusters each time.
larger
The (smaller, larger) the WCSS, the less similar the items in a cluster are.
smaller
The (smaller, larger) the WCSS, the more similar the items in a cluster are.
sum, distances
The WCSS of a cluster is the __________ of the __________ between the items in a cluster and the cluster centroid.
average silhouette
The _________ computes the average silhouette of ALL items in the dataset, based on different values for k. If most items have a HIGH value, then the average will be high, and the clustering configuration is considered appropriate.
within-cluster sum of squares (WCSS)
The degree to which items in a cluster are similar or dissimilar can be quantified using a measure called the __________.
Total WCSS
The elbow method plots total number of clusters k against _________.
larger
The fewer clusters k, the (smaller, larger) total WCSS becomes.
how many
The first thing to decide in k-means clustering is __________ clusters, k, we want.
convex, negative
The graph for the elbow method is a _________ curve with a(n) _________ slope.
randomly
The initial k-means cluster centers are __________ chosen and DO NOT have to be one of the points from the original data.
heuristic
The k-means clustering algorithm is a very simple and efficient approach because it takes a(n) __________ approach to clustering. This means that it begins by making a decision about what clusters items should belong to.
silhouette
The k-value corresponding to the highest average _________ represents the optimal number of clusters.
smaller
The more clusters k, the (smaller, larger) total WCSS becomes.
largest
The optimal number of clusters is denoted by the k-value that yields the (largest, smallest) gap statistic.
number
The second key shortcoming of k-means clustering is that we don't always know the optimal __________ of clusters to create.
elbow, average silhouette, gap statistic
The three statistical methods that can help give guidance as to how many clusters are reasonable are the __________ method, the __________ method, and the __________.
Euclidean distance
To determine the cluster center closest to a particular point, the k-means algorithm evaluates the __________ between each point and the three cluster centers.
every
To say that k-means clustering is "complete" means that __________ item is assigned to a(n) cluster.
1
To say that k-means clustering is "exclusive" means that each item in a k-means clustering can only belong to _____ cluster(s).
independent
To say that k-means clustering is "partitional" means that cluster boundaries are __________ of each other.
True
True/False: The initial k-means cluster centers could be one of the points from the original data.
False
True/False: The initial k-means cluster centers have to be one of the points from the original data.
small
Using a rule of thumb to help choose the right value for k should only be used in __________ datasets.
experts
We can view the statistical approaches for determining the suggested values for k as a panel of _________ who provide us with recommendations based on their perspective of the data.
elbow
When plotting number of clusters k against total WCSS, a visible bend occurs that represents the point at which increasing the value for k no longer yields a significant reduction in WCSS. This point is known as the _________ and is usually expected to be the appropriate number of clusters for the dataset.
reference dataset
When using the gap statistic method, we compare the difference between clusters created from the observed data and clusters created from a RANDOMLY GENERATED DATASET known as a(n) _________
silhouette
a measure of how closely the item is matched with other items within the same cluster and how loosely it is matched with items in neighboring clusters
k-means clustering
a partitional, exclusive, and complete clustering approach that assigns all n items in a dataset to one of k clusters, such that the differences within a cluster are minimized while the differences between clusters is maximized
cluster centroid
the average position of the items currently assigned to a cluster
random initialization trap
the challenge in that k-means clustering is very sensitive to the initial randomly-chosen cluster centers; as a result, different initial centers may lead to different cluster results
gap statistic
the difference in total WCSS for the observed data and the reference dataset
k-means++
the initialization approach used to overcome the random initialization trap
convergence
the state at which the algorithm can no longer improve upon the cluster assignment or the changes become insignificant
gap statistic
the statistical method for choosing the correct number of clusters in which a reference dataset is used
Euclidean distance
the straight-line distance between the coordinates of two points in multidimensional space