ITAO 30120: Data Analysis with Python - Week 12, Part 2

¡Supera tus tareas y exámenes ahora con Quizwiz!

wrong

A silhouette value close to -1 implies that the item is in the (right, wrong) cluster.

right

A silhouette value close to 1 implies that the item is in the (right, wrong) cluster.

center

After clustering each of the items into one of k clusters, the algorithm will then assign every item to its cluster nearest cluster __________ using a distance metric. These are k number of random points in the feature space.

similarity

After determining the number of clusters, the algorithm attempts to assign every item in the dataset to one and only one of k non-overlapping clusters based on __________.

reassigns

After the "true" new cluster centers are calculated (cluster centroids), the algorithm __________ each item to the cluster that is represented by the center closest to it. This may involve shifting some points from one cluster center to another.

mean

After the algorithm assigns each item to its nearest cluster center, the cluster center is moved to the __________ of its assigned items.

further apart

As the value of k decreases the (closer, further apart) the items within each cluster become.

closer

As the value of k increases, the (closer, further apart) the items within each cluster become.

k, 3

If we choose to group our items into 3 clusters, we set our _______ equal to _______.

a priori

If we don't know how to choose the right value for k, ideally it should be determined through __________ knowledge or business requirements.

rule of thumb

In the absence of prior knowledge when determining the right k, a simple __________ can be used. One such rule is setting k equal to the square root of half the number of observations in the dataset.

cluster centers

K-means++ is an initialization algorithm that is used to choose the initial set of __________. The objective is to ensure that the initial set is as FAR AWAY AS POSSIBLE from each other. This minimizes the impact of randomness on the final set of clusters.

location, initial set

One of the notable shortcomings of the k-means clustering technique is that the final set of clusters is sensitive to the __________ of the __________ of cluster centers. This means that we could run the k-means clustering process several times and end up with different-looking clusters each time.

larger

The (smaller, larger) the WCSS, the less similar the items in a cluster are.

smaller

The (smaller, larger) the WCSS, the more similar the items in a cluster are.

sum, distances

The WCSS of a cluster is the __________ of the __________ between the items in a cluster and the cluster centroid.

average silhouette

The _________ computes the average silhouette of ALL items in the dataset, based on different values for k. If most items have a HIGH value, then the average will be high, and the clustering configuration is considered appropriate.

within-cluster sum of squares (WCSS)

The degree to which items in a cluster are similar or dissimilar can be quantified using a measure called the __________.

Total WCSS

The elbow method plots total number of clusters k against _________.

larger

The fewer clusters k, the (smaller, larger) total WCSS becomes.

how many

The first thing to decide in k-means clustering is __________ clusters, k, we want.

convex, negative

The graph for the elbow method is a _________ curve with a(n) _________ slope.

randomly

The initial k-means cluster centers are __________ chosen and DO NOT have to be one of the points from the original data.

heuristic

The k-means clustering algorithm is a very simple and efficient approach because it takes a(n) __________ approach to clustering. This means that it begins by making a decision about what clusters items should belong to.

silhouette

The k-value corresponding to the highest average _________ represents the optimal number of clusters.

smaller

The more clusters k, the (smaller, larger) total WCSS becomes.

largest

The optimal number of clusters is denoted by the k-value that yields the (largest, smallest) gap statistic.

number

The second key shortcoming of k-means clustering is that we don't always know the optimal __________ of clusters to create.

elbow, average silhouette, gap statistic

The three statistical methods that can help give guidance as to how many clusters are reasonable are the __________ method, the __________ method, and the __________.

Euclidean distance

To determine the cluster center closest to a particular point, the k-means algorithm evaluates the __________ between each point and the three cluster centers.

every

To say that k-means clustering is "complete" means that __________ item is assigned to a(n) cluster.

1

To say that k-means clustering is "exclusive" means that each item in a k-means clustering can only belong to _____ cluster(s).

independent

To say that k-means clustering is "partitional" means that cluster boundaries are __________ of each other.

True

True/False: The initial k-means cluster centers could be one of the points from the original data.

False

True/False: The initial k-means cluster centers have to be one of the points from the original data.

small

Using a rule of thumb to help choose the right value for k should only be used in __________ datasets.

experts

We can view the statistical approaches for determining the suggested values for k as a panel of _________ who provide us with recommendations based on their perspective of the data.

elbow

When plotting number of clusters k against total WCSS, a visible bend occurs that represents the point at which increasing the value for k no longer yields a significant reduction in WCSS. This point is known as the _________ and is usually expected to be the appropriate number of clusters for the dataset.

reference dataset

When using the gap statistic method, we compare the difference between clusters created from the observed data and clusters created from a RANDOMLY GENERATED DATASET known as a(n) _________

silhouette

a measure of how closely the item is matched with other items within the same cluster and how loosely it is matched with items in neighboring clusters

k-means clustering

a partitional, exclusive, and complete clustering approach that assigns all n items in a dataset to one of k clusters, such that the differences within a cluster are minimized while the differences between clusters is maximized

cluster centroid

the average position of the items currently assigned to a cluster

random initialization trap

the challenge in that k-means clustering is very sensitive to the initial randomly-chosen cluster centers; as a result, different initial centers may lead to different cluster results

gap statistic

the difference in total WCSS for the observed data and the reference dataset

k-means++

the initialization approach used to overcome the random initialization trap

convergence

the state at which the algorithm can no longer improve upon the cluster assignment or the changes become insignificant

gap statistic

the statistical method for choosing the correct number of clusters in which a reference dataset is used

Euclidean distance

the straight-line distance between the coordinates of two points in multidimensional space


Conjuntos de estudio relacionados

Combo with "Chapter 11 myitlab Isaac Toivonen" and 5 others

View Set

Med Surge Summative Exam2,3,4,6,24,25

View Set

HA Chapter 2: Health History and Interview Prep U Questions

View Set

Hearing: the body senses: the chemical senses

View Set

MedSurg: Prioritization Ch 2 CV management

View Set

CompTIA Network+ - Module 5 - Network Troubleshooting and Tools

View Set

Chap 5: Bones of Skull & Facial #1, & Suture Review

View Set