K-Means Clustering

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Which of the following is correct about Minkowski distance when p is equal to 2?

It is equal to Euclidean distance

Which of the following algorithms scales linearly in terms of computation?

K-means clustering algorithm - K-means clustering algorithm scales in the order of n, while the agglomerative clustering algorithm scales in the order of n2.

Clustering is an Unsupervised Learning technique.

True - Clustering does not involve separating predictor and target variables, so it is an Unsupervised Learning technique.

Which of the following is/are the weakness of the K-means clustering algorithm? 1. It is susceptible to the curse of dimensionality. 2. It is very computationally expensive. 3. It is sensitive to outliers. 4. Choosing the value of K.

1, 3, and 4 1. The K-means clustering algorithm is susceptible to the curse of dimensionality. 2. The algorithm is not very computationally expensive and scales in the order of n, the number of data points. 3. The algorithm involves computation of mean, and so, is sensitive to outliers. 4. The value of K has to be chosen before running the algorithm, but the right value of K is almost always not known.

Clustering has application in many areas, such as

All of the above - Some of the important applications of clustering include segmenting customers into similar groups, identifying fraudulent or criminal activity by analyzing clusters, and categorizing among different species of plants and animals by clustering them based on their attributes.

Consider the following statements: 1. Tracking of buying behavior of customers 2. Medical Surveillance Which of the above are applications of dynamic clustering?

Both A and B - Tracking of buying behavior of customers and medical surveillance involves a continuous influx of data, and the new data has to be clustered repetitively and shift in data points tracked. SO, both are applications of dynamic clustering.

Consider the following steps: A. Assign objects to their closest cluster center according to the Euclidean distance function. B. Calculate the centroid or mean of all objects in each cluster. C. Select k points at random as cluster centers. D. Repeat steps A and B until the same points are assigned to each cluster in consecutive rounds. What is the correct order of the execution of the steps in the K-means clustering algorithm?C, A, B, D

C, A, B, D - In the K-means clustering algorithm, we first select k points at random as cluster centers, then we assign objects to their closest cluster center according to the Euclidean distance function, and then we calculate the centroid or mean of all objects in each cluster. We then repeat the second and third steps until the same points are assigned to each cluster in consecutive rounds.

For the points X(x1, y1) and Y(x2, y2), the formula max(|x1-x2|, |y1-y2|) represents which of the following options?

Chebyshev distance

In centroid-based clustering, the pairwise distance of all data points is computed and a dendrogram is made based on that.

False - In connectivity-based clustering, the pairwise distance of all data points is computed and a dendrogram is made based on that. In centroid-based clustering, the distance of each point from the cluster centroids is calculated and then points are allotted to the closest centroid.

Euclidean distance is insensitive to outliers.

False - Since it involves squares in its computation, Euclidean distance is sensitive to outliers.

The formula of the silhouette is given below: where a is the average distance of a point from all other points in the same cluster and b is the average distance of a point from all points in the nearest cluster. Which of the following statements is correct?

For a good cluster, a is smaller than b - For a good cluster, the average distance of a point from all other points in the same cluster should be smaller compared to the average distance of a point from all points in the nearest cluster. So, a will be smaller than b.

Which of the following statements is true about K-means clustering?

In K-means clustering, two points are similar to each other if they are close to the same centroid. - In K-means clustering, we need to specify the number of clusters before running the clustering algorithm, and two points are similar to each other if they are close to the same centroid.

Which of the following is correct about cdist()?

It computes the distance of every point with the centroid points provided. - The cdist function of scipy computes the distance of every point in one set from every point in another set. As such, it can be used to compute the distance of every point with the centroid points provided.

Which of the following distance metrics can account for the relationship between variables?

Mahalanobis distance - In Mahalanobis distance, an ellipsoid (or contour) is created on principal components of the data which are its natural axis, and the points that lie on a particular ellipsoid (or contour) are considered equidistant from each other. So points that have some relationship among variables will be closer than points that don't have a relationship between any of its variables. Please see the pic below.

Consider the following statements regarding Silhouette analysis: I. Silhouette score is a measurement of how similar points within a cluster are to each other compared to how similar those points are to neighboring clusters. II. This measure has a range of [0, 1].

Only I is correct - Silhouette score gives a measure of how similar points within a cluster are to each other compared to how similar those points are to neighboring clusters. This measure has a range of [-1, 1].

Which of the following statements is correct?

The variables used for clustering should be scaled first. - The variables used for clustering should be scaled first in order to ensure that all the variables are on the same scale.

Connectivity-based clustering models are very easy to interpret but they lack scalability as computational complexity goes up quickly as the size of the dataset increases.

True - Connectivity-based clustering models are very easy to interpret as they can be visualized using dendrograms. However, they lack scalability as computational complexity goes up in the order of n2, where n is the number of observations.

Dynamic clustering involves clustering again and again as new data comes in.

True - Dynamic clustering refers to the process of clustering data repetitively to account for new data.

For K-means clustering, the cluster centroid of a cluster with the points (1,2), (3,-2), and (5, 3) will be (3, 1).

True - For K-means clustering, the cluster centroid is computed as the mean of the points in the cluster. As the points are (1,2), (3,-2), and (5, 3), the coordinates of the centroid will be C_x = (1+3+5)/3 = 3 C_y = {2+(-2)+3}/3 = 1 So, the centroid will be (3, 1).

Hierarchical clustering is a connectivity-based clustering algorithm.

True - Hierarchical clustering is a connectivity-based clustering technique in which the core idea is of objects being more related to nearby objects than to objects farther away.

By analyzing successive frames of security video footage, clustering algorithms can detect if a new person has entered an unauthorized area by identifying the change in clusters, and can trigger an alarm in response.

True - If a new person enters an unauthorized area, there will be a shift in points between clusters. By analyzing successive frames of security video footage, clustering algorithms can detect this shift and trigger an alarm in response.

If the silhouette value is close to 1, the sample is well-clustered and points are already assigned to a very appropriate cluster.

True - If a sample is well-clustered and points are already assigned to a very appropriate cluster, the average distance of points from points in the same cluster will be low, and consequently, the silhouette score value will be close to 1.

K-means clustering is centroid-based clustering and uses Euclidean distances.

True - K-means clustering involves assigning points to cluster centroids based on their distance from the centroids and the distance metric used is Euclidean distance.

Manhattan distance takes the sum of the absolute values of the differences of the coordinates.

True - Manhattan distance takes the sum of the absolute values of the differences of the coordinates. For example, if x=(a,b) and y=(c,d), the Manhattan distance between x and y is (|a−c|+|b−d|).

Scaling is very important in cluster analysis is because all features do not have the same range of values, and scaling prevents any feature from becoming dominant due to the high magnitude of its values.

True - Scaling prevents any feature from becoming dominant due to the high magnitude of its values. As such, it is very important in cluster analysis because all the features might not have the same range of values

The Jaccard distance measures the distance between two sets.

True - The Jaccard index is a measure of similarity between two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two sets of data. Jaccard distance is computed as (1 - Jaccard Index)

K-means clustering aims to partition the n observations into K clusters to minimize the within-cluster sum of squares (i.e. variance).

True - The aim of K-means clustering is to partition the data into K clusters such that observations within a cluster are similar and between clusters are dissimilar. This is done by minimizing the within-cluster sum of squares, i.e. variance.

The elbow method is used to determine the optimal value of K to perform the K-means clustering algorithm.The basic idea behind this method is that it plots the within-cluster sum of squares (WCSS) with changing K.

True - The elbow method plots the within-cluster sum of squares (WCSS) with changing K, allowing us to identify the point where there is a large change (elbow) in the WCSS. This will be the primary candidate for K.

The frequency of purchase, the value of purchase, and the recency of purchase can be used simultaneously as the basis for defining clusters of customers for customer segmentation.

True - Using the three features simultaneously can help in determining different segments of customers.

Visual analysis of attributes selected for clustering might give an idea of the range of values of the number of clusters to form.

True - Visual analysis of data is always helpful, especially in the case of K-means clustering, as it might give an idea of the range of values of K, the number of clusters.

Which of the following is NOT true when the zscore function is applied to a variable before clustering?

standard deviation of that variable is 0.5 - When the zscore function is applied to a variable, the distribution of the variable remains the same as it is just scaled to have zero mean and unit variance.


Ensembles d'études connexes

Acts 1-14 Quiz Questions (Learn Style)

View Set

PEDS Chapter 10 Health Assessment of Children

View Set