Week 5 Reading Questions
How does adjusted pairwise precision and recall work?
1. Classification column is manually created gold standard 2. Precision and recall values are calculated for each cluster 3. The precision recall values are adjusted by a scaling factor based on the size of respective cluster.
What are some different techniques to determine the k in K-means?
1 The elbow method 2 X-means clustering 3 Information criterion approach 4 An information-theoretic approach 5 The silhouette method 6 Cross-validation 7 Finding number of clusters in text databases 8 Analyzing the kernel matrix
What are some data preparation requirements for cluster analysis?
1. Do you need to select a random sample of the data for initial analysis? 2 Do you want to remove outliers? 3 Do you need to impute missing values? 4. You should standardize your input. 5. Other transformation needs
What are three important questions to ask when starting Hierarchical Clustering?
1. How do you represent a cluster of more than one point. 2. How do you determine the nearness of clusters. 3. When to stop combing clusters
What are different termination conditions that the K-means may use?
1. If the new CENTROID of each cluster is the same as the old CENTROID or is sufficiently close then terminate; otherwise repeat Steps 2 through 4. 2. To optimize the performance, other termination conditions (step 4) are provided by different tools, such as the maximum number of iterations; or the variance does not improve by a threshold.
What are the differences between K-means, K-median, K-Medoids, and K-modes?
1. Medians are less sensitive to outliers than means. 2. k-medoid is based on centroids (or medoids) calculating by minimizing the absolute distance between the points and the selected centroid, rather than minimizing the square distance. As a result, it's more robust to noise and outliers than k-means.
How mutual information is different from regular purity measure and why it is preferred for the cluster evaluation?
1. Originated from document clustering. 2. High purity is easy to achieve when the number of clusters is large - in particular, purity is 1 if each observation gets its own cluster. 3. Mutual information makes the trade off the purity against the number of clusters.
What are different objectives of clustering?
1. Partition the data into natural clusters (i.e. groups) that are relatively homogenous with respect to the input using some similarity metric. 2. Description of the dataset. 3. Facilitating improvement in performance of other DM techniques when there are many competing patterns in the data
When do you know to stop clustering
1. Pick a K and stop when you have K clusters 2. Stop when the next merge of clusters will create clusters with low cohesion.
What are the requirements for cluster analysis?
1. Scalability 2. Ability to deal with different types of attributes 3. Discovery of clusters with arbitrary shape 4. Minimal requirements for domain knowledge to determine input parameters 5. Able to deal with noise and outliers 6. Insensitive to order of input records 7. High dimensionality 8. Incorporation of user-specified constraints 9. Interpretability and usability
What are the basic steps in cluster analysis?
1. Specify Objectives of Clustering, and data features 2 Determine Approach for Evaluating Clustering Results a. Quantities: what objective measures? b. Qualitative: are clusters usefulness for business purposes? 3 Select Similarity Metric(s) 4 Determine Type of Clustering (e.g. Hierarchical or Partitional or Fuzzy) 5 Select Relevant Clustering Methods/Techniques 6 Select Data Preparation Approach
Why should you standardize your input for cluster analysis?
1. Variables should be transformed so that equal distances are of equal practical importance. 2. Standardize input (can also be done in SAS EM Cluster Node) 3. Variables with large variance tend to have more effect on the resulting clusters. 4. Be careful of nonlinear transformations, as it may change the number of clusters
How to start working with K Mean algorithm
1. assume euclidean space 2. start by picking K the number of clusters 3. Initialize clusters by picking one point per cluster
How to populate clusters with K Mean algorithm
1. for each point place it in the cluster whose centroid is nearest. 2. after all points are assigned update the location of centroids of K clusters 3. reassign all points to their closest centroid 4. repeat until convergence
Ways to pick the closest point to other points when measuring clustroids.
1. smallest maximum distance to other points 2. smallest average distance to other points 3. smallest sum of squares distance to other points.
Which of the following algorithm is most sensitive to outliers? A. K-means clustering algorithm B. K-medians clustering algorithm C. K-modes clustering algorithm D. K-medoids clustering algorithm
A - Out of all the options, K-Means clustering algorithm is most sensitive to outliers as it uses the mean of cluster data points to find the cluster center.
What is a dendrogram
A diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering.
Another way to define K means
A successive refinement to a veronoi digarm.
Steps to pick K that is not random.
Approach 1. sample data Approach 2. pick a dispersed set of points.
Naive implementation of hierarchical clustering
At each step, computer pairwise distances between all pairs of clusters the merge O(N^3)
How can clusters be characterized
By what members have in common and by what separates members from the population as a whole.
How can you represent a cluster of many points?
Centroid - average of its points.
Do you need to remove outliers for cluster analysis?
Clustering is able to deal with outliers
What is the output produced by clustering?
Either a class label (cluster 1, 2, 3) or a relative score. Also, finding groups of records that are close to each other and far from record in other clusters.
Central idea of clustering
Finding groups of records that are close to each other and far from records in other clusters.
How do you calculate the Manhattan distance
Given n integer coordinates. The task is to find sum of manhattan distance between all pairs of coordinates. Manhattan Distance between two points (x1, y1) and (x2, y2) is: |x1 - x2| + |y1 - y2|
Why highly correlated variables can be an issue for k-means?
If two variables are perfectly correlated, they effectively represent the same concept. But that concept is now represented twice in the data and hence gets twice the weight of all the other variables. The final solution is likely to be skewed in the direction of that concept, which could be a problem if it's not anticipated. In the case of multiple variables and multicollinearity, the analysis is in effect being conducted on some unknown number of concepts that are a subset of the actual number of variables being used in the analysis.
The goal of the clustering algorithm is to find what?
K points that make good cluster centers.
How to measure nearness of clusters
Measure the distance between centroid
Does automatic cluster detection techniques require a target variable?
No, a target variable is not required.
Visually inspect picking the correct K
Plot and look K near the knee of the curve.
Convergence
Points don't move between clusters and centroids are stable.
How to define Cohesion
Set threshold based on 1. Use the diameter of the merged cluster - maximum distance between any points in the cluster 2. Radius - maximum distance of points from a centroid 3. use density approach by divide number of points in the cluster by the diameter or radius of the cluster
What does a cluster's silhouette value measure?
Silhouette refers to a method of interpretation and validation of consistency within clusters of data. ... The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation)
Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering algorithm. After first iteration clusters, C1, C2, C3 has following observations: C1: {(2,2), (4,4), (6,6)} C2: {(0,4), (4,0)} C3: {(5,5), (9,9)} What will be the cluster centroids if you want to proceed for second iteration? A. C1: (4,4), C2: (2,2), C3: (7,7) B. C1: (6,6), C2: (4,4), C3: (9,9) C. C1: (2,2), C2: (0,0), C3: (5,5) D. None of these
Solution: (A) Finding centroid for data points in cluster C1 = ((2+4+6)/3, (2+4+6)/3) = (4, 4) Finding centroid for data points in cluster C2 = ((0+4)/2, (4+0)/2) = (2, 2) Finding centroid for data points in cluster C3 = ((5+9)/2, (5+9)/2) = (7, 7) Hence, C1: (4,4), C2: (2,2), C3: (7,7)
Which of the following method is used for finding optimal of cluster in K-Mean algorithm?
Solution: (A) Out of the given options, only elbow method is used for finding the optimal number of clusters. The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.
Which of the following can be applied to get good results for K-means algorithm corresponding to global minima? Try to run algorithm for different centroid initialization Adjust number of iterations Find out the optimal number of clusters
Solution: (D) All of these are standard practices that are used in order to obtain good clustering results.
What is true about K-Mean Clustering? K-means is extremely sensitive to cluster center initializations Bad initialization can lead to Poor convergence speed Bad initialization can lead to bad overall clustering Options: A. 1 and 3 B. 1 and 2 C. 2 and 3 D. 1, 2 and 3
Solution: (D) All three of the given statements are true. K-means is extremely sensitive to cluster center initialization. Also, bad initialization can lead to Poor convergence speed as well as bad overall clustering.
In which of the following cases will K-Means clustering fail to give good results? 1. Data points with outliers 2. Data points with different densities 3. Data points with round shapes 4. Data points with non-convex shapes Options: A. 1 and 2 B. 2 and 3 C. 2 and 4 D. 1, 2 and 4 E. 1, 2, 3 and 4
Solution: (D) K-Means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different and the data points follow non-convex shapes.
Problem clustering in non-euclidian space.
The average may not be the center.
What does the K in K means measure.
The number of clusters you want.
In K means, the best assignment of cluster centers can be defined as
The one that minimizes the sum of the distance from every data point to its nearest cluster center (or perhaps the distance squared).
What is the definition of a centroids typical member
The one who has the average value in each of the clusters dimensions.
How to create a silhouette score for entire dataset?
The score for an entire cluster is calculated as the average of the scores of its members. This measures the degree of similarity of cluster members.
what does a fuzzy cluster membership mean?
a form of clustering in which each data point can belong to more than one cluster.
How to measure distance of clustroids
Treat clustroids as if they are centroids when computing intercluster distances.
When using K mean how to pick K
Try different K looking at the change in the average distance to centroid as K increases
Clustroid
Which is an existing data point? Centroid or Clustroid?
What makes K Mean algorithms attractive
Work on large data sets
Veronoi diagram
a diagram whose lines mark the points that are equidistant from the two nearest seeds.
How does Rand Index (RI)?
a measure of the similarity between two data clusterings.
Voronoi cells
a partitioning of a plane into regions based on distance to points in a specific subset of the plane
What are seeds in K mean algorithms
a partitioning of a plane into regions based on distance to points in a specific subset of the plane. That set of points (called seeds, sites, or generators) is specified beforehand, and for each seed there is a corresponding region consisting of all points closer to that seed than to any other. These regions are called Voronoi cells.
BFR algorithm, named after its inventors Bradley, Fayyad and Reina
a variant of k-means algorithm that is designed to cluster data in a high-dimensional Euclidean space. It makes a very strong assumption about the shape of clusters: they must be normally distributed about a centroid. The mean and standard deviation for a cluster may differ for different dimensions, but the dimensions must be independent.
Describe K-means algorithm.
aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
K means what is done in the assignment step
assign each record to its closest cluster seed to form initial clusters
After initial seed selection K means alternates between two steps
assignment step and update step
Hard clustering assignment
assigns each record to a single cluster
Soft clustering assignment
associate each record with several clusters with varying degrees of strength.
What should outside be defined as
beyond some threshold distance from the cluster as measured by single linkage
What is the best way to pick K when there is a range of acceptable values
build clusters using each value of K and then evaluate the resulting clusters. The evaluation may be subjective or it may be based on technical criteria such as the ratio of the average intra-cluster distance to the average intra-cluster distance , or the cluster silhouette
A popular application of clustering used in a case study in the book
customer segmentation which is useful for targeting cross-sell offers, for focusing retention efforts, for customizing messaging, or other purposes. Efforts are better than one size fits all approach.
what is a clustroid
data points closest to other data points (non euclidean measure)
K Means clustering relies on what approach
interpretation of data as points in space. The distance between two points depends on their representation, so cluster detection has data preparation requirements.
what is Rand Index (RI)
is a measure of the similarity between two data clusterings
A way to characterize a cluster
look at a typical member and then ask which features of the typical member are most different from the overall population.
in K means each record is considered a point in a scatter plot which in turn implies that all the input variables are: a. numeric or b. categorical or c. either
numeric
Cluster detection does what
provides a way to learn about the structure of complex data; to break up the cacophony of competing signals into simpler components.
Why initial seeds selection is important for k-means clustering?
repeatability and reproducibility of clustering results. The K-means algorithms does not guarantee unique clustering and every different choice of initial cluster centers may lead to different clustering results.
What is a centroid
the average of points
K means what is done in the update step
the centroid of each cluster is calculated. The centroid is simply the average position of cluster members in each dimension.
What distinguishes clustering from other classification techniques
the classes are detected automatically instead of being provided in the form of a categorical target.
what is single linkage
the distance to the nearest cluster member.
What is directed clustering?
the goal is to discover clusters that have different distributions of one or several targets. Unlike a true directed technique, the clustering targets have no direct influence on the clusters discovered.
If different choices of initial seed positions yield radically different clusters what does it mean
the k-means algorithm is not finding stable clusters. This could be because there are no good clusters to be found or because a different choice for K is needed or clusters are present but the k-means algorithm is unable to detect them.
How to use the silhouette score to determine the best value for K in k-means?
try each silhouette value of k in the acceptable range and choose the one that yields the best silhouette.
What type of learning approach is K Means
undirected technique is used in K means
When is semi-directed clustering a good option
when there is more than one target.