Week 5 Reading Questions

Ace your homework & exams now with Quizwiz!

How does adjusted pairwise precision and recall work?

1. Classification column is manually created gold standard 2. Precision and recall values are calculated for each cluster 3. The precision recall values are adjusted by a scaling factor based on the size of respective cluster.

What are some different techniques to determine the k in K-means?

1 The elbow method 2 X-means clustering 3 Information criterion approach 4 An information-theoretic approach 5 The silhouette method 6 Cross-validation 7 Finding number of clusters in text databases 8 Analyzing the kernel matrix

What are some data preparation requirements for cluster analysis?

1. Do you need to select a random sample of the data for initial analysis? 2 Do you want to remove outliers? 3 Do you need to impute missing values? 4. You should standardize your input. 5. Other transformation needs

What are three important questions to ask when starting Hierarchical Clustering?

1. How do you represent a cluster of more than one point. 2. How do you determine the nearness of clusters. 3. When to stop combing clusters

What are different termination conditions that the K-means may use?

1. If the new CENTROID of each cluster is the same as the old CENTROID or is sufficiently close then terminate; otherwise repeat Steps 2 through 4. 2. To optimize the performance, other termination conditions (step 4) are provided by different tools, such as the maximum number of iterations; or the variance does not improve by a threshold.

What are the differences between K-means, K-median, K-Medoids, and K-modes?

1. Medians are less sensitive to outliers than means. 2. k-medoid is based on centroids (or medoids) calculating by minimizing the absolute distance between the points and the selected centroid, rather than minimizing the square distance. As a result, it's more robust to noise and outliers than k-means.

How mutual information is different from regular purity measure and why it is preferred for the cluster evaluation?

1. Originated from document clustering. 2. High purity is easy to achieve when the number of clusters is large - in particular, purity is 1 if each observation gets its own cluster. 3. Mutual information makes the trade off the purity against the number of clusters.

What are different objectives of clustering?

1. Partition the data into natural clusters (i.e. groups) that are relatively homogenous with respect to the input using some similarity metric. 2. Description of the dataset. 3. Facilitating improvement in performance of other DM techniques when there are many competing patterns in the data

When do you know to stop clustering

1. Pick a K and stop when you have K clusters 2. Stop when the next merge of clusters will create clusters with low cohesion.

What are the requirements for cluster analysis?

1. Scalability 2. Ability to deal with different types of attributes 3. Discovery of clusters with arbitrary shape 4. Minimal requirements for domain knowledge to determine input parameters 5. Able to deal with noise and outliers 6. Insensitive to order of input records 7. High dimensionality 8. Incorporation of user-specified constraints 9. Interpretability and usability

What are the basic steps in cluster analysis?

1. Specify Objectives of Clustering, and data features 2 Determine Approach for Evaluating Clustering Results a. Quantities: what objective measures? b. Qualitative: are clusters usefulness for business purposes? 3 Select Similarity Metric(s) 4 Determine Type of Clustering (e.g. Hierarchical or Partitional or Fuzzy) 5 Select Relevant Clustering Methods/Techniques 6 Select Data Preparation Approach

Why should you standardize your input for cluster analysis?

1. Variables should be transformed so that equal distances are of equal practical importance. 2. Standardize input (can also be done in SAS EM Cluster Node) 3. Variables with large variance tend to have more effect on the resulting clusters. 4. Be careful of nonlinear transformations, as it may change the number of clusters

How to start working with K Mean algorithm

1. assume euclidean space 2. start by picking K the number of clusters 3. Initialize clusters by picking one point per cluster

How to populate clusters with K Mean algorithm

1. for each point place it in the cluster whose centroid is nearest. 2. after all points are assigned update the location of centroids of K clusters 3. reassign all points to their closest centroid 4. repeat until convergence

Ways to pick the closest point to other points when measuring clustroids.

1. smallest maximum distance to other points 2. smallest average distance to other points 3. smallest sum of squares distance to other points.

Which of the following algorithm is most sensitive to outliers? A. K-means clustering algorithm B. K-medians clustering algorithm C. K-modes clustering algorithm D. K-medoids clustering algorithm

A - Out of all the options, K-Means clustering algorithm is most sensitive to outliers as it uses the mean of cluster data points to find the cluster center.

What is a dendrogram

A diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering.

Another way to define K means

A successive refinement to a veronoi digarm.

Steps to pick K that is not random.

Approach 1. sample data Approach 2. pick a dispersed set of points.

Naive implementation of hierarchical clustering

At each step, computer pairwise distances between all pairs of clusters the merge O(N^3)

How can clusters be characterized

By what members have in common and by what separates members from the population as a whole.

How can you represent a cluster of many points?

Centroid - average of its points.

Do you need to remove outliers for cluster analysis?

Clustering is able to deal with outliers

What is the output produced by clustering?

Either a class label (cluster 1, 2, 3) or a relative score. Also, finding groups of records that are close to each other and far from record in other clusters.

Central idea of clustering

Finding groups of records that are close to each other and far from records in other clusters.

How do you calculate the Manhattan distance

Given n integer coordinates. The task is to find sum of manhattan distance between all pairs of coordinates. Manhattan Distance between two points (x1, y1) and (x2, y2) is: |x1 - x2| + |y1 - y2|

Why highly correlated variables can be an issue for k-means?

If two variables are perfectly correlated, they effectively represent the same concept. But that concept is now represented twice in the data and hence gets twice the weight of all the other variables. The final solution is likely to be skewed in the direction of that concept, which could be a problem if it's not anticipated. In the case of multiple variables and multicollinearity, the analysis is in effect being conducted on some unknown number of concepts that are a subset of the actual number of variables being used in the analysis.

The goal of the clustering algorithm is to find what?

K points that make good cluster centers.

How to measure nearness of clusters

Measure the distance between centroid

Does automatic cluster detection techniques require a target variable?

No, a target variable is not required.

Visually inspect picking the correct K

Plot and look K near the knee of the curve.

Convergence

Points don't move between clusters and centroids are stable.

How to define Cohesion

Set threshold based on 1. Use the diameter of the merged cluster - maximum distance between any points in the cluster 2. Radius - maximum distance of points from a centroid 3. use density approach by divide number of points in the cluster by the diameter or radius of the cluster

What does a cluster's silhouette value measure?

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. ... The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation)

Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering algorithm. After first iteration clusters, C1, C2, C3 has following observations: C1: {(2,2), (4,4), (6,6)} C2: {(0,4), (4,0)} C3: {(5,5), (9,9)} What will be the cluster centroids if you want to proceed for second iteration? A. C1: (4,4), C2: (2,2), C3: (7,7) B. C1: (6,6), C2: (4,4), C3: (9,9) C. C1: (2,2), C2: (0,0), C3: (5,5) D. None of these

Solution: (A) Finding centroid for data points in cluster C1 = ((2+4+6)/3, (2+4+6)/3) = (4, 4) Finding centroid for data points in cluster C2 = ((0+4)/2, (4+0)/2) = (2, 2) Finding centroid for data points in cluster C3 = ((5+9)/2, (5+9)/2) = (7, 7) Hence, C1: (4,4), C2: (2,2), C3: (7,7)

Which of the following method is used for finding optimal of cluster in K-Mean algorithm?

Solution: (A) Out of the given options, only elbow method is used for finding the optimal number of clusters. The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.

Which of the following can be applied to get good results for K-means algorithm corresponding to global minima? Try to run algorithm for different centroid initialization Adjust number of iterations Find out the optimal number of clusters

Solution: (D) All of these are standard practices that are used in order to obtain good clustering results.

What is true about K-Mean Clustering? K-means is extremely sensitive to cluster center initializations Bad initialization can lead to Poor convergence speed Bad initialization can lead to bad overall clustering Options: A. 1 and 3 B. 1 and 2 C. 2 and 3 D. 1, 2 and 3

Solution: (D) All three of the given statements are true. K-means is extremely sensitive to cluster center initialization. Also, bad initialization can lead to Poor convergence speed as well as bad overall clustering.

In which of the following cases will K-Means clustering fail to give good results? 1. Data points with outliers 2. Data points with different densities 3. Data points with round shapes 4. Data points with non-convex shapes Options: A. 1 and 2 B. 2 and 3 C. 2 and 4 D. 1, 2 and 4 E. 1, 2, 3 and 4

Solution: (D) K-Means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different and the data points follow non-convex shapes.

Problem clustering in non-euclidian space.

The average may not be the center.

What does the K in K means measure.

The number of clusters you want.

In K means, the best assignment of cluster centers can be defined as

The one that minimizes the sum of the distance from every data point to its nearest cluster center (or perhaps the distance squared).

What is the definition of a centroids typical member

The one who has the average value in each of the clusters dimensions.

How to create a silhouette score for entire dataset?

The score for an entire cluster is calculated as the average of the scores of its members. This measures the degree of similarity of cluster members.

what does a fuzzy cluster membership mean?

a form of clustering in which each data point can belong to more than one cluster.

How to measure distance of clustroids

Treat clustroids as if they are centroids when computing intercluster distances.

When using K mean how to pick K

Try different K looking at the change in the average distance to centroid as K increases

Clustroid

Which is an existing data point? Centroid or Clustroid?

What makes K Mean algorithms attractive

Work on large data sets

Veronoi diagram

a diagram whose lines mark the points that are equidistant from the two nearest seeds.

How does Rand Index (RI)?

a measure of the similarity between two data clusterings.

Voronoi cells

a partitioning of a plane into regions based on distance to points in a specific subset of the plane

What are seeds in K mean algorithms

a partitioning of a plane into regions based on distance to points in a specific subset of the plane. That set of points (called seeds, sites, or generators) is specified beforehand, and for each seed there is a corresponding region consisting of all points closer to that seed than to any other. These regions are called Voronoi cells.

BFR algorithm, named after its inventors Bradley, Fayyad and Reina

a variant of k-means algorithm that is designed to cluster data in a high-dimensional Euclidean space. It makes a very strong assumption about the shape of clusters: they must be normally distributed about a centroid. The mean and standard deviation for a cluster may differ for different dimensions, but the dimensions must be independent.

Describe K-means algorithm.

aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

K means what is done in the assignment step

assign each record to its closest cluster seed to form initial clusters

After initial seed selection K means alternates between two steps

assignment step and update step

Hard clustering assignment

assigns each record to a single cluster

Soft clustering assignment

associate each record with several clusters with varying degrees of strength.

What should outside be defined as

beyond some threshold distance from the cluster as measured by single linkage

What is the best way to pick K when there is a range of acceptable values

build clusters using each value of K and then evaluate the resulting clusters. The evaluation may be subjective or it may be based on technical criteria such as the ratio of the average intra-cluster distance to the average intra-cluster distance , or the cluster silhouette

A popular application of clustering used in a case study in the book

customer segmentation which is useful for targeting cross-sell offers, for focusing retention efforts, for customizing messaging, or other purposes. Efforts are better than one size fits all approach.

what is a clustroid

data points closest to other data points (non euclidean measure)

K Means clustering relies on what approach

interpretation of data as points in space. The distance between two points depends on their representation, so cluster detection has data preparation requirements.

what is Rand Index (RI)

is a measure of the similarity between two data clusterings

A way to characterize a cluster

look at a typical member and then ask which features of the typical member are most different from the overall population.

in K means each record is considered a point in a scatter plot which in turn implies that all the input variables are: a. numeric or b. categorical or c. either

numeric

Cluster detection does what

provides a way to learn about the structure of complex data; to break up the cacophony of competing signals into simpler components.

Why initial seeds selection is important for k-means clustering?

repeatability and reproducibility of clustering results. The K-means algorithms does not guarantee unique clustering and every different choice of initial cluster centers may lead to different clustering results.

What is a centroid

the average of points

K means what is done in the update step

the centroid of each cluster is calculated. The centroid is simply the average position of cluster members in each dimension.

What distinguishes clustering from other classification techniques

the classes are detected automatically instead of being provided in the form of a categorical target.

what is single linkage

the distance to the nearest cluster member.

What is directed clustering?

the goal is to discover clusters that have different distributions of one or several targets. Unlike a true directed technique, the clustering targets have no direct influence on the clusters discovered.

If different choices of initial seed positions yield radically different clusters what does it mean

the k-means algorithm is not finding stable clusters. This could be because there are no good clusters to be found or because a different choice for K is needed or clusters are present but the k-means algorithm is unable to detect them.

How to use the silhouette score to determine the best value for K in k-means?

try each silhouette value of k in the acceptable range and choose the one that yields the best silhouette.

What type of learning approach is K Means

undirected technique is used in K means

When is semi-directed clustering a good option

when there is more than one target.


Related study sets

Adaptive quiz: professional identity

View Set

Exam 3: Mastering Blood Vessels (#5)

View Set

Chapter 39: Management of Patients with Oral and Esophageal Disorders

View Set

Performance Management Unit 1 Exam

View Set

Chapter 17: The Cardiovascular System: The Heart (PART 5)

View Set

Anatomy and Physiology: Bone Tissue

View Set

Unit 1 Intro to Psychology Practice Test

View Set

Computer Science Semester Exam ( Chapter 10 )

View Set