Cluster Analysis

Ace your homework & exams now with Quizwiz!

The main input into any cluster analysis procedure is ...

a measure of distance between cases (customers, patients, etc.) being clustered

Cluster analysis groups data objects based on ...

- Groups data objects based on information found in the data that describes the objects and their relationship with the goal to differentiate well between the objects. The identified clusters should substantially differ from each other. - the similarity of the records within the cluster is maximized low within-cluster variation (WCV) and the similarity to records outside this cluster is minimized high between-cluster variation (BCV)

Centroid distance

- Distance between two clusters is the distance between the two cluster centroids. - Centroid is the vector of variable averages for all records in a cluster

Major Issues in Cluster Analysis

- How to measure similarity - How to recode categorical variables - How to standardize or normalize numerical variables - How many clusters we expect to uncover

Summary of k-Means

- Non-hierarchical is computationally cheap and more stable - The number of desired clusters, K, is chosen by the user - Practical considerations usually dominate the choice of K; there is no statistically determined optimal number of clusters - The algorithm develops clusters by intratively assigning records to the nearest cluster mean until cluster assignments do not change - Be wary of chance results; data may not have definitive "real" clusters

Distance Measures

- The objective of a distance measure is to quantify the difference between two cases on the variables used for segmentation - A shorter distance implies similar preferences on the segmentation variables, a longer distance implies dissimilar preferences - All numeric data can be clustered using a variety of distance measures

In summary: k-Means Clustering

- There should be low levels of collinearity among the variables - Used on ordinal data might result in some distortions. - Requires that the number of clusters are pre-specified

Goal of Cluster Analysis

Aims to find useful / meaningful groups of objects (clusters), where usefulness is defined by the goals of the data analysis

Agglomerative

Begin with n clusters (i.e., each single observation initially represents a distinct cluster) and sequentially merge similar clusters until a single cluster is left

Silhouette ratings

Can range between -1 (indicating a very poor model) and 1 (indicating an excellent model)

Value of cluster analysis:

Cluster analysis is an a useful exploratory tool. Cluster analysis can also be used to eliminate highly related variables (grouping of variables) Hierarchical clustering gives visual representation of different levels of clustering On other hand, due to non-iterative nature, it can be unstable, can vary highly depending on settings, and is computationally expensive

Separation

How distant the clusters are from each other

Cohesion

How tightly related the records are within the individual clusters

k-means clustering requires that data be _______ as the procedure relies on Euclidean distances.

Interval or ratio-scaled

k-Means Clustering can be applied to very ____ datasets

Large Useful for sample sizes above 500

Complete linkage

Maximum distance between two clusters, i.e., the distance between the pair of records Ai and Bj that are farthest from each other

Single linkage

Minimum distance between two clusters, i.e., the distance between the pair of records Ai and Bj that are closest

Cluster

Observations belonging to one group are: - similar to one another within the same cluster and - dissimilar to the observations in other clusters

Euclidean Distance

One of the most common distance measures - If two two cases are close in geometric sense, then they represent similar data in the database - Customary to normalize (standardize) before computing the Euclidean distance so that variables with larger scales do not dominate the analysis

How to Determine Inter-Cluster Similarity

Single linkage, complete linkage, average linkage, centroid distance

Clusters Should have ____ within-cluster variation compared to the between-cluster variation.

Small

Interpretation of Average Silhouette Value

The average silhouette value over all records yields a measure of how well the cluster solution fits. 0.5 or better provides good evidence of the reality of the clusters in the data. 0.25 - 0.5 provides some evidence of the reality of the clusters in the data. Hopefully, domain-specific knowledge can be brought to bear to support the reality of the clusters Less than 0.25 provides scant evidence of cluster reality.

Divisive (top-down technique)

Start with one cluster containing all the observations, then works in the opposite direction, subdividing into clusters of smaller size

Clustering

The process of subdividing the records of a dataset into homogenous (similar) natural groups of observations that share common characteristics - Also called data segmentation in some applications - Does not try to classify, estimate, or predict the value of a target variable - The groups are not predefined, but to be found by the classification algorithm

k-means

Type of Clustering Algorithms - Center-based clustering for data with continuous attributes - Defines a prototype in terms of a centroid, which is usually the (average) mean of all the points in the cluster - Places each instance (case) in exactly one of K non-overlapping clusters by minimizing the sum of squared distances of each record to the mean (centroid) of its assigned cluster - K are user-defined The centroid might happen to be one of the points in the cluster, but it does not have to be - Does not ensure the clusters will have the same size, but finds clusters that are the best separate - Generally less computationally intensive and thus preferred with very large datasets

Silhouette

a measure of goodness of fit Combines the concepts of cluster cohesion (favoring models which contain tightly cohesive clusters) and cluster separation (favoring models which contain highly separated clusters)

It is notoriously difficult to validate the quality of clustering algorithms because ...

clustering is an unsupervised problem The validation of a clustering process is highly subjective The only true measure of clustering quality is its ability to meet the goals of a specific application

Are there any records that are outliers?

k-means has trouble clustering data that contains outliers

Often preliminary step in a data mining process is to ...

reduce dimensionality of the data; resulting clusters as inputs into a different technique downstream


Related study sets

BCOR 2110 Financial accounting Test 2

View Set

ACCTG 70 Chapter 9 "Adjustments"

View Set

Western Civilization 1 CLEP: Kingdoms

View Set

exam 2 2107 ch 6 mindful listening

View Set

When ya code on paper and can't look at Stack Overflow... [PYTHON EDITION]

View Set

Econ Macro principles final practice

View Set

AP English Midterms (LOTF and Rhetorical Devices)

View Set