CH 14 Cluster Analysis

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is the normalize score for Sales = $9,077 Average Sales = $8,914 Std. Dev = $3,550

(9077 - 8914) - 3550 = 0.046

What are the disadvantages of hierarchical clustering?

- HIgh computational costs - Observations that are allocated incorrectly in the process cannot be reallocated subsequently - Low stability (one of the characteristics we look for in a useful cluster sensitive to outliers

How many clusters do we have at least in the process agglomerative clustering? Suppose there are n observations.

1

What is the agglomerative process?

1. Begin with n-clusters (each record its own cluster) 2. Merge similar clusters until one cluster is left (the entire data set)

What is the center?

Average values at each coordination for observations in a cluster

K-means process

1. Randomly pick K centers 2. Assign each observation to the nearest center 3. Recalculate centers 4. Repeat steps 2 and 3 until convergent

What are the agglomerative method steps?

1. Start with n clusters (each record is its own cluster) 2. Merge 2 closest records into one cluster 3. At each step, two clusters closest to each other are merged

What is K-means clustering?

1. Use (K) pre-specified number of clusters 2. Assign cases to each cluster Preferred with large datasets Computationally less intensive

What is the K-Mean Clustering Algorithm?

1.Choose the number of clusters desired, k 2.Start with a partition into k clusters a.Often based on random selection of k centroids 3.At each step, move each record to cluster with closest centroid 4.Re-compute centroids, repeat step 3 5.Stop when moving records increases within-cluster dispersion

How many clusters should be defined if a dendrogram like below is observed?

2 or 4 (cut longest branches)

How many pairwise distances do we have for a cluster with 5 observations and another cluster with 6 observations?

30 (6 * 5)

What kind of algorithm do multiple dimensions require?

A distance measure to form clusters

How do you determine the number of clusters in a dendrogram?

A horizontal line intersects the clusters that are far apart to create clusters - Count vertical lines

What is a dendrogram?

A tree-like diagram that summarizes the process of clustering - Records are at the bottom - Similar records are joined by lines who vertical lengths reflects the distance between the records

Cluster separation

Examines the ratio between-cluster variation to within-cluster variation. The higher the better

What is the main hierarchical method covered in this chapter?

Agglomerative Methods

Average linkage is also known as?

Also called unweighted pair-group method using averages

Stability

Are clusters and cluster assignments sensitive to slight changes in inputs?

At most how many clusters?

At most is n clusters

How to choose k?

Base it on how results will be used. Ex. How many market segments do we want?

What happens to 2 records that are closest to each other in a dendrogram?

Closest 2 records form a cluster

Why should we be wary of chance results?

Data may not have definitive "real" clusters

What is cluster stability?

Do clusters assignments change significantly if some of the inputs are slightly altered?

Initial partition can be based on?

Domain knowledge practical constraints Random (if so, then repeat the process with random partitions)

How do you measure distance between 2 records in a hierarchical method?

Euclidean distance Between records i & j 1. Take the distance between two records 2. Square the distance so it's not negative 3. Take sum of all the distances 4. Square the sum

Separation

Examine the ratio between-cluster variation to within-cluster variation (higher is better)

What kind of tool is cluster analysis?

Exploratory tool only useful when producing meaningful clusters

What is the goal of clustering?

Form groups (clusters) based on similar measurements made on those records

Why do we need to do data normalization in hierarchical and k-means clustering?

Hierarchical and K-Means clustering are both distance based so we need to remove the scale effect of each variable

What are two algorithms considered for cluster analysis?

Hierarchical and Non-Hierarchical

Why is stability a desirable feature in cluster analysis?

If we add noise there will be no change, thus stability

Uses of cluster analysis in huge data

Internet search engines can cluster user queries Grouping securities in portfolios

What does a dendrogram show?

It shows cluster hierarchy

K-Means

K Centers OR K Clusters

What is the non-hierarchical method?

K-Means clustering

At least how many clusters?

Least is 1 cluster

Where is cluster analysis used?

Marketing and industry analysis Segmenting groups of similar customers

How do you measure distance in hierarchical methods?

Measure distance between records and between clusters

Number of clusters

Might be useful considering the purpose of the analysis

How do you measure the distance between 2 clusters?

Minimum distance - Single linkage Maximum distance - Complete linkage Average distance - Average linkage Centroid distance - distance between the centers of 2 clusters

Which cluster is furthest?

Pair that has highest distance is furthest away

Which cluster is closest?

Pair that has shortest distance is closest

What are the advantages of hierarchical clustering?

Purely data driven since it doesn't require a specification of the number of clusters

What is the problem with not normalizing?

Raw distance measures are highly influenced by scales of measurements

What is the divisive method?

Start with one cluster Repeatedly divide into smaller clusters

How do we normalize (standardize) the data?

Subtract the mean from n Divide by std. deviations Also called z-score

Within-Cluster Dispersian

Table Rows show clusters 1st column shows no. of observations 2nd column shows average distance in cluster High average distance means cluster is not well-defined and loose Low average distance means cluster is very tight and well-defined

Cluster Output

Table shows distances between clusters and measurements

What is the goal of clusters?

To obtain meaningful and useful clusters

What kind of task is clustering?

Unsupervised Looks at previous data without human supervision

Centroid linkage is also called?

Unweighted pairing method using centroids

Why is separation a desirable feature in cluster analysis?

We want clusters to be highly separated

For hierarchical clustering, do we need to normalize the data if variables are in different scales?

Yes - distance based algorithm

For k-means, do we need to normalize the data if variables are in different scales?

Yes - distance based algorithm

Random chance can often produce

apparent clusters

Types of clusters

complete single average centroid

Non-hierarchical clustering is

computationally cheap and more stable. Requires user to set [ k ]

Suppose you have a data set named dat1 in R. Which of the following is the correct syntax to run hierarchical clustering?

hclust(dist(dat1))

Suppose you have a data set named dat1 in R. Which of the following is the correct syntax to run hierarchical clusteringwith standardization of each variable?

hclust(dist(scale(dat1)))

What is the meaning of k in k-means?

k centers

How many clusters do we have at most in the process agglomerative clustering? Suppose there are n observations.

n

Different cluster methods

produce different results

Desirable features of clusters

stability and high separation

Goal of K-Mean Clustering Algorithm

to divide the sample into a pre-determined number k of non-overlapping clusters so that the clusters are as homogeneous as possible with respect to the measurements

Hierarchical clustering gives

visual representation of different levels of clustering


Ensembles d'études connexes

Prep-U: Chapter 47: Management of Patients With Intestinal and Rectal Disorders

View Set

DOCUMENTATION AND REPORTING NCLEX

View Set

cholinergic neurotransmitters and receptors

View Set

Chapter 16: Nutrition and Fitness

View Set

304 EAQ Alterations in Glucose Regulation

View Set