Quiz #5 - Module 5
conducting cluster analysis
(1) Formulate the problem (2) Select a distance measure (3) Select a clustering procedure (4) Decide on the number of clusters (5) Interpret and profile clusters (6) Assess the validity of clustering
cluster analysis purposes
- Segmenting the market - Understanding buyer behaviors - Identifying new product opportunities - Selecting test markets - Reducing data Segmenting the market: - For example, consumers may be clustered based on benefits sought from the purchase of a product. Each cluster would consist of consumers who are relatively homogeneous in terms of the benefits they seek. This approach is called benefit segmentation. Understanding buyer behaviors: - Cluster analysis can be used to identify homogeneous groups of buyers. Then the buying behavior of each group may be examined separately, as in the department store patronage project, where respondents were clustered based on self-reported importance attached to each factor of the choice criteria utilized in selecting a department store. Identifying new product opportunities: - By clustering brands and products, competitive sets within the market can be determined. Brands in the same cluster compete more fiercely with each other than with brands in other clusters. Selecting test markets: - By grouping cities into homogeneous clusters, it is possible to select comparable cities to test various marketing strategies. Reducing data: - Cluster analysis can be used as a general data reduction tool to develop clusters or subgroups of data that are more manageable than individual observations. Subsequent multivariate analysis is conducted on the clusters rather than on the individual observations. For example, to describe differences in consumers' product usage behavior, the consumers may first be clustered into groups. The differences among the groups may then be examined using multiple discriminant analysis.
similarity/distance coefficient matrix
A __________ is a lower-triangle matrix containing pairwise distances between objects or cases.
dendrogram
A __________, or tree graph, is a graphical device for displaying clustering results. Vertical lines represent clusters that are joined together. The position of the line on the scale indicates the distances at which clusters were joined. It is read from left to right.
hierarchical clustering
A clustering procedure characterized by the development of a hierarchy or tree-like structure.
average linkage
A linkage method based on the average distance between all pairs of objects, where one member of the pair is from each of the clusters. For this reason, it is usually preferred to the single and complete linkage methods.
nonhierarchical clustering
A procedure that first assigns or determines a cluster center and then groups all objects within a prespecified threshold value from the center. Frequently referred to as k-means clustering. These methods include sequential threshold, parallel threshold, and optimizing partitioning. Two major disadvantages are that the number of clusters must be prespecified and the selection of cluster centers is arbitrary. Furthermore, the clustering results may depend on how the centers are selected. Many select the first k (k = number of clusters) cases without missing values as initial cluster centers. Thus, the clustering results may depend on the order of observations in the data.
linkage methods
Agglomerative methods of hierarchical clustering that cluster objects based on a computation of the distance between them.
agglomeration schedule
An __________ gives information on the objects or cases being combined at each stage of a hierarchical clustering process.
icicle diagram
An __________ is a graphical display of clustering results, so called because it resembles a row of icicles hanging from the eaves of a house. The columns correspond to the objects being clustered, and the rows correspond to the number of clusters. It is read from bottom to top.
variance methods
An agglomerative method of hierarchical clustering in which clusters are generated to minimize the within-cluster variance.
cluster analysis and discriminant analysis
Both cluster analysis and discriminant analysis are concerned with classification. However, discriminant analysis requires prior knowledge of the cluster or group membership for each object or case included to develop the classification rule. In contrast, in cluster analysis there is no a priori information about the group or cluster membership for any of the objects. Groups or clusters are suggested by the data, not defined a priori.
decide on the number of clusters
Deciding on the number of clusters requires judgment on the part of the researcher. (1) Theoretical, conceptual, or practical considerations may suggest a certain number of clusters. For example, if the purpose of clustering is to identify market segments, management may want a particular number of clusters. (2) In hierarchical clustering, the distances at which clusters are combined can be used as criteria. This information can be obtained from the agglomeration schedule or from the dendrogram. (3) In nonhierarchical clustering, the ratio of total within-group variance to between-group variance can be plotted against the number of clusters. The point at which an elbow or a sharp bend occurs indicates an appropriate number of clusters. Increasing the number of clusters beyond this point is usually not worthwhile.
assess reliability and validity
Formal procedures for assessing the __________ of clustering solutions are complex and not fully defensible. Hence, we omit them here. However, the following procedures provide adequate checks on the quality of clustering results. (1) Perform cluster analysis on the same data using different distance measures. Compare the results across measures to determine the stability of the solutions. (2) Use different methods of clustering and compare the results. (3) Split the data randomly into halves. Perform clustering separately on each half. Compare cluster centroids across the two subsamples. (4) Delete variables randomly. Perform clustering based on the reduced set of variables. Compare the results with those obtained by clustering based on the entire set of variables. (5) In nonhierarchical clustering, the solution may depend on the order of cases in the data set. Make multiple runs using different order of cases until the solution stabilizes.
divisive clustering
Hierarchical clustering procedure in which all objects start out in one giant cluster. Clusters are formed by dividing this cluster into smaller and smaller clusters.
agglomerative clustering
Hierarchical clustering procedure in which each object starts out in a separate cluster. Clusters are formed by grouping objects into bigger and bigger clusters.
interpret and profile the clusters
Involves examining the cluster centroids. The centroids represent the mean values of the objects contained in the cluster on each of the variables. The centroids enable us to describe each cluster by assigning it a name or label. If the clustering program does not print this information, it may be obtained through discriminant analysis. Often it is helpful to profile the clusters in terms of variables that were not used for clustering. These may include demographic, psychographic, product usage, media usage, or other variables.
complete linkage
Linkage method that is based on maximum distance or the furthest neighbor approach.
single linkage
Linkage method that is based on minimum distance or the nearest neighbor rule.
sequential threshold method
Nonhierarchical clustering method in which a cluster center is selected and all objects within a prespecified threshold value from the center are grouped together.
optimizing partitioning method
Nonhierarchical clustering method that allows for later reassignment of objects to clusters to optimize an overall criterion.
parallel threshold method
Nonhierarchical clustering method that specifies several cluster centers at once. All objects that are within a prespecified threshold value from the center are grouped together.
manhattan distance
The city-block or __________ between two objects is the sum of the absolute differences in values for each variable.
cluster centers
The __________ are the initial starting points in nonhierarchical clustering. Clusters are built around these centers or seeds.
chebychev distance
The __________ between two objects is the maximum absolute difference in values for any variable.
cluster centroid
The __________ is the mean values of the variables for all the cases or objects in a particular cluster.
city-block distance
The __________ or Manhattan distance between two objects is the sum of the absolute differences in values for each variable.
euclidean distance
The square root of the sum of the squared differences in values for each variable. The most commonly used measure of similarity in cluster analysis.
distances between cluster centers
These distances indicate how separated the individual pairs of clusters are. Clusters that are widely separated are distinct and, therefore, desirable.
twostep clustering procedure
This procedure can automatically determine the optimal number of clusters by comparing the values of model-choice criteria across different clustering solutions. It also has the ability to create cluster models based on categorical and continuous variables. In addition to euclidean distance, the __________ also uses the log-likelihood measure. The log-likelihood measure places a probability distribution on the variables. It also accommodates two clustering criteria: Schwarz's Bayesian Information Criterion (BIC) and the Akaike Information Criterion (AIC). In the __________, the euclidean measure can be used only when all of the variables are continuous.
ward's procedure
Variance method in which the squared euclidean distance to the cluster means is minimized.
centroid methods
Variance methods of hierarchical clustering in which the distance between two clusters is the distance between their centroids (means for all the variables).
cluster membership
__________ indicates the cluster to which each object or case belongs.
cluster analysis
__________ is a methodology that forms groups of items such that items within a group are more similar to each other than to items in other groups. __________ is a class of techniques used to classify objects or cases into relatively homogeneous groups called clusters. Objects in each cluster tend to be similar to each other and dissimilar to objects in other clusters. Also called classification analysis, or numerical taxonomy.