Business Analytics Chapter 5: Descriptive Data Mining
The strength of a cluster can be measured by comparing the average distance in a cluster to the distance between cluster centroids. One rule of thumb is that the ratio for between-cluster distance to within-cluster distance should exceed what value for useful clusters? a. 1 b. 2 c. 0.5 d. 1.5
a. 1
In preparing categorical variables for analysis, it is usually best to _____. a. convert the categories to binary, dummy variables b. combine as many categories as possible c. convert the categories to numeric representations d. let them remain categorical
a. convert the categories to binary, dummy variables
Single linkage can be used to measure the distance between clusters that are the _____ in cluster analysis. a. most similar b. farthest apart c. closest d. most different
a. most similar
Suppose the dissimilarity between clusters A and B has the value 24 and the dissimilarity between cluster B and C has the value 12. Use McQuitty's method to determine the dissimilarity of clusters A and B. a. 18 b. 24 c. 12 d. 36
a. 18
Check My Work Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond to a 25-year-old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-year-old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two observations using Euclidean distance. a. 75.39 b. 66.21 c. 72.28 d. 88.57
a. 75.39
_____ is a method of calculating dissimilarity between clusters by calculating the distance between the centroids of the two clusters. a. Complete linkage b. Average linkage c. Single linkage d. Centroid linkage
d. Centroid linkage
_____ is used to measure the dissimilarity between text documents. a. Corpus b. Word cloud c. Dendrogram d. Cosine distance
d. Cosine distance
Jaccard's coefficient is different from the matching coefficient in that the former _____. a. is affected by the scale used to measure variables while the latter is not b. measures overlap while the latter measures dissimilarity c. deals with categorical variable while the latter deals with continuous variables d. does not count matching zero entries while the latter does
d. does not count matching zero entries while the latter does
In k-means clustering, k represents the _____. a. number of variables b. number of observations in a cluster c. mean of the cluster d. number of clusters
d. number of clusters
Euclidean distance can be used to measure the distance between _____ in cluster analysis. a. objects b. ward c. clusters d. observations
d. observations
A popular measure for weighing terms based on frequency and uniqueness is _____. a. corpus b. cosine distance c. word cloud d. term frequency times inverse document frequency
d. term frequency times inverse document frequency
_____ is a measure that computes the dissimilarity between a cluster AB and a cluster C by averaging the distance between A and C and the distance between B and C. a. McQuitty's method b. Jaccard's coefficient c. Ward's method d. None of these choices are correct.
a. McQuitty's method
A cluster's _____ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram. a. durability b. dimension c. affordability d. span
a. durability
_____ is the dissimilarity measure that is more robust to outliers than Euclidean distance. a. Matching distance b. Manhattan distance c. Jaccard distance d. Matching coefficient
c. Jaccard distance
_____ is a measure that computes the dissimilarity between a cluster AB and a cluster C by averaging the distance between A and C and the distance between B and C. a. Jaccard's coefficient b. Ward's method c. McQuitty's method d. None of these choices are correct.
c. McQuitty's method
_____ refers to the number of times a collection of items occurs together in a transaction data set. a. A consequent b. Antecedent c. Support d. Validation count
c. Support
If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the distance between two observations of a cluster? a. The short leg b. The long leg c. The hypotenuse d. Euclidean distance is not related to right triangles.
c. The hypotenuse
A cluster's _____ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram. a. dimension b. affordability c. durability d. span
c. durability
The strength of the association rule is known as _____ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence. a. antecedent b. consequent c. lift d. support count
c. lift