INFO320 CHAPTER 4
__________________ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.
complete linkage
Average linkage is a measure of calculating dissimilarity between two clusters by
computing the average distance between every pair of observations between two clusters.
Single linkage is a measure of calculating dissimilarity between clusters by
considering only the two most similar observations in the two clusters.
In preparing categorical variables for analysis, it is usually best to
convert the categories to binary, dummy variables
The process of eliminating variables from formal analysis without losing any crucial information is called
dimension reduction
Jaccard's coefficient is different from the matching coefficient in that the former
does not count matching zero entries while the latter does
A cluster's _____________ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.
durability
The __________ the lift ratio, the ____________ the association rule.
higher, stronger
the strength of the association rule is known as ____________ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence.
lift
An analysis of items frequently co-occurring in transactions is known as
market basket analysis
When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity between two observations is called the
matching coefficient
Options for replacing the missing entries for a variable include replacing the missing value with the variable's mode, mean, or median. Imputing values in this manner is truly valid only if variable values are
missing completely at random
The endpoint of a k-means clustering algorithm occurs when
no further changes are observed in cluster structure and number
Euclidean distance can be used to measure the distance between________________ in cluster analysis.
observations
k-means clustering is the process of
organizing observations into one of k groups based on a measure of similarity
observation refers to
set of recorded values of variables associated withy a single entity
A ___________ refers to the number of times a collection of items occur together in a transaction data set.
support count
Which of the following reason contribute to the increase in the use of data-mining techniques in business?
the ability to electronically warehouse data
If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the distance between two observations of a cluster?
the hypotenuse
In k-means clustering, k represents
the number of clusters
Which is NOT a primary option for addressing missing data?
to generate random data to replace missing data
The goal of _______ is to use the variable values to identify relationships between observations
unsupervised learning
_______________ approaches are designed to describe patterns and relationships in large data sets with many observations of many variables.
unsupervised learning
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a
dendrogram
The strength of a cluster can be measured by comparing the average distance in a cluster to the distance between cluster centroids. One rule of thumb is that the ratio for between-cluster distance to within-cluster distance should exceed what value for useful clusters?
1
In which of the following scenarios would it be appropriate to use hierarchical clustering?
??
Which of the following is true for Euclidean distances?
It is commonly used as a method of measuring dissimilarity between quantitative observations.
Which statement is true of an association rule?
It is ultimately judged on how actionable it is and how well it explains the relationship between item sets
Data preparation includes all of the following except which task?
calculating the confidence ratio for all association rules
If a models implications depends on inclusion or exclusion of outliers, then one should spend additional time to track down
cause of the outliers
____________________ measures cluster similarity by calculating the distance between the centroids of the two clusters.
centroid linkage
The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called
cluster analysis
In which of the following data-mining process steps is the data manipulated to make it suitable for formal modeling?
data preperation