Chapter 5
In the text mining process, the text is first preprocessed by deriving a smaller set of _____ from the larger set of words contained in a collection of documents. a. tokens b. terms c. stack d. stems
a. tokens
A collection of text documents to be analyzed is called a _____. a. library b. corpus c. consequent d. book
b. corpus
The process of extracting useful information from text data is known as _____. a. corpus b. text mining c. tokenization d. stemming
b. text mining
Suppose we had a data set of from a call center where customers were asked to choose between the following three options: hear account information, billing questions, and customer service. Using the given order of the three options, and using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry "customer service"? a. 000 b. 100 c. 010 d. 001
d. 001
_____ is a method of calculating dissimilarity between clusters by calculating thedistance between the centroids of the two clusters. a. Centroid linkage b. Complete linkage c. Average linkage d. Single linkage
a. Centroid linkage
Average linkage is a measure of calculating dissimilarity between two clusters by _____. a.finding the distance between the two most dissimilar observations in the two clusters b.computing the average distance between every pair of observations between two clusters c. computing the distance between the cluster centroids d.finding the distance between the two closest observations in the two clusters
b.computing the average distance between every pair of observations between two clusters
_____ refers to the number of times a collection of items occurs together in a transaction data set. a. Validation count b. A consequent c. Antecedent d. Support count
d. Support count
_____ approaches are designed to describe patterns and relationships in large data sets with many observations of many variables. a. Data mining b. Data sampling c. Dimension reduction d. Unsupervised learning
d. Unsupervised learning
To identify patterns across transactions, we can use _____. a. centroid linkage b. k-means c. complete linkage d. association rules
d. association rules
The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called _____. a. data visualization b. supervised learning c. market analysis d. cluster analysis
d. cluster analysis
In preparing categorical variables for analysis, it is usually best to _____. a. let them remain categorical b. convert the categories to numeric representations c. combine as many categories as possible d. convert the categories to binary, dummy variables
d. convert the categories to binary, dummy variables
Single linkage can be used to measure the distance between clusters that are the _____ in cluster analysis. a. closest b. farthest apart c. most different d. most similar
d. most similar
Euclidean distance can be used to measure the distance between _____ in cluster analysis. a. clusters b. objects c. ward d. observations
d. observations
The process of converting a word to its stem, or root word, is referred to as _____. a. tokenization b. stacking c. data cleaning d. stemming
d. stemming
When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity between two observations is called _____. a. Jaccard's coefficient b. Euclidean distance c. the antecedent d. the matching coefficient
d. the matching coefficient
Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond to a 25-year-old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-year-old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two observations using Euclidean distance. a. 72.28 b. 75.39 c. 88.57 d. 66.21
b. 75.39
In which of the following scenarios would it be appropriate to use hierarchical clustering? a.When the number of observations in the dataset is relatively high b. When binary or ordinal data needs to be clustered c. When the number of clusters is known beforehand d. When it is not necessary to know the nesting of clusters
b. When binary or ordinal data needs to be clustered
The strength of the association rule is known as _____ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence. a. consequent b. lift c. support count d. antecedent
b. lift
Which statement is true of an association rule? a.It is a data reduction technique that reduces large information into smaller homogeneous groups. b.It is ultimately judged on how actionable it is and how well it explains the relationship between item sets. c.It uses analytic models to describe the relationship between metrics that drive business performance. d.It seeks to classify a categorical outcome into two or more categories.
b.It is ultimately judged on how actionable it is and how well it explains the relationship between item sets.
_____ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters. a. Average linkage b. Average group linkage c. Complete linkage d. Single linkage
c. Complete linkage
Jaccard's coefficient is different from the matching coefficient in that the former _____. a.deals with categorical variable while the latter deals with continuous variables b.is affected by the scale used to measure variables while the latter is not c. does not count matching zero entries while the latter does d. measures overlap while the latter measures dissimilarity
c. does not count matching zero entries while the latter does
The _____ the lift ratio, the _____ the association rule. a. higher; weaker b. lower; stronger c. higher; stronger d. lower; weaker
c. higher; stronger
An analysis of items frequently co-occurring in transactions is known as _____. a. market segmentation b. regression analysis c. market basket analysis d. cluster analysis
c. market basket analysis
In k -means clustering, k represents the _____. a. number of variables b. mean of the cluster c. number of clusters d. number of observations in a cluster
c. number of clusters
The process of dividing text into separate terms is referred to as _____. a. stemming b. data cleaning c. tokenization d. stacking
c. tokenization