INFO 320_Exam 2
The data in a data set are often said to be ______ and ______before they have been preprocessed
"dirty", "raw"
Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond to a 25-year old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-year old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two observations using Euclidean distance.
75.39
Data preparation includes all of the following except which task?
Calculating the confidence ratio for all association rules
____________________ measures cluster similarity by calculating the distance between the centroids of the two clusters.
Centroid linkage
The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called
Cluster analysis
__________________ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.
Complete linkage
Supervised learning
For prediction and classification
missing completely at random (MCAR)
If the missing value is a random occurrence
Which of the following is true for Euclidean distances?
It is commonly used as a method of measuring dissimilarity between quantitative observations.
Which statement is true of an association rule?
It is ultimately judged on how actionable it is and how well it explains the relationship between item sets.
In which of the following scenarios would it be appropriate to use hierarchical clustering?
When binary or ordinal data needs to be clustered.
Median linkage method
analogous to group average linkage except that it uses the median of the similarities computed between all pairs of observations between the two clusters The use of the median reduces the effect of outliers
Associaiton rules
convey the likelhood of certain items being purchases together
The strength of the association rule is known as ____________ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence.
lift
An analysis of items frequently co-occurring in transactions is known as
market basket analysis
Ward's method
merges two clsuters such that the dissimilarity of the observations within the resulting single cluster increases as little as possible Tends to produce clearly efined clusters of similar size
If a variable is missing measurements for a large number of observations
removing this variable from consideration may be an option
Observation
set of recorded values of variables associated with a single entity
Observation refers to the
set of recorded values of variables associated with a single entity
If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the distance between two observations of a cluster?
the hypotenuse
Dimension Reduction
the process of removing variables from the analysis without losing any crucial information
Goal of Clustering
to segment observations into similar groups based on the observed variables
Centroud linkage
uses the averaging concept of cluster centrouds to define btween-cluster similarity Centroid os the average observation of a cluster The similarity between two clsuters is then defined as the similarity of the centroids of the two clusters
Dendrogram
visually summarizes the output from a hierarchal clustering using the matching coefficient to measure similarity between obervartions and the group average linkage clustering method to measure similarity between clusters A chart that depicts the set of nested clusters resulting at each step of aggregation
Which is NOT a primary option for addressing missing data?
To generate random data to replace the missing values
In preparing categorical variables for analysis, it is usually best to
Convert the categories to binary, dummy variables
In which of the following data-mining process steps is the data manipulated to make it suitable for formal modeling?
Data preparation
Variable Representation
Dimension reduction A critical part of data mining is determining how to represent the measurements of the variables and which variables to consider It is best to encode categorical variables with 0-1 dummy variables
Identification of Outliers and Erroneous Data
Examining the variables in the data set by means of summary statistics, histograms, PivotTables, scatter plots, and other tools
missing at random (MAR)
If the missing values are not completely random (i.e., correlated with the values of some other variables)
The endpoint of a k-means clustering algorithm occurs when
No further changes are observed in cluster structure and number
In K-means clustering, k represents the
Number of variables
Data-mining approaches can be separated into two categories:
Supervised learning Unsupervised learning
Which of the following reason contribute to the increase in the use of data-mining techniques in business?
The ability to electronically warehouse data
If a model's implications depend on the inclusion or exclusion of outliers, then one should spend additional time to track down
The cause of the outliers
The increase in the use of data-mining techniques in business has been caused largely by three events
The explosion in the amount of data being produced and electronically tracked The ability to electronically warehouse these data The affordability of computer power to analyze the data
Measuring Similarity between Observations
The goal of cluster analysis is to group observations into clusters such that observations within a cluster are similar and observations in different clusters are dissimilar We need explicit measurements of similarity or dissimilarity Euclidean distance - the most common method to measure dissimilarity between observations
Data Preparation
Treatment of Missing Data Identification of Outliers and Erroneous Data Variable Representation
The process of eliminating variables from formal analysis without losing any crucial information is called
dimension reduction
Jaccard's coefficient is different from the matching coefficient in that the former
does not count matching zero entries while the latter does.
The __________ the lift ratio, the ____________ the association rule.
higher; stronger
missing not at random (MNAR)
if the reason that the value is missing is related to the value of the variable
When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity between two observations is called the
matching coefficient.
Single-linkage clustering method
the similarity between two clusters is defined by the similarity of the pair of observations that are the most similar Single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster However, a cluster formed by merging two clsuters that are close with respoct to single linkage may also consist of pairs of observations that are very different The reason is that there is no consideration of how different an observation may be from other observations in a cluster as long as it is similar to at least one observation in that cluster
If a considerable number of observation have missing values
then replacing them with a value that seems reasonable may be useful , as it does not decrease the number of observations
Suppose we had a data set of from a call center where customers were asked to choose between the following three options:hear account information, billing questions, and customer service. Using the given order of the three options, and using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry "customer service"?
001
The strength of a cluster can be measured by comparing the average distance in a cluster to the distance between cluster centroids. One rule of thumb is that the ratio for between-cluster distance to within-cluster distance should exceed what value for useful clusters?
1
Dealing with missing data requires:
Understanding of why the data are missing and the impact of the missing data
The goal of _____________________ is to use the variable values to identify relationships between observations.
Unsupervised Learning
_____________________ approaches are designed to describe patterns and relationships in large data sets with many observations of many variables.
Unsupervised learning
Cluster Analysis
commonly used in marketing to divide consumers into different homogeneous groups (market segmentation) Can be used to identify outliers
Average linkage is a measure of calculating dissimilarity between two clusters by
computing the average distance between every pair of observations between two clusters.
Single linkage is a measure of calculating dissimilarity between clusters by
considering only the two most similar observations in the two clusters.
McQuitty's method
considers mergng two clsuters A and B, the dissimilarity resulting cluster AB to any other clsuter C is calculated as: ((dissimilarity between A and C) + dissimilarity between B and C)/2)
Data preparation makes heavy use of the:
descriptive statistics and data visualization methods
A cluster's _____________ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.
durability
Euclidean distance can be used to measure the distance between________________ in cluster analysis.
observations
k-means clustering is the process of
organizing observations into distinct groups based on a measure of similarity.
Hierarchal Clustering
starts with each observation belonging to its own cluster and then sequentially merges the most simiar clusters to create a series of nested clusters Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster Each iteration corresponds to an increased level of aggregation by decreasing the number of distinct clusters Determines the similarity of two clusters by considering the similarity between the observations
A ___________ refers to the number of times a collection of items occur together in a transaction data set
support count
Treatment of Missing Data (primary options)
To discard observations with any missing values To discard any variable with missing values To fill in missing entries with estimated values To apply a data-mining algorithm (such as classification and regression trees) that can handle missing values
Unsupervised learning
To detect patterns and relationships in the data Thought of as high-dimensional descriptive analytics Designed to describe patterns and relationships in large data sets with many observations of many variables
Options for replacing the missing entries for a variable include replacing the missing value with the variable's mode, mean, or median. Imputing values in this manner is truly valid only if variable values are
Missing Completely at Random (MAR)
K-Means Clustering
assigns each observation to one of k clusters in a manner such that the observations assigned to the same cluster are similar
Complete linkage clustering method
defines the similarity between two clsuters as the similarity of the pair of observations that are the most different Will consider two clusters to be close if their most different pair of observations are close Can be distorted by outlier observations
Group average linkage clustering method
defines the similarity between two clusters to be the average similarity between two clusters to the average similarity computed over all pairs of observations between the two clusters
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a
dendogram