MKTG 451 exam 2
Two necessary conditions for an ensemble to perform better than a single model:
1. Individual base models are constructed independently of each other 2. Individual models perform better than just randomly guessing
Two primary steps to an ensemble approach:
1. The development of a committee of individual base models 2. The combination of the individual base models' predictions
durability
A cluster's _____________ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.
seasonal pattern.
A time series that shows a recurring pattern over one year or less is said to follow a
dendrogram.
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a
market basket analysis.
An analysis of items frequently co-occurring in transactions is known as
false positive.
An observation classified as part of a group with a characteristic when it actually does not have the characteristic is termed as a(n)
Median linkage
Analogous to group average linkage except that it uses the median of the similarities computer between all pairs of observations between the two clusters
ultimately judged on how actionable it is and how well it explains the relationship between item sets & useful if it is well supported and explain an important previously unknown relationship
Association Rules
Misclassifying an actual ______ observation as a(n) ______ observation is known as a false positive.
Class 0, Class 1
error.
Classifying a record as belonging to one class when it belongs to another class is referred to as a
training data
Data used to build a data mining model is called
supervised learning.
Data-mining methods for predicting an outcome based on a set of input variables is referred to as
Group Average linkage
Defines the similarity between two clusters to be the average similarity computed over all pairs of observations between the two clusters
Classification confusion matrix
Displays a model's correct and incorrect classifications
Supervised learning
For prediction and classification
Confidence:
Helps identify reliable association rules
missing completely at random (MCAR)
If the missing value is a random occurrence, it is called a data value
missing at random (MAR)
If the missing values are not completely random (i.e., correlated with the values of some other variables), these are called
Which of the following is true of Euclidean distances?
It is commonly used as a method of measuring dissimilarity between quantitative observations
Lift ratio:
Measure to evaluate the efficiency of a rule
What causes autocorrelation?
Misspecification, Data Manipulation, & Spatial ordering, inertia
Casual Model
Models that include only variables that are believed to cause changes in the variable to be forecast
Euclidean distance
Most common measure of dissimilarity when observations include continuous variables
Decile-wise lift chart:
Observations are ordered in decreasing probability of Class 1 membership and then considered in 10 equal-sized groups
accuracy
One minus the overall error rate is often referred to as the
Cutoff value
Probability value used to understand the tradeoff between Class 1 error rate and Class 0 error rate
Dimension reduction
Process of removing variables from the analysis without losing any crucial information
Cumulative lift chart:
Sorts observations in decreasing order of their estimated probability of being in Class 1/ Builds the cumulative distribution of the number of actual Class 1 observations
specificity
The ability to correctly predict Class 0 (negative) observations is commonly expressed as
sensitivity or recall
The ability to correctly predict Class 1 (positive) observations is commonly expressed as
Which of the following reasons contributes to the increase in the use of data-mining techniques in business?
The ability to electronically warehouse data
cluster analysis.
The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called
horizontal pattern
The moving averages and exponential smoothing methods are appropriate for a time series exhibiting
Single linkage
The similarity between two clusters is defined by the similarity of the pair of observations (one from each cluster) that are the most similar
Which of the following is not true of a stationary time series?
The time series plot is a straight line.
In which of the following scenarios would it be appropriate to use hierarchical clustering?
When binary or ordinal data needs to be clustered.
Precision
a measure that corresponds to the proportion of observations predicted to be Class 1 by a classifier that are actually in Class 1:
The exponential smoothing forecast for period t + 1 is a weighted average of the
actual value in period t with weight α and the forecast for period t with weight 1 - α.
A causal model provides evidence of ______________ between an independent variable and the variable to be forecast.
an association
Unsupervised learning
approaches are designed to describe patterns and relationships in large data sets with many observations of many variables./ To detect patterns and relationships in the data
k-means clustering
assigns each observation to one of k clusters in a manner such that the observations assigned to the same cluster are as similar as possible
Logistic regression
attempts to classify a categorical outcome (y = 0 or 1) as a linear function of explanatory variables
Random forests random trees)
can be viewed as a variation of bagging specifically tailored for use with classification or regression trees
Classification
categorical outcomes
Classification Trees:
classify categorical outcomes
F1 Score
combines precision and sensitivity into a single measure and is defined as:
McQuitty's method
considers merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C
Prediction
continuous outcome
The Durbin-Watson test checks ONLY for:
first-order serial correlation
boosting method
generates is committee of individual base models by adaptively sampling multiple training sets (more expensive)
Data partitioning
helps us assessing the accuracy of data-based estimates of variable effects, e.g. avoiding overfitting
Clustering methods:
hierarchical clustering & k-means clustering
A classification or estimation method is unstable
if relatively small changes in the training set cause its predictions to fluctuate
A sample is representative
if the analyst can make the same conclusions from it as from the entire population of data
Data is missing not at random (MNAR)
if the reason that the value is missing is related to the value of the variable
Association rules
if then statements which convey the likelihood of certain items being purchased together
Autoregressive models
in which the independent variables are previous values of the time series/ occur whenever all the independent variables are previous values of the time series.
Complete linkage
is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar /different observations in the two clusters.
Impurity
is a measure of the heterogeneity of observations in a classification tree.
Ward's method
merges two clusters such that the distance of the observations with the resulting single cluster increases as little as possible
In the moving averages method, the order k determines the
number of time series values under consideration.
Overall error rate:
percentage of misclassified observations
ensemble method
predictions are made based on the combination of a collection of models
Forcasting model
provides evidence only of association between an independent variable and the variable to be forecast.
Overfitting
refers to the scenario in which the analyst builds a model that does a great job of explaining the sample of data on which it is based but fails to accurately predict outside the sample data.
hierarchical clustering
starts with each observation belonging to its own cluster and then sequentially merges the most similar clusters to create a series of nested clusters
Classification and regression trees (CART)
successively partition a data set of observations into increasingly smaller and more homogeneous (pure) subsets
regression tree
successively partitions observations of the training set into smaller and smaller groups like a classification tree
Forecast error
the amount by which the predicted value differs from the observed value of the time series variable.
Antecedent
the collection of items (or item set) corresponding to the if portion of the rule
bagging approach
the committee of individual base models is generated by first constructing multiple training sets by repeated random sampling of the n observations in the original data with replacement
Consequent
the item set corresponding to the then portion of the rule
Trend
the long-run shift or movement in the time series observable over several periods of time.
Autocorrelation
time residuals are often found to be serially correlated with their own lagged values: (not independent)
A positive forecast error indicates that the forecasting method ________ the dependent variable.
underestimated
Exponential smoothing
uses a weighted average of past time series values as the forecast.
Centroid linkage
uses the averaging concept of cluster centroids to define between-cluster similarity
Regression models are:
very flexible and can incorporate both causal variables and time series effects