9. Predictive Data Mining
Boosting
An ensemble method that iteratively samples from the original training data to generate individual models that target observations that were mispredicted in previous models. Its predictions are based on the weighted average of the predictions of the individual models. The weights are proportional to the individual models' accuracy.
False Positive
An observation classified as part of a group with a characteristic when it actually does not have the characteristic.
Accuracy
One minus the overall error rate is often referred to as the accuracy of the model.
Cutoff value
The smallest value that the predicted probability of an observation can be for the observation to be classified as Class 1.
Bias
The tendency of a predictive model to overestimate or underestimate the value of a continuous outcome.
Lagged variable
The value of an independent variable from the prior period.
k-Nearest neighbors
A classification method that classifies an observation based on the class of the k observations most similar or nearest to it.k observations most similar or nearest to it.
Variable (feature)
A characteristic or quantity of interest that can take on different values.
Receiver operating characteristic (ROC) curve
A chart used to illustrate the tradeoff between a model's ability to identify Class 1 observations and its Class 0 error rate.
Cumulative lift chart
A chart used to present how well a model performs in identifying observations most likely to be in Class 1 as compared with random classification.
Logistic Regression
A generalization of linear regression for predicting a categorical outcome variable.
Classification confusion matrix
A matrix showing the counts of actual versus predicted class values.
F1 Score
A measure combining precision and sensitivity into a single metric.
Mallow's Cp statistic
A measure in which small values approximately equal to the number of coefficients suggest promising logistic regression models.
Accuracy
A measure of classification success. Defined as 1 minus the overall error rate.
Root mean squared error
A measure of the accuracy of an estimation method defined as the square root of the sum of squared deviations between the actual values and predicted values of observations.
Estimation
A predictive data mining task requiring the prediction of an observation's continuous outcome value.
Classification
A predictive data mining task requiring the prediction of an observation's outcome class or category.
Ensemble method
A predictive data-mining approach in which a committee of individual classification or estimation models are generated and a prediction is made by combining these individual predictions.
Features
A set of input variables used to predict an observation's outcome class or continuous outcome value.
Observation (record)
A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database.
Model overfitting
A situation in which a model explains random patterns in the data on which it is trained rather than just the relationships, resulting in training-set accuracy that far exceeds accuracy for the new data.
Classification tree
A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules.
Regression tree
A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules.
Random forests
A variant of the bagging ensemble method that generates a committee of classification or regression trees based on different random samples but restricts each individual tree to a limited number of randomly selected features (variables)
Bagging
An ensemble method that generates a committee of models based on different random samples and makes predictions based on the average prediction of the set of models.
Supervised learning
Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.
Test set
Data set used to compute unbiased estimate of final predictive model's accuracy.
Training set
Data used to build candidate predictive models. Unstable When small changes in the training set cause its predictions to fluctuate substantially.
Validation set
Data used to evaluate candidate predictive models.
Impurity
Measure of the heterogeneity of observations in a classification tree.
Average error
The average difference between the actual values and predicted values of observations.
False positive
The misclassification of a Class 0 observation as Class 1.
False negative
The misclassification of a Class 1 observation as Class 0.
Overall error rate
The percent of misclassified records out of the total records in the validation data.
Class 0 error rate
The percentage of Class 0 observations misclassified by a model in a data set.
Sensitivity (recall)
The percentage of actual Class 1 observations correctly identified.
Specificity
The percentage of actual Class 1 observations correctly identified.
Class 1 error rate
The percentage of actual Class 1 observations misclassified by a model in a data set.
Overall error rate
The percentage of observations misclassified by a model in a data set.Precision The percentage of observations predicted to be Class 1 that actually are Class 1.
Decile-wise lift chart
Used to present how well a model performs at identifying observations for each of the top k deciles most likely to be in Class 1 versus a random k deciles most likely to be in Class 1 versus a random kselection
Prediction methods
are also referred to as estimation methods.
Logistic regression
attempts to classify a categorical outcome as a linear function of explanatory variables.
Classification confusion matrix
displays a model's correct and incorrect classification.
Data exploration
involves descriptive statistics, data visualization, and clustering.
Supervised learning
is a category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.
Logistic regression
is a generalization of linear regression for predicting a categorical outcome variable.
Data sampling
is a method of extracting data relevant to the business problem under consideration. It is the first step in the Data Mining process.
Data partitioning
is dividing the sample data into three sets for training, validation, and testing of the data-mining algorithm performance.
Specificity
is one minus the Class 0 error rate.
Data preparation
is the manipulation of the data with the goal of putting it in a form suitable for formal modeling.
Data preparation
is the step in data-mining which includes addressing missing and erroneous data, reducing the number of variables, defining new variables, and data exploration.
The X axis of a lift chart shows
number of actual Class 1 records identified.
