9. Predictive Data Mining

¡Supera tus tareas y exámenes ahora con Quizwiz!

Boosting

An ensemble method that iteratively samples from the original training data to generate individual models that target observations that were mispredicted in previous models. Its predictions are based on the weighted average of the predictions of the individual models. The weights are proportional to the individual models' accuracy.

False Positive

An observation classified as part of a group with a characteristic when it actually does not have the characteristic.

Accuracy

One minus the overall error rate is often referred to as the accuracy of the model.

Cutoff value

The smallest value that the predicted probability of an observation can be for the observation to be classified as Class 1.

Bias

The tendency of a predictive model to overestimate or underestimate the value of a continuous outcome.

Lagged variable

The value of an independent variable from the prior period.

k-Nearest neighbors

A classification method that classifies an observation based on the class of the k observations most similar or nearest to it.k observations most similar or nearest to it.

Variable (feature)

A characteristic or quantity of interest that can take on different values.

Receiver operating characteristic (ROC) curve

A chart used to illustrate the tradeoff between a model's ability to identify Class 1 observations and its Class 0 error rate.

Cumulative lift chart

A chart used to present how well a model performs in identifying observations most likely to be in Class 1 as compared with random classification.

Logistic Regression

A generalization of linear regression for predicting a categorical outcome variable.

Classification confusion matrix

A matrix showing the counts of actual versus predicted class values.

F1 Score

A measure combining precision and sensitivity into a single metric.

Mallow's Cp statistic

A measure in which small values approximately equal to the number of coefficients suggest promising logistic regression models.

Accuracy

A measure of classification success. Defined as 1 minus the overall error rate.

Root mean squared error

A measure of the accuracy of an estimation method defined as the square root of the sum of squared deviations between the actual values and predicted values of observations.

Estimation

A predictive data mining task requiring the prediction of an observation's continuous outcome value.

Classification

A predictive data mining task requiring the prediction of an observation's outcome class or category.

Ensemble method

A predictive data-mining approach in which a committee of individual classification or estimation models are generated and a prediction is made by combining these individual predictions.

Features

A set of input variables used to predict an observation's outcome class or continuous outcome value.

Observation (record)

A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database.

Model overfitting

A situation in which a model explains random patterns in the data on which it is trained rather than just the relationships, resulting in training-set accuracy that far exceeds accuracy for the new data.

Classification tree

A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules.

Regression tree

A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules.

Random forests

A variant of the bagging ensemble method that generates a committee of classification or regression trees based on different random samples but restricts each individual tree to a limited number of randomly selected features (variables)

Bagging

An ensemble method that generates a committee of models based on different random samples and makes predictions based on the average prediction of the set of models.

Supervised learning

Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.

Test set

Data set used to compute unbiased estimate of final predictive model's accuracy.

Training set

Data used to build candidate predictive models. Unstable When small changes in the training set cause its predictions to fluctuate substantially.

Validation set

Data used to evaluate candidate predictive models.

Impurity

Measure of the heterogeneity of observations in a classification tree.

Average error

The average difference between the actual values and predicted values of observations.

False positive

The misclassification of a Class 0 observation as Class 1.

False negative

The misclassification of a Class 1 observation as Class 0.

Overall error rate

The percent of misclassified records out of the total records in the validation data.

Class 0 error rate

The percentage of Class 0 observations misclassified by a model in a data set.

Sensitivity (recall)

The percentage of actual Class 1 observations correctly identified.

Specificity

The percentage of actual Class 1 observations correctly identified.

Class 1 error rate

The percentage of actual Class 1 observations misclassified by a model in a data set.

Overall error rate

The percentage of observations misclassified by a model in a data set.Precision The percentage of observations predicted to be Class 1 that actually are Class 1.

Decile-wise lift chart

Used to present how well a model performs at identifying observations for each of the top k deciles most likely to be in Class 1 versus a random k deciles most likely to be in Class 1 versus a random kselection

Prediction methods

are also referred to as estimation methods.

Logistic regression

attempts to classify a categorical outcome as a linear function of explanatory variables.

Classification confusion matrix

displays a model's correct and incorrect classification.

Data exploration

involves descriptive statistics, data visualization, and clustering.

Supervised learning

is a category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.

Logistic regression

is a generalization of linear regression for predicting a categorical outcome variable.

Data sampling

is a method of extracting data relevant to the business problem under consideration. It is the first step in the Data Mining process.

Data partitioning

is dividing the sample data into three sets for training, validation, and testing of the data-mining algorithm performance.

Specificity

is one minus the Class 0 error rate.

Data preparation

is the manipulation of the data with the goal of putting it in a form suitable for formal modeling.

Data preparation

is the step in data-mining which includes addressing missing and erroneous data, reducing the number of variables, defining new variables, and data exploration.

The X axis of a lift chart shows

number of actual Class 1 records identified.


Conjuntos de estudio relacionados

Ch 6 Cost-Volume-Profit Relationships

View Set

LOGISTICS BOOK 3 UNIT 4 LOADING DOCK

View Set

Khan Academy- Reading: Science / This passage is from Luis Villareal, "Are Viruses Alive?" © 2008 by Scientific American.

View Set

Physical Science Unit 2: Thermal Energy and Heat Test Review

View Set

Fundamentals of Communication Final Exam

View Set