Business Analytics Final Ch. 9
______ is one minus the overall error rate.
Accuracy
As we increase the cutoff value, _______ error will decrease and _________ error will rise.
Class 0, Class 1
______________ involves descriptive statistics, data visualization, and clustering.
Data exploration
___________ is dividing the sample data into three sets for training, validation, and testing of the data-mining algorithm performance.
Data partitioning
__________ is the manipulation of data with the goal of putting it in a form suitable for formal modeling.
Data preparation
__________ is the step in data mining that includes addressing missing and erroneous data, reducing the number of variables, defining new variables, and data exploration.
Data preparation
__________ is a method of extracting data relevant to the business problem under consideration. It is the first step in the data mining process.
Data sampling
Logistic regression is similar to linear regression, except that it attempts to classify ______________ as a linear function of explanatory variables.
a categorical response.
If the outcome variable is ________, we can use logistic regression as a classification tool.
binary
The best value of k can be determined by building models with k between 1 and 20 and selecting the value of k that results in the smallest
classification error.
A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules is called a(n)
classification tree.
A _________ chart compares the number of actual Class 1 observations identified if considered in decreasing order of their estimated probability of being in Class 1 and compares this to the number of actual Class 1 observations identified if randomly selected.
cumulative lift
The choice of __________ affects classification errors.
cutoff value
An important part of ________ is applying the chosen model to the test data as a final evaluation of model performance.
model assessment
The set of recorded values of variables associated with a single entity is a(n)
observation
A negative RMSE suggests a tendency to ________ the output variable in the test data.
overestimate
Estimation methods are also referred to as
prediction methods.
A(n) _______________ is often displayed as a row of values in a spreadsheet or database in which the columns correspond to the variables.
record
A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules is called a(n)
regression tree.
Logistic regression equations have what shape?
s-shape
Data-mining methods for predicting an outcome based on a set of input variables are referred to as
supervised learning.
The data used to build the candidate predictive model are called the
training set.
A positive average error on the validation data suggests a tendency to _________ the output variable in the validation data.
underestimate
The data used to evaluate candidate predictive models are called the
validation set.
A characteristic or quantity of interest that can take on different values is a(n)
variable.
A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What is the RMSE on the test data?
-1.32
A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What value of k that minimizes the root mean squared error (RMSE) on the validation data?
10
Given the XLMiner output shown below, the best k for the Classification Model is https://cxp-cdn.cengage.info/protected/prod/assets/8f/7/8f7e8fd3-9b62-4746-9f47-be82d2563f3a.jpg?__gda__=st=1564182645~exp=1564787445~acl=%2fprotected%2fprod%2fassets%2f8f%2f7%2f8f7e8fd3-9b62-4746-9f47-be82d2563f3a.jpg*~hmac=67b5e07ac091b738b4bb26d82d347ff5677d036feb6dc1578204ddde6d1905a5
10
From the lift chart shown below, we can infer that if 200 observations with the largest estimated probabilities of being in Class 1 were selected, 150 of them would correspond to actual Class 1 members. If 200 cases were selected at random, approximately how many could we expect to be in Class 1? https://cxp-cdn.cengage.info/protected/prod/assets/81/a/81a56e50-a45a-46a4-8b06-c41d1a239b42.jpg?__gda__=st=1564182645~exp=1564787445~acl=%2fprotected%2fprod%2fassets%2f81%2fa%2f81a56e50-a45a-46a4-8b06-c41d1a239b42.jpg*~hmac=8b96fc4c0037ea7cc3fc7713559e7a5f742b54ec8853ba2ed0ce73c81539cc20
100
A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What is the average error on the test data?
14.65
A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What is the average error on the validation data?
15.31
Use the classification confusion matrix to determine the number of Class 1's in the data set that are correctly classified as Class 1. 1/1=201 1/0=25 0/1= 85 0/0=2689
201 A classification confusion matrix displays a model's correct and incorrect classifications. The number of correct class 1 classifications is 201, which comes from the cell corresponding to a predicted and actual class of 1.
Use the classification confusion matrix to determine the number of Class 0 observations in the data set are incorrectly classified as Class 1. 1/1=201 1/0=25 0/1=85 0/0=2689
25
Use the classification confusion matrix to determine the number of Class 0 observations in the data set are correctly classified as Class 0. 1/1=201 1/0=25 0/1=85 0/0=2689
2689
Given the following classification confusion matrix, what is the overall error rate? 1/1=201 1/0=25 0/1=85 0/0=2689
3.67%
Given the following classification confusion matrix, what is the overall error rate? 1/1=221 1/0=30 0/1=100 0/0=3000
3.88% 30+100/221+100+30+3000=3.88
A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What is the RMSE on the validation data?
4.84
Use the classification confusion matrix to determine the number of Class 1's in the data set that are incorrectly classified as Class 0. 1/1=201 1/0=25 0/1=85 0/0=2689
85
Given the following classification confusion matrix, what is accuracy of the model? 1/1=221 1/0=30 0/1=100 0/0=3000
96.12%
Given the following classification confusion matrix, what is the accuracy of the model? 1/1=201 1/0=25 0/1=85 0/0=2689
96.33% (85+25/201+85+25+2689)-1
______________ occurs when the analyst builds a model that does a great job of explaining the sample of data on which it is based, but fails to accurately predict outside the sample data.
Model overfitting
______ is one minus the Class 1 error rate.
Sensitivity
_____ is a category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.
Supervised learning
Which of these is NOT a step in the data mining process?
data explanation
Instead of Y as outcome variable, in logistic regression, we use a _______ called the logit.
function of Y
