Business Analytics Final Ch. 9

______ is one minus the overall error rate.

Accuracy

As we increase the cutoff value, _______ error will decrease and _________ error will rise.

Class 0, Class 1

______________ involves descriptive statistics, data visualization, and clustering.

Data exploration

___________ is dividing the sample data into three sets for training, validation, and testing of the data-mining algorithm performance.

Data partitioning

__________ is the manipulation of data with the goal of putting it in a form suitable for formal modeling.

Data preparation

__________ is the step in data mining that includes addressing missing and erroneous data, reducing the number of variables, defining new variables, and data exploration.

Data preparation

__________ is a method of extracting data relevant to the business problem under consideration. It is the first step in the data mining process.

Data sampling

Logistic regression is similar to linear regression, except that it attempts to classify ______________ as a linear function of explanatory variables.

a categorical response.

If the outcome variable is ________, we can use logistic regression as a classification tool.

binary

The best value of k can be determined by building models with k between 1 and 20 and selecting the value of k that results in the smallest

classification error.

A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules is called a(n)

classification tree.

A _________ chart compares the number of actual Class 1 observations identified if considered in decreasing order of their estimated probability of being in Class 1 and compares this to the number of actual Class 1 observations identified if randomly selected.

cumulative lift

The choice of __________ affects classification errors.

cutoff value

An important part of ________ is applying the chosen model to the test data as a final evaluation of model performance.

model assessment

The set of recorded values of variables associated with a single entity is a(n)

observation

A negative RMSE suggests a tendency to ________ the output variable in the test data.

overestimate

Estimation methods are also referred to as

prediction methods.

A(n) _______________ is often displayed as a row of values in a spreadsheet or database in which the columns correspond to the variables.

record

A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules is called a(n)

regression tree.

Logistic regression equations have what shape?

s-shape

Data-mining methods for predicting an outcome based on a set of input variables are referred to as

supervised learning.

The data used to build the candidate predictive model are called the

training set.

A positive average error on the validation data suggests a tendency to _________ the output variable in the validation data.

underestimate

The data used to evaluate candidate predictive models are called the

validation set.

A characteristic or quantity of interest that can take on different values is a(n)

variable.

A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What is the RMSE on the test data?

-1.32

A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What value of k that minimizes the root mean squared error (RMSE) on the validation data?

10

Given the XLMiner output shown below, the best k for the Classification Model is https://cxp-cdn.cengage.info/protected/prod/assets/8f/7/8f7e8fd3-9b62-4746-9f47-be82d2563f3a.jpg?__gda__=st=1564182645~exp=1564787445~acl=%2fprotected%2fprod%2fassets%2f8f%2f7%2f8f7e8fd3-9b62-4746-9f47-be82d2563f3a.jpg*~hmac=67b5e07ac091b738b4bb26d82d347ff5677d036feb6dc1578204ddde6d1905a5

10

From the lift chart shown below, we can infer that if 200 observations with the largest estimated probabilities of being in Class 1 were selected, 150 of them would correspond to actual Class 1 members. If 200 cases were selected at random, approximately how many could we expect to be in Class 1? https://cxp-cdn.cengage.info/protected/prod/assets/81/a/81a56e50-a45a-46a4-8b06-c41d1a239b42.jpg?__gda__=st=1564182645~exp=1564787445~acl=%2fprotected%2fprod%2fassets%2f81%2fa%2f81a56e50-a45a-46a4-8b06-c41d1a239b42.jpg*~hmac=8b96fc4c0037ea7cc3fc7713559e7a5f742b54ec8853ba2ed0ce73c81539cc20

100

A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What is the average error on the test data?

14.65

A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What is the average error on the validation data?

15.31

Use the classification confusion matrix to determine the number of Class 1's in the data set that are correctly classified as Class 1. 1/1=201 1/0=25 0/1= 85 0/0=2689

201 A classification confusion matrix displays a model's correct and incorrect classifications. The number of correct class 1 classifications is 201, which comes from the cell corresponding to a predicted and actual class of 1.

Use the classification confusion matrix to determine the number of Class 0 observations in the data set are incorrectly classified as Class 1. 1/1=201 1/0=25 0/1=85 0/0=2689

25

Use the classification confusion matrix to determine the number of Class 0 observations in the data set are correctly classified as Class 0. 1/1=201 1/0=25 0/1=85 0/0=2689

2689

Given the following classification confusion matrix, what is the overall error rate? 1/1=201 1/0=25 0/1=85 0/0=2689

3.67%

Given the following classification confusion matrix, what is the overall error rate? 1/1=221 1/0=30 0/1=100 0/0=3000

3.88% 30+100/221+100+30+3000=3.88

A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke was selected; the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers. Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a detailed scoring report for all three sets of data. What is the RMSE on the validation data?

4.84

Use the classification confusion matrix to determine the number of Class 1's in the data set that are incorrectly classified as Class 0. 1/1=201 1/0=25 0/1=85 0/0=2689

85

Given the following classification confusion matrix, what is accuracy of the model? 1/1=221 1/0=30 0/1=100 0/0=3000

96.12%

Given the following classification confusion matrix, what is the accuracy of the model? 1/1=201 1/0=25 0/1=85 0/0=2689

96.33% (85+25/201+85+25+2689)-1

______________ occurs when the analyst builds a model that does a great job of explaining the sample of data on which it is based, but fails to accurately predict outside the sample data.

Model overfitting

______ is one minus the Class 1 error rate.

Sensitivity

_____ is a category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.

Supervised learning

Which of these is NOT a step in the data mining process?

data explanation

Instead of Y as outcome variable, in logistic regression, we use a _______ called the logit.

function of Y

Business Analytics Final Ch. 9

Kaugnay na mga set ng pag-aaral

Finance Ch 9 MC

Practice assessment

Chapter 14+ Chapter 15

psych 110 Final

French Revolution people and places

PSY 210-Exam 2 MC

CYSE 300 Module 5 Quiz

ACCT 2301 Exam 1, ACCT 2301 Exam 2, ACCT 2301 Exam 3, ACCT 2301 Final

Chapter 2 SmartBook questions

chapter 13 mastering bio

Med surg ch.17,18,19&20

Behind the Urals 3

Written Communications

Chapter 8: Gestalt Therapy

Chapter 3

ECON 201 Part 4 Market Surplus

Chapter 18 the urinary system

Chapter 11 end questions

Chapter 26

Exam #2 ECO 330