Business Analytics Master Quizlet
True
Big data make companies smart.
True
As more independent variables are added to a multiple regression model, the value of R-squared increases. True/false?
False
As per the analysis of Titanic data, passengers who were in 3rd class had higher odds of surviving compared to passengers in 1st class.
Both continuous and categorical dependent variables
Neural Networks can be used to predict:
True
Neural Networks were originally developed to understand biological neural networks.
False
As per the analysis of Titanic data, passengers who were in 3rd class had higher odds of surviving compared to passengers in 1st class. True/False?
True
As the level of significance alpha increases from .05 to .10, we are more likely to reject the null hypothesis.
True
A Decision Tree procedure that grows trees until the leaves are pure tends to overfit. True/False
True
A False Negative error is a Type II error.
True
A False Negative error is a Type II error. True/False
True
A False Positive error is a Type I error.
False
A box plot can be used to summarize nominal data.
True
A confusion matrix is used to describe the performance of a classification model.
True
A confusion matrix is used to describe the performance of a classification model. True/False
2
A decision tree analysis was completed, with the results above for the training and validation data. For the best predictive capability, how many splits should be used?
2
A decision tree analysis was completed, with the results above for the training and validation data. For the best predictive capability, how many splits should be used? a.) 1 b.) 2 c.) 4 d.) 10 e.) none of the above
True
A good clustering scheme will have little variation within clusters and signficant variation between clusters.
True
A histogram can be used to summarize interval data
True
A histogram is only appropriate for variables whose values are numerical and measured on an interval or ratio scale.
False
A model's accuracy rate on the training data set is a better measure of the model's predictive ability than its accuracy rate on the validation data set.
x1
A multiple regression analysis produced the following table: Predictor Coefficients Standard Error t Statistic p-value Intercept 616.6849 154.5534 3.990108 0.000947 x1 -3.33833 2.333548 -1.43058 0.170675 x2 1.780075 0.335605 5.30407 5.83E-05 Using Alpha - α = .05, which independent variable would you drop using backward elimination?
False
A negative correlation coefficient (r) implies a weak relationship among the variables. True/False
True
A one-tailed test is one where Ha is directional and includes < or >.
a low degree of multicollinearity and small standard error
A regression equation will have the best prediction capabilities if the independent variables have a.) a high degree of multicollinearity and small standard error. b.) a high degree of multicollinearity and large standard error. c.) a low degree of multicollinearity and large standard error. d.) a low degree of multicollinearity and small standard error
False
A scatterplot can be drawn with a set of two categorical data.
True
A set of data is said to be categorical if the values or observations belonging to it can be sorted according to category.
Chi-Square Test
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.). Select the appropriate test/model to determine if city service satisfaction level & martial status are independent? (B)
True
According to some experts, analytics has become important for companies these days because many products are becoming commodities and there is minimal differentiation in products and services.
1 - 0.2087 = 0.7913
According to the Fit Details output listed below, The Predictive accuracy of the model is:
True
According to the text, the most popular choice for the number of hidden layers is one.
Complete but not accurate
Accuracy and completeness of data is very important for analytics. The following data - 02/31/2020 is:
Specificity of 99.9%
An anti-theft scanner at an entrance of a book store buzzes once for every 1000 innocent people walking through the scanner. The accuracy of the scanner is: a.) Specificity of 99.9% b.) Sensitivity of 99.9% c.) None of the above d.) Do not have enough information to answer this question.
True
An application of the multiple regression model generated the following results involving the F test of the overall regression model: p - value = .0012, and adjusted R-squared= .67. Thus, the null hypothesis, which states that none of the independent variables are significantly related to the dependent variable, should be rejected at the .05 level of significance.
Less than
An odds ratio ____________ 1 indicates that the condition or event is less likely to occur in the first group.
Less than
An odds ratio ____________ 1 indicates that the condition or event is less likely to occur in the first group. a.) Equal to b.) Not Equal to c.) Greater than d.) Less than
True
An outlier is an observation in a data set, which is far removed in value from the others in the data set.
True
An overfit model does not generalize to other data well, even if they are from the same population.
True
An overfit model does not generalize to other data well, even if they are from the same population. True/False
All of the above are true
Analytics are not applicable when:
0.6498
Based on the Decision Tree, what is the probability that the following new customer will buy less in 2011 compared to 2010 (Outcome Variable - Increased = 0 means Spending in 2011 is less than 2010 Spending; Increased = 1 means Spending in 2011 is greater than 2010 Spending)? a.) 0.1843 b.) 0.0068 c.) 0.2800 d.) 0.4102 e.) 0.6498
All are True
Based on the end nodes of the decision tree (Outcome Variable - Increased = 0 means Spending in 2011 is less than 2010 Spending; Increased = 1 means Spending in 2011 is greater than 2010 Spending), which statement is True? a.) If Customer2010 <1 Then "Spending in 2011 is greater than 2010 Spending" with Prob .9932 b.) If Customer2010 >=1 And Catalogs <18 Then "Spending in 2011 is less than 2010 Spending" with Prob .6498 c.) If Customer2010 >=1 And Catalogs >=18 Then "Spending in 2011 is greater than 2010 Spending" - Prob .8157 d.) All are True
Visualize data
Business understanding phase includes the following except:
All of the above
CRISP-DM is a hierarchical process model that consists of:
False
CRISP-DM model is dependent on industry sector and technology used.
False
CRISP-DM process model consists of five phases.
True
Cluster analysis is a very attractive initial data-mining tool because it can be used to discover rules and patterns.
True
Cluster analysis is an un-supervised technique.
0.13
Consider the following confusion Matrix: What is the False Negative Rate? a.) 0.95 b.) 0.13 c.) 0.93 d.) 0.04
10
Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The False Negative for this sample is:
10
Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The False Negative for this sample is: a.) 10 b.) 90 c.) 80 d.) 20
20
Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The False Positive for this sample is:
0.9
Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The sensitivity for this sample is:
80/(80+20) = 0.8
Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The specificity for this sample is:
80/(80+20) = 0.8
Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The specificity for this sample is: a.) 90/(90+10) = 0.9 b.) 10/(10+90) = 0.1 c.) 20/(20+80) = 0.2 d.) 80/(80+20) = 0.8
I and II only
Consider the following stem-and-leaf display for recent home sales in Akron in thousands of dollars. Which of the following prices was a recent home sale in Akron?
True
Correlation analysis is concerned with measuring the strength of the relationship between two continuous variables. True/False?
Optimize Taste & Price of a product
Data mining models can do the following except:
True
Data mining should be viewed as a process.
Dorm
Data was collected on students at a nearby university on their GPA at the end of their freshman year and their living accommodations (dorm, off-campus, and other). For which form of housing is the distribution of GPAs the most symmetric (least skewed)?
False
Decision Tree is an un-supervised data mining technique.
True
Decision tree analysis splits a node to multiple child nodes in a way to increase certainty in classification. True/False
False
Decision tree analysis splits a node to multiple child nodes in a way to maximize uncertainty in classification.
False
Decision tree analysis splits a node to multiple child nodes in a way to maximize uncertainty in classification. True/False
True
Each terminal node in a decision tree can to be translated into a single IF-THEN rule
True
Each terminal node in a decision tree can to be translated into a single IF-THEN rule.
True
Neural networks can be used for both continuous and categorical dependent (output) variables.
True
Sensitivity measures how good a test is at finding something if it's here.
3.765
Given the output of odds ratios below about what residents would do in case of hurricane thereat, for each additional pet, the residents are _______ times likely to stay than those without. a.) 0.0049 b.) 0.2655 c.) 3.765 d.) 201.140
If Time to Delivery >= 68, Then Status = Lost, Prob = .7652
From the following Leaf Report, which of the following statement are correct?
If Time to Delivery >= 68, Then Status = Lost, Prob = .7652
From the following Leaf Report, which of the following statement are correct? a.) If Time to Delivery < 68 & Part Type = AM Then Status = Lost, Prob .5658 b.) If Time to Delivery >= 68, Then Status = Lost, Prob = .2348 c.) If Time to Delivery < 68 & Part Type = OE Then Status = Lost, Prob .4342 d.) If Time to Delivery >= 68, Then Status = Lost, Prob = .7652
(243+31)/382 - 72%
Given the following Confusion Matrix, the explanatory power is (Total cases in the Training Set = 382; Total cases in the Validation Set = 156):
(243+31)/382 - 72%
Given the following Confusion Matrix, the explanatory power is (Total cases in the Training Set = 382; Total cases in the Validation Set = 156): a.) (243+31)/382 - 72% b.) (45+5)/156 - 32% c.) (95+11)/156 - 68% d.) (97+11)/382 - 28%
(50+25)/108
Given the following Confusion Matrix, the predictive power is (Total cases in the Training Set = 430; Total cases in the Validation Set = 108):
(50+25)/108
Given the following Confusion Matrix, the predictive power is (Total cases in the Training Set = 430; Total cases in the Validation Set = 108): a.) 50/(50+21) b.) (205+101)/430 c.) 205/(205+78) d.) (50+25)/108
The model overfits
Given the following Fit Details output, what can you say about the model overfit?
The Model overfits
Given the following Fit Details output, what can you say about the model overfit? a.) The Model does not overfit b.) The Model overfits c.) Cannot answer the question with the information provided d.) None of the above
3.765
Given the output of odds ratios below about what residents would do in case of hurricane thereat, for each additional pet, the residents are _______ times likely to stay than those without.
5.245
Given the output of odds ratios below about what residents would do in case of hurricane thereat, households living in mobile homes are _______ times more likely to evacuate than those that stay.
5.245
Given the output of odds ratios below about what residents would do in case of hurricane thereat, households living in mobile homes are _______ times more likely to evacuate than those that stay. a.) 0.0049 b.) 0.1906 c.) 5.245 d.) 0.2655
.2655
Given the output of odds ratios, about what residents would do in case of hurricane thereat, the odds of evacuating for each additional pet they would have is _______.
0.2655
Given the output of odds ratios, about what residents would do in case of hurricane thereat, the odds of evacuating for each additional pet they would have is _______. a.) 0.0049 b.) 3.765 c.) 0.2655 e.) 201.140
False
Healthy people correctly identified as healthy by the diagnostic test is known as True Positive. True/False
False
Hierarchical clustering a good technique when you have a very large data set.
The fraction of negative instances that were misclassified
How is False Positive Rate defined?
The fraction of negative instances that were misclassified
How is False Positive Rate defined? a.) The fraction of positive instances that were misclassified b.) The number of True Negatives/All Positives c.) The number of True Positives/All Positives d.) The fraction of negative instances that were misclassified
One
How many outlier records appear to be present in this distribution?
False
If 100 patients known to have a disease were tested, and 43 test positive, then the test has 43% specificity.
False
If 100 patients known to have a disease were tested, and 43 test positive, then the test has 43% specificity. True/False?
True
If both fit and the lack of fit tests in logistic regression are significant, we should consider adding cross effects (of independent variables) to the model.
True
If data is right skewed, the mean is greater than the median.
True
If the probability of The University of Akron basketball team winning against Kent State team is .5, then the odds of The University of Akron winning against Kent State is 1.
True
If the probability of The University of Akron basketball team winning against Kent State team is .5, then the odds of The University of Akron winning against Kent State is 1. True/False?
True
If the probability of winning a game is 0.2, then the odds of winning the game is 0.25.
True
If the probability of winning a game is 0.2, then the odds of winning the game is 0.25. True/False
multicollinearity
If the simple correlation coefficient between two independent variables is greater than .95, then ______________________ is considered to be severe. b.) interaction c.) coefficient of determination d.) multicollinearity
True
If the test is highly sensitive and the test results is negative, we can be nearly certain that the person doesn't have disease.
True
If the test is highly sensitive and the test results is negative, we can be nearly certain that the person doesn't have disease. True/False
True
If the test result for the highly specific test is positive, you can be nearly certain that they actually have the disease. True/False
Is greater than the proportion earning less than $13 per hour.
If the wages of workers for a given company are normally distributed with a mean of $15 per hour, then the proportion of the workers earning more than $13 per hour:
Is greater than the proportion earnings less than $13 per hour
If the wages of workers for a given company are normally distributed with a mean of $15 per hour, then the proportion of the workers earning more than $13 per hour:
X1 is significantly related to Y
If we are testing the significance of the independent variable X1, and we reject the null hypothesis H0: b1= 0, we conclude that:
True
In Cluster analysis, it is important to normalize the data to get rid of differences in scale.
True
In Multiple Regression analysis, a t-test is used in testing the significance of an individual independent variable.
You pick the attribute with the highest Logworth.
In a decision tree algorithm, how is the attribute picked for the next split? a.) You pick the attribute with the lowest Logworth. b.) You pick the attribute where the conditional entropy is maximum. c.) You pick the attribute where the conditional entropy is higher than the base entropy. d.) You pick the attribute with the highest Logworth.
Is a straight line
In a multiple regression analysis, if the normal probability plot ___________, then it can be concluded that the assumption of normality is not violated.
False
In a simple linear regression model, the coefficient of determination (R-Square) not only indicates the strength of the relationship between independent and dependent variables, but also shows whether the relationship is positive or negative.
False
In a simple linear regression model, the coefficient of determination (R-Square) not only indicates the strength of the relationship between independent and dependent variables, but also shows whether the relationship is positive or negative. True/False
False
In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions is a supervised technique.
True
In cluster analysis, algorithms construct clusters where between-cluster variation (BCV) is large, as compared to within-cluster variation (WCV).
False
In doing the one way ANOVA F test we should always do a post-hoc analysis (such as Tukey paired comparisons).
decrease the number of false positives
In logistic regression, if we change the cutoff value from .5 (the default) to a cutoff value of .7 we would expect it to
decrease the number of false positives
In logistic regression, if we change the cutoff value from .5 (the default) to a cutoff value of .7 we would expect it to a.) increase the number of false positives b.) decrease the number of false positives c.) have no effect on the number of false positives
increase the number of false positives
In logistic regression, if we change the cutoff value from 0.5 (the default) to a cutoff value of 0.3 we would expect it to a.) increase the number of false positives b.) decrease the number of false positives c.) have no effect on the number of false positives
True
In logistic regression, typically the overall error rate is lowest at probability cutoff=0.50.
True
In medicine, if the test is positive, it is bad news. True/False
False
In medicine, if the test is positive, it is good news.
False
In medicine, if the test is positive, it is good news. True/False
False
In multiple regression analysis, if the simple correlation coefficient (rxy) between the dependent variable and one of the independent variables is .95, then this indicates that multicollinearity exists.
False
In multiple regression analysis, if the simple correlation coefficient (rxy) between the dependent variable and one of the independent variables is .95, then this indicates that multicollinearity exists. True/False
True
In multivariate modeling, you can get highly significant but meaningless results if you put too many predictors in the model. This is refereed to as problem of over-fitting.
True
In multivariate modeling, you can get highly significant but meaningless results if you put too many predictors in the model. This is refereed to as problem of over-fitting. True/False
Matched pairs from two dependent populations.
In order to test the effectiveness of a drug called XZR designed to reduce cholesterol levels, 9 heart patients' cholesterol levels are measured before they are given the drug. The same 9 patients use XZR for two continuous months. After two months of continuous use the 9 patients' cholesterol levels are measured again. The comparison of cholesterol levels before vs. after administering the drug is an example of testing the difference between:
True
In order to use a classification tree, the target variable must be categorical and not continuous.
False
In predicting the financial status of firms, sensitivity is the ability to predict a firm that is going to stay solvent correctly.
Slope of the regression line
In simple regression analysis the quantity that gives the amount by which Y (dependent variable) changes for a unit change in X (independent variable) is called the
The slope of the regression line must also be positive.
In simple regression analysis, if the correlation coefficient (r) between the dependent and independent variable is a positive value, then
A interval scale
In testing the difference between two independent population means, it is assumed that the level of measurement of the variable is at least _____________.
All of the above are true
In the Regression Trees DM technique,
Siblings and Spouces
In the Titanic data analysis, which independent variable is the Least important significant predictor of Survived?
Siblings and Spouses
In the Titanic data analysis, which independent variable is the Least important significant predictor of Survived? a.) Age b.) Sex c.) Siblings and Spouses d.) Parents and Children e.) Passenger Class
Sex, Passenger Class, Age, & Sibling and Spouces
In the Titanic data analysis, which of the independent variables in this model are significant predictors of Survived variable?
Sex, Passenger Class, Age & Sibling and Spouces
In the Titanic data analysis, which of the independent variables in this model are significant predictors of Survived variable? a.) Sex, Passenger Class, Age b.)Fare & Parents and Children c.) Sex, Passenger Class, Age, & Sibling and Spouces d.) Sex e.) all independent variables are significant
False
In two-way ANOVA, if interaction plots look essentially parallel, we can intuitively conclude there is an interaction between the two factors.
Modeling
In which CRISP-DM phase data is partitioned and training & validation data sets are created?
True
It is possible for a valid regression equation to have none of the data points fall on the regression line.
True
It is possible for a valid regression equation to have none of the data points fall on the regression line. True/False?
True
K-Means clustering is well suited to the task of Market segmentation.
True
K-means algorithm is a typical algorithm for cluster analysis that uses "Euclidean distance" to find clusters.
True
Logistic Regression analysis is a supervised data mining technique.
False
Logistic Regression analysis is an un-supervised data mining technique. True/False?
True
Logistic Regression can be used for both profiling an classification.
False
Logistic Regression odds ratios describe the degree to which the model's independent variables predict the dependent variable.
False
Logistic Regression odds ratios describe the degree to which the model's independent variables predict the dependent variable. True/False?
True
Logistic Regression seeks to predict the probability that a dependent variable will fall into a particular category based on the values of independent variables.
Phase 5
Managers want to know by next week whether deployment will take place. Therefore, analysts meet to discuss how useful and accurate their model is. Identify the CRISP-DM phase for this task.
False
Most charts, graphs, and other visualizations would be considered predictive models.
greater than 10
Multicollinearity between independent variables is severe if the variance inflation factor is a.) Between -1 and +1 b.) Substantially less than 1 c.) Greater than 10 d.) Less than 5
False
One advantage of neural networks is that there is very little chance of overfitting, so validation or testing data is not needed.
True
One disadvantage of neural networks is that they are slow learners.
True
Overfitting is the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points.
True
Overfitting is the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points. True/False
True
Pie and bar charts are used to summarize nominal and ordinal data.
True
Predicting something like the average length of a delivery person's shift is a well-suited task for decision tree modeling.
False
Predicting something like the average length of a delivery person's shift is a well-suited task for decision tree modeling. True/False
True
Predicting the approval or disapproval of a loan based on credit scores and demographic information is a good application of Logistic Regression. True/False
True
Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non-bankrupt firms is a supervised technique.
True
Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously is an unsupervised technique.
False
Sick people correctly identified as sick by the diagnostic test is known as True Negative.
False
Sick people correctly identified as sick by the diagnostic test is known as True Negative. True/False
Multicollinearity
Significant _________ may exist when the overall F-statistic is significant and the individual t statistics for all independent variables are insignificant.
True
Specificity measures how good a test is at finding something if it's false.
c.) Mutlicollinearity may exist between Profit and Assets
Suppose that we are trying to predict PROFIT using Sales, #emp, Assets, and Stockholder'sEq. Which statement below is FALSE? a.) Profit is the dependent or response variable b.) #emp has the weakest correlation with profit c.) Mutlicollinearity may exist between Profit and Assets d.) All of the above statements are true
0.90 (450/500)
Suppose we develop a Logistic Regression (LR) model to predict whether email is Spam or Not Spam. After we apply the model to a test set of 500 new email messages and the LR model produces the following confusion matrix, what is the accuracy rate for this model?
0.90 (450/500)
Suppose we develop a Logistic Regression (LR) model to predict whether email is Spam or Not Spam. After we apply the model to a test set of 500 new email messages and the LR model produces the following confusion matrix, what is the accuracy rate for this model? a.) 0.875 (70/80) b.) 0.78 (390/500) c.) 0.90 (450/500) d.) 0.10 (50/500)
False
The "K" in K-Means Cluster Analysis refers to the maximum number of variables that a clustering model can utilize.
True
The CRISP-DM process model aims to make large data mining projects, less costly, more reliable, more repeatable, more manageable, and faster.
Cluster 4
The Euclidean distances of a customer from the different cluster centers are given below. Based on this information, which cluster the customer belongs to?
Response, zero
The Y intercept (b0) in a multiple regression model represents the estimated value of the ________ variable, when the values of all independent variables are ______. a.) Response, one b.) Response, zero c.) Predictor, one d.) Predictor, zero
Y-Intercept
The __________ of the simple linear regression model (Y = B0 + B1*X) is the value of Y when the value of X is zero.
All of the methods listed above are appropriate
The choice of k (number of clusters in a Cluster Analysis) can be made using a variety of methods. Which of the following methods is appropriate in selecting the number of clusters.
False
The correlation coefficient (r) indicates the amount of change in variable Y when variable X changes by one unit.
False
The correlation coefficient (r) indicates the amount of change in variable Y when variable X changes by one unit. true/false?
False
The correlation coefficient can only assume value between zero and 1, inclusive.
True
The curse of dimensionality refers to the computational complexity of developing clusters using a large number of variables.
True
The dependent variable in logistic regression can have more than two values or classes. True/False?
False
The dependent variable in logistic regression is always binary. True/False?
Response
The dependent variable, the variable of interest in an experiment, is also called ___________ variable.
True
The different phases in the BA Life Cycle methodology are closely interrelated.
True
The error term in the regression model describes the effects of all factors other than the independent variables on y (response variable). True/False
True
The estimated simple linear regression equation minimizes the sum of the squared deviations between each value of Y and the line.
True
The estimated simple linear regression equation minimizes the sum of the squared deviations between each value of Y and the line. True/False?
False
The final stage of the CRISP-DM data mining process is "Evaluation."
Clustering
The following business question - Can we identify different groups of customers based on various demographic and purchasing characteristics? - can be answered by building a _____________ DM model.
Classification
The following business question - What factors influence customers to churn? - can be answered by building a _____________ DM model.
Classification
The following business question -What customer characteristics best predicts bank loan default OR commit fraud? - can be answered by building a _____________ DM model.
b.) Plane
The graph of the prediction equation obtained from the following model is a(n): y= B0 + B1X1 +B2X2 + E a.) Exponential curve b.) Plane c.) Line d.) Parabola
True
The mean and median are the same for a normal distribution.
True
The most popular method for using model errors to update weights is called back propagation of error.
72.75%
The output below (mosaic plot and contingency table) was generated from the Titanic data file. The first data variable is Survived (Yes or No) and the the second variable is Sex (Male, Female). As per the Contingency Table, what percentage of all of the females survived on the Titanic?
50%
The predictive accuracy of this model is ...
-1 to 1
The range of feasible values for the correlation coefficient is from:
0 to 1
The range of feasible values for the multiple coefficient of determination is from: a.) 0 to 1 b.) -1 to 1 c.) minus infinityto 0 d.) -1 to 0 e.) 0 to infinity
True
The three basic building blocks of business analytics are technology, process, and people.
True
The two steps in building a decision tree model are to first generate exact probability predictions for each data record, then convert that to a yes/no prediction based on a cutoff percentage.
False
Today, business analytics are data driven rather than business driven.
True
Training a neural network model involves estimating the weights that will lead to the best predictive results.
False
Two variables x and y have a high correlation coefficient. Therefore, it can be concluded that changes in x causes y to change.
True
Under the pairwise deletion approach to managing missing data we exclude a case with a missing value for a variable only when the analysis includes that variable.
False
Unstructured data increases the veracity in the data.
The centroids of the discovered cluster and the assignment of each input datum to a cluster.
What are the outputs generated by a k-Means clustering Analysis?
Logistic Regression returns probability estimates of a response variable
What is a distinct property of Logistic Regression compared to Linear Regression?
Logistic Regression returns probability estimates of a response variable
What is a distinct property of Logistic Regression compared to Linear Regression? a.) Logistic Regression works very well with continuous target variable b.) Logistic Regression is robust with correlated predictor variables c.) Logistic Regression handles missing values well d.) Logistic Regression returns probability estimates of a response variable
Ho: µC = µA
What is the appropriate null hypothesis for testing whether the mean profit of the industries Computer and Aerospace differ?
60%
What is the predictive accuracy of the CA model?
$400
What is the total cost of errors for the DT model?
$400
What is the total cost of errors for the DT model? a.) $1400 b.) $1000 c.) $800 d.) $200 e.) $400
31%
What percentage of the variability in the dependent variable is accounted for by changes in the independent variables in this model?
31%
What percentage of the variability in the dependent variable is accounted for by changes in the independent variables in this model? a.) 0 % b.) 31 % c.) 47 % d.) 100 % e.) 2 %
False
When creating a decision tree, in general we want to keep splitting as long we can create less purity in the nodes
False
When creating a decision tree, in general we want to keep splitting as long we can create less purity in the nodes True/False
False
When creating a decision tree, we want to keep splitting as long we can create more impurity in the nodes.
True
When creating a decision tree, we want to keep splitting as long we keep increasing R-Square for the validation data set.
True
When creating a decision tree, we want to keep splitting as long we keep increasing R-Square for the validation data set. True/False
False
When the F test is used to test the overall significance of a multiple regression model, if the null hypothesis is rejected, it can be concluded that all of the independent variables X1, X2, X3,... Xk are significantly related to the dependent variable Y. True/False?
False
When the predictor variable is categorical and the response variable is continuous, you would use a logistic regression model. True/False?
False
When using simple regression analysis, if there is a strong correlation between the independent and dependent variable, then we can conclude that an increase in the value of the independent variable causes an increase in the value of the dependent variable.
False
When we are predicting Fraud cases, we want to minimize false positive rates.
In a hierarchical clustering algorithm
Where would you most likely see a dendrogram?
Review project
Which activity is performed in the Deployment phase of the CRISP-DM process?
LR
Which is the best model based on predictive accuracy?
LR
Which is the worst model based on a minimized cost of erroneous predictions? a.) LR b.) CA c.) DT
None of the above
Which of the following activation functions is not used in neural networks (in JMP)?
Boxplot
Which of the following data visualization charts might include "whiskers"?
Desire to analyze big data
Which of the following in not one of the four drivers of business analytics?
Whether a Person Has a Traffic Violation
Which of the following is a categorical variable?
All of the above are weaknesses of neural networks.
Which of the following is a disadvantage of neural networks?
As the value of x increases, the value of the error term also increases.
Which of the following is a violation of one of the major assumptions of the simple regression model?
Cluster Analysis
Which of the following is an un-supervised data mining technique?
Requires assumptions of statistical models.
Which of the following is not True about Regression Trees DM technique: a.) Produces rules that are easy to interpret and implement. b.) Requires assumptions of statistical models. c.) Easy to use and understand. d.) Variable selection & reduction is automatic
The level of measurement of the data for the dependent variable is at least nominal.
Which of the following is not an accurate assumption of the linear regression model?
In which country did we have the highest sales last quarter?
Which of the following questions is typically addressed via a Business Intelligence project?
3,0,1,1
Which of the following sets of counts represents the contents of the completed LR confusion matrix?
3,0,1,1
Which of the following sets of counts represents the contents of the completed LR confusion matrix? a.) 1,2,1,1 b.) 2,0,1,1 c.) 3,0,1,1 d.) 1,2,0,2 e.) 4,0,0,1
All of the above statements are true
Which of the following statements about a ROC curve is true?
Building the model
Which of the following tasks would NOT be part of the data understanding phase in CRISP-DM?
Logistic Regression and Decision Tree
Which two analytical methods can be used for categorical target variables?
Logistic Regression and Decision Tree
Which two analytical methods can be used for categorical target variables? a.) Cluster Analysis and Logistic Regression b.) Linear Regression and Decision Tree c.) Linear Regression and Logistic Regression d.) Decision Tree and Cluster Analysis e.) Logistic Regression and Decision Tree
Logistic Regression
You are tasked with predicting if a customer will purchase a product (Yes/No) when the customer visits the website and the probability of a purchase decision. You are provided with other relevant variables that are associated with the problem. Which analytical method would you recommend?
Logistic Regression
You are tasked with predicting if a customer will purchase a product (Yes/No) when the customer visits the website and the probability of a purchase decision. You are provided with other relevant variables that are associated with the problem. Which analytical method would you recommend? a.) Linear Regression b.) ANOVA c.) Logistic Regression d.) Association Rules
49
You are using "state" a categorical variable in your linear regression model with 50 possible values. How many dummy variables with this variable "state" should be expanded for your model? a.) 49 b.) 2 c.) Categorical Variables cannot be used with Linear Regression d.) 51
Scatter Plot
You received 100,000 home loan records. You want to take a quick look and see if there is any relationship between mortgage age and mortgage amount before conducting advanced analytics. Which graphical tool would you employ for your preliminary analysis?
K-means Clustering
You want to group custommers in your dataset by similarity and assign labels to each group. What is the preferred analytic method to use for this task?
K-means clustering
You want to group custommers in your dataset by similarity and assign labels to each group. What is the preferred analytic method to use for this task?
Backward Elimination
___________ is an iterative variable selection procedure that allows an independent variable to be deleted to a multiple regression model during the next iteration.
Veracity
______________ refers to the accuracy or trustworthiness of the big data.
Prescriptive
____________________ is a form of analytics which examines data to answer the question "what should be done?" or "what can we do to make XYZ happen?"
The slope of the regression line must also be positive
n simple regression analysis, if the correlation coefficient (r) between the dependent and independent variable is a positive value, then A) The slope of the regression line must also be positive. B) The Y intercept must also be a positive value. C) The coefficient of determination can be either positive or negative, depending on the value of the slope. D) The standard error of estimate can either have a positive or a negative value. E) The least squares regression equation could either have a positive or a negative slope.
