Business Analytics Master Quizlet

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

True

Big data make companies smart.

True

As more independent variables are added to a multiple regression model, the value of R-squared increases. True/false?

False

As per the analysis of Titanic data, passengers who were in 3rd class had higher odds of surviving compared to passengers in 1st class.

Both continuous and categorical dependent variables

Neural Networks can be used to predict:

True

Neural Networks were originally developed to understand biological neural networks.

False

As per the analysis of Titanic data, passengers who were in 3rd class had higher odds of surviving compared to passengers in 1st class. True/False?

True

As the level of significance alpha increases from .05 to .10, we are more likely to reject the null hypothesis.

True

A Decision Tree procedure that grows trees until the leaves are pure tends to overfit. True/False

True

A False Negative error is a Type II error.

True

A False Negative error is a Type II error. True/False

True

A False Positive error is a Type I error.

False

A box plot can be used to summarize nominal data.

True

A confusion matrix is used to describe the performance of a classification model.

True

A confusion matrix is used to describe the performance of a classification model. True/False

2

A decision tree analysis was completed, with the results above for the training and validation data. For the best predictive capability, how many splits should be used?

2

A decision tree analysis was completed, with the results above for the training and validation data. For the best predictive capability, how many splits should be used? a.) 1 b.) 2 c.) 4 d.) 10 e.) none of the above

True

A good clustering scheme will have little variation within clusters and signficant variation between clusters.

True

A histogram can be used to summarize interval data

True

A histogram is only appropriate for variables whose values are numerical and measured on an interval or ratio scale.

False

A model's accuracy rate on the training data set is a better measure of the model's predictive ability than its accuracy rate on the validation data set.

x1

A multiple regression analysis produced the following table: Predictor Coefficients Standard Error t Statistic p-value Intercept 616.6849 154.5534 3.990108 0.000947 x1 -3.33833 2.333548 -1.43058 0.170675 x2 1.780075 0.335605 5.30407 5.83E-05 Using Alpha - α = .05, which independent variable would you drop using backward elimination?

False

A negative correlation coefficient (r) implies a weak relationship among the variables. True/False

True

A one-tailed test is one where Ha is directional and includes < or >.

a low degree of multicollinearity and small standard error

A regression equation will have the best prediction capabilities if the independent variables have a.) a high degree of multicollinearity and small standard error. b.) a high degree of multicollinearity and large standard error. c.) a low degree of multicollinearity and large standard error. d.) a low degree of multicollinearity and small standard error

False

A scatterplot can be drawn with a set of two categorical data.

True

A set of data is said to be categorical if the values or observations belonging to it can be sorted according to category.

Chi-Square Test

A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.). Select the appropriate test/model to determine if city service satisfaction level & martial status are independent? (B)

True

According to some experts, analytics has become important for companies these days because many products are becoming commodities and there is minimal differentiation in products and services.

1 - 0.2087 = 0.7913

According to the Fit Details output listed below, The Predictive accuracy of the model is:

True

According to the text, the most popular choice for the number of hidden layers is one.

Complete but not accurate

Accuracy and completeness of data is very important for analytics. The following data - 02/31/2020 is:

Specificity of 99.9%

An anti-theft scanner at an entrance of a book store buzzes once for every 1000 innocent people walking through the scanner. The accuracy of the scanner is: a.) Specificity of 99.9% b.) Sensitivity of 99.9% c.) None of the above d.) Do not have enough information to answer this question.

True

An application of the multiple regression model generated the following results involving the F test of the overall regression model: p - value = .0012, and adjusted R-squared= .67. Thus, the null hypothesis, which states that none of the independent variables are significantly related to the dependent variable, should be rejected at the .05 level of significance.

Less than

An odds ratio ____________ 1 indicates that the condition or event is less likely to occur in the first group.

Less than

An odds ratio ____________ 1 indicates that the condition or event is less likely to occur in the first group. a.) Equal to b.) Not Equal to c.) Greater than d.) Less than

True

An outlier is an observation in a data set, which is far removed in value from the others in the data set.

True

An overfit model does not generalize to other data well, even if they are from the same population.

True

An overfit model does not generalize to other data well, even if they are from the same population. True/False

All of the above are true

Analytics are not applicable when:

0.6498

Based on the Decision Tree, what is the probability that the following new customer will buy less in 2011 compared to 2010 (Outcome Variable - Increased = 0 means Spending in 2011 is less than 2010 Spending; Increased = 1 means Spending in 2011 is greater than 2010 Spending)? a.) 0.1843 b.) 0.0068 c.) 0.2800 d.) 0.4102 e.) 0.6498

All are True

Based on the end nodes of the decision tree (Outcome Variable - Increased = 0 means Spending in 2011 is less than 2010 Spending; Increased = 1 means Spending in 2011 is greater than 2010 Spending), which statement is True? a.) If Customer2010 <1 Then "Spending in 2011 is greater than 2010 Spending" with Prob .9932 b.) If Customer2010 >=1 And Catalogs <18 Then "Spending in 2011 is less than 2010 Spending" with Prob .6498 c.) If Customer2010 >=1 And Catalogs >=18 Then "Spending in 2011 is greater than 2010 Spending" - Prob .8157 d.) All are True

Visualize data

Business understanding phase includes the following except:

All of the above

CRISP-DM is a hierarchical process model that consists of:

False

CRISP-DM model is dependent on industry sector and technology used.

False

CRISP-DM process model consists of five phases.

True

Cluster analysis is a very attractive initial data-mining tool because it can be used to discover rules and patterns.

True

Cluster analysis is an un-supervised technique.

0.13

Consider the following confusion Matrix: What is the False Negative Rate? a.) 0.95 b.) 0.13 c.) 0.93 d.) 0.04

10

Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The False Negative for this sample is:

10

Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The False Negative for this sample is: a.) 10 b.) 90 c.) 80 d.) 20

20

Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The False Positive for this sample is:

0.9

Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The sensitivity for this sample is:

80/(80+20) = 0.8

Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The specificity for this sample is:

80/(80+20) = 0.8

Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The specificity for this sample is: a.) 90/(90+10) = 0.9 b.) 10/(10+90) = 0.1 c.) 20/(20+80) = 0.2 d.) 80/(80+20) = 0.8

I and II only

Consider the following stem-and-leaf display for recent home sales in Akron in thousands of dollars. Which of the following prices was a recent home sale in Akron?

True

Correlation analysis is concerned with measuring the strength of the relationship between two continuous variables. True/False?

Optimize Taste & Price of a product

Data mining models can do the following except:

True

Data mining should be viewed as a process.

Dorm

Data was collected on students at a nearby university on their GPA at the end of their freshman year and their living accommodations (dorm, off-campus, and other). For which form of housing is the distribution of GPAs the most symmetric (least skewed)?

False

Decision Tree is an un-supervised data mining technique.

True

Decision tree analysis splits a node to multiple child nodes in a way to increase certainty in classification. True/False

False

Decision tree analysis splits a node to multiple child nodes in a way to maximize uncertainty in classification.

False

Decision tree analysis splits a node to multiple child nodes in a way to maximize uncertainty in classification. True/False

True

Each terminal node in a decision tree can to be translated into a single IF-THEN rule

True

Each terminal node in a decision tree can to be translated into a single IF-THEN rule.

True

Neural networks can be used for both continuous and categorical dependent (output) variables.

True

Sensitivity measures how good a test is at finding something if it's here.

3.765

Given the output of odds ratios below about what residents would do in case of hurricane thereat, for each additional pet, the residents are _______ times likely to stay than those without. a.) 0.0049 b.) 0.2655 c.) 3.765 d.) 201.140

If Time to Delivery >= 68, Then Status = Lost, Prob = .7652

From the following Leaf Report, which of the following statement are correct?

If Time to Delivery >= 68, Then Status = Lost, Prob = .7652

From the following Leaf Report, which of the following statement are correct? a.) If Time to Delivery < 68 & Part Type = AM Then Status = Lost, Prob .5658 b.) If Time to Delivery >= 68, Then Status = Lost, Prob = .2348 c.) If Time to Delivery < 68 & Part Type = OE Then Status = Lost, Prob .4342 d.) If Time to Delivery >= 68, Then Status = Lost, Prob = .7652

(243+31)/382 - 72%

Given the following Confusion Matrix, the explanatory power is (Total cases in the Training Set = 382; Total cases in the Validation Set = 156):

(243+31)/382 - 72%

Given the following Confusion Matrix, the explanatory power is (Total cases in the Training Set = 382; Total cases in the Validation Set = 156): a.) (243+31)/382 - 72% b.) (45+5)/156 - 32% c.) (95+11)/156 - 68% d.) (97+11)/382 - 28%

(50+25)/108

Given the following Confusion Matrix, the predictive power is (Total cases in the Training Set = 430; Total cases in the Validation Set = 108):

(50+25)/108

Given the following Confusion Matrix, the predictive power is (Total cases in the Training Set = 430; Total cases in the Validation Set = 108): a.) 50/(50+21) b.) (205+101)/430 c.) 205/(205+78) d.) (50+25)/108

The model overfits

Given the following Fit Details output, what can you say about the model overfit?

The Model overfits

Given the following Fit Details output, what can you say about the model overfit? a.) The Model does not overfit b.) The Model overfits c.) Cannot answer the question with the information provided d.) None of the above

3.765

Given the output of odds ratios below about what residents would do in case of hurricane thereat, for each additional pet, the residents are _______ times likely to stay than those without.

5.245

Given the output of odds ratios below about what residents would do in case of hurricane thereat, households living in mobile homes are _______ times more likely to evacuate than those that stay.

5.245

Given the output of odds ratios below about what residents would do in case of hurricane thereat, households living in mobile homes are _______ times more likely to evacuate than those that stay. a.) 0.0049 b.) 0.1906 c.) 5.245 d.) 0.2655

.2655

Given the output of odds ratios, about what residents would do in case of hurricane thereat, the odds of evacuating for each additional pet they would have is _______.

0.2655

Given the output of odds ratios, about what residents would do in case of hurricane thereat, the odds of evacuating for each additional pet they would have is _______. a.) 0.0049 b.) 3.765 c.) 0.2655 e.) 201.140

False

Healthy people correctly identified as healthy by the diagnostic test is known as True Positive. True/False

False

Hierarchical clustering a good technique when you have a very large data set.

The fraction of negative instances that were misclassified

How is False Positive Rate defined?

The fraction of negative instances that were misclassified

How is False Positive Rate defined? a.) The fraction of positive instances that were misclassified b.) The number of True Negatives/All Positives c.) The number of True Positives/All Positives d.) The fraction of negative instances that were misclassified

One

How many outlier records appear to be present in this distribution?

False

If 100 patients known to have a disease were tested, and 43 test positive, then the test has 43% specificity.

False

If 100 patients known to have a disease were tested, and 43 test positive, then the test has 43% specificity. True/False?

True

If both fit and the lack of fit tests in logistic regression are significant, we should consider adding cross effects (of independent variables) to the model.

True

If data is right skewed, the mean is greater than the median.

True

If the probability of The University of Akron basketball team winning against Kent State team is .5, then the odds of The University of Akron winning against Kent State is 1.

True

If the probability of The University of Akron basketball team winning against Kent State team is .5, then the odds of The University of Akron winning against Kent State is 1. True/False?

True

If the probability of winning a game is 0.2, then the odds of winning the game is 0.25.

True

If the probability of winning a game is 0.2, then the odds of winning the game is 0.25. True/False

multicollinearity

If the simple correlation coefficient between two independent variables is greater than .95, then ______________________ is considered to be severe. b.) interaction c.) coefficient of determination d.) multicollinearity

True

If the test is highly sensitive and the test results is negative, we can be nearly certain that the person doesn't have disease.

True

If the test is highly sensitive and the test results is negative, we can be nearly certain that the person doesn't have disease. True/False

True

If the test result for the highly specific test is positive, you can be nearly certain that they actually have the disease. True/False

Is greater than the proportion earning less than $13 per hour.

If the wages of workers for a given company are normally distributed with a mean of $15 per hour, then the proportion of the workers earning more than $13 per hour:

Is greater than the proportion earnings less than $13 per hour

If the wages of workers for a given company are normally distributed with a mean of $15 per hour, then the proportion of the workers earning more than $13 per hour:

X1 is significantly related to Y

If we are testing the significance of the independent variable X1, and we reject the null hypothesis H0: b1= 0, we conclude that:

True

In Cluster analysis, it is important to normalize the data to get rid of differences in scale.

True

In Multiple Regression analysis, a t-test is used in testing the significance of an individual independent variable.

You pick the attribute with the highest Logworth.

In a decision tree algorithm, how is the attribute picked for the next split? a.) You pick the attribute with the lowest Logworth. b.) You pick the attribute where the conditional entropy is maximum. c.) You pick the attribute where the conditional entropy is higher than the base entropy. d.) You pick the attribute with the highest Logworth.

Is a straight line

In a multiple regression analysis, if the normal probability plot ___________, then it can be concluded that the assumption of normality is not violated.

False

In a simple linear regression model, the coefficient of determination (R-Square) not only indicates the strength of the relationship between independent and dependent variables, but also shows whether the relationship is positive or negative.

False

In a simple linear regression model, the coefficient of determination (R-Square) not only indicates the strength of the relationship between independent and dependent variables, but also shows whether the relationship is positive or negative. True/False

False

In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions is a supervised technique.

True

In cluster analysis, algorithms construct clusters where between-cluster variation (BCV) is large, as compared to within-cluster variation (WCV).

False

In doing the one way ANOVA F test we should always do a post-hoc analysis (such as Tukey paired comparisons).

decrease the number of false positives

In logistic regression, if we change the cutoff value from .5 (the default) to a cutoff value of .7 we would expect it to

decrease the number of false positives

In logistic regression, if we change the cutoff value from .5 (the default) to a cutoff value of .7 we would expect it to a.) increase the number of false positives b.) decrease the number of false positives c.) have no effect on the number of false positives

increase the number of false positives

In logistic regression, if we change the cutoff value from 0.5 (the default) to a cutoff value of 0.3 we would expect it to a.) increase the number of false positives b.) decrease the number of false positives c.) have no effect on the number of false positives

True

In logistic regression, typically the overall error rate is lowest at probability cutoff=0.50.

True

In medicine, if the test is positive, it is bad news. True/False

False

In medicine, if the test is positive, it is good news.

False

In medicine, if the test is positive, it is good news. True/False

False

In multiple regression analysis, if the simple correlation coefficient (rxy) between the dependent variable and one of the independent variables is .95, then this indicates that multicollinearity exists.

False

In multiple regression analysis, if the simple correlation coefficient (rxy) between the dependent variable and one of the independent variables is .95, then this indicates that multicollinearity exists. True/False

True

In multivariate modeling, you can get highly significant but meaningless results if you put too many predictors in the model. This is refereed to as problem of over-fitting.

True

In multivariate modeling, you can get highly significant but meaningless results if you put too many predictors in the model. This is refereed to as problem of over-fitting. True/False

Matched pairs from two dependent populations.

In order to test the effectiveness of a drug called XZR designed to reduce cholesterol levels, 9 heart patients' cholesterol levels are measured before they are given the drug. The same 9 patients use XZR for two continuous months. After two months of continuous use the 9 patients' cholesterol levels are measured again. The comparison of cholesterol levels before vs. after administering the drug is an example of testing the difference between:

True

In order to use a classification tree, the target variable must be categorical and not continuous.

False

In predicting the financial status of firms, sensitivity is the ability to predict a firm that is going to stay solvent correctly.

Slope of the regression line

In simple regression analysis the quantity that gives the amount by which Y (dependent variable) changes for a unit change in X (independent variable) is called the

The slope of the regression line must also be positive.

In simple regression analysis, if the correlation coefficient (r) between the dependent and independent variable is a positive value, then

A interval scale

In testing the difference between two independent population means, it is assumed that the level of measurement of the variable is at least _____________.

All of the above are true

In the Regression Trees DM technique,

Siblings and Spouces

In the Titanic data analysis, which independent variable is the Least important significant predictor of Survived?

Siblings and Spouses

In the Titanic data analysis, which independent variable is the Least important significant predictor of Survived? a.) Age b.) Sex c.) Siblings and Spouses d.) Parents and Children e.) Passenger Class

Sex, Passenger Class, Age, & Sibling and Spouces

In the Titanic data analysis, which of the independent variables in this model are significant predictors of Survived variable?

Sex, Passenger Class, Age & Sibling and Spouces

In the Titanic data analysis, which of the independent variables in this model are significant predictors of Survived variable? a.) Sex, Passenger Class, Age b.)Fare & Parents and Children c.) Sex, Passenger Class, Age, & Sibling and Spouces d.) Sex e.) all independent variables are significant

False

In two-way ANOVA, if interaction plots look essentially parallel, we can intuitively conclude there is an interaction between the two factors.

Modeling

In which CRISP-DM phase data is partitioned and training & validation data sets are created?

True

It is possible for a valid regression equation to have none of the data points fall on the regression line.

True

It is possible for a valid regression equation to have none of the data points fall on the regression line. True/False?

True

K-Means clustering is well suited to the task of Market segmentation.

True

K-means algorithm is a typical algorithm for cluster analysis that uses "Euclidean distance" to find clusters.

True

Logistic Regression analysis is a supervised data mining technique.

False

Logistic Regression analysis is an un-supervised data mining technique. True/False?

True

Logistic Regression can be used for both profiling an classification.

False

Logistic Regression odds ratios describe the degree to which the model's independent variables predict the dependent variable.

False

Logistic Regression odds ratios describe the degree to which the model's independent variables predict the dependent variable. True/False?

True

Logistic Regression seeks to predict the probability that a dependent variable will fall into a particular category based on the values of independent variables.

Phase 5

Managers want to know by next week whether deployment will take place. Therefore, analysts meet to discuss how useful and accurate their model is. Identify the CRISP-DM phase for this task.

False

Most charts, graphs, and other visualizations would be considered predictive models.

greater than 10

Multicollinearity between independent variables is severe if the variance inflation factor is a.) Between -1 and +1 b.) Substantially less than 1 c.) Greater than 10 d.) Less than 5

False

One advantage of neural networks is that there is very little chance of overfitting, so validation or testing data is not needed.

True

One disadvantage of neural networks is that they are slow learners.

True

Overfitting is the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points.

True

Overfitting is the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points. True/False

True

Pie and bar charts are used to summarize nominal and ordinal data.

True

Predicting something like the average length of a delivery person's shift is a well-suited task for decision tree modeling.

False

Predicting something like the average length of a delivery person's shift is a well-suited task for decision tree modeling. True/False

True

Predicting the approval or disapproval of a loan based on credit scores and demographic information is a good application of Logistic Regression. True/False

True

Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non-bankrupt firms is a supervised technique.

True

Printing of custom discount coupons at the conclusion of a grocery store checkout based on what you just bought and what others have bought previously is an unsupervised technique.

False

Sick people correctly identified as sick by the diagnostic test is known as True Negative.

False

Sick people correctly identified as sick by the diagnostic test is known as True Negative. True/False

Multicollinearity

Significant _________ may exist when the overall F-statistic is significant and the individual t statistics for all independent variables are insignificant.

True

Specificity measures how good a test is at finding something if it's false.

c.) Mutlicollinearity may exist between Profit and Assets

Suppose that we are trying to predict PROFIT using Sales, #emp, Assets, and Stockholder'sEq. Which statement below is FALSE? a.) Profit is the dependent or response variable b.) #emp has the weakest correlation with profit c.) Mutlicollinearity may exist between Profit and Assets d.) All of the above statements are true

0.90 (450/500)

Suppose we develop a Logistic Regression (LR) model to predict whether email is Spam or Not Spam. After we apply the model to a test set of 500 new email messages and the LR model produces the following confusion matrix, what is the accuracy rate for this model?

0.90 (450/500)

Suppose we develop a Logistic Regression (LR) model to predict whether email is Spam or Not Spam. After we apply the model to a test set of 500 new email messages and the LR model produces the following confusion matrix, what is the accuracy rate for this model? a.) 0.875 (70/80) b.) 0.78 (390/500) c.) 0.90 (450/500) d.) 0.10 (50/500)

False

The "K" in K-Means Cluster Analysis refers to the maximum number of variables that a clustering model can utilize.

True

The CRISP-DM process model aims to make large data mining projects, less costly, more reliable, more repeatable, more manageable, and faster.

Cluster 4

The Euclidean distances of a customer from the different cluster centers are given below. Based on this information, which cluster the customer belongs to?

Response, zero

The Y intercept (b0) in a multiple regression model represents the estimated value of the ________ variable, when the values of all independent variables are ______. a.) Response, one b.) Response, zero c.) Predictor, one d.) Predictor, zero

Y-Intercept

The __________ of the simple linear regression model (Y = B0 + B1*X) is the value of Y when the value of X is zero.

All of the methods listed above are appropriate

The choice of k (number of clusters in a Cluster Analysis) can be made using a variety of methods. Which of the following methods is appropriate in selecting the number of clusters.

False

The correlation coefficient (r) indicates the amount of change in variable Y when variable X changes by one unit.

False

The correlation coefficient (r) indicates the amount of change in variable Y when variable X changes by one unit. true/false?

False

The correlation coefficient can only assume value between zero and 1, inclusive.

True

The curse of dimensionality refers to the computational complexity of developing clusters using a large number of variables.

True

The dependent variable in logistic regression can have more than two values or classes. True/False?

False

The dependent variable in logistic regression is always binary. True/False?

Response

The dependent variable, the variable of interest in an experiment, is also called ___________ variable.

True

The different phases in the BA Life Cycle methodology are closely interrelated.

True

The error term in the regression model describes the effects of all factors other than the independent variables on y (response variable). True/False

True

The estimated simple linear regression equation minimizes the sum of the squared deviations between each value of Y and the line.

True

The estimated simple linear regression equation minimizes the sum of the squared deviations between each value of Y and the line. True/False?

False

The final stage of the CRISP-DM data mining process is "Evaluation."

Clustering

The following business question - Can we identify different groups of customers based on various demographic and purchasing characteristics? - can be answered by building a _____________ DM model.

Classification

The following business question - What factors influence customers to churn? - can be answered by building a _____________ DM model.

Classification

The following business question -What customer characteristics best predicts bank loan default OR commit fraud? - can be answered by building a _____________ DM model.

b.) Plane

The graph of the prediction equation obtained from the following model is a(n): y= B0 + B1X1 +B2X2 + E a.) Exponential curve b.) Plane c.) Line d.) Parabola

True

The mean and median are the same for a normal distribution.

True

The most popular method for using model errors to update weights is called back propagation of error.

72.75%

The output below (mosaic plot and contingency table) was generated from the Titanic data file. The first data variable is Survived (Yes or No) and the the second variable is Sex (Male, Female). As per the Contingency Table, what percentage of all of the females survived on the Titanic?

50%

The predictive accuracy of this model is ...

-1 to 1

The range of feasible values for the correlation coefficient is from:

0 to 1

The range of feasible values for the multiple coefficient of determination is from: a.) 0 to 1 b.) -1 to 1 c.) minus infinityto 0 d.) -1 to 0 e.) 0 to infinity

True

The three basic building blocks of business analytics are technology, process, and people.

True

The two steps in building a decision tree model are to first generate exact probability predictions for each data record, then convert that to a yes/no prediction based on a cutoff percentage.

False

Today, business analytics are data driven rather than business driven.

True

Training a neural network model involves estimating the weights that will lead to the best predictive results.

False

Two variables x and y have a high correlation coefficient. Therefore, it can be concluded that changes in x causes y to change.

True

Under the pairwise deletion approach to managing missing data we exclude a case with a missing value for a variable only when the analysis includes that variable.

False

Unstructured data increases the veracity in the data.

The centroids of the discovered cluster and the assignment of each input datum to a cluster.

What are the outputs generated by a k-Means clustering Analysis?

Logistic Regression returns probability estimates of a response variable

What is a distinct property of Logistic Regression compared to Linear Regression?

Logistic Regression returns probability estimates of a response variable

What is a distinct property of Logistic Regression compared to Linear Regression? a.) Logistic Regression works very well with continuous target variable b.) Logistic Regression is robust with correlated predictor variables c.) Logistic Regression handles missing values well d.) Logistic Regression returns probability estimates of a response variable

Ho: µC = µA

What is the appropriate null hypothesis for testing whether the mean profit of the industries Computer and Aerospace differ?

60%

What is the predictive accuracy of the CA model?

$400

What is the total cost of errors for the DT model?

$400

What is the total cost of errors for the DT model? a.) $1400 b.) $1000 c.) $800 d.) $200 e.) $400

31%

What percentage of the variability in the dependent variable is accounted for by changes in the independent variables in this model?

31%

What percentage of the variability in the dependent variable is accounted for by changes in the independent variables in this model? a.) 0 % b.) 31 % c.) 47 % d.) 100 % e.) 2 %

False

When creating a decision tree, in general we want to keep splitting as long we can create less purity in the nodes

False

When creating a decision tree, in general we want to keep splitting as long we can create less purity in the nodes True/False

False

When creating a decision tree, we want to keep splitting as long we can create more impurity in the nodes.

True

When creating a decision tree, we want to keep splitting as long we keep increasing R-Square for the validation data set.

True

When creating a decision tree, we want to keep splitting as long we keep increasing R-Square for the validation data set. True/False

False

When the F test is used to test the overall significance of a multiple regression model, if the null hypothesis is rejected, it can be concluded that all of the independent variables X1, X2, X3,... Xk are significantly related to the dependent variable Y. True/False?

False

When the predictor variable is categorical and the response variable is continuous, you would use a logistic regression model. True/False?

False

When using simple regression analysis, if there is a strong correlation between the independent and dependent variable, then we can conclude that an increase in the value of the independent variable causes an increase in the value of the dependent variable.

False

When we are predicting Fraud cases, we want to minimize false positive rates.

In a hierarchical clustering algorithm

Where would you most likely see a dendrogram?

Review project

Which activity is performed in the Deployment phase of the CRISP-DM process?

LR

Which is the best model based on predictive accuracy?

LR

Which is the worst model based on a minimized cost of erroneous predictions? a.) LR b.) CA c.) DT

None of the above

Which of the following activation functions is not used in neural networks (in JMP)?

Boxplot

Which of the following data visualization charts might include "whiskers"?

Desire to analyze big data

Which of the following in not one of the four drivers of business analytics?

Whether a Person Has a Traffic Violation

Which of the following is a categorical variable?

All of the above are weaknesses of neural networks.

Which of the following is a disadvantage of neural networks?

As the value of x increases, the value of the error term also increases.

Which of the following is a violation of one of the major assumptions of the simple regression model?

Cluster Analysis

Which of the following is an un-supervised data mining technique?

Requires assumptions of statistical models.

Which of the following is not True about Regression Trees DM technique: a.) Produces rules that are easy to interpret and implement. b.) Requires assumptions of statistical models. c.) Easy to use and understand. d.) Variable selection & reduction is automatic

The level of measurement of the data for the dependent variable is at least nominal.

Which of the following is not an accurate assumption of the linear regression model?

In which country did we have the highest sales last quarter?

Which of the following questions is typically addressed via a Business Intelligence project?

3,0,1,1

Which of the following sets of counts represents the contents of the completed LR confusion matrix?

3,0,1,1

Which of the following sets of counts represents the contents of the completed LR confusion matrix? a.) 1,2,1,1 b.) 2,0,1,1 c.) 3,0,1,1 d.) 1,2,0,2 e.) 4,0,0,1

All of the above statements are true

Which of the following statements about a ROC curve is true?

Building the model

Which of the following tasks would NOT be part of the data understanding phase in CRISP-DM?

Logistic Regression and Decision Tree

Which two analytical methods can be used for categorical target variables?

Logistic Regression and Decision Tree

Which two analytical methods can be used for categorical target variables? a.) Cluster Analysis and Logistic Regression b.) Linear Regression and Decision Tree c.) Linear Regression and Logistic Regression d.) Decision Tree and Cluster Analysis e.) Logistic Regression and Decision Tree

Logistic Regression

You are tasked with predicting if a customer will purchase a product (Yes/No) when the customer visits the website and the probability of a purchase decision. You are provided with other relevant variables that are associated with the problem. Which analytical method would you recommend?

Logistic Regression

You are tasked with predicting if a customer will purchase a product (Yes/No) when the customer visits the website and the probability of a purchase decision. You are provided with other relevant variables that are associated with the problem. Which analytical method would you recommend? a.) Linear Regression b.) ANOVA c.) Logistic Regression d.) Association Rules

49

You are using "state" a categorical variable in your linear regression model with 50 possible values. How many dummy variables with this variable "state" should be expanded for your model? a.) 49 b.) 2 c.) Categorical Variables cannot be used with Linear Regression d.) 51

Scatter Plot

You received 100,000 home loan records. You want to take a quick look and see if there is any relationship between mortgage age and mortgage amount before conducting advanced analytics. Which graphical tool would you employ for your preliminary analysis?

K-means Clustering

You want to group custommers in your dataset by similarity and assign labels to each group. What is the preferred analytic method to use for this task?

K-means clustering

You want to group custommers in your dataset by similarity and assign labels to each group. What is the preferred analytic method to use for this task?

Backward Elimination

___________ is an iterative variable selection procedure that allows an independent variable to be deleted to a multiple regression model during the next iteration.

Veracity

______________ refers to the accuracy or trustworthiness of the big data.

Prescriptive

____________________ is a form of analytics which examines data to answer the question "what should be done?" or "what can we do to make XYZ happen?"

The slope of the regression line must also be positive

n simple regression analysis, if the correlation coefficient (r) between the dependent and independent variable is a positive value, then A) The slope of the regression line must also be positive. B) The Y intercept must also be a positive value. C) The coefficient of determination can be either positive or negative, depending on the value of the slope. D) The standard error of estimate can either have a positive or a negative value. E) The least squares regression equation could either have a positive or a negative slope.


Kaugnay na mga set ng pag-aaral

Ch 32: Health Promotion & Care for the Older Adult

View Set

Module 16: Computer Security and Internet Research

View Set

Chapter 3 /The Context of Situation

View Set

Chapter 5 (Microeconomics) Elasticity and its Application

View Set

Ch 18 (ARH II) High Renaissance & Mannerism in Northern Europe and Spain

View Set

Health Online- The Importance of Mental and Emotional Health and Building Healthy Relationships- The importance of Mental and Emotional Health

View Set