Analytics Final
Analytics can help a company hire, retain, and promote the best people. T/F
True (1)
Each terminal node in a decision tree can be translated into a single IF-THEN rule. T/F
True (8)
All of the following is correct except one when describing the present day analytics. a. Centrally managed b. Fact based decision- making c. Decision Focus d. Data driven
d. Data driven (1)
This is a valid null hypothesis: The average weight of desks made on assembly line one is different from the average weight of desks made on assembly line two. T/F
False (4)
A negative correlation coefficient (r) implies weak relationship among the variables. T/F
False (5)
In a simple linear regression model, the coefficient of determination (R-Square) not only indicates the strength of the relationship between independent and dependent variables, but also shows whether the relationship is positive or negative. T/F
False (5)
The correlation coefficient (r) indicates the amount of change in variable Y when variable X changes by one unit. T/F
False (5)
The variance inflation factor (VIF) measures the relationship between the dependent variable and the rest of the independent variables in the regression model. T/F
False (5)
When the F test is used to test the overall significance of a multiple regression model, if the null hypothesis is rejected, it can be concluded that all of the independent variables X1, X2, X3,... Xk are significantly related to the dependent variable Y. T/F
False (5)
The dependent variable in logistic regression is always binary. T/F
False (6)
When the predictor variable is categorical and the response variable is continuous, you would use a logistic regression model. T/F
False (6)
In medicine, if the test is positive, it is good news. T/F
False (7)
In predicting the financial status of firms, sensitivity is the ability to predict a firm that is going to stay solvent correctly. T/F
False (7)
When creating a decision tree, we want to keep splitting as long we can create more impurity in the nodes. T/F
False (8)
When creating a decision tree, we want to keep splitting as long we keep increasing R-Square for the Training data set. T/F
False (8)
Cluster Analysis is a supervised learning technique. T/F
False (9)
Hierarchical clustering a good technique when you have a very large data set. T/F
False (9)
Unstructured data increases the veracity in the data. T/F
False (1)
A scatter plot can be drawn with a set of two categorical data. T/F
False (3)
In a statistical study, the random variable X = 1, if the house is colonial, and X = 0 if the house is not colonial, then it can be stated that the random variable is continuous. T/F
False (3)
Records that have outlier values should always be removed from a data set during data preparation and cleaning. T/F
False (3)
In doing the one way ANOVA F test we should always do a post-hoc analysis (such as Tukey paired comparisons). T/F
False (4)
The controller of a chain of toy stores is interested in determining whether there is any difference in the weekly sales of store 1 and store 2. The weekly sales are normally distributed. This problem should be analyzed using Oneway ANOVA. T/F
False (4)
Analytics can help companies only in a few industries. T/F
False (1)
Big data is structured and clean data. T/F
False (1)
The five Vs of Big Data are volume, velocity, volatility, variety, and veracity. T/F
False (1)
Today, business analytics are data driven rather than business driven. T/F
False (1)
According to some experts, analytics has become important for companies these days because many products are becoming commodities and there is minimal differentiation in products and services. T/F
True (1)
The three basic building blocks of business analytics are technology, process, and people. T/F
True (1)
A histogram is only appropriate for variables whose values are numerical and measured on an interval or ratio scale. T/F
True (3)
Pie and bar charts are used to summarize nominal and ordinal data. T/F
True (3)
The inter quartile range (IQR) is a measure of variability. A data value is identified as an outlier if it is located 1.5(IQR) or more below the first quartile (Q1) or above the third quartile (Q3)
True (3)
Deciding Ha (alternate hypothesis) is true when it is false is Type 1 error. T/F
True (4)
If the frequency distribution is skewed to the left, then the median of the distribution will be greater than the mean. T/F
True (4)
In regression analysis, if the normal probability plot of residuals exhibits approximately a straight line, then it can be concluded that the assumption of normality is not violated. T/F
True (5)
In this residual plot it appears that the constant variance assumptions is not being violated. *View residual plot image* T/F
True (5)
Regression analysis is an example of a supervised learning technique. T/F
True (5)
If the probability of The University of Akron basketball team winning against Kent State team is .5, then the odds of The University of Akron winning against Kent State is 1. T/F
True (6)
If the probability of winning a game is 0.2, the odds of winning the game is 0.25. T/F
True (6)
Logistic Regression analysis is a supervised data mining technique. T/F
True (6)
A False Positive error is a Type I error. T/F
True (7)
A confusion matrix is used to describe the performance of a classification model. T/F
True (7)
Specificity measures how good a test is at finding something if it's false. T/F
True (7)
In data mining over-fitting results in developing too precise a model that will fail to generalize and therefore will have poor predicting power. T/F
True (8)
In order to use a classification tree, the target variable must be categorical and not continuous. T/F
True (8)
A good clustering scheme will have little variation within clusters and significant variation between clusters. T/F
True (9)
Cluster analysis is a very attractive initial data-mining tool because it can be used to discover patterns. T/F
True (9)
In order to include a categorical variable in k-means cluster analysis in JMP, the data must be coded numerically. Therefore, categorical variables should be coded as 0/1. T/F
True (9)
K-means algorithm is a typical algorithm for cluster analysis that uses "Euclidean distance" to find clusters. T/F
True (9)
One-way to decide on the number of clusters in a cluster analysis is to arbitrarily pick a value. T/F
True (9)
The curse of dimensionality refers to the computational complexity of developing clusters using a large number of variables. T/F
True (9)
Given the output of odds ratios, about what residents would do in case of hurricane thereat, the odds of evacuating for each additional pet they would have is _______. *View image on odds ratio jmp* a. 0.2655 b. 0.0049 c. 3.765 d. 201.140
a. 0.2655 (6)
Consider the following scenario. A hospital performs a test for a rare disease on 200 patients and the results of the test are given below. Answer the following questions based on the results. The sensitivity for this sample is: *View image on test score vs truth* a. 0.9 b. 0.8 c. 0.1 d. 0.2
a. 0.9 (7)
If a null hypothesis is rejected at a significance level of .01, it will ______ be rejected at a significance level of .05. a. Always b. Sometimes c. Never
a. Always (4)
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.) For the variable Martial Status (S/M/W/D) identify the Data type. a. Categorical/nominal b. Continuous c. Categorical/Ordinal
a. Categorical/ nominal (3)
An important factor in selecting software for word-processing and database management systems is the time required to learn how to use the system. To evaluate three file management systems, a firm designed a test involving word processing operators. Since operator variability was believed to have an impact, each of the operators was trained on all three of the file management systems. *View analysis of variance table* If we are using an alpha value of .05 then we would conclude that a. Differences exist among both systems and operators. b. Differences exist among systems only. c. Differences exist among operators only. d. No differences exist.
a. Differences exist among both systems and operations. (4)
An anti-theft scanner at an entrance of a book store buzzes once for every 1000 innocent people walking through the scanner. The accuracy of the scanner is: a. Specificity of 99.9% b. Sensitivity of 99.9% c. None of the above d. Do not have enough information to answer this question.
a. Specificity of 99.9% (7)
A multiple regression analysis produced the following table: *View multiple regression table* Using α = .05, which variable would you drop using backward elimination? a. x1 b. x2 c. Would not drop any independent variables.
a. x1 (5)
If 100 megabytes of storage cost a penny, a terabyte of storage would cost: a. $1,000 b. $100 c. $10,000 d. $100,000
b. $100 (1)
The total economic cost of this model's errors is... *View image on outcome data set jmp* a. $260 b. $280 c. $300 d. $220
b. $280 (7)
What is the total cost of errors for the DT model? *View image of DT model* a. $200 b. $400 c. $800 d. $1400 e. $1000
b. $400 (7)
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.) Select the appropriate test/model to determine if city service satisfaction level & martial status are independent? a. T-test b. Chi-square test c. Linear regression d. ANOVA
b. Chi- squared test (4)
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.) Select the appropriate test/model to determine if there is a relationship between age and household income. a. T-test of means b. Linear Regression c. ANOVA d. Chi-Square test/ Contingency model
b. Linear regression (5)
In order to test the effectiveness of a drug called XZR designed to reduce cholesterol levels, 9 heart patients' cholesterol levels are measured before they are given the drug. The same 9 patients use XZR for two continuous months. After two months of continuous use the 9 patients' cholesterol levels are measured again. The comparison of cholesterol levels before vs. after administering the drug is an example of testing the difference between: a. Two population variances from independent populations. b. Matched pairs from two dependent populations. c. Two population proportions. d. Two means from independent populations.
b. Matched pairs from two dependent populations. (4)
You received 100,000 home loan records. You want to take a quick look and see if there is any relationship between mortgage age and mortgage amount before conducting advanced analytics. Which graphical tool would you employ for your preliminary analysis? a. Stacked bar chart b. Scatter plot c. Histogram d. Box Plot
b. Scatter plot (3)
A study on the effects of television viewing of children reports that children watch an average of 4 hours of television per night. Sarah believes that the average number of hours children in her neighborhood watch television per night is not 4. She performs a hypothesis test and rejects the null hypothesis. In reality, children in her neighborhood DO watch an average of 4 hours of TV per night. What type of error did Sarah make? a. Type II b. Type I c. Random d. Beta e. None of the above
b. Type 1 (4)
In a decision tree algorithm, how is the attribute picked for the next split? a. You pick the attribute where the conditional entropy is higher than the base entropy. b. You pick the attribute with the highest log worth. c. You pick the attribute where the conditional entropy is maximum. d. You pick the attribute with the lowest log worth.
b. You pick the attribute with the highest log worth (8)
Based on the Decision Tree, what is the probability that the following new customer will buy more in 2011 compared to 2010 (Outcome Variable - Increased = 0 means Spending in 2011 is less than 2010 Spending; Increased = 1 means Spending in 2011 is greater than 2010 Spending)? *View image on decision tree jmp* a. 0.1843 b. 0.3502 c. 0.8157 d. 0.9932 e. 0.5898
c. 0.8157 (8)
Suppose we develop a LR model to predict whether email is Spam or Not Spam. After we apply the model to a test set of 500 new email messages and the LR model produces the following confusion matrix, what is the accuracy rate for this model? *View predicted class image* a. 0.78 (390/500) b. 0.875 (70/80) c. 0.90 (450/500) d. 0.10 (50/500)
c. 0.90 (450/500) (6)
In the Titanic data analysis, which of the independent variables in this model are significant predictors of Survived variable? *View image on effect summary jmp* a. Sex, Passenger Class, Age b. Fare & Parents and Children c. Sex, Passenger Class, Age, & sibling and spouses d. Sex e. All independent variables are significant
c. Sex, passenger class, age, & sibling and spouses. (6)
Based on the end nodes of the decision tree (Outcome Variable - Increased = 0 means Spending in 2011 is less than 2010 Spending; Increased = 1 means Spending in 2011 is greater than 2010 Spending), which statement is True? *View image on partial decision tree* a. If Customer2010 <1 Then "Spending in 2011 is greater than 2010 Spending" with Prob .9932 b. If Customer2010 >=1 And Catalogs <18 Then "Spending in 2011 is less than 2010 Spending" with Prob .6498 c. If Customer2010 >=1 And Catalogs >=18 Then "Spending in 2011 is greater than 2010 Spending" - Prob .8157 d. All are True
d. All are true (8)
Which of the following statements about a ROC curve is true? a. It stands for "Receiver Operating Characteristic Curve" b. It plots the false positive rate (1- specificity) and true positive rate(sensitivity) c. Each point on a ROC curve corresponds to a particular confusion matrix that depends on a specific threshold or cutoff. d. All of the above statements are true
d. All of the above statements (8)
This mosaic chart shows the distributions of car size (small, medium, large) by country. Which of the following statements is false? a. Compared to Europeans and Japanese, Americans are more likely to drive large cars b. You could also do a test of independence on this data. c. This chart is based on categorical (nominal) data. d. All of the above statements are true.
d. All of the above statements are true. (3)
Which of the following is a violation of one of the major assumptions of the simple regression model? a. The error terms are independent of each other. b. Histogram of the residuals form a bell-shaped, symmetrical curve. c. The error terms show no pattern over time. d. As the value of x increases, the value of the error term also increases.
d. As the value of x increases, the value of the error term also increases. (5)
Where would you most likely see a dendogram? a. In a k-means clustering algorithm b. In a neural network model c. In Decision Trees analysis d. In a hierarchical clustering algorithm e. None of the above
d. In a hierarchical clustering (9)
An odds ratio ____________ 1 indicates that the condition or event is less likely to occur in the first group. a. Greater than b. Equal to c. Not Equal to d. Less than
d. Less than (6)
Which two analytical methods can be used for categorical target variables? a. Cluster Analysis and Logistic Regression b. Linear Regression and Logistic Regression c. Decision Tree and Cluster Analysis d. Logistic Regression and Decision Tree e. Linear Regression and Decision Tree
d. Logistic regression and decision tree (8)
In simple regression analysis, if the correlation coefficient (r) is a positive value, then a. The Y intercept must also be a positive value. b. The coefficient of determination can be either positive or negative, depending on the value of the slope. c. The least squares regression equation could either have a positive or a negative slope. d. The slope of the regression line must also be positive. e. The standard error of estimate can either have a positive or a negative value.
d. The slope of the regression line must also be positive. (5)
What percentage of the variability in the dependent variable is accounted for by changes in the independent variables in this model? *View image on iterations jmp* a. 0 % b. 2 % c. 47 % d. 100 % e. 31 %
e. 31% (6)
What is the predictive accuracy of the CA model? *View image on CA model jmp* a. 100 % b. 40 % c. 80 % d. 20 % e. 60 %
e. 60% (7)
The choice of k (number of clusters in a Cluster Analysis) can be made using a variety of methods. Which of the following methods is appropriate in selecting the number of clusters. a. Based on subject-matter expert b. Based on convenience c. Based on constraints d. Arbitrarily select k e. All of the methods listed above are appropriate
e. All of the methods listed above are appropriate (9)
Which of the following is a categorical variable? a. Daily Sales in a Store b. Value of Company Stock c. Air Temperature d. Bank Account Balance e. Whether a person had a traffic violation
e. Whether a person had a traffic violation. (3)