Analytics Final
Which of the following is a disadvantage of neural networks?
-Their "black box" reputation -If the network sees only cases in a certain range, its ability to extrapolate (predict outside this range) is a serious danger. -Neural networks do not have a built-in variable selection method. -All of the above are weaknesses of neural networks.
Which of the following statements about a ROC curve is true?
-it stands for "Receiver Operating Characteristic Curve" -It plots the false positive rate (1- specificity) and true positive rate(sensitivity) -Each point on a ROC curve corresponds to a particular confusion matrix that depends on a specific threshold or cutoff. -All the above statements are true
Which of the following is a violation of one of the major assumptions of the simple regression model?
As the value of x increases, the value of the error term also increases
A negative correlation coefficient (r) implies weak relationship among the variables.
False
Cluster Analysis is a supervised learning technique.
False
Decision Tree is an un-supervised data mining technique.
False
Hierarchical clustering a good technique when you have a very large data set.
False
If 100 patients known to have a disease were tested, and 43 test positive, then the test has 43% specificity
False
In medicine, if the test is positive, it is good news.
False
In predicting the financial status of firms, sensitivity is the ability to predict a firm that is going to stay solvent correctly.
False
One advantage of neural networks is that there is very little chance of overfitting, so validation or testing data is not needed.
False
Predicting something like the average length of a delivery person's shift is a well-suited task for decision tree modeling.
False
The correlation coefficient (r) indicates the amount of change in variable Y when variable X changes by one unit.
False
The dependent variable in logistic regression is always binary.
False
When creating a decision tree, we want to keep splitting as long we can create more impurity in the nodes.
False
When creating a decision tree, we want to keep splitting as long we keep increasing R-Square for the Training data set.
False
When the F test is used to test the overall significance of a multiple regressionmodel, if the null hypothesis is rejected, it can be concluded that all of theindependent variables X 1,X2,X3,....XK are significantly related to the dependentvariable Y.
False
When the predictor variable is categorical and the response variable is continuous, you would use a logistic regression model.
False
in a simple linear regression model, the coefficient of determination (R-Square) not only indicates the strength of the relationship between independent and dependent variables but also shows whether the relationships are positive or negative.
False
the variance inflation factor (VIF) measures the relationship between the dependent variable and the rest of the independent variables in the regression model
False
Where would you most likely see a dendogram?
In a hierarchical clustering algorithm
An odds ratio ____________ 1 indicates that the condition or event is less likely to occur in the first group.
Less than
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.) Select the appropriate test/model to determine if there is a relationship between age and household income.
Linear Regression
Which two analytical methods can be used for categorical target variables?
Logistic Regression and Decision Tree
What is a distinct property of Logistic Regression compared to Linear Regression?
Logistic Regression returns probability estimates of a response variable
Which of the following activation functions is not used in neural networks (in JMP)?
None of the above
The graph of the prediction equation obtained from the following model is a? Y=BO+B1X1+B2X2+E
Plane
In the Titanic data analysis, which of the independent variables in this model are significant predictors of Survived variable?
Sex, Passenger Class, Age, & Sibling and Spouces
An anti-theft scanner at an entrance of a book store buzzes once for every 1000 innocent people walking through the scanner. The accuracy of the scanner is:
Specificity of 99.9%
How is False Positive Rate defined?
The fraction of negative instances that were misclassified
in simple regression analysis if the correlation coefficient (r) is a positive value then
The slope of the regression line must also be positive.
A False Positive error is a Type I error.
True
A confusion matrix is used to describe the performance of a classification model.
True
A good clustering scheme will have little variation within clusters and signficant variation between clusters.
True
According to one of the videos, one disadvantage of neural networks is that they are slow learners
True
According to the text, the most popular choice for the number of hidden layers is one.
True
Cluster analysis is a very attractive initial data-mining tool because it can be used to discover rules and patterns.
True
Each terminal node in a decision tree can to be translated into a single IF-THEN rule.
True
If the probability of The University of Akron basketball team winning against Kent State team is .5, then the odds of The University of Akron winning against Kent State is 1.
True
If the probability of winning a game is 0.2, the odds of winning the game is 0.25.
True
In data mining over-fitting results in developing too precise a model that will fail to generalize and therefore will have poor predicting power.
True
In order to include a categorical variable in k-means cluster analysis in JMP, the data must be coded numerically. Therefore, categorical variables should be coded as 0/1.
True
In order to use a classification tree, the target variable must be categorical and not continuous.
True
In regression analysis, if the normal probability plot of residuals exhibits approximately a straight line, then it can be concluded that the assumption of normality is not violated.
True
K-means algorithm is a typical algorithm for cluster analysis that uses "Euclidean distance" to find clusters
True
Logistic Regression analysis is a supervised data mining technique
True
Neural networks can be used for both continuous and categorical dependent (output) variables
True
One-way to decide on the number of clusters in a cluster analysis is to arbitrarily pick a value.
True
Predicting the approval or disapproval of a loan based on credit scores and demographic information is a good application of Logistic Regression.
True
Regression analysis is an example of a supervised learning technique.
True
Specificity measures how good a test is at finding something if it's false.
True
The curse of dimensionality refers to the computational complexity of developing clusters using a large number of variables.
True
The most popular method for using model errors to update weights is called back propagation of error.
True
Training a neural network model involves estimating the weights that will lead to the best predictive results.
True
in this residual plot it appears that the constant variance assumptions are not being violated
True
In a decision tree algorithm, how is the attribute picked for the next split?
You pick the attribute with the highest Logworth.