SAS Advanced Analytics Exam 2 Questions
The odds ratio for a $1000 increase in income is 1.074. What does this mean for every $1000 increase in income?
The odds of the event increase 7.4%
In the HP Principal Components Node, what is analysis performed on by default?
the correlation matrix
In data mining, a cutoff of the proportion of events in the population tends to maximize ____.
the mean of sensitivity and specificity
What do the priors, π0 and π1, represent regarding adjustments for oversampling?
the population proportions of class 0 and 1 respectively
The logit transformation transforms the probability scale to ____.
the real line (-∞, +∞)
How should validation data be prepared for scoring?
the same way that the training data was prepared for model building
In PROC LOGISTIC, models selected in the best-subsets selection method are ranked by ____
the score chi-square statistic
HP data mining nodes (ex. HP Neural) employ ____ to speed-up processing.
threaded kernel extensions.
What is the SELECT= option used for in PROC LOGISTIC?
to select ODS tables for display
What is the TECHNIQUE= option used for in PROC LOGISTIC?
to specify the optimization algorithm
All R library statements must remain in the Open Source Integration node script. (T/F)?
true
In theory, a polynomial regression model of sufficient complexity is a universal approximator. (T/F)?
true
Minimizing deviance is equivalent to maximizing likelihood when the target is a member of the exponential family. (T/F)?
true
Pruning the input with the smallest input-to-hidden weight can negatively impact the neural network's performance. (T/F)?
true
Radial basis functions train much faster and are less likely to fall into local minima than a multilayer perceptron. (T/F)?
true
The OUTMODEL= and INMODEL= options must be in separate PROC LOGISTIC programs. (T/F)?
true
You can perform logistic regression analysis with a generalized linear model. (T/F)?
true
Which radial basis combination function generates the most flexible networks?
unequal height & width
How can you increase the smoothing of an empirical logit plot?
use a small number of bins with many observations per bin
Cluster: PROPERTIES (Visual Analytics)
- # of clusters {*5*} - seed {*1234*} - initial assignment {*forgy, random} - visible roles {*5*} - variable standardization - # of bins - max polylines - visible roles {*5*} - show ellipses - show centroids
Required variable handles for R in open source integration node:
- &EMR_MODEL - &EMR_IMPORT_DATA
Generalized Linear Model: ROLES (Visual Analytics)
1. response 2. continuous effects 3. classification effects 4. interaction effects 5. group by 6. frequency 7. offset 8. weight
PROC LOGISTIC statement order
1. class 2. effectplot 3. model 4. oddsratio 5. roc 6. roccontrast 7. score 8. units 9. output
Which options are used for variable selection in PROC LOGISTIC?
- SHOWSELECTED - VARSELECTION
Required packages for R in open source integration node:
- XML - pmml
When specifying cluster roles for cluster analysis in SAS Visual Statistics ____.
- at least 2 measures are required - no categories/interaction terms are allowed - the K-means method is used
ROCCONTRAST (PROC LOGISTIC option)
- compares the different ROC models - one allowed per LOGISTIC
SCORE (PROC LOGISTIC option)
- creates a data set that contains: - all data in the DATA= data set - posterior probabilities - (optional) prediction confidence intervals - multiple allowed per LOGISTIC
Poisson regression analysis in SAS Visual Statistics
- default link function = log - decimal values permitted for response - predicted values given in the inverse link scale
Decision Tree: PROPERTIES (Visual Analytics)
- include missing - frequency {*count*, percent} - growth strategy {*custom*, basic, advanced, modeling} - maximum branches {*2*} - maximum levels {*6*} - leaf size {*10*} - response bins {*10*} - predictor bins {*20*} - rapid growth - pruning (lenient-->agressive) - use default # of bins - number - prediction cutoff {*.50*} - tolerance - show diagnostic plots
What does the Informative Missingness proerty do in SAS Visual Statistics?
1. impute missing values 2. create binary indicator variables 3. includes the imputation in the generated score code
Generalized Linear Model: PROPERTIES (Visual Analytics)
- informative missingness - distribution {*normal*, beta, binary, exponential, gamma, geometric, inverse gaussian, negative binomial, poisson} - link function { *identity*(normal), *logit*(beta, binary), *log*(exponential, gamma, geometric, negative_binomial, poisson), *power(-2)*(inverse_gaussian)} - override function convergence {*0.000001*} - override gradient convergence {*0.000001*} - max iterations {*50*} - use default # of bins - number - tolerance - show diagnostic plots
Linear Regression: PROPERTIES (Visual Analytics)
- informative missingness - use variable selection - significance level {*0.10*} - use default # of bins - number - prediction cutoff - tolerance - show diagnostic plots
Logistic Regression: PROPERTIES (Visual Analytics)
- informative missingness - use variable selection - significance level {*0.10*} - link function {*logit*, probit} - override function convergence {*0.000001*} - override gradient convergence {*0.000001*} - use default # of bins - number - prediction cutoff - tolerance - show diagnostic plots
UNITS (PROC LOGISTIC option)
- lets you to obtain an odds ratio estimate for a specified change in a predictor variable - unit of change can be: - number - standard deviation (SD) - # ** standard deviation (2*SD)
EFFECTPLOT (PROC LOGISTIC option)
- produces a display of fitted model - gives options for changing & enhancing displays.
ODDSRATIO (PROC LOGISTIC option)
- produces odds ratios for variables - works with: - variables containing interactions w/ other covariates - classification variables using parameterization - multiple allowed per LOGISTIC
CLASS (PROC LOGISTIC option)
- specifies classification variables to be used in the analysis - must precede the MODEL statement.
ROC (PROC LOGISTIC option)
- specifies models to be used in the ROC comparisons - multiple allowed per LOGISTIC - identified by their label
MODEL (PROC LOGISTIC option)
- specifies response variable & predictor variables ^^ (can be character or numeric) - required statement - one allowed per LOGISTIC
What is the default significance level for the backward elimination method if no SLSTAY= option is set in PROC LOGISTIC?
.05
Suppose the profit margin of true positives is nine times higher than the loss margins of false positives. According to Bayes' rule, what is the cutoff that maximizes the expected profit? (Assume zero profit and loss for true negatives and false negatives.)
.10
Linear Regression: ROLES (Visual Analytics)
1. response 2. continuous effects 3. classification effects 4. interaction effects 5. group by 6. frequency 7. offset 8. weight
Logistic Regression: ROLES (Visual Analytics)
1. response 2. continuous effects 3. classification effects 4. interaction effects 5. group by 6. frequency 7. offset 8. weight
Decision Tree: ROLES (Visual Analytics)
1. response 2. predictors
Cluster: ROLES (Visual Analytics)
1. variables - requires >= 2
If the profit margin of true positives is 24 times higher than the loss margins of false positives, then according to Bayes' rule, what is the cutoff that maximizes the total expected profit? (Assume zero profit and loss for true negatives and false negatives.)
1/25
Which of the following is the key limitation of the simple perceptron?
It can solve only linearly separable problems
Consider the following MODEL statement in PROC LOGISTIC: model ins(event='1') = SavBal | Age | IRABal @2; Which effects does it model?
SavBal, Age, IRABal, SavBal x Age, SavBal x IRABal, and Age x IRABal (The bar notation with @2 constructs a model with all the main effects and the two-factor interactions)
What is the formula for specificity? (true negative rate)
TN / TN+FP true negatives / total actual negatives
What is the formula for sensitivity? (true positive rate)
TP / TP+FN true positives / total actual positives
What is cumulative lift?
The fraction of more event-level cases captured by the model than would be expected given the baseline event level.
What would happen if you split the data by taking a simple random sample in PROC SURVEYSELECT? Assume that, as in the previous demonstration, you split the data into two data sets (a training data set and a validation data set) and specify a sampling rate of 0.6667.
The proportion of the events in the training data set would probably be different from the proportion of events in the validation data set.
Suppose you are using development data in which the target event cases are rare. (Assume that there are fewer than 50 events.) In this situation, what is a typical way to get a reasonably honest model assessment?
Use k-fold cross validation
Best-subsets selection in PROC LOGISTIC is relatively efficient for ____.
a small number of variables (ex. < 50)
If the value of the offset is 3.2567, then the model corrected for oversampling has which of the following?
an intercept that is 3.2567 lower in value compared to the model fitted to the biased sample
In PROC LOGISTIC, the UNITS statement enables you to obtain ____.
an odds ratio estimate for a specified change in a predictor variable.
If the lift value is 4 at a depth of 40%, then ____.
at a depth of 40%, there are four times more responders targeted by the model than by random chance
How is variable selection performed in PROC LOGISTIC?
backward elimination
How is the odds ratio obtained?
by exponentiating the parameter estimates
The area under the ROC curve is reported as the ____ in PROC LOGISTIC.
c statistic
When must PROC DMDB be used?
only the first time a data set is used by PROC NEURAL
Kendall Tau investigates the association between ____ of the input.
concordant & discordant pairs of ranked observations
In Zero Inflated Poisson Models predictions are conditioned on ____.
covariates for two distributions
interactive group-by analysis can't be performed for ____ in visual statistics.
decision trees
A linear perceptron is a nonlinear model. (T/F)?
false
Even after training is completed, neural networks are usually slow to generate their estimates/decisions. (T/F)?
false
Linear regression models are appropriate for any continuous response variable. (T/F)?
false
Neural networks cannot model both the autoregressive (AR) and moving average (MA) components of the ARMA model. (T/F)?
false
Numerical optimization methods guarantee that the neural network will eventually find the optimal solution. (T/F)?
false
The DATA= and INMODEL= options can both be used in the same PROC LOGISTIC statement. (T/F)?
false
The HP Neural node is a multithreaded (parallel) version of the Neural Network node. (T/F)?
false
The addition of direct connections between the input and output layers (a skip-layer network) is usually beneficial. (T/F)?
false
The purpose of a surrogate model is to remove the need for a complex model by replacing it with a simpler model. (T/F)?
false ** purpose = to explain blackbox models
The purpose of weight decay and early stopping is to keep the network from falling into a bad local minima. (T/F)?
false ** purpose = to prevent overfitting
The sequential network construction methodology that is described in this course creates a neural network consisting of multiple hidden layers. (T/F)?
false ** sequentially adds neurons to a single hidden layer
In SAS Visual Statistics, the default number of clusters is ____.
five, but you can change this to another number
The lift chart and the gains chart are the same except for ____.
how the vertical axis is scaled GAINS = positive predictive value LIFT = positive predictive value / marginal rate
What is the default output activation function for the multilayer perceptron when modeling a normally distributed target variable?
identity function
After creating a decision tree, what is the value in deriving a leafID variable?
it can be used in filters in other types of visualizations
Given an influence plot, ____ can be used to quantify leverage.
likelihood displacement
In the Kolmogorov-Smirnov test, what does the test statistic, D represent?
maximum vertical distance between the cumulative distributions
The ____ measure is affected by oversampling.
positive predictive value
Measure variables can be used to ____.
rank-order the clusters
Spearman statistics investigate the association between ____ of the input.
rank-ordered values of the variables
What does the 1-R^2 ratio represent in clustering?
ratio of correlation with a variable's own cluster vs the the other clusters lower 1-R^2 = higher correlation with own cluster & lower correlation with others
Which property can be set when modeling decision trees with Visual Statistics?
reuse predictors
How can sensitivity & specificity be generated in PROC LOGISTIC?
score data=valid1 outroc=roc;
Which properties can be set when performing cluster analysis with Visual Statistics?
seed & variable standardization
Hoeffding D investigates the association between ____ of the input.
weighted sum of chi-square statistics for 2x2 tables
When does quasi-complete separation occur?
when a level of the predictor has no observations in one of the response levels