SAS Advanced Analytics Exam 2 Questions

Ace your homework & exams now with Quizwiz!

The odds ratio for a $1000 increase in income is 1.074. What does this mean for every $1000 increase in income?

The odds of the event increase 7.4%

In the HP Principal Components Node, what is analysis performed on by default?

the correlation matrix

In data mining, a cutoff of the proportion of events in the population tends to maximize ____.

the mean of sensitivity and specificity

What do the priors, π0 and π1, represent regarding adjustments for oversampling?

the population proportions of class 0 and 1 respectively

The logit transformation transforms the probability scale to ____.

the real line (-∞, +∞)

How should validation data be prepared for scoring?

the same way that the training data was prepared for model building

In PROC LOGISTIC, models selected in the best-subsets selection method are ranked by ____

the score chi-square statistic

HP data mining nodes (ex. HP Neural) employ ____ to speed-up processing.

threaded kernel extensions.

What is the SELECT= option used for in PROC LOGISTIC?

to select ODS tables for display

What is the TECHNIQUE= option used for in PROC LOGISTIC?

to specify the optimization algorithm

All R library statements must remain in the Open Source Integration node script. (T/F)?

true

In theory, a polynomial regression model of sufficient complexity is a universal approximator. (T/F)?

true

Minimizing deviance is equivalent to maximizing likelihood when the target is a member of the exponential family. (T/F)?

true

Pruning the input with the smallest input-to-hidden weight can negatively impact the neural network's performance. (T/F)?

true

Radial basis functions train much faster and are less likely to fall into local minima than a multilayer perceptron. (T/F)?

true

The OUTMODEL= and INMODEL= options must be in separate PROC LOGISTIC programs. (T/F)?

true

You can perform logistic regression analysis with a generalized linear model. (T/F)?

true

Which radial basis combination function generates the most flexible networks?

unequal height & width

How can you increase the smoothing of an empirical logit plot?

use a small number of bins with many observations per bin

Cluster: PROPERTIES (Visual Analytics)

- # of clusters {*5*} - seed {*1234*} - initial assignment {*forgy, random} - visible roles {*5*} - variable standardization - # of bins - max polylines - visible roles {*5*} - show ellipses - show centroids

Required variable handles for R in open source integration node:

- &EMR_MODEL - &EMR_IMPORT_DATA

Generalized Linear Model: ROLES (Visual Analytics)

1. response 2. continuous effects 3. classification effects 4. interaction effects 5. group by 6. frequency 7. offset 8. weight

PROC LOGISTIC statement order

1. class 2. effectplot 3. model 4. oddsratio 5. roc 6. roccontrast 7. score 8. units 9. output

Which options are used for variable selection in PROC LOGISTIC?

- SHOWSELECTED - VARSELECTION

Required packages for R in open source integration node:

- XML - pmml

When specifying cluster roles for cluster analysis in SAS Visual Statistics ____.

- at least 2 measures are required - no categories/interaction terms are allowed - the K-means method is used

ROCCONTRAST (PROC LOGISTIC option)

- compares the different ROC models - one allowed per LOGISTIC

SCORE (PROC LOGISTIC option)

- creates a data set that contains: - all data in the DATA= data set - posterior probabilities - (optional) prediction confidence intervals - multiple allowed per LOGISTIC

Poisson regression analysis in SAS Visual Statistics

- default link function = log - decimal values permitted for response - predicted values given in the inverse link scale

Decision Tree: PROPERTIES (Visual Analytics)

- include missing - frequency {*count*, percent} - growth strategy {*custom*, basic, advanced, modeling} - maximum branches {*2*} - maximum levels {*6*} - leaf size {*10*} - response bins {*10*} - predictor bins {*20*} - rapid growth - pruning (lenient-->agressive) - use default # of bins - number - prediction cutoff {*.50*} - tolerance - show diagnostic plots

What does the Informative Missingness proerty do in SAS Visual Statistics?

1. impute missing values 2. create binary indicator variables 3. includes the imputation in the generated score code

Generalized Linear Model: PROPERTIES (Visual Analytics)

- informative missingness - distribution {*normal*, beta, binary, exponential, gamma, geometric, inverse gaussian, negative binomial, poisson} - link function { *identity*(normal), *logit*(beta, binary), *log*(exponential, gamma, geometric, negative_binomial, poisson), *power(-2)*(inverse_gaussian)} - override function convergence {*0.000001*} - override gradient convergence {*0.000001*} - max iterations {*50*} - use default # of bins - number - tolerance - show diagnostic plots

Linear Regression: PROPERTIES (Visual Analytics)

- informative missingness - use variable selection - significance level {*0.10*} - use default # of bins - number - prediction cutoff - tolerance - show diagnostic plots

Logistic Regression: PROPERTIES (Visual Analytics)

- informative missingness - use variable selection - significance level {*0.10*} - link function {*logit*, probit} - override function convergence {*0.000001*} - override gradient convergence {*0.000001*} - use default # of bins - number - prediction cutoff - tolerance - show diagnostic plots

UNITS (PROC LOGISTIC option)

- lets you to obtain an odds ratio estimate for a specified change in a predictor variable - unit of change can be: - number - standard deviation (SD) - # ** standard deviation (2*SD)

EFFECTPLOT (PROC LOGISTIC option)

- produces a display of fitted model - gives options for changing & enhancing displays.

ODDSRATIO (PROC LOGISTIC option)

- produces odds ratios for variables - works with: - variables containing interactions w/ other covariates - classification variables using parameterization - multiple allowed per LOGISTIC

CLASS (PROC LOGISTIC option)

- specifies classification variables to be used in the analysis - must precede the MODEL statement.

ROC (PROC LOGISTIC option)

- specifies models to be used in the ROC comparisons - multiple allowed per LOGISTIC - identified by their label

MODEL (PROC LOGISTIC option)

- specifies response variable & predictor variables ^^ (can be character or numeric) - required statement - one allowed per LOGISTIC

What is the default significance level for the backward elimination method if no SLSTAY= option is set in PROC LOGISTIC?

.05

Suppose the profit margin of true positives is nine times higher than the loss margins of false positives. According to Bayes' rule, what is the cutoff that maximizes the expected profit? (Assume zero profit and loss for true negatives and false negatives.)

.10

Linear Regression: ROLES (Visual Analytics)

1. response 2. continuous effects 3. classification effects 4. interaction effects 5. group by 6. frequency 7. offset 8. weight

Logistic Regression: ROLES (Visual Analytics)

1. response 2. continuous effects 3. classification effects 4. interaction effects 5. group by 6. frequency 7. offset 8. weight

Decision Tree: ROLES (Visual Analytics)

1. response 2. predictors

Cluster: ROLES (Visual Analytics)

1. variables - requires >= 2

If the profit margin of true positives is 24 times higher than the loss margins of false positives, then according to Bayes' rule, what is the cutoff that maximizes the total expected profit? (Assume zero profit and loss for true negatives and false negatives.)

1/25

Which of the following is the key limitation of the simple perceptron?

It can solve only linearly separable problems

Consider the following MODEL statement in PROC LOGISTIC: model ins(event='1') = SavBal | Age | IRABal @2; Which effects does it model?

SavBal, Age, IRABal, SavBal x Age, SavBal x IRABal, and Age x IRABal (The bar notation with @2 constructs a model with all the main effects and the two-factor interactions)

What is the formula for specificity? (true negative rate)

TN / TN+FP true negatives / total actual negatives

What is the formula for sensitivity? (true positive rate)

TP / TP+FN true positives / total actual positives

What is cumulative lift?

The fraction of more event-level cases captured by the model than would be expected given the baseline event level.

What would happen if you split the data by taking a simple random sample in PROC SURVEYSELECT? Assume that, as in the previous demonstration, you split the data into two data sets (a training data set and a validation data set) and specify a sampling rate of 0.6667.

The proportion of the events in the training data set would probably be different from the proportion of events in the validation data set.

Suppose you are using development data in which the target event cases are rare. (Assume that there are fewer than 50 events.) In this situation, what is a typical way to get a reasonably honest model assessment?

Use k-fold cross validation

Best-subsets selection in PROC LOGISTIC is relatively efficient for ____.

a small number of variables (ex. < 50)

If the value of the offset is 3.2567, then the model corrected for oversampling has which of the following?

an intercept that is 3.2567 lower in value compared to the model fitted to the biased sample

In PROC LOGISTIC, the UNITS statement enables you to obtain ____.

an odds ratio estimate for a specified change in a predictor variable.

If the lift value is 4 at a depth of 40%, then ____.

at a depth of 40%, there are four times more responders targeted by the model than by random chance

How is variable selection performed in PROC LOGISTIC?

backward elimination

How is the odds ratio obtained?

by exponentiating the parameter estimates

The area under the ROC curve is reported as the ____ in PROC LOGISTIC.

c statistic

When must PROC DMDB be used?

only the first time a data set is used by PROC NEURAL

Kendall Tau investigates the association between ____ of the input.

concordant & discordant pairs of ranked observations

In Zero Inflated Poisson Models predictions are conditioned on ____.

covariates for two distributions

interactive group-by analysis can't be performed for ____ in visual statistics.

decision trees

A linear perceptron is a nonlinear model. (T/F)?

false

Even after training is completed, neural networks are usually slow to generate their estimates/decisions. (T/F)?

false

Linear regression models are appropriate for any continuous response variable. (T/F)?

false

Neural networks cannot model both the autoregressive (AR) and moving average (MA) components of the ARMA model. (T/F)?

false

Numerical optimization methods guarantee that the neural network will eventually find the optimal solution. (T/F)?

false

The DATA= and INMODEL= options can both be used in the same PROC LOGISTIC statement. (T/F)?

false

The HP Neural node is a multithreaded (parallel) version of the Neural Network node. (T/F)?

false

The addition of direct connections between the input and output layers (a skip-layer network) is usually beneficial. (T/F)?

false

The purpose of a surrogate model is to remove the need for a complex model by replacing it with a simpler model. (T/F)?

false ** purpose = to explain blackbox models

The purpose of weight decay and early stopping is to keep the network from falling into a bad local minima. (T/F)?

false ** purpose = to prevent overfitting

The sequential network construction methodology that is described in this course creates a neural network consisting of multiple hidden layers. (T/F)?

false ** sequentially adds neurons to a single hidden layer

In SAS Visual Statistics, the default number of clusters is ____.

five, but you can change this to another number

The lift chart and the gains chart are the same except for ____.

how the vertical axis is scaled GAINS = positive predictive value LIFT = positive predictive value / marginal rate

What is the default output activation function for the multilayer perceptron when modeling a normally distributed target variable?

identity function

After creating a decision tree, what is the value in deriving a leafID variable?

it can be used in filters in other types of visualizations

Given an influence plot, ____ can be used to quantify leverage.

likelihood displacement

In the Kolmogorov-Smirnov test, what does the test statistic, D represent?

maximum vertical distance between the cumulative distributions

The ____ measure is affected by oversampling.

positive predictive value

Measure variables can be used to ____.

rank-order the clusters

Spearman statistics investigate the association between ____ of the input.

rank-ordered values of the variables

What does the 1-R^2 ratio represent in clustering?

ratio of correlation with a variable's own cluster vs the the other clusters lower 1-R^2 = higher correlation with own cluster & lower correlation with others

Which property can be set when modeling decision trees with Visual Statistics?

reuse predictors

How can sensitivity & specificity be generated in PROC LOGISTIC?

score data=valid1 outroc=roc;

Which properties can be set when performing cluster analysis with Visual Statistics?

seed & variable standardization

Hoeffding D investigates the association between ____ of the input.

weighted sum of chi-square statistics for 2x2 tables

When does quasi-complete separation occur?

when a level of the predictor has no observations in one of the response levels


Related study sets

Prieto; AP W.H Chapter 5 Contact, Commerce, and Colonization

View Set

INTRO TO FBLA PRINCIPLES AND PROCEDURES

View Set

sémiologie et psychopathologie de l'adulte

View Set