SAS Certification Practice Exam: Statistical Business Analysis Using SAS 9: Regression and Modeling

¡Supera tus tareas y exámenes ahora con Quizwiz!

B gives all interactions, up to three-way. D gives all interactions up to two-way. The correct answers are: model Purchase = Gender Age Region; model Purchase = Gender|Age|Region @1;

Select the equivalent LOGISTIC procedure model statements. (Choose two.) Select one or more: a.model Purchase = Gender Age Region; b.model Purchase = Gender|Age|Region; c.model Purchase = Gender|Age|Region @1; d.model Purchase = Gender|Age|Region @2;

A would be Kendall Tau. B is not correct. D is Hoeffding D. The correct answer is: rank-ordered values of the variables

Spearman statistics in the CORR procedure are useful for screening for irrelevant variables by investigating the association between which function of the input variables? Select one: a.concordant and discordant pairs of ranked observations b.logit link (log(p/1-p)) c.rank-ordered values of the variables d.weighted sum of chi-square statistics for 2x2 tables

PROC LOGISTIC uses a complete case analysis thus only the 7,000 complete records will be used. The correct answer is: 7000

The LOGISTIC procedure will be used to perform a regression analysis on a data set with a total of 10,000 records. A single input variable contains 30% missing records. How many total records will be used by PROC LOGISTIC for the regression analysis? Enter an unformatted numeric answer in the space below.

Variable clustering is a solution with redundant inputs. Redundant variables have nothing to do with collapsing levels of a categorical input nor dendrograms. Either PROC alone, LOGISTIC or REG, does not present a solution. The correct answer is: Cluster the variables using PROC VARCLUS.

There are several redundant input variables in a regression application.What is a viable solution to consider? Select one: a.Cluster the variables using PROC VARCLUS. b.Collapse levels of categorical inputs using PROC CLUSTER. c.Produce a dendrogram using PROC TREE. d.Use PROC GLM rather than PROC REG.

The SAS Procedure CLUSTER is used to collapse levels of a nominal predictor variable using the Greenacre's method. The other SAS procedures listed cannot be used to collapse levels of a nominal predictor variable with the Greenacre's method. The correct answer is: CLUSTER

Which SAS procedure is used to collapse levels of a nominal predictor variable using Greenacre's method? Select one: a.FREQ b.FACTOR c.CLUSTER d.PRINCOMP

The statement PROC LOGISTIC performs complete case analysis, and even a few missing values per variable can cause an enormous loss of data, is correct because even a small probability of missingness for each input can lead to a large number of observations with missing values, leaving few observations for model fitting. Imputing missing values using PROC MI is a true statement, but PROC MI is not relevant for pure predictive modeling because we do not care about validity of tests of model parameters as much as accuracy of the model. Imputing missing values using PROC STDIZE is factually true, but reducing input variation artificially is not a goal of predictive modeling and does not necessarily improve a model. Preventing the loss of data is more important than reducing the variance of inputs. Replacing the missing values with some reasonable value is false in that PROC LOGISTIC has no built-in feature for handling missing values, therefore it makes no assumptions like MCAR. The correct answer is: PROC LOGISTIC performs complete case analysis, and even a few missing values per variable can cause an enormous loss of data.

Logistic regression candidate input variables have missing values. In a predictive modeling project, what is the primary reason for imputing missing values before fitting a logistic regression model to the data using PROC LOGISTIC? Select one: a.PROC LOGISTIC performs complete case analysis, and even a few missing values per variable can cause an enormous loss of data. b.Imputing missing values using PROC MI will reduce the bias of statistical tests for model parameters. c.Imputing missing values using PROC STDIZE with mean imputation will decrease the variance of inputs having missing values. d.Replacing the missing values with some reasonable value will lower the type I error for the statistical tests of significance of the predictor variables.

Oversampling only changes the baseline mean of the problem. This is reflected only in a change to the parameter estimate of the intercept. The correct answer is: Only the intercept estimate is biased.

One common approach for predicting rare events in the LOGISTIC procedure is to build a model that disproportionately over-represents those cases with an event occurring (e.g., a 50-50 event/non-event split).What problem does this present? Select one: a.All parameter estimates are biased. b.Only the intercept estimate is biased. c.Only the non-intercept parameter estimates are biased. d.Sensitivity estimates are biased.

Training data are used to derive a model, validation data are used to fine tune a model, such as selecting a subset of inputs, and test data are used to perform final assessment of a model. Choice A is incorrect because validation data are not used to derive a model. Choice C is incorrect because the test data set is just a partition of the original modeling data and therefore has a target variable. The correct answers are: The validation data set is used to tune a model, such as selecting an appropriate set of inputs for a model. The test data set is used for final assessment of a model.

A common practice in predictive modeling is to employ training, validation, and test data sets. Which two statements correctly describe elements of this practice? (Choose two.) Select one or more: a.Comparing parameters estimated from the training data set provides evidence of model stability. b.The validation data set is used to tune a model, such as selecting an appropriate set of inputs for a model. c.The test data set is the deployment data set that contains inputs but no target variable. d.The test data set is used for final assessment of a model.

Sensitivity and specificity are not affected by separate sampling because they do not depend on the proportion of each class in the sample. PV+ and PV- do have this dependence. The correct answer is: Sensitivity and Specificity

A confusion matrix is created for data that were oversampled due to a rare target.Which values are not affected by this oversampling? Select one: a.Sensitivity and PV+ b.Specificity and PV- c.PV+ and PV- d.Sensitivity and Specificity

Assessment measures from data that trains models are optimistically biased. Accuracy rate = 1-Misclassification rate but this has nothing to do with the reason one would not state this value for assessment. The correct answer is: It is optimistically biased since it is calculated from the data used to train the model.

An analyst builds a logistic regression model which is 75% accurate at predicting the event of interest on the training data set. The analyst presents this accuracy rate to upper management as a measure of model assessment.What is the problem with presenting this measure of accuracy for model assessment? Select one: a.This accuracy rate is redundant with the misclassification rate. b.It is pessimistically biased since it is calculated from the data set used to train the model. c.This accuracy rate is redundant with the average squared error. d.It is optimistically biased since it is calculated from the data used to train the model.

The c statistic measures model performance (area under the ROC curve), with higher values better. You need to use the validation data because the c statistic will show better fit on the training data with over-fit models. The correct answer is: The model had the highest c statistic on the validation data.

An analyst compared many different models to predict the binary Purchase variable and selected one particular model. What rationale supports this decision? Select one: a.The model had the highest c statistic on the training data. b.The model had the highest c statistic on the validation data. c.The model had the lowest c statistic on the training data. d.The model had the lowest c statistic on the validation data.

data=valid1 is needed since you want this to be using the validation data set.The correct option on the score statement is outroc=. The correct answer is:score data=valid1 outroc=roc;

An analyst generates a model using the LOGISTIC procedure. They are now interested in getting the sensitivity and specificity statistics on a validation data set for a variety of cutoff values.Which statement and option combination will generate these statistics? Select one: a.score data=valid1 out=roc; b.score data=valid1 outroc=roc; c.model resp(event='1') = gender region / outroc=roc; d.model resp(event='1') = gender region / out=roc;

Quasi-complete separation is when a single level of a categorical input is seen to have a 0% or 100% rate for the event of interest. Collinearity concerns correlation between inputs, influential observations deal with individual records that effect parameter estimate. The correct answer is: quasi-complete separation

An analyst investigates Region (A, B, or C) as an input variable in a logistic regression model. The analyst discovers that the probability of purchasing a certain item when Region = A is 1.What problem does this illustrate? Select one: a.collinearity b.influential observations c.quasi-complete separation d.problems that arise due to missing values

The interaction effect should be listed in the lsmeans statement and the slice option should list the predictor we are looking within. The correct answer is:lsmeans Gender*Income / slice=Income;

An analyst, using the GLM procedure, determines that there is a significant interaction between two categorical predictors: -income (Low, Medium, High) -gender (M, F) The analyst is interested in testing the effect of gender within each level of income. Which GLM procedure statement will generate these tests? Select one: a.lsmeans Gender*Income /slice=Gender; b.lsmeans Income / slice=Gender; c.lsmeans Gender*Income / slice=Income; d.lsmeans Gender / slice=Income;

The test of equal variance (Folded F test) is produced by default whenever a two-sample t-test is requested. A two-sample t-test is requested using the CLASS statement in PROC TTEST. The correct answer is: Use a CLASS statement.

How do you get PROC TTEST to display the test for equal variance? Select one: a.Use the option EV. b.Use the MEANS statement with a HOVTEST option. c.Request a plot of the residuals. d.Use a CLASS statement.

In model selection and validation step, splitting data into training and validation sets are necessary. The correct answer is: Training: 50% Validation: 50% Testing: 0%

In order to perform honest assessment on a predictive model, what is an acceptable division between training, validation, and testing data? Select one: a.Training: 50% Validation: 0% Testing: 50% b.Training: 100% Validation: 0% Testing: 0% c.Training: 0% Validation: 100% Testing: 0% d.Training: 50% Validation: 50% Testing: 0%

Training, validation data members should be mutually exclusive. Sampling with replacement in unacceptable. The correct answers are: simple random sampling without replacement, stratified random sampling without replacement

In partitioning data for model assessment, which two sampling methods are acceptable? (Choose two.) Select one or more: a.simple random sampling without replacement b.simple random sampling with replacement c.stratified random sampling without replacement d.sequential random sampling with replacement

If you know the target value for your new data set, then augmenting the training data set with new observations and rerunning the LOGISTIC procedure is NOT appropriate. The correct answer is: Concatenate the new observations data set to the training data and then use the SCORE statement in the LOGISTIC procedure.

Which method is NOT an appropriate way to score new observations with a known target in a logistic regression model? Select one: a.Concatenate the new observations data set to the training data and then use the SCORE statement in the LOGISTIC procedure. b.Use the CODE statement to output code and then use a DATA step to score the new observations data set with that code. c.Use the saved parameter estimates from the LOGISTIC and then use PROC PLM to score the new observations data set. d.Run PROC LOGISTIC first on the training data with an OUTMODEL= option and then again on the new observations data set with an INMODEL= option.

The one-way ANOVA model states that the dependent variable is equal to a within-group population mean (mu_i) plus a deviation from the population mean. The within-group population mean (mu_i) is estimated by using the within-group sample mean (ybar_i). The correct answer is: within-group sample means

What are the "predicted values" that result from fitting a one-way analysis of variance (ANOVA) model? Select one: a.within-group sample variances b.between-group sample variances c.within-group sample means d.between-group mean differences

If a variable has a very low rank for Spearman (coefficient - close to 0) and a very high rank for Hoeffding, this indicates a nonmonotonic relationship. A variable with a low rank on Spearman but a high rank on Hoeffding provides evidence of a non-linear association. The correct answer is: nonlinear and nonmonotonic association between two variables

What does a high Hoeffding's D correlation statistic and a low Spearman's rank correlation statistic indicate? Select one: a.linear and monotonic association between two variables b.nonlinear and monotonic association between two variables c.nonlinear and nonmonotonic association between two variables d.linear and nonmonotonic association between two variables

Effectiveness of the cleansing methods can only be determined if done after data splitting. The correct answer is: There is no ability to compare the effectiveness of different cleansing methods.

What is a drawback to performing data cleansing (imputation, transformations, etc.) on raw data prior to partitioning the data for honest assessment as opposed to performing the data cleansing after partitioning the data? Select one: a.It violates assumptions of the model. b.It requires extra computational effort and time. c.It omits the training (and test) data sets from the benefits of the cleansing methods. d.There is no ability to compare the effectiveness of different cleansing methods.

PROC LOGISTIC ignores observations with missing data. The correct answer is: Only cases with variables that are fully populated are used.

What is the default method in the LOGISTIC procedure to handle observations with missing data? Select one: a.Missing values are imputed. b.Parameters are estimated accounting for the missing values. c.Parameter estimates are made on all available data. d.Only cases with variables that are fully populated are used.

The ILINK option transforms the response (i.e. dependent variable) from the logit scale back into predicted probabilities. The parameter estimates must remain on the logit scale. Remember that the logit, or natural log of the odds, is the link function for the logistic regression, and the ILINK option is short for inverse link. The response had to be first transformed into the logit scale, and specifying the ILINK option transforms the logit back into predicted probabilities. The correct answer is: To transform the estimate of the response from the logit scale back to predicted probabilities.

What is the general purpose of the ILINK option in PROC LOGISTIC? Select one: a. To transform the parameter estimates of the model from the logit scale back to predicted probabilities. b. To transform the estimate of the response from the logit scale back to predicted probabilities. c. To transform the parameter estimates of the model from the logit scale back to odds ratios. d. To transform the estimate of the response from the logit scale back to the odds ratios.


Conjuntos de estudio relacionados

pharmacology: ch 14 drugs for the treatment of seizure disorders

View Set

Practice Questions/Review For Final

View Set

Earth Science 1121 Final Exam: All Previous Quizzes 1-6

View Set

Comprehensive Health Insurance Chapter 2-Understanding Managed Care:Insurance Plans: Study Guide for Quiz

View Set