SAS Statistics

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

95% You want to be as confident as possible but increasing the conf. level too much, you risk negative and positive infinity confidence bounds.

A 95% confidence interval represents a range of values within which you are _______certain that the true population mean exists. A. 5% B. 95%

Gaussian

A Normal Distribution bell curve is also known as a ___________ distribution? A. Gaussian B. Expected

One-Sided

A _____________ t-test compares the mean calculated from a sample to a hypothesized mean. The null hypothesis of the test is generally that the difference between the two means is zero. A. One-Sided B. Two-Sided

Straight

A linear association between two continuous variables can be inferred when the general shape of a scatter plot of the two variables is a __________ line. A. Straight B. Curved

Model 1 Models 1 and 3 are better than Model 2 because they have lower values of AIC and SC. Model 1 also has the highest values of the c statistic so it is the best of the three models. Review: Comparing the Binary and Multiple Logistic Regression Models, Fitting a Binary Logistic Regression Model

According to the goodness-of-fit statistics shown below, which multiple logistic regression model would be the best to use? Statistic Model 1 Model 2 Model 3 AIC 501.5 520.4 501.5 SC 501.5 520.4 501.5 c 0.675 0.675 0.655 a. Model 1 b. Model 2 c. Model 3

Adj R-Sq

An analyst has selected this model as a champion because it shows better model fit than a competing model with more predictors. Which statistic justifies this rationale? Adj R-Sq R-Square Error DF Coeff Var

the measure of the ability of the statistical hypothesis test to reject the null hypothesis when it is actually false Power is the ability of the statistical test to detect a true difference, or the ability to successfully reject a false null hypothesis. The probability of committing a Type I error is α. The probability of failing to reject the null hypothesis when it is actually false is a Type II error.

How do you define the term power? a. the measure of the ability of the statistical hypothesis test to reject the null hypothesis when it is actually false b. the probability of committing a Type I error c. the probability of failing to reject the null hypothesis when it is actually false

Should have chosen Use a CLASS statement.

How do you get PROC TTEST to display the test for equal variance? Use the option EV. Request a plot of the residuals. Should have chosen Use a CLASS statement. Use the MEANS statement with a HOVTEST option.

CLASS Statement

How do you tell PROC TTEST that you want to do a two-sample t-test? a.SAMPLE=2 option b.CLASS statement c.GROUPS=2 option d.PAIRED statement

5 In Mallows' Cp criterion, p equals the number of variables in the model plus 1 for the intercept. Therefore, for these models, p equals 8, 9, or 10, depending on the number of terms in the model. All the C(p) values are less than their respective p values, so all five models meet Mallows' Cp criterion. Review: Evaluating Models Using Mallows' Cp Statistic, Viewing Mallows' Cp Statistic in PROC REG, The REG Procedure: Using the All-Possible Regressions Technique, The REG Procedure: Using Automatic Model Selection

How many of the following models meet Mallows' Cp criterion for model selection? Model Index Number in Model C(p) R-Square Variables in Model 1 7 5.8653 0.7445 Age Weight Neck Abdomen Thigh Forearm Wrist 2 8 5.8986 0.7466 Age Weight Neck Abdomen Hip Thigh Forearm Wrist 3 8 6.4929 0.7459 Age Weight Neck Abdomen Thigh Biceps Forearm Wrist 4 9 6.7834 0.7477 Age Weight Neck Abdomen Hip Thigh Biceps Forearm Wrist 5 7 6.9017 0.7434 Age Weight Neck Abdomen Biceps Forearm Wrist a. 0 b. 1 c. 3

D. proc reg data=SASUSER.MLR; model y = x1-x4; run;

Identify the correct SAS program for fitting a multiple linear regression model with dependent variable (y) and four predictor variables (x1-x4). A. proc reg data=SASUSER.MLR; model y = x1 x2 x3 x4 /solution; run; B. proc reg data=SASUSER.MLR; model y = x1; model y = x2; model y = x3; model y = x4; run; C. proc reg data=SASUSER.MLR; var y x1 x2 x3 x4; model y = x1-x4; run; D. proc reg data=SASUSER.MLR; model y = x1-x4; run;

Report the F value and possibly remove the blocking factor from future studies. Your only choice is to report the F value, and if you plan future studies, do not include the blocking variable. The blocking factor must be included in all ANOVA models that you calculate with the sample that you've already collected. Review: Performing ANOVA with Blocking

If your blocking variable has a very small F value in the ANOVA report, what would be a valid next step? a. Remove it from the MODEL statement and re-run the analysis. b. Test an interaction term. c. Report the F value and possibly remove the blocking factor from future studies.

UNIVARIATE

PROC _____________ DATA=SAS-data-set <options>; VAR variables; HISTOGRAM variables </ options>; INSET keywords </ options>; RUN; A. FREQ B. UNIVARIATE

FREQ PROC FREQ can generate large volumes of output as the number of variables or the number of variable levels (or both) increases.

PROC _____________ DATA=SAS-data-set; TABLES table-requests </ options>; RUN; A. FREQ B. UNIVARIATE

Diagnostics

Produces a panel display of diagnostic plots for linear models? A. Diagnostics B. Hovtest

model Focus(event='Sports')=Gender; In the MODEL statement, the response variable name is followed by the EVENT= option in parentheses (which specifies the event category—the level of the response variable that you're interested in), an equal sign, and the predictor variable name. Review: The LOGISTIC Procedure

Suppose you want to investigate the relationship between the gender of elementary school students and their focus in school. The variable Gender indicates the gender of each student as Boy or Girl. The variable Focus identifies each student's main focus in school as Grades or Sports. Which of the following MODEL statements correctly completes this PROC LOGISTIC step for your analysis? proc logistic data=school.students; class Gender; _____________________________________ run; a. model Focus(event='Sports*Grades')=Gender; b. model Focus(event='Sports')=Gender; c. model Focus(ref='Sports')=Gender; d. model Focus*Gender(ref='Sports');

means, normal, larger The central limit theorem states that the distribution of sample means is approximately normal, regardless of the distribution of the population data, and this approximation improves as the sample size gets larger.

The central limit theorem states that the distribution of sample __(1)__ is approximately __(2)__, regardless of the distribution of the population data, and this approximation improves as the sample size gets __(3)__. a. means, skewed, larger b. variance, equal, smaller c. means, normal, larger d. proportions, equal, smaller

Linear

The defining feature of _____________ models is the __________ function of the explanatory variables. A. Linear B. Logistic

True

True or False? Assessing ANOVA Assumptions In many cases, good data collection designs can help ensure the independence assumption. Diagnostic plots from PROC GLM can be used to verify the assumption that the error is approximately normally distributed. PROC GLM produces a test of equal variances with the HOVTEST option in the MEANS statement. H0 for this hypothesis test is that the variances are equal for all populations.

True

True or False? Tukey's HSD Test HSD=Honest Significant Difference This method is appropriate when you consider pairwisecomparisons. The experimentwise error rate is equal to alpha when all pairwise comparisons are considered less than alpha when fewer than all pairwise comparisons are considered. Also known as the Tukey-Kramer Test

The variance inflation factors indicate that collinearity is present in the model. Several variance factors are above 10 (Abdomen, Weight, Height, Chest, Hip,Density, Adiposity, and FatFreeWt). This indicates that collinearity among the predictor variables is present in the model. Review: The REG Procedure: Detecting Collinearity

View this PROC REG output. What does the output indicate about the model? a. The p-value for the overall model is not significant. b. The model does not fit the data well. c. The p-values for the parameter estimates indicate that collinearity is present in the model. d. The variance inflation factors indicate that collinearity is present in the model. e. none of the above

within-group sample means

What are the "predicted values" that result from fitting a one-way analysis of variance (ANOVA) model? within-group sample variances between-group sample variances within-group sample means between-group mean differences

Link

When modeling a categorical variable, which function is used? A. Link B. Logit

Link

When modeling an interval variable, which function is used? A. Link B. Logit

- 2 Log L increased.

When selecting variables or effects using SELECTION=BACKWARD in the LOGISTIC procedure, the business analyst's model selection terminated at Step 3. What happened between Step 1 and Step 2? A. Pr > Chisq increased. B. - 2 Log L increased. C. AIC increased. D. DF increased.

Low

When the p-value is ____________, it provides doubt about the truth of the null hypothesis. A. High B. Low

model Health=Drug Disease Drug*Disease; In the MODEL statement, you first specify the main effect variables as they exist in the two-way ANOVA model. You then define the interaction term by separating the two main effect variables with an asterisk in the MODEL statement. Review: Performing Two-Way ANOVA with Interactions, Applying the Two-Way ANOVA Model

When you perform a two-way ANOVA in SAS, which of the following statements correctly defines the model that includes the interaction between the two main effect variables? a. class Drug*Disease; b. class Drug=Disease; c. model Drug*Disease; d. model Health=Drug Disease Drug*Disease;

the smallest overall validation average squared error PROC GLMSELECT selects the model that has the smallest overall validation error. Review: Building a Predictive Model

Which of the following does PROC GLMSELECT use to select a model from the candidate models when a validation data set has been provided? a. the smallest number of predictors b. the largest adjusted R-Square value c. the smallest overall validation average squared error d. none of the above

The errors are independent, normally distributed with zero mean and constant variance.

Y = B0 + B1X + E Which statement best summarizes the assumptions placed on the errors? A. The errors are correlated, normally distributed with constant mean and zero variance. B. The errors are correlated, normally distributed with zero mean and constant variance. C. The errors are independent, normally distributed with constant mean and zero variance. D. The errors are independent, normally distributed with zero mean and constant variance.

Dunnett

________________ method is recommended when there is a true control group. When appropriate (when a natural control category exists, against which all other categories are compared) it is more powerful than methods that control for all possible comparisons. A. Levene B. Tukey C. Dunnett

GLM

ods graphics; proc _________ data=STAT1.ameshousing3 plots=diagnostics; class Heating_QC; model SalePrice=Heating_QC; means Heating_QC / hovtest=levene; format Heating_QC $Heating_QC.; title "One-Way ANOVA with Heating Quality as Predictor"; run; quit; A. SGPLOT B. SGSCATTER C. GLM

Total Variation

the overall variability in the response variable. It is calculated as the sum of the squared differences between each observed value and the overall mean, This measure is also referred to as the Total Sum of Squares (SST). A. Total Variation B. Between Group Variation C. Within Group Variation

Between Group Variation

the variability explained by the independent variable and therefore represented by the between-treatment sum of squares. It is calculated as the weighted (by group size) sum of the squared differences between the mean for each group and the overall mean, This measure is also referred to as the Model Sum of Squares (SSM). A. Total Variation B. Between Group Variation C. Within Group Variation

proc logistic data = MYDIR.EMPLOYMENT descending; class Education (param=ref ref='3'); model Hired = Salary Education; run;

A Human Resource manager fits a logistic regression model with the following characteristics: - binary target Hired - continuous predictor Salary - categorical predictor Education (levels=1,2,3) The default odds ratio compares each level against the last class level for the variable Education. Which SAS program gives parameter estimates for Education that are consistent with the default odds ratios? proc logistic data = MYDIR.EMPLOYMENT descending; class Education (param=ref ref='3'); model Hired = Salary Education; run; proc logistic data = MYDIR.EMPLOYMENT descending; class Education; model Hired = Salary Education; run; proc logistic data = MYDIR.EMPLOYMENT descending; class Education (ref='3'); model Hired = Salary Education; run; proc logistic data = MYDIR.EMPLOYMENT descending; class Education Salary (param=ref ref='3'); model Hired = Salary Education; run;

Diffogram The downward-sloping diagonal lines show the confidence intervals for the differences. The upward-sloping line is a reference line showing where the group means would be equal.

A _____________can be used to quickly tell whether two group means are statistically significant. The point estimates for the differences between pairs of group means can be found at the intersections of the vertical and horizontal lines drawn at group mean values. A. Histogram B. Diffogram

C. The portfolios differ significantly with respect to risk.

A financial analyst wants to know whether assets in portfolio A are more risky (have higher variance) than those in portfolio B. The analyst computes the annual returns (or percent changes) for assets within each of the two groups and obtains the following output from the GLM procedure: A. Assets in portfolio A are significantly more risky than assets in portfolio B. B. Assets in portfolio B are significantly more risky than assets in portfolio A. C. The portfolios differ significantly with respect to risk. D. The portfolios do not differ significantly with respect to risk.

proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run;

A linear model has the following characteristics: - a dependent variable (y) - one continuous predictor variables (x1) including a quadratic term (x12) - one categorical predictor variable (c1 with 3 levels) - one interaction term (c1 by x1) Which SAS program fits this model? proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1sq c1byx1 /solution; run; proc reg data=SASUSER.MLR; model y = c1 x1 x1sq c1byx1 /solution; run; proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run; proc reg data=SASUSER.MLR; model y = c1 x1 x1*x1 c1*x1; run;

Standard Error

A statistic that measures the variability of your estimate is the ___________ of the mean. A. Variability B. Standard Error

ANOVA

Analysis of variance (ANOVA) is a statistical technique used to compare the means of two or more groups of observations or treatments. For this type of problem, you have the following: a continuous dependent variable, or response variable a discrete independent variable, also called a predictor or explanatory variable. A. CONOVA B. ANOVA

NORMAL

Creates a normal probability plot. Options (MU= SIGMA=) determine the mean and std deviation of the normal distribution used to create reference lines(normal curve overlay in HISTOGRAM and diagonal reference line in PROBPLOT). A. NORMAL B. EXTENDED

median The median is not affected by outliers and is less affected by the skewness. The mean, on the other hand, averages in any outliers that might be in your data.

For an asymmetric (or skewed) distribution, which of the following statistics is a good measure for the middle of the data? a. mean b. median c. either mean or median

Large wrist size is significantly different than small wrist size.

Given alpha=0.02, which conclusion is justified regarding percentage of body fat, comparing small(S), medium(M), and large(L) wrist sizes? Medium wrist size is significantly different than small wrist size. Large wrist size is significantly different than small wrist size. There is no significant difference due to wrist size. Large wrist size is significantly different than medium wrist size.

Correct answer: no The p-value of 0.2942 is greater than 0.05, so you fail to reject the null hypothesis and conclude that the variances are equal. Review: The GLM Procedure

Given this SAS output, is there sufficient evidence to reject the assumption of equal variances? a. yes b. no

yes The p-value of <.001 is less than 0.05, so you would reject the null hypothesis and conclude that the means between the two groups are significantly different. Review: Examining the Equal Variance t-Test and p-Values

Given this SAS output, is there sufficient evidence to reject the hypothesis of equal means? a. yes b. no

the most parsimonious model The most parsimonious model is selected. The most parsimonious model is the simplest, least complex of the candidate models. Review: Building a Predictive Model

Honest assessment might generate multiple candidate models that have the same (or nearly the same) validation assessment values. In this situation, which model is selected? a. the model that has the highest variance when it is applied to the population b. the model that has the most terms c. the most parsimonious model d. the most biased model

T-Test

If you analyze the difference between two means using ANOVA, you reach the same conclusions as you reach using a pooled, two-group _________. A. T-Test B. Analysis

the SCORE= option The SCORE= option specifies the data set that contains the parameter estimates. PROC SCORE reads the parameter estimates from this data set, scores the observations in the data set that the DATA= option specifies, and writes the scored observations to the data set that the OUT= option specifies. Review: The SCORE Procedure: Scoring Predicted Values Using Parameter Estimates

In this PROC SCORE step, which option specifies the data set containing the parameter estimates that are used to score observations? proc score data=dataset1 score=dataset2 out=dataset3 type=parms; var Performance; run; a. the DATA= option b. the SCORE= option c. the OUT= option

GLM

PROC __________DATA=SAS-data-set PLOTS=options; CLASS variables; MODEL dependents=independents </ options>; MEANS effects </ options>; LSMEANS effects </ options>; OUTPUT OUT=SAS-data-set keyword=variable...; RUN; QUIT; A. SGPLOT B. SGSCATTER C. GLM

HOVTEST

Performs a test of homegeneity (equality) of variances. The null hypothesis for this test is that the variances are equal. Levene's test is the default. A. T-TEST B. HOVTEST C. EQUALTEST

both parametric and non-parametric models Predictive models can be based on both parametric and non-parametric models. Review: What Is Predictive Modeling?

Predictive models can be based on which of the following? a. parametric models only b. non-parametric models only c. both parametric and non-parametric models

proc univariate data=statdata.sleep mu0=8; var hours; run; You specify the MU0= option as part of the PROC UNIVARIATE statement to indicate the test value of the null hypothesis. The alternative hypothesis is that μ is not equal to 8 hours, but this does not need to be specified in the PROC UNIVARIATE code.

Psychologists at a college want to know if students are sleeping more or less than the recommended average of 8 hours a day. Which of the following code choices correctly tests the null hypothesis? a. proc univariate data=statdata.sleep mu0<>8; var hours; run; b. proc univariate data=statdata.sleep; var hours / mu0=8; run; c. proc univariate data=statdata.sleep; var hours / mu0<>8; run; d. proc univariate data=statdata.sleep mu0=8; var hours; run;

The odds of the event are 1.142 greater for each one thousand dollar increase in salary.

Salary data are stored in 1000's of dollars. What is a correct interpretation of the estimate? A. The odds of the event are 1.142 greater for each one thousand dollar increase in salary. B. The probability of the event is 1.142 greater for each one thousand dollar increase in salary. C. The probability of the event is 1.142 greater for each one dollar increase in salary. D. The odds of the event are 1.142 greater for each one dollar increase in salary.

KERNAL

Superimposes kernal density estimates on the histogram. A. NORMAL B. EXTENDED C. KERNAL

the predicted value of the response when all predictors = 0.

The Intercept estimate is interpreted as: the predicted value of the response when all predictors are at their means. the predicted value of the response when all predictors are at their minimum values. the predicted value of the response when all the predictors are at their current values. the predicted value of the response when all predictors = 0.

Constant variance, because the interquartile ranges are different in different ad campaigns.

The box plot was used to analyze daily sales data following three different ad campaigns. The business analyst concludes that one of the assumptions of ANOVA was violated. Which assumption has been violated and why? A. Constant variance, because Prob > F < .0001. B. Normality, because Prob > F < .0001. C. Constant variance, because the interquartile ranges are different in different ad campaigns. D. Normality, because the interquartile ranges are different in different ad campaigns.

7

The following SAS Code is submitted: proc reg data=sashelp.fish; model weight=length1 height width / selection=adjrsq; run; How many possible subset models will be assessed by SAS? A. 6 B. 8 C. 5 D. 7

the standard deviation (σ) and the variance (σ²) The location and spread of a normal distribution depend on the value of two parameters, the mean (µ) and the standard deviation (σ).

The location and spread of a normal distribution depend on the value of which two parameters? a. the mean (x̄) and the standard deviation (s) b. the standard deviation (σ) and the variance (σ²) c. the mean (µ) and the standard deviation (σ) d. none of the above

a two-sided t-test Because the cereal manufacturer is interested in determining whether the two processes produce a different mean cereal weight, he needs to perform a two-sided t-test. Review: Scenario: Comparing Group Means, Scenario: Testing for Differences on One Side

The manufacturer for a cereal company uses two different processes to package boxes of cereal. He wants to be sure the two processes are putting the same amount of cereal in each box. He plans to perform a two-sample t-test to determine whether the mean weight of cereal is significantly different between the two processes. What type of test should he run? a. an upper-tailed t-test b. a two-sided t-test c. a lower-tailed t-test

Mean

The predicted value in ANOVA is the group _________. A. Mean B. Median C. Mode

Predicted

The regression coefficients are just numbers and they are multiplied by the explanatory variable values. These products are then summed to get the individual's ______________ value. A. Expected B. Predicted

used to calculate confidence intervals of the mean. The standard error of the mean is part of the equation used to calculate a confidence interval of the mean. It is not normally distributed, and it is never less than 0.

The standard error of the mean is a. used to calculate confidence intervals of the mean. b. always normally distributed. c. sometimes less than 0. d. none of the above

True The CLASS statement creates a set of "design variables" (sometimes referred to as "dummy variables") representing the information contained in any categorical variables. Linear regression is then performed on the design variables. ANOVA can be thought of as linear regression on dummy variables. It is only in the interpretation of the model that a distinction is made.

True or False? What Does a CLASS Statement Actually Do? The CLASS statement creates a set of "design variables" representing the information in the categorical variables. PROC GLM performs linear regression on the design variables, but reports the output in a manner interpretable as group mean differences. There is only one "parameterization" available in PROC GLM.

Several observations exceed the cutoff values, so these observations might be influential. The gray horizontal lines mark the +2 and -2 cutoff values of the RSTUDENT residuals. Several observations fall outside these lines, so these observations might be influential. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

View this plot of RSTUDENT residuals versus predicted values of PctBodyFat2. What does it indicate? a. The model does not fit the data well. b. The residuals have a cyclical shape, so the independence assumption is being violated. c. Several observations exceed the cutoff values, so these observations might be influential. d. none of the above

Ho: u=uo Ho: u-uo=0

What is the null hypothesis for a one-sample t-test? A. Ho: u=uo B. Ho: uo=0 C.Ho: u-uo=0 D. Ho: uo-0=0

C. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /selection=backward select=bic; run;

Which SAS program will correctly use backward elimination with BIC selection criterion within the GLMSELECT procedure? A. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /select=backward choose=bic; run; B. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /select=backward selection=bic; run; C. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /selection=backward select=bic; run; D. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /selection=backward choose=bic; run;

None of the Above

Which of the following affects alpha? a. The p-value of the test b. The sample size c. The number of Type I errors d. All of the above e. Answers a and b only f. None of the above

Gender should not be removed due to its involvement in the significant interaction.

Which statement is correct at an alpha level of 0.05? School should be removed because it is significant. Gender should not be removed due to its involvement in the significant interaction. School*Gender should be removed because it is non-significant. Gender should be removed because it is non-significant.

the assumption of equal variances You use Levene's Test for Homogeneity in PROC GLM to verify the assumption of equal variances in a one-way ANOVA model. Review: The GLM Procedure

You can examine Levene's Test for Homogeneity to more formally test which of the following assumptions? a. the assumption of errors being normally distributed b. the assumption of independent observations c. the assumption of equal variances d. the assumption of treatments being randomly assigned

PLOTS= FREQPLOT

requests a frequency plot. Frequency plots are available for frequency and crosstabulation tables. For multiway crosstabulation tables, PROC FREQ provides a two-way frequency plot for each stratum (two-way table). A. PLOTS= FREQPLOT B. PLOTS= FREQUENCY

a. a PROC GLMSELECT step that contains the SCORE statement b. a PROC PLM step that contains the SCORE statement and references an item store that was created in PROC GLMSELECT c. a PROC PLM step with the CODE statement that writes the score code based on item store created in PROC GLMSELECT, and a DATA step that scores the data d. any of the above Any of these approaches can be used to score data based on the model built by PROC GLMSELECT. Review: Methods of Scoring

A department store is deploying a chosen model to make predictions for an upcoming sales period. They have the necessary data and are ready to proceed. Which of the following methods can be used for scoring? a. a PROC GLMSELECT step that contains the SCORE statement b. a PROC PLM step that contains the SCORE statement and references an item store that was created in PROC GLMSELECT c. a PROC PLM step with the CODE statement that writes the score code based on item store created in PROC GLMSELECT, and a DATA step that scores the data d. any of the above

yes, Hip and Abdomen Hip and Abdomen both have p-values lower than .05, so they are statistically significant in predicting or explaining the variability of the percentage of body fat. Review: Performing Simple Linear Regression, Analysis versus Prediction in Multiple Regression, Fitting a Multiple Linear Regression Model

According to these parameter estimates, are any of the variables in the model statistically significant in predicting or explaining the percentage of body fat? Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept 1 -20.98714 5.55433 -3.78 0.0002 Age 1 0.01226 0.02836 0.43 0.6658 Hip 1 -0.40163 0.09994 -4.02 <.0001 Abdomen 1 0.86123 0.06814 12.64 <.0001 a. no b. yes, Age c. yes, Hip and Abdomen d. yes, Age, Hip, and Abdomen

a fairly strong, negative linear relationship The correlation coefficient for the relationship between Performance and RunTime is -0.82049, which is negative. It is also close to 1, making it a relatively strong relationship. Review: Using Correlation to Measure Relationships between Continuous Variables

Based on this correlation matrix, what type of relationship do Performance and RunTime have? Pearson Correlation Coefficients, N = 31 Prob > |r| under H0: Rho=0 Performance RunTime Age Performance 1.00000 -0.82049 <.0001 -0.71257 <.0001 RunTime -0.82049 <.0001 1.00000 0.19523 0.2926 Age -0.71257 <.0001 0.19523 0.2926 1.00000 a. a fairly strong, positive linear relationship b. a fairly strong, negative linear relationship c. a fairly weak, positive linear relationship d. a fairly weak, negative linear relationship

POSITION=NE

Determines the position of the inset. The position is a compass point keyword, a margin keyword, or a pair of coordinates. You can specifiy coordinates in axis percent units or axis data units. The default value is NW. A. POSITIONPLOT B. POSITION=NE

STEPWISE The summary table contains both Variable Entered and Variable Removed columns. Of the three types of stepwise selection (forward, backward, and stepwise), only stepwise selection can both enter and remove variables. Therefore, STEPWISE must have been specified in the PROC REG step. Review: The Stepwise Selection Approach to Model Building, The GLMSELECT Procedure, The GLMSELECT Procedure: Performing Stepwise Regression

Given the information in this summary of variable selection, which stepwise selection method was specified in the PROC REG step? Step Variable Entered Variable Removed Number Vars In Partial R-Square Model R-Square C(p) F Value Pr > F 1 RunTime 1 0.7434 0.7434 3.3432 84.00 <.0001 2 Age 2 0.0213 0.7647 2.8192 2.54 0.1222 a. FORWARD b. BACKWARD c. STEPWISE d. can't tell from the information given

36.1680 and 52.3021 The CLI option, which displays the 95% CL Predict column in the Output Statistics table, produces confidence limits for an individual predicted value. In this table, the third observation, for Kate, contains the value 55 for Performance. Therefore, the values in her 95% CL Predict column are the lower and upper confidence limits for a new individual value at the same value of Performance. In contrast, the CLM option displays the values in the 95% CL Mean column, which are the lower and upper confidence limits for a mean predicted value for each observation. Review: Specifying Confidence and Prediction Intervals in SAS, Viewing and Printing Confidence Intervals and Prediction Intervals, The REG Procedure: Producing Predicted Values

Here is a table of output statistics from PROC REG. If you sample a new value of the dependent variable when Performance equals 55, what are the lower and upper prediction limits for this newly sampled individual value? Output Statistics Obs Name Performance Dependent Variable Predicted Value Std Error Mean Predict 95% CL Mean 95% CL Predict Residual 1 Jack 48 40.8400 44.9026 1.0190 42.0732 47.7319 37.4190 52.3861 -4.0626 2 Annie 43 45.1200 45.3793 1.3081 41.7475 49.0112 37.5570 53.2016 -0.2593 3 Kate 55 44.7500 44.2351 1.4885 40.1023 48.3678 36.1680 52.3021 0.5149 4 Carl 40 46.0800 45.6654 1.6493 41.0862 50.2446 37.3608 53.9700 0.4146 5 Don 58 44.6100 43.9490 1.8646 38.7719 49.1261 35.3003 52.5977 0.6610 6 Effie 45 47.9200 45.1886 1.1361 42.0343 48.3429 37.5763 52.8009 2.7314 a. 44.7500 and 44.2351 b. 40.1023 and 48.3678 c. 36.1680 and 52.3021 d. can't tell from the information given

tables Country Size Country*Size; You use the TABLES statement in PROC FREQ to create frequency and crosstabulation tables. In the TABLES statement, you separate table requests with a space. In a table request for a crosstabulation table, you specify an asterisk between the variable names. Review: Crosstabulation Tables

In a PROC FREQ step, which statement or set of statements creates a frequency table for Country, a frequency table for Size, and a crosstabulation table for Country by Size? a. tables Country, Size, Country*Size; b. tables Country*Size; c. tables Country | Size; d. tables Country Size Country*Size;

Populations

In inferential statistics, the focus is on learning about ______________. Examples of ___________ are all people with a certain disease, all drivers with a certain level of insurance, or all customers, both current and potential, at a bank. A. Populations B. Volumes

no The most complex model is not always the best choice. An overly complex model might be too flexible, which can lead to overfitting. Review: Model Complexity

In predictive modeling, is the most complex model the best choice? a. yes b. no

that the errors are normally distributed The Residuals versus Quantile plot is a normal quantile plot of the residuals. Using this plot, you can verify that the errors are normally distributed, which is one of our assumptions. Here the residuals follow the normal reference line pretty closely, so we can conclude that the errors are normally distributed. Review: The REG Procedure: Producing Default Diagnostic Plots

In the diagnostic plots below, what does the Residual versus Quantile plot indicate about the model? a. that the errors are normally distributed b. that the data set contains many influential observations c. that the model is inadequate because the spread of the residuals is less than the spread of the centered fit d. that the model is inadequate because patterns occur in the spread around the reference line

age, body temperature, gas mileage, income The continuous variables are age, body temperature, gas mileage, and income.

Select the choice that lists only continuous variables. a. body temperature, number of children, gender, beverage size b. age, body temperature, gas mileage, income c. number of children, gender, gas mileage, income d. gender, gas mileage, beverage size, income

The probability is .95 that the true average weight is between 15.02 and 15.04 ounces. A 95% confidence interval means that you are 95% confident that the interval contains the true population mean. If you sample repeatedly and calculate a confidence interval for each sample mean, 95% of the time your confidence interval will contain the true population mean. A confidence interval is not a probability. When a confidence interval is calculated, the true mean is in the interval or it is not. There is no probability associated with it.

Select the statement below that incorrectly interprets a 95% confidence interval (15.02, 15.04) for the population mean, if the sample mean is 15.03 ounces of cereal. a. You are 95% confident that the true average weight for a box of cereal is between 15.02 and 15.04 ounces. b. The probability is .95 that the true average weight is between 15.02 and 15.04 ounces. c. In the long run, approximately 95% of the intervals calculated with this procedure will capture the true average weight.

A Cramer's V statistic that is close to 1 Cramer's V statistic is the only appropriate statistic to use in this example. When Cramer's V is close to 1, there is a relatively strong general association between two categorical variables. You cannot use an odds ratio because the predictor Type is not binary. You cannot use the Spearman correlation statistic because the predictor Type is not ordinal. Review: Cramer's V Statistic, Odds Ratios, The Spearman Correlation Statistic

Suppose you are analyzing the relationship between hot dog ingredients and taste. Which of the following statistics provides evidence of a relatively strong association between the variables Type (which has the values Beef, Meat, and Poultry) and Taste (which has the values Bad and Good)? a. A Cramer's V statistic that is close to 1 b. An odds ratio that is greater than 1 c. A Spearman correlation statistic that is close to 1

tables Rating*Grade / chisq measures; Both variables are ordinal and have logically-ordered values, so the Mantel-Haenszel test (for ordinal association) is a stronger test than the Pearson chi-square test (for general association) in this situation. The CHISQ option produces both the Pearson and Mantel-Haenszel statistics. The MEASURES option produces the Spearman correlation statistic, which measures the strength of an ordinal association. MHCHISQ is not a valid option, and the CLODDS= option is not a valid option in PROC FREQ. Review: The Mantel-Haenszel Chi-Square Test, The Spearman Correlation Statistic, Performing a Mantel-Haenszel Chi-Square Test of Ordinal Association

Suppose you are testing for an association between student ratings of teachers and student grades. The Rating variable has the values 1 (for poor), 2 (for fair), 3 (for good) and 4 (for excellent). The Grade variable has the values A, B, C, D, and F. Which of the following TABLES statements in PROC FREQ produces the appropriate chi-square statistics and measure of strength for these variables? a. tables Rating*Grade / chisq measures; b. tables Rating*Grade / chisq; c. tables Rating*Grade / mhchisq; d. tables Rating*Grade / mhchisq clodds=pl;

the equal variance assumption When a residuals plot displays a funnel shape, it indicates that the variance of the residuals is not constant. That is, the variance increases toward the wide end of the "funnel." This shows you that your model violates the equal variance assumption. Review: Verifying Assumptions Using Residual Plots

Suppose you have a residuals plot that shows a funnel shape for the residuals, such as in the plot below. Which assumption of linear regression is being violated? a. the linearity assumption b. the independence assumption c. both the linearity assumption and the independence assumption d. the equal variance assumption e. both the linearity assumption and the equal variance assumption

proc plm restore=homestore; score data=new out=new_out; run; In PROC PLM, the RESTORE= option specifies the name of the item store. In the SCORE statement, the DATA= option specifies New as the data set that contains the observations to be scored. The OUT= option specifies that the scored results are saved in a data set named New_Out. Review: Scoring Data

Suppose you ran a PROC GLMSELECT step that saved the context and results of the statistical analysis in an item store named Homestore. Which of the following programs scores new observations in a data set named New and saves the predictions in a data set named New_Out? a. proc plm restore=homestore; score data=new out=new_out; run; b. proc plm restore=new; score data=homestore out=new_out; run; c. proc plm data=homestore; score data=new out=new_out; run; d. proc plm restore=homestore; model data=new out=new_out; run;

Total

The _________sum of squares, SST, is a measure of the total variability in a response variable. It is calculated by summing the squared distances from each point to the overall mean. Because it is correcting for the mean, this sum is sometimes called the corrected total sum of squares. A. Total B. Error

oddsratio Amount (units=20); oddsratio Frequency / diff=ref at (Meal=all); oddsratio Meal / diff=ref at (Frequency=all); You must specify the intervals of Amount in the UNITS statement, not in the ODDSRATIO statement. To calculate odds ratios for the two categorical variables as described, each of the two ODDSRATIO statements must set DIFF= to REF against all levels of the interacting variable. Review: The ODDSRATIO Statement, The UNITS Statement

Suppose you want to fit a multiple logistic regression model to determine how the method of administering a drug affects patients' response to the drug. The binary variable Response has the values 0 and 1. There are three predictors: Amount identifies the dosage amount in mg, Frequency has the values Daily and Weekly, and Meal has the values Yes and No. You want to calculate three odds ratios: an odds ratio for Amount at 20 mg intervals an odds ratio for Frequency against the reference level (Daily) as compared to all levels of Meal an odds ratio for Meal against the reference level (Yes) as compared to all levels of Frequency Which of the following blocks of code below correctly completes the following PROC LOGISTIC program? proc logistic data=newdrug; class Frequency (param=ref ref='Daily') Meal (param=ref ref='Yes'); model Response (event='1') = Frequency | Meal | Amount @2; _____________________________________________ run; a. oddsratio Amount (units=20); oddsratio Frequency / diff=ref at (Meal=all); oddsratio Meal / diff=ref at (Frequency=all); b. units Amount=20; oddsratio Amount; oddsratio Frequency / diff=all at (Meal='Yes'); oddsratio Meal / diff=all at (Frequency='Daily'); c. units Amount=20; oddsratio Amount; oddsratio Frequency / diff=ref at (Meal=all); oddsratio Meal / diff=ref at (Frequency=all); d. oddsratio Amount (units=20); oddsratio Frequency / diff=all at (Meal='Yes'); oddsratio Meal / diff=all at (Frequency='Daily');

class Program(param=ref ref='2') Gender(param=ref ref='Male'); The CLASS statement lists all the categorical predictor variables. For each categorical predictor, you use the PARAM= option to specify reference cell coding (REF or REFERENCE) instead of the default parameterization method, effect coding. The default reference level is the level with the highest ranked value when the levels are sorted in ascending alphanumeric order. Review: Specifying a Parameterization Method in the CLASS Statement, Reference Cell Coding

Suppose you want to fit a multiple logistic regression model to determine which of two rehabilitation programs is more effective. The categorical response variable Relapsed (Yes or No) indicates whether study participants stayed clean after one year. The categorical predictor variables are Program (1 or 2) and Gender (Male or Female). Age is a continuous predictor variable. Assume that you want to use reference cell coding with the default reference levels. Which of the following CLASS statements correctly completes the PROC LOGISTIC step for this analysis? proc logistic data=programs.rehabilitation; _____________________________________ model Relapsed (event='Yes') = Program | Gender | Age @2; run; a. class Program(param=ref ref='2') Gender(param=ref ref='Male'); b. class Program(param=ref ref='2') Gender (param=ref ref='Male') Age (param=ref units=1); c. class Program(param=ref ref='1') Gender(param=ref ref='Female');

false The Tukey method and the pairwise t-tests are two methods you learned about that compare all possible pairs of means, so they can be used only when you make pairwise comparisons. The Dunnett method compares all categories to a control group. Review: Dunnett's Multiple Comparison Method, Tukey's Multiple Comparison Method

The Dunnett method compares all possible pairs of means, so it can be used only when you make pairwise comparisons. a. true b. false

Error

The ___________sum of squares, SSE, measures the random variability within groups; it is the sum of the squared deviations between observations in each group and that group's mean. This is often referred to as the unexplained variation or within-group variation. A. Total B. Error

DFFITS and CooksD only The variable Summary_i compresses the indicator variables RStud_i, DFits_i, and CookD_i into a single variable, with values in the order shown in the assignment statement that defines Summary_i. Therefore, the Summary_i value 011 means that the RStudent value did not exceed the cutoff, but the values for DFFITS and CooksD did. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

The observation below is from the data set InfluentialBF. Obs Summary_i Case PredictedValue RStudent DFFITS CutDFFits CooksD CutCooksD 1 011 39 44.8580 -2.6312 -1.5941 0.80322 0.496 0.12903 Assume that these assignment statements were used in creating the data set: CutDFFits=2*(sqrt(&numparms/&numobs)); CutCooksD=4/&numobs; RStud_i=(abs(RStudent)>3); DFits_i=(abs(DFFits)>CutDFFits); CookD_i=(CooksD>CutCooksD); Summary_i=compress(RStud_i||DFits_i||CookD_i); For which statistics did this observation exceed the cutoff criteria? a. RStudent, DFFITS, and CooksD b. RStudent and DFFITS only c. RStudent and CooksD only d. DFFITS and CooksD only

P Value

The probability calculated from the data is called the ____________. A. P Value B. Expected Confidence

The row percentages indicate that the distribution of size changes when the value of country changes. To see a possible association, you look at the row percentages. A higher percentage of American-made cars are large as opposed to small. The opposite is true for European cars and especially for Japanese cars. Review: Association between Categorical Variables, Crosstabulation Tables

This table shows frequency statistics for the variables country and size in a data set that contains data about people and the cars they drive. What evidence in the table indicates a possible association? Frequency Percent Row Pct Col Pct Table of country by size country(country) size(size) Large Medium Small Total American 36 11.88 31.30 85.71 53 17.49 46.09 42.74 26 8.58 22.61 18.98 115 37.95 European 4 1.32 10.00 9.52 17 5.61 42.50 13.71 19 6.27 47.50 13.87 40 13.20 Japanese 2 0.66 1.35 4.76 54 17.82 36.49 43.55 92 30.36 62.16 67.15 148 48.84 Total 42 13.86 124 40.92 137 45.21 303 100.00 a. The frequency statistics indicate that the values of each variable are equally distributed across levels. b. The row percentages indicate that the distribution of size changes when the value of country changes. c. The column percentages indicate that most of the cars of each size are manufactured in Japan.

The drug effect is not significant when used in patients with disease Z. The p-value for disease Z is 0.7815. Because this p-value is greater than your alpha of 0.05, you fail to reject the null hypothesis and conclude that there is no significant effect of Drug on Health for patients with disease Z. Review: Performing a Post Hoc Pairwise Comparison

This table shows output from a post hoc pairwise comparison in which you tested the significance of a drug on patients' health for three different diseases. What conclusion can you make based on this output? a. The drug effect is significant when used in patients with disease Z. b. The drug effect is significant when used in patients with diseases Y and Z. c. The drug effect is not significant when used in patients with disease Z.

CONNECT= proc sgplot data=STAT1.ameshousing3; vbox SalePrice / category=Central_Air connect=mean; title "Sale Price Differences across Central Air"; run;

VBOX options which Specifies that a connect line joins a statistic from box to box. This option applies only when the CATEGORY option is used to generate multiple boxes. A. CATEGORY= B. CONNECT=

CATEGORY= proc sgplot data=STAT1.ameshousing3; vbox SalePrice / category=Central_Air connect=mean; title "Sale Price Differences across Central Air"; run;

VBOX options which Specifies the category variable for the plot. A box plot is created for each distinct value of the category variable. A. CATEGORY= B. CONNECT=

both of the above An influential observation is an observation that strongly affects the linear model's fit to the data. If the influential observation weren't there, the best fitting line to the rest of the data would most likely be very different. Review: Introduction, Using Diagnostic Statistics to Identify Influential Observations, Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2, Handling Influential Observations

What is an influential observation? a. unusual observation that can sometimes have a large residual compared to the rest of the points b. an observation so far away from the rest of the data that it influences the slope of the regression line c. both of the above d. neither of the above

a table of correlations and a scatter plot matrix with histograms along its diagonal By default, PROC CORR produces a table of correlations (which can be a correlation matrix, depending on your program). The NOSIMPLE option suppresses printing of the simple descriptive statistics for each variable, and PLOT=MATRIX requests a scatter plot matrix instead of individual scatter plots. The HISTOGRAM option displays histograms of the variables in the VAR statement along the diagonal of the scatter plot matrix. Review: Using Correlation to Measure Relationships between Continuous Variables, Using Correlation to Measure Relationships between Continuous Variables

What output does this program produce? proc corr data=statdata.bodyfat2 nosimple plots=matrix(nvar=all histogram); var Age Weight Height; run; a. individual correlation plots and simple descriptive statistics b. a scatter plot matrix only, with histograms along its diagonal c. a table of correlations and a scatter plot matrix with histograms along its diagonal d. can't tell from the information given

For each year older (holding the other predictors at a fixed value), the predicted value of oxygen consumption is 2.78 lower. The parameter estimate for Age is the average change in Oxygen_Consumption for a 1-unit change in Age. In this case, the parameter estimate is negative. So, for each year older (a 1-unit change in Age), oxygen consumption decreases by 2.78 units. Review: The Simple Linear Regression Model

When Oxygen_Consumption is regressed on RunTime, Age, Run_Pulse, and Maximum_Pulse, the parameter estimate for Age is -2.78. What does this mean? a. For each year older (holding the other predictors at a fixed value), the predicted value of oxygen consumption is 2.78 greater. b. For each year older (holding the other predictors at a fixed value), the predicted value of oxygen consumption is 2.78 lower. c. For every 2.78 years older (holding the other predictors at a fixed value), oxygen consumption doubles. d. For every 2.78 years younger (holding the other predictors at a fixed value), oxygen consumption doubles.

proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape; partition fraction(test=0 validate=.20); run; The PARTITION statement specifies that the original data set, Housing, be split. The FRACTION option specifies the fraction of the original data set (as a decimal value) to be placed in the holdout data set. The training data set contains the remaining observations, those that were not allocated to the validation (or, if specified, test) data sets. Review: Using PROC GLMSELECT to Build a Predictive Model, Building a Predictive Model

Which of the following PROC GLMSELECT steps splits the original data set into a training data set that contains 80% of the original data and a validation data set that contains 20% of the original data? a. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape / fraction(test=0 validate=.20); run; b. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape / partition(test=0 validate=.20); run; c. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape; fraction(test=0 validate=.20); run; d. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape; partition fraction(test=0 validate=.20); run;

STUDENT residuals You can use STUDENT residuals to detect outliers. To detect influential observations, you can use RSTUDENT residuals and the DFFITS and Cook's D statistics. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

Which of the following can you use to detect outliers? a. DFFITS statistics b. Cook's D statistics c. STUDENT residuals d. RSTUDENT residuals

proc univariate data=statdata.speedtest; histogram Speed / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; run; In the HISTOGRAM statement, you specify the Speed variable and the NORMAL option using estimates of the population mean and the population standard deviation. In the INSET statement, you specify the keywords SKEWNESS and KURTOSIS, as well as the POSITION=NE option.

Which of the following code choices creates a histogram for the variable Speed from the data set SpeedTest with a normal curve overlay and a box with the skewness and kurtosis statistics printed in the northeast corner? a. proc univariate data=statdata.speedtest; histogram Speed / normal(mu=est sigma=est); inset skewness kurtosis; run; b. proc univariate data=statdata.speedtest; histogram Speed / normal (mean std); inset skewness kurtosis / position=ne; run; c. proc univariate data=statdata.speedtest; histogram Speed / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; run; d. proc univariate data=statdata.speedtest; histogram Speed / normal(skewness kurtosis); run;

proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std var range qrange; class Type; var Yield; run; The PROC MEANS statement must include the option PRINTALLTYPES in order for SAS to display statistics for all requested combinations of class variables - that is, for each level or occurrence of the variable and for all occurrences combined. The statistics specified on the second line must include the keywords N MEAN MEDIAN STD VAR RANGE QRANGE. The code must specify Type as the class variable and Yield as the analysis variable.

Which of the following code examples correctly calculates descriptive statistics of popcorn yield (Yield) for each level of the class variable (Type) in the data set Statdata.Popcorn, as well as statistics for all levels combined? The output should include the following statistics: sample size, mean, median, standard deviation, variance, range, and interquartile range. a. proc means data=statdata.popcorn maxdec=2 fw=10 n mean median std var range qrange; class Type; var Yield; run; b. proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std var range qrange; class Yield; var Class; run; c. proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std var range qrange; class Type; var Yield; run; d. proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std range IQR; class Type; var Yield; run;

all of the above All of these statements are available for use within PROC PLM for postprocessing. Recall that this postprocessing will be performed using the item store. Review: Performing Postprocessing Tasks with the PLM Procedure

Which of the following is available for use in postprocessing within PROC PLM? a. LSMEANS b. LSMESTIMATE c. SLICE d. all of the above

The observations are dependent. In an ANOVA model, you assume that the errors are normally distributed for each treatment, the errors have equal variances across treatments, and the observations are independent. When you add a blocking factor to your ANOVA model, you also assume that the treatments are randomly assigned within each block and that the effects of the treatment are the same within each block. Review: More ANOVA Assumptions

Which of the following is not an assumption you make when including a blocking factor in an ANOVA randomized block design? a. The treatments are randomly assigned within each block. b. The errors are normally distributed. c. The effects of the treatment factor are constant across the levels of the blocking variable. d. The observations are dependent.

all of the above All six steps are important for developing good regression models. You might need to perform some steps iteratively to produce the best possible model. Review: Using an Effective Modeling Cycle

Which of the following is suggested for developing good regression models? a. getting to know your data by performing preliminary analyses b. identifying good candidate models c. checking and validating your assumptions using residual plots and other statistical tests d. identifying any influential observations or collinearity e. revising the model if needed f. validating the model with data not used to build the model g. all of the above h. a, c, and d only

When you score data, you apply the score code (the equations obtained from the final model) to the scoring data. When you score data, you apply the score code to the scoring data. It is not necessary to rerun the algorithm that was used to build the model. If you made any modifications to the training or validation data, you must make the same modifications to the scoring data before you can score it. The size of the scoring data set is not affected by the size of the training and validation data sets. Review: Preparing for Scoring

Which of the following statements about scoring is true? a. When you score data, you must rerun the algorithm that was used to build the model. b. When you score data, you apply the score code (the equations obtained from the final model) to the scoring data. c. If you made any modifications to the training or validation data, it is not necessary to make the same modifications to the scoring data. d. The scoring data set cannot be larger than either the training data set or the validation data set.

2 only In statement 2, the amount of salty snacks eaten and thirst have a positive linear relationship. As the values of one variable (amount of salty snacks eaten) increase, the values of the other variable (thirst) increase as well. Review: Using Scatter Plots to Describe Relationships between Continuous Variables, Using Correlation to Measure Relationships between Continuous Variables

Which of the following statements describes a positive linear relationship between two variables? The more I eat, the less I want to exercise The more salty snacks I eat, the more water I want to drink. No matter how much I exercise, I still weigh the same. a. 1 only b. 1 and 2 c. 2 only d. 2 and 3 e. 3 only

all of the above All of the statements are true concerning information criteria. All of the formulas begin with the same calculation but are different in the penalty term accessing the complexity of the model. With this penalty assessment, models that contain different numbers of parameters can be compared where the smaller information criteria value is considered better. Review: Information Criteria

Which of the following statements is true about information criteria such as AIC, AICC, BIC, and SBC? a. Formulas for all information criteria begin with the same calculation. b. The penalty term to assess the complexity of the model allows information criteria to be a useful means of comparing models with different number of parameters. c. The best model is the one with the smallest information criteria value. d. all of the above

You can reproduce your results if you specify an integer that is greater than zero in the SEED= option and then rerun the code using the same SEED= value. By specifying an integer that is greater than zero in the SEED= option, you can reproduce your results by rerunning the code using the same SEED= value. The SEED= option has nothing to do with the allocation of observations to the validation data set. If you do not specify a valid value in the SEED= option, the seed is automatically generated from reading the time of day from the computer's clock. The SEED= option is used when you start with a data set that is not yet partitioned. Review: Using PROC GLMSELECT to Build a Predictive Model

Which of the following statements is true about the SEED= option in PROC GLMSELECT? PROC GLMSELECT DATA=training-data-set <SEED=number>; MODEL target(s)=input(s) </ options>; PARTITION FRACTION(<TEST=fraction><VALIDATE=fraction>); RUN; a. You can reproduce your results if you specify an integer that is greater than zero in the SEED= option and then rerun the code using the same SEED= value. b. The SEED= option offers an alternative way to specify the proportion of observations to allocate to the validation data set. c. If a valid value is not specified for the SEED= option, the code will not run. d. You can use the SEED= option only when you have already partitioned the data prior to model building.

proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm / r influence; id Case; run; quit; Program a specifies the R and INFLUENCE options, which request diagnostic statistics. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

Which of these programs requests diagnostic statistics as well as diagnostic plots? a. proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm / r influence; id Case; run; quit; b. proc reg data=statdata.bodyfat2 plots(only)= (QQ RESIDUALBYPREDICTED RESIDUALS); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case; run; quit; c. both of the above

ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(label) COOKSD(label) DFFITS(label) DFBETAS(label)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; title; run; quit; Program b is almost correct, but the images must be created for the data sets to be saved. Program c tells SAS to create the images and save them into their own data sets. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

Which program correctly saves information from influential plots into individual output data sets? Assume that ODS GRAPHICS is on. a. proc reg data=statdata.bodyfat2; PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm / r influence; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; run; quit; b. ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=statdata.bodyfat2 plots=none; PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; title; run; quit; c. ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(label) COOKSD(label) DFFITS(label) DFBETAS(label)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; title; run; quit; d. ods output outputstatistics; proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; run; quit;

The response variable can have more than two levels as long as one of the levels is coded as 0. In binary logistic regression, the response variable can only have two levels. Review: Modeling a Binary Response

Which statement about binary logistic regression is false? a. Binary logistic regression uses predictor variables to estimate the probability of a specific outcome. b. To model the relationship between a predictor variable and the probability of an outcome, you must use a nonlinear function. c. The mean of the response in binary logistic regression is a probability, which is between 0 and 1. d. The response variable can have more than two levels as long as one of the levels is coded as 0.

All main effects and interactions that remain in the final model must be significant. Backward elimination results in a final model that can contain one or more main effects and (if specified) interactions. Any interactions in the final model must be significant. Main effects that are involved in interactions must appear in the final model, whether or not they are significant. Review: The Backward Elimination Method of Variable Selection

Which statement about the backward elimination method is false? a. Backward elimination is a method of selecting variables for a logistic regression model. b. Backward elimination removes effects and interactions one at a time. c. All main effects and interactions that remain in the final model must be significant. d. To obtain a more parsimonious model, you specify a smaller significance level.

reference Typically, the original data set is split into two subset data sets called the training and validation data sets. However, in some situations, the data is split into three subsets, and the third of these is called the test data set. Review: Using PROC GLMSELECT to Build a Predictive Model

With a large enough data set, observations can be divided into three subset data sets for use in honest assessment. Which of the following is not the name of one of these three subset data sets? a. training b. validation c. reference d. test

Parameters

____________ are evaluations of characteristics of populations. They are generally unknown and must be estimated through the use of samples. A sample is a group of measurements from a population. In order for inferences to be valid, the sample should be representative of the population. A. Metrics B. Parameters

Scatter Scatter plots are useful to accomplish the following: explore the relationships between two variables locate outlying or unusual values identify possible trends identify a basic range of Y and X values communicate data analysis results

____________plots are two-dimensional graphs produced by plotting one variable against another within a set of coordinate axes. The coordinates of each point correspond to the values of the two variables. A. Box B. Histogram C. Scatter

Within Group Variation

the variability not explained by the model. It is also referred to as within treatment variability or residual sum of squares. It is calculated as the sum of the squared differences between each observed value and the mean for its group, This measure is also referred to as the Error Sum of Squares (SSE). A. Total Variation B. Between Group Variation C. Within Group Variation


Ensembles d'études connexes

Chapter 90: Male Reproductive Disorders

View Set

CEH - Tools/Systems/Programs/Background

View Set

Chapter 10- Elder Abuse and Neglect

View Set

Unit 10 Study Guide - 𝕗𝕦𝕔𝕜 𝕓𝕚𝕥𝕔𝕙𝕖𝕤 𝕘𝕖𝕥 𝕞𝕠𝕟𝕖𝕪

View Set

Managing People and Organizations Exam 3 Study Set

View Set