SAS 240: Regression Modeling

Ace your homework & exams now with Quizwiz!

AIC, AICC, BIC, and SBC

- Formulas for all information criteria begin with the same calculation. - The penalty term to assess the complexity of the model enables information criteria to be a useful means of comparing models with a different number of parameters. - The best model is the one with the smallest information criteria value.

assumptions of logistic regression

1) dependent variable is binary 2) Observations are independent 3) little to no multicollinearity (no corr amonst indie variables) 4) indie variables are linear related to log odds 5) 10 cases per each variable min (large sample size)

goodness of fit logistic regression

A pair is concordant if the model predicted the order correctly. In other words, a pair is concordant if the observation with the desired outcome has a higher predicted probability, based on the model, than the observation without the outcome. A pair is discordant if the model did not predict the order correctly. That is, a pair is discordant if the observation with the desired outcome has a lower predicted probability than the observation without the outcome. If the predicted outcome probabilities are the same, the pair is tied. In general, higher percentages of concordant pairs and lower percentages of discordant and tied pairs indicate a more desirable model.

Cook's distance (D)

A statistic used to identify if there are any influential cases in the data set with values of 1 or greater indicating excessive influence. Measures the change to the estimates that results from deleting each observation

Forward selection

After a variable is added to the model, it stays in, even if it becomes non-significant later.

AIC

Akaike's Information Criterion (lower score the better)

Which assumption does Collinearity violate?

Collinearity causes instability in the model by inflating the variance of the parameter estimates, which raises the p-values. However, it doesn't violate any assumptions.

Quantile-Quantile plots (Q-Q plots)

Compares ordered variable values with quantiles of a specified theoretical distribution. If the data distribution matches the theoretical distribution, the points on the plot form a linear pattern. Thus, you can use a Q-Q plot to determine how well a theoretical distribution models a set of measurements. Detects violations of normality.

WELCH

Compensate for heterogeneous variances

Effect of sample size on P Value and Power

For a given effect size and sample size, as alpha is decreased power is also decreased.

Empirical logit plots

For continuous predictor variables, the plots should be fairly linear if the assumptions of the standard logistic regression model were met.

Compared to a multiple comparisons test that controls the experimentwise error rate, what characteristics will a multiple comparisons test that only controls the comparisonwise error rate tend to have?

If only the comparisonwise error rate is controlled, the overall risk of a Type I error across all the comparisons is increased (and therefore the risk of Type II error is decreased), so the test might find more significant differences than would otherwise be found.

r-squared

Is a goodness of fit measure. It measures the proportion of variability explained by the model by dividing the model sums of squares by the total sums of squares.

Influential Point

Is any point that has a large effect on the slope of a regression line fitting the data. They are generally extreme values. The process to identify an [term] begins by removing the suspected [term] from the data set. If this removal significantly changes the slope of the regression line, then the point is considered an [term]

When may a categorical variable cause a problem or an inefficiency in predictive modeling?

It has levels that rarely occur It has too many different levels It has a level that almost occurs always

Benefits of cut off point

It helps maximize True Positive and minimize False Positive rates

What is the role of validation dataset?

It is used for model selection

What is the role of test data set (suppose data is partitioned to training, validation, and test datasets)?

It will be used to assess the final model

lsmeans

Least Square Means. In contrast to the MEANS statement, the [term] statement performs multiple comparisons on interactions as well as main effects.

CP

Mallows' statistic

MSE

Mean Squared Error

Kolmogorov-Smirnov (KS) statistic

Measures the ability of the model in separating the positive and negative events.

standard error

Measures the variability associated with the sample mean, xbar.

If there is no correlation among the predictor variables, can there still be collinearity in the model?

No

NOBS

Number of observations used

Brier Score

The [term] is the weighted squared difference between the predicted probabilities and their observed response levels. For events/trials syntax, the [term] reliability is the weighted squared difference between the predicted probabilities and the observed proportions

Error rate Or Misclassification Rate

The [term] of the model equals the number of false positives plus the number of false negatives over the total number of cases. (FP+FN)/total = (10+5)/165 = 0.09 equivalent to 1 minus Accuracy

Accuracy

The [term] of the model equals the number of true positives plus the number of true negatives over the total number of cases

ROCCONTRAST

The [term] statement implements the nonparameteric approach of DeLong, DeLong, and Clarke-Pearson (1988) to compare the three [term] curves

Power of a test increases

The chances of a Type II error decrease

Gains Chart

The horizontal axis is the depth, which refers to the proportion of cases exceeding different cutoffs of the predicted probabilities.

In a logistic regression model, ________ and _______ have a linear relationship.

The logit of the predicted probability and independent (predictor) variables

How many levels can a response variable have in logistic regression?

The response variable can have more than two levels if one of the levels is coded as 0.

Degrees of Freedom

Total DF = Model DF + ERROR DF. There are 24 data rows in the data set, therefore, Total DF = 23. Thus, Model DF = 23 - 15 = 8.

Interaction Plot

When the difference between a group means of one variable changes at different levels of another variable, a possible interaction exists between the variables. This interaction is displayed as nonparallel lines in an interaction plot.

Student Residuals

You can use [term] to detect outliers. To detect influential observations, you can use DFFITS and Cook's D statistics, and [term]

Specificity

[term] = TN / (TN + FP) (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

sensitivity

[term] = TP / (TP + FN) dividing the true positives by the total actual positives AKA true positive rate and probability of detection.

C Statistic

[term] = percent concordant + 0.5 (percent tied)

COVRATIO

[term] statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the th observation

DFBETAS

[term] statistics are the scaled measures of the change in each parameter estimate and are calculated by deleting the th observation

Describes the model sums of squares, or SSM, in one-way ANOVA

____ is the variability explained by the predictor variable, and therefore, it measures the variability between the groups.

confusion matrix

a cross tabulation of predicted classes and actual classes. It quantifies the confusion of the classifier.

Type III sum of squares

all listed effects are adjusted for all other effects in the table, so order is not important. [term] also called the partial sum of squares, is the increase in the model sum of squares due to adding that variable to a model that already contains all the other variables.

VALIDATE

average square error over the validation data

Dunnett's method

compares all categories to a control group.

Tukey's Method

compares all possible pairs of means

sample size influence

effect of the number of trials on the p-value

GMSEP

estimated MSE of prediction, assuming multivariate normality

Type II Error

failure to reject the null hypothesis when it is false

Lift

gain at a certain depth / depth

decision rules in data mining

higher cutoffs decrease sensitivity and increase specificity. the use of the central cutoff, π1, is recommended when no profit matrix is given. The central cutoff tends to maximize the mean of sensitivity and specificity. the plug-in Bayes rule might not achieve the maximum profit if the estimate of the posterior probability is poorly estimated. A cutoff of the proportion of events in the population tends to maximize the mean of sensitivity and specificity.

Collinearity

increases the variance of the parameter estimates, which also increases the prediction error of the model

Outlier Point

is a data point that diverges from an overall pattern in a sample. An [term] has a large residual (the distance between the predicted value () and the observed value (y)). [term] lower the significance of the fit of a statistical model because they do not coincide with the model's prediction.

Cramer's V

is a measure of association derived from the Pearson chi-square. It is designed so that the attainable upper bound is always 1.

Hoeffding's measure of dependence, D

is a nonparametric measure of association that detects more general departures from independence. Measured from -0.5 to 1. The statistic approximates a weighted sum over observations of chi-square statistics for two-by-two classification tables. Determine if there exists any nonlinear associations between the variables

DFFITS

is a scaled measure of the change in the predicted value for the th observation and is calculated by deleting the th observation. A large value indicates that the observation is very influential in its neighborhood of the space.

disordinal interaction

lines cross. Interaction effects.

ordinal interaction

lines do not cross. Parallel = Not Significant. Not parallel = Significant

What are data1 and data2? proc logistic data=TEST.PREDICT_8 outest=data_1; class marital_status_ gender_ race_; model pass_fail= marital_status_ gender_ race_ ; score data=score_data out=data_2; run;

parameter estimates and scored probabilities

PRESS

predicted residual sum of squares statistic. CHOOSE=[term]

PROC CLUSTER SYNTAX

proc cluster data=mileages method=ward pseudo; ID City; run;

Type I Error

rejecting the null hypothesis when it is true

WARD

requests [term]'s minimum-variance method (error sum of squares, trace W). Distance data are squared unless you specify the NOSQUARE option. To reduce distortion by outliers, the TRIM= option is recommended. See the NONORM option.

variance inflation factors

requests diagnostic statistics to assess the magnitude of the collinearity problem. Several variance inflation factors are above 10. This indicates that collinearity among the predictor variables is present in the model. http://support.sas.com/resources/papers/proceedings17/1404-2017.pdf

PDIFF

requests that -values for differences of the LS-means be produced. Pair wise comparison for ALL, and CONTROL defines a control to compare.

Centroid

requests the [term] method (unweighted pair-group method using centroids, UPGMC, [term] sorting, weighted-group method). Distance data are squared unless you specify the NOSQUARE option.

SLE

sets criterion for entry into model

SL

significance level of the statistic used to assess an effect's contribution to the fit when it is added to or removed from a model

Proc Logist Class statement

specifies the classification variables to be used in the analysis. The CLASS statement must precede the MODEL statement.

SMOTE

synthetic minority oversampling technique. To oversample the minority (25%) to balance the data set, 4 synthetic points need to be created along each pair. This way, instead of the two original points, we'll have 6 points; making the count of the minority event cases equal to the non-event cases.

effect size

the difference between the the observed statistic and the hypothesized value

Oversampling affects

the intercepts, false positive and false negaitive rates

Oversampling Does not Affect

the parameter estimates, ROC Curve, sensitivity

alpha

the probability of committing a type I error

power

the probability that you correctly reject the null hypothesis

Which of the following does PROC GLMSELECT use to select a model from the candidate models when a validation data set is provided?

the smallest overall validation average squared error

HOVTEST=LEVENE

widely considered to be the standard homogeneity of variance test

F-Test

The [term] in the ANOVA table tests the global hypothesis for the model. [term] the Type I and Type III tables, as well as the t tests in the parameter estimates table only test individual effects. The R-square and Adjusted R-square are measures of model fit. (Need to add further definition)

mi

SAS Procedure to impute missing values

BIC

Sawa's Bayesian Information Criterion

SBC

Schwarz Bayesian information criterion

SLS

Sets criterion for staying in model

Data portioning is performed for the purpose of model assessment. Please specify which type of sampling method should be performed?

Simple/Stratified random sampling without replacement

proc corr data outs=work

Spearman's statistic

RStudent

Studentized residual. Used to measure influence, outliers, and leverage.

The location and spread of a normal distribution depend on the value of?

The Mean and Standard Deviation

How can you recognize an interaction?

By comparing the effect of one variable on the response at different levels of another using group means, or plotting the means to investigate different effect patterns on the response for a variable at different levels of another.

Which statements should be used to split data into a training and validation set?

The PARTITION statement specifies that the original data set, housing, be split. The FRACTION option specifies the fraction of the original data set (as a decimal value) to be placed in the holdout data set. The training data set contains the remaining observations, that is, those that were not allocated to the validation (or, if specified, test) data sets.

PROC LOGISTIC Syntax

PROC LOGISTIC <options>; CLASS variable</v-options>; EFFECTPLOT <plot-type <(plot-definition-options)>> </options>; MODEL response=<effects></options>; ODDSRATIO <'label'> variable </options>; SCORE <options>; CODE <options>; UNITS <predictor1=list1> </option>; RUN;

proc corr data outp=work

Pearson's Statistic

Pearson correlation coefficient

The Pearson correlation statistic is a measure of the linear relationship, or association, between two continuous variables. The closer the value is to -1, the stronger the negative linear relationship is between the two variables. The closer the value is to 0, the weaker the linear relationship. A correlation coefficient of 0 means that no linear relationship or association exists between the two variables.

UNITS statement

The [term] computed an odds ratio for a 25-unit increase for Recent_Avg_Gift_Amt. With an odds ratio of 0.77, there is a 100*(0.77-1) percent change in the odds.

Duncan Grouping

Result-guided test that compares the treatment means while controlling the comparison-wise error rate. Means with the same letter are not significantly different. You should use this test for planned comparisons only

Akaike's Information Criterion

This is a measure of fit. Lower score the better.


Related study sets

Topic 2.2: The Mongol Empire and the Making of the Modern World | AP World History: Modern

View Set

Chapter 15: Physiological and Behavioral Responses of the Neonate

View Set

Chapter 11 Compensating Executives

View Set

UNIT 3A: Uses and Abuses of Percentages

View Set