STA 363 Final
multiple linear regression phrasing conclusions
(response) increases on average by (b1) units for every one-unit increase in X1 when other variables are held constant
True or False? Consider two models for the same data. Model 1 has AIC = -32.9, and model 2 has AIC = -28.8. Model 2 is the better fitting model to the data
false
True or False? Cross-validation is a method used to determine what variables are significant in a statistical model
false
True or False? In ANCOVA models, we typically start by fitting a no-interaction model and then simplify the model if warranted
false
True or False? In a multiple regression model given by Y= B0 + B1X1 + B2X2 + B3X3 + e, B1 can be correctly interpreted as the mean change in Y given a one-unit increase in X1
false
True or False? In a one-way ANOVA, the F-Test tests the null hypothesis H0: u1 = u2 = ... = uk = 0
false
True or False? Multicollinearity is a situation in a multiple regression where some of the predictors are related to the response variable
false
Repeated measures ANOVA assumptions
independence (between subjects), constant error variance, and normality CANNOT CHECK WITH AUTOPLOT
Simple linear regression assumptions
independence (no plot), constant error variance (check scale-location), Normality (check QQ plot), and linearity (check residuals vs fitted)
two-way ANOVA assumptions
independence (no plot), constant error variance (check scale-location), and Normality (check QQ plot)
Blocked ANOVA assumptions
independence (no plot), constant error variance (check scale-location), and normality (check QQ plot)
one-way ANOVA assumptions
independence (no plot), constant error variance (check scale-location), and normality (check QQ plot)
multiple linear regression assumptions
independence (no plot), constant error variance (check scale-location), normality (check QQ plot), and linearity (check residuals vs fitted)
t-test assumptions
independence, equal variance (unless we are doing the unequal variances test)
two-way ANOVA EDA
interaction plot (line plot) Use x-axis for one factor, and separate lines for the other
categorical predictors... understand coding of dummy variables
interpretations of model coefficients are based on the reference category: all categories are compared to the reference. Since reference category is represented by all dummy variables equal to 0, intercept represents the group that is in the reference category for all categorial predictors
Poisson Regression
is appropriate in some applications when the variable of interest is a count.
multiple linear regression usage
multiple linear regression is used for similar purposes as simple linear regression, but it is used when we have multiple predictors of interest
A confidence interval for the mean response in a regression model is...
never wider than the corresponding prediction interval for the response
Repeated measures ANOVA EDA
profiles plots (line plots grouped by subject, so each line represents on subject)
Model coefficients... other coefficients (when exponentiated)
represent odds ratios Must remember that effects are multiplicative (ex. a two-unit increase in a predictor will increase the response by (e^beta)^2 times
Usual Observations... outliers
residuals larger than +/- 3
Unusual observations... what plot do we look at?
residuals: the difference between the observed value of the dependent variable and the predicted variable leverage: a measure of how far away the independent variable values of an observation are from those of the other observation
Two variables are said to interact if...
the effect that one of them has on the response depends on the value of the other
True or False? Logistic regression is a modeling tool for binary response variables. However, you can use either quantitative or qualitative predictors in a logistic regression model
true
True or False? Poisson regression is a type of generalized linear model useful for data where the response Variable Y is a count
true
True or False? R^2 is a poor means by which to compare the quality of fit of two models because R^2 will never decrease by adding predictors to the model to the data
true
when are reduced f-tests used?
when we want to compare our full model to a "reduced" model, which has fewer predictors. ONLY FOR LINEAR MODELS. For GLM, use likelihood ratio test instead.
one-way ANOVA usage
when we want to compare the means of a response at two or more factor levels of one factor of interest
Paired t-test usage
when we want to compare the means of two samples, and there is a natural pairing between measurements (same subjects measured at different times)
Blocked ANOVA null hypothesis
α1 = α2 = ... = αk = 0 OR μ1 = μ2 = ... = μk
one-way ANOVA null hypothesis
α1 = α2 = ... = αk = 0 OR μ1 = μ2 = ... = μk
one-way ANOVA model form
Yi = μ + αi + εi μ is the overall mean αi is the effect of group i ε is the error
Blocked ANOVA model form
Yij = μ + αi + Bj + εij μ is the overall mean αi is the effect of group i βj is the effect of block j ε is the error
t-test model form
Yij = μ + αi + εij, for i=1,2 μ is the overall mean αi is the effect of group i ε is the error (special case of ANOVA)
two-way ANOVA model form
Yij=μ+αi+βj+αβij+εij μ is the overall mean αi is the effect for factor 1 βj is the effect for factor 2 αβij is the interaction ε is the error
R^2
higher is better
one-way ANOVA follow up procedures
if the ANOVA f-test is significant for the factor of interest, Tukey or Dunnett multiple comparisons
Simple linear regression model form
Y = β0 + β1X1 + ε be sure to define Y and X1
Simple linear regression null hypothesis
t-test: β1 = 0 which means predictor has no effect on response variable
t-test usage
When we want to compare the means of two independent samples
Simple linear regression phrasing conclusion
(response) increases on average by (b1) units for every one-unit increase in X1
When to use any of the different models (a) ANOVA (b) Linear Regression (c) Generalized Linear Models
(a) One-way, Two-way, Blocked, Repeated Measures (b) simple and multiple (c) logistic regression, poisson regression
t-test EDA
Box plot
Simple linear regression follow-up procedures
Box-cox for potential power transformations if the model appears to be non-linear or has non-constant variance
ANCOVA... fitted models
For X=0, fitted model is just beta_0 + beta_1*X For X=1, fitted model is (beta_0 + beta_2) + (beta_1 + beta_3)*X
two-way ANOVA null hypothesis
For factor 1:α1 = α2 = ... = αk = 0 For factor 2:β1 = β2 = ... = βk = 0
Cross-validation... what does the number of folds control?
How many groups we create from the data for testing sets
Blocked ANOVA follow up procedures
If the ANOVA f-test is significant for the factor of interest, Tukey or Dunnett multiple comparisons; NO multiple comparisons for blocking factor
Box-Cox
If the confidence interval for the optimal λ includes 1, then no transformation is needed. If it does not include 1, then a transformation is appropriate Peak represents optimal power transformation (x^2 or sqrt(X))
Use deviance to describe variability
If the model is a good fit, null deviance should be large compared to residual deviance Null deviance is basically total variation Residual deviance is basically error variation. Want error to be relatively small is a "good" model
two-way ANOVA follow up procedures
If there is a significant interaction, Tukey or Dunnett multiple comparisons for each factor at each level of the other factor; otherwise, treat same as one-way ANOVA
Paired t-test EDA
Look at profile of how each observation changes over time
Logistic Regression... relationships between p, odds, and log odds
Odds = p/(1-p) = P(Success)/P(Failure) Log Odds = log(odds)
Interpret model output from the chosen model
Same as any other linear model output at this point Review: f-test, t-test, coefficients
Paired t-test model form
Same as t-test (with slightly more complicated error structure)
How to address multicollinearity
Scaling predictors means we standardize them by centering and scaling - every predictor is represented by Z-scores instead
t-test phrasing conclusion
There is a significant difference in the mean (response) between (group 1) and (group 2)
Main concepts behind model validation
Training data: fit model (compute model coefficients) Test data: evaluate model (compute RMSE)
Multicollinearity
VIFS (>10 indicates a multicollinearity issue)
one-way ANOVA EDA
box plot
categorial predictors... understand how to interpret linear model coefficients for categorial predictors
comparing groups ex) average difference in the response variable between males and females, or average difference between age 21-30 and age 11-20
Choose models based on cross-validation output
check RMSE values
Best Subsets method
checks every combination of predictors. Step-wise selection only checks some of the models
Main limitation for best subsets?
computation is slow
Benefits of model validation?
eliminates the bias that comes from using the same data for both fitting and for evaluation
categorical predictors... understand coding of dummy variables (Predictors with 3+ factor levels)
have to choose a "reference" category, and set up k-1 dummy variables, where k is the number of factor levels
categorical predictors... understand coding of dummy variables (Binary Predictors)
just one dummy variable coded
Poisson Regression... model form
log(λ) = beta_0 + beta_1*X + ...
Logistic Regression... model form
logit(p) = beta_0 + beta_1*X + ...
Unusual Observations... high-leverage
look for natural gaps in the leverage (x-axis) - can also compute a threshold
Reduced f-tests... how to interpret the output
look for the f-stat, degrees of freedom, and the p-value in the output
BIC
lower is better a criterion for model selection among a finite set of models
AIC
lower is better an estimator of out-of-sample prediction error
ANCOVA... understand when you can and cannot interpret main effects (like two-way ANOVA)
main effects are the coefficients for the non-interaction terms. If there is a significant interaction, cannot interpret main effects.
Paired t-test follow up procedure
none
t-test follow-up procedures
none
Repeated measures ANOVA model form
same as one-way or two-way ANOVA, depending on context (with a slightly more complicated error structure)
Simple linear regression EDA
scatterplot
multiple linear regression EDA
scatterplot matrix
Reduced f-tests... variables are being tested based on code
should be straightforward: whatever variables are left out of the reduced model. We are NOT testing the variables present in both models
Blocked ANOVA EDA
should include block factor in the plot. For one-way, use linetypes or color, or boxplot with facets over blocks. Interaction plot for two-way blocked design, can facet over the blocks
Interpret model output from stepwise selection output
shows each iteration with AIC values as well as which variables were removed or added at each step
two-way ANOVA phrasing conclusions
similar to one-way ANOVA, but may be more complicated if we have significant interactions
Step-wise selection (forward)
start with empty model MUST SPECIFY SCOPE
Step-wise selection (backward)
start with full model
Model coefficients... Exponentiated intercept
the odds of (success) when all predictors are equal to 0 This may include dummy variables, must know which factor level is the reference category
full model f-test is a special case where...
the reduced model is an intercept only model, notation in R: response ~1
one-way ANOVA phrasing conculsions
there is a significant difference in the mean (response) between (factor level) and (other factor level)
Blocked ANOVA phrasing conclusion
there is a significant difference in the mean (response) between (factor level) and (other factor level), adjusting for (block factor)
Paired t-test phrasing conclusion
there is a significant difference in the mean (response) between (group 1) and (group 2)
Violations of the linearity assumption in a regression model may be addressed by...
transforming one or more of the predictor variables
multiple linear regression follow up procedures
trying different sets of predictors or transformations to make the model a better fit (checking adjusted R squared)
Paired t-test Assumptions
u1 = u2 OR the true mean difference D = 0
unusual observations... what can be done about them
verify that they are legitimate data entries. If so, should not remove them. Can use a dummy variable to represent a single observation Can fit model both ways, see if results are different
ANCOVA... test for significant interactions
we can use either the t-test in the model output or do a reduced f-test to test this. Leave interaction out of model if p-value is not less than 0.05 (prefer simpler model)
two-way ANOVA usage
when the model has two factors of interest
Blocked ANOVA usage
when there is a confounding factor that either has a known effect or we are not interested in its effect
Simple linear regression usage
when we are trying to estimate the relationship between one predictor and a response
Repeated measures ANOVA usage
when we have a within-subjects factor, or multiple measurements for each experimental unit
t-test null hypothesis
μ1 = μ2
Paired t-test Null hypothesis
μ1 = μ2 OR the true mean difference D = 0