Week 6: Multiple Regression
categorical variables in regression?
best to dummy code categorical variables (e.g., dummy coding = 0 and 1) (Dummy coding allows us to turn categories into something a regression can treat as having a high (1) and low (0) score.)
stepwise (variable selection)
both enters and removes variables simultaneously; leaves only the significant predictors --> each variable is entered in sequence and its value assessed; if adding contributes to the model it is retained and all the other variables re-tested to see if they still contribute to the success; if they no longer contribute they are removed
how to deal with categorical predictor variables in multiple regression?
in general, a categorical variable with k levels will be transformed into k-1 variables each with two levels (e.g., if a categorical variable had six levels, then 5 dichotomous variables could be constructed that would contain the same information as the single categorical variable; dichotomous variables have the advantage that they can be directly entered into the regression model.)
consequence and solution of multicollinearity?
it leads to an unstable regression estimate; solution: omitting one of the IVs from the model
sequential/ hierarchical variable selection
items are entered in a given order based on theory, logic, or practicality; by block in SPSS (e.g., block 1: age, sex, education; block 2: 8 illness perceptions); --> researcher has greater control over the regression process (allows you to enter predictor variables in a specific order) -> used when researcher has an idea which predictors influence the DV (e.g., we might wish to control for demographic influences on adherence BEFORE examining IPQ variables)
level of measurement in multiple linear regression
continuos DV, predictor variable can be continuos or categorical
how does regression reports the mean?
reports only one mean (as an intercept) and the difference between that mean/intercept and the other mean
R-square interpretation multiple regression
the amount of variance explained by the predictors in the outcome of interest
homoscedasticity: meaning in linear regression
the residuals are equal across the regression line
advantages/disadvantages of enter method
(+) when the researcher does not know which IVs will create the best equation; we can assess the contribution made by each IV; (-) we may have a large number of IVs for the sample size; most of the IVs do not contribute
assumptions in multiple linear regression
(1) Normality - normally distributed residuals (2) No multicollinearity (3) Independence of errors (4) Homoscedasticity (5) Linearity
methods for selecting variables in a regression model:
(1) entry: default - all predictor variables; (2) selection procedures - to reduce the set of predictor variables; only retain the most important predictors
selection procedures (variable selection)?
(1) statistical regression models: forward selection, backward elimination, and stepwise (2) sequential regression: hierarchical
report regression results in the text, what to include:
(a) Unstandardized or standardised slope (beta), whichever is more interpretable given the data (b) T-test (c) Corresponding significance level (d) Perhaps confidence interval (e) It is useful but not customary to report the percentage of variance explained along with the corresponding F-test
variable selection: statistical regression models
(i.e., forward selection, backward elimination, and stepwise) use of statistical criterion (e.g., variance explained) to select variables; takes many methodological decisions out of the hand of the researcher; ideal when there is no sound theoretical literature to base your model on
variable selection: sequential regression
(i.e., hierarchical) items are entered in a given order based on theory, logic or practicality; allows the researcher greater control over the regression process; used when the researcher has an idea as to which predictors may impact the dependent variable
linear regression with ordinal outcome (e.g., Likert Scale)
- If you have a large sample and can assume an underlying continuum, then assume the outcome is continuous --> linear regression - If you cannot, use non-parametric tests for continuous outcome (i.e., Kruskal Wallis & Mann Whitney) or ordinal regression
steps in conducting a regression analysis
1. Centre continuous variables, e.g., mean or a suitable value 2. Dummy code categorical variables 3. Decide on the regression method, e.g., enter 4. Run the analysis 5. Check the assumptions
how to check independence of errrors
Durbin-Watson statistic: values less than 1 or greater than 3 are cause for concern (a bit concern when not in range between 1.5 to 2.5)
backward elimination
Method of selecting variables for inclusion in the regression model that starts by including all independent variables in the model and then eliminating those variables not making a significant contribution to prediction. (e.g., removing the variable does not significantly decrease the amount of variance explained) -> proceeds to remove the non-significant variables one by one -> looks at all predictor variables together
multiple linear regression vs correlation (vs SLR)?
Simple linear regression and correlation give similar results Multiple linear regression and correlation DO NOT In multiple linear regression: (a) you are using more than one predictor variable (b) there is a possibility that these predictor variables (Xs) are correlated with each other (c) this affects the outcome of the regression. You can separately correlate each X and Y to give an idea of the relationships, but these results will be different to the results of the regression when all IVs are analysed simultaneously
how are ANOVA and regression outputs similar for binary predictor?
They have the same F-statistic and p-value (significance). The intercept of the regression is the mean of the reference group (e.g., coded 0 for boys); The regression coefficient for the predictor (e.g., gender) is the difference between the groups (e.g. men and women); thus, intercept + partial regression coefficient = mean for the other group (like stated in ANOVA)
differences ANOVA and Regression: how to assess which means (of 3 or more categorical predictors) differ?
While ANOVA requires post-hoc analysis to find that out, regression report differences between reference group and each other group
variable selection: blocking
allows to enter variables in blocks (e..g, Block 1: age, sex eduction; block 2: duration of illness; block 3: illness perception); SPSS also allows us to use the Stepwise method for each block
Forward Selection
builds a model by adding in predictors in order of how much variance they explain; non-significant variables are excluded -> looks at each predictor separately
dummy coding
creating dichotomous variables from categorical variables
adjusted r-squared
estimate of what R-square would be in the population (is much smaller than the r-square when having large number of IVs compared to sample size)
standardised beta interpretation multiple regression
for a one SD increase in x, y increases by B SD whilst adjusting or holding all other variables constant
unstandardised beta interpretation multiple regression
for a one unit increase in x, y increases by B units whilst adjusting or holding all other variables constant
multicollinearity
high intercorrelations among IVs (e.g., r=0.80 or above) (no multicollinearity assumed in multiple regression)
why can we just not add up the r-squared of the individual predictors in multiple regression?
if all correlations between IVs were zero, we could just add up the r-square for each IV... but IVs are usually correlated!
SPSS output for checking multicollinearity?
in "coefficient" and "collinearity statistics"(right) dialogue box: if the VIF value is greater than 10, or the Tolerance is less than 0.1, then you have concerns over multicollinearity!
interpretation of slope in regression after logarithmic transformation
increase of 1 unit in x would result in (exp(B)-1)*100 percentage change in Y
multiple regression: control for effects?
multiple regression allows us to control for the effects of certain variables by entering them into the model (e.g., age, SES)
how many dummy variables do you need to create in regression?
n-1 (bzw predictors - 1) you need to decide on your reference value (will not have a variable associated with it)
what could cause non-normality in multiple regression?
often due to non-normality of the DV; maybe also due to the presence of a few large univariate outliers, or the violation of the linearity assumption
how to check for homoscedasticity and linearity assumption?
plot the standardised residuals against standardised predicted values; the plot should show a random pattern centred around the line of zero standard residual value
semi-partial correlations squared (sr2)
provide the percentage of variance in the DV which is uniquely associated with each IV
variable selection: entry method
standard way of selecting variables into a regression model; all predictor variables are entered into the equation at the same time
independence of errors
the errors associated with one observation are not correlated with the errors of any other observation
linearity assumption
the relationship between the outcome variable and predictors is linear
semi-partial correlation (sr)
the unique relation between an IV and the DV. An sr is the variance in a DV explained by an IV and only that IV (i.e., it does not included variance in a DV explained by other IVs).
dummy variable: interpretation for significant predictor variable?
the variable is (not) significantly different (i.e., higher or lower) than the reference variable (e.g., Asian people have higher depression scores than white people. There is not difference between black and white ethnic groups for depression in this analysis). --> B = ..., so on average, X have higher/lower outcome scores than Y
what if the data violate the assumption of normality or linearity?
use a transformation (e.g., log transformation); dichotomise the IV (e.g., age into 18-30, 30-50, etc); dichotomise the outcome variable (e.g., test score on a scale of 1-100 to pass and fail)
when is stepwise MRL useful? advantage/disadvantage?
very useful if there are a large number of IVs, especially with a small sample; advantage: always results in the most parsimonious model; in general, it is however better to test a theory than simply build a model based on the most significant predictors (stepwise models can result in different models in samples from the same population)
assumption of normality in multiple linear regression
we only need to consider the distribution of the residuals (We do not need to care about the univariate normality of either the DV or the IV)
when is the entry method (variable selection) appropriate?
when dealing with a small set of predictors; the researcher does not know which independent variables will create the best prediction equation