Week 6: Multiple Regression

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

categorical variables in regression?

best to dummy code categorical variables (e.g., dummy coding = 0 and 1) (Dummy coding allows us to turn categories into something a regression can treat as having a high (1) and low (0) score.)

stepwise (variable selection)

both enters and removes variables simultaneously; leaves only the significant predictors --> each variable is entered in sequence and its value assessed; if adding contributes to the model it is retained and all the other variables re-tested to see if they still contribute to the success; if they no longer contribute they are removed

how to deal with categorical predictor variables in multiple regression?

in general, a categorical variable with k levels will be transformed into k-1 variables each with two levels (e.g., if a categorical variable had six levels, then 5 dichotomous variables could be constructed that would contain the same information as the single categorical variable; dichotomous variables have the advantage that they can be directly entered into the regression model.)

consequence and solution of multicollinearity?

it leads to an unstable regression estimate; solution: omitting one of the IVs from the model

sequential/ hierarchical variable selection

items are entered in a given order based on theory, logic, or practicality; by block in SPSS (e.g., block 1: age, sex, education; block 2: 8 illness perceptions); --> researcher has greater control over the regression process (allows you to enter predictor variables in a specific order) -> used when researcher has an idea which predictors influence the DV (e.g., we might wish to control for demographic influences on adherence BEFORE examining IPQ variables)

level of measurement in multiple linear regression

continuos DV, predictor variable can be continuos or categorical

how does regression reports the mean?

reports only one mean (as an intercept) and the difference between that mean/intercept and the other mean

R-square interpretation multiple regression

the amount of variance explained by the predictors in the outcome of interest

homoscedasticity: meaning in linear regression

the residuals are equal across the regression line

advantages/disadvantages of enter method

(+) when the researcher does not know which IVs will create the best equation; we can assess the contribution made by each IV; (-) we may have a large number of IVs for the sample size; most of the IVs do not contribute

assumptions in multiple linear regression

(1) Normality - normally distributed residuals (2) No multicollinearity (3) Independence of errors (4) Homoscedasticity (5) Linearity

methods for selecting variables in a regression model:

(1) entry: default - all predictor variables; (2) selection procedures - to reduce the set of predictor variables; only retain the most important predictors

selection procedures (variable selection)?

(1) statistical regression models: forward selection, backward elimination, and stepwise (2) sequential regression: hierarchical

report regression results in the text, what to include:

(a) Unstandardized or standardised slope (beta), whichever is more interpretable given the data (b) T-test (c) Corresponding significance level (d) Perhaps confidence interval (e) It is useful but not customary to report the percentage of variance explained along with the corresponding F-test

variable selection: statistical regression models

(i.e., forward selection, backward elimination, and stepwise) use of statistical criterion (e.g., variance explained) to select variables; takes many methodological decisions out of the hand of the researcher; ideal when there is no sound theoretical literature to base your model on

variable selection: sequential regression

(i.e., hierarchical) items are entered in a given order based on theory, logic or practicality; allows the researcher greater control over the regression process; used when the researcher has an idea as to which predictors may impact the dependent variable

linear regression with ordinal outcome (e.g., Likert Scale)

- If you have a large sample and can assume an underlying continuum, then assume the outcome is continuous --> linear regression - If you cannot, use non-parametric tests for continuous outcome (i.e., Kruskal Wallis & Mann Whitney) or ordinal regression

steps in conducting a regression analysis

1. Centre continuous variables, e.g., mean or a suitable value 2. Dummy code categorical variables 3. Decide on the regression method, e.g., enter 4. Run the analysis 5. Check the assumptions

how to check independence of errrors

Durbin-Watson statistic: values less than 1 or greater than 3 are cause for concern (a bit concern when not in range between 1.5 to 2.5)

backward elimination

Method of selecting variables for inclusion in the regression model that starts by including all independent variables in the model and then eliminating those variables not making a significant contribution to prediction. (e.g., removing the variable does not significantly decrease the amount of variance explained) -> proceeds to remove the non-significant variables one by one -> looks at all predictor variables together

multiple linear regression vs correlation (vs SLR)?

Simple linear regression and correlation give similar results Multiple linear regression and correlation DO NOT In multiple linear regression: (a) you are using more than one predictor variable (b) there is a possibility that these predictor variables (Xs) are correlated with each other (c) this affects the outcome of the regression. You can separately correlate each X and Y to give an idea of the relationships, but these results will be different to the results of the regression when all IVs are analysed simultaneously

how are ANOVA and regression outputs similar for binary predictor?

They have the same F-statistic and p-value (significance). The intercept of the regression is the mean of the reference group (e.g., coded 0 for boys); The regression coefficient for the predictor (e.g., gender) is the difference between the groups (e.g. men and women); thus, intercept + partial regression coefficient = mean for the other group (like stated in ANOVA)

differences ANOVA and Regression: how to assess which means (of 3 or more categorical predictors) differ?

While ANOVA requires post-hoc analysis to find that out, regression report differences between reference group and each other group

variable selection: blocking

allows to enter variables in blocks (e..g, Block 1: age, sex eduction; block 2: duration of illness; block 3: illness perception); SPSS also allows us to use the Stepwise method for each block

Forward Selection

builds a model by adding in predictors in order of how much variance they explain; non-significant variables are excluded -> looks at each predictor separately

dummy coding

creating dichotomous variables from categorical variables

adjusted r-squared

estimate of what R-square would be in the population (is much smaller than the r-square when having large number of IVs compared to sample size)

standardised beta interpretation multiple regression

for a one SD increase in x, y increases by B SD whilst adjusting or holding all other variables constant

unstandardised beta interpretation multiple regression

for a one unit increase in x, y increases by B units whilst adjusting or holding all other variables constant

multicollinearity

high intercorrelations among IVs (e.g., r=0.80 or above) (no multicollinearity assumed in multiple regression)

why can we just not add up the r-squared of the individual predictors in multiple regression?

if all correlations between IVs were zero, we could just add up the r-square for each IV... but IVs are usually correlated!

SPSS output for checking multicollinearity?

in "coefficient" and "collinearity statistics"(right) dialogue box: if the VIF value is greater than 10, or the Tolerance is less than 0.1, then you have concerns over multicollinearity!

interpretation of slope in regression after logarithmic transformation

increase of 1 unit in x would result in (exp(B)-1)*100 percentage change in Y

multiple regression: control for effects?

multiple regression allows us to control for the effects of certain variables by entering them into the model (e.g., age, SES)

how many dummy variables do you need to create in regression?

n-1 (bzw predictors - 1) you need to decide on your reference value (will not have a variable associated with it)

what could cause non-normality in multiple regression?

often due to non-normality of the DV; maybe also due to the presence of a few large univariate outliers, or the violation of the linearity assumption

how to check for homoscedasticity and linearity assumption?

plot the standardised residuals against standardised predicted values; the plot should show a random pattern centred around the line of zero standard residual value

semi-partial correlations squared (sr2)

provide the percentage of variance in the DV which is uniquely associated with each IV

variable selection: entry method

standard way of selecting variables into a regression model; all predictor variables are entered into the equation at the same time

independence of errors

the errors associated with one observation are not correlated with the errors of any other observation

linearity assumption

the relationship between the outcome variable and predictors is linear

semi-partial correlation (sr)

the unique relation between an IV and the DV. An sr is the variance in a DV explained by an IV and only that IV (i.e., it does not included variance in a DV explained by other IVs).

dummy variable: interpretation for significant predictor variable?

the variable is (not) significantly different (i.e., higher or lower) than the reference variable (e.g., Asian people have higher depression scores than white people. There is not difference between black and white ethnic groups for depression in this analysis). --> B = ..., so on average, X have higher/lower outcome scores than Y

what if the data violate the assumption of normality or linearity?

use a transformation (e.g., log transformation); dichotomise the IV (e.g., age into 18-30, 30-50, etc); dichotomise the outcome variable (e.g., test score on a scale of 1-100 to pass and fail)

when is stepwise MRL useful? advantage/disadvantage?

very useful if there are a large number of IVs, especially with a small sample; advantage: always results in the most parsimonious model; in general, it is however better to test a theory than simply build a model based on the most significant predictors (stepwise models can result in different models in samples from the same population)

assumption of normality in multiple linear regression

we only need to consider the distribution of the residuals (We do not need to care about the univariate normality of either the DV or the IV)

when is the entry method (variable selection) appropriate?

when dealing with a small set of predictors; the researcher does not know which independent variables will create the best prediction equation


Ensembles d'études connexes

Module 8: Digital Storage: Preserving Content Locally and on the Cloud

View Set

Social Issues in the Workplace Final Exam Review

View Set

A.2.2 Pro Domain 2: Physical and Network Security

View Set

Chapter 54 urine and kidney disease

View Set

Fluid, Electrolyte, and Acid- Base Imbalances (Ch.16- Med Surg)

View Set