Quantitative Methods: Multiple Regression
Adjusted R^2
- Used as a measure of model goodness of fit since it does not automatically increase as independent variables are added to the model - Adjusts for degrees of freedom by incorporating the number of independent variables - Increases if a variable is added to a model that has a coefficient t-stat with an absolute value of >1 - Decreases if a variable is added to a model that has a coefficient t-stat with an absolute value of <1 - The higher the better 1 - [ [(n-1)/(n-k-1)] x (1-R^2)] R^2 = SSR / SST
Studentized Residuals (ti*)
- Used to identify outliers Steps: 1. Estimate the regression model using the original sample size of n. Delete one observation and re-estimate the regression using n-1 observations. Perform this sequentially for all observation, deleting one at a time 2. Compare the actual Y value of the deleted observation i to the predicted Y-values using the model parameters estimated with that observation deleted (ei* = Yi - Yi*) 3. Studentized Residual is the residual in step 2 divided by its standard deviation (ti* = ei* / s*) 4. We can then compare this studentized residual to critical values from a t-distribution with n-k-2 degrees of freedom to determine if the observation is influential (ABSOLUTE VALUE AGAINST A 2 TAIL DISTRIBUTION)
Homoskedasticity
- Variance of regression residuals is the same for all observations - Beat with scatter plots that that compares regression residuals vs predicted values
Outliers
Extreme observations of the dependent (Y) variable
Influential Data Points
Extreme observations that, when excluded, cause a significant change to model coefficients
Changes in Slope Coefficient
For every 1 unit change in the independent variable, the dependent variable changes by the slope amount
Likelihood Ratio
- A method to assess the fit of logistic regression models that is based on the log-likelihood metric that describes the model's fit to the data - Log-Likelihood is always a negative number, so higher values (closer to 0) indicate a better-fitting model - Chi-Square Distribution = -2(Log Likelihood Restricted Model - Log Likelihood Unrestricted Model)
Normal Q-Q Plot
- A visual used to compare the distribution of the residuals from a regression to a theoretical normal distribution - The residuals should lie along a diagonal if they are normally distributed
Dummy Variable
- An independent variable that takes on a value of either 1 or 0, depending on a specified condition - Used to quantify the impact of qualitative events - Can be the intercept or the slope - AKA: Indicator Variable
High-Leverage Points
- Extreme observations of the independent (X) variables - Identified using Leverage (Lij) - Leverage is the measure of the distance between the "j"th observation of the independent variable (i) relative to its sample mean
Leverage (Lij)
- Measure of the distance between the "j"th observation of the independent variable (i) relative to its sample mean - Takes a value between 0 and 1 (The higher the value, the greater the distance, and hence the higher the potential influence on the observation on the estimated regression parameters - If leverage is greater than 3 x [(k + 1) / n], then the observation is potentially influential - k = # of Independent Variables
Nested Model
- Models in which one regression model has a subset of the independent variables of another regression model - F-Statistic for this one
Coefficient of Determination (R^2)
- Percentage of the variation of the dependent variable that is explained by the independent variables - EXPLANATORY POWER LIMITATIONS: - Can't provide info on whether coefficients are statistically significant - Can't provide info on whether there are biases in the estimated coefficients and predictions - Can't tell whether the model fit is good SSR/SST OR [SST-SSE] / SST
Logistic Regression (Logit) Model
- Regression model that uses an exponential function of variables to estimate a response between 0 and 1. - Slopes are interpreted as the change in the log odds of the event occurring per 1-unit change in the independent variables while holding all other variables constant - Intercept is interpreted as the log odds of the response variable occurring when all independent (predictor) variables are 0 ln(p / (1-p)) = b + mx + e p: Odds / (1 + Odds) OR p: 1 /( 1 + e ^ (- linear_model))
Restricted Model
- Regression model with a subset of the complete set of independent variables - In hypothesis testing, the model obtained after imposing all of the restrictions required under the null
Unrestricted Model
- Regression model with the complete set of independent variables - In hypothesis testing, the model that has no restrictions placed on its parameters.
Akaike Information Criterion (AIC)
- Statistic used to compare sets of independent variables for explaining a dependent variable - Preferred for finding the model that is best suited for a prediction - The lower the better n ln(SSE/n) + 2 (k+1)
Schwarz Bayesian Information Criterion (BIC or SBC)
- Statistic used to compare sets of independent variables for explaining a dependent variable - Preferred for finding the model with the best goodness of fit - The lower the better n ln(SSE/n) + ln(n) (k+1)
Partial Regression Coefficient
- The regression coefficient in a multiple regression equation - Describes the effect of a one unit change in the independent variable on the dependent variable, holding all other independent variables constant - AKA: Partial Slope Coefficient
F-Statistic
MSR/MSE OR JOINT [(SSE Restricted - SSE Unrestricted) / q] / [SSE Unrestricted / (n-k-1)] OR [(SST Unrestricted - SSE Unrestricted) / k] / [SSE Unrestricted / (n-k-1)] OR [RSS Unrestricted / k] / [SSE Unrestricted / (n-k-1)] Q = # of variables omitted in restricted model n = DOF k = # of Independent Variables
Multiple Linear Regression
Modeling and estimation method that uses two or more independent variables to describe the variation of the dependent variable. Also referred to as multiple regression. Y = (M1X1) + (M2X2) +...+ (MnXn) + b + e
Log-Likelihood Criteria
The higher the criteria, the better the fit
# of Independent Variables in a Model
k
Probability
1 / [1+e^(-Regression Model)]
Assumptions of Multiple Regression
1) Linearity 2) Homoskedasticity 3) Independence of errors 4) Independence of Independent Variables 5) Normality
Qualitative Dependent Variable
A categorical variable (usually binary) which takes on a value of either 1 or 0
Cooks Distance (Di)
A composite metric for evaluating if a specific observation is influential (i.e., it takes into account both the leverage and outliers) e^2 / [(k+1) x MSE] x [hii / (1-hii)^2] e: Residual to the ith observation k: # of independent variables MSE: Mean Square Error of Regression Model hii: Leverage value for the ith observation - Values greater than (k/n)^(0.5) indicate that the ith observation is highly likely to be an influential data point - Value > 1 indicate a high likelihood of an influential observation - Value > 0.5 indicate further research is required
Maximum Likelihood Estimation (MLE)
A method that estimates values for the intercept and slope coefficients in a logistic regression that make the data in the regression sample most likely.
Logistic Regression
A statistical analysis which determines an individual's risk of the outcome as a function of a risk factor. The outcome of interest has two categories. Best for when the dependent variable is NOT continuous
Analysis of Variance (ANOVA)
A table that presents the sum of squares, degrees of freedom, mean squares, and F-Statistic for a regression model
Interaction Term
A term that combines two or more independent variables and represents their joint influence on the dependent variable
Influence Plot
A visual that shows, for all observations, studentized residuals on the y-axis, leverage on the x-axis, and Cook's D as circles whose size is proportional to the degree of influence of the given observation
Slope Dummy
Allow the slope of the relationship between the dependent variable and an independent variable to be different depending on whether the condition specified by a dummy variable is met D=0: Y = b + mx + e D=1: b + (m + D)x + e
Intercept Dummy
Changes the constant or intercept term, depending on whether the qualitative condition is met D=0: Y = b + mx + e D=1: Y = (b + D) + mx + e
P-Value
The smallest level of significance for which the null hypothesis can be rejected - P-Value < Level of Significance: Null is rejected - P-Value > Level of Significance: Null cannot be rejected
