Multiple Regression (Partial F Test)
Guidelines for Including/Excluding Variables in a Regression Model
1. Look at a variable's t-value and its associated P-value. If the P-value is above some accepted significance level, such as 0.05, this variable is a candidate for exclusion. 2. It is a mathematical fact that: - If t-value < 1, then Se will decrease and adjusted r^2 will increase if this variable is excluded from the equation. - If t-value > 1, the opposite will occur. Because of this, some statisticians advocate excluding variables with t-values less than 1 and including variables with t-values greater than 1. However, analysts who base the decision on statistical significance at the usual 5% level, as in guideline 1, typically exclude a variable from the equation unless its t-value is at least 2 (approximately). This latter approach is more stringent - fewer variables will be retained - but it is probably the more popular approach. 3. When there is a group of variables that are in some sense logically related, it is sometimes a good idea to include all of them or exclude all of them. In this case, their individual t-values are less relevant. Instead, a partial F test can be used to make the include/exclude decision. 4. Use economic, theoretical or practical considerations to decide whether to include or exclude variables. Some variables might really belong in an equation because of their theoretical relationship with the response variable, and their low t-values, possibly the result of an unlucky sample, should not necessarily disqualify them from being in the equation. Similarly, a variable that has no economic or physical relationship with the response variable might have a significant t-value just by chance. This does not necessarily mean that it should be included in the equation.
Will adding extra explanatory variables ever decrease r^2?
Adding new explanatory variables will always keep the r^2 value the same or increase it; it can never decrease it. In general, adding explanatory variables to the model causes the prediction errors to become smaller, thus reducing the sum of squares due to error, SSE. Because SSR = SST - SSE, when SSE becomes smaller, SSR becomes larger, causing r^2 = SSR/SST to increase. Therefore, if a variable is added to the model, r^2 usually becomes larger even if the variable added is not statistically significant. This can lead to "fishing expeditions," where you keep adding variables to the model, some of which have no conceptual relationship to the response variable, just to inflate the r 2 value.
Reduced and Full Models
Consider the following general situation. Suppose you have already estimated a reduced multiple regression model that includes the variables x1 through xj : y = α + β1 x1 + ... + βj xj + ε Now you are proposing to estimate a larger, referred to as full, model that includes x(j+1) through xk in addition to the variables x1 through xj : y = α + β1 x1 + Lβ j x j + β j+1 x j+1 + Lβk xk + ε That is, the full model includes all of the variables from the smaller model, but it also includes k - j extra variables.
Null and alternative hypothesis and test statistic for partial F test
If the null hypothesis is true, this test statistic has an F distribution with df1 = k - j and df2 = n - k - 1 degrees of freedom. If the corresponding P-value is sufficiently small, you can reject the null hypothesis that the extra variables have no explanatory power.
What does adding explanatory variables usually do to SSE and therefore r^2? What can this lead to?
In general, adding explanatory variables to the model causes the prediction errors to become smaller, thus reducing the sum of squares due to error, SSE. Because SSR = SST - SSE, when SSE becomes smaller, SSR becomes larger, causing r^2 = SSR/SST to increase. Therefore, if a variable is added to the model, r^2 usually becomes larger even if the variable added is not statistically significant. This can lead to "fishing expeditions," where you keep adding variables to the model, some of which have no conceptual relationship to the response variable, just to inflate the r^2 value.
Do the 4 methods produce the same equation?
In most cases the final results of these four procedures are very similar. However, there is no guarantee that they will all produce exactly the same final equation
What does it mean if adjusted r^2 is negative?
It can happen that the value of r^2adj is negative. This is not a mistake, but a result of a model that fits the data very poorly. In this case, some software systems set r^2adj equal to 0. Excel will print the actual value.
What is one common mistake people make when testing the significance of an added group of explanatory variables?
Many users look only at the r^2 and Se values to check whether extra variables are doing a "good job." For example, they might cite that r^2 went from 80% to 90% or that Se went from 500 to 400 as evidence that extra variables provide a "significantly" better fit. Although these are important indicators, they are not the basis for a formal hypothesis test. The partial F test is the formal test of significance for an extra set of variables.
If the added group of variables is significant, does this mean each individual variable that was added is significant?
NO If the partial F test shows that a group of variables is significant, it does not imply that each variable in this group is significant. Some of these variables can have low t-values (and consequently, large P-values). Some analysts favor excluding the individual variables that aren't significant, whereas others favor keeping the whole group or excluding the whole group. Either approach is valid. Fortunately, the final model building results are often nearly the same either way.
Backward elimination
The backward elimination procedure begins with a model that includes all potential explanatory variables. It then deletes one explanatory variable at a time by comparing its P-value to the prescribed P-value to Leave. The backward elimination procedure does not permit a variable to be reentered once it has been removed. The procedure stops when none of the explanatory variables in the model have a P-value greater than P-value to Leave.
best subsets regression
The best subsets regression procedure works by trying possible subsets from the list of possible explanatory variables. This procedure does not actually compute all possible regressions. There are ways to exclude models known to be worse than some already examined models. Typical computer output reports results for a collection of "best" models, usually the two best one-variable models, the two best two-variable models, the two best three-variable models, and so on. The user can then select the best model based on such measures as Cp , r^2 , adj r^2 , Se .
What is one issue with Cp?
The bias is measured with respect to the total group of variables provided by the researcher. This criterion cannot determine when the researcher has forgotten about some variable not included in the total group.
Forward selection
The forward selection procedure begins with no explanatory variables in the model and successively adds variables one at a time until no remaining variables make a significant contribution. The forward selection procedure does not permit a variable to be removed from the model once it has been entered. The procedure stops if the P-value for each of the explanatory variables not in the model is greater than the prescribed P-value to Enter.
What are p-value to enter and p-value to leave in software model building?
The levels of significance α1 and α1 for determining whether an explanatory variable should be entered into the model or removed from the model are typically referred to as P-value to Enter and P-value to Leave. Usually, by default P-value to Enter = 0.05 and P-value to Leave = 0.10.
What is the purpose of a partial F test?
The partial F test is used to determine whether the extra variables provide enough extra explanatory power as a group to warrant their inclusion in the equation. In other words, the partial F test tests whether the full model is significantly better than the reduced model.
stepwise regression
The stepwise regression procedure is much like a forward procedure, except that it also considers possible deletions along the way. Because of the nature of the stepwise regression procedure, an explanatory variable can enter the model at one step, be removed at a subsequent step, and then enter the model at a later step. The procedure stops when no explanatory variables can be removed from or entered into the model.
Cp explanation
Theory says that if the value of Cp is large, then the mean square error of the fitted values is large, indicating either a poor fit, substantial bias in the fit, or both. In addition, if the value of Cp is much greater than k + 1, then there is a large bias component in the regression, usually indicating omission of an important variable. Therefore, when evaluating which regression is best, it is recommended that regressions with small Cp values and those with values near k + 1 be considered.
Interpretation of F test statistic
This test statistic measures how much the sum of squared residuals, SSE, decreases by including the extra variables in the equation. It must decrease by some amount because the sum of squared residuals CANNOT INCREASE when extra variables are added to an equation. But if it does not decrease sufficiently, the extra variables might not explain enough to justify their inclusion in the equation, and they should probably be excluded.
Adjusted r^2 purpose
To avoid overestimating the impact of adding an explanatory variable on the amount of variability explained by the estimated regression equation, many analysts prefer adjusting r^2 for the number of explanatory variables.
The major issues in model building are (2)
finding the proper functional form of the relationship and selecting the explanatory variables to be included in the model.
four most common types of model-building procedures
forward selection backward elimination stepwise regression and best subsets regression.
principle of parsimony
suggests using the fewest number of explanatory variables that can predict the response variable adequately. Regression models with fewer explanatory variables are easier to interpret and are less likely to be affected by interaction or collinearity problems.
If you add variables and the adjusted r^2 decreases, what does this mean?
the extra variables are essentially not pulling their weight and should probably be omitted.
Conditions for good regression models (beyond LINE conditions)
• Relatively few explanatory variables. • Relatively high r^2 and r^2adj , indicating that much of the variability in y is accounted for by the regression model. • A small value of Cp (close to or less than k + 1) • A relatively small value of Se , the standard deviation of the residuals, indicating that the magnitude of the errors is small. • Relatively small P-values for the F- and t-statistics, showing that the overall model is better than a simple summary with the mean and that the individual parameters are reliably different from zero.