BUAL 2650 EXAM 3
Challenges in Building Regression Models
-for large data sets, checking the data for accuracy, missing values, consistency, and reasonableness can be a major part of the effort. -the probability that a Type I error will occur increases when the number of considered models increases -best subsets regression and stepwise regression do not check assumptions and conditions, so human interaction must always be introduced to the process
the error terms are autocorrelated, either positively or negatively
Ha for Residual Analysis
Negative Autocorrelation
Positive residuals tend to followed over time by negative residuals
True
T/F In situations where two competing models have essentially the same predictive power (as determined by an F-test), choose the more parsimonious of the two
D > 4 -dL
there is evidence of negative autocorrelation
D<dL
there is evidence of positive autocorrelation
D < 4 -dU
there is no evidence of negative autocorrelation
D>dU
there is no evidence of positive autocorrelation
parsimonious model
a general linear model with a small number of β parameters
regression outlier
a residual that is larger than 3s (in absolute value)
T/F Adding dummy variables takes two regressions and turns them into one
true
if you want to fit a first-order model, you need at least ______ observed x values.
two
Extrapolating
predicting a y value by extending the regression model to regions outside the range of the x-values of the data.
Assumptions for Random Error 3(E)
1. Mean equal to 0 2. Variance equal to sigma squared 3. Normal distribution 4. Random errors are independent
Base Levels
- this is the category that is left out while making dummy variables -ex. the amount of dummy variables you should use is k-1 so if you were using months of the year you would have 11 dummy variables and pick a month to be the base level, could be any of them.
Best Subsets Regression
-Choose a single criterion of "best" -Choose a modest set of potential predictors -A computer searches all possible models and reports the best two-predictor model, the best three-predictor model, the best four-predictor model, and so on. -The technique becomes increasingly impracticable as the size of the data set increases
Best regression models should have:
-Relatively few predictors, to keep the model simple -Relatively high r^2 -Relatively small value of s e -relatively small p-values for the f and t statics - no cases with extraordinarily high-leverage -no cases with extraordinarily large residuals, and Studentized residuals that appear to be nearly Normal. -Predictors reliably measured and relatively unrelated
Stepwise Regression
-The regression builds the model stepwise from a given initial model. At each step, a predictor is either added to or removed from the model. -The predictor chosen to add is the one whose addition increases the adjusted r^2 the most -the predictor chosen to remove is the one whose removal reduces the adjusted r^2 least. -the uses first identifies the dependent variable, y, and the set of potentially important independent variables, x1, x2......,where k is generally large. -the dependent and independent variables are then entered into a computer stepwise regression program where the data set goes through a series of steps.
Which predictors should you keep?
-Variables that are most reliably measured -least expensive to find -inherently important to the problem -new variables formed by combining variables
Quadratic Model
-We allow a curve in the relationship between the dependent variable (y) and the independent variable -this is when a quadratic term is introduced
Dummy Variable
-also known as an indictor variable -will indicate if it has an inversion -=1 if "yes" -=0 if "no" ex. all meat items get a 1 hamburger = 1 veggie burger =0
Cross-product Term
-also known as the interaction term -it is the part of the regression equation that is produced from multiplying x1 by x2
Inversion
-categorical or qualitative data -levels of a variable or membership (inversion vs no inversion) that we are using
Steps in a Residual Analysis
1. Check for a mis-specified model by plotting the residuals against each of the quantitative independent variables. Analyze each plot, looking for a curvilinear trend. This shape signals the need for a quadratic term in the model. Try a second-order term in the variable against which the residuals are plotted. 2. Examine the residual plots for outliers. Draw lines on the residual plots at 2- and 3- standard deviation distances below and above the 0 line. Examine residuals outside the 3 - standard deviation lines as potential outliers and check to see that no more than 5% of the residuals exceed the 2 - standard deviation lines. Determine whether each outlier can be explained as an error in data collection or transcription, corresponds to a member of a population different from that of the remainder of the sample, or simply represents an unusual observation. If the observation is determined to be an error, fix it or remove it. Even if you cannot determine the cause, you may want to rerun the regression analysis without the observation to determine its effect on the analysis. 3. Check for non-normal errors by plotting a frequency distribution of the residuals, using a stem-and-leaf display or a histogram. Check to see if obvious departures from normality exist. Extreme skewness of the frequency distribution may be due to outliers or could indicate the need for a transformation of the dependent variable. 4. Check for unequal error variances by plotting the residuals against the predicted values, y hat. If you detect a cone-shaped pattern of some other pattern that indicates that the variance of 3(e) is not constant, refit the model using an appropriate variance-stabilizing transformation on y, such as ln(y).
Solutions to some problems created by multicollinearity
1. Drop one or more of the correlated independent variables from the model. One way to decided which variables to keep in the model is to employ stepwise regression 2.If you decide to keep all the independent variables in the model, a. avoid making inferences about the individual B parameters based on the t-tests. b.Restrict inferences about E(y) and future y-values to values of the x's that fall within the range of the sample data.
Why Regression models based on time series data are especially susceptible to the independence condition
1. Key independent (x) variables that have not been included in the model are likely to be related to time, 2. The omission of one or more of these independent variables will tend to yield residuals (errors) that are also related to time.
Goals of Re-expression
1. Make the distribution of a variable more symmetric 2. Make the spread of several groups more alike 3. Make the form of a scatterplot more nearly linear 4. Make the scatter in a scatterplot or residual plot spread out evenly rather that following a fan shape
Signs that Multicollinearity in the Regression Model
1. Significant correlations between pairs of independent variables 2. Nonsignificant t-tests for all (or nearly all) of the individual B parameters when the F-test for overall model adequacy is significant 3. Signs opposite from what is expected in the estimated B parameters
Finding a Multicollinearity in the Regression Model:
1. Significant correlations between pairs of independent variables 2.Nonsignificant t-tests for all (or nearly all) of the individual B parameters when the F-test for overall model adequacy is significant 3.Signs opposite from what is expected in the estimated B parameters.
How to check if our dummy variable is adding value to our regression
Check it's p-value and compare it to alpha if you reject it, it is significant and adding value
DF for Regression in a Simple Regression in an ANOVA Table is ______
ALWAYS one!!
First-order Autocorrelation
Adjacent measurements are related
Why is extrapolation dangerous?
Because it is the assumption that the relationship between x and y does not change
How do you simplify the model and improve the t-statistic?
By removing some of the predictors.
When a category is "turned on" what is it coded as?
Coded as 1 vs. turned off = 0
Only difference in Multiple Regression ANOVA Table
Degrees of Freedom Column
What to use when the slopes are the same but the intercept differs
Dummy/indicator variable
Second-order Autocorrelation
Every other measurement is related
error terms are not autocorrelated
Ho for Residual Analysis
Multicollinearity
If the correlation between x1 and x2 is too high this will cause either x1 or x2 to be redundant **slide 25 part four ***may need to delete
the mean of the residuals is equal to 0
Property 1 of Regression Residuals
the standard deviation of the residuals is equal to the standard deviation of the fitted regression model, s
Property 2 of Regression Residuals
Errors are normally distributed
Property 3 of Regression Residuals
Errors are independent
Property 4 of Regression Residuals
Positive Autocorrelation
Residuals tend to be followed by residuals with the same sign
Difference between Simple and Multiple Regression (ANOVA Table)
Simple only has one independent variable Multiple has at least two independent variable
Single Regression Model
When the data can be divided into two groups with similar regression slopes
interaction
a relationship between our two independent variables. we are going to add an extra term and it is going to depend on the two independent variables ** direct quote from video
How a Single Regression Model is accomplished
by including an indicator (or "dummy") variable that indicates whether if theres an inversion or not
Durbin-Watson Statistic
estimates the autocorrelation by summing squares of consecutive differences and comparing the sum with its expected value
nested
if one model contains all the terms of the second model and at least one additional term
What to use when the slopes and the intercept differ
interaction
How dummy variables indicates an inversion:
inversions= 1 if "yes" (category is turned 'on') =0 if "No" (category is turned 'off')
DF for Regression in Multiple Regression
k **amount of independent variables in problem
Dummy Variables Equation
k-1 always use a number of dummy variables that is one less than the numbers of levels of the qualitative variable.
Total DF in a Simple Regression in ANOVA is ________
n-1
DF for Error for Simple Regression in ANOVA is _____
n-2
DF For Error in Multiple Regression
n-k-1 *n=total sample size *k= number of independent variables in problem
autocorrelated
points near each other in time will be related
reject H0 in a nested model F-test
preferred: complete model
fail to reject H0 in a nested model F-test
preferred: reduced model
4 -dL <D < 4 -dU
test is inconclusive
dL<D<dU
test is inconclusive
Multicollinearity is measured in terms of _____
the R2m between a predictor and all of the other predictors in the model. it is not measured in terms of the correlation between any two predictors.
complete model
the more complex of the two models; also called "full"
reduced model
the simpler of the two models
if you wanted to fit a quadratic model, at least _________ different x-values must be observed.
three
negative autocorrelation
values above 2 are evidence of WHAT?
positive autocorrelation
values below 2 are evidence of WHAT?
no autocorrelation
values that equal 2 are evidence of WHAT?
quadratic model
we can allow a curve in the relationship between the dependent variable (y) and an independent variable