BUAL 2650 EXAM 3

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Challenges in Building Regression Models

-for large data sets, checking the data for accuracy, missing values, consistency, and reasonableness can be a major part of the effort. -the probability that a Type I error will occur increases when the number of considered models increases -best subsets regression and stepwise regression do not check assumptions and conditions, so human interaction must always be introduced to the process

the error terms are autocorrelated, either positively or negatively

Ha for Residual Analysis

Negative Autocorrelation

Positive residuals tend to followed over time by negative residuals

True

T/F In situations where two competing models have essentially the same predictive power (as determined by an F-test), choose the more parsimonious of the two

D > 4 -dL

there is evidence of negative autocorrelation

D<dL

there is evidence of positive autocorrelation

D < 4 -dU

there is no evidence of negative autocorrelation

D>dU

there is no evidence of positive autocorrelation

parsimonious model

a general linear model with a small number of β parameters

regression outlier

a residual that is larger than 3s (in absolute value)

T/F Adding dummy variables takes two regressions and turns them into one

true

if you want to fit a first-order model, you need at least ______ observed x values.

two

Extrapolating

predicting a y value by extending the regression model to regions outside the range of the x-values of the data.

Assumptions for Random Error 3(E)

1. Mean equal to 0 2. Variance equal to sigma squared 3. Normal distribution 4. Random errors are independent

Base Levels

- this is the category that is left out while making dummy variables -ex. the amount of dummy variables you should use is k-1 so if you were using months of the year you would have 11 dummy variables and pick a month to be the base level, could be any of them.

Best Subsets Regression

-Choose a single criterion of "best" -Choose a modest set of potential predictors -A computer searches all possible models and reports the best two-predictor model, the best three-predictor model, the best four-predictor model, and so on. -The technique becomes increasingly impracticable as the size of the data set increases

Best regression models should have:

-Relatively few predictors, to keep the model simple -Relatively high r^2 -Relatively small value of s e -relatively small p-values for the f and t statics - no cases with extraordinarily high-leverage -no cases with extraordinarily large residuals, and Studentized residuals that appear to be nearly Normal. -Predictors reliably measured and relatively unrelated

Stepwise Regression

-The regression builds the model stepwise from a given initial model. At each step, a predictor is either added to or removed from the model. -The predictor chosen to add is the one whose addition increases the adjusted r^2 the most -the predictor chosen to remove is the one whose removal reduces the adjusted r^2 least. -the uses first identifies the dependent variable, y, and the set of potentially important independent variables, x1, x2......,where k is generally large. -the dependent and independent variables are then entered into a computer stepwise regression program where the data set goes through a series of steps.

Which predictors should you keep?

-Variables that are most reliably measured -least expensive to find -inherently important to the problem -new variables formed by combining variables

Quadratic Model

-We allow a curve in the relationship between the dependent variable (y) and the independent variable -this is when a quadratic term is introduced

Dummy Variable

-also known as an indictor variable -will indicate if it has an inversion -=1 if "yes" -=0 if "no" ex. all meat items get a 1 hamburger = 1 veggie burger =0

Cross-product Term

-also known as the interaction term -it is the part of the regression equation that is produced from multiplying x1 by x2

Inversion

-categorical or qualitative data -levels of a variable or membership (inversion vs no inversion) that we are using

Steps in a Residual Analysis

1. Check for a mis-specified model by plotting the residuals against each of the quantitative independent variables. Analyze each plot, looking for a curvilinear trend. This shape signals the need for a quadratic term in the model. Try a second-order term in the variable against which the residuals are plotted. 2. Examine the residual plots for outliers. Draw lines on the residual plots at 2- and 3- standard deviation distances below and above the 0 line. Examine residuals outside the 3 - standard deviation lines as potential outliers and check to see that no more than 5% of the residuals exceed the 2 - standard deviation lines. Determine whether each outlier can be explained as an error in data collection or transcription, corresponds to a member of a population different from that of the remainder of the sample, or simply represents an unusual observation. If the observation is determined to be an error, fix it or remove it. Even if you cannot determine the cause, you may want to rerun the regression analysis without the observation to determine its effect on the analysis. 3. Check for non-normal errors by plotting a frequency distribution of the residuals, using a stem-and-leaf display or a histogram. Check to see if obvious departures from normality exist. Extreme skewness of the frequency distribution may be due to outliers or could indicate the need for a transformation of the dependent variable. 4. Check for unequal error variances by plotting the residuals against the predicted values, y hat. If you detect a cone-shaped pattern of some other pattern that indicates that the variance of 3(e) is not constant, refit the model using an appropriate variance-stabilizing transformation on y, such as ln(y).

Solutions to some problems created by multicollinearity

1. Drop one or more of the correlated independent variables from the model. One way to decided which variables to keep in the model is to employ stepwise regression 2.If you decide to keep all the independent variables in the model, a. avoid making inferences about the individual B parameters based on the t-tests. b.Restrict inferences about E(y) and future y-values to values of the x's that fall within the range of the sample data.

Why Regression models based on time series data are especially susceptible to the independence condition

1. Key independent (x) variables that have not been included in the model are likely to be related to time, 2. The omission of one or more of these independent variables will tend to yield residuals (errors) that are also related to time.

Goals of Re-expression

1. Make the distribution of a variable more symmetric 2. Make the spread of several groups more alike 3. Make the form of a scatterplot more nearly linear 4. Make the scatter in a scatterplot or residual plot spread out evenly rather that following a fan shape

Signs that Multicollinearity in the Regression Model

1. Significant correlations between pairs of independent variables 2. Nonsignificant t-tests for all (or nearly all) of the individual B parameters when the F-test for overall model adequacy is significant 3. Signs opposite from what is expected in the estimated B parameters

Finding a Multicollinearity in the Regression Model:

1. Significant correlations between pairs of independent variables 2.Nonsignificant t-tests for all (or nearly all) of the individual B parameters when the F-test for overall model adequacy is significant 3.Signs opposite from what is expected in the estimated B parameters.

How to check if our dummy variable is adding value to our regression

Check it's p-value and compare it to alpha if you reject it, it is significant and adding value

DF for Regression in a Simple Regression in an ANOVA Table is ______

ALWAYS one!!

First-order Autocorrelation

Adjacent measurements are related

Why is extrapolation dangerous?

Because it is the assumption that the relationship between x and y does not change

How do you simplify the model and improve the t-statistic?

By removing some of the predictors.

When a category is "turned on" what is it coded as?

Coded as 1 vs. turned off = 0

Only difference in Multiple Regression ANOVA Table

Degrees of Freedom Column

What to use when the slopes are the same but the intercept differs

Dummy/indicator variable

Second-order Autocorrelation

Every other measurement is related

error terms are not autocorrelated

Ho for Residual Analysis

Multicollinearity

If the correlation between x1 and x2 is too high this will cause either x1 or x2 to be redundant **slide 25 part four ***may need to delete

the mean of the residuals is equal to 0

Property 1 of Regression Residuals

the standard deviation of the residuals is equal to the standard deviation of the fitted regression model, s

Property 2 of Regression Residuals

Errors are normally distributed

Property 3 of Regression Residuals

Errors are independent

Property 4 of Regression Residuals

Positive Autocorrelation

Residuals tend to be followed by residuals with the same sign

Difference between Simple and Multiple Regression (ANOVA Table)

Simple only has one independent variable Multiple has at least two independent variable

Single Regression Model

When the data can be divided into two groups with similar regression slopes

interaction

a relationship between our two independent variables. we are going to add an extra term and it is going to depend on the two independent variables ** direct quote from video

How a Single Regression Model is accomplished

by including an indicator (or "dummy") variable that indicates whether if theres an inversion or not

Durbin-Watson Statistic

estimates the autocorrelation by summing squares of consecutive differences and comparing the sum with its expected value

nested

if one model contains all the terms of the second model and at least one additional term

What to use when the slopes and the intercept differ

interaction

How dummy variables indicates an inversion:

inversions= 1 if "yes" (category is turned 'on') =0 if "No" (category is turned 'off')

DF for Regression in Multiple Regression

k **amount of independent variables in problem

Dummy Variables Equation

k-1 always use a number of dummy variables that is one less than the numbers of levels of the qualitative variable.

Total DF in a Simple Regression in ANOVA is ________

n-1

DF for Error for Simple Regression in ANOVA is _____

n-2

DF For Error in Multiple Regression

n-k-1 *n=total sample size *k= number of independent variables in problem

autocorrelated

points near each other in time will be related

reject H0 in a nested model F-test

preferred: complete model

fail to reject H0 in a nested model F-test

preferred: reduced model

4 -dL <D < 4 -dU

test is inconclusive

dL<D<dU

test is inconclusive

Multicollinearity is measured in terms of _____

the R2m between a predictor and all of the other predictors in the model. it is not measured in terms of the correlation between any two predictors.

complete model

the more complex of the two models; also called "full"

reduced model

the simpler of the two models

if you wanted to fit a quadratic model, at least _________ different x-values must be observed.

three

negative autocorrelation

values above 2 are evidence of WHAT?

positive autocorrelation

values below 2 are evidence of WHAT?

no autocorrelation

values that equal 2 are evidence of WHAT?

quadratic model

we can allow a curve in the relationship between the dependent variable (y) and an independent variable


Kaugnay na mga set ng pag-aaral

Histoire révision Brevet Chapitre 1

View Set

Pharmacology Chapter 57 GI Secretion Drugs 1-4

View Set

AS 356 Final Exam Review (Exams 1,2,3)

View Set

Object Manager & Lightning App Builder

View Set

생활 스페인어 중간 (문장)

View Set

RN Nursing Care of Children Online Practice 2019 B

View Set

Chap 36 Davis Advantage / Edge - Seizures

View Set

Saunders Ch 20: Care of client with a Tube

View Set