P507 Final Exam

¡Supera tus tareas y exámenes ahora con Quizwiz!

Durbin Watson Statistic Range

0 to 4 0 = perfect positive serial correlation 2 = no (first order) autocorrelation 4 = perfect negative serial correlation in the case of first order autocorrelation, d = 2(1-⍴)

White's Test

White's test is what we use to test for heteroscedasticity in our dataset Ho = homoscedasticity, Ha = heteroscedasticity find critical value in chi-sq table for k'-1 if white's stat is greater than critical value, reject the null of homoscedasiticity

Causes of Autocorrelation

note: autocorrelation violates non-autoregression GM assumption Misspecification (faulty model, human error): 1) incorrect specification of a variable form/relationship 2) misspecification by omission of a relevant variable Serial correlation (non-faulty model, nature of the data): 3) smooth or cyclical data movement over time 4) consistent measurement error for Y

Practical significance and shortcomings of Logit model analysis and interpretations.

possible to compute the probability of Y occurring, which is useful it's more difficult to interpret logit models (not as intuitive as OLS), and they're less precise logit can tell you whether a relationship is significant, but it doesn't do well at telling you the strength of that relationship need to have obs>100 to fulfill GM

Effects of Autocorrelation

1) OLS estimators are still unbiased because the estimated mean still equals 0 2) OLS estimators are no longer best (minimum variance) because the variance of the error term depends on rho* 3) standard deviation is biased, implying the SE, t, F & R2 are also biased

Potential causes of heteroscedasticity?

1) error-learning (people learn from their mistakes and do better going forward, meaning errors are not constant. funnel pointing right) 2) degree of choice w/ socioeconomic variables (increasing discretionary income, so spread of errors increases as income increases. funnel pointing left) 3) measurement errors decrease with increasing size of unit of analysis 4) outliers in small sample sizes can really alter regression results 5) misspecification due to omission of variables (concentrated errors)

Use and characteristics of Two-Stage Least Squares (2SLS)

2SLS replaces an endogenous variable with an exogenous proxy/IV to help us get to one-way causality 2SLS estimates won't be unbiased (not BLUE) but they're consistent 1) Regress Y1 on all of the predetermined variables in the model (to get Y1-hat) 2) Regress Y2 on Y1-hat and the rest of the variables in the original equation (using the relationship between X and IV to express Y) Y1-hat acts as an instrument and cannot be correlated with the error Result of 2SLS: The t and F statistics are asymptotically valid R^2 does not indicate goodness of fit or the percentage of variation explained. can also be used for simultaneous equations

How simultaneous equation bias originates.

An econometric model is simultaneous if all of the equations (built into the model) are required to determine the value of at least one endogenous variable. When constructing such a model, utilizing the same variables across embedded equations can lead to independent variables being related to the error term which violates the classical linear model assumption of independent variables being fixed of uncorrelated with the error term (GLM Assumption 6). Due to the violation of this assumption, application of OLS to these equations to estimate the structural coefficients would yield not only biased but also inconsistent estimates. I.e. endogenous variables on the right side of the equation.

Characteristics of the Logit model.

Based on the cumulative logistic distribution function where e is the base of the natural log and Pi is the probability of Y occurring superior to LPM because: 1) probability of dependent variable occurring always falls between 0 and 1 2) the relationship between Xs and Y is non-linear, which is generally preferable some issues with logit: 1) interpretation is tricky 2) loss of variation across variables hard to see due to grouping data in independent variables 3) requires a sample size of over 100 due to the Dfs needed and the structure

After the WLS estimation is carried out, the estimated equation is: Y* = 13.64 + 4.75 X(2)* + 0.882 X(3)* + u-hat* R2 = 0.912 F = 16.41 n = 40 t stats: (7.5) (8.9) Using these estimation results, write out the final model.

Either: 1) Flip the intercept and the parameter estimate for the variable that you corrected for. 2) Run the WLS without an intercept then you have to make the variable that has been modified the intercept variable. You weighted by the second variable. Thus, when weighting by it, the X2 gets cancelled out, making the second parameter estimate the intercept. Y* = 4.75 + 13.64 X(2) + 0.882 X(3) + u-hat

Interpretation of the Odds Ratio from a Logit regression.

Give a 1 unit increase in X the odds of Y happening increases/decreases by ____ holding the effects of all other variables constant. A one unit increase in X ind var changes the odds ratio of Y by ____ times; thus it is increased by ____% holding the effects of all else constant.

Interpretation of the parameter estimate from a Logit regression.

Given a one unit increase in the independent variable, the log of the odds ratio of the dependent variable will increase/decrease by [parameter estimate], holding the effects of the other independent variables constant. The intercept, β1 is the log of the odds ratio when X is 0. Note that since these parameters relate to the log of the odds ratio rather than to the dependent variable directly, parameter estimates obtained from the Logit Model are not directly comparable with those from the LPM.

Using the Goldfeld-Quandt procedure, test for the presence of the indicated form of heteroscedasticity by calculating the appropriate statistic and giving your conclusion. model: Y = 3.98 + 14.87 X2 + 0.863 X3 + ui R2 = 0.893 F = 10.25 n = 40 t stats (5.6) (6.8) RSS(high) = 48.93 RSS(low) = 2.64

H0= homoscedasticity H1= heteroscedasticity F(calc.) = RSS(high) / RSS(low) = 48.93 / 2.64 = 18.53 **note this is because RSS(high) is the greater value F*: DF: (n-c-2k)/2 = (40-8-(2*3))/2 = 13 13 degrees of freedom on numerator & denominator Critical F (13, 13, .05) = 2.5769 F(calc) > F* so we reject H0.

Instrumental Variables (short)

If there are compounding variables that effect X but are not account for in X's measurement we can end up with correlation between our X and error term (b/c the effect of the unobserved variables is picked up by error term). The IV can be used to overcome this issue of simultaneity. It is a variable that doesn't suffer from Xs endogenaity issues and is only related to Y through X.

Why is it important to identify the source of Heteroscedasticity when correcting for this problem?

In order to correct for it, you need to know the source of the heteroscedasticity. If you know the source of the heteroscedasticity you can weight by that variable in order to eliminate that heteroscedasticity. OLS estimators are no longer minimum variance (best), estimates of the variances of parameter estimates are also biased, thereby affecting the accuracy of the T and F statistics. Using the incorrect assumption could make the heteroscedasticity worse.

Purpose served by the concordance/discordance results from a Logit regression.

It shows us how desirable the Logit model is (we want them to be concordant). Shows accuracy in predicting the logit model. Percent concordant: percentage of pairs where the observation with the desired outcome (event) has a higher predicted probability than the observation without the outcome (non-event). Percent discordant: percentage of pairs where the observation with the desired outcome (event) has a lower predicted probability than the observation without the outcome (non-event). Percent tied: percentage of pairs where the observation with the desired outcome (event) has the same predicted probability as the observation without the outcome (non-event). Higher percentages of concordant pairs and lower percentages of discordant and tied pairs indicate a more desirable model; basically is a measure of the reliability of the model

Logit regression SAS full output interpretation and tests.

Likelihood ratio (like F test) Ho: model is not significant (betas = 0) Ha: model is significant (betas dne 0) Chi-sq value for likelihood ratio of [blank] and p value of [blank], accept or reject, indicating statistical significant evidence of a relationship Pseudo-R2 of [blank] suggests that [blank%] of variation in the dependent variable can be explained by changes in the independent variable. Parameter estimates: β1 = (Value) is the predicted value of the log of the odds ratio when all the independent variables are equal to zero. β2 = A one unit increase in X leads to a (value) change in the log of the odds ratio of Y, holding the effects of other independent variables constant. chi-sq for each ind var (like t stats) Interpret odds ratio for each independent variable: - A one unit increase in X results in a (value) times increase in the odds of Y holding the effects of other independent variables constant. - Percent Concordant = (Value)% of the pairs where the observation with the desired outcome (event) has a higher predicted probability than the observation without the outcome (non-event). - Percent Discordant = (Value)% of the pairs where the observation with the desired outcome (event) has a lower predicted probability than the observation without the outcome (non-event). AIC - the lowest AIC tells us the model with the most explanatory power (covariate + intercept measure should be lower than intercept alone) SC - the lowest SC tells us the model with the most predictive power (covariate + intercept measure should be lower than intercept alone) Predicted probabilities: Solve for y-hat i.e. put the values into the regression equation and solve. Then plug Y-hat into [e^(Y-hat)]/[1+e^(Y-hat)] to get the predicted probability of such and such happening

What were the conclusions reached by Donovan et al. in the article on the relationship between trees and human health?

Results suggest that loss of trees to the emerald ash borer increased mortality related to cardiovascular and lower-respiratory-tract illness. This finding adds to the growing evidence that the natural environment provides major public health benefits.

How simultaneous equation bias originates (short)

Simultaneous equations occur when all of the equations in a model have to used to estimate the value of at least one of the endogenous variables. Using the same variable across embedded equations can lead to it becoming correlated with the error term which violates 6th assumption of the GLM (so applying OLS will give biased and inconsistent estimates)

Concept and use of an instrumental variable.

Sometimes, one of our independent variables may not satisfy the OLS assumption of no correlation with the error term (endogenous). This happens when confounding variables are affecting Xi that are not captured by its measurement (i.e. measurement error, reverse causality, self-selection). These unobserved factors are captured by ui, hence it has a correlation with Xi. This biases the parameter estimate coefficients. We use instrumental variables as a solution to independent variables being endogenous. IV is related to Yi through Xi only (exclusion restriction). It's uncorrelated with ui. Good instruments are usually generated by real or natural experiments. We use 2SLS to estimate an IV model

Exclusion restriction of the instrumental variable approach.

The Instrumental Variable is related to Y through X only, hence being uncorrelated to the error term u, allowing to capture the effect of X on Y but without its shortcomings. Basically, IV must explain Y without explaining any of the error term (exogenously determined)

Structure of the auxiliary regression for White's test for a given regression equation.

The equation depends on the sample size and the number of variables we have. The squared residuals from the main regression are regressed on all of the independent variables, the squared independent variables, and the cross-products of the independent variables. U-hat = Normal Betas and X variables + Squared variables + cross products +vi U-hat = B1 + B2X2 + B3X3 + B4X2^2 + B5X3^2 + B6(X2*X3) + vi

What are the effects of omitting a relevant variable that is uncorrelated with the included variables in an OLS regression?

The parameter estimates for the included independent variables will not be biased, but the intercept and the parameter estimate for the missing variable will be biased. OLS estimators are still "best" (minimum variance). Estimates of the variance of the parameter estimates will be biased upwards. The T and F statistics will not be accurate.

Omitted variable bias essay question (short)

This is because the omitted variable caused the model to be misspecified and the parameter estimate of the variable to be unbiased. Rather than the estimated value of the parameter being equal to the true value, it is equal to the true value plus some additional value whose exact amount depends on the amount of correlation between the included and omitted variable. AKA parameter estimate of X2 is picking up effects of the omitted variables. Once the omitted variables are included the included variable may become insignificant once it's unbiased. We can conclude that the variables are correlated. Including the omitted variables would not of effected the included variables biasedness and significance if they were uncorrelated.

How the Cochran-Orcutt procedure estimates rho.

Uses GLS approximations in an iterative process to converge on a reliable estimate of rho for first-order serially correlated disturbances. This process depends on minimizing the sum of squared residuals (RSS) for different values of rho until stable value for RSS is obtained (i.e. one that doesn't change between iterations) 1) Use OLS to initial equation and calculate the residuals, then use these residuals to derive the 1st round of rho. (capture residuals) 2) Carry out GLS transformation on the original equation but drop the 1st 3) Apply OLS to the transformed data to get parameter estimates 4) Using the values of untransformed variables, generate second round of residuals 5) Repeat GLS, OLS variation until you come to a stable rho - parameter estimates at this point become BLUE. Note: as samples are larger, less bias

Omitted variable bias essay question.

When you leave out critical variables from a regression, the parameter estimate of X2 is trying to pick up the effects of the omitted variable, and is therefore biased (no longer the true value of that variable's parameter estimate). when the relevant variables are added in to the model, they pick up some of the explanatory power that was mistakenly picked up by B2 if B2 isn't that relevant, it could become insignificant once the model is correctly specified therefore, the relationship between X2 and the omitted variables can be determined to be correlated, because B2 wouldn't pick up the omitted variable stuff otherwise The omission of a relevant variable biases the true variance of the parameter estimate down and the estimate of the variance of the parameter estimate up, so the true effect--of whether the variable's significance increases or decreases--is difficult to determine.

23 b. After the WLS estimation is carried out, the estimated equation is: Y* = 13.64 + 4.75*X_2 + 0.882*X_3 + u-hati* R2 = 0.912 F = 16.41 n = 40 t stats: (7.5) (8.9) Using these estimation results, write out the final model

You weighted by the X2 parameter since he tells you variance = variance*X2i^2. From the WLS regression, the second parameter estimate (4.75) becomes the new intercept in the final model and the intercept from the WLS regression (13.64) becomes the parameter estimate. X3 from the WLS remains unchanged. Answer: Y = 4.75 + 13.64*X2 + 0.882*X3 + uhat

Potential Causes of Heteroscedasticity

a. Error Learning Model Error-learning models imply that people generally learn from their mistakes, so that over time, errors of behavior become smaller. b. Income Effect- Certain socio-economic variables are characterized by a greater degree of choice as the values of these variables grow. The disposition of discretionary income is an often-used example of this phenomenon. c. Measurement Errors- Measurement errors often decrease with the increasing size of the unit of analysis. For example, survey data generally provides a more accurate representation of reality for large cities than for small cities.

Which of the following is/are true concerning heteroscedasticity?

a. there is no general test for heteroscedasticity that is perfectly free of any assumption about which variable the error terms are correlated with b. in order to correct for heteroscedasticity, you must have some idea of the values of σi c. heteroscedasticity generally does not represent a problem with the specification of the model d. correcting for heteroscedasticity results in estimators which are BLUE (on test, the answer is e. all of the above)

If first-order autocorrelation is present in a regression model, the following equation:

a. can be estimated with OLS assuming ρ is known or can be estimated b. is in first-difference form c. loses the first observation unless some additional transformations on the data values of this observation are undertaken d. when estimated, represents an application of Weighted Least Squares (on test, the answer is e. all of the above)

In constructing a regression model, an important theoretical explanatory variable cannot be measured directly. To minimize specification error, the researcher should (choose the best answer):

a. find a proxy variable that has a strong relationship to the actual explanatory variable

The most problematic issue that Logit analysis overcomes with respect to the Linear Probability Model (LPM) is that, in the LPM:

c. there is no way to guarantee that the predicted value of the dependent variable will be between 0 and 1

A distributed lag model is a regression equation that:

c. uses time-series data and contains both the current values of an explanatory variable and the past period value(s) of this variable on the right side of the equation

Hypothesis testing with Durbin Watson

d statistic predicated on the concept that we can use the values of the residuals to approximate the values of the error terms - thus checking for autocorrelation we're given d by SAS output and k' (k' = # of Xs in model) Ho = no autocorrelation, rho = 0 Ha = autocorrelation, rho dne 0 HA : ⍴ > 0 d<dl ... Reject H0 d>du ... do not reject H0 dl<d<du ... indeterminate region HA: ⍴ < 0 and d > 2 (or HA: ⍴ ≠ 0 and d > 2) substitute (4-d) for (d) as the test value in the above decision rule The accepted practice is to make corrections for serial correlation only when Ho can be firmly rejected (i.e. d < dl). use DW table to find critical values this test is designed to detect first order autocorrelation and might not be able to find autocorrelation of a higher order

Why is it important to identify the source of heteroscedasticity before correcting for it?

more common in cross-sectional data than time-series because time-series data usually involved values that are of a similar order of magnitude when heteroscedasticity is present, OLS estimators are not minimum variance parameter estimate variance estimates are also biased, meaning biased F and t if you try to correct for heteroscedasticity using the wrong assumption, you could make the problem even worse you have to make theoretically-justifiable assumptions when determining the course of action for correcting for heteroscedasticity we use WLS to correct for heteroscedasticity, but we need to rely on assumptions to properly implement it


Conjuntos de estudio relacionados

Chapter 3: The Interview, Chapter 10: Pain Assessment, Chapter 9: General Survey, Measurement, Vital Signs, Chapter 8: Assessment Techniques, Chapter 2: Cultural Competency, Chapter 4: The Complete Health History

View Set

Chapter 2 - The Project Management and Information Technology Context

View Set

Unit 3: What is the role of DNA?

View Set

Cognition - Chapter 7, Barron's AP Psychology

View Set

Hot Words for the SAT (with sentences)

View Set

Chapter 12 Vocabulary - Intro to Business

View Set

Combo with "A&P Ch. 4" and 27 others

View Set