306 Final Exam
Suppose that the linear probability model yields a predicted value of Y that is equal to 1.3. Explain why this is nonsensical.
The predicted value of Y must be between 0 and 1.
Suppose that you have just read a careful statistical study of the effect of advertising on the demand for cigarettes. Using data from New York during the 1970s, the study concluded that advertising on buses and subways was more effective than print advertising. Use the concept of external validity to determine if these results are likely to apply to Boston in the 1970s; Los Angeles in the 1970s; New York in 2010.
The results are likely to apply to Boston in the 1970s, but not to Los Angeles in the 1970s or New York in 2010.
A researcher estimates a regression using two different software packages. The first uses the homoskedasticity-only formula for standard errors. The second uses the heteroskedasticity-robust formula. The standard errors are very different. Which should the researcher use?
The heteroskedasticity-robust standard errors should be used.
An ordinary least squares regression of Y onto X will be internally inconsistent if X is correlated with the error term. Each of the five primary threats to internal validity implies that X is correlated with the error term.
True True
By imposing restrictions on the true coefficients, the researcher wishes to test the null hypothesis that the coefficients on I and E are jointly 0, against the alternative that at least one of them is not equal to 0, while controlling for the other variables. The values of the sum of squared residuals (SSR) from the unrestricted and restricted regressions are 36.50 and 37.50, respectively.
(((37.5-36.5))/(2))/((36.5)/(100-3-1))
Consider the following least squares specification between test scores and the student-teacher ratio: Test Score=557.8+36.42 1n(Income) According to this equation, a 1% increase in income is associated with an increase in test scores of
.36 points
Consider the following regression output where the dependent variable is test scores and the two explanatory variables are the student-teacher ratio and the percent of English learners: Test Score=698.9−1.10×STR−0.650×PctEL. You are told that the t-statistic on the student-teacher ratio coefficient is 2.56. The standard error therefore is approximately:
.43
The adjusted R^2, or Rbar^2, is given by:
1- n-1/n-k-2 SSR/TSS
The critical value of F4,_∞at the 5% significance level is:
2.37
Assume that you had estimated the following quadratic regression model: Test Score=607.3+3.85Income−0.0423Income2 If income increased from 10 to 11 ($10,000 to $11,000), then the predicted effect on test scores would be:
2.96
You have estimated the relationship between test scores and the student-teacher ratio under the assumption of homoskedasticity of the error terms. The regression output is as follows: Test Score=698.9−2.28×STR, and the standard error on the slope is 0.48. The homoskedasticity-only "overall" regression F-statistic for the hypothesis that the regression R2 is zero is approximately:
22.56
What is the difference between internal validity and external validity? What is the difference between the population studied and the population of interest?
A statistical analysis is said to have internal validity if the statistical inferences about causal effects are valid for the population being studied. The analysis is said to have external validity if conclusions can be generalized to other populations and settings. The population studied is the population from which the sample was drawn, while the population of interest is the population to which causal inferences from this study are to be applied.
What is the trade-off when including an extra variable in a regression?
An extra variable could control for omitted variable bias, but it also increases the variance of other estimated coefficients.
A researcher estimates the effect on crime rates of spending on police by using city-level data. Which of the following represents simultaneous causality?
Cities with high crime rates may need a larger police force, and thus more spending. More police spending, in turn, reduces crime.
Which of the following is an example of panel data? A panel in which the variables are studied for all 'n' entities for all 'T' time periods is (_________) panel. Such omitted variable biases in panel regressions can be removed by using an OLS regression with (__________)
Data on the performance of Golden State Warriors, Cleveland Cavaliers, Chicago Bulls, New York Knicks, and Dallas Mavericks in the NBA playoffs for the years 2000 to 2015. a balanced fixed effects
A recent study found that the death rate for people who sleep 6 to 7 hours per night is lower than the death rate for people who sleep 8 or more hours. The 1.1 million observations used for this study came from a random survey of Americans aged 30 to 102. Each survey respondent was tracked for 4 years. The death rate for people sleeping 7 hours was calculated as the ratio of the number of deaths over the span of the study among people sleeping 7 hours to the total number of survey respondents who slept 7 hours. This calculation was then repeated for people sleeping 6 hours, and so on. Based on this summary, would you recommend that Americans who sleep 9 hours per night consider reducing their sleep to 6 or 7 hours if they want to prolong their lives? Why or why not? Explain.
Drug or alcohol use. Type of employment. Indicator for chronic illness
The following OLS assumption is most likely violated by omitted variables bias:
E(u_i|X_i)=0
Consider a regression with two variables, in which X_1i is the variable of interest and X_2i is the control variable. Conditional mean independence requires:
E(ui|X_1i,X_2i)=E(ui|X_2i)
Consider the polynomial regression model of degree r, Yi=β0+β1Xi+β2X2i+•••+βrXri+μi. According to the null hypothesis that the regression is linear and the alternative that is a polynomial of degree r corresponds to:
H_0:B_2=0,B_3=0,...,B_r=0 vs H_1: at least one B_j (does not equal) 0, j=2,...,r
You have estimated a linear regression model relating Y to X. Your professor says, "I think that the relationship between Y and X is nonlinear." How would you test the adequacy of your linear regression? (Check all that apply)
If adding a quadratic term, you could test the hypothesis that the estimated coefficient of the quadratic term is significantly different from zero. Compare the fit between of linear regression to the non-linear regression model.
When does missing data pose a threat to internal validity? Which of the following statements is not an implication of the regression error being correlated across observations?
Internal validity is threatened when the data are missing because of a selection process that is related to Y_i beyond depending on X_i The OLS estimators become biased and inconsistent.
What do subscripts i and t refer to?
Subscripts i and t identify the entity and time period respectively.
A researcher is interested in the effect on test scores of computer usage. Using school district data, she regresses district average test scores on the number of computers per student. What are possible sources of bias for β1, the estimated effect on tests scores of increasing the number of computers per student? For each source of bias below, determine whether β1 will be biased up or down. Average income per capita in the district. If this variable is omitted, it will likely produce a(an) (________)bias of the estimated effect on tests scores of increasing the number of computers per student. The availability of computerized adaptive learning tools in the district. If this variable is omitted, it will likely produce a(an) (_________)bias of the estimated effect on tests scores of increasing the number of computers per student. The availability of computer-related leisure activities in the district. If this variable is omitted, it will likely produce a(an) (____) bias of the estimated effect on tests scores of increasing the number of computers per student.
Upward Upward Downward
Graph Question
Y_i=B_0+B_1K_i+B_2X^2_i+u_1 The relationship between wage earnings and years of experience. The relationship between time spent studying for an exam and grade for such exam. The relationship between income and fertility.
A polynomial regression model is specified as:
Y_i=B_0+B_1X_i+B_2X^1_1+***B_rX^r_i+u_i
Consider the following regression equation: Yi=β0+β1Xi+β2Xi ×Di+ui, where β0, β1, β2, and ui are the intercept, the slope coefficient on Xi, the coefficient on the interaction term which is the product of (Xi ×Di), where Di is the binary variable respectively. This regression equation has (_____) slope and (______)intercept for the two values of the binary variable. The coefficient on (X1×X2) is the effect of a one-unit increase in the product of X1 and X2, above and beyond the sum of the individual effects of a unit increase in X1 alone and a unit increase in X2 alone. Which of the following statements describes a way of determining the degree of the polynomial in X which best models a nonlinear regression? Let r denote the highest power of X that is included in the regression.
a different the same this holds true whether Upper X 1 and divided by or Upper X 2 are continuous or binary. A way is to check if the coefficients in the regression equation associated with the largest values of r are equal to zero.
All of the following are true with the exception of one condition:
a high R^2 or Rbar^2 always means that an added variable is statistically significant.
A survey of earnings contains an unusually high fraction of individuals who state their weekly earnings in 100s, such as 300, 400, 500, etc. This is an example of:
errors-in-variables bias.
Consider the population regression of log earnings [Yi,where Yi= ln(Earningsi)] against two binary variables: whether a worker is married (D1i,where D1i = 1 if the ith person is married) and the worker's gender (D2i,where D2i = 1 if the ith person is female), and the product of the two binary variables Yi=β0+β1D1i+β2D2i+β3D1i×D2i+μi. The interaction term:
allows the population effect on log earnings of being married to depend on gender.
Under the least squares assumptions for the multiple regression problem (zero conditional mean for the error term, all X_i and Y_i being i.i.d., all X_i and μ_i having finite fourth moments, no perfect multicollinearity), the OLS estimators for the slopes and intercept:
are unbiased and consistent.
The interpretation of the slope coefficient in the model ln(Yi)=β_0+β_1ln(Xi)+μi is as follows:
a 1% change in X is associated with a β_1% change in Y.
In the multiple regression model, the t-statistic for testing that the slope is significantly different from zero is calculated:
by dividing the estimate by its standard error.
If you had a two-regressor regression model, then omitting one variable that is relevant:
can result in a negative value for the coefficient of the included variable, even though the coefficient will have a significant positive effect on Y if the omitted variable were included.
The homoskedasticity-only F-statistic and the heteroskedasticity-robust F-statistic typically are:
different.
A binary variable is often called a:
dummy variable.
Threats to internal validity lead to:
failures of one or more of the least squares assumptions.
The probit model:
forces the predicted values to lie between 0 and 1.
Consider the multiple regression model with two regressors X1 and X2, where both variables are determinants of the dependent variable. When omitting X2 from the regression, there will be omitted variable bias for B_1
if X_1 and X_2 are correlated.
Imperfect multicollinearity:
implies that it will be difficult to estimate precisely one or more of the partial effects using the data at hand.
The question of reliability/unreliability of a multiple regression depends on:
internal and external validity.
A nonlinear function:
is a function with a slope that is not constant.
This problem is inspired by a study of the "gender gap" in earnings in top corporate jobs [Bertrand and Hallock (2001)]. The study compares total compensation among top executives in a large set of U.S. public corporations in the 1990s. (Each year these publicly traded corporations must report total compensation levels for their top five executives.) Let Female be an indicator variable that is equal to 1 for females and 0 for males. A regression of the logarithm of earnings onto Female yields ln(Earnings)=6.48−0.44Female, SER=2.65 (0.01) (0.05) The estimated coefficient on Female is -0.44. Explain what this value means. The SER is 2.65. Explain what this value means. Does this regression suggest that female top executives earn less than top male executives? Does this regression suggest that there is gender discrimination?
ln(Earnings) for femalesare, onaverage, 0.44 lower thanmen's. Earnings for females are, on average, 44% lower than men's. The error term has a standard deviation of 2.65 (measured in log-points) Yes No
A "Cobb-Douglas" production function relates production (Q) to factors of production, capital (K), labor (L), raw materials (M), and an error term u using the equation Q=λKβ1Lβ2Mβ3eu, where λ, β1, β2, and β3 are production parameters. Taking logarithms of both sides of the equation yields ln(Q)=β0+β1ln(K)+β2ln(L)+β3ln(M)+u. Suppose that you thought that the value of β2 was not constant, but rather increased when K increased. Which of the following regression functions captures this dynamic relationship?
ln(Q)=B+Bln(K)+Bln(L)+Bln(M)+B[ln(L)*ln(K)]+u
Imperfect multicollinearity:
means that two or more of the regressors are highly correlated.
Consider the multiple regression model with two regressors X1 and X2, where both variables are determinants of the dependent variable. You first regress Y on X1 only and find no relationship. However, when regressing Y on X 1and X2, the slope coefficient β1 changes by a large amount.
omitted variable bias.
The dummy variable trap is an example of:
perfect multicollinearity.
The best way to interpret polynomial regressions is to:
plot the estimated regression function and to calculate the estimated effect on Y associated with a change in X for one or more values of X.
Your textbook plots the estimated regression function produced by the probit regression of deny on P/I ratio. The estimated probit regression function has a stretched "S" shape given that the coefficient on the P/I ratio is positive. Consider a probit regression function with a negative coefficient.
resemble an inverted "S" shape (for low values of X, the predicted probability of Y would approach 1).
In the case of errors-in-variables bias:
the OLS estimator is consistent if the variance in the unobservable variable is relatively large compared to the variance in the measurement error.
The linear probability model is:
the application of the linear multiple regression model to a binary dependent variable.
In the probit regression, the coefficient β1 indicates:
the change in the z-value associated with a unit change in X.
In the case of errors-in-variables bias, the precise size and direction of the bias depend on:
the correlation between the measured variable and the measurement error.
In the log-log model, the slope coefficient indicates:
the elasticity of Y with respect to X.
Internal validity is that:
the estimator of the causal effect should be unbiased and consistent.
Comparing the California test scores to test scores in Massachusetts is appropriate for external validity if:
the institutional settings in California and Massachusetts, such as organization in classroom instruction and curriculum, were similar in the two states.
A statistical analysis is internally valid if:
the statistical inferences about causal effects are valid for the population studied.
The true causal effect might not be the same in the population studied and the population of interest because:
the study is out of date. of geographical differences. of differences in characteristics of the population. all of the above.
When testing a joint hypothesis, you should:
use the F-statistics and reject at least one of the hypotheses if the statistic exceeds the critical value.
Using the textbook example of 420 California school districts and the regression of test scores on the student-teacher ratio, you find that the standard error on the slope coefficient is 0.51 when using the heteroskedasticity-robust formula, while it is 0.48 when employing the homoskedasticity-only formula. When calculating the t-statistic, the recommended procedure is to:
use the heteroskedasticity-robust formula.
In the multiple regression model, the adjusted R^2, R bar^2
will never be greater than the regression R^2
In which of the following scenarios does perfect multicollinearity occur? Why is it impossible to compute OLS estimators in the presence of perfect multicollinearity? Perfect multicollinearity can be rectified by modifying the ()
Perfect multicollinearity occurs when one of the regressors is a perfect linear function of the other regressors. It is impossible to compute OLS estimators in the presence of perfect multicollinearity because it produces division by 0. (independent variables)
Labor economists studying the determinants of women's earnings discovered a puzzling empirical result. Using randomly selected employed women, they regressed earnings on the women's number of children and a set of control variables (age, education, occupation, and so forth). They found that women with more children had higher wages, controlling for these other factors. What is most likely causing this result?
Sample selection bias.
Suppose that a state offered voluntary standardized tests to all its third graders and that these data were used in a study of class size on student performance. Which of the following would generate selection bias?
Schools with higher-achieving students could be more likely to volunteer to take the test.
One of your friends is using data on individuals to study the determinants of smoking at your university. She is particularly concerned with estimating marginal effects on the probability of smoking at the extremes. She asks you whether she should use a probit, logit, or linear probability model. What advice do you give her?
She should use the logit or probit, but not the linear probability model.
(Y_i, X_1i, X_2i) satisfy the following assumptions You are interested in β_1, the causal effect of X_1 on Y. Suppose that X_1 and X_2 are uncorrelated. You estimate β_1 by regressing X_1 (so that X_2 is not included in the regression). Does this estimator suffer from omitted variable bias?
No.
Consider the following regression model Y_i=B_0+B_1X_i+u_i Suppose that Y is measured with random error. Does this mean that regression analysis is unreliable? Now, suppose that X is measured with random error. Does this mean that regression analysis is unreliable?
No. Yes.