ECON 3720 Midterm
statistically significant
"Statistical significance helps quantify whether a result is likely due to chance or to some factor of interest," says Redman. When a finding is significant, it simply means you can feel confident that's it real, not that you just got lucky (or unlucky) in choosing the sample.
*The higher R^2 is...
...the closer the estimated regression equation fits the sample data. Measures of this type are called "goodness of fit" measures. R^2 measures the percentage of the variation of Y around Y bar that is explained by the regression equation.
The ability to interpret R-squared, to understand the bounds on R-squared, to know its relationship to the correlation coefficient, and its shortcomings as a measure of the goodness of fit.
R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R-squared explains to what extent the variance of one variable explains the variance of the second variable. So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model's inputs. An R-squared of 100% means that all movements of a security (or other dependent variables) are completely explained by movements in the index (or the independent variable(s) you are interested in) R-Squared only works as intended in a simple linear regression model with one explanatory variable. With a multiple regression made up of several independent variables, the R-Squared must be adjusted. The adjusted R-squared compares the descriptive power of regression models that include diverse numbers of predictors. Every predictor added to a model increases R-squared and never decreases it. Thus, a model with more terms may seem to have a better fit just for the fact that it has more terms, while the adjusted R-squared compensates for the addition of variables and only increases if the new term enhances the model above what would be obtained by probability and decreases when a predictor enhances the model less than what is predicted by chance. In an overfitting condition, an incorrectly high value of R-squared is obtained, even when the model actually has a decreased ability to predict. This is not the case with the adjusted R-squared. The coefficient of determination, R2, is similar to the correlation coefficient, R. The correlation coefficient formula will tell you how strong of a linear relationship there is between two variables. R Squared is the square of the correlation coefficient, r (hence the term r squared).
Understand that it is possible for the error to be heteroskedastic when a regression is done in levels, but to be homoscedastic when done in a log-log specification.
REMINDER: Homoskedastic (also spelled "homoscedastic") refers to a condition in which the variance of the residual, or error term, in a regression model is constant. That is, the error term does not vary much as the value of the predictor variable changes.
Understand how to do this in the case of log-log specifications. Understand the usefulness of the Ramsey test in helping to choose between a specification in levels or in logs.
***compare the goodness of fit for the two models *you can't just compare adj-r^2, AIC, BIC from the two regressions *the total sums of squares are different *have to make some adjustments *compare the adjusted R-squares *step 1: generate the squared residual for every observation based on the levels predictions from the log regression (Need to do the steps from above before this) gen RS = (predictedlevel - wage)^2 *step 2: generate the sum of squared residuals for the log regression using these two steps, which sum all of the observations in the column RS sum RS display r(sum) //gives RSS: the sum of squared residuals *step 3: re-run the levels regression to get the total sum of squares (TSS) that we will use regress wage i.fe i.nw i.un c.ed c.ed#c.ed c.exper c.exper#c.exper //to get TSS *step 4: calculate adjusted r2 for the log regression *adjr2=1-(rss/(n-k-1))/(tss/(n-1)) display 1-((52285.653/1281)/(80309.8255/1288)) *we don't need to re-calculate for the levels regression - can get that from the regression output *compare AICs *NOTE: we are doing them by hand here. this formula is a simplified version from the one STATA uses. if you are comparing AIC/BIC you must use the same formula for both models *AIC=log(RSS/N)+2(K+1)/N display log(52285.653/1289)+2*(7+1)/1289 //AIC for the log model display log(51517.0149/1289)+2*(7+1)/1289 //AIC for the model without logs *compare BICs *BIC=log(RSS/N)+log(N)(K+1)/N display log(52285.653/1289)+log(1289)*(7+1)/1289 //BIC for the log model display log(51517.0149/1289)+log(1289)*(7+1)/1289 //BIC for the model without logs Ramsey reset - The null hypothesis is that t=0 so it means that the powers of the fitted values have no relationship which serves to explain the dependent variable y, meaning that the model has no omitted variables. The alternative hypothesis is that the model is suffering from an omitted variable problem.
Ability to locate estimates of various regression parameters (For example, those discussed in table 4.1) on STATA output
**kind of blurry... everything is estimated (a sample statistic) SS = sum of squares (difference between the observed and predicted dependent variables) DF = degrees of freedom MS = mean sum of squares Number of obs = N F(2, 47) = F statistic Prob > F = p value Root MSE = square root of mean squared error/standard error = standard deviation of model's error Coefficient = regression coefficient std. err. = standard error of the estimated coefficient t = t stat p> |t| = p value * divide mean square model by residual square model to get F statistic
*slope dummies in STATA
*SLOPE DUMMIES use "projectstar.dta", clear *run the regression w/o slope dummies, picking "regular" as the omitted condition for class type reg tscorek i.male i.black i.freelunch c.totexpk b2.cltypek i.schidkn, beta *estimate the information criteria estat ic *run the regression with slope dummies, using the "##" syntax reg tscorek i.male i.black i.freelunch c.totexpk##b2.cltypek i.schidkn estat ic *compute the slope of teaching experience for each of the 3 classroom types *the slope coefficient of teaching experience for "regular" classroom *from the regression output: 1.419747 *the slope coefficient of teaching experience for "small" classroom *from the regression output; display 1.419747-1.393757 *the slope coefficient of teaching experience for "small" classroom *from the regression output; display 1.419747-.9472276 *compute the slope coefficients separately for each classtype using a postestimation command margins cltypek, dydx(totexpk) atmeans
Knowing how to perform one and two- sided t tests of hypothesized values of a coefficient other than zero.
*think of Rhine example
knowing how to apply the mechanical rule correctly to accept/reject at a fixed level of alpha in STAT 2120 level example
1. Specify the null and alternative hypotheses. 2. Using the sample data and assuming the null hypothesis is true, calculate the value of the test statistic. To conduct the hypothesis test for the population mean μ, we use the t-statistic t∗=x¯−μs/n which follows a t-distribution with n - 1 degrees of freedom. 3. Determine the critical value by finding the value of the known distribution of the test statistic such that the probability of making a Type I error — which is denoted α (greek letter "alpha") and is called the "significance level of the test" — is small (typically 0.01, 0.05, or 0.10). 4. Compare the test statistic to the critical value. If the test statistic is more extreme in the direction of the alternative than the critical value, reject the null hypothesis in favor of the alternative hypothesis. If the test statistic is less extreme than the critical value, do not reject the null hypothesis.
*Four important specification criteria
1. Theory: Is the variable's place in the equation unambiguous and theoretically sound? 2. t-Test: Is the variable's estimated coefficient significant in the expected direction? 3. Adjusted R squared: Does the overall fit of the equation (adjusted for degrees of freedom) improve when the variable is added to the equation? 4. Bias: Do other variables' coefficients change significantly when the variable is added to the equation?
Understand the considerations that go into deciding whether to include or exclude additional variables, as summarized by Studenmund on p. 178, with the addendum of mine that sample size and the degree to which the iffy variable is correlated with included variables might also enter into the consideration.
1. Theory: Is the variable's place in the equation unambiguous and theoretically sound? 2. t-Test: Is the variable's estimated coefficient significant in the expected direction? 3. Adjusted R squared: Does the overall fit of the equation (adjusted for degrees of freedom) improve when the variable is added to the equation? 4. Bias: Do other variables' coefficients change significantly when the variable is added to the equation? From Michener: **sample size and the degree to which the iffy variable is correlated with included variables might also enter into the consideration.
level of confidence in a confidence interval
A confidence interval, in statistics, refers to the probability that a population parameter will fall between a set of values for a certain proportion of times. Analysts often use confidence intervals than contain either 95% or 99% of expected observations. Thus, if a point estimate is generated from a statistical model of 10.00 with a 95% confidence interval of 9.50 - 10.50, it can be inferred that there is a 95% probability that the true value falls within that range. For example, a 95% confidence interval of the mean [9 11] suggests you can be 95% confident that the population mean is between 9 and 11.
Distinction between correlation and causation
A correlation between variables does not automatically mean that the change in one variable is the cause of the change in the values of the other variable. Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events.
Understand the properties of estimators. Be able to explain why it is desirable that an estimator be unbiased, consistent and efficient
A desirable property of a distribution of estimates is that its mean equals the true mean of the variable being estimated. An estimator that yields such estimates is called an unbiased estimator. An estimator is an unbiased estimator if its sampling distribution has as its expected value the true value of β. OLS IS BLUE! OLS coefficient estimators can be shown to have the following properties: 1. They are unbiased. That is, (β^) is β This means that the OLS estimates of the coefficients are centered around the true population values of the parameters being estimated. 2.They are minimum variance. The distribution of the coefficient estimates around the true parameter values is as tightly or narrowly distributed as is possible for an unbiased distribution. No other unbiased estimator has a lower variance for each estimated coefficient than OLS. 3. They are consistent. As the sample size approaches infinity, the estimates converge to the true population parameters. Put differently, as the sample size gets larger, the variance gets smaller, and each estimate approaches the true value of the coefficient being estimated. 4. They are normally distributed
Understand the use of dummy variables when the explanatory variable is qualitative. (discussion is scattered in Studenmund - see the index. There is a reading online.) This includes correctly defining dummy variables as well as correctly interpreting the β and βˆ associated with them.
A dummy variable (aka, an indicator variable) is a numeric variable that represents categorical data, such as gender, race, political affiliation, etc. Technically, dummy variables are dichotomous, quantitative variables. Their range of values is small; they can take on only two quantitative values. As a practical matter, regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence. Republican, democrat, independent example: The value of the categorical variable that is not represented explicitly by a dummy variable is called the reference group. In this example, the reference group consists of Independent voters. In analysis, each dummy variable is compared with the reference group. In this example, a positive regression coefficient means that income is higher for the dummy variable political affiliation than for the reference group; a negative regression coefficient means that income is lower. If the regression coefficient is statistically significant, the income discrepancy with the reference group is also statistically significant.
*Problem with R squared
A major problem with R squared is that adding another independent variable to a particular equation can never decrease R squared. That is, if you compare two equations that are identical, except that one has an additional independent variable, the equation with the greater number of independent variables will always have a better (or equal) fit as measured by R squared. To sum, R squared is of little help if we're trying to decide whether adding a variable to an equation improves our ability to meaningfully explain the dependent variable.
Know how to use a p value to decide whether to accept or reject the null in a STAT 2120 level example
A p-value less than 0.05 is typically considered to be statistically significant, in which case the null hypothesis should be rejected. A p-value greater than 0.05 means that deviation from the null hypothesis is not statistically significant, and the null hypothesis is not rejected.
population parameter/sample statistic
A parameter is a number describing a whole population (e.g., population mean), while a statistic is a number describing a sample (e.g., sample mean). The goal of quantitative research is to understand characteristics of populations by finding parameters
Understand the use and interpretation of standardized Beta coefficients (not covered in Studenmund)
A standardized beta coefficient compares the strength of the effect of each individual independent variable to the dependent variable. The higher the absolute value of the beta coefficient, the stronger the effect. For example, a beta of -. 9 has a stronger effect than a beta of +. **think least important to most important (from lab 5)
Definition of correlation coefficient
A statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1.
Definition of the (stochastic) error term
A term that is added to a regression equation to introduce all of the variation in Y that cannot be explained by the included Xs. Sometimes called a disturbance term. Usually referred to with the symbol epsilon, although other symbols like u or v sometimes are used.
*What does adjusted R squared measure?
Adjusted R squared measures the percentage of the variation of Y around the mean that is explained by the regression equation, adjusted for degrees of freedom.
What does an increase or decrease in adjusted R squared tell you?
Adjusted R squared will increase, decrease, or stay the same when a variable is added to an equation, depending on whether the improvement in fit caused by the addition of the new random variable outweighs the loss of the degree of freedom. An increase indicates that the marginal benefit of adding a variable exceeds the cost, while a decrease indicates that the marginal cost exceeds the benefit.
Definition of an estimate
An estimator is a mathematical technique that is applied to a sample of data to produce a real world numerical estimate of the true population regression coefficient (or other parameters). Thus, OLS is an estimator, and a β hat produced by an OLS is an estimate
Definition of an estimator
An estimator is a mathematical technique that is applied to a sample of data to produce a real world numerical estimate of the true population regression coefficient (or other parameters). Thus, OLS is an estimator, and a β hat produced by an OLS is an estimate
Slope dummy variables. Understand what an interaction term is more generally, and the special case of a slope dummy variable.
An interaction term is an independent variable in a regression equation that is the multiple of two or more other independent variables. Each interaction term has its own regression coefficient, so the end result is that the interaction term has three or more components, as in B3XiDi. Such interaction terms are used when the change in Y with respect to one independent variable (in this case X) depends on the level of another independent variable (in this case D). nteraction terms can involve two quantitative variables ( X1X2) or two dummy variables ( D1D2), but the most frequent application of interaction terms involves one quantitative variable and one dummy variable (X1D1), a combination that is typically called a slope dummy. Slope dummy variables allow the slope of the relationship between the dependent variable and an independent variable to be different depending on whether the condition specified by a dummy variable is met. This is in contrast to an intercept dummy variable, which changes the intercept, but does not change the slope, when a particular condition is met.
Unbiased, consistent, efficient as applied to estimators
An unbiased estimator is said to be consistent if the difference between the estimator and the target popula- tion parameter becomes smaller as we increase the sample size. Formally, an unbiased estimator ˆµ for parameter µ is said to be consistent if V (ˆµ) approaches zero as n → ∞. An estimator is consistent if, as the sample size increases, the estimates (produced by the estimator) "converge" to the true value of the parameter being estimated. To be slightly more precise - consistency means that, as the sample size increases, the sampling distribution of the estimator becomes increasingly concentrated at the true parameter value. An estimator is unbiased if, on average, it hits the true parameter value. That is, the mean of the sampling distribution of the estimator is equal to the true parameter value. The two are not equivalent: Unbiasedness is a statement about the expected value of the sampling distribution of the estimator. Consistency is a statement about "where the sampling distribution of the estimator is going" as the sample size increases.
OLS is
BLUE
Understanding the arithmetic basis of logs (e.g. the log of a product is the sum of the logs, the log of x^a is a*log(x), the difference between base 10 logs and natural logs, that the log is only defined for positive numbers.)
Base 10 logs v natural logs (base e logs): - Log refers to a logarithm to the base 10 - Ln refers to a logarithm to the base e - This is also called as a common logarithm -This is also called as a natural logarithm -The common log is represented as log10 (x) -The natural log is represented as loge (x) -The exponent form of the common logarithm is 10^x =y -The exponent form of the natural logarithm is e^x =y -The interrogative statement for the common logarithm is "At which number should we raise 10 to get y?" -The interrogative statement for the natural logarithm is "At which number should we raise Euler's constant number to get y?" -It is more widely used in physics when compared to ln -As logarithms are usually taken to the base in physics, ln is used much lesser -Mathematically, it is represented as log base 10 -mathematically, it is represented as log base e
*Why do we need the stochastic error term?
Because there are at least four sources of variation in Y other than the variation in the included Xs: 1. Many minor influences on Y are omitted from the equation (for example, because data are unavailable) 2. It is virtually impossible to avoid some sort of measurement error in the dependent variable. 3. The underlying theoretical equation might have a different functional form (or shape) than the one chosen for the regression. For example, the underlying equation might be nonlinear. 4. All attempts to generalize human behavior must contain at least some amount of unpredictable or purely random variation.
*Specification
Before any equation can be estimated, it must be specified. Specifying an econometric equation consists of three parts: choosing the correct independent variables, the correct functional form, and the correct form of the stochastic error term. A specification error results when any one of these choices is made incorrectly.
In regression applications, knowing how to perform one and two-sided t tests of the null that beta is zero (or greater than or equal to zero, or less than or equal to zero)
Beta is zero = the groups are not different Beta is greater than zero = The groups are different https://www.statisticshowto.com/probability-and-statistics/t-test/
Knowing how to compute: find p-values from t-statistics, using STATA (for a precise answer) or by bracketing the p-values from Studenmund's tables
Bracketing the values: Use your estimated t stat and look on the t distribution table. See the alpha values that your estimated t stat is between and give an estimate.
Definitions of cross sectional, time series and panel data sets
Cross sectional data means that we have data from many units, at one point in time. Time series data means that we have data from one unit, over many points in time. Panel data (or time series cross section) means that we have data from many units, over many points in time.
understand the interpretation of the coefficient of a dummy variable in a log-log regression
Double-log models should be run only when the logged variables take on positive values. Dummy variables, which can take on the value of zero, should not be logged. From lab 6: "The average effect of a diamond being Clarity = SI1 as opposed to being I1 is a 40.57% increase in price of a diamond."
Understanding the use of econometrics in economics: estimation (including confidence intervals)
Econometricians use regression analysis to make quantitative estimates of economic relationships that previously have been completely theoretical in nature Dependent variables, independent variables, and causality - Regression analysis is a statistical technique that attempts to "explain" movements in one variable, the dependent variable, as a function of movements in a set of other variables, called the independent (or explanatory) variables, through the quantification of one or more equations
Understanding the use of econometrics in economics: hypothesis testing
Econometrics has three major uses: 1. Describing economic reality 2. Testing hypotheses about economic theory and policy 3. Forecasting future economic activity
Understanding the use of econometrics in economics: prediction
Econometrics has three major uses: 1. Describing economic reality 2. Testing hypotheses about economic theory and policy 3. Forecasting future economic activity
Understanding the use of econometrics in economics: description
Econometrics is a set of techniques that allows the measurement and analysis of economic phenomena and the prediction of future economic trends Literally means "economic measurement" - the quantitative measurement and analysis of actual economic and business phenomena Econometrics has three major uses: 1. Describing economic reality 2. Testing hypotheses about economic theory and policy 3. Forecasting future economic activity
Understanding the use of econometrics in economics: macroeconomic policy evaluation as an illustration of these^ (reading online)
Econometrics uses economic theory, mathematics, and statistical inference to quantify economic phenomena. In other words, it turns theoretical economic models into useful tools for economic policymaking. The objective of econometrics is to convert qualitative statements (such as "the relationship between two or more variables is positive") into quantitative statements (such as "consumption expenditure increases by 95 cents for every one dollar increase in disposable income"). Econometricians—practitioners of econometrics—transform models developed by economic theorists into versions that can be estimated. As Stock and Watson (2007) put it, "econometric methods are used in many branches of economics, including finance, labor economics, macroeconomics, microeconomics, and economic policy." Economic policy decisions are rarely made without econometric analysis to assess their impact. https://www.imf.org/external/pubs/ft/fandd/basics/econometric.htm
What is meant by the "linear regression model"?
Equation 1.3 states that Y, the dependent variable, is a single equation linear function of X, the independent variable The betas are the coefficients that determine the coordinates of the straight line at any point. Beta zero is the constant or intercept term; it indicates the value of Y when X equals zero. Beta one is the slope coefficient, and it indicates the amount that Y will change when X increases by one unit (for single variable regression).
Understand why one would expect a stochastic error term in the regression model
Equation 1.4 can be thought of as having two components, the deterministic component and the stochastic, or random, component. The expression without epsilon is called the deterministic component of the regression equation because it indicates the value of Y that is determined by a given value of X, which is assumed to be nonstochastic. This deterministic component can also be thought of as the expected value of Y given X, the mean value of the Ys associated with a particular value of X. Unfortunately, the value of Y observed in the real world is unlikely to be exactly equal to the deterministic expected value E (Y|X). As a result, the stochastic element must be added.
Understand the connection between the units in which an explanatory variable is measured (for instance - feet or inches), its estimated coefficient, and the fit of the regression
Example:
Definition of degrees of freedom
Excess of the number of observations (N) over the number of coefficients (including the intercept) estimated (K+1) For instance, when the campus box number variable is added to the weight/height example, the number of observations stays constant at 20, but the number of estimated coefficients increases from 2 to 3, so the number of degrees of freedom falls from 18 to 17 - this decrease has a cost, since the lower the degrees of freedom, the less reliable the estimates are likely to be
meaning of E βk (ˆ)
Expected value of the estimated coefficient The measure of the central tendency of the sampling distribution of , which can be thought of as the mean of the , is denoted as E (β^), read as "the expected value of beta-hat."
Where to find critical values and how to compute observed sample values to perform a comparison in STAT 2120 level examples
F = compare calculated value to table f value. If the f table value is smaller than the calculated value, you can reject the null. Look for the right degrees of freedom and significance level. Numerator in column and denominator in row. T = compare calculated value to table t value. Look for right degree of freedom and significance level. Z = compare calculated value to table Z value. Look for right degree of freedom and significance level. https://www.statisticshowto.com/probability-and-statistics/find-critical-values/
Understand how economic theories that are nonlinear in variables are often linear in the logs of those variables.
First of all, it's a different measurement. Its interpretation is different...now that we are in percentage terms the graph is going to look different.
In regression applications, knowing how to use the estimated variance-covariance matrix of the coefficients to compute the standard error of a linear combination of regression coefficients.
From Eric: "He is getting at this relationship: Var(X+Y)=Var(X)+Var(Y)+2*Cov(X,Y) Replace the x and y with the betas. You need the variance-covariants matrix to get the last term. The terms on the diagonal of the matrix are the coefficient variances, the off-diagonal terms are covariances."
In regression applications, know what the F test of overall significance tests, how to apply it, and where the relevant numbers are found on the computer output. Understand its connection to the size of R-squared.
From lab 4: Overall F test: Is the regression significant at the 1% level? This test tests the null hypothesis that H0: β1=β2 =β3 =β4=0 And the alternative hypothesis is HA: H0 not true The work for equation 5.11 here: F = (445696505/18) / (17418579.9/(420-18-1)) = 570.03083785 F = 570.031 The critical value of F as stated in the table is: Critical value = 1.98 (From the F distribution table at the 1% significance level): Thus, the precise critical value is 1.9792604. If the null were true here, "we would expect F to be in the neighborhood of 1" as Prof. Michener explained in his video, and clearly, our F value 570.031 is not in the neighborhood of 1. It is also way bigger than our critical value of F, which is 1.979, so we reject the null hypothesis. The practical interpretation is that a bigger R2 lead to high values of F, so if R2 is big (which means that a linear model fits the data well), then the corresponding F statistic should be large, which means that that there should be strong evidence that at least some of the coefficients are non-zero.
Be able to explain, in simple situations, the sign of the bias caused by a particular omitted variable based on common sense.
From lab 5:
Know how the estimated coefficient of a variable in a multiple regression which omits an important variable is related to the estimated coefficient of the same variable in another regression which is otherwise identical but contains the important variable.
From lab 5:
Using log specifications. Be able to use a log specification to forecast the level of a variable, and know how to construct confidence intervals for the conditional mean and forecast intervals for the level of the variable that are based on a regression estimated in logs.
From recitation 9 dofile: use "Ruud wage data (1).dta", clear *calculate the summary statistics of all the variables in the data set sum *generate levels wages variable gen wage=exp(logw) ***run the regression with logw regress logw i.fe i.nw i.un c.ed c.ed#c.ed c.exper c.exper#c.exper *marginal effect of education at sample mean *REMEMBER you can only use the margins command if you use the # syntax for quadratics margins, dydx(c.ed) atmeans ***run the regression with wage (without logs) regress wage i.fe i.nw i.un c.ed c.ed#c.ed c.exper c.exper#c.exper ***estimate the wage for a white male union member with 12 years of education and 10 years of experience *by hand display -2.908767*0-1.48922*0+1.112505*1-.4062558*12+.0675012*(12^2)+.5139389*10-.0066643*(10^2)-.7337009 *using Stata *need to create a new observation for this person *can do do it using the data editor like we did in class, or this way (1290 is the observation # of the new observation) set obs `=_N+1' replace fe = 0 in 1290 replace nw = 0 in 1290 replace un = 1 in 1290 replace ed = 12 in 1290 replace exper = 10 in 1290 *run the levels regression again regress wage i.fe i.nw i.un c.ed c.ed#c.ed c.exper c.exper#c.exper *predict wages based on the levels regression predict wagehat, xb //a postestimation command *Confidence intervals: remember that the confidence interval for the conditional mean is DIFFERENT from the confidence interval for the forecast ***calculate the 95% confidence interval for the conditional mean (wage) *t-statistic for 5% significance level display invttail(1281,0.025) *generate standard errors of the PREDICTION predict SEpredict, stdp display 9.6968652+1.9618176*.61002836 //upper limit display 9.6968652-1.9618176*.61002836 //lower limit ***calculate the 95% confidence interval for a new observation (wage) *t-statistic for 5% significance level display invttail(1281,0.025) *generate standard errors of the FORECAST predict SEforecast, stdf display 9.6968652+1.9618176*6.3709013 //upper limit display 9.6968652-1.9618176*6.3709013 //lower limit ***estimate the log wage for a white male union member with 12 years of education and 10 years of experience *run the log regression regress logw i.fe i.nw i.un c.ed c.ed#c.ed c.exper c.exper#c.exper *by hand display -.2380428*0-.1284902*0+.1769132+.0191133*12+.0029654*(12^2)+.0473488*10-.0006527*(10^2)+.9812464 *using Stata predict logwagehat, xb //a postestimation command ***calculate the 95% confidence interval for the conditional mean (lwage) display invttail(1281,0.025) predict SEpredictlog, stdp display 2.2227577+1.9618176*.04459711 //upper limit display 2.2227577-1.9618176*.04459711 //lower limit ***calculate the 95% confidence interval for a new observation (lwage) display invttail(1281,0.025) predict SEforecastlog, stdf display 2.2227577+1.9618176*.46575502 //upper limit display 2.2227577-1.9618176*.46575502 //lower limit ***use the predicted lwage to predict wage (in levels) - for confidence intervals *convert the CI for the conditional mean from lwage to wage display exp(2.3102491) //upper limit display exp(2.1352663) //lower limit *convert the CI for a new observation from lwage to wage display exp(3.1364841) //upper limit display exp(1.3090313) //lower limit ***use the predicted lwage to predict wage (in levels) - for point estimate *REMEMBER: you can't just exponentiate the point estimate of the prediction from your log regression for this *it was OK for confidence intervals, but not for point estimates *see the handouts for more details *here are the steps *1. exponentiate the predicted logs gen m_hat = exp(logwagehat) *2. regresss wage on the exp(predlog), suppressing the constant reg wage m_hat, noconstant *3. use this regression to generate predicted values of hte level of gas use predict predictedlevel, xb *now we have the point estimate in the predictedlevel variable *interpretation: the LOG regression predicts that this new individual will make $10.185868 per hour in LEVELS terms *compare this to that the LEVELS regression predits that this same person will make $9.698652 per hour in LEVELS terms *they are different!
Gauss-Markov theorem. Know which assumptions must be met for OLS to be BLUE and to be MVUE
Given Classical Assumptions I through VI (Assumption VII, normality, is not needed for this theorem), the Ordinary Least Squares estimator of βk is the minimum variance estimator from among the set of all linear unbiased estimators of βk, for k = 0, 1, 2,..., K. Best means that each has the smallest variance possible (in this case, out of all the linear unbiased estimators of βk). An unbiased estimator with the smallest variance is called efficient, and that estimator is said to have the property of efficiency. Since the variance typically falls as the sample size increases, larger samples almost always produce more accurate coefficient estimates than do smaller ones. The Gauss-Markov Theorem requires that just the first six of the seven classical assumptions be met. What happens if we add in the seventh assumption, that the error term is normally distributed? In this case, the result of the Gauss-Markov Theorem is strengthened because the OLS estimator can be shown to be the best (minimum variance) unbiased estimator out of all the possible estimators, not just out of the linear estimators. In other words, if all seven assumptions are met, OLS is "BUE."
Understand that when percentage changes are more nearly constant than absolute changes, log-log specifications are likely to outperform specifications in the levels of the same variables.
Got it:)
homoskedasticity and heterosckedasticity
Homoskedastic (also spelled "homoscedastic") refers to a condition in which the variance of the residual, or error term, in a regression model is constant. That is, the error term does not vary much as the value of the predictor variable changes. Another way of saying this is that the variance of the data points is roughly the same for all data points. Heteroskedasticity in statistics is the error variance. This is the dependence of scattering that occurs within a sample with a minimum of one independent variable. This means that the standard deviation of a predictable variable is non-constant. You can tell if a regression is homoskedastic by looking at the ratio between the largest variance and the smallest variance. If the ratio is 1.5 or smaller, then the regression is homoskedastic. If the variance of the error term is homoskedastic, the model was well-defined. If there is too much variance, the model may not be defined well. Adding additional predictor variables can help explain the performance of the dependent variable. Oppositely, heteroskedasticity occurs when the variance of the error term is not constant.
less than, greater than and two sided tests
How can we tell whether it is a one-tailed or a two-tailed test? It depends on the original claim in the question. A one-tailed test (less than, greater than) looks for an "decrease" or "increase" in the parameter whereas a two sided test looks for a "change" (could be increase or decrease) in the parameter.
Know the classical assumptions, p 94, and their meaning
I. The regression model is linear, is correctly specified and has an additive error term. II. The error term has a zero population mean III. All explanatory variables are uncorrelated with the error term IV. Observations of the error term are uncorrelated with each other. V. The error term has a constant variance (no heteroskedasticity) VI. No explanatory variable is a perfect linear function of any other explanatory variable(s) (no perfect multicolinnearity) VII. The error term is normally distributed (This assumption is optional but usually is invoked)
Understand the interpretation of the coefficient of an ordinary quantitative variable in a log-log regression
If X increases by one percent, Y will change by β1 percent. (Thus β1 is the elasticity of Y with respect to X.)
*More again on consequences
If all these conditions hold, the variable belongs in the equation; if none of them do, the variable is irrelevant and can be safely excluded from the equation. When a typical omitted relevant variable is included in the equation, its inclusion probably will increase adjusted R squared and change at least one other coefficient. If an irrelevant variable, on the other hand, is included, it will reduce adjusted r squared, have an insignificant t-score, and have little impact on the other variables' coefficients.
Understand the relationship between the F-test when there is one d.f. in the numerator and a t- test.
If there is one df in the numerator, that means that k = 1, because df = k (=1) here. Thus, the t test would only be testing one value.
Be able to judge, based on AIC and BIC, which specification appears to have the best fit.
If they decrease!
Knowing how to use STATA's invTtail command to find a critical value of a t-test.
If you know t* and want to calculate the area above it under the t-model with df degrees of freedom (shown in gray), use the command display ttail(df, t*). If you know the area in gray, alpha (e.g. 5%), and want to calculate t*, use the command invttail(df,alpha). Suppose you want to calculate the critical value of t for a 90% confidence interval with 17 degrees of freedom, i.e. you want to find the value of t* for which 5% of the area under the curve lies above t* and 5% lies below -t*. To find this value using STATA type: . display invttail(17,0.05) in the STATA command window. This gives us the 95th percentile of the t-model with 17 degrees of freedom, which corresponds to the critical value for a 90% confidence interval. In the Results window the value 1.7396067 is shown (Compare this value with the one given by the table in the back of the book). Note that the p-value for t≤2.09 (the area to the left of 2.09) with 4 degrees of freedom, would be given by .display 1-ttail(4,2.09) The p-value for |t|≥2.09 (two-sided test) with 4 degrees of freedom, would be given by .display 2*ttail(4,2.09)
*More on slope dummy variables (explanation from textbook part 1)
In general, a slope dummy is introduced by adding to the equation a variable that is the multiple of the independent variable that has a slope you want to change and the dummy variable that you want to cause the changed slope. The general form of a slope dummy equation is:
Understand how fitting a polynomial modifies the interpretation of the slope coefficients.
In most average cost functions, the slope of the cost curve changes sign as output changes. If the slopes of a relationship are expected to depend on the level of the variable itself, then a polynomial model should be considered. Polynomial functional forms express Y as a function of independent variables, some of which are raised to powers other than 1. For example, in a second-degree polynomial (also called a quadratic) equation, at least one independent variable is squared: (7.10) Such a model can indeed produce slopes that change sign as the independent variables change. For another example, consider a model of annual employee earnings as a function of the age of each employee and a number of other measures of productivity such as education. What is the expected impact of age on earnings? As a young worker gets older, his or her earnings will typically increase. Beyond some point, however, an increase in age will not increase earnings by very much at all, and around retirement we'd expect earnings to start to fall abruptly with age. As a result, a logical relationship between earnings and age might look something like the right half of Figure 7.4; earnings would rise, level off, and then fall as age increased. Such a theoretical relationship could be modeled with a quadratic equation: (7.12) What would the expected signs of and be? Since you expect the impact of age to rise and fall, you'd thus expectto be positive andto be negative (all else being equal). In fact, this is exactly what many researchers in labor economics have observed.
*EXAMPLE of slope dummy variables (explanation from textbook part 3)
In practice, slope dummies have many realistic uses. For example, consider the question of earnings differentials between men and women. Although there is little argument that these differentials exist, there is quite a bit of controversy over the extent to which these differentials are caused by sexual discrimination (as opposed to other factors). Suppose you decide to build a model of earnings to get a better view of this controversy. If you hypothesized that men earn more than women on average, then you would want to use an intercept dummy variable for gender in an earnings equation that included measures of experience, special skills, education, and so on, as independent variables:
Knowing how to pick an appropriate null and alternative hypothesis for such a test, and understanding why it is not appropriate to first look at the estimated sign of the coefficient and let it determine whether you do a greater than or less than test.
In real world applications, deciding on the alternative is sometimes difficult; ambiguous cases ought to be treated as two sided "CAUTION: We do not recommend changing from a two-tailed test to a one-tailed test after running your regression. This would be statistical cheating! You must know the direction of your hypothesis before running your regression."
Minimum variance unbiased estimator (MVUE)
In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter. Minimum variance and minimum MSE (mean squared error) are the same thing when you restrict attention to unbiased estimators (bias = 0). In other words, efficiency can be measured by variance
Understand how the Ramsey RESET test works, what a significant value of the RESET test most likely signals, and how to actually do a RESET test in STATA.
In statistics, the Ramsey Regression Equation Specification Error Test (RESET) test is a general specification test for the linear regression model. More specifically, it tests whether non-linear combinations of the fitted values help explain the response variable. The null hypothesis is that t=0 so it means that the powers of the fitted values have no relationship which serves to explain the dependent variable y, meaning that the model has no omitted variables. The alternative hypothesis is that the model is suffering from an omitted variable problem. If p is less than .05 (or any other significance level), then you reject the null...AKA there IS omitted variable bias. HOW TO: Run your regression and then use the command "estat ovtest" to get your p value. It doesn't tell you why your model is wrong, just that it is.
Knowing what the t-test does NOT do, which is well described by Studenmund on pp. 145-8.
Limitations of the t test - easy to misuse. - usefulness of the test diminishes rapidly as more and more specifications are estimated and tested The t test does not test theoretical validity The t test does not test importance The t test is not intended for tests of the entire population
Know Section 2.3; that is, how one should evaluate a fitted regression, with the addendum that - as part of point 5 - we are interested in the magnitude as well as the signs of the estimated coefficients, insofar as we have any expectations about magnitudes.
Most good econometricians spend quite a bit of time thinking about what to expect from an equation before they estimate the equation. Once the computer estimates have been produced, however, it's time to evaluate the regression results. The list of questions that should be asked during such an evaluation is long. For example: 1. Is the equation supported by sound theory? 2. How well does the estimated regression fit the data? 3. Is the data set reasonably large and accurate? 4. Is OLS the best estimator to be used for this equation? 5. How well do the estimated coefficients correspond to the expectations developed by the researcher before the data were collected? 6. Are all the obviously important variables included in the equation? 7. Has the most theoretically logical functional form been used? 8. Does the regression appear to be free of major econometric problems?
In regression applications, knowing the proper number of degrees of freedom to use in a t-test (both from first principles and by finding the appropriate number on STATA output).
N-1; Number of observations - 1
Null and alternative hypotheses
Null: The value of the unknown population parameter we wish to test is called the null hypothesis. H0 is the universal notation for a null hypothesis. Often the null corresponds to the idea "nothing interesting or unexpected is going on" Alternative hypothesis: The alternative hypothesis comes in three flavors: less than, not equal to (Two sided), greater than. H-sub-A signifies the alternative hypothesis. Which flavor is a matter of judgement: What do you expect to be true if the null is false? Rhine deck example:
Definition of ordinary least squares
Ordinary least squares (OLS) is a regression estimation technique that calculates the beta hats as to minimize the sum of the squared residuals
Definition of adjusted R-squared
Regular R squared is of little help if we're trying to decide whether adding a variable to an equation improves our ability to meaningfully explain the dependent variable. Because of this problem, econometricians have developed another measure of the quality of the fit of an equation. That measure is adjusted R squared (pronounced R bar squared), which is R squared adjusted for degrees of freedom. The highest possible adjusted R squared is 1.00, the same as for R squared. The lowest possible adjusted R squared is not 0; if R squared is extremely low, adjusted R squared can be slightly negative.
*Difference between residual and error term
Residual: difference between the observed/actual Y and the estimated/predicted regression line (Y hat) Error term: difference between the observed/actual Y and the true regression equation (the expected value of Y) *Note that the error term is a theoretical concept that can never be observed, but the residual is a real world value that is calculated for each observation every time a regression is run. *The residual can be thought of as an estimate of the error term.
In regression applications, understand how to use the F test to test the statistical significance of a subset of variables simultaneously -- both by hand and with STATA. Understand how this relates to the change in R-squared if the relationship is estimated with and without these variables. Understand how to use the F test to test restrictions on coefficients (beyond the simple case where several are simultaneously equal to zero) - both by hand and with STATA.
STATA: post estimation command By hand: formula R squared will always increase as new variables are introduced. Adjusted R squared, however, will account for degrees of freedom! The critical F-value, Fc, is determined from Statistical Table B-2 or B-3, depending on the level of significance chosen by the researcher and on the degrees of freedom. The F-statistic has two types of degrees of freedom: the degrees of freedom for the numerator of Equation 5.10 (M, the number of constraints implied by the null hypothesis) and the degrees of freedom for the denominator of Equation 5.10 (N − K − 1, the degrees of freedom in the unconstrained regression equation). The underlying principle here is that if the calculated F-value (or F-ratio) is greater than the critical value, then the estimated equation's fit is substantially better than the constrained equation's fit, and we can reject the null hypothesis of no effect.
Be able to give examples from elementary statistics (estimating the center of a symmetric distribution) of estimators with various desirable and undesirable properties
SUMMARY... desirable properties: reasonably accurate point estimates, small-ish confidence intervals, unreactive to misunderstandings about reality (extreme scores) undesirable properties: inaccurate point estimates, large confidence intervals, reactive to misunderstandings about reality (extreme scores) First let's think about the qualities we might seek in an ideal estimation partner, in non-technical language: "Open-minded data scientist seeking estimator for reasonably accurate point estimates with small-ish confidence intervals. Prefer reasonable performance in small samples. Unreactive to misunderstandings a plus." That's what we often want, anyway. This brings up one of the first interesting points about selecting an estimator: we may have different priorities depending on the situation and therefore the best estimator can be context dependent and subject to the judgement of a human. If we have a large sample, we may not care about small sample properties. If we know we're working with noisy data, we might prioritize estimators that are not easily influenced by extreme values. Now let's introduce the properties of estimators more formally and discuss how they map to our more informally described needs. We will refer to an estimator at a given sample size n as tn and the true target parameter of interest as θ. "reasonably accurate point estimates" → consistent "small-ish confidence intervals" → asymptotically normal, high relative efficiency "unreactive to misunderstandings about reality" → robust https://multithreaded.stitchfix.com/blog/2020/09/24/what-makes-a-good-estimator/
Be able to interpret the coefficient of a slope dummy variable (*and the p value)
See example above. In Equation 7.19, would be an estimate of the differential impact of an extra year of experience on earnings between men and women. We could test the possibility of a positive true by running a one-tailed t-test on . If were significantly different from zero in a positive direction, then we could reject the null hypothesis of no difference due to gender in the impact of experience on earnings, holding constant the other variables in the equation. If B3>0, then men earn x amount more for a unit increase in education level than women.
serial correlation
Serial correlation (also called Autocorrelation) is where error terms in a time series transfer from one period to another. In other words, the error for one time period a is correlated with the error for a subsequent time period b. For example, an underestimate for one quarter's profits can result in an underestimate of profits for subsequent quarters. This can result in a myriad of problems.
meaning of SE βk or s βk (ˆ)(ˆ) in table 4.1
Since the standard error of the estimated coefficient, SE(), is the square root of the estimated variance of the , it is similarly affected by the size of the sample and the other factors we've mentioned. For example, an increase in sample size will cause SE() to fall; the larger the sample, the more precise our coefficient estimates will be.
Knowing how to find, document and input your own data. Knowing the steps one would take in going from a research hypothesis to point estimates of the relevant parameters in an applied problem. (Outlined in Chapter 3 and practiced in second lab)
Six steps in applied regression analysis 1. Review the literature and develop the theoretical model 2. Specify the model: select the independent variables and the functional form 3. Hypothesize the expected signs of the coefficients 4. Collect the data. Inspect and clean the data. 5. Estimate and evaluate the equation 6. Document the results
Definition of a specification error
Specification error occurs when the functional form or the choice of independent variables poorly represent relevant aspects of the true data-generating process.
Ability to correctly interpret the. meaning of βi , i = 0, 1, K and (when presented with sample output βi^, i = 0,1, K)
TIPS: The beta coefficient is the degree of change in the outcome variable for every 1-unit of change in the predictor variable. The non-hat beta coefficients are for the actual regression model, not the estimated regression model. β^1, β^2, β^3, etc represent the estimated increase in Y per unit increase in X. Note that the increase may be negative which is reflected when β^1 is negative. The intercept is the value of y when x=0. i represents the # of each independent value. K represents the total amount of independent variables. β^0 is the Y-intercept of the regression line. When X=0 is within the scope of observation, β^0 is the estimated value of Y when X=0.
Ability, when faced with a new data set, to identify which are the dependent variables and which are the independent variables
TIPS: The independent variable is the cause. Its value is independent of other variables in your study. The dependent variable is the effect. Its value depends on changes in the independent variable.
Definition of dummy variables
Take on the value of one or zero (and only those values) depending on whether a specified condition is met
sampling distribution of an estimator
The "sampling distribution" of a statistic (estimator) is a probability distribution that describes the probabilities with which the possible values for a specific statistic (estimator) occur.
Ability to use an F table.
The F distribution is a right-skewed distribution used most commonly in Analysis of Variance. When referencing the F distribution, the numerator degrees of freedom are always given first, as switching the order of degrees of freedom changes the distribution (e.g., F(10,12) does not equal F(12,10) ). For the four F tables below, the rows represent denominator degrees of freedom and the columns represent numerator degrees of freedom. The right tail area is given in the name of the table. For example, to determine the .05 critical value for an F distribution with 10 and 12 degrees of freedom, look in the 10 column (numerator) and 12 row (denominator) of the F Table for alpha=.05. F(.05, 10, 12) = 2.7534. You can use the interactive F-Distribution Applet to obtain more accurate measures.
In regression applications, knowing how to use STATA output to quickly determine whether one accepts/rejects and the correct p-value for this kind of null.
The F-value is the Mean Square Model (2385.93019) divided by the Mean Square Residual (51.0963039), yielding F=46.69. The p-value associated with this F value is very small (0.0000). These values are used to answer the question "Do the independent variables reliably predict the dependent variable?". The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". You could say that the group of variables math and femalecan be used to reliably predict science (the dependent variable). If the p-value were greater than 0.05, you would say that the group of independent variables does not show a statistically significant relationship with the dependent variable, or that the group of independent variables does not reliably predict the dependent variable. Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable. The ability of each individual independent variable to predict the dependent variable is addressed in the table below where each of the individual variables are listed.
Best linear unbiased estimator (BLUE)
The Gauss Markov theorem says that, under certain conditions, the ordinary least squares (OLS) estimator of the coefficients of a linear regression model is the best linear unbiased estimator (BLUE), that is, the estimator that has the smallestvariance among those that are unbiased and linear in the observed output variables.
mean squared error
The Mean Squared Error measures how close a regression line is to a set of data points. It is a risk function corresponding to the expected value of the squared error loss. Mean square error is calculated by taking the average, specifically the mean, of errors squared from data as it relates to a function. One method of deciding whether this decreased variance in the distribution of the is valuable enough to offset the bias is to compare different estimation techniques by using a measure called the Mean Square Error (MSE). The Mean Square Error is equal to the variance plus the square of the bias. The lower the MSE, the better.
how to convert a two sided p value to a one sided p value
The P value means the probability, for a given statistical model that, when the null hypothesis is true, the statistical summary would be equal to or more extreme than the actual observed results
Meaning of hats over variables, such as CBt
The estimated or predicted values in a regression or other predictive model are termed the y-hat values. "Y" because y is the outcome or dependent variable in the model equation, and a "hat" symbol (circumflex) placed over the variable name is the statistical designation of an estimated value.
Definition of dependent (response) variables and independent (explanatory, predictor) variables
The independent variable is the cause. Its value is independent of other variables in your study. It is changed or controlled in a scientific experiment to test the effects on the dependent variable. The dependent variable is the effect. Its value depends on changes in the independent variables.
alpha, beta, power
The probability of a type I error is called alpha. The probability of a type II error is called beta. These are really conditional probabilities. Alpha is the probability of rejecting the null given that it is true. Beta is the probability of accepting the null given that it is false (plus a particular false value) Power is the chance a false null is correctly rejected. It is equal to 1 - beta. (Beta = 100% - power)
Definition of residuals
The residual for each observation is the difference between predicted values of y (dependent variable) and observed values of y. Residual = actual/observed y value - predicted/estimated y value The difference between the estimated value of the dependent variable and the actual/observed value of the dependent variable is defined as the residual.
Definition of R-squared
The simplest commonly used measure of fit is R squared, or the coefficient of determination. It is the ratio of the explained sum of squares to the total sum of squares. Ranges from 0 to 1.
*Relationship between TSS, ESS, RSS
The smaller the RSS is relative to the TSS, the better the estimated regression line fits the data. OLS is the estimating technique that minimizes the RSS and therefore maximizes the ESS for a given TSS.
*Why do we need the stochastic error term? (important)
The stochastic error term must be present in a regression equation because there are at least four sources of variation in Y other than the variation in the included Xs: 1. Many minor influences on Y are omitted from the equation (for example, because data are unavailable). 2. It is virtually impossible to avoid some sort of measurement error in the dependent variable. 3. The underlying theoretical equation might have a different functional form (or shape) than the one chosen for the regression. For example, the underlying equation might be nonlinear. 4. All attempts to generalize human behavior must contain at least some amount of unpredictable or purely random variation.
Understanding how and why one uses an F-test to test the statistical significance of a set of two or more dummy variables.
There are many other uses of the F-test besides the test of overall significance. For example, let's take a look at the problem of testing the significance of seasonal dummies. Seasonal dummies are dummy variables that are used to account for seasonal variation in the data in time-series models. Notice that only three dummy variables are required to represent four seasons. In this formulation β1 shows the extent to which the expected value of Y in the first quarter differs from its expected value in the fourth quarter, the omitted condition. β2 and β3 can be interpreted similarly. Inclusion of a set of seasonal dummies "deseasonalizes" Y. This procedure may be used as long as Y and X4 are not "seasonally adjusted" prior to estimation. Many researchers avoid the type of seasonal adjustment done prior to estimation because they think it distorts the data in unknown and arbitrary ways, but seasonal dummies have their own limitations such as remaining constant for the entire time period. As a result, there is no unambiguously best approach to deseasonalizing data. To test the hypothesis of significant seasonality in the data, one must test the hypothesis that all the dummies equal zero simultaneously rather than test the dummies one at a time. In other words, the appropriate test of seasonality in a regression model using seasonal dummies involves the use of the F-test instead of the t-test. Seasonal dummy coefficients should be tested with the F-test instead of with the t-test because seasonality is usually a single compound hypothesis rather than 3 individual hypotheses (or 11 with monthly data) having to do with each quarter (or month). To the extent that a hypothesis is a joint one, it should be tested with the F-test. If the hypothesis of seasonal variation can be summarized into a single dummy variable, then the use of the t-test will cause no problems. Often, where seasonal dummies are unambiguously called for, no hypothesis testing at all is undertaken.
Understand why it is good practice to report a sensitivity analysis. Understand why these considerations make it important to rely, as much as is possible, on a priori theory to determine your specification.
Throughout this text, we've encouraged you to estimate as few specifications as possible and to avoid depending on fit alone to choose between those specifications. If you read the current economics literature, however, it won't take you long to find well-known researchers who have estimated five or more specifications and then have listed all their results in an academic journal article. What's going on? In almost every case, these authors have employed a technique called sensitivity analysis. Sensitivity analysis consists of purposely running a number of alternative specifications to determine whether particular results are robust (not statistical flukes). In essence, we're trying to determine how sensitive a potential "best" equation is to a change in specification because the true specification isn't known. Researchers who use sensitivity analysis run (and report on) a number of different reasonable specifications and tend to discount a result that appears significant in some specifications and insignificant in others. Indeed, the whole purpose of sensitivity analysis is to gain confidence that a particular result is significant in a variety of alternative specifications, functional forms, variable definitions, and/or subsets of the data.
*Purpose of OLS
To obtain numerical values for the coefficients of an otherwise completely theoretical regression equation
Definition of total, explained and residual sum of squares and their acronyms
Total sum of squares (TSS): Econometricians use the squared variations of Y around its mean as a measure of the amount of variation to be explained by the regression. This computed quantity is usually called the total sum of squares, or TSS. For OLS, the total sum of squares has two components, variation that can be explained by the regression and variation that cannot: Explained sum of squares (ESS): measures the amount of the squared deviation of Y from its mean that is explained by the regression line. This component is called the explained sum of squares and is attributable to the fitted regression line. Residual sum of squares (RSS): unexplained portion of TSS
Understand why doing specification searches in a data set, and then using the same data set to perform statistical tests, is an invalid procedure and can cause unimportant factors to erroneously appear to be statistically significant.
Two kinds of mistakes can be made using such a system. First, X2 sometimes can be left in the equation when it does not belong there, but such a mistake does not change the expected value of Second, X2 sometimes can be dropped from the equation when it belongs. In this second case, the estimated coefficient of X1 will be biased. In other words, will be biased every time X2 belongs in the equation and is left out, and X2 will be left out every time that its estimated coefficient is not significantly different from zero. We will have systematic bias in our equation! To summarize, the t-test is biased by sequential specification searches. Since most researchers consider a number of different variables before settling on the final model, someone who relies on the t-test or is likely to encounter this problem systematically. other info: One of the weaknesses of econometrics is that a researcher potentially can manipulate a data set to produce almost anyresult by specifying different regressions until estimates with the desired properties are obtained. If theory, not or t-scores, is the most important criterion for the inclusion of a variable in a regression equation, then it follows that most of the work of specifying a model should be done before you attempt to estimate the equation. Since it's unreasonable to expect researchers to be perfect, there will be times when additional specifications must be estimated. However, these new estimates should be few in number and should be thoroughly grounded in theory. In addition, they should be explicitly taken into account when testing for significance and/or summarizing results. In this way, the danger of misleading the reader about the statistical properties of the final equation will be reduced. The sequential specification search technique allows a researcher to estimate an undisclosed number of regressions and then present a final choice (which is based upon an unspecified set of expectations about the signs and significance of the coefficients) as if it were the only specification estimated. Such a method misstates the statistical validity of the regression results for two reasons: The statistical significance of the results is overestimated because the estimations of the previous regressions are ignored. The expectations used by the researcher to choose between various regression results rarely, if ever, are disclosed. Thus the reader has no way of knowing whether all the other regression results had opposite signs or insignificant coefficients for the important variables. Unfortunately, there is no universally accepted way of conducting sequential searches, primarily because the appropriate test at one stage in the procedure depends on which tests previously were conducted, and also because the tests have been very difficult to invent. In the previous section, we stated that sequential specification searches are likely to mislead researchers about the statistical properties of their results. In particular, the practice of dropping a potential independent variable simply because its coefficient has a low t-score or because it lowers will cause systematic bias in the estimated coefficients (and their t-scores) of the remaining variables.
Type I and type II error
Type I: Rejecting a true null Type II: Accepting a false null
Understand consequences of including a full set of dummy variables along with a constant in a regression model.
When defining dummy variables, a common mistake is to define too many variables. If a categorical variable can take on k values, it is tempting to define k dummy variables. Resist this urge. Remember, you only need k - 1 dummy variables. A kth dummy variable is redundant; it carries no new information. And it creates a severe multicollinearity problem for the analysis. Using k dummy variables when only k - 1 dummy variables are required is known as the dummy variable trap. Avoid this trap!
*Why use OLS? How does it work?
Why? - easy to use - goal of minimizing the residuals is quite appropriate - OLS estimates have a number of useful characteristics (the sum of residuals is exactly zero, OLS can be shown to be the best estimator under a set of specific assumptions) - estimator is a mathematical technique that is applied to a sample of data to produce a real world numerical estimate of the true population regression coefficient (Or other parameters). Thus, OLS is an estimator, and beta hat produced by an OLS is an estimate. How? - it selects those estimates of beta zero and beta one that minimize the squared residuals, summed over all of the sample data points
p value
Yet another way to test a hypothesis is by using what are known as p values. If alpha > p value, reject H0; if alpha < p value, accept H0. For a greater than test, the p value is the probability of getting an outcome as big or bigger than what you got in your sample, when the null hypothesis is true
Distinction between Yi and Yi^
Yi is int he true regression equation and Yi^ is in the estimated regression equation. Specifically, Yi is the ith observation of the dependent variable.
Understand the different measures of goodness of fit - adjusted R-squared, AIC, and BIC, including how to compute the latter two by hand and in STATA.
You want adjusted R squared as close to 1 as possible. AIC and BIC you want to get smaller when you run a different model. If they decrease, that indicates that your new model is better. AIC and BIC command in STATA = estat ic...By hand =
how to test hypotheses by using a z, t, or F test statistic
Z score: do these two populations differ? T statistic: Do these two samples differ? F statistic: Do any of these three or more samples differ from each other? F = larger sample variance/smaller sample variance (placing larger on top forces it into a right tailed test) A test statistic is just a way to calculate this likelihood consistently across a variety of situations and data. This is important because it helps you establish the statistical significance of your result, which in turn determines whether or not you reject your null hypothesis. This is done by comparing your test statistic value to a pre-established critical value. The higher the absolute value of your test statistic, the higher the significance of your result.
Using STATA's Ftail and invFtail commands to find p-values and critical values for F tests.
display Ftail(df1, df2, F0) computes the right tail (upper tail) p value of an F statistic display invFtail(df1, df2, p) computes the right tail critical value of an f distribution
Understand that when comparing two specifications with different measures (e.g. log and level) of the dependent variable, goodness of fit measures like R-squared, adjusted R-squared, AIC, and BIC can't be compared without converting everything to a common basis.
duh
Understand how to display a polynomial fit overlaid with a scatterplot to visually judge the quality of the fit. Be able to implement a polynomial model in STATA.
from Eric's dofile: *QUADRATICS use "Ruud wage data.dta" *create a scatter plot of log wages against years of schooling twoway (scatter logw ed) (lfit logw ed) *create a scatter plot and a fitted quadratic plot of log wages against years of schooling twoway (scatter logw ed) (qfit logw ed) *run the regression *method 1 gen ed2 = ed^2 gen exper2 = exper^2 reg logw i.fe i.nw i.un ed exper ed2 exper2 *method 2 regress logw i.fe i.nw i.un c.ed c.ed#c.ed c.exper c.exper#c.exper *evaluate the CONDITIONAL marginal effect of years of schooling at sample means *method A: from the regression output display .0191133+2*.0029654*13.14507 *method B: use a postestimation command *** IMPORTANT NOTE: the margins command can be used ONLY if the regression you run uses the "#" method from method 2 above! If you run the regression using the generated ed2 and expr2 the margins command will be inaccurate! regress logw i.fe i.nw i.un c.ed c.ed#c.ed c.exper c.exper#c.exper margins, dydx(ed) atmeans *evaluate the AVERAGE marginal effect of years of schooling using a postestimation command margins, dydx(ed) *for quadratic terms, marginal effect at sample means = average marginal effect *run the regression with only linear education regress logw i.fe i.nw i.un ed c.exper c.exper#c.exper *do a RESET test estat ovtest *run the original regression regress logw i.fe i.nw i.un c.ed c.ed#c.ed c.exper c.exper#c.exper *do a RESET test estat ovtest
Be able to do the necessary manipulations to use logs in STATA.
generate lnwage = ln(wage) regress lnwage grade c.tenure##c.tenure We can fit a regression model for our transformed variable including grade, tenure, and the square of tenure. Note that I have used Stata's factor-variable notation to include tenure and the square of tenure. The c. prefix tells Stata to treat tenure as a continuous variable, and the ## operator tells Stata to include the main effect (tenure) and the interaction (tenure*tenure). margins This margins command reports the average predicted log wage. Based on this model and assuming we have a random or otherwise representative sample, we expect that the average hourly log wage in the population is 1.87 with a confidence interval of [1.85, 1.89]. However, I'm not sure if that's high or low because I'm not used to thinking on a log-wages scale. It is tempting to simply exponentiate the predictions to convert them back to wages, but the reverse transformation results in a biased prediction *REMEMBER: you can't just exponentiate the point estimate of the prediction from your log regression for this *it was OK for confidence intervals, but not for point estimates *see the handouts for more details *here are the steps *1. exponentiate the predicted logs gen m_hat = exp(logwagehat) *2. regresss wage on the exp(predlog), suppressing the constant reg wage m_hat, noconstant *3. use this regression to generate predicted values of hte level of gas use predict predictedlevel, xb *now we have the point estimate in the predictedlevel variable
F tests: Understand the meaning of degrees of freedom in the numerator and denominator, and be able to figure out what they are in a particular application.
k = # of groups N = total sample size
Understanding the consequences for bias and variance of having an omitted variable or an irrelevant variable.
omitted variable - whenever you have an omitted variable, the interpretation and use of your estimated equation become suspect. - usually causes bias in the estimated coefficients of the variables that are in the equation - The bias caused by leaving a variable out of an equation is called omitted variable bias. In an equation with more than one independent variable, the coefficient βk represents the change in the dependent variable Y caused by a one-unit increase in the independent variable Xk holding constant the other independent variables in the equation. If a variable is omitted, then it is not included as an independent variable, and it is not held constant for the calculation and interpretation of This omission can cause bias: It can force the expected value of the estimated coefficient away from the true value of the population coefficient. - MAJOR CONSEQUENCE: cause bias in the regression coefficients that remain in the equation From Equations 6.2 and 6.3, it might seem as though we could get unbiased estimates even if we left X2 out of the equation. Unfortunately, this is not the case,1 because the included coefficients almost surely pick up some of the effect of the omitted variable and therefore will change, causing bias. To see why, take another look at Equations 6.2 and 6.3. Most pairs of variables are correlated to some degree, so X1 and X2 almost surely are correlated. When X2 is omitted from the equation, the impact of X2 goes into ∈∗, so ∈∗ and X2 are correlated. Thus if X2 is omitted from the equation and X1 and X2 are correlated, both X1 and ∈∗ will change when X2 changes, and the error term will no longer be independent of the explanatory variable. That violates Classical Assumption III! In other words, if we leave an important variable out of an equation, we violate Classical Assumption III (that the explanatory variables are independent of the error term), unless the omitted variable is uncorrelated with all the included independent variables (which is extremely unlikely). In general, when there is a violation of one of the Classical Assumptions, the Gauss-Markov Theorem does not hold, and the OLS estimates are not BLUE. Given linear estimators, this means that the estimated coefficients are no longer unbiased or are no longer minimum variance (for all linear unbiased estimators), or both. In such a circumstance, econometricians first determine the exact property (unbiasedness or minimum variance) that no longer holds and then suggest an alternative estimation technique that might be better than OLS. irrelevant variables: - increase the variances of the estimated coefficients of the included variables
*More explanation for the consequences of irrelevant variables
t scores and adjusted R squared decrease! standard errors increase!
Distinction between β and βˆ
β is a population parameter. It is in the theoretical regression equation. βˆis called beta-hat and is the sample estimate of the population parameter β. It is in the estimated regression equation. β is always a regression coefficient.
Distinction between εi and ei
εi is in the true regression equation and ei is in the estimated regression equation. Specifically, ε is the stochastic error term.