PAM 3100- Prelim #2
Per-capita transformations
-Per capita (I .e, per person): Xt / Pt -for example: GDP per capita -many times more informative than raw values
F-Distribution
- a family of distributions -F is always >=0 -larger values are more surprising -usually center of the distribution is around 1 depends on 2 parameters: -df1: "numerator degrees of freedom" -df2: "denominator degrees of freedom"
Motivation for multivariate regression
-"control" for any omitted variables -for OLS to be unbiased in multivariate regression, we need error term to be uncorrelated with each of X₂,...,Xk
F-test
-After estimating our regression and specifying the hypothesis, we can compute an F-statistic: F* (formula complicated, rely on stata) -If H0 is true, F* will be drawn from an F-distribution Then, we can see whether F* is a surprising number to have come from this F-distribution? -if yes: reject null hypothesis -if no: fail to reject null hypothesis
Solutions to Multicollinearity
-Do a joint test of suspect variables: F-test -Maybe drop some variables that are highly correlated -get more data
Consequences of clustered errors
-OLS estimators unbiased -typically less precise -OLS estimators no longer minimum variance (some other linear estimators have lower variance) -OLS standard errors not correct: standard errors too small = t-stat too big
Consequences of autocorrelation
-OLS estimators unbiased: E[bk]=Bk -typically less precisely estimated -OLS estimators no longer minimum variance ---other estimators have lower variance OLS standard errors not correct: -positive autocorrelation: OLS standard errors too small --> t-statistic too big -negative autocorrelation: OLS standard errors too large --> t-statistic too small
Nominal to Real
-consider a time series suppose that: -X1t: value of x in year t, expressed in year 1 dollars (real value of x in year t) -Xtt: value of x in year t, expressed in year t dollars (nominal value of x in year t) to transform data from nominal to real, we use the CPI: -CPI1, CPIt: cost of fixed bundle of goods in year 1 and t
Consequences of Including Irrelevant Variables
-estimates of b₁,...,b₄ are unbiased -But: estimators of b₁,...,b₄ are less precise (ie, have larger standard errors) than they would if X₄ were left out of the regression
Solutions to Omitted Variables
-find rich dataset that includes variables that would otherwise be omitted -run an experiment where assignment of variable of interest can be done randomly -find a "quasi-experiment" (natural experiment) --difference-in-difference --regression discontinuity -look for instrumental variables --variables that predict the included X's, but are uncorrelated with X₄
Fundamental Difference
-for a multivariate model, the key difference is in interpretation of coefficients: "holding the other X's constant"
Regression and causality - problem
-if X is correlated with the error term: there are hidden (omitted) variables in the error term that are correlated with X --> B₂ will not give causal impact -if we can include these in our model, the remaining part of the error term might be uncorrelated with X (so we could give estimate of causal effect) -main motivation for multivariate regress --> controlling for confounding factors
Violation of Population assumption
-if the multivariate population model is the "truth", then in the bivariate model, X₂ and the error term are correlated -if B₃≠0, and X₂ and X₃ are correlated --> X₂ and the error term in the bivariate regression are correlated (violation of assumption) Key assumption in bivariate model: X₂ and error term uncorrelated -if true: OLS --> unbiased estimator of B₂ -if not true (ie., X₂ and error term are correlated), then OLD generates biased estimates of B₂ -we call this "omitted variable bias" Implications for statistical inference. Either: -estimators (x, bk) are biased or -OLS standard errors are wrong or -estimators not efficient
Assumption 4 implications
-if we have perfect multicollinearity we cannot even estimate OLS -usually, perfect multicollinearity is a result of a user error -STATA detects multicollinearity and "drops" one variable to estimate model Note: this assumption fails if the sample size is smaller than number of estimated parameters (ie., if n< K)
Interaction Terms
-interactions can be extremely useful, especially to examine heterogeneity in relationships between X and Y -we use them anytime we can expect the relationships between two variables to be different across groups -example: does the impact of education on earnings differ by gender or race?
Graphical Description
-its difficult to show multivariate data graphically -one exception: if 3 variables, one of which is a dummy variable -some examples: still form, 3-D picture, animated figures
Consequences of Heteroscedasticity
-least squares estimators remain unbaised: E[bk]=Bk -noisier (less precise) estimates -OLS estimator no longer minimum variance ---some other linear estimators have lower variance -OLS standard errors are not correct ---so, t-statistics and p-values are wrong--> can lead to incorrect inferences
Dummy variables (or indicator variables)
-many times we are interested in learning about how categorical variables predict outcomes -we have focused on the case where X and Y are numerical -to use categorical variables, we need to recode them as dummy variables: variable that =1 if a condition is satisfied, =0 otherwise
Nonlinear relationships
-method is for linear relationships solution: data transformation -create new variables so transformed data has linear relationship -apply regression to transformed data
Solution to variable selection
-modern "big data" methods like LASSO -LASSO: least absolute shrinkage and selection operator
Why transform Data
-most times, data are not in form that is useful for analysis -data transformations let use our main tools -Careful: they might require slightly different interpretation of results
T-tests in multivariate regression
-perform hypothesis testing about one of the parameters Bk Two important differences: -degrees of freedom: n-k -interpreting slope: "change in Y for 1 unit increase in X, holding constant the other X's in the regression" note: K is the number of estimated parameters (including the intercept), if we include 3 independent variables in the model, then K=4
Interaction Terms equations
-population model: Y= B₁ + B₂D + B₃X + B₄D * X+e Therefore: -dummy variables: different intercepts by group -interaction variables: different slopes by group
Regression and causality
-regression model: Y= B₁ + B₂X + ε -B₂= statistical relationship between X and Y When can we interpret the statistical relationship as a causal relation? --> When the error term is uncorrelated with X -when X is randomly assigned or as good as randomly assigned -note: random assignment of X ≠ random sample
Uses of natural logarithm
-rescale time-series data to have linear trends -calculating growth rates or percentage changes
Heteroskedasticity and homoskedasticity
-sometimes people assume that all ε's are drawn from the same distribution ---this implies they all have the same variance --- constant variance= homoscedasticity (equal variance) -non-constant variance=heteroscedasticity (different variance) -very common in cross-section data
Measures of goodness of fit
-standard error of the regression -R² (r-squared) -adjusted R²
Dummy Variable Trap
-suppose that there are M categories for some categorical variable (M is constant) for some model to be estimated, use M-1 dummy variable and M-1 interaction terms: -for the omitted group, all dummy variables and interactions will be zero -line for the omitted group is then the one to which the lines for all other groups are compared (eg., B₁ + B₃X)
Omitted Variables Theory
-suppose true model is given by: Y=B₁+B₂X₂+B₃X₃+ε -then we want to estimate B₂, not B~₂ -if we estimated a bivariate regression, we are incorrectly omitting X₃. therefore, X₃ would be an omitted variable
Violation of A0: Population relationship not linear
-the relationship between Y and X₂,...,Xk might not be linear -need to do the appropriate data transformation -transformed data should have linear relationship -sometimes we can do "nonparametric" analysis--> in stata "lowess" or "lfit" commands
Ordinary Least Squares
-there is a formula for each b1,... bk -where X is a K x n invertible matrix, B (hat) and Y are 1x n vectors
Interaction Terms: definitions
-to use categorical variables in multivariate analysis, need to recode them as dummy variables or interaction terms -dummy variable: is a variable equal to 1 if some condition is satisfied, 0 otherwise -interaction term: is the product of a dummy variable and numerical variable
dummy variables with more than 2 categories
-to use categorical variables in multivariate data analysis, we must recode them a different dummy variables - if the categorical variable has more than 2 categories, we must create one dummy variable for each category in regression, we include (# categories -1) dummy variables in regression: -omitted group = "reference group" -other dummy variables telling us the shift in intercept relative to omitted group (reference group)
TESTING ABOUT MULTIPLE COEFFICIENTS SIMULTANEOUSLY: F-Test
-used to test hypothesis regarding two or more slopes at the same time -when testing all slope coefficients in the model simultaneously --> "Test of overall significance of the regression"
steps for inference on Bk
0. choose target parameter Bk, you want to test and identify corresponding sample estimate bk 1. form a hypothesis about the value of the parameter (define H₀ vs Ha) 2. create a test statistic (and draw a graph) 3. use test statistic to decide whether to reject --> p-value approach or critical value approach 4. interpret results
Steps of an F-test
1. Set up the Null and Alternative Hypotheses 2. Estimate the regression model 3. Using STATA, compute the F* and p-value 4. Use p-value to conclude: Reject H0; or fail to reject H0 5. Interpret conclusion
Step 3: critical value approach
1. calculate using invttail -n-K degrees of freedom 2. two-sided test: display invttail (n-K, a/2)
Step 3: p-value approach
1. calculate using ttail(n-k,t*) -now: n-K degrees of freedom (in bivariate regression K=2) 2. two-sided test: display 2*ttail (n-k, abs(t*))
Natural Log Transformations
1. log-linear 2. linear-log 3. log-log
Assumptions: Population model for multivariate data
Assumption 0: -the population regression model, showing the relationship between X₂,..., Xk and Y is linear Assumption 1: -the conditional distribution of e (error term) given values of X's has a mean of zero -distribution of the error term: mean is zero (average deviation is zero)--> E[e]=0 -E[e]=0 is true regardless of the values of X's: --knowing X's are not informative about average value of error term --mathematically: E[e|X₁,...,Xk]=0 Assumption 2: -X's, Y are independently and Identically distributed (iid) draws from the population (random sample) Assumption 3: -no extreme outliers -X's and Y have a non-zero infinite 4th moments (finite Kurtosis) Assumption 4: -the regressors X₂,...Xk are not perfectly multicollinear Assumption 5: -either: n is large or ε~N(0²ε) (note: S&W does not explicitly make this assumption)
Solutions to autocorrelation
Calculate "correct" standard errors after OLS: -HAC "Heteroscedastic and Autocorrelation Consistent" standard errors -in stata: "newey" Get more precise estimates and correct standard errors by modeling error variance: -transform model so new error term is independent -"weighted" least squared (will be minimum variance approach)
step 4: interpret results
Do we reject H0 or not?: -If we reject H0, we showed that HA is true -If we do not reject H0, we have insufficient evidence to claim HA is true. Remember we can never 'prove' H0!
F-Test and T-Tests
If we are only testing 1 coefficient, we can use either the t-test or F-test - Results are identical (same p-value) -The F-statistic F* is equal to (t*)^2 -We can only use F-tests for 2-sided hypothesis!
When should natural log models be used?
Log-Linear Model: -when the underlying relationship between X and Y is best characterized as exponential (e.g., education and wages) Linear-Log Model: -when X is on a very different scale fore different observations (e.g., population of countries of states) Log-Log Model: -when interested in calculating an elasticity Important: -when interpreting coefficients in models using log transformations in multivariate regression, remember to include "holding all other X's constant"
Violation of Assumption 2: autocorrection
Model assumes ε's are drawn independently: -this means that one draw does not affect the value of the next draw -independent draws --> uncorrelated errors When dealing with time-series data, draws are almost always correlated -correlated draws (over time) --> autocorrelated errors
Violation of Assumption 4: Multicollinearity
Perfect multicollinearity is a strict violation of assumption 4 -OLS cannot be estimated A less extreme version that can still be a problem is: high multicollinearity -Essentially, when the regressors are highly correlated among each other (through not perfectly - i.e., |p| ≠ 1) -For example: Age, years of education, years of work experience Problems: -Coefficients are imprecise (significance tests indicate we cannot tell apart from zero); but high R-squared (model predicts data well!) Coefficients "unstable": small changes to the model leads to large changes in estimated coefficients
Some Notation
Random variables are X₂,...Xk and Y -X2,...Xk: covariables, explanatory variables, independent variables, right-hand side variables (RHS), or regressors -Y: dependent variables, outcome, regressand, or left-hand-side variables (LHS) K≥2: -K is total # of variables being analyzed -1 Y variable; K-1 variables -K is total # of unknown population parameters (B1, B2, B3,... Bk)
Adjusted R-squared
R²(bar) < R². the difference btwn measures is larger when: -K is larger -N is smaller Adjusted R-squared can increase or decrease when we add a variable to the model -it can also be negative
R-Squared
Stata output: -"r-squared" in the regression statistic table Interpretation: ↑R² → ↑ fit of the regression: -R² = 1: a perfect fit; all points on regression line -O < R² < 1: typical case; higher → better fit -R² = 0: no fit -notice that OLS is result of minimizing SSR= Maximizing R²
Standard error of the regression
Stata output: -Root MSE Interpretation: -measured in same units as Y, so interpreted in relation to similar regressions ↓Se → ↑ fit of the regression
Interpretation of OLS coefficients
The OLS coefficients, bk are random variables -The bk's have distributions!
Violation of assumption 2: Clustered errors
The error term (ε) might be correlated with some group of observations, but independent from others
Step 1: Form a Hypothesis
Two-sided or two-tailed tests: -null hypothesis: H₀: Bk = 0 -alter hypothesis: Ha: Bk≠ 0 One sided tests: -upper one-sided: H₀: Bk ≤ 0 vs Ha: Bk > 0 -lower one-sided: H₀: Bk ≥ 0 vs Ha: Bk < 0
Outline -- Data Transformations
Univariate analysis: -nominal to real currency -per-capita adjustments -percentage changes -moving averages -natural logarithm Bivariate and Multivariate analysis: -dummy variable -natural logarithm -polynominal models -interaction models
Log-Log Model
Use OLS to estimate B₁ and B₂: -thus, get b₁ and B₂
Log-Linear Model
Use OLS to estimate B₁ and B₂: -thus, get b₁ and B₂: lnyi= b₁ + b₂Xi -b₂ * 100 interpreted as average % ∆y for one-unit increase in x
Linear-Log Model
Use OLS to estimate B₁ and B₂: -thus, get b₁ and B₂: yi= b₁ + b₂lnXi
Linear log model interpretation
ex: on average, a 1 % increase in household income is associated with an increase of 0.18 (18.8/100) points in the math score
Other Types of Interactions
We can also include interaction terms between continuous variables -This will allow for modeling more complicated relationships Ex: does increase in income by year's education depends on age? -Then we can include a variable "education*age" -The change in income for an extra year of education will depend on age, and on the coefficient on the interaction term -Also, the change in income for an extra year of age will depend on education, and on the coefficient on the interaction term
Logarithm
a logarithmic function is the inverse operation of raising a number to a power Formally: -base logarithm: if ab = c, then logaC = b -"natural logarithm": if ex = y, the ln y = x -thus logarithm with base e (ln=log) --very common
Polynominal transformation
a multivariate linear model that allows for nonlinear relationships between X and Y -we can still use OLS to estimate B's
Problem with R-squared
adding more explanatory variables (X's) will always increase R², or in extreme cases leaves it unchanged: -TSS constant (never changes) -ESS stays the same or goes up -we can interpret higher R² as better fit, but this might just be due to an increase in K (# of variables included in the model)
Dummy variable interpretation
b2 (slope) is the vertical distance ie., the average difference for D=1 relative to units D=0
Solutions to Heteroscedasticity
calculate "correct" standard errors after using OLS estimates by: -(most common) stata's robust option get more precise estimates and correct standard errors by modeling variance: -transform model so new error term has constant variance -"weighted" least squares (will be minimum variance approach)
Solutions to Clustered errors
calculate correct standard errors after OLS: -"cluster-robust standard errors" -"regress y x, cluster(school_id) Warning: careful with using "cluster" command if there are few clusters (<30 groups)
multivariate analysis interpretation
generally, we interpret Bk, as the predicated change in Y for a unit increase in Xk, holding other X's constant (all else equal, all else constant)
Moving averages
many times, time-series data can be very noisy: -eg., stock prices, COVID cases, etc Moving averages "smooth over" these fluctuations: -calculate average of observations close to the observation of interest (eg., daily data, +/- 3 days) -average moves as the observation of interest moves -adjacent observations in moving averages use a lot of the same information (look "smoother") Different choices when defining moving average: -centered vs lagged -even weights vs varying weights (less common) In stata: -we can us the "lag" and "forward" operators -for a centered 5 quarter moving average: gen ma5= (L2.gdp + L.gdp + F.gdp + F2.gdp) / 5
Should we always control for as many variables as possible?
not always... can generate problems when including irrelevant variables
perfect multicollinearity
one of the regressors is a perfect linear function of one or more of the other regressors -assumption 4: we assume this is not the case
Polynominal transformation: interpretation
suppose B₂ > 0 and B₃ > 0: -Y is increasing in X, at an increasing rate suppose B₂ < 0 and B₃ < 0: -Y is decreasing in X, at an increasing rate suppose B₂ > 0 and B₃ < 0: -Y is increasing in X, at a decreasing rate
dummy variables: estimation
we can estimate models with dummy variables using ordinary least squares
Properties of bk
when assumptions 0-5 are satisfied: 1. estimate of bk is unbiased: E[bk]=Bk -the average values of bk in repeated samples is Bk 2. estimate of bk is consistent: plimₙ→∞bk → Bk -the distribution of bk collapses around the "truth" (Bk) as the sample size grows 3. has variance: var(bk)=σ²bk --> 0 as n→∞ -the estimated slopes get more precise (smaller variance) with larger sample sizes 4. OLS is still BLUE -bk's are minimum variance estimators of Bk -of all possible linear and unbiased estimators, bk, is the most probable of being close to Bk (OLS is blue) 5. bk~N(Bk,σ²bk) -bk's are normally distributed 6. test statistic is distributed as a T with n-K degrees of freedom - (bk - Bk)/ sbk ~ Tn-K
Polynominal transformation: estimation
when to use polynominal models?: -when you expect the relationship between two variables to be smooth, but nonlinear