Intro Econ A
Heteroskedasticity (11)
the pattern of covariation around the regression line is not constant around the regression line, and varies in some way when the values change from small to medium and large 1. Heteroskedasticity and homoskedasticity describe a feature of the disturbance term u - is var(u|X) constant or not? Heteroskedasticity - non-constant Homoskedasticity - constant 2. Consequences of heteroskedasticity vs. homoskedasticity 3. Implication for computing standard errors a) The (robust) SE formula we've seen so far works whether or not we have heteroskedasticity or homoskedasticity. b) But the (traditional) SE formula that is reported by default by most regression software is valid only if we have homoskedasticity. c) This is why we are always asking Stata for "robust" SEs. Terminology alert: "Robust SEs" (what we've been using so far): - "Robusttoheteroskedasticity" - "Heteroskedastic-consistent" - "Heteroskedastic-robust" - AvailablewithStata'srobustoption "Un-robust SEs" (this lecture, an alternative, sometimes useful): - "Homoskedasticity-only" - "Classical" - "Traditional" - ReportedbyStatabydefault
Nonlinear regression function(10++)
A regression function with a slope that is not constant. he regression function so far has been linear in the Xs. • But the linear approximation is not always a good one. • The multiple regression model can handle regression functions that are nonlinear in one or more X. If a relation between Y and X is nonlinear: • TheeffectonYofachangeinXdependsonthevalueofX-that is, the marginal effect of X is not constant. • A linear regression is mis-specified: the functional form is wrong. • The estimator of the effect on Y of X is biased: in general it isn't even right on average. • The solution is to estimate a regression function that is nonlinear in X.
dummy variable (9+)
A variable for which all cases falling into a specific category assume the value of 1, and all cases not falling into that category assume a value of 0.
Frisch-Waugh-Lovell (FWL) Theorem
The Frisch-Waugh-Lovell (FWL) Theorem converts any multiple regression model into a two-dimensional version with one Y and one X. • Intuition: we "partial out" ("take account of") the effects of all the other Xs from Y and from the X1 of interest. • Specifically, the FWL Theorem says: - Regress Y on all the other Xs. Get the residuals from this regression. Call this new variable . Intuition: is Y after removing everything that the other Xs can explain. - Regress the X1 of interest on all the other Xs. Get the residuals from this regression. Call this new variable . Intuition: is X1 after removing everything that the other Xs can explain. - Regress on . You will get exactly the same estimate (and the same SEs) as you got in your original multiple regression with all the Xs!
Population regression line with slope β1 and intercept β0
The population regression line is the expected value of Y given X. • The slope is the difference in the expected values of Y, for two values of X that differ by one unit. • The estimated regression can be used either for: - prediction (predicting the value of Y given X, for an observation no tin the data set) - causal inference (learning about the causal effect on Y of a change in X) • Causal inference and prediction place different requirements on the data - but both use the same regression toolkit. • Here we cover prediction. We cover causal inference later. We illustrate with the Stata auto dataset. Does car weight predict car miles per gallon? The population regression line: weight = β0 + β1mpg β1 = slope of population regression line Why are β0 and β1 "population" parameters? • We would like to know the population value of β0 and β1. • We don't know β0 or β1, so must estimate them using data. //// Yi =β0 +β1Xi +ui,i=1,...,n • We have n observations, (Xi, Yi), i = 1,.., n. • X is the independent variable or regressor • Y is the dependent variable • β0 = intercept • β1=slope • ui = the regression error /// The regression error consists of omitted factors. In general, these omitted factors are other factors that influence Y, other than the variable X. The regression error also includes error in the measurement of Y.
Presentation of regression results
We have a number of regressions and we want to report them. It is awkward and difficult to read regressions written out in equation form, so instead it is conventional to report them in a table. • A table of regression results should typically include: - estimatedregressioncoefficients - standarderrors - confidenceintervals(atleastforthevariablesofinterest) - measuresoffit - numberofobservations - otherrelevantstatistics,asappropriate - explanatorynotes • Find this information in the following table!
Unbiased estimator; consistent estimator
What is an Unbiased Estimator? An unbiased estimator is an accurate statistic that's used to approximate a population parameter. "Accurate" in this sense means that it's neither an overestimate nor an underestimate. If an overestimate or underestimate does happen, the mean of the difference is called a "bias. In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ(0)—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ(0)
Population mean vs sample mean (and similarly for SD etc.)
Population mean: the arithmetic average value of the responses on a variable Ῡ. Just the average of all the observations in our dataset. where n=number of observations in our dataset. (Think 74 cars, and Yi is miles per gallon for car i.) μY . The "true" value of the parameter; what it is we are trying to estimate. Also written E(Yi). (Think "DGP", data generating process: Yi is a random variable drawn from a distribution with mean μY .) The sample mean Ῡ is our estimator of the population mean μY. (NB: with OLS, the true parameter will be and our estimator is .) the arithmetic average value of the responses on a variable Ῡ. Just the average of all the observations in our dataset. where n=number of observations in our dataset. (Think 74 cars, and Yi is miles per gallon for car i.) The sample mean Ῡ is our estimator of the population mean μY. (NB: with OLS, the true parameter will be and our estimator is .) Ῡ. Our estimate of the parameter μY
Population vs sample
Population: A group of individuals that belong to the same species and live in the same area group or collection of all possible entities of interest populations are infinitely large Population: entire group of people about which information is wanted (e.g. American adults). Sample: a part or subset of the population that is used to gain information about the whole population. Sample: a subset of the population
"Big Data" (12)++
"Big Data" means many things: • Data sets with many observations (millions) • Data sets with many variables (thousands, or more) • Data sets with nonstandard data types, like text, voice, or images "Big Data" also has many different applications: • Prediction using many predictors - Given your browsing history, what products are you most likely to shop for now? - Given your loan application profile, how likely are you to repay a bank loan? • Prediction using highly nonlinear models (for which you need many observations) • Recognition problems, like facial and voice recognition "Big Data" has different jargon, which makes it seem very different than statistics and econometrics... • "Machine learning:" when a computer (machine) uses a large data set to learn (e.g., about your online shopping preferences) But at its core, machine learning builds on familiar tools of prediction. • This chapter focuses on one of the major big data applications, prediction with many predictors. We treat this as a regression problem, but with many predictors we need new methods that go beyond OLS. • For prediction, we do not need - and typically will not have - causal coefficients.
Quadratic in X
13) MATHS$
Predicted value Y hat i; Residual u hat i
5
• How to interpret and b hat 1 and beta hat 0
5
• R2 ("R-sq")
5
RSS (Residual Sum of Squares)
5. As in regression with a single regressor, the RMSE measures the spread of the Ys around the regression line. Same two "flavours" as before: Flavour 1 (dof adjustment): Flavour 2 (no dof adjustment): • Think of these as measuring the standard deviation of the "prediction mistake". • Flavour 1 is called "Standard Error of the Regression" (SER) in S-W and is called "Root MSE" in Stata regression output. The R2 is the fraction of the variance explained - same definition as in regression with a single regressor: R2 ESS1SSR, whereESS i1 i i (Y Y)2. i TSS TSS nnn ˆˆ (Y Y)2,SSR i1 ˆ u2,TSS i1 • The R2 always increases when you add another regressor (why?) - a bit of a problem for a measure of "fit".???????? The R2 (the "adjusted R2 ") corrects this problem by "penalizing" you for including another regressor the R2 does not necessarily increase when you add another regressor. Adjusted R2:R2 1 n1 SSR n k 1 TSS NotethatR2 R2,howeverifnislargethetwowillbeveryclose.
S^2(y) = sample variance (estimated variance) of
??
Overfitting
A hypothesis is said to be overfit if its prediction performance on the training data is overoptimistic compared to that on unseen data. It presents itself in complicated decision boundaries that depend strongly on individual training examples.
Multicollinearity
A situation in which several independent variables are highly correlated with each other. This characteristic can result in difficulty in estimating separate or independent regression coefficients for the correlated variables. Perfect multicollinearity is when one of the regressors is an exact linear function of the other regressors. Some more examples of perfect multicollinearity: 1. The example from before: you include weight measured in two different units. Weight in kilos is a linear transformation of weight in pounds (just multiply by 0.453592). 2. Regress mpg on a constant, F, and D, where: Fi = 1 if the car is imported (F="foreign"), = 0 otherwise; Di = 1 if the care is made locally (D="domestic); = 0 otherwise. Since Fi = 1 - Di there is perfect multicollinearity. (The "dummy variable trap".) imperfect Imperfect and perfect multicollinearity are quite different despite the similarity of the names. Imperfect multicollinearity occurs when two or more regressors are very highly correlated. • Why the term "multicollinearity"? If two regressors are very highly correlated, then their scatterplot will pretty much look like a straight line - they are "co-linear" - but unless the correlation is exactly ±1, that collinearity is imperfect. Imperfect multicollinearity implies that one or more of the regression coefficients will be imprecisely estimated. • The idea: the coefficient on X1 is the effect of X1 holding X2 constant; but if X1 and X2 are highly correlated, there is very little variation in X1 once X2 is held constant - so the data don't contain much information about what happens when X1 changes but X2 doesn't. If so, the variance of the OLS estimator of the coefficient on X1 will be large. • Imperfect multicollinearity(correctly)results in large standard errors for one or more of the OLS coefficients. But arguably, the real problem here is that you don't have enough data! More observations means more information in the dataset. An econometrics joke [sic]: the problem is not multicollinearity but "micronumerosity".
polynomial in x
A sum of multiples of powers of x Approximate the population regression function by a polynomial: • This is just the linear multiple regression model - except that the regressors are powers of X. In practice, r=2 is most common. • Estimation, hypothesis testing, etc. proceeds as in the multiple regression model using OLS. (But not confidence intervals!) • The coefficients are harder to interpret vs. the standard linear model. is not the impact of X2 on Y ... because you can't change X and hold X2 constant at the same time! • Quadratic (r=2): Marginal effect is . Depends on the value of X! • In practice: use Stata factor variables and "margins" command.
RSS (residual sum of squares)
As in regression with a single regressor, the RMSE measures the spread of the Ys around the regression line. Same two "flavours" as before: Flavour 1 (dof adjustment): Flavour 2 (no dof adjustment): These measuring the standard deviation of the "prediction mistake". Flavour 1 is called "Standard Error of the Regression" (SER) in S-W and is called "Root MSE" in Stata regression output. The R2 is the fraction of the variance explained - same definition as in regression with a single regressor: The R2 always increases when you add another regressor (why?) - a bit of a problem for a measure of "fit". The R2 (the "adjusted R2 ") corrects this problem by "penalizing" you for including another regressor the R2 does not necessarily increase when you add another regressor. Adjusted R2:R2 1 n1 SSR n k 1 TSS Note that R(hat)^2<R2, however if n is large the two will be very close.
Marginal effect of X on Y
But if the variable X1 is continuous, we can just use calculus. The marginal effect of X on Y - the expected change in Y for a small change in X - is just the partial derivative: And the estimated predicted change in Y - the estimated marginal effect - is the partial derivative of the estimated function evaluated at the chosen values of the Xs:
Distribution, joint distribution
Distribution: The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population. Joint distribution: random variables X and Z have a joint distribution co-variance between X AND z cov(X,Z) = E{X-U(x)(z- U(Z))} = VARIANCE (xz
Estimated regression line w/ estimated slope Bhat1 and intercept beta hat0
Estimated regression line: mpg = 39.4 - 0.006 ✕ weight • A car that weighs one pound more gets 0.006 less miles per gallon. • The intercept (taken literally) means that, according to this estimated line, car models that had weight = 0 would get 39.4 miles per gallon. • Obviously crazy - no such thing as a weightless car! • This interpretation of the intercept makes no sense - it extrapolates the line outside the range of the data. Here, the intercept is just not meaningful.
Impact of changing units of Y or X
Estimated regression line: mpg = 39.4 - 0.006 ✕ weight in lbs • A car that weighs one pound more gets 0.006 less miles per gallon. • The original units of Y and X are inconvenient: - "Milespergallon"isacommonUSmeasure,butwewouldprobably prefer "kilometres per litre". - Similarly,kiloswouldbebetterthanlbs. - Buttonnesmightbeevenbetter. • Changing the units doesn't change the model, it just changes how it's presented. (Like reporting income in pounds or pence.)
Bias-variance tradeoff
From basic probability (see S-W chapter 2) we can decompose the expected value of the square of a random variable Z into its variance and the square of its mean: Apply this to prediction, where instead of Z we have the prediction error : • Biased estimates will make the MSPE bigger. • Noisy estimates will also make the MSPE bigger. • Given the choice between two estimators, we might prefer one with the bigger bias if it has a small variance. The low-bias/low- variance estimator has excellent predictions ... but might be impractical or unavailable. The high-bias/low variance estimator has a lower MSPE than the low- bias/high variance estimator ... even though it's biased!
Independently and identically distributed (i.i.d.)
I.i.d
Conditional expectation (conditional mean)
In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value - the value it would take "on average" over an arbitrarily large number of occurrences - given that a certain set of "conditions" is known to occur The conditional mean plays a key role in prediction: • Suppose you want to predict a value of Y, and you are given the value of a random variable X that is related to Y. That is, you want to predict Y given the value of X. Of all possible predictions m that depend on X, the conditional mean E(Y | X) has the smallest mean squared prediction error (proof is in Appendix 2.2). A common measure of the quality of a prediction m of Y is the mean squared prediction error (MSPE), given X, E[(Y - m)2|X] Conditional mean: you want to predict a value of Y, and you are given the value of a random variable X that is related to Y. Your dataset has values of both Y and X. you want to predict a value of Y, and the only information you have is a dataset of values of the random variable Y for other individuals. Much of econometrics is about estimating a conditional mean. But our first example will be to estimate an unconditional mean (easier). The conditional mean (conditional on being a member of a group) is a (possibly new) term for the familiar idea of the group mean.
OLS (ordinary least squares) regression
In statistics, ordinary least squares is a type of linear least squares method for estimating the unknown parameters in a linear regression model the most popular method for computing sample regression models By analogy, we will focus on the least squares ("ordinary least squares" or "OLS") estimator of the unknown parameters β0 and β1. The OLS estimator solves, The population regression line: E(mpg|weight) = β0 + β1weight (Be careful: we have started the horizontal axis at about 2,000 lbs and the vertical axis at 10 miles per gallon.) The population regression line: E(mpg|weight) = β0 + β1weight (This is what happens when we start the axes at zero.) • The OLS estimator minimizes the average squared difference between the actual values of Yi and the prediction ("predicted value") based on the estimated line. • The expression is often called the "Residual Sum of Squares" (RSS). OLS minimizes the RSS. • It's also often called the "Sum of Squared Residuals" (SSR). Why will be clear from the next slide. • This minimization problem can be solved using calculus (App. 4.2). • The result is the OLS estimators of β0 and β1.
Mean, variance, standard deviation, covariance, correlation
Mean: the arithmetic average of a distribution, obtained by adding the scores and then dividing by the number of scores expected value of Y = E(Y) = U(y) = long-run average value of Y over repeated realisations of Y Variance: measure of the squared spread of the distribution standard deviation squared = E(Y - U(y))^2 Standard deviation: a computed measure of how much scores vary around the mean score - squared root of variance Covariance: A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship covarianc is a measure of the linear association between X and Z; its units are units of X*units of Z COV>0 means positive relationship between X and Correlation: A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other. • -1 ≤ corr(X,Z) ≤ 1 • corr(X,Z) = 1 mean perfect positive linear association • corr(X,Z) = -1 means perfect negative linear association • corr(X,Z) = 0 means no linear association
LSA (Least Squares Assumptions) for Prediction (8)
Prediction entails using an estimation sample to estimate a prediction model, then using that model to predict the value of Y for an observation not in the estimation sample. - Prediction requires good out-of-sample performance. • The "out-of-sample" (OOS) prediction of Y is (equation) • It is simply the estimated OLS coefficients applied to XOOS. For example, if car 75 weighed 1.5 tonnes, we would predict Y(hat)^OOS = 16.8 - 5.63 * 1.5 = 8.36 km/litre • The critical LSA for prediction is that the out-of-sample observation for which you want to predict Y comes from the same distribution as the data used to estimate the model. Assumptions 1. The out-of-sample observation (XOOS,YOOS) is drawn from the same distribution as the estimation sample (Xi,Yi), i = 1,..., n 2. (Xi,Yi), i = 1, ..., n are i.i.d. 3. Large outliers in X and/or Y are rare (X and Y have finite fourth moments, i.e., E(X4) < ∞ and E(Y4) < ∞). (Reminder: i.i.d. = independently and identically distributed.) 1) The out of sample observation (XOOS,YOOS) is drawn from the same distribution as the estimation sample (Xi,Yi), i = 1,..., n. This ensures that the regression line fit using the estimation sample also applies to the out-of-sample data to be predicted. • This is a much easier assumption to satisfy than the assumption we will have to make for causal inference ("zero conditional mean" of the error u). More on this later in the course. • Why is this a weak assumption? All we have to believe is that we are looking at a stable distribution. "The rules don't change when we want to make a prediction." • But this also means we cannot interpret β1 as the causal effect of X on Y. We are using X only to predict Y. 2)(Xi,Yi), i = 1, ... , n are i.i.d. This arises automatically if the entity (individual, district) is sampled by simple random sampling: • The entities are selected from the same population, so (Xi, Yi) are identically distributed for all i = 1, ... , n. • The entities are selected at random, so the values of (X, Y) for different entities are independently distributed. The main place we will encounter non-i.i.d. sampling is when data are recorded over time for the same entity (time series data, panel data). We will deal with that complication when we cover time series. (NB: We will also make this assumption for causal inference.) 3)Large outliers are rare. Technical statement: E(X4) < ∞ and E(Y4) < ∞ • A large outlier is an extreme value of X or Y. • On a technical level, if X and Y are bounded, then they have finite fourth moments. (E.g., standardized test scores automatically satisfy this; family income would satisfy this too.) • The substance of this assumption is that a large outlier can strongly influence the results - so we need to rule out large outliers. • Look at your data! If you have a large outlier, is it a typo? Does it belong in your data set? Why is it an outlier? (NB: We will also make this assumption for causal inference.) β0, β1, β2,..., βk are the regression coefficients: Yi =β0 +β1X1i +β2X2i +...+βkXki +ui, i=1,...,n LSA Assumptions: 1. The out-of-sample observation is drawn from the same distribution as the estimation sample , i = 1,..., n. , i = 1, ..., n, are i.i.d. 3. Large outliers are unlikely. 4. There is no perfect multicollinearity. (NEW)
Independently distributed
Random variables are identically distributed if they have a common probability distribution - come from the same distribution
Heteroskedasticity-robust SEs Also called "heteroskedasticity-consistent SEs","robust SEs"
Recall the three least squares assumptions for prediction: 1. The out-of-sample observation (XOOS,YOOS) is drawn from the same distribution as the estimation sample (Xi,Yi), i = 1,..., n 2. (Xi,Yi), i = 1,...,n, are i.i.d. 3. Large outliers are rare Heteroskedasticity and homoskedasticity concern var(u|X=x). Because we have not explicitly assumed homoskedastic errors, we have implicitly allowed for heteroskedasticity. Not a problem. So why does this matter?
var(Ῡ ) vs var (Yi)
Sample variance of Yi converges in probability to the population variance of Yi.") (Intuition of is "as n gets big".) Why does the law of large numbers apply? • Because is a sample average; see Appendix 3.3. • Technical note: we assume E(Y4) < ∞ because here the average is not of Yi, but of its square; see Appendix 3.3
Dummy variable as a dependent variable (LPM, Linear Probability Model). (9++)
So far the dependent variable (Y) has been continuous: • district-wide average test score • traffic fatality rate What if Y is binary? • Y = get into college, or not; X = high school grades, SAT scores, demographic variables • Y = person smokes, or not; X = cigarette tax rate, income, demographic variables • Y = mortgage application is accepted, or not; X = race, income, house characteristics, marital status A natural starting point is the linear regression model with a single regressor: Yi =β0 +β1Xi +ui • What does 1 mean when Y is binary? Is 1= Y ? But: X • What does the line β0 + β1X mean when Y is binary? ˆ • What does the predicted value Y mean when Y is binary? ˆ For example, what does Y 0.26 mean? When Y is binary, the linear regression model Yi =β0 +β1Xi +ui is called the linear probability model because Pr(Y=1|X ) = β0 + β1X • The predicted value is a probability: - E(Y|X=x) = Pr(Y = 1|X = x) = probability that Y = 1 given x - Ŷ = the predicted probability that Yi = 1, given X • β1 = difference in probability that Y = 1 associated with a unit difference in x:
Dummy variable as a regressor
Sometimes a regressor is binary: • X=1ifsmallclasssize,=0ifnot • X=1iffemale,=0ifmale • X = 1 if treated (experimental drug), = 0 if not Binary regressors are sometimes called "dummy" variables. So far, β1 has been called a "slope," but that doesn't make much sense if X is binary. How do we interpret regression with a binary regressor? When Xi = 0, Yi = β0 + ui (benchmark) • the mean of Yi is β0 -or- E(Yi|Xi=0) = β0 WhenXi =1,Yi =β0 +β1 +ui • themeanofYi isβ0 +β1 -or- E(Yi|Xi=1)=β0 +β1 so: β1 = E(Yi|Xi=1) - E(Yi|Xi=0) = population difference in group means (NB: This will be very handy when we do causal inference later. Think of an experiment where one group gets the drug (Xi = 1) and the other gets a placebo (Xi = 0). β1 is the effect of the drug on Yi.)
"Dummy variable trap"
Suppose you have a set of multiple binary (dummy) variables, which are mutually exclusive and exhaustive - that is, there are multiple categories and every observation falls in one and only one category (primary, hs, uni, pg). If you include all these dummy variables and a constant, you will have perfect multicollinearity - this is sometimes called the dummy variable trap. • Whyisthereperfectmulticollinearityhere? • Solutionstothedummyvariabletrap: 1. Omit one of the groups (e.g. primary), or 2. Omit the intercept (rarely done). • Solution(1)isequivalenttochoosingthebenchmarkcategory. • Intheautoexample,weusedthe"foreign"dummy,omittedthe "domestic" dummy, and our benchmark was domestic. • Perfect multicollinearity usually reflects a mistake in the definitions of the regressors, or an oddity in the data. • If you have perfect multicollinearity, your statistical software will let you know - either by crashing or giving an error message or by "dropping" one of the variables arbitrarily. • The solution to perfect multicollinearity is to modify your list of regressors so that you no longer have perfect multicollinearity. • When you do this, you should think about which benchmark category (the one you omit) makes interpretation easy or convenient. • Stata's "factor variables" make this easy to do.
Lasso (Least Absolute Selection and Shrinkage Operator)
The Lasso estimator "shrinks" the estimate towards zero by penalizing large absolute values of the coefficients. OLS: choose bs to minimize the sum of squared residuals Lasso: choose bs to minimize the penalized SSR where is the "penalty term" and is the size of the penalty. ??? 18 The Lasso estimator "shrinks" the estimate towards zero by penalizing large absolute values of the coefficients. Lasso: choose bs to minimize the penalized SSR where is the "penalty term". Fact: the Lasso solution "shrinks" some - many - of the β's exactly to 0. (An example of extreme "shrinkage".) Intuition: Think of as the "price" for including a particular Xj. If Xj contributes little to the fit, don't buy it - just set bj=0. Means the same thing as dropping Xj from the model.?????????? Lasso shrinks some - many - of the β's exactly to 0 • This property gives the Lasso its name: the Least Absolute Selection and Shrinkage Operator. It selects a subset of the predictors to use for prediction - and drops the rest. • This feature means that the Lasso can work especially well when in reality many of the predictors are irrelevant. • Models in which most of the true β's are zero are sparse. • The Lasso produces sparse models, and works well when the population model ("reality") is in fact sparse. • The bigger the Lasso penalty , the more dropped predictors. • How to choose ? Usual approach: use cross-validation and choose the to minimize the MSPE. • Dependent variable y: log(mpg). 69 observations. • 16 continuous predictors: - price, headroom, trunk space, weight, length, turning circle, displacement (engine size), gear ratio - squares of all of the above • 6 dummy predictors - foreign dummy - All 5 dummies for repair record (5 repair categories) NB: the "dummy variable trap" is not a trap for the Lasso! Not only is perfect multicollinearity not a problem, the Lasso can even handle k>n! • → k=22 predictors in total • If we used OLS, deterioration vs. the oracle MSPE ≈ k/n = 22/69 ≈ 32%. OLS will overfit. Lasso a good alternative.
Sparse models
The Lasso produces sparse models, and works well when the population model ("reality") is in fact sparse.
Mean squared prediction error (MSPE)
The Mean Squared Prediction Error (MSPE) is the expected value of the squared error made by predicting Y for an observation not in the estimation data set: ˆ 2 MSPEEYOOS Y(XOOS) where: - Y is the variable to be predicted - X denotes the k variables used to make the prediction, and (XOOS, YOOS) are the values of X and Y in the out- of-sample data set. - The prediction uses a model estimated using the estimation data set, evaluated at XOOS. • The MSPE measures the expected quality of the prediction made for an out-of-sample observation.
RMSE (Root Mean Squared Error), SER (Standard Error of the Regression)
The RMSE measures the spread of the distribution of u. Flavour 1 is (almost) the sample standard deviation of the OLS residuals, and uses a "degrees of freedom" adjustment (divide by n-2, not n): The RMSE measures the spread of the distribution of u. Flavour 2 is exactly the same except it doesn't use the "degrees of freedom" adjustment (divide by n, not n-2):??? The RMSE measures the spread of the distribution of u. Flavour 1 (dof adjustment): Flavour 2 (no dof adjustment): Terminology alert: S-W call Flavour 1 the "Standard Error of the Regression" (SER); Stata calls it the "Root MSE" and reports it in the OLS output. S-W call Flavour 2 the "Root MSE". Stata doesn't report it. Confusing ... but nothing we can do. We will just say "RMSE" or "Root MSE".???? The RMSE measures the spread of the distribution of u. Flavour 1 (dof adjustment): Flavour 2 (no dof adjustment): Both have the units of u, which are the units of Y. They measure the standard deviation of the OLS residual (the "mistake" made by the OLS regression line).???
Added Variable plots (7)
The Stata avplot command will graph this automatically. The Stata avplot command will graph this automatically. Here: partial out weight and graph mpg vs. displacement.
LSA (Least Squares Assumptions) for Prediction 1. Out-of-sample (OOS) observations come from the same distribution as the estimation sample. 2. Data are i.i.d. (independently and identically distributed). 3. Large outliers are rare.
The assumptions for the linear regression model listed in Key Concept 4.3 (single variable regression) and Key Concept 6.4 (multiple regression model). Prediction entails using an estimation sample to estimate a prediction model, then using that model to predict the value of Y for an observation not in the estimation sample. - Prediction requires good out-of-sample performance. • The "out-of-sample" (OOS) prediction of Y is (equation) • It is simply the estimated OLS coefficients applied to XOOS. For example, if car 75 weighed 1.5 tonnes, we would predict Y(hat)^OOS = 16.8 - 5.63 * 1.5 = 8.36 km/litre • The critical LSA for prediction is that the out-of-sample observation for which you want to predict Y comes from the same distribution as the data used to estimate the model. Assumptions 1. The out-of-sample observation (XOOS,YOOS) is drawn from the same distribution as the estimation sample (Xi,Yi), i = 1,..., n 2. (Xi,Yi), i = 1, ..., n are i.i.d. 3. Large outliers in X and/or Y are rare (X and Y have finite fourth moments, i.e., E(X4) < ∞ and E(Y4) < ∞). (Reminder: i.i.d. = independently and identically distributed.) 1) The out of sample observation (XOOS,YOOS) is drawn from the same distribution as the estimation sample (Xi,Yi), i = 1,..., n. This ensures that the regression line fit using the estimation sample also applies to the out-of-sample data to be predicted. • This is a much easier assumption to satisfy than the assumption we will have to make for causal inference ("zero conditional mean" of the error u). More on this later in the course. • Why is this a weak assumption? All we have to believe is that we are looking at a stable distribution. "The rules don't change when we want to make a prediction." • But this also means we cannot interpret β1 as the causal effect of X on Y. We are using X only to predict Y. 2)(Xi,Yi), i = 1, ... , n are i.i.d. This arises automatically if the entity (individual, district) is sampled by simple random sampling: • The entities are selected from the same population, so (Xi, Yi) are identically distributed for all i = 1, ... , n. • The entities are selected at random, so the values of (X, Y) for different entities are independently distributed. The main place we will encounter non-i.i.d. sampling is when data are recorded over time for the same entity (time series data, panel data). We will deal with that complication when we cover time series. (NB: We will also make this assumption for causal inference.) 3)Large outliers are rare. Technical statement: E(X4) < ∞ and E(Y4) < ∞ • A large outlier is an extreme value of X or Y. • On a technical level, if X and Y are bounded, then they have finite fourth moments. (E.g., standardized test scores automatically satisfy this; family income would satisfy this too.) • The substance of this assumption is that a large outlier can strongly influence the results - so we need to rule out large outliers. • Look at your data! If you have a large outlier, is it a typo? Does it belong in your data set? Why is it an outlier? (NB: We will also make this assumption for causal inference.) β0, β1, β2,..., βk are the regression coefficients: Yi =β0 +β1X1i +β2X2i +...+βkXki +ui, i=1,...,n LSA Assumptions: 1. The out-of-sample observation is drawn from the same distribution as the estimation sample , i = 1,..., n. , i = 1, ..., n, are i.i.d. 3. Large outliers are unlikely. 4. There is no perfect multicollinearity. (NEW)
Prediction with many predictors
The many-predictor problem: • The goal is to provide a good prediction of some outcome variable Y given a large number of X's, when the number of X's (k) is large relative to the number of observations (n) - in fact, maybe k > n! • The goal is good out-of-sample prediction. - The estimation sample is the n observations used to estimate the prediction model - The prediction is made using the estimated model, but for an out-of-sample (OOS) observation - an observation not in the estimation sample.
Oracle Prediction
The oracle prediction is the best possible prediction - the prediction that minimizes the MSPE - if you knew the joint distribution of Y and X. The oracle prediction is the conditional expectation of Y given X, E(YOOS |X = XOOS). Intuition: The oracle knows the true s, but does not know the true error uOOS. The oracle makes the best possible predictions, where "best possible" means "knowing the true s". But the oracle cannot foretell the future perfectly, because the oracle cannot predict the true error uOOS.
P-value
The probability level which forms basis for deciding if results are statistically significant (not due to chance). probability of drawing a statistic (e.g.Ῡ ) at least as extreme ("unusual, unexpected") as the value actually computed with your data, assuming that the null hypothesis is true. P value and significance level The significance level is prespecified. For example, if the prespecified significance level is 5%, - yourejectthenullhypothesisif|t|≥1.96. - Equivalently,yourejectifp≤0.05. • It is common - but very bad! - practice for people simply to report whether a coefficient is "statistically significant" (meaning whether or not they can reject that it is zero). Not informative! • As we will see, it is much better to report standard errors and/or confidence intervals instead. • However, sometimes we have no choice because the hypothesis tested is about something other than a coefficient. We will see this when we cover "specification tests". ------------------------------- p value ????? where Ῡ act is the value of Ῡ actually observed (nonrandom). To compute the p-value, you need the to know the sampling distribution of Ῡ , which is complicated if n is small. • But if n is large, you can use the normal approximation (CLT):??? The significance level is prespecified. For example, if the prespecified significance level is 5%, - yourejectthenullhypothesisif|t|≥1.96. - Equivalently,yourejectifp≤0.05. • It is common - but very bad! - practice for people simply to report whether a coefficient is "statistically significant" (meaning whether or not they can reject that it is zero). Not informative! • As we will see, it is much better to report standard errors and/or confidence intervals instead. • However, sometimes we have no choice because the hypothesis tested is about something other than a coefficient. We will see this when we cover "specification tests".
Significance level
The probability of a Type I error. A benchmark against which the P-value compared to determine if the null hypothesis will be rejected. See also alpha. The significance level of a test is a pre-specified probability of incorrectly rejecting the null, when the null is true.
Unconditional expectation (unconditional mean)
Unconditional mean: you want to predict a value of Y, and the only information you have is a dataset of values of the random variable Y for other individuals.
Standard error for a single coefficient
Under the four Least Squares Assumptions, ˆ • The sampling distribution of 1 has mean 1 ˆ • var(1 ) is inversely proportional to n. • Other than its mean and variance, the exact (finite-n) distribution ˆ of1 isverycomplicated;butforlargen... p ˆˆ 1 is consistent: 1 1 (law of large numbers) ˆˆ 1 E(1) is approximately distributed N(0,1) (CLT) ˆ var(1) These statements hold for 1 ,..., k ˆˆ Conceptually, there is nothing new here! OLS is unbiased, consistent, and appx normal in large samples.
The OLS estimator is B hat 1 unbiased and consistent for B 1
Under the four Least Squares Assumptions, ˆ • The sampling distribution of 1 has mean 1 ˆ • var(1 ) is inversely proportional to n. • Other than its mean and variance, the exact (finite-n) distribution ˆ of1 isverycomplicated;butforlargen... p ˆˆ 1 is consistent: 1 1 (law of large numbers) ˆˆ 1 E(1) is approximately distributed N(0,1) (CLT) ˆ var(1) These statements hold for 1 ,..., k ˆˆ Conceptually, there is nothing new here! OLS is unbiased, consistent, and appx normal in large samples.
The OLS estimator is unbiased and consistent
Under the four Least Squares Assumptions, ˆ • The sampling distribution of 1 has mean 1 ˆ • var(1 ) is inversely proportional to n. • Other than its mean and variance, the exact (finite-n) distribution ˆ of1 isverycomplicated;butforlargen... p ˆˆ 1 is consistent: 1 1 (law of large numbers) ˆˆ 1 E(1) is approximately distributed N(0,1) (CLT) ˆ var(1) These statements hold for 1 ,..., k ˆˆ Conceptually, there is nothing new here! OLS is unbiased, consistent, and appx normal in large samples.
identically distributed
With respect to random variables, the property of random variables that are independent of each other but follow the identical probability distribution. if Z and Z are independently distributed, then Cov(X,Z) = 0 (but not vice versa) Two events are independent when the occurrence of one does not affect the probability of the occurrence of the other
Sampling distribution of Bhat 1
Yi =β0 +β1Xi +ui,i=1,...,n β1 = slope of population regression line (our main interest) β0 = intercept of population regression line The Sampling Distribution of The OLS estimator is an unbiased estimator of β1: See S-W chapter for a proof. But we prefer to use "large sample theory" - more generally applicable. Yi =β0 +β1Xi +ui,i=1,...,n β1 = slope of population regression line (our main interest) β0 = intercept of population regression line The Sampling Distribution of B(Hat) 1 : E(B(hat)1|X(1)...X(n) = B(1) prefer use 'large sample theory' - more generally applicable For n large, 1 is approximately distributed, Sampling Distribution of B(1): For n large, B(hat)(1) is approximately distributed (equation) rrfcvff is a consistent estimator of β1 and is appx. Normal for large n. Butthisisn'tenough:itdependson , , , -andwedon't know the true values for these. So we use the "plug-in" principle: replace them with estimates based on the sample. (12)??? This is a bit nasty, but: • • • It is less complicated than it seems. The numerator estimates var(v), the denominator estimates [var(X)]2. Why the degrees-of-freedom adjustment n-2? Because two coefficients have been estimated (β0 and β1). (But you could also use n; sometimes done.) and computed and reported by regression software, e.g., Stata (regress with the robust option - will return to this technical point later).
Random sampling
a sample that fairly represents a population because each member has an equal chance of inclusion Choose an individual (car, individual, entity) at random from the population. Randomness and data • Prior to sample selection, the value of Y is random because the individual selected is random. • Once the individual is selected and the value of Y is observed, then Y is just a number - not random. • Thedatasetis(Y1,Y2,...,Yn),whereYi =valueofYfortheith individual (district, entity) sampled.
Statistical significance
a statistical statement of how likely it is that an obtained result occurred by chance The significance level of a test is a pre-specified probability of incorrectly rejecting the null, when the null is true
Forecast interval
confidence interval and forecast interval We use confidence intervals for estimates of parameters such as μ Y. • We use forecast intervals intervals for outcomes such as Yi. • A95% confidence interval for μ Y is an interval that contains the true value of μY in 95% of repeated samples. • A 95% forecast interval for Yn+1 is an interval that contains Yn+1 (an "extra" observation) in 95% of repeated samples. • Forecast intervals tell us something about "how good" our prediction will be. forecasting more useful than hypothesis testing the time unit for which forecasts are prepared, such as week, month, or quarter • A point forecast is a prediction of what the "next" Yi will be. • A 95% forecast interval for Yn+1 is an interval that contains Yn+1 (an "extra" observation) in 95% of repeated applications. • Terminology issues here: Textbooks often use"prediction"and "forecast" interchangeably. But... • Sometimes they use"prediction"for what our model predicts for Yi (where observation i is in the sample of n observations). An "in- sample" prediction. • Sometimes they use "forecast" for what our model predicts for Yn+1 (an "extra" observation). An "out-of-sample" forecast • A point forecast is a prediction of what the "next" Yi will be. • A 95% forecast interval for Yn+1 is an interval that contains Yn+1 (an "extra" observation) in 95% of repeated samples. • The natural point forecast for our example of estimating the mean is easy to construct: it's just the sample mean Ῡ. • That's because we assumed i.i.d. sampling: every observation is independent, so our best guess of the mileage of the next car model is just the average of the cars we already observed. • But the forecast interval for Yi will not be the same as the confidence interval for μY based on Ῡ. It will be wider. • Why? Look again at the distribution of mpg across 74 cars. • A95% confidence interval for μ Y is an interval that contains the true value of μY in 95% of repeated samples. • As we accumulate more data ?? ,the uncertainty in our point estimate Ῡ disappears. Our confidence interval for μY gets narrower and narrower, and eventually shrinks to a point. (Remember the definition of a "consistent" estimator.) Eventually Ῡ, our estimate of μY, becomes exactly right. • A 95% forecast interval for Yn+1 is an interval that contains Yn+1 (an "extra" observation) in 95% of repeated samples. • Accumulating more and more data does not eliminate the uncertainty in our forecast for Yn+1! The next car could always be a low or high mileage car. • A 95% forecast interval for Yn+1 is an interval that contains Yn+1 (an "extra" observation) in 95% of repeated samples. • Accumulating more and more data does not eliminate the uncertainty in our forecast for Yn+1! The next car could always be a low or high mileage car. • Our car mileage example involves an unconditional forecast. We don't know anything about car n+1 except that it comes from the same distribution as our randomly selected 74 cars. (A very special case!) Our goal in prediction will be a good conditional forecast: we will use extra information about car n+1 (say, its weight) to construct a point forecast and a forecast interval for it. how ro construct a forecast interval (slide 29- >)
(Law of Large Numbers (LLN), Central Limit Theory (CLT)... but covered later when we come to econometric theory.)
law of large numbers: As n increases, the distribution of Ῡ becomes more tightly centered around μY : An estimator is consistent if the probability that its falls within an interval of the true population value tends to one as the sample size increases. If (Y ,,Y ) are i.i.d. and 2 , then Y is a consistent estimator 1nY of Y , that is, equation 1 which can be written, Y Y Pr[|Y Y|]1asn p p ("Y Y " means "Y converges in probability to Y "). Y (themath: asn,var(Y)2 0, whichimpliesthat Pr[|Y Y|]1.) The Central Limit Theorem:Moreover, the distribution of (Ῡ - μY) becomes the Normal distribution (the Central Limit Theorem) (CLT) (1 of 3) If(Y,,Y)arei.i.d.and02 ,then when n is large the 1, n Y distribution of Y is well approximated by a normal distribution. 2, Y is approximately distributed N(Y , Y ) ("normal distribution with mean and variance 2 /n") n YY n (Y Y )/Y is Y approximately distributed N(0,1) (standard normal) That is, "standardized" Y Y E(Y ) Y Y is approximately Y / n - The larger is n, the better is the approximation.
Log specifications: - Linear-log - Log-linear - Log-log PICTURE
ln(X) = the natural logarithm of X • Logarithmic transforms permit modeling relations in "percentage" terms (like elasticities), rather than linearly. ?? • The interpretation of the slope coefficient differs in each case. • The interpretation is found by applying the general "before and after" rule: "figure out the change in Y for a given change in X." • Each case has a natural interpretation (for small changes in X ) (20 -->)
shrinkage
losses experienced by retailers due to shoplifting, employee theft, and damage to merchandise
Hypothesis testing (maintained, null, alternative)
make and test an educated guess about a problem/solution Hypothesis testing terminology: The model (or "maintained hypothesis"): everything we believe about the model and don't test. H0 or "null hypothesis": a restriction on the model. Usually a restriction on a parameter, e.g., H0: E(Y ) = μY,0. Means "Hypothesis: the true mean μY is actually exactly μY,0." The hypothesis testing problem (for the mean): make a provisional decision based on the evidence at hand whether a null hypothesis is true, or instead that some alternative hypothesis is true. That is, test - H0:E(Y)=μY,0vs.H1:E(Y)>μY,0(1-sided,>) - H0:E(Y)=μY,0vs.H1:E(Y)<μY,0(1-sided,<) - H0:E(Y)=μY,0vs.H1:E(Y)≠μY,0(2-sided) The view we take in this course: hypothesis tests of whether a parameter (E(Y), i.e., μY) takes a specific value (μY,0) are rarely interesting. Much more useful: confidence and forecast intervals. But we need to do hypothesis testing first. H1 or "alternative hypothesis": what we believe if the null is actually wrong. The hypothesis testing problem (for the mean): make a provisional decision based on the evidence at hand whether a null hypothesis is true, or instead that some alternative hypothesis is true. The view we take in this course: hypothesis tests of whether a parameter (E(Y), i.e., μY) takes a specific value (μY,0) are rarely interesting. Much more useful: confidence and forecast intervals. But we need to do hypothesis testing first.
categorical variable
places an individual into one of several groups or categories Example from the auto dataset: rep78=repair record in 1978, on a 5-point scale, where 1="bad" and 5="good". Note 5 observations are missing. We can't include all 5 dummies - this is the dummy variable - so we have to omit one. Only 2 cars have the worst repair record (rep78=1) so that would be a poor benchmark. Hard to estimate precisely with just two cars! Best choice here is probably rep78=3 (in the middle). Example from the auto dataset: rep78=repair record in 1978, on a 5-point scale, where 1="bad" and 5="good". Benchmark category=3. What do we predict for the price of the car (in thousands of current US$) based on its repair record? • If we specify rep78 as a "factor variable" by saying i.rep78, this tells Stata to include a set of dummies, omitting the first one as the benchmark. • So we specify it by ib3.rep78 instead. This says use category 3 as the base (the omitted category, the benchmark). • But we shouldn't forget to look at the price data. Average price in 1978 was about $6,000. Keep this in mind when we interpret the results. • Not much evidence of systematic variation in price based on repair record. • The effects compared to the base category (rep78=3) are mostly not very big, the SEs are large (parameters are imprecisely estimated), and CIs are wide. • The exception is rep78=1, the car models with "bad" repair records. These are low-price cars compared to the benchmark. But there are only 2 such cars. We can't conclude much based on just 2 observations.
Homoskedasticity
the pattern of the covariation is constant (the same) around the regression line, whether the values are small, medium, or large --------------------------------------- If var(u|X) is constant - that is, if the variance of the conditional distribution of u given X does not depend on X - then u is said to be homoskedastic. Sometimes we denote this by . (Note the absence of the subscript on - it's a constant.) If var(u|X) is not constant - that is, if the variance of the conditional distribution of u given X does depend on X - then u is said to be heteroskedastic. Sometimes we denote this by . (Note the subscript on - it varies across observations.) • Thereareformaltestsforheteroskedasticityvs.homoskedasticity. Typically these work by estimating the model, obtaining the residuals , and then constructing a test statistic. • Intuition:wecan'tanalysethevarianceoftheunobserveddisturbance u directly, but we can analyse the residuals instead. • Butsometimeswecantellwhatisgoingonjustbydoingascatterplot of the residuals vs. the explanatory variable(s) X. The pattern may be obvious. • Let's do this with the auto dataset. The Stata commands are: regress mpg weight predict mpg_resid, resid scatter mpg_resid weight
Confidence interval
the range of values within which a population parameter is estimated to lie Recall that a 95% confidence is, equivalently: • The set of points that cannot be rejected at the 5% significance level. • A set-valued function of the data (an interval that is a function of the data) that contains the true parameter value 95% of the time in repeated samples. Because the t-statistic for β1 is N(0,1) in large samples, construction of a 95% confidence for β1 is just like the case of the sample mean: ˆˆ 95% confidence interval for 1 {1 1.96SE(1)} Given a sample from a population, the CI indicates a range in which the population mean is believed to be found. Usually expressed as a 95% CI, indicating the lower and upper boundaries. A 95% confidence interval for μ Y is an interval that contains the true value of μY in 95% of repeated samples. • 95% is traditional but arbitrary.Some times see 99%. • Confidence intervals are usually much more useful than hypothesis tests, p-values, "statistical significance", etc. • Why? They tell us right away about the likely magnitude of the true μY, and they tell us right away how precise our estimate is. A narrow CI is precise; a wide CI is imprecise. • Digression: What is random here? The values of Y1, ..., Yn and thus any functions of them. • The confidence interval is a function of the data, so it is random. The confidence interval will differ from one sample to the next. The population parameter, μY, is not random; we just don't know it.
Confidence interval (and how to construct one)
• Confidence intervals for a single coefficient in multiple regression follow the same logic and recipe as for the slope coefficient in a single-regressor model. EQUATION is approximately distributed N(0,1) (CLT). ˆ • Thus hypotheses on 1 can be tested using the usual t-statistic,and confidence intervals are constructed as {1 1.96 SE1 )}. • So too, for β2, ..., βk. Confidence Intervals for a Single Coefficient Confidence intervals for a single coefficient in multiple regression follow the same logic and recipe as for the slope coefficient in a single-regressor model. • 1 E(1) is approximately distributed N(0,1) (CLT). ˆ ˆˆ var(1) • Thus hypotheses on 1 can be tested using the usual t-statistic, ˆˆ and confidence intervals are constructed as {1 1.96 SE1 )}. • So too, for β2, ..., βk.
Population regression model with coefficients β1,β2, ..., βk and intercept β0
• Consider the case of two regressors: Yi =β0 +β1X1i +β2X2i +ui, i=1,...,n • Y is the dependent variable X1, X2 are the two independent variables (regressors) (Yi, X1i, X2i) denote the ith observation on Y, X1, and X2. • β0 = population intercept β1 = effect on Y of a change in X1, holding X2 constant • β2 = effect on Y of a change in X2, holding X1 constant • ui = the regression error (omitted factors) Consider the difference in the expected value of Y for two values of X1 holding X2 constant: Population regression line when X1 = X1,0: Y=β0 +β1X1,0 +β2X2 Population regression line when X1 = X1,0 + ΔX1: Y+ΔY=β0 +β1(X1,0 +ΔX1)+β2X2 Actual = predicted + residual: Yi= Y(hat)i +u(hat)i RMSE = std. deviation of u(hat)i (with or without dof adjustment) Flavour 1 (dof adjustment): equation Flavour 2 (no dof adjustment):equation R^2= Fraction of variance of Y explained by variance of X R(line)^2= "Adjusted R^2" = R^2 with a degrees-of-freedom correction that adjusts for estimation uncertainty; R(line)^2 < R^2 (17)
E(Ῡ ) vs E(Yi)
• E(Ῡ ) is the expected value - the true population mean - of the new random variable Ῡ (the sample mean, average mileage). • μY = E(Yi) is the expected value - the true population mean - of the original random variable Yi (mileage for a single car i). •Same applies to var(natural Y) sd vs. var(Y(i)) - sd .
First LSA (least-squares assumption) for prediction (reminder)
• For prediction, it is crucial that data for which we will be making the prediction are similar to the data used to estimate the model: The first least squares assumption for prediction: (XOOS, YOOS) are drawn from the same distribution as the estimation sample, (Xi, Yi), i = 1,..., n. • NB: When we did estimation and testing of the s etc. in the classical ("no-Big-Data") setting, we also needed LSA #2 (the sample dataset is i.i.d.) and LSE #3 ("outliers are rare"). For the setting in this chapter it is possible to relax these assumptions.
Homoskedasticity-only SEs - Also called"traditional SEs","classical SEs"
• ItisnotaproblemforOLS,aslongasyouuse"robustSEs" ("heteroskedastic-robust", "heteroskedastic-consistent", etc.). • Thisiswhythe"robust"approachispopular,andwhyit'scalled"robust": our estimates, tests, confidence intervals, all work whether or not the disturbance is heteroskedastic or homoskedastic. They are "robust to the presence of heteroskedasticity". But... • OLSisnotthe"best"estimatoravailableifthedisturbanceis heteroskedastic. "Best" here means "has the lowest variance among estimators that are linear in Y". • Ifyouarewillingtomakemoreassumptions,youcanusea"better" estimator, namely GLS - Generalised Least Squares. Advanced topic, might cover it at the end of the course. • But even if the errors are heterokedastic, OLS is still perfectly useable, and doesn't require any extra assumptions. Two different consequences of homoskedastic disturbances u: 1. OLS becomes the "best" estimator. 2. Another formula for SEs becomes available. First point: OLS as the "best" estimator: • Itcanbeshown[sic]thatOLShasthelowestvarianceamongestimators that are linear in Y. • Inthissense,OLSisthe"best"estimatortouseifthedisturbanceis homoskedastic. • SometimeswesayOLSis"BLUE"=BestLinearUnbiasedEstimator. • This is a result called the Gauss-Markov Theorem. Will cover at the end of the course. Two different consequences of homoskedastic disturbances u: 1. OLS becomes the "best" estimator. 2. Another formula for SEs becomes available.?? (17)