BA with RRRRR
Standard Significance Level
.1, .05, .01, .001
using R to adjust for heteroscedasticity
1. White estimator for robustness 2. tiny g and tiny w??
R linear model output order of terms
1. estimate 2. standard error 3. t-value 4. p(value)
model selection
1. r^2 -- always goes up when you add variables 2. adjusted r^2 -- bad statistics 3. t-testing -- compounds type 1 error 4. AIC or BIC are the best ways to select your model
where does endogeneity come from?
1. sample selection bias 2. omitted variable bias 3. measurement error bias 4. reverse regression bias 5. simultaneity bias
Covariance
A measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship
Correlation
A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other. Between -1 and 1
normal random variable
A random variable whose probability distribution defines a standard bell-shaped curve
heteroscedasticity
A regression in which the variances in y for the values of x are not equal -- very problematic because it messes up the hypothesis test results and confidence intervals
interaction variable
a characteristic whose different values are associated with differences in the pattern and/or strength of the relationship between x and z. Shows the differences (if any) between the slopes of the regression line of x variable at different levels of z variable
independence
a critical assumption that is made in statistics all the time -- assume that rows are independent of each other. This means we can separate the variables
r squared
a measure of the strength of a linear relationship; the coefficient of determination
endogeneity
a relationship between the error rate and the independent variable; a violation of base assumptions
joint hypothesis
testing two hypotheis (Ex: B1=0 and B2=0); use ANOVA to calculate F stat and find p value to decide whether or not to reject the null
sample selection bias
the bias introduced when data availability leads to certain observations being excluded from the analysis. Can be reduced if add other variables
Beta 0
the intercept; = y-bar-B1(x-bar)
conditional probability
the probability that one event happens given that another event is already known to have happened
confidence intervals
the range on either side of an estimate that is likely to contain the true value for the whole population; reverse the t-statistic
how to show interaction variables in R?
variable 1:variable2
joint distributions
ways to describe potential relationshis when you have multiple random variables
what are you doing when you log a rate?
you're looking at a growth rate
logs in linear model
tells us that we're interpreting a percentage change
BIC
Bayesian information criterion; ln(r)*r-2lnL where r= number of variables and L = likelihood
simultaneity bias
Bias in an estimate of a causal effect due to reverse causation (x impacts y but y also impacts x)
reverse regression bias
Cannot reverse the linear regression model; example from class: bwght=B0+B1cigs does not equal cigs=B0+B1bwght because the error term and bwght are completely correlated
complicated hypothesis
Example: B1=B3; need to use algebra to rearrange the linear model and calculate for the difference (theta??)
as we get more data, what happens to the variance
It decreases so a larger data set provides more accurate results than a smaller one
is there a test for endogeneity?
No -- you have to know your data and think about room for bias
Beta 1
Sxy/Sx^2; coefficient of independent variable that shows relationship between independent and dependent
omitted variable bias
The bias that arises in the OLS estimators when a relevant variable is omitted from the regression. Can reduce omitted variable bias by adding relevant variables
expected value
The mean of a probability distribution.
central limit theorem
The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution. CLT takes an unknown distribution and standardizes it to a normal distribution
white estimator for robustness
adjust variance to account for heteroscedasticity -- you can use the white estimator on both heteroscedastic and non-heteroscedastic data
AIC
akaike information criterion; 2r-2lnr where r = number of variables
likelihood
another way to find best fit; essentially the joint distribution PDF and it helps us estimate where mu and population standard deviation are
uniform random variable
any random variable with a uniform density function
weighted least squares method
assigns less weight to studies with smaller samples or greater error; in R ??? good luck with that
are lower values better or worse in AIC and BIC when selecting model?
better
if we know there is exogeneity, what is bias?
bias(beta 1) = 0
bias
breaking of fundamental assumptions
when log is on the right of the linear model...
divide by 100
when there are logs on both sides of a linear model
do nothing
quadratic models
include a squared variable in the linear model to recognize diminishing marginal returns
iid
independent and identically distributed -- all values are independent of other values and all values are from the same distribution
t statistic
indicates the distance of a sample mean from a population mean in terms of the estimated standard error; is the result something that came from a normal distribution or something that came from infinity
Probability
likelihood that a particular event will occur
when log is on the left of the linear model...
multiple coefficient by 100
measurement error bias
occurs when we have inaccurate data due to a faulty or inappropriate measuring tool (example: mothers reported number of cigs they smoked/day while pregnant)
OLS
ordinary least squares -- standard linear model that finds the best fit line by minimizing the sum of error squared