GLBL 121 Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Standard Error Interpretation for Difference in Means

"On average, the amount of difference between the sample mean difference and the population mean difference is _____."

Interpreting Indicator Variable Coefficient

"Respondents living in the mountain region experience 1.75 weeks of average duration of unemployment relative to the average of unemployment duration in the Pacific region, holding all other variables constant." -all in reference to the omitted category

Linear Relationships

-Explanatory/Independent Variable: X -Response/Dependent Variable: Y -Linear function explains how variable Y tends to change from one subset of the population to another, as defined by the values of X -We must specify the model we are using Y= alpha + beta(x) -alpha = y-intercept for straight line -beta = slope of the line (how much Y changes when X changes by 1)

Am I confident there's a difference in the average of Y between the indicator variable categories?

-I can't compare coefficients to each other because that's not what is provided (given output, I'd have to change omitted category) -I cannot mathematically say with confidence (even with significant p-value) that there's a significant difference because the output only tells you relative to the Pacific region -"I don't have enough information to determine if there is a significant difference in the average of Y."

Choice of Omitted Group

-if we choose, say females, as the group against we compare male wages, then females are our omitted/base/reference group -choice is arbitrary and does not affect our conclusions about the size of differences between groups

Linear Regression Model

E(Y) = alpha + beta(X) -probabilistic model -E(Y) denotes mean of conditional distribution of Y Regression function: mathematical function that describes how mean of response variable changes according to value of X -Linear regression: uses straight line to describe relationship between avg. of Y and X -Least squares provides sample prediction equation (Y hat = a + bx) --> estimates mean of Y for all observations in population having value of X Alpha and beta are regression coefficients.

Multiple Regression

Estimate the relationship between X and Y, holding constant other factors that may be responsible for the observed association between X and Y

Pearson Correlation

Measure of association for quantitative variables; standardized slope of linear regression equation -does not depend on units of measurement -adjusts for slope b for fact that marginal distributions of X and Y have SDs that depend on units for X and Y -correlation is value of slope, assuming that measurement units of X and Y are such that their SDs are same - r= s sub x/ s sub y --> s sub x = sqrt (sum((x-xbar)^2)/n-1) --> s sub y = sqrt (sum((y-ybar)^2)/n-1) -multiplying slope by ratio of standard deviations provides standardized measure (equal sample dispersions, correlation equals slope)

Least Squares Prediction: Formal Terms

Population Equation for a Linear Model: Y = alpha + beta(x) Prediction for a given value of X, based on a sample: Y hat = a + b(x) -Y hat is predicted value of Y Equation for a given observation: Yi = a + bxi + ei -Yi and Xi are observed values of Y and X, e is error in observation Yi -Yi - Yhat = ei

Least Squares Prediction: More Terms

Prediction Equation: Y hat = a + bX -predicts response variable Y in relation to X -best prediction equation is the one with smallest sum of squared errors -also called least squares line Residuals: Prediction errors -for given observation, response variable equals Y and predicted value equals Y hat -difference in Y and Y hat is residual -minimizing residuals on average Sum(Yi-Yhat)^2: Sum of Squared Errors (SSE) -residual sum of squares -squared errors in response variable left over after you control for variation due to explanatory variable Least Squares Estimate: b and a -values determining prediction equation that minimizes SSE Method of Least Squares: method to calculate b and a such that SSE is minimized Outlier: atypical X or Y score, large residual -biases estimate of slope -increase possibility of type I error -Y scores that are outliers are especially troublesome if they're associated with extreme values of X -Observation is influential outlier if removing it causes a large change in the prediction equation -probably should exclude extreme values/outliers

Total Sum of Squares (TSS)

Summarizes all the variation in the model. -sum of squares of difference between Y and Y bar -total squared error predicting Y using the mean

Sum of Squared Errors (SSE)

Summarizes all the variation not explained by the model -sum of squares of difference between Y and Y hat

Independent Sample

The choice of subjects for one sample doesn't depend on which subjects are in the other sample. -i.e. comparisons of groups dividing large samples into sub-samples(by class, race, etc.) -when full sample is random, sub-samples are independent random samples from sub-populations

Beta

The slope of a relationship -beta > 0: relationship between X and Y is positive -beta < 0: relationship between X and Y is negative -beta = 0: no relationship between X and Y (Y does not change in relation to X), independent -beta is a population parameter -how much does Y change due to a one unit increase in X

Mean Square Error

ratio of Sum of Squares/df

Multiple Regression Results Interpretations (Bivariate to Multiple)

-"For all the variation we see in Y, the linear regression only accounts for ____% of it." -"There is a negative/positive relationship between X and average of Y because we see that a one unit increase in X is associated with a ____ unit dec/inc in average Y." -make sure to look at p-value to see confidence to reject the null that there is no significant relationship

Comparing 2 Proportions for Large Samples: Confidence Intervals

-Again, compare two population proportions by using the difference between them. -If n1 and n2 are relatively large, then the estimator has a sampling distribution that is approximately normal about π2-π1 -Sample is large enough to use method if, for each sample, more than 5 observations fall in the category for which the proportion is estimated (and more than 5 do not fall in that category) -sum variances of the sampling distributions of the separate proportions to get variance of sampling distribution for difference of two proportions -Can compare outcomes over time (because of randomized control aspect) -Standard error is bigger than individual errors (less precision in confidence interval, harder to reject null because we are more confident in one sample than the other) -c.i.: (πhat2-πhat1) +/- z(standard error) Say 95% ci is (-0.26, -0.16) -"We are 95% confident that the difference in population proportion falls between -0.26 and -0.16. Group 1 is larger than group 2; that is, the level of agreement went down over time (interval is negative). I am 95% confident that the difference in level of agreement between 1986 and 2016 is between -0.26 and -0.16.

If the samples have the same sample SD, but one sample is larger, are we more confident in that sample?

-Because we add the standard errors together, we need to standardize the difference. -We incorporate sample size and variability in the standard error; as long as we meet the necessary threshold for a normal sampling distribution, it is fine to compare groups of different sizes. -The variability in both groups is reflected in the overall standard error.

Proportional Reduction in Error (PRE) Measure

-Error 1: Errors in predicting Y without using X --> Rule 1: best predictor is sample mean (one value closest to all observations on Y) --> Y - Y bar -Error 2: Errors in predicting Y with X --> Rule 2: best predictor of Y is prediction equation (Y hat = a + bx), when relationship between X and Y is linear --> Y - Y hat -PRE = E1-E2/E1 -Strength of association between an explanatory variable X and a response variable Y is judged by the goodness of X as a predictor of Y -If you can better predict Y by substituting X values into prediction equation, the variables are strongly related

Gossett's Catastrophe

-Gossett was referencing Table A (standard normal distribution) -With a small sample, the sample will not be representative of the true population --> distribution will be a flatter curve because it isn't that unusual to get something at the extreme ends of the curve -Within a distribution of samples, the estimated variance and standard deviation will vary (sample and population standard deviation has implications for the difference between sample mean and population mean) -NOT caused by non-normal sampling distributions (occurs even when underlying population is normal) and NOT because we have a biased estimator of the population variance -Problem is that bias only tells us average of estimator and our decision rules make our tests very sensitive to distributions

Causality and Multiple Regression

-It can be difficult to isolate the causal effect using observed data due to additional factors that impact the dependent variable -Some other variables could be responsible for the observed association between X and Y -multiple regression lets us control for these other variables

Measures of Fit for Multiple Regression

-SER = std. dev. of errors (with df correction) -RMSE = std. dev. of errors (without df correction) -original RMSE doesn't account for X's so use SER to adjust for number of regressors -same for R^2 (note, adjusted r squared less than r squared because it adjusts for estimation uncertainty)

How do we know if we can assume that the population distribution is normal?

-We can never really know, so we can only guess and use robust methods. -It's impossible to actually verify the shape of the population because we simply can't collect all the data; our sample data is our best rough guess (which makes one-tailed test more problematic since we are really looking at just one tail, and the opposite tail could look different) -However, given that we are working with small samples, this rough guess can be quite imprecise. -Saving grace if violation is minor, is using a two-tailed methodology (we may be over in one tail, under in other) --> in aggregate, you get the appropriate p-value by looking at both tails --> note: MINOR violations

Why can't we just eyeball the difference in means?

-We need to contextualize how confident we are in our samples and the difference in means, since we don't have sufficient info about the population. -Because of the sample information, we can't just believe direct comparison; we need to contextualize the information with standard error -If we are incredibly confident in our sample because of a small standard error, then we are more confident in the significance of our difference in means

Why do standard errors matter?

-With an incorrect standard error, we are putting false information into how confident we are that our sample statistic actually measures the population parameter -THE STANDARD ERRORS ARE BIASED WHEN HETEROSKEDASTICITY IS PRESENT --> BIAS IN TEST STATISTICS AND CONFIDENCE INTERVAL -Problem with the slope coefficient: We need standard error to determine how precise an estimate of the population correlation is the sample correlation statistic (is it a biased estimator?); does not change coefficient, but how confident we are in these estimators -Confidence interval: the width of the confidence interval would be different than it should be with a proper standard error, leading us to wrong estimates of where the population correlation should fall -Significance test: a wrong standard error can also bias the confidence we place in our correlation coefficient (especially for extreme cases of heteroskedasticity) TAKEAWAY: -standard errors, t-statistics, confidence intervals will be wrong -typically, homoskedastic standard errors are too small (homoskedastic only estimator of variance of beta is inconsistent if there is heteroskedasticity) --> sees less variance in the conditional distributions than there actually is --> we'd be overestimating our confidence in our sample statistic, we'd have falsely narrower ci

What is critique if we find a difference to be statistically significant and we are pretty confident?

-Yeah, but, is this result universally observed or conditional? Why is there a difference (what factors)? -While this is the answer in aggregate, the answer could be conditional; the result could change with consideration of other factors. This is why we need regression analysis. -We use regression to take in all these factors. We first must define the mathematical equation to describe the form and assume a causal relationship. Our payoff is that we can discern if the relationship exists, how strong, and in what direction. (1) Does an association exist? (2) If an association exists, what is the strength of the association? (3) What is the form of the relationship

Properties of t distributions

-bell shaped and symmetric about a mean of 0 (like standard normal) -standard deviation is a bit larger than 1 (precise value depends on degrees of freedom) -thicker tails and more spread out than standard normal --> because with smaller samples, and more erratic statistics, you are more likely to observe values that land in the extreme ends -larger the df value, closer your t distribution resembles standard normal (t distribution narrows because of increasing precision of sample standard deviation as point estimate)

Confidence Interval for a Slope

-ci = b +/- t(estim. standard error of sample slope) -"We are 95% confident that the real relationship (true population slope) between X and Y is between ____ and ____. This means that we are 95% confident that there is at least a ____ unit inc/dec and at most a ____ unit inc/dec in the average of Y due to a one unit increase in X." -If zero is included in interval: "Since zero is included in this interval (and positive and negative numbers), we can't be certain if there is a negative, positive, or any relation between X and Y."

Why does it matter what line we choose?

-could influence future predictions -linear function summarizes the relationship (defines relationship's strength, direction, etc.) -we need to find the function that best summarizes points to get the story right (best captures the entirety of the whole relationship) -Least squares prediction allows us to find the line we are looking for

Large Sample Confidence Interval for Difference in Means

-find standard error: sqrt(s1^2/n1 + s2^2/n2) -find appropriate z score -confidence interval: (difference in sample means) +/- z(standard error) -"We are 99% confident that the population mean difference is in this interval; that is, the difference in mean prison terms between Hispanics and whites is between -0.48 months and 6.08 months, with 99% confidence. We are 99% confident that the population mean prison term is between 0.48 months less and 6.08 months more for Hispanics than whites. This interval contains zero, which means that one plausible outcome is that the population mean prison term for Hispanics equals that of whites; we cannot rule out any possibilities--if Hispanics serve mean prison terms less than, equal to, or greater than whites." -OR, say whole interval is positive: "We are 99% confident that women work more hours than men (group 2 larger than group 1) in terms of average work hours, working 13.7 to 15.3 hours more than men." -NOTE: If zero (no difference) included in interval, that is a potential outcome. If negative and positive values are included, anything could be possible.

Properties of R Squared

-measures strength of association between X and Y (larger r squared, stronger association) -falls between 0 and 1 -r squared = 1 when SSE = 0 (all sample points fall exactly on prediction line, no prediction error in using X to explain Y) -r squared = 0 when TSS = SSE (no linear association between X and Y) -does not depend on units of measurement -takes same value when X predicts Y as when Y predicts X

Why do we not use t-distribution for small sample proportion test?

-no "extreme" in this case --> either you are one category or the other -the logic of the t distribution doesn't apply (there is no consistent sampling distribution to leverage) -the sampling distribution of the population proportion is highly discrete (concentrated at a few points) --> exacerbated by a small sample (continuous approximation such as standard normal distribution is inappropriate) -closer π is to 0 or 1, the more skewed the actual sampling distribution becomes -binomial distribution and sampling distribution of estimated π are approximately normal for large n

How can adding variables to a regression change the coefficients?

-omitted variables can be correlated both with the observed X and Y -once we hold constant those factors, the observed relationship between X and Y changes -bivariate regression suffers from OMITTED VARIABLE BIAS

How does the t-distribution account for small samples?

-only if population is normally distributed -we are accounting for sample size already in conducting small sample test with t-distribution --> already adjusting for lack of confidence -more likely to get extreme values (more area in tails) --> harder to reject null, wider confidence interval -everything is relative to sample size, don't adjust confidence level to methodology

Comparing Two Means for Large Samples

-parameter of interest: population mean difference -estimate parameter from sample data by difference of sample means -sampling distribution: every possibility of differences between sample means of all possible samples from the populations -unbiased estimator --> mean of sampling distribution is mean of population distribution (sampling distribution is approximately normal about true population difference in means) -ASSUMPTIONS: (1) random sample, (2) large sample

How are the probabilities of Type I and Type II error related?

-probabilities are inversely related -the smaller the alpha level, the greater beta (or your probability of type II error) --> the stronger the evidence required to reject the null (smaller alpha), the greater the chance that you will fail to detect a real difference (and reject a difference that may have been statistically significant at other levels) --> artificially lowering alpha means that we are changing the substantive result (z-score would have to be very extreme to reject) -we lower the probability of type I error by lowering alpha (chance we commit type I error is alpha, or the chance that we fall in the extreme tail ends of the sampling distribution by chance even though the proposed population parameter is true)

Randomized Control Trial

-randomly select samples of similar quality and character over all the conditional factors -give one group treatment --> any observable treatment isolates that one treatment (we attribute difference to that treatment)

Confidence Interval for Small Sample Test of Means

-same formula for confidence interval -instead of z score, use t score (95% confidence interval, use t sub 0.025 for appropriate df) -with additional assumption that population distribution is normal -in practice, we don't know the population standard deviation so we don't know exact standard error --> substituting sample standard deviation to get estimated standard error introduces extra error (especially if n is small) --> why we use t score -"We are 95% confident that the true mean number of ___ for the population of ____ falls within the interval of ____ and ____."

Why do we want to conduct a confidence interval for a slope?

-small p-value for a significance test suggests that the regression line has a nonzero slope (since our null is that beta is zero) -though this is useful info, we also want to know the size of the slope (not just that there is a relationship) -if the absolute value of the slope is small relative to the units of measurement for X and Y, then even though the association might be statistically significant, it could be practically unimportant -thus, why it is extremely informative to construct a ci for sample slope

Calculating Coefficient of Determination

-square correlation coefficient -TSS-SSE/TSS

Variance of Sampling Distribution of Difference Between 2 Estimates

-sum up variances -sampling error is associated with each estimate and these errors add in determining the sampling error of the difference of estimates -standard error --> we will need to estimate the standard error by substituting the sample standard deviations for unknown population values

Why do we use the minimum value in the calculation where n > 10/min(π0,1-π0)?

-when the null hypothesized proportion falls between 0.3 and 0.7, then the rule of n greater than or equal to 30 applies -however, when the null hypothesized proportion falls outside of this range, and gets closer to 0 or 1, a larger sample size is necessary to ensure the sampling distribution is approximately normal (why we take the minimum --> we want to look at the most extreme proportion) -sampling distribution of the proportion is more skewed when it gets closer to 0 or 1 than when it is near the middle of the range, especially since the proportion must fall within this bound range of 0 and 1 -this is even more true with smaller samples, because the sampling distribution of the population proportion becomes more discrete (clustered at very few points) since there are a limited number of possibilities that the population proportion can take on (and this number decreases as n is smaller)

Comparison of Binomial and Normal Distribution

-with a large enough n, binomial distribution will look like normal distribution -when π is near 0.5, the binomial distribution tends to a normal distribution more quickly with increasing n -with small samples, we work from the binomial distribution instead of the population distribution --> shape of distribution --> discrete levels of distribution --> we do not calculate means and standard deviations (calculate p directly)

Points to Hit When Interpreting the Pearson Coefficient

1. An increase of one standard deviation in X corresponds to a change of r standard deviations in the Y variable. -"The murder rate for a state is expected to be higher by 0.63 standard deviations for each 1.0 standard deviation increase in poverty rate." -larger absolute value of r, stronger association (SD change in X leads to greater proportion of SD change in Y) 2. Strength of Relationship -"A r of 0.63 suggests a moderately strong linear relationship between poverty rate and murder rate for a state." 3. Is the relationship positive or negative or zero? -"Since our r of 0.63 is positive, we can deduce that poverty rate and murder rate are positively related. That is, an increase in poverty rate will lead to an increase in murder rate."

What are the costs of using the t distribution?

1. Assume the population distribution is normal. -additional assumption that is hard to evaluate 2. More difficult to reject the null -the t distribution has more statistics in the extreme tails (b/c of LLN: curve is flatter b/c you are as likely to get something close to the population mean as to extreme values) --> means are very susceptible to outliers, but with more observations, the outliers have less impact (closer to actual population distribution) --> with smaller samples, means are very susceptible to outliers and more relatively extreme values have huge impact (less info to counterbalance)

Model Assumptions and Violations

1. Assumption of Linearity -linear regression assumes that the relationship between X and the mean of Y follows a straight line -If this assumption is badly violated, as with a U-Shaped distribution, results and conclusions using the linear model can be misleading 2. Extrapolation beyond observed values of X is dangerous -relationship might not be linear past that range, and even so, the standard errors could get wider -be careful interpreting the y-intercept (might lie outside the observed data or be an impossible value) 3. Influential observations and outliers may unduly influence the model -slope and standard error of slope may be affected by influential observations -inherent weakness of least squares regression -may wish to evaluate 2 models: with and without the influential observation -observations with large influences on model parameter estimates can also have a large impact on the correlation (especially with small samples) 4. Truncated samples cause the opposite problems of influential observations and outliers -truncation on X-axis reduces correlation coefficient for remaining data -truncation on Y axis is worse problem (violates assumption of normally distributed errors) -i.e. top coded income data, health measured by # of days in hospital

Comparing 2 Means for Large Samples: Significance Tests

1. Assumptions: -samples are random -parameter of interest is variable with continuous scale -sample size large enough s.t. sampling distribution of (Ybar2-Ybar1) is approximately normal --> n1 greater than/equal to 20, n2 greater than or equal to 20 -samples are drawn independently 2. Hypothesis: Let μ1 = pop. mean of group 1, μ2 = pop. mean of group 2 -H0: μ2 = μ1 -Ha: μ2 < μ1 or μ2 > μ1 or μ2 not equal to μ1 3. Test Statistic: -standard error sqrt(s1^2/n1 + s2^2/n2) -z score: (difference in sample means) - 0/standard error -The z statistic is our statistic of interest for comparing 2 means for large samples. 4. P-Value: -"When we look up the p-value for a z=2.2, we get 0.0139. This means that if the true difference in population mean prison terms between the two groups was really zero, then 1.39% of samples constructed would have z-score of 2.196 or greater by chance alone. We are performing a two-tailed test; the p-value for a two-tailed test is 2*0.0139=0.0305. So, if there really is no difference in population mean prison terms for Hispanics and whites, then 3.05% of samples of this construction would have a difference in sample means this far or farther from the difference in population means." 5. Conclusion: -"The p-value of 0.0305 for a two-tailed test with 99% confidence indicates that it could be likely, if whites and Hispanics had the same mean prison terms, for these sample difference to occur. That is, our p-value of 0.0305 is greater than our statistically significant alpha level of 0.01. Thus, our observed difference of 2.8 months greater between mean prison terms for Hispanics compared to whites is not statistically significant; we fail to reject our null hypothesis that the population means do not differ."

Significance Test for π2-π1

1. Assumptions: -samples are random -sample size large enough s.t. sampling distribution of estimator is approximately normal -samples are drawn independently 2. Hypothesis: Let π1 = pop. mean of group 1, π2 = pop. mean of group 2 -H0: π2 = π1 -Ha: π2 < π1 or π2 > π1 or π2 not equal to π1 3. Test Statistic: -Let π hat denote proportion of total sample in category of interest. This is the pooled estimate (based upon the null hypothesis assumption that the population proportions are equal). -z = estimate - null hyp value/se = (grp. 1 proportion - grp. 2 proportion)/std. error 4. P-Value: -"When we look up the p-value for a z=4.87, we get an extremely small p-value, close to zero. To get our two-tailed p-value, we multiply the p-value by 2, getting a p-value that is still extremely close to zero (very small). This means that if the true difference in population proportions of HIV infection between the two groups was really zero, then a close-to-zero percentage of samples close to zero would have z scores as far or farther from π2-π1 as we got in this sample by chance alone." 5. Conclusion: -"The p-value for this two-tailed test, close to zero, indicates that it is extremely unlikely for these sample proportion differences to occur if the true difference in population proportions of HIV infection between couples who always use condoms and couples who don't was zero. That is, our p-value is smaller than our statistically significant level of 0.05. Thus, there is extremely strong evidence to suggest that our observed difference, that couples who don't always use condoms have a population proportion of HIV infected that is 0.112 higher than couples who always use condoms, is statistically significant. We reject the null that the population proportions do not differ. There is strong evidence that seems to suggest, from the sample proportions, that the difference takes the direction of a higher proportion for couples who don't always use condoms."

Small Sample Test for Population Proportion

1. Assumptions: "We will be conducting a small-sample test for population proportion. To conduct this test, we must assume that observations are: -dichotomous (for given n, observation falls in one of two categories) -identical (probability of falling into each category is same) -outcomes of successive observations are independent." 2. Hypothesis: Let π denote the population proportion of ____. -H0: π = ___ -Ha: π does not equal ___ 3. Test Statistic: We calculate p-value using the test statistic of the binomial variable X (exact application of binomial distribution). 4. P-Value: The P-value is the sum of P(X) for every X as unlikely as measured X -the two-tailed p-value is the probability of an outcome at least this extreme -"The two-tailed p-value is ___. That is, for samples of size n, we expect _____% of samples to produce a sample population as far or farther from the hypothesized population proportion." 5. Conclusion: -REJECT NULL: "We reject the null hypothesis that μ=____. The p-value of ___ for a two-tailed test indicates that it is extremely unlikely, given a true population mean of ____, to have a sample proportion as far away from the population proportion as ___. That is, there is sufficient evidence to reject the null since ___ is less than the alpha level of ____ and the difference between the sample proportion and the proposed population proportion is statistically significant." -FAIL TO REJECT NULL: "We fail to reject the null hypothesis that μ=____. The p-value of ___ for a two-tailed test indicates that it is possible, given a true population proportion of ____, to have a sample proportion as far away from the population proportion as ___. That is, there is insufficient evidence to reject the null since ___ is more than the alpha level of ____ and the difference between the sample proportion and the proposed population proportion is not statistically significant."

Small Sample Test for Population Mean

1. Assumptions: "We will be doing a small sample test for population means. To perform this test, we must assume that... -the variable has a normal population distribution -the sample is a random sample -the variable is treated as quantitative with interval scale" 2. Hypothesis: Let μ denote the population mean income for ____. -H0: μ = ____ -Ha: μ < ____ or μ > ____ or μ not equal to ____ 3. Test Statistic: For a n of ___, we calculate the following statistics: -Y bar -s -estimated population SD (sub Y bar)=s/sqrt(n) -t score (t-statistic is the statistic of interest in a small sample test of population mean) = (y bar-null)/estimated pop. SD 4. P-value: find corresponding p value for t score -"For alpha=0.05 and df=___, we get a critical t-value of ____. That is 2.5% of the distribution lies beyond t=____ and by symmetry, 2.5% of the distribution also falls in the left tail below -t=-_____. For n=__, the probability equals 95% between t= -___ and ____. Our calculated t value of ___ lies between t=___ and t=___, so we would expect a tail probability between ___ and ____. Thus, we would expect a t score of -___ to fall/not fall within this 95% and (not) in the extreme 2.5% ends of the tails; that is, our p-value is (not) greater than 0.05. To get a two-tailed p-value, we would have to multiply our p-value by 2. 5. Conclusion: -REJECT NULL: "We reject the null hypothesis that μ=____. The p-value of ___ for a two-tailed test indicates that it is extremely unlikely, given a true population mean of ____, to have a sample mean as far away from the population mean as ___. That is, there is sufficient evidence to reject the null since ___ is less than the alpha level of ____ and the difference between the sample mean and the proposed population mean is statistically significant." -FAIL TO REJECT NULL: "We fail to reject reject the null hypothesis that μ=____. The p-value of ___ for a two-tailed test indicates that it is possible, given a true population mean of ____, to have a sample mean as far away from the population mean as ___. That is, there is insufficient evidence to reject the null since ___ is more than the alpha level of ____ and the difference between the sample mean and the proposed population mean is not statistically significant."

Significance Test for the Slope of the Linear Regression Equation

1. Assumptions: To conduct a significance test for the slope of the linear regression equation, we assume the following: -The mean of Y is related to X by the equation E(Y)=alpha + beta(x) -The conditional standard deviation is identical at each x value -The conditional distribution of Y at each value of X is normal -The sample is selected randomly 2. Hypothesis: Let beta represent the linear relationship between X and Y. -H0: beta = 0 (the variables are statistically independent) -Ha: beta not equal to 0 3. Test Statistic: -Our test statistic of choice is our t score. -t = b/estim. standard error of sample slope -->NOTE: standard error is smaller when conditional sd of Y is small, sample size large, sd of X is larger -standard error of sample slope = root MSE = sqrt (SSE/n-2) --> depends on point estimate of sd of conditional distribution -df for t statistic is n-2 4. P-Value: "From the results of our significance test, we get a t-score of -2.136. With 8 degrees of freedom, a t-score of -2.136 corresponds to a p-value that is smaller than 0.05 and greater than 0.025. That is, with an alpha of 0.05 and df=8, we get a critical t-value of 2.306 for a two-tailed test (such that 2.5% of the distribution falls within the extreme tails). Thus, we would expect a t-score of -2.136 to fall within the middle 95% of the distribution and not in the extreme tail 2.5% ends; that is, it would have a two-tailed p-value greater than 0.05. We expect our t-value to have a two-tailed p-value between 0.1 and 0.05 (multiply one-tail p-value by 2). That is, between 5% and 10% of sample slopes would have t-scores as far or farther as -2.136 by chance." 5. Conclusion: "The p-value for this two-tailed test is just greater than our statistically significant level of alpha=0.05. That is, it is just not unlikely given a true population slope of 0 (no relation between X and Y) to have a sample slope as far from the population slope as -0.004. That is, there is insufficient evidence that the deviation of the sample slope from the null hypothesized population slope is significant (that our observed relationship between X and Y is significant). We fail to reject the null.

Options to Compare Groups (i.e. gender difference in wages)

1. Basic Comparison of Descriptive Statistics -can give us a rough picture, but we also don't know how confident we are in the difference -have to standardize the difference using standard error and our t/z score -we need to know how confident we are in the statistics that we collected (how representative they are of the population) 2. Comparison of Means Test -Not very specific (looks at sample groups as a whole) -Does not include several other important factors that could influence our variable of interest 3. Difference in Means Test using OLS regression -power: we can isolate the relationship between two variables that affect the observed difference in groups -especially useful when we have multiple independent variables

How do you interpret coefficient on region (when there are multiple dummy variables)?

1. Create a 1-0 dummy variable for each category in the qualitative variable (esp. with no ranking) --> exhaustive set of indicators for each category 2. Include all but one of the indicator variables in your regression 3. Interpreting the coefficient: "If our respondent lives in the mountain region, they will experience 1.75 weeks less in average of unemployment duration than respondents living in the Pacific holding all else constant." --> mention significance 4. Interpreting the constant/y-intercept: "The average of duration of unemployment when all X's are zero (when respondent is male, no experience, lives in the Pacific)."

Interpretations of Coefficient of Determination

1. For predicting Y, the linear prediction equation has ____% less error than the sample mean. 2. ___% of the variation in Y is explained by its linear relationship with X. 3. My mathematical function (linear prediction equation) summarizes the data _____. There seems to be a strong/fairly strong/weak association between X and Y because the r squared value is ___. The variance of the conditional distribution of Y for a given X is ___% smaller than the variance of the marginal distribution of Y.

Model

A formula that provides a simple approximation for the true relationship between X and Y -for a given value of X, the model predicts Y -better predictions, better model -linear function is simplest bivariate model

Robust Method

A statistical procedure which performs adequately even when an assumption is violated. -Identification and use of such methods is important considering that in practice we are rarely able to know for sure if assumptions are perfectly satisfied -Two-sided inferences for a mean are quite robust against violations of the assumption that the population distribution is normal --> does not mean test is robust to violation of other assumptions (i.e. random sample assumption)

Reponse Variable

A variable that we think of as being explained or caused by the value of another variable -also called dependent variable, Y

Explanatory Variable

A variable that we think of as explaining or causing the value of another variable -also called independent variable, X

Adjusted R Squared

Adjusted r squared corrects for inflation of r squared by penalizing you for including another regressor -does not necessarily increase if you add another regressor -but if n is large, then the two will be close -increases only if the new variable improves the model more than would be expected by chance

Making predictions using scattered data

Ask: Is a linear model suitable? If so, how do we draw the line? Samples may not be suitable for linear model if: -data points in scatterplot suggest nonlinear relationship -several outliers -evidence of heteroskedasticity (amount of scatter of dots depends on x value) Possible problems if you incorrectly use linear model: -estimates of beta may be biased for a particular value of x -subsequent statistical significance tests may produce too many type I errors Remedies if no linear fit: -drop outliers -rescale one/both variables -simpler statistical analysis

Magnitude and Sign of OVB

BIAS = short coeff - long coeff = (beta2)(gamma1) -multiply (relationship b/w x2 and y)(relationship b/w x1 and x2) --> if these two values increase, magnitude of bias increases -calculate this way b/c sometimes you don't have x2 (someone else's data, not in data set, inherently unobservable) -if you have x2, you can just directly add to regression and observe how coefficient on x1 changes -if gamma1 or beta2 = 0, then there is no bias (not a determinant of Y or not correlated with included variable of interest) SIGN: -depends on sign of beta2 and gamma1 --> beta2 is a population parameter so we can never be sure if it is pos/neg (and we don't have X2 so we can't estimate it) --> since X2 is not observed, we can't be sure of sign of correlation between X1 and X2 --> but we can make educated guesses

Why do we prefer the beta (coefficient in multivariate regression) to alpha (coefficient in bivariate regression)?

Beta is closer to the value of the relationship. That is, it accounts for more variables so it is more representative of the isolated effect of the independent variable we wish to examine.

Interpretation of Multiple Regression Equation (Slope Coefficients and Y-Intercept)

COEFFICIENTS: -"According to the output, our coefficient on age is 0.135. That is, holding all other independent variables constant, a one-year increase in age is associated with a 0.135 increase in the average of hourly wage. Namely, controlling for union status, we see this correlation between age and hourly wage." Y-INTERCEPT: -"When X=0 across all other independent variables, the average of Y is _____."

Variation about the Regression Line

For each fixed value of X, there is a conditional distribution of Y values -linear regression model has additional parameter sigma describing standard deviation of the conditional distribution (conditional sd) -conditional sd measures variability of Y for all observations with same x value Estimate of sigma: sqrt(SSE/n-2) = sqrt(sum(y-yhat)^2/n-2) -based on assumption that conditional sd is identical for all values of X (homoskedastic) -you use n-2 because you need at least 2 points (df estimator is 2 because you have x and y) In practice, conditional distributions probably won't have same SD and be perfectly normal

Why do we use a pooled estimate?

Given that I'm assuming equality in the proportions for the null, I have to use a singular estimate in my calculation for standard error. We can't use π hat because we assume the null is true, so we are calculating the probability under that condition. We need a singular value, so we choose the pooled estimate for our standard error calculation (provides a higher precision estimate under our assumption). We can think of the pooled estimate as the weighted average of our sample statistics.

Least Squares Prediction: Equations for Slope and Y Intercept

Goal for a given sample: estimate b and a such that the sum of the squared errors in observations is as small as possible -b= sum((x-xbar)(y-ybar))/sum(x-xbar)^2 -a= ybar - b(xbar)

Deterministic vs. Probabilistic Model

Hard to say that if I increase x by 1, every observation of y increases by a certain amount (some variation) Deterministic Model: Every value of X corresponds to a single value of Y -doesn't account for variability in Y values for observations with the same X value -there is determined Y for every X --> all fall on the line Probabilistic Model: Allows for variability in values of Y at each value of X Conditional Distribution: Distribution of Y values at each fixed value of X -probability distribution of Y for individual observations at each X -The prediction equation gives us the expected value of Y, the mean of Y To go from deterministic to probabilistic, we must assume: -Distribution of Y at X (conditional distribution) is normal --> majority of points close to line -Standard deviation of conditional distributions is equal across all X's --> level of spread uniform across conditional distributions, homoskedastic

Heteroskedasticity vs. Homoskedasticity

Homoskedasticity: The variance of the residuals (error terms) around the regression line is the same for all values of X Heteroskedasticity: The variance of the residuals (error terms) around the regression line is not the same for all values of X. That is, the variance of the residuals depends on the value of X. -NOTE: There is a heteroskedasticity-robust formula for standard error of sample slope on the formula sheet.

Measuring Linear Variation--Does the slope provide information on the magnitude or strength of the given relationship?

If magnitude of slope gets bigger, does that mean the relationship is stronger? -Slope of prediction equation depends on units of measurement -Doesn't directly indicate if relation is strong or weak (since we can make b as large or small as we like from choice of units) --> we need to standardize slope -Slope can't really tell us about strength of relationship, only direction of association

Interpretation of Pearson Coefficient

If r=0, then there is no relationship between X and Y. X and Y are independent. If r is positive, then the relationship is positive (as X increases, Y increases). If r is negative, then the relationship is negative (as X increases/decreases, Y moves in opposite direction). The correlation has the same sign as b. Since r is proportional to the slope of the linear association, it measures the strength of the association between X and Y. It also treats X and Y symmetrically, which means it doesn't say anything about causal direction. Relationship is bound by -1 to 1 (closer to these ends, stronger the relationship). -0-0.3: weak -0.3-0.6: moderately strong -0.65-1: strong

Ideal Situation v. Reality

In reality, all data points don't fall on a single straight line. -there are several lines, with different slopes and y-intercepts, that could fit the data in a plot

Interpreting OVB

NEGATIVE BIAS: -too far left -include omitted variable --> move right (coefficient on observed variable increases) -understating the true effect of our observed independent variable on the dependent variable POSITIVE BIAS: -too far right -include omitted variable --> move left (coefficient on observed variable decreases) -overstating the true effect of our observed independent variable on the dependent variable HOW MUCH DOES IT MOVE BACK?: -if both relationships are very weak, bias is essentially zero (exclusion of factor is unimportant to substantive results) -If both are important, bias is huge -Can be strong, moderate, weak relationship -"experiencing moderately strong negative bias, overstating negative effect"

Perfect Multicollinearity

One of the regressors is an exact linear function of the other regressors -often associated with: regressors constructed from other regressors, lots of dummy variables (collectively exhaustive, mutually exclusive) + constant, quirk of the data

Scatter Diagrams

One of the simplest methods for analyzing the relationship between 2 quantitative variables by graphing the data -Observe distribution of n points for n observations -can pick up: direction (pos/neg), dispersion (clustered, spread out), shape (linear, random)

Why does r squared increase when you add another regressor?

PROBLEM: Every time you add a regressor, r squared increases (even if by chance). Thus, a model with more terms might appear to be a better fit just because it has more terms and not necessarily becomes it explains the relationships well. They are upwardly biased. -This is because adding more terms minimizes the sum of squared errors. It is impossible to explain less variation in Y by adding explanatory variables.

Coefficient of Determination

Proportional reduction in error from using the linear prediction equation instead of sample mean to predict Y -r squared = E1-E2/E1 = TSS - SSE/TSS -tells us how much of the total variation our prediction model is actually explaining -When a strong linear relationship exists between X and Y, the prediction equation provides predictions that are much better than the sample mean

Dummy Variable

Qualitative info is often captured by defining a binary/dummy/indicator variable -group of interest: 1 -everyone else: 0 -Lets us use a single regression equation to explain multiple groups (can control for something with multiple categories) -Used to help sort data into mutually exclusive categories

Dependent Sample

Results when natural matching occurs between each subject in one sample and a subject in the other sample -we can't use independent sample techniques on dependent samples (i.e. longitudinal data of same respondent, differences between husbands and wives)

Conditional vs. Marginal Variation

SD of Conditional Distribution: -sqrt(SSE/n-2);SSE = sum of squared differences between Y and Y hat -uses residual difference SD of Marginal Distribution: -sqrt(sum of differences between Y and Y bar/n-1) -makes no reference to any other variable X --> doesn't always mean you've increased the strength of the relationship because the slope isn't standardized if steeper slope Differences: -degrees of freedom (n-1 vs n-2) -E(Y) is different: Y hat vs. Y bar -conditional SD usually smaller than marginal SD

Interpretation of Dummy Variable Equation

SLOPE: -If 1, we are adding that to y-intercept ("This much more") -"According to the output, our coefficient on union status, or the slope of the regression equation, is 5.84. That is, a union member's average hourly wage is $5.84 higher than a non-union member's average hourly wage. In other words, the presence of union membership is associated with a $5.84 increase in the average of hourly wage." Y-INTERCEPT: -"Our output tells us that the constant, or y-intercept of the regression equation, is 16.68. That is, the average value of hourly wage when X=0—when the respondent is not a union member—is $16.68."

How can a result be statistically significant in the bivariate regression but not in the multiple regression?

Sometimes, we do see that controlling for more independent variables that could affect the relationship between X and Y could lead us to determine that the relationship between X and Y in the bivariate regression is not statistically significant. -By adding more independent variables one by one, we see the predictive power of each of the elements. -In the bivariate form, when we found a highly negative/positive/largely correlative relationship, we are excluding other factors. This means that the variable we observe serves as a crude proxy for all those other factors, compensating for their omission and making it harder to isolate the true effect of our independent variable of interest. Thus, our regression might over-state or under-state the true effect of our independent variable on the dependent variable. By serving as a proxy for the effects of all other variables related to X and Y that are not accounted for, the relationship also might look more important than it actually is.

Model Sum of Squares

TSS-SSE -represents the amount of variation that IS explained by your model

Conditions for Omitted Variable Bias

The coefficient is biased in the bivariate regression because if there is some omitted variable X2 that is correlated with both Y and X1. 1. X2 is a determinant of Y -X2 is correlated with Y -X2 is in u (the error term in the bivariate regression) -X2 would have a non-zero coefficient if we put it in the regression 2. X2 is correlated with X1

Root MSE

The root mean squared error is the estimated standard deviation of the conditional distribution of Y. In other words, the standard deviation of the distribution of Y's around the predicted average Y value associated with a given X is the RMSE. It also thus represents the standard deviation of the residuals for this regression line relating X to Y.

The Danger of Using the Simple Homoskedasticity Formula

The simple homoskedasticity formula is only correct if the errors are homoskedastic. -which is generally not a realistic assumption -more robust formula gives you the right answer regardless if the errors are homo/heteroskedastic -using homoskedastic formula for heteroskedastic errors means that you will get incorrect standard errors -two formulas coincide with large n in case of homoskedasticity

Why do we need to identify the type of sample?

The way we conceptualize standard error, average error, difference of statistics/parameters is very different for dependent samples -cannot apply independent sample methodology on dependent samples, would get grossly wrong results

Alpha

The y-intercept of a relationship -alpha is the value of Y when X=0 -alpha is a population parameter

Imperfect Multicollinearity

Two or more regressors are very highly correlated -scatterplot will pretty much look like a straight line, but unless the correlation is exactly one, that collinearity is imperfect -one or more of the regression coefficients will be imprecisely estimated -coefficient on X1 is effect of X1 holding X2 constant BUT if X1 and X2 are highly correlated, there is little variation in X1 if X2 is held constant --> data uninformative about how X1 changes when X2 doesn't --> variance of OLS estimator on X1 will be very large (large standard errors too) -numbers get much bigger, very inaccurate, sign may change --> not a predictable form (also conditional on other X's)

Will the probability of type II error decrease with very large sample sizes? What about type I?

Type II: Yes. -As n increases, by the law of large numbers ,we should get a sample mean closer to the actual population mean. That means, a deviation in the observed statistic from the null hypothesized population mean is more likely to be statistically significant and we are more likely to reject the null (less type II error). -With a smaller sample size, it is not unusual to get statistics that are more erratic, which increases the chance that we observe a statistic very close to the null (when in actuality, the true population parameter is not near the null value). Thus, smaller samples yield greater type II error. Type I: No. -As long as we meet our assumption (that n is large enough such that the sampling distribution is approximately normal), then our sampling distribution still looks normal no matter how many observations we collect. Thus, the chance of falling in the extreme tail ends of our sampling distribution by chance (such that we reject the null hypothesis when we actually should not have) remains constant no matter how much we increase sample size, as long as the minimum threshold is met.

Regression Analysis

Using linear models to study the form of relationship between variables, the strength of a relationship between variables, and the statistical significance of that relationship

Which of the following will tend to be smaller, meaning smaller prediction error -- E1 or E2?

We expect E2 to be smaller because: -When we find the best fit line using the second method, the equation we derive is designed to have less error (minimizing residuals) -We are optimizing the prediction power of our function. -We are using more distributions (the method using the average only uses one distribution) -We are using the covariance of both variables. AKA WE ARE MINIMIZING ERROR AND USING MORE INFO TO CALCULATE

At least how many groups do we have to exclude?

We have to exclude at least one group (mathematically necessary). -each of the coefficients is a direct comparison of the indicator to the excluded group (specific to difference compared to specific group) -can't include all because of collinearity

The most basic assumption when constructing confidence intervals or with hypothesis tests using the t distribution is that the population distribution is normal. Why?

We need to assume the population distribution is normal because in that case, the sampling distribution of Y bar is normal even for small samples. Ordinarily, if the sample size is large enough, then we can use Central Limit Theorem to approximate the sampling distribution of a variable with any population distribution shape to be normal. However, since small samples are not large enough for the sampling distribution to be approximately normal (regardless of population distribution shape), we assume that the population distribution is normal because then the sampling distribution has to be normal. To conduct valid significance tests or construct valid confidence intervals, we need normal sampling distributions. Thus, to perform these statistical methods on small samples, we assume that the population distribution is normal (such that the sampling distribution is still normal).


Ensembles d'études connexes

Modules 4 - 7: Ethernet Concepts

View Set

Chapter 13: Neurocognitive Disorders

View Set

Chapter 22: The Respiratory System

View Set

The Study of Human Language Chapter 1

View Set

MED SURGE. QUIZ #7 Review for Exam III

View Set

MATH - Solving Multi step Algebraic Equations

View Set

Politics & Economics of the European Union Midterm

View Set