confidence intervals and t-distribution

¡Supera tus tareas y exámenes ahora con Quizwiz!

Why do we need distributions?

-Distributions help up determine the probability of observing a sample mean of a certain value. -Why does this matter? ¤ It is not uncommon to observe one person that is very tall or one building high lead content pipes. This may not be groundbreaking news or require attention. But what if we saw an entire city or state of very tall people or high lead content? This is much more unlikely and may be worthy of further investigation... But HOW unlikely does something have to be for us to conclude there IS something going on here?

How Unlikely is Unlikely?

A fairly common convention in statistics is to test how likely something is... If the probability of seeing, for example, this high lead content in water is less than 5%, we tend to call this "statistically significant." ¨ There are three main levels of significance: ¤ .10, which is equivalent to a less than 10% chance -p value < .10 ¤ .05, which is equivalent to a less than 5% chance -p value < .05 ¤ .01, which is equivalent to a less than 1% chance -p value < .01 ̈When we were working with the z-distribution, the z-values associated with those chances are known and we can look them up on a z-table: ̈ 90% confidence has z = 1.65 ¤p value = .10 ̈ 95% confidence has z = 1.96 ¤p value = .05 ̈99% confidence has z = 2.58 ¤p value = .01 But now, we need to be a bit more realistic... rarely do we know 𝜎and sample sizes can be small sometimes. We need a different distribution now...

95% CI in z and in t (df=6)

All else being equal, because of the extra uncertainty of not knowing 𝜎, the t-distribution will result in a wider confidence interval.

Sample Compared to Sampling -Ideal situation where out Point Estimate is equal to μ

Data Distribution:: 𝑥, 𝑠 Distribution of the data from a ONE sample taken from the population Sampling Distribution: 𝜇̅", 𝜎̅"Distribution of the possible statistics (̅𝑥, 𝑠) of many samples Pretend we know the true sampling distribution. Perfect Match Scenario Remember, with proper, large enough samples the sampling mean will equal true population mean.

68, 95, 99 Rule, Rounded

For normally distributed data: 68% of observations fall between ±1 SDs 95% of observations fall between ±2 SDs 99.7% of observations fall between ±3 SDs

Sample Compared to Sampling -More Realistic Scenario

Notice how our one sample mean ̅𝑥 is close but not exactly true 𝜇 Our SAMPLE mean ̅𝑥won't be exactly 𝜇 Our one sample mean ̅𝑥is close but not exactly true 𝜇. It is one of the many sample means (̅𝑥) in the SAMPLING distribution (𝜇̅"). In this case, our sample mean is one of the slightly higher estimates in sampling distribution.

confidence interval formula

Point Estimate:Statistic (ex. sample mean) in the original number scale z-critical:The level of confidence I want (ex. 95%, z=1.96) in its corresponding z-score Standard Error (SE):Standard deviation of the sampling distribution

Wait, what's with the 1.96??

Remember the 68%, 95% 99% rules? For normally distributed data (like sampling distributions) 95% of observations fall between plus and minus two standard deviations from the mean. Well... That was not completely exact... In Case You're Wondering n case you're wondering where 1.96 came from, it is from the unit normal table for z-scores. If you look for the z-score that is associated with 95%, you will find this: .4750 + .4750 = .950

Confidence interval for a mean when 𝜎 is unknown:

S instead of thing

Based on the equation above, what factors influence how wide or narrow your CI is?

The Margin of Error is what widens or narrows your CI, but what affects your MOE? ̈The Margin of Error is affected by :¤Sample size (n) ¤The variability in the POPULATION (σ) ¤The desired level of confidence (ex. 95%) -The z-critical value (ex. 1.96)

But how do we know?

Since we will never actually see the sampling distribution (remember we only ever take one sample in the real world) you might be wondering, how do we know if our sample mean is one of the sample means that is too high or too low? Can we ever know this?Hm... It all seem so uncertain...It is... but we can account for some of that uncertainty...

margin of error

The amount of error based on SE and a desired level of confidence in the original number scale

95% Confidence Interval and 1.96

The confidence level of 95% is a fairly common standard in academia (for better or for worse). For much of the rest of the course we will be using the standardized values associated with 95% confidence. When we are dealing with data where the standard deviation is known (𝜎) we use the standardized z-value of 1.96. In the future, when 𝜎is unknown, we will use different standardized values... I am sorry, again.

Flatter and wider

The t-distribution is flatter and wider than the z- distribution, but gets closer to the z-distribution as sample size increases.

90, 95, 99 Rule, Exact

These are now exact, not rounded. We don't care as much about 68% in this context. 90% of observations fall between ±1.65 SDs 95% of observations fall between ±1.96 SDs 99.7% of observations fall between ±2.58 SDs

What is our goal when we take a sample and calculate a statistic like the mean?

We want the truth! We are sampling and calculating in an attempt to capture the truth, in this case true 𝜇

point estimate formula

lower + upper bound / 2

standard error =

moe / z

Because 𝜎is known, we use the unit normal z-table and we see the critical z value associated with 95% confidence is 1.96

true

MOE=

x bar - lower bound upper bound - x bar

How do we interpret this confidence interval?

¤If we repeated the sample again and again, 95 out of 100 times, the true 𝜇will be captured in our confidence interval. ¤Said another way, we are 95% confident the true 𝜇value is between [75.77, 86.23]. ¤"If we take repeated samples of n = 9 and compute a 95% confidence interval each time, approximately 95% of the intervals would contain the true number of hours college students sleep. ¤"We are 95% confident that the true number of hours college students sleep is between 3.5 and 10.5 hours."

Interpretation

¨ A random sample of 25 college graduates revealed that they worked an average of 6 years at a job before being promoted, with a standard deviation of 1.3 years. Compute and interpret a 99% confidence interval for the mean number of years worked at a job before being promoted. ¤ "If we take repeated samples of size 25 and compute a 99% confidence interval each time, approximately 99% of the intervals would contain the true mean number of years worked at a job before being promoted." ¤ The average number of years it takes to get promote is somewhere between 5.27 year and 6.73 years." ¤ 1out of 100 times we'll get a confidence interval that is too high or too low

Degrees of Freedom = 𝑛 − 1

¨ Degrees of Freedom are the number of scores allowed to vary when calculating a statistic... ¤ Suppose you want an average score of 85 (𝑥̅= 85) on all three exams. -There are no restrictions for the first two exam scores -Ex. You could get a 75 and an 85 (or a 65 and a 95, etc) for an average of 80, but that last score has to be a 95 to get that 85 average ¨ As soon as you know your first two exam scores, the mean (𝑥̅= 80) dictates the score that you need on the third exam (ex. a 95). ¤ Notice that the first two exam scores are free to vary, but the third score is fixed by the mean (i.e. 𝑛 − 1 scores are free) For now, what is most important is that you know how to calculate the degrees of freedom for a sample so you can look up the correct t-critical.

Which one?

¨ If I ask you to calculate a Confidence Interval, how do you know which method, which table, to use? ¤ Hints: - Is this a sample within an KNOWN σ? - If you see Greek, σ, you know you can use the z-distribution - Is this a sample with UNKNOWN σ? - Do you see an "s" or "σ"? - If you see "s" or "unknown variance", you will use the t-distribution

z vs. t-table

¨ Main purpose of z and t distributions is to standardize to determine how probable something is and to create confidence intervals ¨ The main difference between the z and t tables ¤ z tables have: n Lists of z values to look up corresponding proportions n Or look up proportion to find the corresponding z-score ¤ t tables have: -Degrees of freedom (n-1) -The level of significance you want -One or two tail options

t-distribution Characteristics

¨ Slight differences from z but same purposes: ¤ to standardize samples for comparisons, ¤ determine probability of observing certain means, ¤ and creating confidence intervals ¨ t-distribution handles uncertainty (unknown σ) ¤ It also takes into account the sample size, n ¨ Similar to the z-distribution when sample size is large ¤ Once sample size is large enough, it mimics the z-distribution with no real difference

Why are we doing this...?

¨ So far, we've been calculating confidence intervals for statistic of interest like the mean ¤ Point estimates aren't actually that precise... To be more confident, we must be more humble and give a range (confidence interval) for our estimates ¤ And we have to acknowledge our shortcomings in regards to sample size (i.e. df, using t instead of z) - Getting boring, I know... But... ¨ This is all working up to being able to test bigger, more interesting ideas...

How Do We Know Where It Is?

¨ Taking into account sample size and our desired level of confidence, we can look up on a table to determine our critical t-value(s)

Comparing t to z

¨ The standardized t-critical to construct a 95% CI for a sample with 10 people when 𝜎 is unknown is ±2.262. ¨ Notice how the "cut offs", the critical t-value, moves further outward, away from the mean. Compare this to the same situation but assume 𝜎 is known where the z-value is ±1.96 ¨ The larger the critical t-value, the wider your CI will be, meaning less precise.

t-table

¨ We need to know how far out that 5% (2.5% on either side) is and what t-critical corresponds it to... Why can't we just use 1.96 anymore? Because 𝜎 is unknown AND we have to take sample size into account...

Confidence Intervals w/ t instead of z

¨ What affects the width of our CI? ¤ Margin of Error (MOE) which is determined by:n Sample size, standard deviation, and the critical t value We know that using t is always going to result in a wider CI because it "penalizes" small sample sizes and unknown 𝜎 ¤ For 95% CI, z is always 1.96 while t depends on sample size but is always equal to (when n is large) or greater than z ¤ The larger the critical value, the wider your CI

Historical Moment: William Sealy Gosset

¨ William Sealy Gosset (June 13, 1876 - October 16 1937) ¨ (AKA "Student") ¨ A statistician, chemist, and Head Brewer for Guinness He had to check the sugar content in the malt used to make the beer. It had to be within a certain range because sugar is related to alcohol content... And he didn't want beer batches to vary too much. ¨ But... He frequently had to use small sample sizes which has more uncertainty associated with it, and he didn't know the true σ. So, he created a new distribution, very similar to the normal distributions. ¨ It became know as "Student's t" because Guinness would not allow him to not mention beer, Guinness, or his own last name. So he called himself, "Student". The "t" comes from people using Gosset's "Student test".

Only the Important Stuff

¨ With z-tables, we are looking for an exact proportion, i.e. probability. This is possible because 𝜎 is known. But now that 𝜎 is unknown and we might have a small sample size, there isn't one chart for every single combination of degrees of freedom and one vs. two tailed tests. There would be way too many charts. Instead, we use charts that focus on the levels of significance we are interested in as researchers. Those levels are typically .10, .05, and .01. ¨ With the t-table, instead of looking up a z-value and getting the proportion, we are going to look for the t-value that corresponds to the proportion we want, like .05.

Accepting Uncertainty

̈"How did we miss the truth if we created our 95% confidence interval?? Aren't we 95% confident?!" ¤Yes, indeed, we are 95% confident, but that still leaves 5% we are not so confident about ¤That is a 1 in 20 chance of missing the truth -Basically being wrong... ̈ You might be that 1 in 20, that 5%... ̈If I reach into a bag full of 100 point estimates, ̅𝑥, 5 out of 100 times, I will grab an estimate that does not catch the truth ̈In these cases I will "miss" the mark and my CI will not capture the true population parameter, μ

Height Example With Different N's

̈Are you more likely to miss the true 𝜇with a small sample size or a large sample size? ̈You are much more likely to be "off" if you have a small sample size... So to make sure you "catch the truth" you need to make your more or less range (the confidence interval) wider with small sample sizes... ¤95% CI with sample size of 36, [65'', 67''] vs. ¤95% CI with sample size of 4, [63'', 69''] ̈Notice how the interval gets WIDER with a smaller sample size... There's more uncertainty with small sample sizes.

Sample Size (n)

̈As sample size (n) increases ¤The larger the sample size, the smaller the standard error ¤MOE decreases as n increase ¤The more narrow your confidence interval inverse

Variability (𝜎)

̈As the variability in the POPULATION (σ) increases: ¤The standard error increase ¤MOE increase as variability increases ¤The confidence interval widens direct

Confidence Levels

̈Confidence Intervals describe the uncertainty and errors associated with taking a sample from the population ¤Point estimate ±Margin of Error ̈We can't really know the absolute truth of a population when taking a sample (due to error) .¤Though our point estimate may be wrong, we can be confident that the extra cushion (the margin of error) around our estimate captures the truth, the true 𝜇 This is our goal... To capture the truth.

Confidence Intervals

̈Confidence intervals are in the raw number language of the original scale ¤Ex. Jelly beans, height, exam scores, etc. The number of jelly beans in the jar is 300 The number of jelly beans in the jar is between 250-350 95% CI [250, 350]

Jellybeans!

̈First, guess how many jellybeans are in the jar. ¤Ex. 100 ̈Second, a ±value ¤100 ±20 ̈Next, give me a range for your guess ¤Ex. 80-120 ̈Lastly, tell me how confident you are in your estimate ¤Ex. 95%, 80%, etc.

Confidence Intervals

̈Gives us a range of what the truth (𝜇) might be ̈Help increase our chance of catching the truth ¤In case our initial point estimate (̅𝑥) is off -Which is usually is... ̈Later, they will help us with inferential statistics...

Reality Check

̈If we are trying to calculate some estimates about something unknown, how likely is it that we will know a populations σ(standard deviation)? ¤Not very likely... -So we can't use the normal z-distribution anymore...But we do have another option... ̈We get a bit more realistic and start using a different distribution where we don't need to know the standard deviation (𝜎) for the population ̈Now we will start using... The t-distribution

Level of Confidence (𝑧critical)

̈If you increase your desired level of confidence (ex. from 95% to 99%): ¤The standard error is not affected ¤The margin of error increases with higher levels of confidence ¤The confidence interval widens as you increase your desired level of confidence 90% confidence has z = 1.65 95% confidence has z = 1.96 99% confidence has z = 2.58 direct

More or Less Truths?

̈Imagine you hear in the news: ¤"Purple candidate is expected to have 52% of the votes and is projected to win by 2% points in the up coming election. Green candidate is expected to have 48% of the vote." ̈Who's going to win? Purple or Green? ̈What if I told you the poll is accurate to within 7 points? ¤Purple is expected to have between 45-59% of the votes ¤Green is expected to have between 41-55% of the votes ̈Who's going to win? Do we know? ̈The inaccuracy of polls is to be expected because we are sampling a subset, but research is vulnerable to the same inaccuracy issues. ̈Every time you hear something like, "The average American watches 5 hours of TV per day." ¤There is always some error in such statistics because statistics are derived from a sample ̈ A more accurate way to report the average amount of TV watching is to report a confidence interval, a range ¤"The average American watches between 4 to 6 hours of TV per day."

Give me an estimate...

̈Point Estimate: This is what you just gave me as your jellybean guess ̈ Margin of Error (MOE): This is the ±value ̈Confidence Interval (CI): This is range you calculated after you took your point estimate and ±the MOE ̈Confidence Level: This is how confident you are that the true # of jellybeans falls within you CI

Why do we calculate confidence intervals?

̈Rarely is a single point estimate (̅𝑥) the exact parameter value we are looking for (𝜇) ¤More likely that it is a little off, maybe a little too high or a little too low ̈To increase the chances we capture the truth (𝜇), we put a little wiggle room, a little cushion around our point estimate ¤Ex. ̅𝑥= 65'' vs. 95% CI [64.5'', 66.5''] Two Ways to Fish for Truth If you are trying to capture the Truth Tuna, casting a wider net increases your chances.

Quick Reminder

̈Sampling Distributions are theoretical. Though they represent the distribution of a bunch of sample means, we do not actually go out, collect, and calculate 100s of sample means. ¤Rather, we use our ONE sample and create a theoretical sampling distribution based on our one sample's mean, variance, and sample size. ̈While we are being introduced to these new topics, we are going to pretend as though we know the truth before moving on.

Precision vs. Confidence

̈The more precise, the less confidence ̈ The more confident, the less precisen I am 99% confident that the average age of this class is 19 or somewhere between -z = 2.58, CI: [17, 21] -How precise is this estimate? -I am 95% confident that the average age of this class is 19 or somewhere between -z = 1.96, CI: [17.5, 20.5] -How precise is this estimate?nI am 80% confident that the average age of this class is 19 or somewhere between -z = 1.28, CI: [18, 20] -How precise is this estimate?

How Confidence Intervals Help Us

̈Why do we create confidence intervals? ¤To provide us with the wiggle room we need to account for uncertainty and to give us a better chance of capturing the truth. ̈Why do we create confidence intervals? ¤To provide us with the wiggle room we need to account for uncertainty and to give us a better chance of capturing the truth.


Conjuntos de estudio relacionados

SU 14.4 - Analytics and Big Data

View Set

10. Workflow and Process Automation - 12%

View Set

Translating Algebraic Expressions, Equations, and Inequalities

View Set

Chapter 63: Musculoskeletal Problems

View Set