Ch. 18
1) suppose you have the following population {2,3,4,5,8,11} 2) suppose you create a sampling distribution for x bar (sampling mean) when you select a sample size of n=2 from this population (choose two random numbers from the data set) and record their sample means 5)if increase sample size from 2 to 5 8) standard deviation
1) population mean (µ) = 5.5 population st. deviation (σ): 3.08 (how much #s deviate from the center) population size (N): 6 2) randomly select 2 numbers out of the data set and record their average (all possible combinations for n=2 and their sample means (x bars)) -all possible x bar for n=2 3) x bar does not equal µ (bc 2 numbers does not necessarily represent the whole data set and the x bar of the 2 numbers don't necessarily represent the true mean the data set) 4) make sampling distribution of x bar for n=2 (easier than making sampling distribution of p hat because for p hat have to have many many repeats but then for sampling distribution of x bar it is just all the possible combos of the 6 #s) -sampling distribution is very spread out -middle of distribution for x bar is centered at µ=5.5 -this is because averaging only 2 random numbers does not give you a good idea of what µ actually is -σ/√n = 3.08/rad 2 = 2.17 (x bar varies 2.17 aw ay from 5.5) 5)now pick 5 numbers from the data set and find the corresponding x bar 6) then make a sampling distribution of x bar for n=5 -sampling distribution for x bar is always centered at µ=5.5. (both sampling distributions whether it had a lot of spread or not was always centered at 5.5) -For n=5, sampling distribution of x bar was less varied and more bell shaped. -σ/√n = 3.08/rad 5 = 1.37 (x bar varies 1.37 away from 5.5) 7. sampling distribution for x bar is always centered at µ!! the only difference is the spread (variability) 8. as increase n, can see less and less spread (but how much spread?) 9. σ of x bar (st. deviation of x bar (how much x bar varies from µ)) = σ/√n (as n gets bigger, makes denominator bigger and your quotient smaller, so the variation of x bar gets smaller) 10. if you didn't know σ, then you can use st. deviation from sample which is called standard error of x bar = s/√n (still tells you how much x bar deviates from mean) -st. error is a little bit bigger bc we don't actually know sigma so have to replace with next best thing
The CDC reports that the mean weight of adult men in the US is 190 lbs, with a st. deviation of 59 lbs. An elevator in our building has a weight limit of 10 persons or 2500 lbs. What's the probability that if 10 men get on the elevator, they will overload its weight limit?
(assume normal) µ=190 σ=59 x bar: 2500/10 = 250 lbs (have to find for individual weight bc given µ for individual weight) 1. probability that the average weight exceeds 250 (P(x bar ≥ 250)) 2. normal distribution centered at 190 with a spread of 59/√10 = 18.65 3. normalcdf (250, 1E + 99, 18.65, 190) = 0% (because above 3 standard deviations 250 is a lot of st. dev above 19)
assumptions and conditions
-must check if these conditions are satisfied in order to use normal approximation (model) for the distribution of sample proportions two assumptions (don't need to check but need to indicate you assume they're true) 1. independent assumption (the sample values must be independent of each other) 2. large enough assumption (the sample size, n, must be large enough) three conditions (must check) 1. random condition (the sample should be a simple random sample of the population) 2. 10% condition (the sample size, n, must be no larger than 10% of the population (if this is true, then you can automatically assume independence since the population is large) 3. success/failure conditions (the sample size has to be big enough so that both np (# of successes) and nq (# of failures) are at least 10 )
normal approximation to binomial distribution and sampling distribution
-normal approximation of binomial distribution is centered at mean (µ) of np with a spread of √(np(1-p)) -diff between sampling distribution and binomial probability is that sampling distribution talks about the proportion of success and not the number of successes (not 5 heads out of 10 tries, want to know the percentage) - p̂ = % of successes (# of successes/n) -so sampling distribution is the binomial model divided by n (to make it a percentage since binomial model tells you the number of successes) -this then makes the sampling distribution of p̂ centered at µ of p̂=p and with a spread/variability (σ of p̂) of √((pq)/n) -here, have to define notation p=population proportion and p̂=sample proportion -as increase n, you see p̂ converge to the true proportion (decrease the spread, which is (√((pq)/n)) -standard error of p̂= SE of p̂= √((p̂q^)/n) (still talks about how much variability, but instead of using true proportion, you use sample proportion) -SE of p̂ is always a little bit bigger than σ of p̂ because there is going to more variability bc using a sample proportion to build a spread rather than a true proportion (but both describe how p̂ deviates from p)
sampling distribution
-repeat a random event (flip a coin) infinitely many times to generate a distribution -shape of histogram is sampling distribution -can talk about predicting the behavior of p̂ -as you increase the number of repeats that you do (n), then p̂ becomes closer to the true proportion (p) and sampling distribution will be less varied and closer to the mean -variability of p̂ will decrease as n increases (spread of sampling distribution becomes more scrunched up (more concentrated at the mean))
the fundamental theorem of statistics
-the sampling distribution of any mean becomes more nearly Normal as the sample size grows -if you have a large enough sample, your sampling distribution will be normal -this is called the Central Limit Theorem -not only does the histogram of the sample means get closer and closer to the normal model as the sample size grows, but this is true regardless of the shape of the population distribution -the CLT works better (and faster) the closer the population is to a Normal itself, it also works better for larger samples -definition: the man of a random sample is a random variable whose sampling distribution can be approximated by a Normal model. The larger the sample, the better the approximation will be. -guarantees normality bc of what averaging does (always pushes numbers to the middle, so if you keep doing this you will always get to the middle) (curve becomes taller bc variability is decreasing) -more important that we are guaranteeing that x bar is normal than it is to guarantee that x is normal because we want the average response (x represents only one person while x bar represents an entire population) -for x bar, we need n ≥ 30 so CLT guarantees that sampling distribution of x bar is normal -for p hat, we need np ≥ 10 and nq ≥ 10 for sampling distribution of p hat to be normal -averages pull things to the center and so do proportions (the idea is convergence, you are converging to the same number, so you will have a mound in the middle (bring numbers to the middle) so CLT is used for proportions and means
population proportion
-true proportion p - p=50% for true proportion of heads when flip a coin
unbiased estimator
average of all your x bars ends up being µ (µ of x bar = µ) -biased estimator if average of all your x bar does not equal µ (µ of x bar does not equal µ) -tells us why the center is µ
Suppose that 13% of the population is left-handed. A 200-seat auditorium has been built with 15 "lefty-seats" seats that have the built-in desk on the left rather than the right arm of the chair. In a class of 90 students, what's the probability that there will not be enough seats for the left-handed students?
p=0.13 = % of left-handed people p̂=15/90=0.167 1. make normal approximation to find the probability that there > 15/90 students will be left-handed 2. centered at 0.13 with a spread of √(0.13)(0.87)/90 = 0.035 3. then find out when proportion of left handers is greater than 15/90 (in order to see the chances of if you will run out of left handed seats) 4. do normalcdf (0.167, 1E+99, 0.035, 0.13) = 0.145 = 14.5% (this is the probability that the proportion of left handed students out of 90 students is greater than 15/90) 5. remember to check conditions since using normal approximation
The CDC reports that 22% of 18 y/o women in the US have a BMI of 25 or more- a value associated with an increased health risk. As part of a health check at a large college, the PE department usually requires students to come in to be measured and weighed. This year, the department decided to try out a self-report system. It asked 200 randomly selected female students to report their heights and weights (from which their BMI is calculated). Only 31 of these students had BMIs greater than 25. Is the proportion of high-BMI students unusually small?
p=0.22 (true proportion of females BMI of 25 or more) p̂=31/200=0.155 (how likely would it have been for me to seen this 15.5% if the true proportion was 22%?) 1. make a sampling distribution to answer this question where it is centered at the true p (0.22) 2. then mark where the p̂ is (have to know the likelihood of landing from p̂ to the left) 3. for this, we have to know how many people you sampled (n) because the difference of 7% could be very huge (difference of 7% of a sample of 200 is very big compared to the difference of 7% in a sample of 2) 4. to find the spread (variability) of p̂ from p and see if the difference of 7% is actually huge then do √pq/n = √((22)(78))/200) = 0.0293 5. can expect the p̂ (each sample proportion of a diff sample sizes of 200) to vary around 3% from p=22% 6. to see how likely it is to get around 7% below then you do normalcdf (-1E + 99, 0.155, 0.0293, 0.22) = 1.32% 7. 1.32% is unusually low 8. remember to check conditions since did normal approximation
categorical variable
parameter: p=population proportion statistic: p hat = sample proportion center: p=µ of p hat spread: √pq/n shape: normal if np ≥ 10 and nq ≥ 10
quantitative variable
parameter: µ=population mean; σ=population st. deviation statistic: x bar is sample mean center: µ = µ of x bar spread: σ/(√n) shape: normal if n>30 (always normal if population is normal)
mean vs. proportion
proportion is for categorical data (p̂) and mean is for quantitative data (x bar)
sample proportion
p̂ - p̂ is not necessarily p (it depends how many times you repeat the event) - p̂= # of observed successful outcomes/ # of total outcomes - p̂ does not necessarily mean that landing on heads when flipping a coin = equal 50% (flip 2 times, not necessarily 50% heads; flip 2 million times closer to 50% (Law of Large Numbers)
A college PE department asked a random sample of 200 female students to self-report their heights and weights, but the percentage of students with BMIs over 25 seemed suspiciously low. One possible explanation may be that the respondents 'shaded' their weights down a bit. The CDC reports that the mean weight of 18 y/o women is 143.74 lb., with a st. deviation of 51.54, but these 200 randomly selected women reported a mean weight of only 140 lbs. Based on the Central Limit Theorem and the 68-95-99.7 Rule, does the mean weight in this sample seem exceptionally low, or might this just be random sample-to-sample variation?
µ=143.74 σ=51.54 n=200 (which is greater than 30 so CLT guarantees that x bar is normal) 1. make normal distribution centered at 143.74 with a spread of σ/√n = 51.54/√200 = 3.64 2. you want to find the probability where you observe where x bar is less than or equal to 140 to see if the mean weight is exceptionally low 3. normalcdf (-1E + 99, 140, 3.64, 143.74)= 13.57% 4. not unusual because 13.57% is not unusually low 5. if increase sample size to n=2000 then this percentage would become smaller because if when you increase n, then the spread will be smaller and normal distribution is taller (less variability) which means the area underneath that tail would be smaller so then likelihood would decrease and 13.57% would decrease and may become unusual (unusual is beyond 2 st. devs)