Stats 100 P2
what are degrees of freedom (t disirtbution)?
Degrees of freedom play an important role when using the Student t-score table. There are actually several t-score distributions. We differentiate between these distributions by use of degrees of freedom. Here the probability distribution that we use depends upon the size of our sample. If our sample size is n, then the number of degrees of freedom is n - 1. For instance, a sample size of 22 would require us to use the row of the t-score table with 21 degrees of freedom.
Example question for z test for two proportions
Do the proportion of men and women differ in their perpcetion of Wall St Execs?
Example question from two sample means
Does drinking tea lead to a stronger immune system? Null is that there's no difference. Alternative = outcome differs
What is the Central Limit Theorum?
If random samples of size n are drawn repeatedly from any population with mean mu and vairance sigma^2 then when n is relatively large (n >= 30), the distribution of these sample mean will be APPROX normal Teh distribution of sample means of size n drawn from the same population is centered around mu and has a variance of sigma^2/n The CLT states that if random samples of size n are drawn repeatedly from any population with a given population mean and population variance, then when the sample size is relatively large (greater than 30 for most distributions) the distribution of these sample means will be approximately normal.
What are the values of z* for different confidence intervals?
If we For 95% of data - z* of 1.96 99% confidence interval = z* is 2.58 80% confidence interval = z* is 1.28 We Know the proportion lie whtin a certain set of values on the sampling distribution
What happens if we convert X to z-scores in the normal dstribution?
If we convert 𝑋 to Z-scores, then we have a standard (or unit) normal distribution
What happens if n<30?
If we have n < 30 and use s as an estimate for sigma, then using the standard normal distribution will underestimate the amount of uncertainty (our confidence intervals will be too narrow)
When should you use t-tests?
If you don't know the distribution in the population or the population standard deviation and n < 30, then you should use the one-sample t-test or calculate t-interval If we have a small sample and don't know population SD, run into problems with inferences
what indicates the probabiltiy of a type I error?
In this notation, alpha is the probability of a Type I error or "False Alarm" (that is, rejecting the null when in fact it is true)
High variance vs low variance
Larger variance or high variance means the individual values are spread out around the average and may mean poor process control. Low variance means the values are closer to the average and that means better process control.
Heights of Male Adults What's the probabilty that random US adult male btw 69 - 74 inches?
Lower z-score bound = -.3 Higher z-score bound = 1.21 Goal is to find the probability within that range 50.5% of men btw that range Requires us to know that population distribution can be approximated by teh normal model with known mean mu and standard deviation
Inferential Stats
Making a conclusion(s) about a population based only on data from a sample Estimating and hypothesis testing
Correct? We are 95% confident that the sample mean lies in our interval.
NO know the sample mean, not hte population mean
What are the null vs alternative hypotheses for proportions for 2 sided and 1 sided tests?
Null ⇒ Ho : p = p(o) Alternative 2 sided z test ⇒ Ha : p = p(a) Right sided z test ⇒ Ha : p > p(a) Left sided z test ⇒ Ha : p < p(a)
What is the success-failure condition for proportions?
Population Parameters vs Sample Stats Population PROPORTION = p or p hat Sample proportion = p hat or pi hat Sampling distribution for p hat is approximately normal as long as success-failure conidtion holds Must see at least 10 success and 10 failures As long as n * p >= 10 AND n(1 - p) >= 10 THEN condition holds Use sample proportion (p hat) to asses
Measures of Variability
Range, IQR, Variance, SD
You obtain a p-value of 0.04 for a two independent samples t-test. Your null hypothesis that the difference in means is zero in the population while your alternative hypothesis is that the difference is not equal to zero in the population. What is your decision at the alpha 0.05 level?
Reject the null that the two means are equal in the population
What's an example of a success-failure condition that DOES NOT hold?
Sample n = 4 3 approve We know p hat = .75 Np = 4 x .75 = 3 and n(1 - p) is 4 * .25 = 1, both are LESS THAN 10 success /failure condition doesn't hold up, therefore not a sample distribution. that determines whether the sampling distribution of p hat is approximately normal Np = 4 x .75 = 3 and n(1 - p) is 4 * .25 = 1, both are LESS THAN 10
What is the standard error for a proportion?
Sigma = sqrt(p x (1-p)) distribution is centered on p. we typicaly don't know p and need to use p hat as a measurement.
What are we using in this course?
Simple random sample without replacement
What is z*
Since we know the sampling distribution of the sample mean is approximately normal and centered on the population mean, we know that 95% of the sample means will lie within 1.96 standard deviations of the population mean.
t distribution model
Small n s = "fatter tails" More likely to get a extreme sample mean Continuous distribution just like the normal and "mound shaped" Shape of distribution governed by "degrees of freedom" which is dependent on n but also the number of parameters in the model Works for large and small samples
What are the qualities of a t distribution?
Small n s = "fatter tails" More likely to get a extreme sample mean Continuous distribution just like the normal and "mound shaped" Shape of distribution governed by "degrees of freedom" which is dependent on n but also the number of parameters in the model Less likely will reject hypothesis with smasll sample size bc less certainity
What does a smaller z* indicate?
Smaller z* = smaller interval Larger sample size (n) creates a smaller interval Smaller sample size (n) creater larger intervals
When do we typically use the one-sample z-test or calculate z- intervals?
So far we are assuming that at least one of these conditions holds: (1) We know the distribution in the population is normal (2) We know the population standard deviation (3) We don't know sigma, but we have a sample size n ≥ 30, allowing us to use s as a reasonable approximation of sigma One sample z test when know population SD, population distributed normally, OR sample size >= 30.
What is the standard error of the sample mean?
Standard Error = sigma/sqrt(n)
An educational psychologist split the population of Ivy League students into eight (8) groups based on the school they attended in the past year (Brown University, Columbia University, Cornell University, Dartmouth College, Harvard University, the University of Pennsylvania, Princeton University, or Yale University). She then randomly sampled 10 students within each of these groups. What kind of sampling method did she conduct?
Stratified Sampling
What does a test stat indicate?
Tells us how far off our our sampel mean ( x bar) is from Mo of the null hypotehsis More extreme test stat = more inconsistent data is with teh null hypothesis
Will the distribution of the sample means be normal if drawn repeatedly from any population with mean and variance?
The CLT states that if random samples of size n are drawn repeatedly from any population with a given population mean and population variance, then when the sample size is relatively large (greater than 30 for most distributions) the distribution of these sample means will be approximately normal.
What is the average distribution of the sample means?
The average of all the sample means! Distribution of sample means shoudl have a normal distribution
What is the point estimate for the z-interval for the difference of two proportions?
The difference in sample proportions
Sampling Distribution
The distribution of sample means of same size n The sampling distrubtion of the sample means is centered on the population mean (mu) Although the mean of sample will be off, the average of sample means will be CLOSE. the distribution of sample means of the same size n from the same population, X bar best guess for mu
Normal Distribution Model as a Probability Distribution
The height of the normal distribution model is given by a probability density function (pdf) The height is sometimes called the curve of the probabilty distribution The higher the height the more likely a certain area under the curve will occur Total area under the curve equals 1 Area under the curve is the PROBABILTY that those range of values will occur
What does the height of a normal distribution tell us?
The height of the normal distribution model tells us how likely certain sets of values are to occur
What is the mean of the standard normal distribution model (that is, where is it centered)?
The mean of the standard normal distribution model is always zero
How is the normal distribution expressed?
The normal distribution model is expressed as: X~N(μ,σ2) X~N(population mean, SD^2 aka variance)
What is the null hypothesis for two proportions?
The null is that the difference in proportions is zero, while in the two-sided test the alternative is that the difference is not zero Therefore confidence interval should include ZERO! hypthesis always uses population parameters
Instead of a sample of 30 people, suppose you have a random sample of 50 adults and you find that the proportion who support an increase in government funding of scientific projects is \hat{p}=0.70 p ^ = 0.70 . You want to make an inference about the proportion found in your sample of 50 to the adult population. What is the point estimate for the population proportion?
The point estimate is simply the sample proportion of 0.70. This is our best guess of the population proportion without additional information.
What is the p-value?
The probability of obtaining data as or more extreme than the data we actually obtained, assuming the null hypothesis is true.
does t distribution model change as degrees of freedom change?
The shape of the _________________ changes as the degrees of freedom changes, with the tails becoming "fatter" (or "thicker") and the peak flatter with smaller sample sizes.
When does the standard normal distribution model work best?
The standard normal distribution model for the sampling distribution works best when you know the population distribution is normal, you know σ, or n ≥ 30
The null hypothesis that the difference of means in the population is zero implies what?
The two means in the population are equal to each other
What is the variance of the standard normal distribution model (that is, what is the spread)?
The variance of the standard normal distribution model is always 1
Chi Squared Goodness of Fit Test
The 𝜒2 one-way goodness-of-fit test is used to evaluate some claim about the distribution of a single categorical variable We use it when we DONT HAVE A POPULATION It is suitable for data that has been random selected from a population
What is the probability of randomly drawing a z-score between -2 and +2 on the standard normal distribution based on a sample size of 10, approximately? Round your answer to two decimal places.
.95 - it's ALWAYS .95 bc standard normal distriction doesn't change based on sample size
What are the 5 steps conducting a hypothesis?
5 STEPS 1. State null (Ho) and alterantive (Ha) hypothesis 2. Select signfiicance level of alpha - How willing are you to make a Type I error? Lower = less willing to have error 3. Compute the test statistic z based on alpha 4. Assume null is true and make a decision by a. Comparing the test stat z with critical value or values b. Comparing the p-value with the signifiance alpha 5. CI
Normal Distribution
A normal distribution is determined by two parameters the mean and the variance.
Normal distribution vs standard normal distribution
A normal distribution is determined by two parameters the mean and the variance. Often in statistics we refer to an arbitrary normal distribution as we would in the case where we are collecting data from a normal distribution in order to estimate these parameters. Now the standard normal distribution is a specific distribution with mean 0 and variance 1. This is the distribution that is used to construct tables of the normal distribution.
Sample stat vs Population Parameter
A population parameter is a numerical summary of a population, while a sample statistic is a numerical summary of a sample.
What is the problem with the standard distribution?
A problem with using the standard normal distribution as a model for the sampling distribution for the sample mean is that our confidence intervals will be too narrow if we have a small sample size (say less than 30).
What does a test stat indicate?
A test statistic gives a measure of how far our data lie from the claimed null value on the sampling distribution
Strata vs Cluster & when they work best
Both are overlapping subsets of the population DIFFERENCES - All strata are represented in the sample, but only a subset of clusters are rep in sample - Stratified sampling works best when elements within strata are homogenous, while cluster sampling works best when clusters within sample are heterogenous
What is the Normal Coverage Rule?
Data is mound shaped, symmetric 1 SD - .60%, 2 SD .95%, 3 .997
Simple Random Sampling
Elements are equally likely to be sampled Each observation is independent of eachother Pros: Simple, Requries little knowledge of population Cons: Need list of all elements in the population May be expensive or difficult to obtain samplign frame
The shape of the sampling distribution
FOUNDATION OF INTEFERENTIAL STATS 1) Centered around mu (mean of popluation) 2) With a SD of population/sqrt(n) 3) Mound-shaped 4) Symmetric
Probability density function
Gives the height of the normal distribution model
Cluster Sampling
Have population N = 100 Want n = 18 ID c = 9 (there are 9 clusters) Clusters are groups - naturlaly occuring ie neighborhoods, cities, etc.. Randomly select clusters and sample EVERYONE in that cluster Each element can only be a member ONE cluster Pros: Saves time and $$, Only need to know about part of the overall population Cons: Typically results in greater variation between successive samples, May require larger samples to say something about the population
Stratified Sample
Have population N = 100 Want sample n = 18 Divide individuals into s = 6 (strata) Randomly sample within each sample Strata are groups that then have random sampling Each element can only be part of ONE group - ie race, city etc... Randomly select people from strata Pros: Ensure analyze groups with small proprtions of the population, Successive samples tend to be less different from one another Cons: Need list of all elements including their involvement/membership in the strata Expensive, might be more administrative work
Do we always assume the null is true?
Ho = Mu (o) Ha = NOT Mu (o) Guessing a value for population mean Assume null is TRUE If reject null, accept alternative If accept the null, reject alternative
Null vs Alternative Hypotheses
Ho = null hypothesis Default state of affairs Ha = alterantive hypothesis
1 sided vs 2 sided test
Ho → null hypotehsis almost always an equality where Mo is some claimed value for hte populuation mean Alternative hypotehsis one of hte following Two sided test = Ha: does not equal Mo Right sided test = Ha: greater than Mo Left sided test = Ha : less than Mo
Example question for one sample z test for a proportion
IS teh proportion of people voting for a politican greater than .5?
How do we compare the proportions of 2 groups? What is the success/failure condition?
We can compare the proportions of two groups using the two independent samples z-test The test can be used when the sampling distribution of the difference in proportions is approximately normal, which occurs when the number of "successes" and "failures" for each group is at least 5 Have 2 population proportions (p1 and p2). GOAL = make an inference about the difference between these two proportions Point estimate for a difference of 2 proportions is difference of 2 proportions → Sample proportions = p-hat1 - p-hat2 Population proportions = p1 - p2 Sucess-failure condition must be met except greater than FIVE (5)
Multistage Cluster Sampling
We have N = 100 Sample n = 18 list Cluster c = 9 Randomly select cluster and then randomly select the elements Selecting subset of clusters Pros: Saves time and $, Only need to know part of overal lpop since samples are based on particular clusters Cons: Greater variation, May require larger samples to say something about the population, Adds another layer of complexity to the sampling procedure
When constructing z-interval, we typically do not know the population standard deviation \sigma σ . How do we calculate the standard error then?
We use the sample standard deviation as an estimate for the population standard deviation
Examples questions of comparing means?
What is avg income of men vs women? Avg level of happiness between US and Denmark?
Since we typically do not know the population proportion, how do we calculate the standard error of the sample proportion?
When calculating confidence intervals we use the sample proportion as an estimate for the population proportion in the formula for the SE
Do critical values change in a one-smaple t-test?
When conducting a one-sample t-test, the critical values and p-values change depending on the sample size (that is, the degrees of freedom).
When computing the z test statistic for a one-sample z-test for a proportion, we must know the standard error. Since we typically do not know the population proportion, how do we calculate the standard error (SE) for a one-sample z-test for a proportion?
When conducting a one-sample z-test for a proportion, we use the claimed proportion from the null hypothesis as an estimate for the population proportion in the formula for the SE
Z intervals vs T intervals
Z-intervals = stnadard normal distirbution to calculat econfidence intervals t -intervals = t-distribution to claculate confidence intervals
Do t-test stat and z-test stat change?
Z-test stat DOES NOT change based on the sample size Location of t test stat on t distrbution chagnes based on sample size AS sampling size increases shape looks more "normal" - norma lstandard deviation With larger sample size
What does z test stat indicate?
Z-test stats tells us how many SDs our sample mean is above the claimed population mean on the sampling distribution Lower z - hypothesis is likely High z test stat - hypotehsis less likely
Does the standard error of the sample increase or decrease as n increase?
decreases!
Do confidence intervals capture the population mean?
expect 95% of interval westimaets to capture population mean
Sampling Frame
list of elements of population, ie all households in a city
A ____ p-value suggests the observed data and the null hypothesis are inconsistent with each other, while a ____ p-value suggests that they are consistent with each other.
low, high
Measures of Central Tedency
mean, median, mode, range
Normal Distribution
mound-shaped, symmetric, centered around the mean centered on Mu, has a SD of sigma and variance of sigma^2
True or False? The one-sample z-test for a proportion, like the z-interval for a proportion, uses the standard normal distribution as a model for the sampling distribution. This is only valid if there are at least 10 "successes" and 10 "failures," which is an assumption about the population.
True - We do have to assume the success-failure condition holds when conducting a one-sample z-test for a proportion.
Type I vs Type II Errors
Type I Error = drug works!, alpha error (false negative) "there's afire" but no fire. - Rejecting Ho when it's true. Type II Error = beta error (false positive) - Failing to reject Ho when it's false.
Inferences for the Difference of Two Means - what are we comparing?
We can compare the means of two groups using the two independent samples t-test The null is that the difference in means is zero, while in the two-sided test the alternative is that the difference is not zero
What is the z-score
• The z-score of a value is the number of standard deviations it falls above or below the meean
Standard Normal Distribution
standard normal distribution is a specific distribution with mean 0 and variance 1. Conveniently if X has the a normal distribution with mean m and variance s2 then if we define Z=X−m/s then Z has the standard normal distribution. So for any specific normal distribution we can calculate probabilities of the form P[a<X<b]P[a<X<b] from the tables for Z. The standard normal distribution is the normal distribution expressed in terms of z-scores
What distribution do we use for the difference of two means? What is the point estimate?
t distribution. the point estimate for the two sample means is the difference between teh sample means.
What is the name of a confidence interval for a sample mean based on the t distribution?
t-interval
What does a z-score indicate?
where the probability lies in the distribution and the likelihood it will happen
How do you calculate a z-interval estimate for p?
standard error = math.sqrt((p*(1-p))/n) p hat +/- z*SE
What is the SD of the sampling distribution?
standard error of the sample mean Standard Error = sigma/sqrt(n)
How do you make inferences about a one-sample proportion?
Accordingly, we can use one-sample z-tests and z-intervals for making inferences about a one-sample proportion
What is the 95% confidence interval?
An approximate 95% confidence interval is given by the point estimate plus or minus 2 times the SE Point estimate + or - 2X standard error aka Point Estimate + or - 2X(sigma/sqrt(n))
# Based on what you know about the population mean and population standard deviation for age, # what is the standard deviation of the sampling distribution based on a sample size of 50? # Round your answer to two decimal places.
Answer will always be SD/math.sqrt(n)
Based on what you know about the population mean and population standard deviation for age, what is the mean of the sampling distribution based on a sample size of 50?
Answer will always be the POPULATION MEAN
Does the center of the sampling distribution change as the sample size increases?
As the sample size increases the center of the sampling distribution remains the same.
Does the spread of the sampling distribution change as the sample size increases?
As the sample size increases the spread of the sampling distributions decreases.
Confidence Intervals
CI = x bar + or - z* X SE
As the sample size increases, would you expect a sample mean to lie closer or farther from the population mean on the sampling distribution? Why?
Closer, because as the sample size increases the standard error of the sample mean decreases.
Sampling method
procedure for collecting elements of a population
You randomly sample 100 people from a population, ask the age of the people in the sample, and calculate average age in the sample. You obtain a sample average of 45 years. Suppose you repeat this process two more times and obtain sample averages of 50 and 52 years, respectively. What do we call the variation in the sample averages?
sampling variability
