Exam 2
Statistical Inference
Drawing conclusions on the basis of observing only a smaller sample. Conclusions apply to a larger group of individuals. Also backed by a statement of our confidence in them.
Binomial Mean and Variance
If we have a binomial distribution with n trials and the probability of success p, then mean= mu = np variance= sigma^2 = np(1-p) standard Deviation= sigma= root(np(1-p))
Population Distribution
Population Distribution of a variable is the distribution of all values among all the individuals in that population.
P(Z > 2.09)
Procedure: Find the probability associated with 2.09 on the table and subtract that probability from 1. P(Z > 2.09) = 1 - P(Z < 2.09) = 1 - P(Z < 2.09) = 1 - 0.9817 = 0.0183
P(Z < 1.25)
Since the table gives us the area under the standard normal curve to the left of z, then provide the value from the table. Look down the first column and find 1.2, then go across the top row until you find .05.
Shape of Binomial Distribution
Symmetric if p = 0.5, but also when n is large even if p is close to 0 or 1.
Accurate
The method measures what it intended; correctly Estimates population parameter
Sample Space of Binomial Distribution
The sample space for the binomial setting consists of 2n possible outcomes. Example: n = 5 trials 2n = 32 possible outcomes
Sampling Variability
The value of a statistic varies in repeated random sampling Main idea: to see how trustworthy a procedure is, ask what would happen if we repeated it many times.
Analogy for accuracy and precision
To be a good golfer, we need to get the golf ball in the cup. Needs to be both accurate (tends to hit the ball near the cup) and precise (even when missing, does not miss by much). Think of the cup as the population parameter and the golf balls as estimates.
Measuring Accuracy and Precision
We will measure an estimate's accuracy by considering bias (which focuses on center). We will measure an estimate's precision with a statistic called the standard error (which focuses on spread).
Hypothesis Testing
assesses evidence for a claim about a parameter by comparing it with observed data is used to judge between two different claims about the parameter.
Important Questions to Ask when Sampling
What percentage of people who were asked to participate actually did so? Did researchers choose people to participate in the survey or did the people themselves choose to participate? If a large percentage of those chosen to participate refused to do so or if people themselves chose to participate, the conclusions of the survey are suspect.
Properties of Normal Distribution
When drawing the normal curve, the mean, μ and the standard deviation, σ have specific roles The mean μ is the center of the curve. The values (μ - σ) and (μ + σ) are the inflection points of the curve.
Continuous or Discrete Random Variable? Exp. RV a) Take a 30-question # of questions MC test answered correctly b) Observe cars arriving # of cars that paid the toll at a tollbooth for 1 hour c) Observe an # of non-productive employee's work hours (8 hour day) d) Weight of a shipment # of pounds
X = 0, 1, 2, ...20; discrete X = 0, 1, 2, ...; discrete 0 < X < 8; continuous X > 0; continuous
Can we use a sample proportion to make inferences about a population proportion?
Yes we can p= population proportion p(hat)= sample proportion (Vary from sample to sample)
Parameter
a measure that describes a population. This value is fixed, yet (most likely) unknown
Statistics
a measure that describes a sample. This value can vary from sample to sample but is known when we have the sample.
Hypothesis
an assumption made about the population from which the sample was taken. Two Types Null Hypothesis, H0 (read H-naught) Alternative Hypothesis, Ha (read H-A)
Independent Event
if knowing that one occurs does not change the probability that the other occurs.
Estimated Standard Error in CI
root((p(hat)(1-p(hat))/n)
Margin of Error
shows how precise we believe our estimate is based on the variability of the estimate. It is the quantity we add and subtract to our estimate when constructing a confidence interval. z*root((p(hat)(1-p(hat))/n)
Confidence Intervals
supplements an estimate of a parameter with an indication of its variability answers the question "what is the value of the parameter?"
significance level
the probability of making the mistake of rejecting the null hypothesis when, in fact, the null hypothesis is true. Typical values for a are 0.10, 0.05, and 0.01. If the null hypothesis is rejected, the probability of type I error will be 10%, 5% or 1%. a = 0.05, there is a 5% chance of rejecting a true null hypothesis.
Unbiased estimate:
when the mean of the sampling distribution of the statistic is equal to the true value of the parameter being estimated.
Critical Value in CI
z*
Standardization?
z=(x-mu)/sigma mu=mean sigma=st. dev.
Precision (CLT)
reflected in the spread of distribution The means are nearly the same but the standard error for the larger sample size is much smaller. Smaller standard error = more precise
As the confidence level increases...
the margin of error of the confidence interval also increases.
Random Variable
variable whose value is a numerical outcome of a probability experiment
95% Confidence Interval Example
(0.3758, 0.4642) We are roughly 95% sure that this interval captures the unknown population approval rating proportion.
Requirements of a Discrete Probability Distribution
0 ≤ P(xi) ≤ 1 for each value of xi of X Sum of all probabilities = 1 P(X = x) is read the probability that the random variable X equals the value x. pi = P(xi) = P(X = xi)
Density Curve
A density curve is a mathematical model of a distribution. A density curve is a mathematical model of the histogram The total area under the density curve, by definition, is equal to 1, or 100%. Histogram is built from the sample. Smoothed curve describes the population. "Area under the curve" = probability
Sampling Distribution
A graph that represents all possible values the sample proportion can take is defined as a sampling distribution. The sampling distribution of a statistic is the distribution of all possible values taken by the statistic when all possible samples of a fixed size n are taken from the population. It is a theoretical idea; in reality, we often do not actually build it (though today we will simulate it). The sampling distribution of a statistic is the probability distribution of that statistic.
Standard Normal Distribution
A normal distribution with mean = 0 and standard deviation = 1. Any normal distribution with mean, m and standard deviation, s can be standardized...
Discrete Random Variable
A random variable that can take on a countable number of observations An example is flipping a coin 10 times where X is the number of heads.
Continuous Random Variable
A random variable whose values are uncountable An example is the amount of time it takes to complete a task. Our exam is 70 minutes. You can finish any time between 0 and 70 minutes
Confidence intervals provide us with:
A range of plausible values for a population parameter. A confidence level, which expresses our level of confidence that the interval contains the population parameter.
Simple Random Sampling (SRS)
Each individual in the population has the same chance of being chosen for the sample. In addition, every possible sample of size n out of a population of N individuals has an equally likely chance of being selected. Subjects are selected without replacement (subjects cannot be selected twice).
Example of Finding Critical Value
Ex: Find the critical value for a 92% confidence interval (see board). A = 0.92; 1 - A = 0.08; (1 - A)/2 = 0.04 Look inside the Table 2 for 0.04 to find the low critical value (remember it is a plus/minus). If you wanted to find the positive z critical value, look inside the table for 1 - 0.04 = 0.96.
Binomial Probability Function Example
Flip a coin five times. First, find out how many ways can we obtain 2 heads in the 5 flips where a head is a success. Let's use the first part of the function 5!/(2!(5-2)!)=10
Conditions for SE to be correct
For this standard error to be correct, The sample is randomly selected from the population of interest. If the sampling is without replacement, the population must be at least 10 times larger than the sample size.
Right-sided Hypotheses
H0: p = p0 Ha: p > p0
Left-sided Hypotheses
H0: p = p0 Ha: p < p0
Two-sided Hypotheses
H0: p = p0 (a given value we are testing) Ha: p ≠ p0
Precise
If the method is repeated, the estimates are very consistent.
Multiplication Rule
If two events A and B are independent, then P(A and B) = P(A) x P(B)
We say what we say if we reject H0.
If we do not reject the null hypothesis, we conclude that there is not enough statistical evidence to infer that Ha is true. Remember, we will never accept the H0 as true, we always will either reject or not reject it.
Binomial Distribution
One of several specific discrete probability distributions. Handles probability problems with two outcomes (or problems that can be reduced to two outcomes). Result of a binomial experiment.
Error
P(Type I error) = a. This is called the significance level. P(Type II error) = b. a and b are inversely related. Any attempt to reduce one will increase the other.
Population VS Samples
Populations - We have access to all of the individuals in a group of interest : Descriptive measures of populations are called parameters. Samples - We can only access a portion of the individuals in a group of interest. Descriptive measures of samples are called statistics.
Precision
Precision is reflected in the spread of distribution. It is measured by using the standard deviation of the column of sample proportions.
Probability Sample
Probability sampling uses chance to select a sample, based on known selection probabilities. Any bias is accommodated using knowledge of the selection probabilities.
P(1.01 < Z < 2.02)
Procedure: Find the area to the left of 2.02 and subtract the area to the left of 1.01. P(Z < 2.02) = 0.9783 P(Z < 1.01) = 0.8438 P(1.01 < Z < 2.02) = 0.9783 - 0.8438 = 0.1345 A little off because of rounding
Conditions to Check
Random sample Large sample size np0 ≥ 10 and n(1 - p0) ≥ 10 Large population Population is at least 10 times bigger than the sample size if the sample is collected without replacement. Independence* Each observation has no influence on any others. (*Note: the book separates conditions 1 and 4, but we do not have to).
p(hat) characteristics
The mean of p(hat)'s equals the population proportion; p. This will always happen. This does not work in practice but only theoretically
H0 Information
The null hypothesis (H0) is assumed to be true throughout the statistical analysis. Only if the sample observations are in extreme contradiction to H0 do we reject H0 in favor of Ha. If H0 cannot be rejected, we do not conclude that H0 is true but merely that we have no evidence to reject it.
Identify the population and sample. What was the parameter of interest? What is the statistic? In January 2015 the Pew Research Center published a report stating that 37% of Americans believed that genetically modified foods (GMOs) were safe to eat. This was based on a survey of 2002 American adults.
The population is all American adults. The sample was the 2002 American adults who were surveyed. The parameter of interest is the percentage of all American adults who believe that GMOs are safe to eat. The statistic is 37% (the percentage of the sample who felt this way).
What if you were asked to find that z-value where 10% of the data fell above this value?
The probabilities are found inside the table. Remember, they are probabilities to the left of a specific z-value. We notice 0.10 is to the right of the z-value in question. If 0.10 is the area to the right, then 0.90 is the area to the left. Then, find the probability closest to 0.9000 inside the table. I found 0.8997. This probability is associated with the z-value 1.28.
Discrete Probability Distribution
The probability distribution of a discrete random variable X lists the values and their corresponding probabilities.
Probability Distribution
The probability distribution of a random variable X tells us what values X can take and how to assign probabilities to those values. Also called probability model
P-Value
The probability, computed assuming H0 is true, that the test statistic would take a value as extreme or more extreme than that actually observed. The smaller the p-value, the stronger the evidence against H0. (The easier it will be to reject H0).
Binomial Distribution Variables
The random variable representing the count X of successes in the binomial setting has the binomial distribution with parameters n and p. The parameter n is the total number of observations. The parameter p is the probability of success on each observation. The count of successes X can be any whole number between 0 and n.
Alternative Hypothesis, Ha
The research hypothesis; the statement about a population parameter we intend to demonstrate is true. Claims that the effect we are looking for does exist.
Central Limit Theorem
The simulations helped us understand what happens all the time. Thus, we do not need to simulate over and over again. The central limit theorem gives information about the shape of the sampling distribution of the sample proportion when certain conditions hold.
Test Statistic
The test statistic has the structure: z=(observed value-null value)/SE = (p(hat)-p0)/root(p0(1-p0)/n) 𝑝 ̂ is the sample proportion p0 is proportion believed to be true in the null hypothesis
Binomial Experiment Requirements
There are a fixed number of trials n. Each observation fall into one of just two categories (called success and failure). The probability of a success is the same for each trial and is labeled, p. The n trials are all independent.
Second Piece of the function
To obtain the probability of any one specific outcome we use... P("outcome") = p#S (1 - p)#F Where p = probability of success 1 - p = probability of failure #S = number of successes (k) #F = number of failures (n - k)
Cautions about Sampling
Under coverage some individuals or groups in the population are left out of the process of choosing the sample. Nonresponse individuals chosen for the sample cannot be contacted or refuse to cooperate/respond. Response bias behavior of respondent or interviewer may lead to inaccurate answers or measurements Wording of questions confusing or leading (biased) questions; words with different meanings Unfortunately, random sampling cannot handle all types of possible bias.
Procedure for finding Critical Value
Using standard normal table: Once we know the area in the middle (call it A), we can use it to find the critical value. If area A is found in the middle of our standard normal distribution, then the area 1 - A represents the total area in both tails. Thus, the area in one tail would be (1-A)/2
Suppose we have IQ test results that we know are normally distributed with mean 100 and standard deviation 15. (We write N(100, 15)). Find the probability that a randomly selected person scores below 112.
We are looking for P(X < 112) We must standardize.. z=(x-mu)/sigma=(112-100)/15=.8 so, P(Z < 0.80) will allow us to use the table to answer this question. P(Z < 0.80) = 0.7881 Thus there is a 78.81% chance that a randomly selected individual scored below 112 on the IQ Test.
Making a Decision using the P-value
We compare the p-value with the significance level, α. This value is decided before conducting the test. If the p-value is less than or equal to α (p ≤ α), then we reject H0. If the p-value is greater than α (p > α), then we fail to reject H0. If the p-value is as small or smaller than a, we say that the data are statistically significant at level a.
What happens when we construct many confidence intervals?
We will simulate the process of constructing many, many confidence intervals. This will help us clearly understand the interpretation of a confidence interval.
Empirical Rule
Approximately 68% of the data will lie within 1 standard deviation of the mean. Approximately 95% of the data will lie within 2 standard deviations of the mean. Approximately 99.7% of the data will lie within 3 standard deviations of the mean.
Things that can go wrong in choosing a sample for a population
BIAS A survey method is biased if it has a tendency to produce an untrue value. Three types: Measurement bias Sampling bias Use of an estimator that is biased (we will not do this)
Bias
Bias is measured as the distance between the mean value of the estimator (the center of the distribution) and the population parameter.
Categorical Variables
Calculating mean is impossible. BUT we can cound the "successes" p(hat)= # who approve (x) / total # sample (n)
Example Aspirin claims that the proportion of headache sufferers who get relief with two pills is 53%. What is the probability that in a random sample of 400 headache sufferers, less than 50% get relief? We are given p = 0.53 and n = 400 We are asked to find P( < 0.50)
Check that p(hat) is approximately normal Find the mean and SE m = 0.53 s = 0.025 Standardize Z = (0.50 - 0.53)/0.025 = -1.20 P(Z < -1.20) = 0.1151
Null Hypothesis, H0
Claims that the effect we are looking for does not exist. It is the "no change" or "no difference" hypothesis. A skeptical statement about a population parameter. The hypothesis test is designed to assess the strength of the evidence against the null hypothesis.
Continued...
Next, we want to calculate the probability of one of these sequences happening. The question is, what is the probability of you getting say SSFFF in five coin flips. Said in another way, what is the probability of getting a head on the first flip and a head on the second and a tail on the third and a tail on the fourth and a tail on the fifth. We know the probability of getting heads on a coin flip is 0.5 (getting a tail is also 0.5). In addition, we know that getting a head on the first flip has no affect on getting a head or a tail on the second or any of the other flips. Thus coin flips are independent. Consider using the Multiplication Rule for independent events (remember a head is a success). P(SSFFF) = P(S) * P(S) * P(F) * P(F) *P(F) P(SSFFF) = P(S)2 * P(F)3 P(SSFFF) = 0.52 * 0.53 P(SSFFF) = 0.5 * 0.5 * 0.5 * 0.5 * 0.5 = 0.03125
Sampling Bias
Occurs when a sample is used that is not representative of the population Example: Internet polls - people who answer these polls tend to be those who have strong feelings about the results and are not necessarily representative of the population
Measurement Bias
Results from asking questions that do not produce a true answer. Occurs when measurements tend to record values larger (or smaller) than the true value Example: Asking people, "How much do you earn?" It is likely that people will report a value higher than their actual salary, resulting in an estimate that tends to be too large. Self-reporting of personal data The use of confusing wording in survey questions The use of non-neutral language in questions
Standard Error
SE= sigma p(hat)= root((p(1-p)/n) in reality we dont know p thus the SE= root((p(hat)(1-p(hat))/n)
Example of Sample Consider a simple random sample of size n = 2 form a population of N = 4. Population: {A, B, C, D} How many different samples of 2 can be taken (assuming no replacement).
Samples: [AB, AC, AD, BC, BD, CD] Each sample is equally liking.
Sampling Distribution
Sampling Distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.
Conditions for CLT for Sample Proportions
Sampling is random and independent. May sample with or without replacement. Large sample: The sample size, n, is large enough that the sample expects at least 10 successes and 10 failures. np>=10, n(1-p)>=10 Big population: If sampling is done without replacement, the population must be at least 10 times larger than the sample size.
Confidence Level
Tells us how often the estimation is successful. Measures the success rate of the method, not of any one particular interval.
Continuous Probability Distribution
The continuous random variable, X takes all values in an interval of numbers (often measurements). The continuous probability distribution assigns probabilities as areas under a density curve.
Properties continued
The curve is symmetric about the mean (i.e. area under the curve to the left of the mean is equal to the area under the curve to the right of the mean). The mean = median = mode. So, the highest point of the curve is at x = μ. The curve has inflection points at (μ - σ) and (μ + σ). The total area under the curve is equal to 1. As x gets larger and larger (in either the positive or negative directions), the graph approaches but never reaches the horizontal axis.
Binomial Probability Function
The first part of the function gives us the number of ways of arranging exactly k successes among n trials. This is also called the number of possible combinations. This piece of the function is also known as the binomial coefficient. n!/(k!(n-k)!)
Variable
defined to be a characteristic of an individual
Sampling Design
describes exactly how to choose a sample from the population.
The probability that on entering college, a student will graduate is 0.77. An academic advisor selected a random sample of 12 students and followed that group for 6 years. What is the probability of the following events. Exactly 10 of the 12 graduate? Less than half graduate? 8 or more graduate? Between 7 and 9 inclusive graduate?
n=12, p=.77 x=10 x<=5 x>=8 7<=x<=9
Type II error
occurs if one does not reject H0 when it is false.
Type I error
occurs if one rejects a true H0
Confidence interval equation
p(hat)- z*root((p(hat)(1-p(hat))/n)<p<p(hat)+ z*root((p(hat)(1-p(hat))/n)
