AP Statistics Chapter 4-6
Basic Principle for Experimental Design
1. Comparison - Use a design that compares two or more treatments. 2. Random Assignment - Use chance to assign experimental units. Create roughly equivalent groups of experimental units at the start of the experiment to balance the effects of other variables among the treatment groups. 3. Control - Keep other variables that might affect the response the same for all groups. (This is not the same as control group.) 4. Replication - Use enough experimental units in each group so the differences can be distinguished from chance.
Scope of Inference
1. Inferences about populations are possible when individuals are randomly selected. 2. Inferences about cause and effect are possible when individuals are randomly assigned to groups.
Criteria for establishing causation when we can't do an experiment.
1. The association is strong. 2. The association is consistent. 3. Larger values of the explanatory variable are associated with stronger responses. 4. The alleged cause precedes the effect in time. 5. The alleged cause is plausible.
Rules of Probability
1. The probability of any event must be between 0 and 1, inclusive. 0 ≤ P(E) ≤ 1. 2. The sum of the probabilities of all outcomes must equal 1. 3. If E and F are disjoint events, then P(E or F) = P(E) + P(F). If E and F are not disjoint events, then P(E or F) = P(E) + P(F) - P(E and F) 4. If E represents any event and Ec represents the complement of E, then P(Ec) = 1 - P(E) 5. If E and F are independent events, then P(E and F) = P(E)∗P(F)
simple random sample problem
1. label each experimental unit 1-n and put those in a hat 2. pick randomly while not putting any back to skip repeats, skip numbers not in range
Matched pair design
A common form of blocking for comparing just two treatments. In some matched pairs designs, each subject receives both treatments in a random order. In others, the subjects are matched in pairs as closely as possible, and each subject in a pair is randomly assigned to receive one of the treatments. twin studies pretest vs posttest
Block
A group of experimental units that are known before the experiment to be similar in some way that is expected to affect the response to the treatments. When blocks are formed wisely, it is easier to find convincing evidence that one treatment is more effective than another. strata - sample vs experiment - block
Table of random digits (table d)
A long string of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with these properties: • Each entry in the table is equally likely to be any of the 10 digits 0 through 9. • The entries are independent of each other. That is, knowledge of one part of the table gives no information about any other part.
For events A and B related to the same chance process
If A and B are independent, then they cannot be mutually exclusive. these events are independent so they can't be mutually exclusive
law of large numbers
if we observe more and more repetitions of any chance process, the proportion of times that a specific outcome approaches a single value
Spread of a Linear Transformation
oy=IbIox
Population (chapter 4)
In a statistical study, the population is the entire group of individuals about which we want information.
Simple random sample (SRS)
The basic random sampling method. An SRS gives every possible sample of a given size the same chance to be chosen. We often choose an SRS by labeling the members of the population and using random digits to select the sample. Common ways to choose an SRS included drawing names out of a hat, technology random number generators or using tables of random digits. You should be able to describe in great detail how to choose an SRS using those methods. size n is chosen in such a way that every group of n individuals in the population has an equal chance to be selected as the sample. People that are randomly picked from the hat are the simple random sample
Bias
The design of a statistical study shows bias if it would consistently underestimate or consistently overestimate the value you want to know.
Sampling frame
The list from which a sample is actually chosen.
Wording of questions
The most important influence on the answers given to a survey. Confusing or leading questions can introduce strong bias, and changes in wording can greatly change a survey's outcome. Even the order in which questions are asked matters.
Nonsampling error
The most serious errors in most careful surveys are nonsampling errors. These have nothing to do with choosing a sample—they are present even in a census. Some common examples of nonsampling errors are nonresponse, response bias, and errors due to question wording.
Binomial Coefficient
The number of ways of arranging k successes among n observations
Sample (chapter 4)
The part of the population from which we actually collect information. We use information from a sample to draw conclusions about the entire population. subset of individuals in the population from which we actually collect data
Experimental units
The smallest collection of individuals to which treatments are applied.
Random sampling
The use of chance to select a sample; is the central principle of statistical sampling. let chance do the sampling which eliminates bias
Stratified random sample
To select a stratified random sample, first classify the population into groups of similar individuals, called strata. Then choose a separate SRS from each stratum to form the full sample. stratified random samples give more precise estimates than simple random samples of the same size
symbol for intersection
∩ (means "and")
symbol for union
∪ (means "or")
Geometric Random Variable
Y when Y= the number of trials required to obtain the first success
probability model
a description of some chance process that consists of two parts: a sample space S and probability for each outcome
probability (chapter 5)
a number between 0 and 1 the describes the proportion of times the outcome would occur in a very long series of repetitions
simulation
an imitation of chance behavior based on a model that accurately reflects the situation Follows four-step process: State -- Ask a question of interest about some chance process. Plan -- Describe how to use a chance device to imitate one repetition of the process. Tell what you will record at the end of each repetition. Do -- Perform many repetitions of the simulation. Conclude -- Use the results of your simulation to answer the question of interest
event
any collection of outcomes from some chance process; subset of sample space; usually designated by capital letters (ex. A, B, C, etc.) p(A0=(number of outcomes corresponding to event A)/(total number of outcomes in sample space)
basic probably rules
For any event A, 0 ≤ P(A) ≤ 1. If S is the sample space in a probability model, P(S) = 1. In the case of equally likely outcomes, use the P(A) formula Complement rule: P(AC) = 1 − P(A). Addition rule for mutually exclusive events: If A & B are mutually exclusive, P(A or B) = P(A) + P(B).
Strata
Groups of individuals in a population that are similar in some way that might affect their responses. ideally similar within and different between ex. high school grades (freshmen, sophomores, juniors, seniors)
Margin of error (chapter 4)
A numerical estimate of how far the sample result is likely to be from the truth about the population due to sampling variability.
Convenience sample
A sample selected by taking the members of the population that are easiest to reach; particularly prone to large bias. bad method of sampling
systematic random sample
A sample where the items or people are selected according to a specific time or item interval.
Treatment
A specific condition applied to the individuals in an experiment. If an experiment has several explanatory variables, a treatment is a combination of specific values of these variables.
Level
A specific value of an explanatory variable (factor) in an experiment For example, if we were studying effects of advertising an explanatory variable might be lengths of commercials and we could have commercials of varying lengths. 30, 45 and 60 minute commercials would make 3 levels of that one explanatory variable.
Census
A study that attempts to collect data from every individual in the population. costs too much and takes a lot of time
Sample survey
A study that uses an organized plan to choose a sample that represents some specific population. We base conclusions about the population on data from the sample. You must 1) say exactly what population you want to describe and 2) say exactly what you want to measure - give exact definitions of the variables.
Response bias
A systematic pattern of incorrect responses. Ex. lie about age, income, etc. misremember a number of hours, etc. Or the gender, race, ethnicity, or behavior if the interviewer can affect people's responses
Single-blind
An experiment in which either the subjects or those who interact with them and measure the response variable, but not both, know which treatment a subject received.
Double-blind
An experiment in which neither the subjects nor those who interact with them and measure the response variable know which treatment a subject received.
Control group
An experimental group whose primary purpose is to provide a baseline for comparing the effects of the other treatments. Depending on the purpose of the experiment, a control group may be given a placebo or an active treatment. placebo effect
Replication
An important experimental design principle. Use enough experimental units in each group so that any differences in the effects of the treatments can be distinguished from chance differences between the groups.
Random assignment
An important experimental design principle. Use some chance process to assign experimental units to treatments. This helps create roughly equivalent groups of experimental units at the start of the experiment.
Placebo
An inactive (fake) treatment.
Statistically significant
An observed effect so large that it would rarely occur by chance. statistically significant does imply causation
Experiment
Deliberately imposes some treatment on individuals to measure their responses. Sometimes, the explanatory variables in an experiment are called factors. Many experiments study the joint effects of several factors. In such an experiment, each treatment is formed by combining a specific value (often called a level) of each of the factors. cause and effect ONLY taste test
Placebo effect
Describes the fact that some subjects respond favorably to any treatment, even an inactive one (placebo).
Subjects
Experimental units that are human beings.
difference between nonresponse and voluntary response
Misuse the term "voluntary response" to explain why certain individuals don't respond in a sample survey. Their idea is that participation in the survey is optional (voluntary), so anyone can refuse to take part. What the students are describing is nonresponse. Nonresponse can occur only after a sample has been selected. In a voluntary response sample, every individual has opted to take part, so there won't be any nonresponse.
Observational study
Observes individuals and measures variables of interest but does not attempt to influence the responses. survey
Nonresponse
Occurs when a selected individual cannot be contacted or refuses to cooperate; an example of a nonsampling error.
Undercoverage
Occurs when some members of the population are left out of the sampling frame; a type of sampling error. Ex. opinion polls conducted by calling landlines miss households that have only cell phones as well as those without any phone
general multiplication rule (probability)
P(A and B) = P(A ∩ B) = P(A) * P(B|A) where P(B|A) is the conditional probability that event B occurs given that A has already occured
general addition rule (probability)
P(A or B) = P(A) + P(B) - P(A and B) if A and B are any 2 events resulting from some chance process
addition rule of mutually exclusive events
P(A or B) = P(A) + P(B), if A and B are mutually exclusive
multiplication rule for independent events (probability)
P(A ∩ B) = P(A) * P(B) if A and B are independent events, then the probability that A and B both occur
Compliment rule
P(A^C) = 1 - P(A) not A
Geometric Probability
P(Y=k)=(1-p)^(k-1)p
Voluntary response samples
People decide whether to join a sample based on an open invitation; particularly prone to large bias. Call-ins or many Internet polls rely on voluntary response samples. People who choose to participate in such surveys are usually not representative of some larger population of interest. Voluntary response samples attract people who feel strongly about an issue, and who often share the same opinion which leads to bias bad method of sampling under or overestimate bc of strong opinionated people usually volunteer
P(A^C)
Probability of NOT A within the sample space
Randomized block design
Start by forming blocks consisting of individuals that are similar in some way that is important to the response. Random assignment of treatments is then carried out separately within each block. the random assignment of experimental units to treatments is carried out separately within each block In summary: control what you can, block on what you can't control, and randomize to create comparable groups for reducing variation
Cluster sample
To take a cluster sample, first divide the population into smaller groups. Ideally, these clusters should mirror the characteristics of the population. Then choose an SRS of the clusters. All individuals in the chosen clusters are included in the sample. Often used for practical reasons: saving time and money Cluster sampling works best when the clusters look just like the population but on a smaller scale Don't offer the statistical advantage of better info about the population that stratified random samples do because clusters are often chosen for ease so they may have as much variability as the population itself. Some people take as SRS from each cluster rather than including all members of the cluster
Inference about the population
Using information from a sample to draw conclusions about the larger population. Requires that the individuals taking part in a study be randomly selected from the population of interest. Random sampling - representation of whole population. Or any other way of sampling that reduces bias can make inference about population only when individuals were randomly selected
Inference about cause and effect
Using the results of an experiment to conclude that the treatments caused the difference in responses. Requires a well-designed experiment in which the treatments are randomly assigned to the experimental units. Statistical significance since the difference would be too large to be explained by chance variation in the random sample Well-designed experiments randomly assign individuals to treatment groups. However, most experiments don't select experimental units at random from the larger population which limits such experiments to inference about cause and effect Observational studies don't randomly assign individuals to groups, which rules out inference about cause and effect. An observational study that uses random sampling can make an inference about the population. can make inferences about cause and effect only when individuals were randomly assigned
Completely randomized design
When the treatments are assigned to all the experimental units completely by chance.
Lack of realism
When the treatments, the subjects, or the environment of an experiment are not realistic. Lack of realism can limit researchers' ability to apply the conclusions of an experiment to the settings of greatest interest. tested on rats and assumed it'll have the same reaction on humans
Confounding
When two variables are associated in such a way that their effects on a response variable cannot be distinguished from each other. how is something confounding? how does it affect the response and experimental variables?
Adding/Subtracting Constants to a Random Variable
adds/subtracts a to measures of center and location (mean, median, quartiles, percentiles) doesn't change shape or measures of spread (range, IQR, standard derivation)
Binomial Setting
consists of independent trials of the same chance process, each resulting in success or failure, with probability of success on each trial
Geometric Setting
consists of repeated trials of the same chance process in which the probability p of successes is the same on each trial, and the goal is to count the number of trials it takes to get one success finding first success so no n
law of averages
do not mistake for law of large numbers -- idea that possible outcomes balance out in the future, i.e. getting heads on a coin flip six times in a row must be followed by getting tails six times; MYTH
Well designed experiment
establishes internal validity, which is one of the most important validates to interrogate when you encounter causal claims ells us that changes in the explanatory variable cause changes in the response variable. More precisely, it tells us that this happened for specific individuals in the specific environment of this specific experiment Well-designed experiments randomly assign individuals to treatment groups. However, most experiments don't select experimental units at random from the larger population which limits such experiments to inference about cause and effect Observational studies don't randomly assign individuals to groups, which rules out inference about cause and effect. An observational study that uses random sampling can make an inference about the population
cluster
group of individuals that are located near eachother different between, similar between
Discrete Random Variable
has a fixed set of possible values with gaps between them each probability has to be between 0 and 1 sum of probabilities has to =1
Normal Approximation (large numbers count)
if X is a count of successes having the binomial distribution with parameters n and p, then when n is large, X is approximately Normally distributed with mean np and standard deviation square root of np(1-p) to find normal approx. np≥10 AND n(1-p)≥10
Linear Transformation
involves adding or subtracting a constant, multiplying or dividing a constant, or both Y=a+bx Y = a + bX is a linear transformation of the random variable X the probability distribution of Y has the same shape as the probability distribution of X if b > 0 μY = a + bμX σY = |b|σX ( b could be a negative number) Linear transformations have similar effects on other measures of center or location (median, quartiles, percentiles) and spread (range, IQR). Whether we're dealing with data or random variables, the effects of a linear transformation are the same Results apply to both discrete and continuous random variables.
Binomial Distribution
its probability distribution The probability distribution of X is a binomial distribution with parameters n and p, where n is the number of trials of the chance process and p is the probability of a success on any one trial. The possible values of X are the whole numbers from 0 to n basically what is n and p
Geometric Distribution
its probability distribution with parameter p, the probability of a success on any trial
Independent Random Variables
knowing the value of one variable tells you nothing about the other
Multiplying/Dividing Constants to a Random Variable
multiplies/divides measures of center and location (mean, median, quartiles, percentiles) by b multiples.divides measures of spread (range, IQR, standard deviation) by b doesn't change shape of distribution
Binomial Probability
n= # of trials p=prob of success k= # of success (or use binompdf)
Shape of a Linear Transformation
same as the probability distribution of X is b>0
intersection
shows A and B
Continuous Random Variable
takes all values in some interval of numbers infinitely many possible values use density curve and the probability is the area under the curve (normcdf) all continuous probability models assign probability 0 to every individual outcome
Random Variable
takes numerical values determined by the outcome of a chance process
Probability Distribution
tells us what the possible values of X are and how probabilities are assigned to those values discrete or continuous
Variance of a Random Variable
the "average" squared deviation of the values of the variable from their mean typical distance from the mean
Mean (Expected Value) of a Random Variable
the balance point of the probability distribution density curve or histogram average of the possible values of X
10% Condition
the binomial distribution with trials n and probability p success gives a good approximation to the count of successes in an SRS of size n from a large population containing proportion n of successes as long as the same size n is no more than 10% of the population size N sample sizes should be no more than 10% of the population. Whenever samples are involved in statistics, check the condition to ensure you have sound results. rule for independent when not replacing n<.10N
Binomial Random Variable
the count X of successes
Mean of the Difference of Two Random Variables
the difference of their means
conditional probability
the probability that one event happens given that another event is already known to have happened; denoted by P(B|A)
sample space S
the set of all possible outcomes
Standard Deviation of random variable
the square root of the variance and measures the typical distance of the values in the distribution from the mean *(x-u)^2
Mean of the Sum of Two Random Variables
the sum of their means have to be independent in order to add means
The Variance of the Sum of Two Independent Random Variables
the sum of their variances
The Variance of the Difference of Two Independent Random Variables
the sum of their variances (square root to find standard deviation)
mutually exclusive (disjoint)
two events that have no outcomes in common that can never occur together; when P(A and B) = 0 An example of a mutually exclusive event is flipping a coin. The result can be either heads or tails but never both, so it can be said that flipping a coin is mutually exclusive 1) Have no outcomes in common 2) Cannot be independent 3) Cannot occur at the same time 4) Have an intersection that is the "empty set"
Large Counts Condition
using normal approximation when np≥10 and n(1-p)≥10
Mean of a Geometric Random Variable
uy=1/p
Center of a Linear Transformation
uy=a+bux
independent events (probability)
when the occurrence of one event does not change the probability that the other event will happen; if P(A|B) = P(A) and P(B|A) = P(B) two mutually exclusive events can never be independent because if one event happens, the other event is guaranteed not to happen (male and pregnant) 1) Cannot be disjoint 2) Means that the outcome of one event does not influence the outcome of any other event