AP Stats Review BVD
the probability of an event is anywhere between...and...
0...1
What is the area under a density curve?
1
Binomial probability model
A Binomial model is appropriate for a random variable that counts the number of successes in a fixed number of Bernoulli trials. ~Binom(n,p)
Random Digit Table
A chance device that is used to select experimental units or conduct simulations.
Population Parameter
A characteristic or measure of a population.
Normal Distribution
A continuous probability distribution that appears in many situations, both natural and man-made. It has a bell-shape and the area under the normal density curve is always equal to 1.
Probability Distribution
A discrete random variable X is a function of all n possible outcomes of the random variable (xi) and their associated probabilities P(xi).
Tree diagram
A display of conditional events of probabilities that is helpful in thinking through conditioning. Use tree diagram for Bayes' Rule problems. Add up all of the instances where the event occurs and this is the denominator of your probability. The numerator will be the specific situation that is being asked about.
Frequency Table
A display organizing categorical or numerical data and how often each occurs.
Normal probability plot
A display to help assess whether a distribution of data is approximately normal; if it is nearly straight, the data satisfy the nearly normal condition
Symmetric
A distribution where the two halves on either side of the center are roughly equal or mirror images
uniform distribution
A distribution whose frequencies are constant across the possible values. Its plot is rectangular.
One-Way Table
A frequency table of one variable.
Two-Way Table
A frequency table that displays two categorical variables.
Power Model
A function in the form of y - axᵇ.
Geometric probability model
A geometric model is appropriate for a random variable that counts the number of Bernoulli trials UNTIL the first success.
Box-And-Whisker Plot/Boxplot
A graphical display of the five-number summary of a set of data, which also shows outliers with an *. Make comparison statements if you have parallel box-and-whisker graphs. A has lower median than B. C is more spread out than D. 50% of the cars from A are cheaper than the lowest priced car at B. Do not just write about one at a time. Must compare two or more in the same sentence.
Bar Chart
A graphical display used with categorical data, where frequencies for each category are shown in vertical bars. Usually, bars do not touch. A broken y-axis can cause bar charts to be misleading.
Mode
A hump or local high point in the shape of the distribution of a variable
Sampling frame
A list of individuals from whom the sample is drawn
median
A measure of center that is the value that divides an order set of values into two equal halves. To find it, you list all the values in order and select the middle one, or if the number of values is even the average of the two middle ones. If there are n values, the median is at position (n+1)/2. On a plot of a distribution, the median is the value that divides the area between the distribution curve and the x-axis in half.
mean
A measure of center, often called the average, computed by adding all the values of x and dividing by the number of values, n. On a plot, the place where you would put a pencil point below the horizontal axis in order to balance the distribution.
Simulation
A method of modeling chance behavior that accurately mimics the situation being considered.
Exponential Model
A model of the form y = abˣ.
Standard Normal model
A normal model, N(μ,σ) with mean μ=0 and standard deviation σ=1
Spread
A numerical summary of how tightly the values are clustered around the center measured by the standard deviation, interquartile range and range.
Population parameter
A numerically valued attribute of a model for a population
Categorical Variable
A variable recorded as labels, names, or other non-numerical outcomes. There will be no units. Remember that something like Social Security Number is an identifier, and is therefore categorical even though it is numbers.
Lurking Variable
A variable that has an effect on the outcome of a study but was not part of the investigation.
Coefficient of Determination (r²)
Percent of variation in "y" that can be explained by variation in "x". OR Percent of variation in "y" that can be explained by the model.
Relative Frequency
Percentage or proportion of the whole number of data.
What happens to power when we increase alpha level (type 1 error probability)?
Power increases Type 2 error chance decreases
P(B|A)
Probability of B occurring given A has occurred
Logarithmic Transformation
Procedure that changes a variable by taking the logarithm of each of its values.
Bivariate Data
Quantitative data. Each point is written as (x,y).
Randomize
Randomize subjects to treatments to even out effects that we cannot control
Test Statistic
The number of standard deviations (standard errors) that a sample statistic lies from a hypothesized population parameter.
Lurking variable
A variable that is not explicitly part of a model but affects the way the variables in the model appear to be related is called a lurking variable. Because we can never be certain that observational data are not hiding a lurking variable that influences both x and y, it is never safe to conclude that a linear model demonstrates a causal relationship, no matter how strong the linear association.
Factor
A variable whose levels are controlled by the experimenter. Experiments attempt to discover the effects that differences in factor levels may have on the responses of the experimental units.
Response
A variable whose values are compared across different treatments. In a randomized experiment, large response differences can be attributed to the effect of differences in treatment levels.
Quantitative
A variable whose values are counts or measurements.
Graphical Display
A visual representation of a distribution.
Significance Level
ALPHA. The probability of a Type I error. A benchmark against which the P-value compared to determine if the null hypothesis will be rejected. -denoted by α -ex: .01, .05, .10
Observed Values
Actual outcomes or data from a study or an experiment.
Shifting
Adding a constant to each data value, which adds the constant to the mean, median and quartiles but doesn't change the standard deviation or IQR
Stemplot, or stem-and-leaf plot
Also called a stem-and-leaf plot. Data are separated into a stem and leaf by place value and organized in the form of a histogram. Always provide a KEY. Shows the same SOCS as a histogram.
Extrapolation
Although linear models provide an easy way to predict values of y for a given value of x, it is unsafe to predict for values of x past the x-values in the data set. Such extrapolation may pretend to see into the future, but the predictions should not be trusted.
Point Estimate
An approximate value that has been calculated for the unknown parameter.
Random
An event is this if we know what outcomes could happen, but not which particular values will happen
Experiment
An experiment manipulates factor levels to create treatments, randomly assigns subjects to these treatment levels, and then compares the responses of the subject groups across treatment levels
Outlier
An extreme value in a data set. Quantified by being less than Q1 - 1.5*IQR or more than Q3 + 1.5*IRQ.
Influential Point
An extreme value whose removal would drastically change the slope of the least-squares regression model.
Case
An individual about whom we have data
Outcome
An individual result of a component of a simulation is its outcome
Outcome
An individual result of a component of a simulation is its outcome.
Standardized Score
The number of standard deviations an observation lies from the mean, z = (observation - mean) / (standard deviation).
Prospective study
An observational study in which subjects are followed to observe future outcomes. Because no treatments are deliberately applied, a prospective study is not an experiment. Nevertheless, prospective studies typically focus on estimating differences among groups that might appear as the groups are followed during the couse of the study. CANNOT PROVE CAUSATION.
Prospective Study
An observational study in which subjects are followed to observe futures outcomes. It is NOT and experiment because no treatments are deliberately applied.
Outlier
Any data point that stands away from the others can be called an outlier. In regression, outliers can be extraordinary in two ways, by having a large residual or by having high leverage.
Blinding
Any individual associated with an experiment who is not aware of how subjects have been allocated to treatment groups
Probability Rule (Definition of Probability)
Any probability of any given event is a number between zero and one.
Response bias
Anything in a survey design that influences responses, including wording of the question or even how the interviewer is dressed etc.
Law of Large Numbers
As the number of trials increases, the experimental probability approaches the theoretical probability - the more trials you do, the closer you will get to the predicted probability THERE IS NO LAW OF AVERAGES!!!
Replication
The practice of reducing chance variation by assigning each treatment to many experimental units.
Geometric Distribution
The probability distribution of a geometric random variable X. All possible outcomes of X before the first success is seen and their associated probabilities.
Sampling Distribution
The probability distribution of a sample statistic when a sample is drawn from a population.
Beta (β)
The probability of a Type II error. See power.
What happens to the probability of a type 1 Error when we increase the power?
The probability of a type 1 error increases
Power
The probability of correctly rejecting the null hypothesis when it is in fact false. Equal to 1 - β. See beta and Type II error.
General Addition Rule for Unions of Two Events
The probability of either the first event (A) or the second event (B) occuring is equal to the probability of A plus the probability of B miuns the probability of both A and B occuring. Also, the probability of the union of A and B is equal to the probability of A plus the probability of B minus the probability of the disjointing of A and B.
P-Value
The probability of observing a test statistic as extreme as, or more extreme than, the statistic obtained from a sample, under the assumption that the null hypothesis is true. -if the p-value is as small or smaller than α, we say the data is statistically significant at level α -the smaller the p-value, the stronger the evidence AGAINST Ho
P-value
The probability of observing a value for a test statistic at least as far from the hypothesized value as the statistic value actually observed IF the null hypothesis is true. A small P-value indicates either that the observation is improbable or that the probability calculation was based on incorrect assumptions.
Conditional Probability
The probability of one event (B) under the condition that another event (A) has occurred. This is calculated as the probability that A and B occur divided by the probability that A occurs.
Rule of Multiplication
The probability that Events A and B both occur is equal to the probability that Event A occurs times the probability that Event B occurs, given that A has occurred. P(A ∩ B) = P(A) P(B|A)
Observational Study
Attempts to determine relationships between variables, but the researcher imposes no conditions as in an experiment.
Sampling Variability
Natural variability due to the sampling process. Each possible random sample from a population will generate a different sample statistic.
Random Variables
Numerical outcome of a random phenomenon.
Subset
One unstated condition for finding a linear model is that the data be homogeneous. If, instead, the data consist of two or more groups that have been thrown together, it is usually best to fit different linear models to each group than to try to fit a single model to all of the data. Displays of the residuals can often help you find substets in the data.
Univariate
One-variable data.
general addition rule for two events
P(A or B) = P(A) + P(B) - P(A and B)
regardless of whether events are independent or not, the equation for finding P(AnB) is?
P(A) x P(B)
the formula used to determine whether or not two events are independent is P(BIA) = ?
P(B)
What happens to the probability of a Type 1 Error when we decrease the significance level?
Type 1 error probability decreases Type 2 error probability increases
How do you find the sample size when given confidence and support level?
Use the ME formula and solve for n
Probability Table
Use when the question has no conditionals in the given information and does not ask for any conditionals as answers OR when it has no conditionals in the given information but asks for conditionals in the answers -Columns and Rows should total up to 100 percent if dealing with percentages -when solving for a "given" statement, look at the individual totals for denominator -when solving for an "and" statement, look at the grand totals -doesn't add up to 100 if dealing with numerical totals rather than percents
Probability Venn Diagram
Use when the question has no conditionals in the question and asks for no conditionals in the answer - if the middle value of the diagram (.28) is not given and neither are the other inner values (.39 and .17), add the two total values together and then subtract the outer total (.17) from that value or vice versa -don't put individual probabilities inside the circles unless they say "only"
Scatterplots
Used to visualize bivariate data. The explanatory variable is shown on the horizontal axis and the response variable is shown on the vertical axis.
Paired samples
Used when the two samples are not independent from each other -take both samples and combine them to create a list of differences that becomes our new data -use this list and preform a one sample CI with a t distribution (1 sample t interval) -when we interpret non independent samples we say "the mean of differences" -all the variables represent the mean difference for populations
Histogram
Uses adjacent bars to show the distribution of values in a quantitative variable
Response variable
Values of the this variable record the results of each trial with respect to what we were interested in
Independence
Variables where the conditional distribution of one is the same for each category of the other
What does the margin of error in a confidence interval estimating the z-scores cover?
Variation from random sampling
10% condition
Verify that the sample is smaller than 10% of the population.
Confidence intervals for two means
We are either looking for an interval that contains the true difference in the means or the true mean difference -when we interpret CIs for independent samples we say "the true difference between two means"
Pooling
We pool with 2-prop z-test because the null hypothesis is that the proportions are equal. add up successes and divide by sum of n1 and n2
Statistically significant
When an observed difference is too large for us to believe that it is likely to have occurred naturally. When the p-value falls below the alpha level, we say that the test is "statistically significant" at that alpha level.
Simpson's paradox
When averages are taken across different groups, they can appear to contradict the overall averages
Intersection P(A ∩ B) = ?
When both event A AND B occur -uses "and" statements -use multiplication to find answers. P(A ∩ B) = P(A)xP(B|A) if the events are not independent. If A and B are independent, then P(A ∩ B) = P(A)xP(B).
Comparing distributions
When doing this, consider their SOCS. Use comparison words and discuss similarities/differences in the same sentences.
Voluntary bias
When volunteering to respond, it is typical that only those who are passionate (strongly against or for) respond to the surveys/attend meetings/express opinions. This leads to a lack of representation for the moderate/middle group.
Standard error
When we estimate the standard deviation of a sampling distribution using statistics found from the data, the estimate is called a standard error. The S.D. formulas are on the formula sheet you get on the exam -- then you sub in statistics from your data and now it's called standard error.
outlier
a data point that does not fit the overall pattern in a scatter plot. or Any value less than Q1-1.5*IQR or above Q3+1.5*IQR
normal probability plot
a display to help assess whether a distribution of data is approximately normal; if it is nearly straight, the data satisfy the nearly normal condition
mode
a hump or local high point in the shape of the distribution of a variable; the apparent locations of these can change as the scale of a histogram is changed
sampling frame
a list of individuals from whom the sample is drawn
population parameter
a numerically valued attribute of a model for a population
outlier in scatterplot
a point that does not fit the overall pattern seen in the scatterplot
leverage point
a point whose x-value is far from the mean of x. The further away from the mean, the more they determine the slope and intercept of the regression line.
sample
a representative subset of a population, examined in hope of learning about the population
representative
a sample is this if the statistics computed from it accurately reflect the corresponding population parameters
strength
a scatterplot shows an association that is this if there is little scatter around the underlying relationship
observational study
a study based on data in which no manipulation of factors has been employed
sample survey
a study that asks questions of a sample drawn from some population in the hope of learning something about the entire population
wording bias
a type of response bias where the question is posed to achieve a desired result
quantitative variable
a variable in which the numbers act as numerical values; always has units
lurking variable
a variable other than x and y that simultaneously affects both variables, accounting for the correlation between the two
response
a variable whose values are compared across different treatments
shifting
adding a constant to each data value adds the same constant to the mean, the median, and the quartiles, but does not change the standard deviation or IQR
effect on mean or median, of adding the same number B to each observation
adds B
effect on quartiles and percentiles, of adding the same number C to each observation
adds C
if the probability of an event is one, then the event...occurs
always
if the random variables are independent, the variance of their sum or difference is always the...?
always SUM of the variances. then take square root of the sum of variances to find s.d. of the sum or difference
linear model
an equation of the form y-hat = b0 + b1x
model
an equation or formula that simplifies and represents reality
random
an event is this if we know what outcomes could happen, but not which particular values will happen
outcome
an individual result of a component of a simulation
prospective study
an observational study in which subjects are followed to observe future outcomes
retrospective study
an observational study in which subjects are selected and then their previous conditions or behaviors are determined
Retrospective Study
an observational study where subjects are selected and their previous conditions are determined
random behavior
an occurrence for which we know what outcomes could happen, but not which particular values will happen
matching
any attempt to force a sample to resemble specified attributes of the population
outlier
any data point that stands away from the others; can be extraordinary by having a large residual or by having high leverage
nonresponse bias
bias introduced to a sample when a large fraction of those sampled fails to respond
direction
can be positive or negative. Use to describe a scatterplot that is linear in form. Positive means that, in general, as one variable increases, so does the other. Negative means that increases in one variable generally correspond to decreases in the other.
the set of outcomes that are not in the event
complement
the probability of an event occurring is one minus the probability that it doesn't occur
complement rule
P (BIA) = P(AnB)/P(A) is the equation for finding the...?
conditional probability
r
correlation coefficient, on a scale of -1 to 1 where 0 means no correlation and -1 or 1 means a perfect correlation (perfect correlation means all points on the line of regression)
leverage
data points whose x-values are far from the mean of x are said to exert ____ on a linear model; with high enough ____, residuals can appear to be deceptively small
conditional distributions
describes values of that variable among individuals who have specific value of that variable (#/row or column total)
the mean of the difference of two random variables is the...?
difference of the means E(X-Y)=E(X)-E(Y)
how to examine a scatter plot
direction (positive, negative) form (linear, curved) strength outliers
mutually exclusive events that have no outcomes in common
disjoint
the probability of two...events is the sum of the probabilities of the two events
disjoint
contingency table
displays counts and, sometimes, percentages of individuals falling into named categories on two or more variables; categorizes the individuals on all variables at once, to reveal possible patterns in one variable that may be contingent on the category of the other
multimodal
distributions with more than two modes
effect on shape, range, IQR and standard deviation, of adding the same number A to each observation
does not change
quantitative data is displayed by
dot plots, stemplots, histograms
P(AUB) = P(A) + P(B) does not work for events that are not disjoint because you would ____?___ the events
double-count
regression to the mean
each predicted y-hat tends to be fewer standard deviations from its mean than its corresponding x was from its mean
complement
event that is "not A" is referred to as complement of A -denoted A^c
control group
experimental units assigned to a baseline treatment level (placebo, no fertilizer, etc.) so that comparisons of results can be made
how to calculate outliers
falls more than 1.5 x IQR above quartile 3 or below quartile one
Success/ failure condition
for a Normal model to be a god approximation of a Binomial model, we must expect at least 10 successes and 10 failures. That is, np is greater than or equal to 10 and nq is greater than or equal to 10.
General Multiplication Rule
for any two events, P(A and B)= P(A) x P(B I A)
predicted value
found by substituting the x-value in the regression equation; they're the values on the fitted line
Systematic Random Sample
generate a random starting point, then select every kth individual from the list or queue.
slope
gives a value in "y-units per x-unit"; changes of one unit in x are associated with changes of b1 units in predicted values of y
distribution
gives the possible values of the variable and the relative frequency of each value
Dotplot
graphs a dot for each value against a single axis. Can show shape, center, spread, & outliers just like a histogram
mutually exclusive (disjoint)
have no outcomes in common can never occur together
r-squared
how much of the variability of the data is accounted for by the model or regression line. The larger the decimal the more successful the model is in relating y to x
Multiplication Rule
if events A and B are independent, then P(A and B) = P(A)*P(B)
influential point
if omitting the point greatly changes the regression model.
the outcome of one trial does not influence or change the outcome of another
independent
SRS
individuals chosen in a way that every possible set of n individuals has equal chance to be sample selected
the probability of A and B occurring
joint probability
single-blind
just one party is blinded and does not know which treatment has been assigned (either the participant or the person judging the results) If both are blinded, it's called double blind
frequency table
lists the categories in a categorical variable and gives the count or percentage of observations for each category
use for distributions that are reasonably symmetric w/ no outliers
mean & standard deviation
use for skewed distributions or distribution with strong outliers
median & IQR
simulation
models random events by using random numbers to specify event outcomes with relative frequencies that correspond to the true real-world relative frequencies we are trying to model
In a right skewed distribution the mean is ____ (less/more) than the median
more
effect on mean or median, of multiplying by a constant B
multiplies by B
effect on quartiles and percentiles, of multiplying by a constant C
multiplies by C
effect on range, IQR and standard deviation, of multiplying by a constant B
multiplies by absolute value of B
Tree Diagram
multiply along the branches to find final probabilities ('and' statements)
rescaling
multiplying each data value by a constant multiplies both the measures of position and the measures of spread by that constant
double blind
neither the subjects nor the people who have contact with them know which treatment a subject has received
Disjoint events are ____________ independent.
never
if the probability of an event is zero, then the event...occurs
never
heterogeneous
not similar in makeup
n stands for?
number of trials
parameter
numerically valued attribute of a model
prospective
observational study in which subjects are followed to observe future outcomes
retrospective
observational study in which subjects are selected and then their previous conditions or behaviors are determined
independent events
occurrence of one event has no effect on chance that other event will happen
categorical variable data is displayed by
pie charts, bar graphs, stacked bar graphs
What happens as you increase sample size/
power increases decreases spread and standard deviation
the proportion of times that an event is likely to occur in a long series of trials
probability
quartile
the lower of this is the value with a quarter of the data below it; the upper of this has a quarter of the data above it
simulation component
the most basic situation in a simulation in which something happens at random
sampling variability
the natural tendency of randomly drawn samples to differ, one from another
sample size
the number of individuals in a sample
r2
the square of the correlation between y and x; gives the fraction of the variability of y accounted for by the least squares linear regression on x; an overall measure of how successful the regression is in linearly relating y to x
placebo effect
the tendency of many human subjects to show a response even when administered a fake treatment
random numbers
these are hard to generate, but several websites offer an unlimited supply of equally likely random values
normal percentile
this corresponding to a z-score gives the percentage of values in a standard normal distribution found at that z-score or below
random assignment
to be valid, an experiment must assign experimental units to treatment groups at random
shape
to describe this aspect of a distribution, look for single vs. multiple modes, and symmetry vs. skewness
matched pairs design
treatment is randomly assigned to similar individuals in pair (two same aged females, one given placebo and other given drug)
placebo
treatment known to have no effect, administered so that all groups experience the same conditions
confounding
two variables are associated in such a way that their effects on response variable can't be distinguished from each other. i.e. Does X cause Y or does Z cause Y?
nonresponse
type of bias that is problematic because the intended sample is incomplete
completely randomized
type of experiment in which all experimental units have an equal chance of receiving any treatment
matched pairs
type of study in which subjects who are similar in ways not under study may be grouped together and then compared with each other on the variables of interest
normal model
useful family of models for unimodal, symmetric distributions
histogram
uses adjacent bars to show the distribution of vales in a quantitative variable; each bar represents the frequency (or relative frequency) of values falling in an interval of values
response variable
values of this record the results of each trial with respect to what we were interested in
independence
variables are said to be this if the conditional distribution of one variable is the same for each category of the other
re-express data
we do this by taking the logarithm, the square root, the reciprocal, or some other mathematical operation on all values in the data set
influential point
when omitting a point from the data results in a very different regression model, the point is an ____
When is a linear model appropriate?
when the graph of the residuals shows an absence of pattern. Needs to be completely random. Any curvature, distinct clusters, or fanning is not a good sign.
regression line
the linear equation y-hat = b0 + b1x that satisfies the least squares criterion
How do you find the new sample size when reducing the margin of error by a fixed amount?
(fixed amount like if you want the margin of error to be cut to 1/3 the original size it would be 3)^2
Type I Error
(α) The probability of stating there is a difference when there actually isnt one -aka the Ho is true, but you reject it anyway
Type II Error
(β) The probability of stating there is no difference when there actually is one -aka Ho is false but you fail to reject it anyway
Null Hypothesis
- Ho -the statement being tested -the test assesses the strength of the evidence against the null hypothesis -is usually a statement of no effect or no difference -assume it to be true for calculations
nonreponse bias
- Occurs when some individuals who are A PART of the survey do not respond - Those who choose not to respond may differ from those who do
response bias
- When something in the survey design influences the response - If someone feels uncomfy answering a question, they may be reluctant to tell the truth and change their answer to avoid judgement, interviewer behaviour, or question wording
correlation r
- measures direction and strength of linear relationship between two quantitative variables - always between 1 and -1 -doesn't change when units change - does not matter what we call x variable and what we call y variable
Alternative Hypothesis
-Ha -Depending on p-value we can either say that there is sufficient evidence to support Ha or there is NOT sufficient evidence to support Ha -always refer to population parameters, not sample statistics -is one sided when if we are testing that the true proportion is larger or smaller than the claim (one tailed test) -is two sided if we are testing that the true proportion is different than the claim (two tailed test)
The following will increase power...
-increasing the sample size (which decreases the variability) -increasing the efffect size -increasing alpha -anything that increases the power will automatically decrease the Type II error
normal distribution
A useful probability distribution that has a symmetric bell or mound shape and tails extending infinitely far in both directions.
How do you calculate power?
1-Beta
First 3 rules of working with probability
1. Make a picture! 2. Make a picture!! 3. Make a picture!!!
Bernoulli trials if...
1. There are two disjoint possible outcomes "success" and "failure" 2. The probability of success is constant 3. The trials are independent
stratified random sample
1. classify population into groups/strata of similar individuals 2. choose separate SRS in each stratum 3. combine SRSs to form full sample (taking an SRS of seniors, SRS of juniors, SRS of freshman, SRS of sophomores)
principles of experimental design
1. use a control 2. random assignment 3. replication (use enough experimental units)
Third Quartile
25% of data in the set is above this value. 75% of the data in this set is below this value. It is equivalent to the 75th percentile.
___%-___%-___% Rule
68%-95%-99.7% Rule. In a normal model, about 68% of values fall within 1 standard deviation of the mean, about 95% fall within 2 standard deviations of the mean, and about 99.7% fall within 3 standard deviations of the mean.
Quantitative variable
A variable in which the numbers act as numerical values (always have units)
Placebo Effect
A phenomenon where subjects show a response to a treatment merely because the treatment is imposed regardless of its actual effect.
Cluster Sample
A population is split into heterogeneous subgroups, such as a city being broken up by neighborhoods, and then a census is performed in randomly selected clusters. Do not confuse with Stratified random sampling.
undercoverage bias
A portion of the population has been excluded from the sample or is not proportionally represented to how it is in the population Can arise from: absence during sampling because of location/day/time/other
Parameter
A quantity (such as the mean or variance) that characterizes a statistical population and that can be estimated by calculations from sample data
Units
A quantity or amount adopted as a standard of measurement, such as dollars, hours or grams
Geometric Random Variable
A random variable X (a) that has two possible outcomes of each trial, (b) for which the probability of a success is constant for each trial, and (c) for which each trial is independent of the other trials.
margin of Error
A range of values to the left and right of a point estimate.
Representative
A sample is said to be this if the statistics computed from it accurately reflect the corresponding population parameters
Multistage Sample
A sample resulting from multiple applications of cluster, stratified, and/or simple random sampling.
Simple Random Sample (SRS)
A sample where n individuals are selected from a population in a way that every possible combination of n individuals is equally likely.
Bias
A sampling method is biased if it tends to produce samples that do not represent the population. You cannot "fix" this by making a sample larger.
Simple random sample (SRS)
A simple random sample of sample size n is one in which each set of n elements in the population has an equal chance of selection
Simulation
A simulation models random events by using random numbers to specify event outcomes with relative frequencies that correspond to the true real-world relative frequencies we are trying to model.
skewed left
A skewed distribution with a tail that stretches left, toward the smaller values.
skewed right
A skewed distribution with a tail that stretches right, toward the larger values.
Observational study
A study based on data in which no manipulation of factors has been employed (no treatments).
Sample survey
A study that asks questions of a sample drawn from some population in the hope of learning something about the entire population
Sample Survey
A study that collects information from a sample of a population in order to determine one or more characteristics of the population.
Census
A study that observes, or attempts to observe, every individual in a population. Can be expensive and time-consuming, depending on the population.
Sample
A subset of a population, examined in hope of learning about the population
Probability Distribution and its Rules
A table made up of the sample space and the individual probabilities of each event in the sample space Rules: 1) The probability of any event is a number between 0 and 1 2) The sum of all possible outcomes must equal 1 3) The probability that an event does NOT occur is 1 minus the probability that the event does occur (called the complement)
Placebo
A treatment know to have no effect, administered so that all groups experience the same conditions
Skewed
A unimodal asymmetric, distribution that tends to slant-most of the data are clustered on one side of the distribution and "tails" off on the other side.
Normal model
A useful family of models for unimodal, symmetric distributions
Response Bias
Because of the manner in which an interview is conducted, because of the phrasing of questions, or because of the attitude of the respondent, inaccurate data are collected.
Non response bias
Bias introduced to a sample when a large fraction of those who are randomly selected to participate fail to respond
Don't confuse Geometric and Binomial models
Both involve Bernoulli trials, but the issues are different. If you are repeating trials until your first success, that's a geometric probability. You don't know in advance how many trials you'll need (so with geometric, there is no "n"). If you are counting the number of successes in a specified number of trials, n, that's a Binomial probability.
Range
Calculated as the maximum value minus the minimum value in a data set.
Transformation
Changing the values of a data set using a mathematical operation.
Treatments
Combinations of different levels of the factors in an experiment.
What are the principles of experimental design?
Control, randomize, replicate, and block (if applicable)
Principles of experimental design
Control, randomize, replicate, block
Leverage
Data points whose x-values are far from the mean of x are said to exert leverage on a linear model. High-leverage points pull the line close to them, and so they can have a large effect on the slope and intercept of the line. With high enough leverage, their residuals can appear to be deceptively small.
Quartile
Data values that divide the data into 4 equal parts (Q1 or lower quartile, median, Q3 or upper quartile)
Interquartile Range
Defines the middle 50% of a data set, IQR = Q3 - Q1.
unimodal
Describes a distribution of univariate data with only one well-defined peak.
bimodal
Describes a distribution with two well-defined peaks/humps
Probability
Describes the pattern of chance outcomes -the proportion of times the outcome would occur in a very long series of repetition
sampling distribution model
Different random samples give different values for a statistic. The sampling distribution model shows the behavior of the statistic over all the possible samples for the same size n.
Timeplot
Displays data that change over time. Successive plots are often connected with lines to show trends more clearly.
Percentiles
Divide the data set into 100 equal parts. An observation at the Pth percentile is higher tha P percent of all observations.
Confidence intervals for two proportions
Do not assume that p1=p2 so there is no pooling -want to estimate the true difference in two proportions -our CI statements now reflect "the true difference in the two proportions"
Standardizing
Done to eliminate units; values can be compared and combined even if the original variables had different units and magnitudes
Area principle
Ensures that in a data display, each data value should be represented by the same amount of area. Graphs that violate this are misleading.
Outliers
Extreme values that don't appear to belong with the rest of the data
Randomized Block Design
First, units are sorted into subgroups or blocks, and then treatments are randomly assigned within the blocks.
Joint Frequencies
Frequencies for each cell in a two-way table relative to the total number of data.
Venn Diagram
Graphical representation of sets or outcomes and how they intersect.
discrete random variable
Has a fixed set of possible values. Each value has probability between 0 and 1. Sum of probabilities of all values = 1
Disjoint (a.k.a. Mutually Exclusive) Events
Have no outcomes in common- the events CAN NOT occur at the same time - These events are not independent because if one of the disjoint events occurs, the others cannot.
percentile
I'm in the 95th percentile if 95% of the others are at or below my score.
Context
Ideally tells WHO and WHAT was measured, HOW and WHERE the data were collected, and WHEN and WHY the study was performed (a.k.a. - the W's)
sampling distribution for a mean
If assumptions of independence and random sampling are met, and the sample size is large enough, the smpling distribution of the sample mean is modeled by a Normal modlel with a mean equal to the population mean, μ, and a standard deviation equal to σ/√n.
sampling distribution model for a proportion
If assumptions of independence and random sampling are met, and we expect at least 10 successes and 10 failures, then the sampling distribution of a proportion is modeled by a Normal model with a mean equal to the true proportion value, p, and a standard deviation equal to √(pq/n.)
How to interpret confidence LEVEL...
If many samples of this size were taken and their CI's constructed, we'd expect approx ___% of the CI's to contain the true _____________.
Influential point
If omitting a poin from the data results in a very different regression model, then that point is called an influential point.
Multiplication Principle
If one can perform an original task in X different ways and a second task in Y different ways, he or she can perform both tasks in (X)(Y) different ways.
Independent Random Variables
If the values of one random variable have no association with the values of another, the two variables are called independent random variables.
5-number summary
In order: consists of Minimum, Q1, Median, Q3, and Maximum. These are the components of a Box-plot. To find, do 1-var Stats on Calculator.
Experimental Units
Individuals (a person, a plot of land, a machine, or any single material unit) in an experiment.
Margin of error. Higher confidence level means __________ margin of error.
Is the "plus or minus" part of a CI. You multiply the z-star or t-star by the standard error. Get larger sample size to reduce margin of error. Higher confidence level means LARGER margin of error. Lower confidence level means SMALLER margin of error.
Simulation
It models random events by using random numbers to specify event outcomes with relative frequencies that correspond to the true real-world relative frequencies we are trying to model
Frequency table
Lists the categories in a categorical variable and gives the count or percentage of observations for each category
Position
Location of a data value relative to the population
Shape
Look for single vs. multiple nodes and symmetry vs. skewness
What happens as alpha level decreases?
Lower chance of a type 1 error Higher chance of type 2 error
Response Variable
Measures the outcomes that have been observed.
expected value how to calculate
Multiply each possible value by the probability that it occurs and find the sum. ON CALCULATOR: 1-var stats L1,L2 (Frequency list is the list containing the probabilities)
Rescaling
Multiplying each data value by a constant multiplies both the measures of position (mean, median, and quartiles) and the measures of spread (standard deviation and IQR) by that constant
Replicate
Repeat an experiment on as many subjects as possible
Mound-Shaped
Resembles a hill or mount; a distribution that is symmetric and unimodal.
Sample Statistic
Result of a sample used to estimate a parameter.
Replacement
Sampling can be done with or without replacement based on context- used when drawing from a set of objects
Multistage sample
Sampling schemes that combine several sampling methods
Sampling Error
See sampling variability.
Pie chart
Shows how a "whole" divides into categories by showing of a wedge of a circle whose area corresponds to the proportion in each category
Strata
Subgroups of a population that are similar or homogeneous.
Block
Subgroups of the experimental units that are separated by some characteristic before treatments are assigned because they may respond differently to the treatments. To reduce the effects of identifiable attributes of the subjects that cannot be controlled. For example, if we think gender may affect our results, 100 students are first blocked by gender, then randomly assign the 64 girls into 2 treatment groups of 32 each and randomly assign the 36 boys into 2 treatment groups of 18 each.
First Quartile
Symbolized Q1, represents the median of the lower 50% of a data set.
Data
Systematically recorded information, whether numbers or labels, together with context
z-score
Tells how many standard deviations a value is from the mean. The standard normal distribution has a mean of 0 and standard deviation of 1. "unusual" is more than 2 s.d.'s from the mean.
Least-Squares Regression Line (LSRL)
The "best-fit" line to model the linear relationship between the two variables. It is calculated by minimizing the sum of the squares of the differences between the observed and predicted values of the line. The LSRL has the equation ŷ = bo + b1x.
Explanatory Variable
The "x" variable
Central Limit Theorem
The Central Limit Theorem (CLT) states that the sampling distribution model of the sample mean (and sample proportion) from a random sample will be approximately Normal for large n, regardless of the distribution of the population, as long as the observations are independent. The larger the sample size, the closer to the normal curve it will become. - the population could be ANY SHAPE, yet the sampling distributions for mu or p-hat will be approx normal for large n
Mean
The arithmetic average of a data set; the sum of all the values divided by the number of values, x̄ = (Σxi)/n.
Random
The descriptor of an order that is unpredicatable in the short term but has a regular pattern in the long run.
Independent
The descriptor that indicates knowing the occurrence of one event does not change the probability another event will occur. If two events (A and B) have positive probability and the probability of A given B equals the probability of B, they are independent.
Matched-Pairs Design
The design of a study where experimental units are naturally paired by a common characteristic, or with themselves in a before-after type of study. Such as twins, or the same person before/after.
Effect size
The difference between the null hypothesis value and true value of a model parameter is called the effect size.
levels
The different quantities or categories of a factor in an experiment.
Marginal distribution
The distribution of either variable alone in a contingency table; the counts or percentages are the totals found in the margins (last row or column) of the table
Sampling Distribution of the Sample Mean (x̄)
The distribution of sample means from all possible simple random samples of size n taken from a population.
Sampling Distribution of a Sample Proportion p̂
The distribution of sample proportions from all possible simple random samples of size n taken from a population.
Population
The entire group of individuals or instances about whom we hope to learn
Bins
The equal intervals that define the "bars" of a histogram. Changing bin width can alter the appearance of the histogram.
Type I error
The error of rejecting a null hypothesis when in fact it is true (also called a "false positive"). The probability of a Type I error is alpha.
Intersect
The event that all of a collection of events has occurred, which is best displayed using a Venn Diagram.
Union
The event that at least one of a collection of event occurs. (one OR both events)
Empty Event
The event that has no possible outcomes, which can occur in the intersection of two disjoint events.
Percentile
The ith ___ is the number that falls above i% of the data
Confidence Level
The level of certainty that a population parameter exists in the calculated confidence interval.
Probability Model
The mathematical description of a random phenomenon consisting of the sample space and a way to assign probabilities to events.
Expected Value definition
The mean of a probability distribution. E(X) can be found by multiplying each value times its probability and finding the sum.
Median
The middle value of a data set; the equal areas point, where 50% of the data are at or below this value, and 50% of the data are at or above this value.
Five-Number Summary
The minimum, first quartile (Q1), median, third quartile (Q3), and maximum values in a data set.
Sampling variability
The natural tendency of randomly drawn samples to differ, one from another
Sample size
The number of individuals in a sample
General Multiplication Rule for Any Two Events
The probability that both of two events (A and B) happen together is equal to the probability of B multiplied by the conditional probability that the A occurs given B has occurred.
Power of a Test
The probability to correctly reject a null hypothesis (1-β) -as alpha increases beta decreases -as type II error decreases power increases
Randomization
The process by which treatments are assigned by a chance mechanism to the experimental units.
Back-Transform
The process by which values are substituted into a model of transformed data, and then reversing the transforming process to obtain the predicted value or model for nontransformed data.
Estimation
The process of determining the value of a population parameter from a sample statistic.
Sampling Without Replacement
The process through which one changes the independent of outcomes by not maintaining the sample size.
Sampling With Replacement
The process through which one maintains the independence of outcomes by making sure the sample size remains the same.
Probability
The proportion of times the outcome of a random phenomenon will occur in a very long series of repetitions.
Randomized block design
The randomization occurs only within block
Joint Event
The simultaneous occurrence of two events.
Nonresponse Bias
The situation where an individual selected to be in the sample is unwilling, or unable, to provide data.
b1
The slope of the regression equation. For every 1 additional unit of x, the y will go up/down by _______ units. b1 = r*sy / sx (where sy and sx are the stdev of y and x)
Minimum
The smallest numerical value in a data set.
Variability
The spread in a data set.
Standard Deviation of a random variable
The spread in a model, and square root of the variance of a random variable
Standard Deviation
The square root of the variance. It is the average amount by which the scores in a distribution deviate around the mean because it is based on every score in the distribution. Used to measure variability of a data set. It is calculated as the square root of the variance of a set of data
Placebo effect
The tendency of many human subjects to show a response even when administered a placebo
Alpha (α)
The threshold p-value that determines when we reject a null hypothesis. If we observe a statistic whose p-value based on the null hypothesis is less than our alpha, we reject that null hypothesis. If no alpha is given, state that you are using 0.05. - Understand how the critical value for a test is related to the specified alpha level - Also is the probability of a Type I error.
Predicted Value
The value of the response variable predicted by a model for a given explanatory variable.
Critical Value
The value that the test statistic must exceed in order to reject the null hypothesis. OR --When computing a confidence interval, the value of t-star (or z-star) used to find the margin of error.
sampling variability/sampling error
The variability we expect to see from one sample to another. It is sometimes called the sampling error, but sampling variability is the better answer.
b0
The y-intercept of the regression equation. The value when x = 0. b0 = ybar - b1xbar (where ybar is the mean of the y variable and xbar is the the mean of the x variable)
measures of Center
These locate the middle of a distribution. The mean and median are measures of center.
Normal percentile
This corresponding to a z-score gives the percentage of values in a standard normal distribution found at that z-score or below
Random Phenomena
Those outcomes that are unpredictable in the short term, but nevertheless, have a long-term pattern.
Continuous Random Variables
Those typically found by measuring, such as heights or temperatures. Can take on infinitely many values.
Random assignment
To be valid, an experiment must assign experimental units to treatment groups at random. This is called random assignment.
Addition Rule
To find probability of A OR B, we use P(A ∪ B) = P(A) + P(A) - P(A ∩ B) . Note-- If A and B are disjoint events: P(A or B)=P(A) + P(B) because there is no overlap
Dependent Events
Two events are called dependent when they are related and the fact that one event has occurred changes the probability that the second event occurs.
Independent Events
Two events are called independent when knowing that one event has occurred does not change the probability that the second event occurs, or vice versa. If A and B are independent, then P(BlA) = P(B).
q stands for?
probability of failure. q=1-p
p stands for?
probability of success
multiplication rule for independent events
probability that A and B will occur = P(A) * P(B)
general multiplication rule
probability that A and B will occur = P(A) * P(B|A)
IQR
quartile 3 - quartile 1
has a distinguishing feature; in the long run, it settles down into a way that is consistent and predictable. individual outcomes are uncertain, but there is a regular distribution of outcomes in a large number of repetitions
random
Convenience Sample
sample of individuals easiest to reach (standing in front of school and sampling). Not necessarily representative of the population, so this does not produce reliable results.
What is p hat?
sample proportion (from the sample). used to estimate the true proportion, p.
simple random
sampling design in which each set of n elements in the population has an equal chance of selection
multistage
sampling schemes that combine several sampling methods
multistage sample
sampling schemes that combine several sampling methods
Sample Space
set of all possible outcomes for the variable. Probabilities must add to 1
SOCS
shape outliers center spread
pie chart
shows how a "whole" divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category
scatterplots
shows the relationship between two quantitative variables measured on the same cases
homogeneous
similar in makeup
the intersection of two or more events implies where both events occur....?
simultaneously
level
specific values that the experimenter chooses a factor
Level
specific values that the experimenter chooses for a factor are called the levels of the factor. Placebo counts as a level (i.e. 0mg of a drug)
ladder of powers
square y squareroot of y log of y -1 over squareroot of y -1 over y
Variance
standard deviation squared
Observational study
study based on data in which no manipulation of factors has been employed. NOT AN EXPERIMENT AND THEREFORE CANNOT PROVE CAUSATION.
mean of discrete variable
sum of probability*frequency for each value
the mean of the sum of two random variables is the...?
sum of the means E(X+Y)=E(X)+E(Y)
form
the ____ we care about most is "nearly linear". Form can also be curved, oscillating, etc.
randomization
the best defense against bias, in which each individual is given a fair, random chance of selection
range
the difference between the lowest and highest values in a data set
residuals
the differences between data values and the corresponding values predicted by the regression model; ____ = observed value - predicted value
Residual
the distance the actual value is from the predicted value y - yhat = residual (a neg value means y is below the predicted and a pos value means y is above the predicted.)
population
the entire group of individuals or instances about whom we hope to learn