AP Stats Review BVD

Ace your homework & exams now with Quizwiz!

the probability of an event is anywhere between...and...

0...1

What is the area under a density curve?

1

Binomial probability model

A Binomial model is appropriate for a random variable that counts the number of successes in a fixed number of Bernoulli trials. ~Binom(n,p)

Random Digit Table

A chance device that is used to select experimental units or conduct simulations.

Population Parameter

A characteristic or measure of a population.

Normal Distribution

A continuous probability distribution that appears in many situations, both natural and man-made. It has a bell-shape and the area under the normal density curve is always equal to 1.

Probability Distribution

A discrete random variable X is a function of all n possible outcomes of the random variable (xi) and their associated probabilities P(xi).

Tree diagram

A display of conditional events of probabilities that is helpful in thinking through conditioning. Use tree diagram for Bayes' Rule problems. Add up all of the instances where the event occurs and this is the denominator of your probability. The numerator will be the specific situation that is being asked about.

Frequency Table

A display organizing categorical or numerical data and how often each occurs.

Normal probability plot

A display to help assess whether a distribution of data is approximately normal; if it is nearly straight, the data satisfy the nearly normal condition

Symmetric

A distribution where the two halves on either side of the center are roughly equal or mirror images

uniform distribution

A distribution whose frequencies are constant across the possible values. Its plot is rectangular.

One-Way Table

A frequency table of one variable.

Two-Way Table

A frequency table that displays two categorical variables.

Power Model

A function in the form of y - axᵇ.

Geometric probability model

A geometric model is appropriate for a random variable that counts the number of Bernoulli trials UNTIL the first success.

Box-And-Whisker Plot/Boxplot

A graphical display of the five-number summary of a set of data, which also shows outliers with an *. Make comparison statements if you have parallel box-and-whisker graphs. A has lower median than B. C is more spread out than D. 50% of the cars from A are cheaper than the lowest priced car at B. Do not just write about one at a time. Must compare two or more in the same sentence.

Bar Chart

A graphical display used with categorical data, where frequencies for each category are shown in vertical bars. Usually, bars do not touch. A broken y-axis can cause bar charts to be misleading.

Mode

A hump or local high point in the shape of the distribution of a variable

Sampling frame

A list of individuals from whom the sample is drawn

median

A measure of center that is the value that divides an order set of values into two equal halves. To find it, you list all the values in order and select the middle one, or if the number of values is even the average of the two middle ones. If there are n values, the median is at position (n+1)/2. On a plot of a distribution, the median is the value that divides the area between the distribution curve and the x-axis in half.

mean

A measure of center, often called the average, computed by adding all the values of x and dividing by the number of values, n. On a plot, the place where you would put a pencil point below the horizontal axis in order to balance the distribution.

Simulation

A method of modeling chance behavior that accurately mimics the situation being considered.

Exponential Model

A model of the form y = abˣ.

Standard Normal model

A normal model, N(μ,σ) with mean μ=0 and standard deviation σ=1

Spread

A numerical summary of how tightly the values are clustered around the center measured by the standard deviation, interquartile range and range.

Population parameter

A numerically valued attribute of a model for a population

Categorical Variable

A variable recorded as labels, names, or other non-numerical outcomes. There will be no units. Remember that something like Social Security Number is an identifier, and is therefore categorical even though it is numbers.

Lurking Variable

A variable that has an effect on the outcome of a study but was not part of the investigation.

Coefficient of Determination (r²)

Percent of variation in "y" that can be explained by variation in "x". OR Percent of variation in "y" that can be explained by the model.

Relative Frequency

Percentage or proportion of the whole number of data.

What happens to power when we increase alpha level (type 1 error probability)?

Power increases Type 2 error chance decreases

P(B|A)

Probability of B occurring given A has occurred

Logarithmic Transformation

Procedure that changes a variable by taking the logarithm of each of its values.

Bivariate Data

Quantitative data. Each point is written as (x,y).

Randomize

Randomize subjects to treatments to even out effects that we cannot control

Test Statistic

The number of standard deviations (standard errors) that a sample statistic lies from a hypothesized population parameter.

Lurking variable

A variable that is not explicitly part of a model but affects the way the variables in the model appear to be related is called a lurking variable. Because we can never be certain that observational data are not hiding a lurking variable that influences both x and y, it is never safe to conclude that a linear model demonstrates a causal relationship, no matter how strong the linear association.

Factor

A variable whose levels are controlled by the experimenter. Experiments attempt to discover the effects that differences in factor levels may have on the responses of the experimental units.

Response

A variable whose values are compared across different treatments. In a randomized experiment, large response differences can be attributed to the effect of differences in treatment levels.

Quantitative

A variable whose values are counts or measurements.

Graphical Display

A visual representation of a distribution.

Significance Level

ALPHA. The probability of a Type I error. A benchmark against which the P-value compared to determine if the null hypothesis will be rejected. -denoted by α -ex: .01, .05, .10

Observed Values

Actual outcomes or data from a study or an experiment.

Shifting

Adding a constant to each data value, which adds the constant to the mean, median and quartiles but doesn't change the standard deviation or IQR

Stemplot, or stem-and-leaf plot

Also called a stem-and-leaf plot. Data are separated into a stem and leaf by place value and organized in the form of a histogram. Always provide a KEY. Shows the same SOCS as a histogram.

Extrapolation

Although linear models provide an easy way to predict values of y for a given value of x, it is unsafe to predict for values of x past the x-values in the data set. Such extrapolation may pretend to see into the future, but the predictions should not be trusted.

Point Estimate

An approximate value that has been calculated for the unknown parameter.

Random

An event is this if we know what outcomes could happen, but not which particular values will happen

Experiment

An experiment manipulates factor levels to create treatments, randomly assigns subjects to these treatment levels, and then compares the responses of the subject groups across treatment levels

Outlier

An extreme value in a data set. Quantified by being less than Q1 - 1.5*IQR or more than Q3 + 1.5*IRQ.

Influential Point

An extreme value whose removal would drastically change the slope of the least-squares regression model.

Case

An individual about whom we have data

Outcome

An individual result of a component of a simulation is its outcome

Outcome

An individual result of a component of a simulation is its outcome.

Standardized Score

The number of standard deviations an observation lies from the mean, z = (observation - mean) / (standard deviation).

Prospective study

An observational study in which subjects are followed to observe future outcomes. Because no treatments are deliberately applied, a prospective study is not an experiment. Nevertheless, prospective studies typically focus on estimating differences among groups that might appear as the groups are followed during the couse of the study. CANNOT PROVE CAUSATION.

Prospective Study

An observational study in which subjects are followed to observe futures outcomes. It is NOT and experiment because no treatments are deliberately applied.

Outlier

Any data point that stands away from the others can be called an outlier. In regression, outliers can be extraordinary in two ways, by having a large residual or by having high leverage.

Blinding

Any individual associated with an experiment who is not aware of how subjects have been allocated to treatment groups

Probability Rule (Definition of Probability)

Any probability of any given event is a number between zero and one.

Response bias

Anything in a survey design that influences responses, including wording of the question or even how the interviewer is dressed etc.

Law of Large Numbers

As the number of trials increases, the experimental probability approaches the theoretical probability - the more trials you do, the closer you will get to the predicted probability THERE IS NO LAW OF AVERAGES!!!

Replication

The practice of reducing chance variation by assigning each treatment to many experimental units.

Geometric Distribution

The probability distribution of a geometric random variable X. All possible outcomes of X before the first success is seen and their associated probabilities.

Sampling Distribution

The probability distribution of a sample statistic when a sample is drawn from a population.

Beta (β)

The probability of a Type II error. See power.

What happens to the probability of a type 1 Error when we increase the power?

The probability of a type 1 error increases

Power

The probability of correctly rejecting the null hypothesis when it is in fact false. Equal to 1 - β. See beta and Type II error.

General Addition Rule for Unions of Two Events

The probability of either the first event (A) or the second event (B) occuring is equal to the probability of A plus the probability of B miuns the probability of both A and B occuring. Also, the probability of the union of A and B is equal to the probability of A plus the probability of B minus the probability of the disjointing of A and B.

P-Value

The probability of observing a test statistic as extreme as, or more extreme than, the statistic obtained from a sample, under the assumption that the null hypothesis is true. -if the p-value is as small or smaller than α, we say the data is statistically significant at level α -the smaller the p-value, the stronger the evidence AGAINST Ho

P-value

The probability of observing a value for a test statistic at least as far from the hypothesized value as the statistic value actually observed IF the null hypothesis is true. A small P-value indicates either that the observation is improbable or that the probability calculation was based on incorrect assumptions.

Conditional Probability

The probability of one event (B) under the condition that another event (A) has occurred. This is calculated as the probability that A and B occur divided by the probability that A occurs.

Rule of Multiplication

The probability that Events A and B both occur is equal to the probability that Event A occurs times the probability that Event B occurs, given that A has occurred. P(A ∩ B) = P(A) P(B|A)

Observational Study

Attempts to determine relationships between variables, but the researcher imposes no conditions as in an experiment.

Sampling Variability

Natural variability due to the sampling process. Each possible random sample from a population will generate a different sample statistic.

Random Variables

Numerical outcome of a random phenomenon.

Subset

One unstated condition for finding a linear model is that the data be homogeneous. If, instead, the data consist of two or more groups that have been thrown together, it is usually best to fit different linear models to each group than to try to fit a single model to all of the data. Displays of the residuals can often help you find substets in the data.

Univariate

One-variable data.

general addition rule for two events

P(A or B) = P(A) + P(B) - P(A and B)

regardless of whether events are independent or not, the equation for finding P(AnB) is?

P(A) x P(B)

the formula used to determine whether or not two events are independent is P(BIA) = ?

P(B)

What happens to the probability of a Type 1 Error when we decrease the significance level?

Type 1 error probability decreases Type 2 error probability increases

How do you find the sample size when given confidence and support level?

Use the ME formula and solve for n

Probability Table

Use when the question has no conditionals in the given information and does not ask for any conditionals as answers OR when it has no conditionals in the given information but asks for conditionals in the answers -Columns and Rows should total up to 100 percent if dealing with percentages -when solving for a "given" statement, look at the individual totals for denominator -when solving for an "and" statement, look at the grand totals -doesn't add up to 100 if dealing with numerical totals rather than percents

Probability Venn Diagram

Use when the question has no conditionals in the question and asks for no conditionals in the answer - if the middle value of the diagram (.28) is not given and neither are the other inner values (.39 and .17), add the two total values together and then subtract the outer total (.17) from that value or vice versa -don't put individual probabilities inside the circles unless they say "only"

Scatterplots

Used to visualize bivariate data. The explanatory variable is shown on the horizontal axis and the response variable is shown on the vertical axis.

Paired samples

Used when the two samples are not independent from each other -take both samples and combine them to create a list of differences that becomes our new data -use this list and preform a one sample CI with a t distribution (1 sample t interval) -when we interpret non independent samples we say "the mean of differences" -all the variables represent the mean difference for populations

Histogram

Uses adjacent bars to show the distribution of values in a quantitative variable

Response variable

Values of the this variable record the results of each trial with respect to what we were interested in

Independence

Variables where the conditional distribution of one is the same for each category of the other

What does the margin of error in a confidence interval estimating the z-scores cover?

Variation from random sampling

10% condition

Verify that the sample is smaller than 10% of the population.

Confidence intervals for two means

We are either looking for an interval that contains the true difference in the means or the true mean difference -when we interpret CIs for independent samples we say "the true difference between two means"

Pooling

We pool with 2-prop z-test because the null hypothesis is that the proportions are equal. add up successes and divide by sum of n1 and n2

Statistically significant

When an observed difference is too large for us to believe that it is likely to have occurred naturally. When the p-value falls below the alpha level, we say that the test is "statistically significant" at that alpha level.

Simpson's paradox

When averages are taken across different groups, they can appear to contradict the overall averages

Intersection P(A ∩ B) = ?

When both event A AND B occur -uses "and" statements -use multiplication to find answers. P(A ∩ B) = P(A)xP(B|A) if the events are not independent. If A and B are independent, then P(A ∩ B) = P(A)xP(B).

Comparing distributions

When doing this, consider their SOCS. Use comparison words and discuss similarities/differences in the same sentences.

Voluntary bias

When volunteering to respond, it is typical that only those who are passionate (strongly against or for) respond to the surveys/attend meetings/express opinions. This leads to a lack of representation for the moderate/middle group.

Standard error

When we estimate the standard deviation of a sampling distribution using statistics found from the data, the estimate is called a standard error. The S.D. formulas are on the formula sheet you get on the exam -- then you sub in statistics from your data and now it's called standard error.

outlier

a data point that does not fit the overall pattern in a scatter plot. or Any value less than Q1-1.5*IQR or above Q3+1.5*IQR

normal probability plot

a display to help assess whether a distribution of data is approximately normal; if it is nearly straight, the data satisfy the nearly normal condition

mode

a hump or local high point in the shape of the distribution of a variable; the apparent locations of these can change as the scale of a histogram is changed

sampling frame

a list of individuals from whom the sample is drawn

population parameter

a numerically valued attribute of a model for a population

outlier in scatterplot

a point that does not fit the overall pattern seen in the scatterplot

leverage point

a point whose x-value is far from the mean of x. The further away from the mean, the more they determine the slope and intercept of the regression line.

sample

a representative subset of a population, examined in hope of learning about the population

representative

a sample is this if the statistics computed from it accurately reflect the corresponding population parameters

strength

a scatterplot shows an association that is this if there is little scatter around the underlying relationship

observational study

a study based on data in which no manipulation of factors has been employed

sample survey

a study that asks questions of a sample drawn from some population in the hope of learning something about the entire population

wording bias

a type of response bias where the question is posed to achieve a desired result

quantitative variable

a variable in which the numbers act as numerical values; always has units

lurking variable

a variable other than x and y that simultaneously affects both variables, accounting for the correlation between the two

response

a variable whose values are compared across different treatments

shifting

adding a constant to each data value adds the same constant to the mean, the median, and the quartiles, but does not change the standard deviation or IQR

effect on mean or median, of adding the same number B to each observation

adds B

effect on quartiles and percentiles, of adding the same number C to each observation

adds C

if the probability of an event is one, then the event...occurs

always

if the random variables are independent, the variance of their sum or difference is always the...?

always SUM of the variances. then take square root of the sum of variances to find s.d. of the sum or difference

linear model

an equation of the form y-hat = b0 + b1x

model

an equation or formula that simplifies and represents reality

random

an event is this if we know what outcomes could happen, but not which particular values will happen

outcome

an individual result of a component of a simulation

prospective study

an observational study in which subjects are followed to observe future outcomes

retrospective study

an observational study in which subjects are selected and then their previous conditions or behaviors are determined

Retrospective Study

an observational study where subjects are selected and their previous conditions are determined

random behavior

an occurrence for which we know what outcomes could happen, but not which particular values will happen

matching

any attempt to force a sample to resemble specified attributes of the population

outlier

any data point that stands away from the others; can be extraordinary by having a large residual or by having high leverage

nonresponse bias

bias introduced to a sample when a large fraction of those sampled fails to respond

direction

can be positive or negative. Use to describe a scatterplot that is linear in form. Positive means that, in general, as one variable increases, so does the other. Negative means that increases in one variable generally correspond to decreases in the other.

the set of outcomes that are not in the event

complement

the probability of an event occurring is one minus the probability that it doesn't occur

complement rule

P (BIA) = P(AnB)/P(A) is the equation for finding the...?

conditional probability

r

correlation coefficient, on a scale of -1 to 1 where 0 means no correlation and -1 or 1 means a perfect correlation (perfect correlation means all points on the line of regression)

leverage

data points whose x-values are far from the mean of x are said to exert ____ on a linear model; with high enough ____, residuals can appear to be deceptively small

conditional distributions

describes values of that variable among individuals who have specific value of that variable (#/row or column total)

the mean of the difference of two random variables is the...?

difference of the means E(X-Y)=E(X)-E(Y)

how to examine a scatter plot

direction (positive, negative) form (linear, curved) strength outliers

mutually exclusive events that have no outcomes in common

disjoint

the probability of two...events is the sum of the probabilities of the two events

disjoint

contingency table

displays counts and, sometimes, percentages of individuals falling into named categories on two or more variables; categorizes the individuals on all variables at once, to reveal possible patterns in one variable that may be contingent on the category of the other

multimodal

distributions with more than two modes

effect on shape, range, IQR and standard deviation, of adding the same number A to each observation

does not change

quantitative data is displayed by

dot plots, stemplots, histograms

P(AUB) = P(A) + P(B) does not work for events that are not disjoint because you would ____?___ the events

double-count

regression to the mean

each predicted y-hat tends to be fewer standard deviations from its mean than its corresponding x was from its mean

complement

event that is "not A" is referred to as complement of A -denoted A^c

control group

experimental units assigned to a baseline treatment level (placebo, no fertilizer, etc.) so that comparisons of results can be made

how to calculate outliers

falls more than 1.5 x IQR above quartile 3 or below quartile one

Success/ failure condition

for a Normal model to be a god approximation of a Binomial model, we must expect at least 10 successes and 10 failures. That is, np is greater than or equal to 10 and nq is greater than or equal to 10.

General Multiplication Rule

for any two events, P(A and B)= P(A) x P(B I A)

predicted value

found by substituting the x-value in the regression equation; they're the values on the fitted line

Systematic Random Sample

generate a random starting point, then select every kth individual from the list or queue.

slope

gives a value in "y-units per x-unit"; changes of one unit in x are associated with changes of b1 units in predicted values of y

distribution

gives the possible values of the variable and the relative frequency of each value

Dotplot

graphs a dot for each value against a single axis. Can show shape, center, spread, & outliers just like a histogram

mutually exclusive (disjoint)

have no outcomes in common can never occur together

r-squared

how much of the variability of the data is accounted for by the model or regression line. The larger the decimal the more successful the model is in relating y to x

Multiplication Rule

if events A and B are independent, then P(A and B) = P(A)*P(B)

influential point

if omitting the point greatly changes the regression model.

the outcome of one trial does not influence or change the outcome of another

independent

SRS

individuals chosen in a way that every possible set of n individuals has equal chance to be sample selected

the probability of A and B occurring

joint probability

single-blind

just one party is blinded and does not know which treatment has been assigned (either the participant or the person judging the results) If both are blinded, it's called double blind

frequency table

lists the categories in a categorical variable and gives the count or percentage of observations for each category

use for distributions that are reasonably symmetric w/ no outliers

mean & standard deviation

use for skewed distributions or distribution with strong outliers

median & IQR

simulation

models random events by using random numbers to specify event outcomes with relative frequencies that correspond to the true real-world relative frequencies we are trying to model

In a right skewed distribution the mean is ____ (less/more) than the median

more

effect on mean or median, of multiplying by a constant B

multiplies by B

effect on quartiles and percentiles, of multiplying by a constant C

multiplies by C

effect on range, IQR and standard deviation, of multiplying by a constant B

multiplies by absolute value of B

Tree Diagram

multiply along the branches to find final probabilities ('and' statements)

rescaling

multiplying each data value by a constant multiplies both the measures of position and the measures of spread by that constant

double blind

neither the subjects nor the people who have contact with them know which treatment a subject has received

Disjoint events are ____________ independent.

never

if the probability of an event is zero, then the event...occurs

never

heterogeneous

not similar in makeup

n stands for?

number of trials

parameter

numerically valued attribute of a model

prospective

observational study in which subjects are followed to observe future outcomes

retrospective

observational study in which subjects are selected and then their previous conditions or behaviors are determined

independent events

occurrence of one event has no effect on chance that other event will happen

categorical variable data is displayed by

pie charts, bar graphs, stacked bar graphs

What happens as you increase sample size/

power increases decreases spread and standard deviation

the proportion of times that an event is likely to occur in a long series of trials

probability

quartile

the lower of this is the value with a quarter of the data below it; the upper of this has a quarter of the data above it

simulation component

the most basic situation in a simulation in which something happens at random

sampling variability

the natural tendency of randomly drawn samples to differ, one from another

sample size

the number of individuals in a sample

r2

the square of the correlation between y and x; gives the fraction of the variability of y accounted for by the least squares linear regression on x; an overall measure of how successful the regression is in linearly relating y to x

placebo effect

the tendency of many human subjects to show a response even when administered a fake treatment

random numbers

these are hard to generate, but several websites offer an unlimited supply of equally likely random values

normal percentile

this corresponding to a z-score gives the percentage of values in a standard normal distribution found at that z-score or below

random assignment

to be valid, an experiment must assign experimental units to treatment groups at random

shape

to describe this aspect of a distribution, look for single vs. multiple modes, and symmetry vs. skewness

matched pairs design

treatment is randomly assigned to similar individuals in pair (two same aged females, one given placebo and other given drug)

placebo

treatment known to have no effect, administered so that all groups experience the same conditions

confounding

two variables are associated in such a way that their effects on response variable can't be distinguished from each other. i.e. Does X cause Y or does Z cause Y?

nonresponse

type of bias that is problematic because the intended sample is incomplete

completely randomized

type of experiment in which all experimental units have an equal chance of receiving any treatment

matched pairs

type of study in which subjects who are similar in ways not under study may be grouped together and then compared with each other on the variables of interest

normal model

useful family of models for unimodal, symmetric distributions

histogram

uses adjacent bars to show the distribution of vales in a quantitative variable; each bar represents the frequency (or relative frequency) of values falling in an interval of values

response variable

values of this record the results of each trial with respect to what we were interested in

independence

variables are said to be this if the conditional distribution of one variable is the same for each category of the other

re-express data

we do this by taking the logarithm, the square root, the reciprocal, or some other mathematical operation on all values in the data set

influential point

when omitting a point from the data results in a very different regression model, the point is an ____

When is a linear model appropriate?

when the graph of the residuals shows an absence of pattern. Needs to be completely random. Any curvature, distinct clusters, or fanning is not a good sign.

regression line

the linear equation y-hat = b0 + b1x that satisfies the least squares criterion

How do you find the new sample size when reducing the margin of error by a fixed amount?

(fixed amount like if you want the margin of error to be cut to 1/3 the original size it would be 3)^2

Type I Error

(α) The probability of stating there is a difference when there actually isnt one -aka the Ho is true, but you reject it anyway

Type II Error

(β) The probability of stating there is no difference when there actually is one -aka Ho is false but you fail to reject it anyway

Null Hypothesis

- Ho -the statement being tested -the test assesses the strength of the evidence against the null hypothesis -is usually a statement of no effect or no difference -assume it to be true for calculations

nonreponse bias

- Occurs when some individuals who are A PART of the survey do not respond - Those who choose not to respond may differ from those who do

response bias

- When something in the survey design influences the response - If someone feels uncomfy answering a question, they may be reluctant to tell the truth and change their answer to avoid judgement, interviewer behaviour, or question wording

correlation r

- measures direction and strength of linear relationship between two quantitative variables - always between 1 and -1 -doesn't change when units change - does not matter what we call x variable and what we call y variable

Alternative Hypothesis

-Ha -Depending on p-value we can either say that there is sufficient evidence to support Ha or there is NOT sufficient evidence to support Ha -always refer to population parameters, not sample statistics -is one sided when if we are testing that the true proportion is larger or smaller than the claim (one tailed test) -is two sided if we are testing that the true proportion is different than the claim (two tailed test)

The following will increase power...

-increasing the sample size (which decreases the variability) -increasing the efffect size -increasing alpha -anything that increases the power will automatically decrease the Type II error

normal distribution

A useful probability distribution that has a symmetric bell or mound shape and tails extending infinitely far in both directions.

How do you calculate power?

1-Beta

First 3 rules of working with probability

1. Make a picture! 2. Make a picture!! 3. Make a picture!!!

Bernoulli trials if...

1. There are two disjoint possible outcomes "success" and "failure" 2. The probability of success is constant 3. The trials are independent

stratified random sample

1. classify population into groups/strata of similar individuals 2. choose separate SRS in each stratum 3. combine SRSs to form full sample (taking an SRS of seniors, SRS of juniors, SRS of freshman, SRS of sophomores)

principles of experimental design

1. use a control 2. random assignment 3. replication (use enough experimental units)

Third Quartile

25% of data in the set is above this value. 75% of the data in this set is below this value. It is equivalent to the 75th percentile.

___%-___%-___% Rule

68%-95%-99.7% Rule. In a normal model, about 68% of values fall within 1 standard deviation of the mean, about 95% fall within 2 standard deviations of the mean, and about 99.7% fall within 3 standard deviations of the mean.

Quantitative variable

A variable in which the numbers act as numerical values (always have units)

Placebo Effect

A phenomenon where subjects show a response to a treatment merely because the treatment is imposed regardless of its actual effect.

Cluster Sample

A population is split into heterogeneous subgroups, such as a city being broken up by neighborhoods, and then a census is performed in randomly selected clusters. Do not confuse with Stratified random sampling.

undercoverage bias

A portion of the population has been excluded from the sample or is not proportionally represented to how it is in the population Can arise from: absence during sampling because of location/day/time/other

Parameter

A quantity (such as the mean or variance) that characterizes a statistical population and that can be estimated by calculations from sample data

Units

A quantity or amount adopted as a standard of measurement, such as dollars, hours or grams

Geometric Random Variable

A random variable X (a) that has two possible outcomes of each trial, (b) for which the probability of a success is constant for each trial, and (c) for which each trial is independent of the other trials.

margin of Error

A range of values to the left and right of a point estimate.

Representative

A sample is said to be this if the statistics computed from it accurately reflect the corresponding population parameters

Multistage Sample

A sample resulting from multiple applications of cluster, stratified, and/or simple random sampling.

Simple Random Sample (SRS)

A sample where n individuals are selected from a population in a way that every possible combination of n individuals is equally likely.

Bias

A sampling method is biased if it tends to produce samples that do not represent the population. You cannot "fix" this by making a sample larger.

Simple random sample (SRS)

A simple random sample of sample size n is one in which each set of n elements in the population has an equal chance of selection

Simulation

A simulation models random events by using random numbers to specify event outcomes with relative frequencies that correspond to the true real-world relative frequencies we are trying to model.

skewed left

A skewed distribution with a tail that stretches left, toward the smaller values.

skewed right

A skewed distribution with a tail that stretches right, toward the larger values.

Observational study

A study based on data in which no manipulation of factors has been employed (no treatments).

Sample survey

A study that asks questions of a sample drawn from some population in the hope of learning something about the entire population

Sample Survey

A study that collects information from a sample of a population in order to determine one or more characteristics of the population.

Census

A study that observes, or attempts to observe, every individual in a population. Can be expensive and time-consuming, depending on the population.

Sample

A subset of a population, examined in hope of learning about the population

Probability Distribution and its Rules

A table made up of the sample space and the individual probabilities of each event in the sample space Rules: 1) The probability of any event is a number between 0 and 1 2) The sum of all possible outcomes must equal 1 3) The probability that an event does NOT occur is 1 minus the probability that the event does occur (called the complement)

Placebo

A treatment know to have no effect, administered so that all groups experience the same conditions

Skewed

A unimodal asymmetric, distribution that tends to slant-most of the data are clustered on one side of the distribution and "tails" off on the other side.

Normal model

A useful family of models for unimodal, symmetric distributions

Response Bias

Because of the manner in which an interview is conducted, because of the phrasing of questions, or because of the attitude of the respondent, inaccurate data are collected.

Non response bias

Bias introduced to a sample when a large fraction of those who are randomly selected to participate fail to respond

Don't confuse Geometric and Binomial models

Both involve Bernoulli trials, but the issues are different. If you are repeating trials until your first success, that's a geometric probability. You don't know in advance how many trials you'll need (so with geometric, there is no "n"). If you are counting the number of successes in a specified number of trials, n, that's a Binomial probability.

Range

Calculated as the maximum value minus the minimum value in a data set.

Transformation

Changing the values of a data set using a mathematical operation.

Treatments

Combinations of different levels of the factors in an experiment.

What are the principles of experimental design?

Control, randomize, replicate, and block (if applicable)

Principles of experimental design

Control, randomize, replicate, block

Leverage

Data points whose x-values are far from the mean of x are said to exert leverage on a linear model. High-leverage points pull the line close to them, and so they can have a large effect on the slope and intercept of the line. With high enough leverage, their residuals can appear to be deceptively small.

Quartile

Data values that divide the data into 4 equal parts (Q1 or lower quartile, median, Q3 or upper quartile)

Interquartile Range

Defines the middle 50% of a data set, IQR = Q3 - Q1.

unimodal

Describes a distribution of univariate data with only one well-defined peak.

bimodal

Describes a distribution with two well-defined peaks/humps

Probability

Describes the pattern of chance outcomes -the proportion of times the outcome would occur in a very long series of repetition

sampling distribution model

Different random samples give different values for a statistic. The sampling distribution model shows the behavior of the statistic over all the possible samples for the same size n.

Timeplot

Displays data that change over time. Successive plots are often connected with lines to show trends more clearly.

Percentiles

Divide the data set into 100 equal parts. An observation at the Pth percentile is higher tha P percent of all observations.

Confidence intervals for two proportions

Do not assume that p1=p2 so there is no pooling -want to estimate the true difference in two proportions -our CI statements now reflect "the true difference in the two proportions"

Standardizing

Done to eliminate units; values can be compared and combined even if the original variables had different units and magnitudes

Area principle

Ensures that in a data display, each data value should be represented by the same amount of area. Graphs that violate this are misleading.

Outliers

Extreme values that don't appear to belong with the rest of the data

Randomized Block Design

First, units are sorted into subgroups or blocks, and then treatments are randomly assigned within the blocks.

Joint Frequencies

Frequencies for each cell in a two-way table relative to the total number of data.

Venn Diagram

Graphical representation of sets or outcomes and how they intersect.

discrete random variable

Has a fixed set of possible values. Each value has probability between 0 and 1. Sum of probabilities of all values = 1

Disjoint (a.k.a. Mutually Exclusive) Events

Have no outcomes in common- the events CAN NOT occur at the same time - These events are not independent because if one of the disjoint events occurs, the others cannot.

percentile

I'm in the 95th percentile if 95% of the others are at or below my score.

Context

Ideally tells WHO and WHAT was measured, HOW and WHERE the data were collected, and WHEN and WHY the study was performed (a.k.a. - the W's)

sampling distribution for a mean

If assumptions of independence and random sampling are met, and the sample size is large enough, the smpling distribution of the sample mean is modeled by a Normal modlel with a mean equal to the population mean, μ, and a standard deviation equal to σ/√n.

sampling distribution model for a proportion

If assumptions of independence and random sampling are met, and we expect at least 10 successes and 10 failures, then the sampling distribution of a proportion is modeled by a Normal model with a mean equal to the true proportion value, p, and a standard deviation equal to √(pq/n.)

How to interpret confidence LEVEL...

If many samples of this size were taken and their CI's constructed, we'd expect approx ___% of the CI's to contain the true _____________.

Influential point

If omitting a poin from the data results in a very different regression model, then that point is called an influential point.

Multiplication Principle

If one can perform an original task in X different ways and a second task in Y different ways, he or she can perform both tasks in (X)(Y) different ways.

Independent Random Variables

If the values of one random variable have no association with the values of another, the two variables are called independent random variables.

5-number summary

In order: consists of Minimum, Q1, Median, Q3, and Maximum. These are the components of a Box-plot. To find, do 1-var Stats on Calculator.

Experimental Units

Individuals (a person, a plot of land, a machine, or any single material unit) in an experiment.

Margin of error. Higher confidence level means __________ margin of error.

Is the "plus or minus" part of a CI. You multiply the z-star or t-star by the standard error. Get larger sample size to reduce margin of error. Higher confidence level means LARGER margin of error. Lower confidence level means SMALLER margin of error.

Simulation

It models random events by using random numbers to specify event outcomes with relative frequencies that correspond to the true real-world relative frequencies we are trying to model

Frequency table

Lists the categories in a categorical variable and gives the count or percentage of observations for each category

Position

Location of a data value relative to the population

Shape

Look for single vs. multiple nodes and symmetry vs. skewness

What happens as alpha level decreases?

Lower chance of a type 1 error Higher chance of type 2 error

Response Variable

Measures the outcomes that have been observed.

expected value how to calculate

Multiply each possible value by the probability that it occurs and find the sum. ON CALCULATOR: 1-var stats L1,L2 (Frequency list is the list containing the probabilities)

Rescaling

Multiplying each data value by a constant multiplies both the measures of position (mean, median, and quartiles) and the measures of spread (standard deviation and IQR) by that constant

Replicate

Repeat an experiment on as many subjects as possible

Mound-Shaped

Resembles a hill or mount; a distribution that is symmetric and unimodal.

Sample Statistic

Result of a sample used to estimate a parameter.

Replacement

Sampling can be done with or without replacement based on context- used when drawing from a set of objects

Multistage sample

Sampling schemes that combine several sampling methods

Sampling Error

See sampling variability.

Pie chart

Shows how a "whole" divides into categories by showing of a wedge of a circle whose area corresponds to the proportion in each category

Strata

Subgroups of a population that are similar or homogeneous.

Block

Subgroups of the experimental units that are separated by some characteristic before treatments are assigned because they may respond differently to the treatments. To reduce the effects of identifiable attributes of the subjects that cannot be controlled. For example, if we think gender may affect our results, 100 students are first blocked by gender, then randomly assign the 64 girls into 2 treatment groups of 32 each and randomly assign the 36 boys into 2 treatment groups of 18 each.

First Quartile

Symbolized Q1, represents the median of the lower 50% of a data set.

Data

Systematically recorded information, whether numbers or labels, together with context

z-score

Tells how many standard deviations a value is from the mean. The standard normal distribution has a mean of 0 and standard deviation of 1. "unusual" is more than 2 s.d.'s from the mean.

Least-Squares Regression Line (LSRL)

The "best-fit" line to model the linear relationship between the two variables. It is calculated by minimizing the sum of the squares of the differences between the observed and predicted values of the line. The LSRL has the equation ŷ = bo + b1x.

Explanatory Variable

The "x" variable

Central Limit Theorem

The Central Limit Theorem (CLT) states that the sampling distribution model of the sample mean (and sample proportion) from a random sample will be approximately Normal for large n, regardless of the distribution of the population, as long as the observations are independent. The larger the sample size, the closer to the normal curve it will become. - the population could be ANY SHAPE, yet the sampling distributions for mu or p-hat will be approx normal for large n

Mean

The arithmetic average of a data set; the sum of all the values divided by the number of values, x̄ = (Σxi)/n.

Random

The descriptor of an order that is unpredicatable in the short term but has a regular pattern in the long run.

Independent

The descriptor that indicates knowing the occurrence of one event does not change the probability another event will occur. If two events (A and B) have positive probability and the probability of A given B equals the probability of B, they are independent.

Matched-Pairs Design

The design of a study where experimental units are naturally paired by a common characteristic, or with themselves in a before-after type of study. Such as twins, or the same person before/after.

Effect size

The difference between the null hypothesis value and true value of a model parameter is called the effect size.

levels

The different quantities or categories of a factor in an experiment.

Marginal distribution

The distribution of either variable alone in a contingency table; the counts or percentages are the totals found in the margins (last row or column) of the table

Sampling Distribution of the Sample Mean (x̄)

The distribution of sample means from all possible simple random samples of size n taken from a population.

Sampling Distribution of a Sample Proportion p̂

The distribution of sample proportions from all possible simple random samples of size n taken from a population.

Population

The entire group of individuals or instances about whom we hope to learn

Bins

The equal intervals that define the "bars" of a histogram. Changing bin width can alter the appearance of the histogram.

Type I error

The error of rejecting a null hypothesis when in fact it is true (also called a "false positive"). The probability of a Type I error is alpha.

Intersect

The event that all of a collection of events has occurred, which is best displayed using a Venn Diagram.

Union

The event that at least one of a collection of event occurs. (one OR both events)

Empty Event

The event that has no possible outcomes, which can occur in the intersection of two disjoint events.

Percentile

The ith ___ is the number that falls above i% of the data

Confidence Level

The level of certainty that a population parameter exists in the calculated confidence interval.

Probability Model

The mathematical description of a random phenomenon consisting of the sample space and a way to assign probabilities to events.

Expected Value definition

The mean of a probability distribution. E(X) can be found by multiplying each value times its probability and finding the sum.

Median

The middle value of a data set; the equal areas point, where 50% of the data are at or below this value, and 50% of the data are at or above this value.

Five-Number Summary

The minimum, first quartile (Q1), median, third quartile (Q3), and maximum values in a data set.

Sampling variability

The natural tendency of randomly drawn samples to differ, one from another

Sample size

The number of individuals in a sample

General Multiplication Rule for Any Two Events

The probability that both of two events (A and B) happen together is equal to the probability of B multiplied by the conditional probability that the A occurs given B has occurred.

Power of a Test

The probability to correctly reject a null hypothesis (1-β) -as alpha increases beta decreases -as type II error decreases power increases

Randomization

The process by which treatments are assigned by a chance mechanism to the experimental units.

Back-Transform

The process by which values are substituted into a model of transformed data, and then reversing the transforming process to obtain the predicted value or model for nontransformed data.

Estimation

The process of determining the value of a population parameter from a sample statistic.

Sampling Without Replacement

The process through which one changes the independent of outcomes by not maintaining the sample size.

Sampling With Replacement

The process through which one maintains the independence of outcomes by making sure the sample size remains the same.

Probability

The proportion of times the outcome of a random phenomenon will occur in a very long series of repetitions.

Randomized block design

The randomization occurs only within block

Joint Event

The simultaneous occurrence of two events.

Nonresponse Bias

The situation where an individual selected to be in the sample is unwilling, or unable, to provide data.

b1

The slope of the regression equation. For every 1 additional unit of x, the y will go up/down by _______ units. b1 = r*sy / sx (where sy and sx are the stdev of y and x)

Minimum

The smallest numerical value in a data set.

Variability

The spread in a data set.

Standard Deviation of a random variable

The spread in a model, and square root of the variance of a random variable

Standard Deviation

The square root of the variance. It is the average amount by which the scores in a distribution deviate around the mean because it is based on every score in the distribution. Used to measure variability of a data set. It is calculated as the square root of the variance of a set of data

Placebo effect

The tendency of many human subjects to show a response even when administered a placebo

Alpha (α)

The threshold p-value that determines when we reject a null hypothesis. If we observe a statistic whose p-value based on the null hypothesis is less than our alpha, we reject that null hypothesis. If no alpha is given, state that you are using 0.05. - Understand how the critical value for a test is related to the specified alpha level - Also is the probability of a Type I error.

Predicted Value

The value of the response variable predicted by a model for a given explanatory variable.

Critical Value

The value that the test statistic must exceed in order to reject the null hypothesis. OR --When computing a confidence interval, the value of t-star (or z-star) used to find the margin of error.

sampling variability/sampling error

The variability we expect to see from one sample to another. It is sometimes called the sampling error, but sampling variability is the better answer.

b0

The y-intercept of the regression equation. The value when x = 0. b0 = ybar - b1xbar (where ybar is the mean of the y variable and xbar is the the mean of the x variable)

measures of Center

These locate the middle of a distribution. The mean and median are measures of center.

Normal percentile

This corresponding to a z-score gives the percentage of values in a standard normal distribution found at that z-score or below

Random Phenomena

Those outcomes that are unpredictable in the short term, but nevertheless, have a long-term pattern.

Continuous Random Variables

Those typically found by measuring, such as heights or temperatures. Can take on infinitely many values.

Random assignment

To be valid, an experiment must assign experimental units to treatment groups at random. This is called random assignment.

Addition Rule

To find probability of A OR B, we use P(A ∪ B) = P(A) + P(A) - P(A ∩ B) . Note-- If A and B are disjoint events: P(A or B)=P(A) + P(B) because there is no overlap

Dependent Events

Two events are called dependent when they are related and the fact that one event has occurred changes the probability that the second event occurs.

Independent Events

Two events are called independent when knowing that one event has occurred does not change the probability that the second event occurs, or vice versa. If A and B are independent, then P(BlA) = P(B).

q stands for?

probability of failure. q=1-p

p stands for?

probability of success

multiplication rule for independent events

probability that A and B will occur = P(A) * P(B)

general multiplication rule

probability that A and B will occur = P(A) * P(B|A)

IQR

quartile 3 - quartile 1

has a distinguishing feature; in the long run, it settles down into a way that is consistent and predictable. individual outcomes are uncertain, but there is a regular distribution of outcomes in a large number of repetitions

random

Convenience Sample

sample of individuals easiest to reach (standing in front of school and sampling). Not necessarily representative of the population, so this does not produce reliable results.

What is p hat?

sample proportion (from the sample). used to estimate the true proportion, p.

simple random

sampling design in which each set of n elements in the population has an equal chance of selection

multistage

sampling schemes that combine several sampling methods

multistage sample

sampling schemes that combine several sampling methods

Sample Space

set of all possible outcomes for the variable. Probabilities must add to 1

SOCS

shape outliers center spread

pie chart

shows how a "whole" divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category

scatterplots

shows the relationship between two quantitative variables measured on the same cases

homogeneous

similar in makeup

the intersection of two or more events implies where both events occur....?

simultaneously

level

specific values that the experimenter chooses a factor

Level

specific values that the experimenter chooses for a factor are called the levels of the factor. Placebo counts as a level (i.e. 0mg of a drug)

ladder of powers

square y squareroot of y log of y -1 over squareroot of y -1 over y

Variance

standard deviation squared

Observational study

study based on data in which no manipulation of factors has been employed. NOT AN EXPERIMENT AND THEREFORE CANNOT PROVE CAUSATION.

mean of discrete variable

sum of probability*frequency for each value

the mean of the sum of two random variables is the...?

sum of the means E(X+Y)=E(X)+E(Y)

form

the ____ we care about most is "nearly linear". Form can also be curved, oscillating, etc.

randomization

the best defense against bias, in which each individual is given a fair, random chance of selection

range

the difference between the lowest and highest values in a data set

residuals

the differences between data values and the corresponding values predicted by the regression model; ____ = observed value - predicted value

Residual

the distance the actual value is from the predicted value y - yhat = residual (a neg value means y is below the predicted and a pos value means y is above the predicted.)

population

the entire group of individuals or instances about whom we hope to learn


Related study sets

Chapter 17: Schizophrenia Spectrum Disorders

View Set

Types of Life Insurance Policies

View Set

Chapter IX: Motivating Employees

View Set

chapter 1-drug definitions, standards, and information sources

View Set

Chapter 7- Momentum- Physics- Davis

View Set