AP Statistics Exam Study Guide
How do we interpret a confidence interval?
"It can be stated with ______% confidence that the true parameter is between ________ and _________."
how do we find expected counts for homogeneity and independence tests?
(row total) • (column total)/ table total
how do we calculate degrees of freedom for a homogeneity or independence test
(rows-1)*(columns-1)
If calculating the test statistic (z or t) by hand, what formula should be used?
(statistic-parameter)/standard error
Empirical Rule
-About 68% of data is within 1 standard deviation of the mean -About 95% of data is within 2 standard deviation of the mean -About 99.7% of data is within 3 standard deviations of the mean.
Binomial Curve
-Center: np( number of trials x probability of success= expected # of successes) -Spread: standard deviation, σ = square root of np(1-p) -Shape: approaches normality if you can expect at least 10 successes and 10 failures
PROCEDURES FOR HYPOTHESIS TESTS & CONFIDENCE INTERVALS: check your conditions--Unbiased Estimator (Randomization)
-Ensures that the CENTER (the sample statistic) is legitimate -samples: randomly selected from the population -experiments: randomly assigned into treatment or control group(s) -Note: If you are running a 2-sample interval or test, you must check and STATE that both samples are random, and independent of each other!
PROCEDURES FOR HYPOTHESIS TESTS & CONFIDENCE INTERVALS: check your conditions--Independence
-Ensures that the SPREAD (the standard deviation) formulas that you're given are reliable -when sampling from the population: sample must be less than 10% of the population -Experiments relying on volunteers: Groups should be independent of each other (i.e. not matched pairs) -if there are matched pairs, do a PAIRED t-test; find the difference between each pair and use those numbers in a 1-sample t-test (or interval).
Median
-Find the number in the MIDDLE position (use n+1/2). If there are TWO numbers in the middle position, average them. -Yes, it is resistant to the effects of outliers. -best to use when the data is skewed.
how to calculate a geometric distribution
-GeometPDF is used for P(n=#), the probability that the first success will happen on the nth trial -GeometCDF is used for P(n>=#) or P(n <=#), the probability that the first success will happen on or after or on or before the nth trial
Sample size (aka Large Counts) condition for proportions
-How to meet the condition: ensure the expected number of successes AND failures for the sample are >=10 -ensures that the shape of the sampling distribution is appropriate for inference -Note: if the population has an approximately normal distribution, this condition can be considered "met" regardless of sample size!
What is the margin of error, and how do we calculate it?
-Margin of Error= Critical value • Standard error -it indicates how far off we can reasonably expect the parameter to be from the statistic obtained from the sample
Boxplot
-Min, Q1, Med, Q3, Max shown -cannot show shape (but can show skews vs. symmetry) -outliers should be marked with a * -one-fourth (25%) of the # of data points are located in each section of the box plot.
what function in the calculation should you use to analyze a normal distribution?
-NormalCDF
How do you calculate the conditional probability of a given event?
-P(A|B) = P(A∩B)/P(B) or P(both events)/P(given events
Spread: standard deviation
-Population: σ Sample: s -paired with mean -to find:use 1-Var Stats in calculator! or σ/s= the square root of (x-μ)^2/n -not resistant to the effects of outliers
Spread: variance
-Population: σ2 -Sample: s2 -paired with mean -to find:use 1-Var Stats in calculator! or σ^2/s^2= (x-μ)^2/n -not resistant to the effects of outliers
what does a mosaic plot show us?
-Vertically, it shows the distribution of each variable within each sub-group. -horizontally, the width of the bars shows the proportion of the population that is taken up by each sub-group
Histogram
-X-axis shows intervals, y-axis shows frequency (# of data points within that interval) -finding median: figure out how many data points there are, use n+1/2 to find the position of the median, then find which interval contains that position
z-score
-a data point's z-score in the number of standard deviation it is located away from the mean -formula: z= x-μ/σ -can help us compare two unalike measurements
Geometric Distribution
-a density curve that allows us to determine how many trials it will take to get one success (also think of it as the first success) -events need to be independent (of course)
what types of graphs can be used to represent categorical data?
-bar graphs (including side-by-side and segmented) -mosaic plots -2-way tables
How does transforming and combining a random variable change that variables distribution: Combining (adding or subtracting two random variables to each other)
-effect on center: changes according to the operation -effect on spread: ADD VARIANCES (regardless of whether variables are being added or subtracted)
How does transforming and combining a random variable change that variables distribution: Multiplying/Dividing by a CONSTANT (number)
-effect on center: changes according to the operation -effect on spread: changes according to the operation
How does transforming and combining a random variable change that variables distribution: Adding/Subtracting a CONSTANT (number)
-effect on center: changes according to the operation -effect on spread: doesn't change
Sample size (aka Large Counts) condition for means
-ensure that the sample size is >=30 -How to meet the condition: ensure the expected number of successes AND failures for the sample are >=10 -ensures that the shape of the sampling distribution is appropriate for inference -Note: if the population has an approximately normal distribution, this condition can be considered "met" regardless of sample size!
PROCEDURES FOR HYPOTHESIS TESTS & CONFIDENCE INTERVALS: Do the calculation--Confidence intervals
-give interval (lower, upper)
PROCEDURES FOR HYPOTHESIS TESTS & CONFIDENCE INTERVALS: state your conclusion--confidence intervals
-give the % confidence -give the interval in context -"It can be stated with __________% confidence that the true mean (or true proportion) of (context) is between _________ and _____________."
How to identify outliers: IQR Test
-how it works: a point is considered an outlier if it is less than Q1-1.5(IQR) or Greater than Q3+1.5(IQR) -this is a general guideline not a strict rule!
PROCEDURES FOR HYPOTHESIS TESTS & CONFIDENCE INTERVALS: check your conditions--sample size (also known as "Large Counts")
-if met, the shape of the sampling distribution is appropriate for the test. -Means (μ): -30 or more, OR -Population is known to be normal, OR -Graph of the sample shows no obvious skews or outliers (t-test only) -Proportions (p): -at least 10 expected successes and 10 expected failures (find expected value of each) -for 2-proportion z-tests, use POOLING to check for expected success and failures -if making a confidence interval where there are no expected values, you may use the actual values from your sample.
theoretical distribution
-it's like a histogram in which the center is the mean and the intervals are each one standard deviation apart from each other
X^2 test of independence null and alternate hypotheses
-null: there is no association between the variables within this population -alternate: there is an association between the variables within the population
X^2 test of homogeneity null and alternate hypotheses
-null: there is no difference in the distribution of (variable) across the populations (or treatment groups) -alternate: there is a difference in the distribution of (variable) across the populations (or treatment groups)
X^2 Goodness of Fit null and alternate hypotheses
-null: varies depending on problem -alternate: at least one of these proportions is not as specified in the null hypothesis
Using the Binomial Distributions
-only works in binomial settings, which occurs when the following conditions are met ("BINS") -B: Binary (can be analyzed in terms of successes and failures) -I: Independent -N: Number of trials is known in advance -S: Same probability of success for each trial -BinomPDF: finds P(X=#) -BinomCDF: finds P(x>=#) or P(x<=#)
Range
-paired with either mean or median -to find: maximum-minimum -not resistant to the effects of outliers
Interquartile Range (IQR)
-paired with median -to find: Q3-Q1 -yes, resistant to the effects of outliers
Upper Quartile
-paired with the median -to find: midpoint of median and maximum, or use 1-var stats -yes, resistant to the effects of outliers
Lower Quartile (Q1)
-paired with the median -to find: midpoint of minimum and median, or use 1-var stats -yes, resistant to the effects of outliers
Mean
-population: μ -Sample: x (x-bar) -add up all data points, divide by number of data points in the data set -not resistant to the effects of outliers -usually the best measure of center to use, unless the data is skewed.
When looking at bivariate data, what does shape tell us?
-possibilities: linear or non-linear (curved) -what the r-value tells us: r assumes that shape is linear
when looking at bivariate data, what does direction tell us?
-possibilities: positive or negative -what the r-value tells us: -positive r-value=positive direction -negative r-value= negative direction -if r=0, there is no (linear) correlation
When looking at bivariate data, what does strength tell us?
-possibilities: weak, moderately strong, or strong -what the r-value tells us: if r is close to 1 (or -1), correlation is strong. If r is close to 0, correlation is weak.
PROCEDURES FOR CONFIDENCE INTERVALS: 1. State what you're doing (P in PANIC)
-procedure (interval) you're using -the parameter (population) of interest! -confidence level
PROCEDURES FOR HYPOTHESIS TESTS: 1. State what you're doing (P in PHANTOM)
-procedure (test) you're using -the parameter (population) of interest! -Hypotheses, Ho and Ha -Significance Level, α (if none is given, use .05)
Stemplot
-remember to give a key to show what the numbers mean -do not skip stems -if given a back to back templet, always read stem first, then leaf
multiple outcomes--mutually exclusive
-rule: add probabilities -formula: P(A∪B) = P(A) + P(B)
Multiple outcomes--Not mutually exclusive
-rule: add probabilities but subtract the overlap (if using a Venn diagram, just add up the 3 sections in the diagram) -formula: P(A∪B) = P(A) + P(B) - P(A∩B)
Multiple events--Independent
-rule: multiply probabilities, account for the change in probability with each trial, account for combinations (nCr) if necessary. -formula: nCr x P(event 1) x P(event 2) x P(event 3)...
"At least one"
-rule: opposite of "none" -formula: 1-P(0)
Conditions of a Chi-Squared test
-same conditions as other significance tests -for the sample size condition to be met, all expected counts must be ≥5
Dotplot
-so easy a caveman could do it!
how do we find expected counts for goodness of fit?
-sometimes, you may expect certain proportions out of a total -sometimes, you may expect that the data is equally distributed among the categories (in this case, just use simple division)
Confidence interval formula
-statistic ± critical value * standard error of statistic -critical values can be found in the t table (for z distributions, use the ∞ row) -standard deviation: use the formula sheet (they are very clearly laid out!)
when running a chi-squared test, what three things must you report?
-test statistic (x^2), degrees of freedom (df) and p-value (p)
PROCEDURES FOR HYPOTHESIS TESTS & CONFIDENCE INTERVALS: Do the calculation--Significance Tests
-test statistic (z, t, or x^2) -degrees of freedom (t and x^2 only) -p-value
Be able to discuss generalizability
-the extent to which the results of a sample (or experimental group) can be applied to a certain population -BIAS can hurt (or even eliminate) generalizability. You need RANDOMNESS to avoid this! -for example, a study that consists of volunteers should only be generalized to those volunteers! You might be able to generalize to "people who are similar to the volunteers," but absolutely no further, because they weren't randomly selected! -even a relatively small size (not ridiculously small, but somewhat small) can be valid as long as it's random!
Independence condition for both proportions and means:
-to meet the condition: ensure that the sample is less than 10% of the population -ensures that the spread of the sampling distribution is appropriate for inference
Unbiased Estimated condition for both proportion and means:
-to meet the condition: ensure that the sample is randomly selected -ensures the center of the sampling distribution is appropriate for inference.
how to use Z-scores to calculate the percentage of data points above, below, or between certain boundaries
-works only for normally-distributed data! do not do these procedures if you do not know that your data is normally distributed! -with Z-table: -Z-table gives the percentage of values below a given z-score -can use the z-table backwards--if you know the percentage, find it on the z-table, and see what z-score it equates to! -with calculator: -normalcdf (if looking for percentage or probability) -Invnorm (if given percentile or probability) -to adequately show work, you must write anything you type into the calculator.
PROCEDURES FOR HYPOTHESIS TESTS & CONFIDENCE INTERVALS: state your conclusion-significance tests
-· State whether p < α (reject) or p > α (fail to reject) -give the consequences in context -chi-squared: you might be asked to perform a follow-up analysis to see where the biggest gaps between observed and expected values are -REJECT: "Because p < α, Ho is rejected. There is statistically significant evidence to suggest (whatever Ha was) -FAIL TO REJECT: "Because or p > α, the test fails to reject Ho. There is NO statistically significant evidence to suggest (whatever Ha was)
how do you analyze the least-squares regression line (line of best fit)?
-ŷ= mx + b -ŷ is the predicted value of y for a given value of x -interpretation of slope: Y is predicted to change by (slope) for each (1 unit) increase of x -interpretation of Y-intercept: when x=0, y is predicted to be (y-intercept) r^2 value (coefficient of determination): the % of the variation in y that can be accounted for by its linear relationship with x -extrapolation: using a regression model to predict a y-value with an x-value that is outside the interval of x-values you have. Do not do this!
suppose you want to avoid a type I error as much as possible. Should you use a significance level of .10, .05, or .01?
.01, because a lower alpha level means a lower chance of rejecting Ho (and thus a lower chance of doing so incorrectly, which is what a type I error is).
for continuous random variables, what is the probability of getting exactly one given outcome?
0(1/∞)
What type of confidence interval to calculate when estimating a population proportion
1-proportion Z interval
when testing a claim about a population proportion use
1-proportion Z test
What type of confidence interval to calculate when estimating a population and the population standard deviation is NOT known
1-sample T interval
when testing a claim about a population mean and the population standard deviation is NOT known use
1-sample T test
What type of confidence interval to calculate when estimating a population mean and the population standard deviation is known (RARE)
1-sample Z interval
when testing a claim about a population mean and the population standard deviation is known (RARE) use
1-sample Z test
How is power calculated?
1-β
What is the four-step process of statistical inference (in this case, creating a confidence interval)
1. State what you're doing 2. Check conditions 3. Run the procedure 4. Draw the appropriate conclusion
what is the expected value of a geometric random variable
1/p (if n=1/p, then np=1)
when testing a claim about the difference between two population proportions use
2-proportion Z test
What type of confidence interval to calculate when estimating the difference between two population means and the population standard deviations are NOT known
2-sample T interval
when testing a claim about the difference between two population means and the population standard deviations are NOT known.
2-sample T test
What type of confidence interval to calculate when estimating the difference between two population means and the population standard deviations are known (RARE)
2-sample Z interval
when testing a claim about the difference between two population means and the population standard deviations are known (RARE) use
2-sample Z test
Significance levels (alpha-levels) determine the p-value below which a test's result should be considered significant. If no alpha level is given, it is a good general to use _______________.
5% or .05
When calculating the probability of getting more than one outcome for a given event, what formula should you use?
=P(A∪B) = P(A) + P(B) - P(A∩B) -In layman's terms, P(A or B) = P(A) + P(B) - P(Both)
Define Treatment
A condition (or lack thereof) imposed upon a subject.
How is a sample measured?
A sample is measured using a sampling technique, of which there are several.
Define Confounding
A situation in which an outside variable is tied to both the explanatory and response variables, to the point where it is impossible to tell whether the observed response is being caused by the explanatory variable, the outside ("lurking") variable, or some combination of the two.
The 5 things you should discuss when analyzing a distribution of data:
Center, Shape, Spread, Outliers, and Context (CSOCS)
Why do we often measure samples instead of populations?
Collecting data from an entire population is often impossible, infeasible, or beyond the capability of the resources we have available. Samples can supply good estimates of the population, and are much easier to collect data for.
How do we calculate the degrees of freedom of a t-distribution?
DF= n-1
What does a confidence interval allow us to do?
Estimate the value of a population parameter, using a sample statistic.
What is the difference between an Experiment and an Observational Study? Which one lets us establish cause-and-effect relationships?
Experiments assign treatments or conditions to subjects, while observational studies simply observe what is already happening. Only well-run experiments allow us to make claims of cause-and-effect relationships.
If dealing with a t-distribution and your sample size is not 30 or more, what other methods can you use to check for normality?
Graph the sample data, and check the graph for any heavy skews or outliers. If it has neither, proceed.
what is a type II error? Also indicate which variable is used to express its probability.
Ho is not rejected, but in reality, it was false, and should have been rejected. Represented by Beta (β) (the conclusion suggests nothing is happening, but in reality, it is).
what is a type I error? Also indicate which variable is used to express its probability.
Ho is rejected, but in reality, it was true and should not have been rejected. Represented by ALPHA (⍺) (the conclusion suggests something is happening, but in reality it isn't.)
How do we interpret a confidence level?
If all possible samples of size n were taken, and a ________% confidence interval was generated for each one, _______% of those intervals would contain the true parameter.
Why is comparison (either with a control group OR a second treatment group) important?
It allows us to see whether (or to what extent) the explanatory variable is responsible for the observed response.
In experiments, it is important to control for outside factors or variables. What does this mean?
It means mitigating the influence of outside factors as much as possible. To do this, treatment groups should be as similar as possible, with the only substantial difference between the groups being the treatment itself.
what does spread tell us about our data?
It tells us how far apart or clustered together the data points are from the center.
Conditions for linear regression
L: Linear I: Independent (or use 10% rule) N: Normal Distribution of Residuals E: Equal variance of y-values (can think of this as even scatter) R: Randomization
How do you calculate the expected value of a discrete random variable?
MULTIPLY the value of each outcome by its probability, then ADD the results together.
Multistage Sample
More than one sampling technique is utilized in order to obtain the sample.
X and Y are correlated. Does this mean that X causes Y? If not, what else might be going ON?
No-correlation does not imply causation. It could be reverse causation (y causes x), it could be a lurking variable (z influences both x and y), or it could simply be a coincidence (a spurious correlation).
What are the two types of hypotheses used in significance tests, and what symbols do we use to represent them?
Null hypothesis (Ho) and Alternate hypothesis (Ha)
A principal wants to create an advisory committee of 20 randomly-selected students out of the 1,800 students their high school. Describe how he could do so using a simple random sample
Number all 1800 students, then select 20 random numbers (skipping repeats). The students corresponding with the selected #s will be on the committee.
A principal wants to create an advisory committee of 20 randomly-selected students out of the 1,800 students their high school. Describe how he could do so using a...systematic random sample
Number all 1800 students. Randomly select one number from 1 to 90. The student corresponding to that number will be on the committee, as will every 90th student thereafter (90 is used because 1800/20 = 90)
A principal wants to create an advisory committee of 20 randomly-selected students out of the 1,800 students their high school. Describe how he could do so using a stratified random sample
Number all male students, then number all female students. Randomly select 10 of the numbers assigned to the male students and 10 of the numbers assigned to the female students.
Cluster Sample
One or more subgroups that is reasonably representative of the population ("cluster") are selected and used as the sample.
When calculating the probability of multiple events, what rule or formula should you use?
P(event 1) x P(event 2) x P(event 3)...
what must we do if the conditions are not met?
PROCEED WITH CAUTION!
Simple Random Sample (SRS)
Randomly select n subjects from the population. Every subject has an equal chance of being selected.
A principal wants to create an advisory committee of 20 randomly-selected students out of the 1,800 students their high school. Describe how he could do so using a...multistage sample
Randomly select one homeroom from each grade level (cluster), then select 5 students from each (SRS).
· What is the difference between sampling error and sampling bias?
Sampling error is an unavoidable phenomenon, because no sample can perfectly represent the population. Sampling bias, however, is a systematic and avoidable failure of the sample to at least be reasonably representative of the population.
A principal wants to create an advisory committee of 20 randomly-selected students out of the 1,800 students their high school. Describe how he could do so using a...cluster sample
Select a group of 20 students that is reasonable representative of the school (for example, an elective class that has a mix of student ages and abilities).
A principal wants to create an advisory committee of 20 randomly-selected students out of the 1,800 students their high school. Describe how he could do so using a...convenience sample
Send out an invitation to join the committee and select the first 20 students who respond.
Stratified Random Sample
Separate the population into sub-groups (strata), then randomly select a certain number of subjects from each group. Helps ensure important sub-groups are represented *Stratifying willreduce variabilityof possible sample results!
How can a small sample size affect the validity of the sample?
Small sample sizes will result in high sampling error, thus reducing the usefulness and applicability of the data.
Convenience Sample
Subjects are selected based on how easy they are to obtain. Almost always leads to biased results and should not be used if at all possible.
When analyzing a series of multiple events, each with multiple possible outcomes, what visual aide will be helpful?
TREE DIAGRAM!
Systematic Random Sample
The first subject is selected randomly. Then, every "nth" subject thereafter will be sampled. The interval at which each subsequent subject will be selected is determined in advance.
Important information about p-values
The p-value is ALWAYS between 0 and 1. If your calculator gives something other than this, I guarantee there will be an E at the end. This represent scientific notation. This means your p-value is very small (in fact, many statisticians just write "P<0.001" and call it a day). As far as we're concerned, p-values this low will always be significant!
Loaded questions
When a question is designed in such a way to elicit a certain response, rather than seeking to gather accurate information.
Undercoverage
When one or more sub-groups within a population are not adequately represented within the sample. This is particularly important if there is a meaningful difference between which groups are represented and which aren't, with regard to the research question.
Voluntary response bias
When subjects are allowed to choose whether or not they are in the sample. This tends to cause subjects with strong opinions to be overrepresented in the sample.
Nonresponse bias
When the attempt is made to include all relevant sub-groups, but some groups fail to respond or provide meaningful data. Again, particularly important if there is a meaningful difference between which groups are providing responses and which aren't.
False answers
When the wording or nature of a question is such that subjects are unwilling or unable to answer truthfully, and thus provide inaccurate or estimated data.
what is an outlier?
a data point that is far lower or far higher than the others in the data set, to the point that it single-handedly changes how we analyze the entire data set.
Control group
a group of experimental units that receive no active treatment. This allows us to compare the effects of the active treatments to the effect of no treatment at all, which helps us see the extent to which the active treatments are actually influencing the response variable.
what is the difference between a parameter and a statistic?
a parameter is a measurement from a population; a statistic is a measurement from a sample
How is a population measured?
a population is measured by a census.
What is the difference between a proportion and a mean?
a proportion measures a fraction or percentage out of a total; a mean measures an average value
Matched Pairs Design
a special case of blocking; subjects are paired up with a similar subject, and one member from each pair is randomly assigned to each treatment group. OR, each subject is assigned to BOTH treatments (in which case each subject is their own matched pair), with the ORDER of treatments being assigned randomly.
Double-blind study
a study in which neither the subjects nor those administering treatment to the subjects know which treatment each subject is receiving. This adds a further layer of protection against confounding by ensuring the researcher bias does not taint the results.
Blind study
a study in which subjects do not know which treatment they are receiving. This is advantageous because it helps combat sources of confounding (for example, the Placebo Effect).
Completely Randomized Design
all subjects are taken and divided randomly into treatment groups. Every subject has the same chance of being in each group. Analogous to SRS in sampling.
Bias
anything that causes a sample to be not representative of the population of interest.
uniform distribution
approximately equal mean and median
how do you interpret a p-value? what does that number mean?
assuming Ho and the probability model are true, p, is the probability of getting a sample result as or more extreme as the one obtained.
why can two events that are mutually exclusive never independent?
because if the first event happens, the probability that the second event will happen is now 0. This change in probability for the second event means that the events are not independent.
how do you analyze (interpret the results of) a test for which the p-value is less than alpha, what would you write?
because p<⍺, Ho is rejected. There is statistically significant evidence to suggest that Ha is true.
How do you analyze a test for which the p-value is greater than alpha, what would you write.
because p>⍺, the test fails to reject Ho. There is not statistically significant evidence to suggest that Ha is true.
how do we calculate degrees of freedom for a goodness-of-fit test
categories-1
what is quantitative data?
data consists of numerical measurements
what is categorical data?
data that consists of counts from various categories
what are independent events?
events that do not affect the probability of a given outcome. P(A)=P(AlB)
what is a residual?
for a given x-value, it is the difference between what y is and what y was predicted to be.
when do we use Chi-squared tests?
for categorical variables, chi-squared tests allow us to measure the extent to which the observed counts differ from the expected counts.
how can power be increased? List 3 ways.
increase sample size, increase alpha, and increase the distance between Ho and the true parameter.
What is a null hypothesis, and what does the null hypothesis always assume to be true?
it assumes there is no difference or no change from the proposed parameter in one-sample tests. For two-sample tests, it assumes that the two parameters are equal to each other (i.e. no difference)
what information does a residual plot give you?
it helps show whether the regression equation is a good fit for the data. If the residual plot is scattered, the regression is likely a good fit. But if there is a noticeable pattern in the residuals, the regression is not an ideal fit.
What is an alternative hypothesis? What are the 3 types of alternative hypotheses you could have?
it is a competing claim made in contrast to Ho. Ha can be >, <, or ≠Ho
What is the definition of expected value?
it is the average value of the variable over many trials.
What is conditional probability?
it is the probability that an outcome will occur, given that a previous outcome has occurred.
What does shape tell us about our data?
it tells us the patterns or trends taking place within the data set.
What does the center tell us about our data?
it tells us the value of the "average" or "typical" data within our data set.
Remember that the sample statistic ("point estimate") is in the center of the confidence interval and that the distance between the sample statistic and the ends of the confidence interval is the ___________________.
margin of error
What happens to the margin of error when there is an increase in sample size
margin of error decreases
What happens to the margin of error when the confidence level decreases
margin of error decreases (confidence level)
What happens to the margin of error when there is a decrease in sample size
margin of error increases
What happens to the margin of error when the confidence level increases
margin of error increases (confidence level)
Normal distribution
mean and median are approximately equal
Skewed right distribution
mean is greater than the median
When do we use X^2 Goodness of Fit
measuring one categorical variable
when do we use X^2 test of independence?
measuring the association between two variables with one sample
when do we use X^2 Test of Homogeneity
measuring the distribution of a variable across several samples or treatment groups
Skewed left distribution
median is greater than the mean
power and type ii error always go the _________ direction
opposite
what are mutually exclusive outcomes?
outcomes that cannot happen at the same time.
what is the law of large numbers?
over a very large number of trials, the proportion of trials that result in the desired outcome will approach the value of the outcome's theoretical probability.
when testing a claim about a study or experiment that utilizes matched pairs use
paired T-test (1-sample T test)
what is the definition of power?
power is the probability of rejecting Ho, given that Ho is false.
What is replication, and how can we make sure that our sample has it?
replication happens when the same (or similar) results are seen across multiple trials. Having a large sample size helps ensure replication is present, so that we can be more sure that the observed results aren't from random chance or coincidence.
Conditional Probability (A given B)
rule: probability of both events/probability of given event formula: P(A|B) = P(A∩B)/P(B)
what is the difference between a sample distribution and a sampling distribution?
sample distribution is a graph of data take from one sample. Sampling distribution is a graph of statistics taken from multiple samples.
shape of a geometric distribution
shape is always skewed right
equation for confidence interval for a linear regression
slope ± critical value • SE Coef (b ± t* •SE)
How are variance and standard deviation related?
standard deviation is the square root of variance; variance is standard deviation squared (Variance = σ2, thus σ = sq root of variance )
Central Limit Theorem (CLT)
states that, if the sample size is large enough, the sampling distribution will be approximately normal, and the center (mean) of the distribution will be equal to the true parameter.
· Randomized Block Design ("Blocking")
subjects are sorted into homogeneous groups based on certain characteristics (such as gender, age, prior experience, etc.). The subjects in each block are then assigned to treatment groups, to ensure that each treatment group has representation from each block. Analogous to stratified sampling.
what two (for t-tests, three) things should you report after running a significance test in your calculator?
test statistic (z or t), degrees of freedom (df, for t only), and p-value (p).
If you adjust the sample size, how does the confidence interval change?
the confidence interval changes by the square root of that amount (since n is inside the square root in all standard deviation formulas).
What is a sampling distribution?
the distribution of all possible sample results, from a given sample size, for a certain population
Define Experimental Units (Subjects when human)
the individuals upon which treatments are being imposed
How does the process for calculating the probability of multiple events change when the events are dependent?
the probability of each event will change with each trial. Exponents can no longer be used to represent the probability of multiple events.
How do you know if the transformation has achieved linearity?
the transformed graph will appear linear, and the residual plot will show no pattern.
· Why is it important to randomize the assignment of treatments?
this ensures that any lurking variables have an equal chance of affecting all treatment groups, thus minimizing the risk of results being compromised by confounding.
when power decreases ...
type I error decreases and type ii error decreases
when power increases ...
type I error increases and type ii error decreases
When, and how, do you use the combinations (nCr) function in your calculator?
use it when one outcome has multiple sequences of events. For example, if you are running 4 trials and want the probability of 3 successes, there are four possible combinations for that outcome (123, 124, 134, 234). In your calculator, input nCr(trials, successes). For instance nCr(4,3) will equal 4.
discrete random variable
variable that has a finite number of possible outcomes.
continuous random variable?
variable that has an infinite number of possible outcomes
bimodal distribution
varies (can be symmetric OR skewed)
when do we use a t-distribution?
when doing inference for a population mean, when the population standard deviation (σ) is unknown (which will almost always be the case).
Placebo effect
when subjects demonstrate a real response from a fake or ineffective treatment, because they believe the treatment to be real or effective. To prevent this, subjects should not be allowed to know which treatment they are receiving, so that this "belief" in the treatment can affect all treatment groups equally.
explanatory and response variables (which one is x and which one is y)
x is the explanatory variable, y is the response variable
how do you calculate a residual?
y-ŷ
How to design a random sampling procedure
· Random number generator will be your friend! · "Describe a method..." (NOTE: blanks will be filled in with the context of the problem!) o START WITH: Assign each _________(unit, subject, etc.) a different number between ____ and _____ o Describe how you will implement the sampling method you want to use o Randomly select ________ numbers, ignoring repeats, and include the _________(unit, subject, etc.) that corresponds with those numbers in your sample.
If X has a binomial distribution with parameters n and p, then...
μp̂=p σp̂= √p(1-p)/n
If x̄ is the mean of a random sample of size n from an infinite population with mean μ and standard deviation σ, then...
μx̄ = μ σx̄ = σ/√n
standard deviation of a geometric distribution
√((1-p)/p