Stats 101 Midterm 1
null hypothesis (Ho)
"There is nothing going on" observed difference in proportions in treatment vs. control group is simply due to chance variables are independent
alternative Hypothesis (Ha)
"There is something going on" observed different in proportions in treatment vs. control group is not due to chance variables are dependent
complementary events
2 disjoint events whose probabilities add up to 1 i.e. probability of heads and probability of tail
68-95-99.7 rule
68% of observations fall between µ+σ and µ-σ, 95% of observations fall between µ+2σ and µ-2σ, 99.7% of observations fall between µ+3σ and µ-3σ
population parameters
a quantity or statistical measure that, for a given population, is fixed and that is used as the value of a variable in some general distribution or frequency function to make it descriptive of that population
dependent variable
aka associated two variables that show some connection with one another Can be either positive or negative
disjoint event
aka mutually exclusive; events cannot happen at the same time ○ i.e. voter cannot be registered as a Democrat and a Republican at the same time ○ If A and B are disjoint, P(A and B) = 0, P(A | B) = 0
Simpson's paradox
an association, or a comparison, that holds when we compare two groups can disappear or even by reversed when the original groups are broken down into smaller groups according to some other feature (a confounding/lurking variable) ○ Possible to derive two different conclusions, depending on how you break it up ○ Must take a step back to find what root of causality
law of large numbers
as more observations are collected, the proportion of occurrences with a particular outcome converges to the probability of that outcome (the larger the sample, the more likely the observed probability be the actual probability) ○ i.e. highly unlikely for a coin to land on head exactly 3 times out of 1000 trials, while it is more likely for a coin to land on head exactly 3 times out of 10 trials
block
block for variables (confounding) known or suspected to affect the outcome blocking is to random assignment, what stratifying is to random sampling
double-blind
both the experimental units and the researchers don't know the group assignment
marginal probabilities
calculated in margins of frequency tables; P(A), P(A-bar), P(B), P(B-bar)
replicate
collect a sufficiently large sample, or replicate the entire story
sample space
collection of all possible outcomes of a trial i.e. sample space for sex of the kids of a couple who has two kids S = {MM, FF, FM, MF}
control
compare treatment of interest to a control group
z-score
creates common scale to assess data without worrying about the specific units in which it was measured Z = (obs - mean) / SD
continuous probability distribution
different from discrete in that: • Probability of a continuous random variable being equal to any specific value = 0 • Can't be expressed in tabular form • We use a probability density function (pdf) to describe its distribution • We can calculate the probability for ranges of values the random variable takes as the area under the curve
multistage sample
divide pop into heterogenous clusters; randomly sample a few clusters; randomly sample within these chose clusters
cluster sample
divide pop into heterogenous clusters; randomly sample a few clusters; sample all observations within clusters
stratified sample
divide pop into homogenous strata; randomly sample from within each stratum
simple random sample
drawing names from a hat; each case equally likely to be selected
center
estimate of a typical observation in the distribution mean median mode
non-disjoint event
events can happen at the same time ○ i.e. voter can be both Republican and a moderate at the same time ○ If A and B are non-disjoint, P(A and B) =/= 0
blinding
experimental units don't know which group they're in
relative frequency segmented bar plot
explore relationships between variables
confounding variable
extraneous variables that affect both the explanatory and the response variable that make it seem like there is a relationship between them
placebo
fake treatment, used as control group for medical studies
bar plot
for categorical variables, whereas histograms are for numerical variables. ordering of bars is interchangeable in bar plots, while x-axis is a number line for histograms
inference (soup example)
generalize and conclude that your entire soup needs salt
non-response bias
if only a non-random fraction of the randomly sampled people choose to respond to a survey, the sample may no loner be representative of the population i.e. randomly select 300 households, mail them polls, and ask them to return, but not everyone returns
Bayesian Interpretation of inference
interprets probability as a subjective degree of belief; uses different conditional probability: P(hypothesis | observed data)
Bayes' theorem rewritten
P(A and B) = P(A | B) * P(B)
joint probabilities
P(A and B), P(A and B-bar), P(A-bar and B-bar), P(A-bar and B)
general addition rule
P(A or B) = P(A) + P(B) - P(A and B)
addition rule for disjoint event
P(A or B) = P(A)+ P(B) because P(A and B)=0
Bayes' theorem
P(A | B)=P(A and B)/P(B)
conditional probabilities
P(A|B), P(A|B-bar), P(A-bar|B-bar), P(A-bar|B), P(B|A), P(B-bar|A), P(B-bar|A-bar), P(B|A-bar)
posterior probability
P(hypothesis | observed data) = P(hypothesis & data)/P(data) = (P(data | hypothesis) * P(hypothesis))/P(data)
dependent event
Probability of A relies on B and vice versa ○ If A and B are dependent P(A and B) = P(A | B) * P(B)
Discrete probability distribution:
lists all possible events and the probabilities with which they occur • Events must be disjoint • Each probability must be between 0 and 1 • Probabilities must add up totally to 1 • i.e. binomial distribution
probability distribution
lists all possible outcomes in the sample space, and the probabilities with which they occur • Events listed must by disjoint • Each probability must be between 0 and 1 • The probabilities must add up to 1
spread
measure of variability in distribution standard deviation variance IQR Range
response variable
measures an outcome of a study
outlier
more than 1.5 * IQR away from the quartiles
P(k successes in n trials)
n!/(k!∗(n−k)!) ∗ p^k ∗ (1−p)^((n−k)) describes probability of having exactly k successes in n independent trials with the probability of success p
Calculating expected value using binomial distribution
n*p = µ
types of biases
non-response voluntary response convenience
robust statistics
not easily affected by outliers and extreme skew • Mean and SD are easily affected by observations since the value of each data point contributes to their calculation • Good for calculation symmetric distributions • Median and IQR are more robust • Therefore, choose median and IQR when describing skewed distributions
k
number of successes
n
number of trials
unusual observations
observations that stand out from the rest of the data that may be suspected outliers
unusual observations (z score)
observations with |Z| > 2 are considered unusual
voluntary response bias
occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue since such a sample will also not be representative of the population i.e. open up a poll online
µ
population mean
σ
population standard deviation
categorical variable
qualitative Take on limited number of distinct categories, can be identified w numbers, but doesn't make sense to do arithmetic operations Regular categorical Ordinal: levels have inherent ordering
numerical variable
quantitative Take on numerical values, can add, subtract, take averages, etc. w these values Continuous: not countable Discrete: countable (integers)
four principles of experimental design
randomize control block replicate
randomize
randomly assign subjects to treatments
x̄
sample mean
s
sample standard deviation
mosaic plot
shows marginal distribution as well as conditional frequency distributions (i.e. width shows marginal distribution)
point estimates
single value used to estimate population parameter
skewness
skewed to the side of the long tail left skewed symmetric right skewed
shape
skewness modality
Representative (soup example)
spoonful must be representative of the whole pot (aka the population) for the inference to be valid
variance
square of the SD Why square? □ To get rid of negatives so observations equally distant from the mean are weighed equally □ To weigh larger deviations more heavily
Exploratory analysis (soup example)
taste a spoonful of soup and decide the spoonful you tasted isn't salty enough
hypothesis test
test hypothesis by assuming that null hypothesis is true, use either simulation or theoretical methods ○ Results from simulation look like data --> difference in proportions was due to chance; variables are independent ○ Results from simulation do not look like data --> difference in proportions was not due to chance, but due to actual affect of variable; variables are dependent ○ p-value: probability of observing an outcome at least as extreme as the one observed in the original data, given that the null hypothesis is true; if this probability is low, reject the null in favor of the alternative
Frequentist interpretation of inference
the probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times; uses p-value to help make decisions
explanatory variable
the variable suspected of affecting the other; usually what the researcher divides the population into first
independent variable
two variables are not associated
modality
unimodal bimodal uniform multimodal
normal distribution
unimodal and symmetric (bell curve), and follows 68-95-99.7 rule Short hand: N(µ, σ)
normal approximation to binomial
use expected value and SD to calculate Z score then use Z-distribution to calculate probability i.e. P(K>=70): find µ and σ --> calculate z-score --> find P(Z> calculated z-score) using Z-distribution
Binomial distribution
used for calculating the probability of exact number of successes for a given number of trials; follows Bernoulli random variable, meaning an individual trial only has 2 possible outcomes
probability trees
useful for organizing info in conditional probability calculations, especially when you know P(A | B) and you want to find P(B | A)
segmented bar plot
useful for visualizing conditional frequency distributions
side-by-side box plot
usually compares distribution of a numerical variable across levels of a categorical variable i.e. number of clubs college students are involved with and their class year
random process
we know what outcomes could happen but we don't know which particular outcome will happen
sample statistics
we use sample statistics as point estimates for unknown population parameters of interest
S-F Rule
you can use the normal distribution to approximate binomial probabilities when the sample size is large enough - the expected number of successes and failures are both at least 10 - closely follows a normal distribution np >= 10 and n(1-p)>= 10
Calculating standard deviation using binomial distribution
σ = sqrt(np(1−p))
observational study
• Collect data in a way that does not directly interfere w how the data arise • Only can establish association • Retrospective: uses past data • Prospective: data collected throughout study
variability vs. diversity
• Diverse: more different observations • Variability: more observations further away from mean
blocking vs. explanatory variables
• Explanatory variables (factors): conditions we can impose on experimental units (what we are testing) • Blocking variables: characteristics that the experimental units come with, that we would like to control for
Rules for binomial distribution
• Must follow rules of discrete probability distribution: • Events must be disjoint • Each probability must be between 0 and 1 • Probabilities must add up totally to 1
histogram
• Provides view of data density • Great for describing shape of distribution • Bin width: can alter story that the histogram tells • The wide the bin width, the less exact you can see the details of the distribution
scope of inference
• Random sample --> generalizability • Random assignment --> causality
experiment
• Randomly assign subjects to treatments • Establish casual connections
IQR
• Range of the middle 50% of data; distance between 1st quartile (25th percentile) and third quartile(75th percentile) • IQR = Q3 - Q1
Degrees of belief
• Step 1: Start with set of prior beliefs (aka prior probabilities) • Step 2: Observe data • Step 3: Based on that data, update beliefs • Step 4: New believes are called posterior beliefs (aka posterior probabilities), because they are post-data • Step 5: Set new prior beliefs = old posterior beliefs and repeat
Upper/lower fence
• Upper fence = Q3 + (1.5 * IQR) • Lower fence = Q1 - (1.5 * IQR)
Upper/lower whisker
• Upper whisker = maximum value before the upper fence • Lower whisker = minimum value before the lower fence
box plot
• Useful for highlighting outliers, median, and IQR • CANNOT display modality
independent event
Having information on A does not tell us anything about B (and vice versa) ○ If A and B are independent: P(A | B) = P(A) P(A and B) = P(A) * P(B)
convenience bias
Individuals who are easily accessible are more likely to be included in the sample i.e. poll everyone in the Stats 101 class
Interpreting data matrices
Observation = case = each row of the data matrix Variable: each column of the data matrix
percentiles using z-score
We can only use z-scores to calculate percentiles when distribution is normal; graphically percentile = area below the probability distribution curve to the left of that observation Similarly, if you are given a percentile, you can find the value by calculating the Z-score and solving for X (observed value)
Z-distribution aka standardized normal distribution
Z ~ N(µ=0, σ=1)
