Stats 101 Midterm 1

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

null hypothesis (Ho)

"There is nothing going on" observed difference in proportions in treatment vs. control group is simply due to chance variables are independent

alternative Hypothesis (Ha)

"There is something going on" observed different in proportions in treatment vs. control group is not due to chance variables are dependent

complementary events

2 disjoint events whose probabilities add up to 1 i.e. probability of heads and probability of tail

68-95-99.7 rule

68% of observations fall between µ+σ and µ-σ, 95% of observations fall between µ+2σ and µ-2σ, 99.7% of observations fall between µ+3σ and µ-3σ

population parameters

a quantity or statistical measure that, for a given population, is fixed and that is used as the value of a variable in some general distribution or frequency function to make it descriptive of that population

dependent variable

aka associated two variables that show some connection with one another Can be either positive or negative

disjoint event

aka mutually exclusive; events cannot happen at the same time ○ i.e. voter cannot be registered as a Democrat and a Republican at the same time ○ If A and B are disjoint, P(A and B) = 0, P(A | B) = 0

Simpson's paradox

an association, or a comparison, that holds when we compare two groups can disappear or even by reversed when the original groups are broken down into smaller groups according to some other feature (a confounding/lurking variable) ○ Possible to derive two different conclusions, depending on how you break it up ○ Must take a step back to find what root of causality

law of large numbers

as more observations are collected, the proportion of occurrences with a particular outcome converges to the probability of that outcome (the larger the sample, the more likely the observed probability be the actual probability) ○ i.e. highly unlikely for a coin to land on head exactly 3 times out of 1000 trials, while it is more likely for a coin to land on head exactly 3 times out of 10 trials

block

block for variables (confounding) known or suspected to affect the outcome blocking is to random assignment, what stratifying is to random sampling

double-blind

both the experimental units and the researchers don't know the group assignment

marginal probabilities

calculated in margins of frequency tables; P(A), P(A-bar), P(B), P(B-bar)

replicate

collect a sufficiently large sample, or replicate the entire story

sample space

collection of all possible outcomes of a trial i.e. sample space for sex of the kids of a couple who has two kids S = {MM, FF, FM, MF}

control

compare treatment of interest to a control group

z-score

creates common scale to assess data without worrying about the specific units in which it was measured Z = (obs - mean) / SD

continuous probability distribution

different from discrete in that: • Probability of a continuous random variable being equal to any specific value = 0 • Can't be expressed in tabular form • We use a probability density function (pdf) to describe its distribution • We can calculate the probability for ranges of values the random variable takes as the area under the curve

multistage sample

divide pop into heterogenous clusters; randomly sample a few clusters; randomly sample within these chose clusters

cluster sample

divide pop into heterogenous clusters; randomly sample a few clusters; sample all observations within clusters

stratified sample

divide pop into homogenous strata; randomly sample from within each stratum

simple random sample

drawing names from a hat; each case equally likely to be selected

center

estimate of a typical observation in the distribution mean median mode

non-disjoint event

events can happen at the same time ○ i.e. voter can be both Republican and a moderate at the same time ○ If A and B are non-disjoint, P(A and B) =/= 0

blinding

experimental units don't know which group they're in

relative frequency segmented bar plot

explore relationships between variables

confounding variable

extraneous variables that affect both the explanatory and the response variable that make it seem like there is a relationship between them

placebo

fake treatment, used as control group for medical studies

bar plot

for categorical variables, whereas histograms are for numerical variables. ordering of bars is interchangeable in bar plots, while x-axis is a number line for histograms

inference (soup example)

generalize and conclude that your entire soup needs salt

non-response bias

if only a non-random fraction of the randomly sampled people choose to respond to a survey, the sample may no loner be representative of the population i.e. randomly select 300 households, mail them polls, and ask them to return, but not everyone returns

Bayesian Interpretation of inference

interprets probability as a subjective degree of belief; uses different conditional probability: P(hypothesis | observed data)

Bayes' theorem rewritten

P(A and B) = P(A | B) * P(B)

joint probabilities

P(A and B), P(A and B-bar), P(A-bar and B-bar), P(A-bar and B)

general addition rule

P(A or B) = P(A) + P(B) - P(A and B)

addition rule for disjoint event

P(A or B) = P(A)+ P(B) because P(A and B)=0

Bayes' theorem

P(A | B)=P(A and B)/P(B)

conditional probabilities

P(A|B), P(A|B-bar), P(A-bar|B-bar), P(A-bar|B), P(B|A), P(B-bar|A), P(B-bar|A-bar), P(B|A-bar)

posterior probability

P(hypothesis | observed data) = P(hypothesis & data)/P(data) = (P(data | hypothesis) * P(hypothesis))/P(data)

dependent event

Probability of A relies on B and vice versa ○ If A and B are dependent P(A and B) = P(A | B) * P(B)

Discrete probability distribution:

lists all possible events and the probabilities with which they occur • Events must be disjoint • Each probability must be between 0 and 1 • Probabilities must add up totally to 1 • i.e. binomial distribution

probability distribution

lists all possible outcomes in the sample space, and the probabilities with which they occur • Events listed must by disjoint • Each probability must be between 0 and 1 • The probabilities must add up to 1

spread

measure of variability in distribution standard deviation variance IQR Range

response variable

measures an outcome of a study

outlier

more than 1.5 * IQR away from the quartiles

P(k successes in n trials)

n!/(k!∗(n−k)!) ∗ p^k ∗ (1−p)^((n−k)) describes probability of having exactly k successes in n independent trials with the probability of success p

Calculating expected value using binomial distribution

n*p = µ

types of biases

non-response voluntary response convenience

robust statistics

not easily affected by outliers and extreme skew • Mean and SD are easily affected by observations since the value of each data point contributes to their calculation • Good for calculation symmetric distributions • Median and IQR are more robust • Therefore, choose median and IQR when describing skewed distributions

k

number of successes

n

number of trials

unusual observations

observations that stand out from the rest of the data that may be suspected outliers

unusual observations (z score)

observations with |Z| > 2 are considered unusual

voluntary response bias

occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue since such a sample will also not be representative of the population i.e. open up a poll online

µ

population mean

σ

population standard deviation

categorical variable

qualitative Take on limited number of distinct categories, can be identified w numbers, but doesn't make sense to do arithmetic operations Regular categorical Ordinal: levels have inherent ordering

numerical variable

quantitative Take on numerical values, can add, subtract, take averages, etc. w these values Continuous: not countable Discrete: countable (integers)

four principles of experimental design

randomize control block replicate

randomize

randomly assign subjects to treatments

sample mean

s

sample standard deviation

mosaic plot

shows marginal distribution as well as conditional frequency distributions (i.e. width shows marginal distribution)

point estimates

single value used to estimate population parameter

skewness

skewed to the side of the long tail left skewed symmetric right skewed

shape

skewness modality

Representative (soup example)

spoonful must be representative of the whole pot (aka the population) for the inference to be valid

variance

square of the SD Why square? □ To get rid of negatives so observations equally distant from the mean are weighed equally □ To weigh larger deviations more heavily

Exploratory analysis (soup example)

taste a spoonful of soup and decide the spoonful you tasted isn't salty enough

hypothesis test

test hypothesis by assuming that null hypothesis is true, use either simulation or theoretical methods ○ Results from simulation look like data --> difference in proportions was due to chance; variables are independent ○ Results from simulation do not look like data --> difference in proportions was not due to chance, but due to actual affect of variable; variables are dependent ○ p-value: probability of observing an outcome at least as extreme as the one observed in the original data, given that the null hypothesis is true; if this probability is low, reject the null in favor of the alternative

Frequentist interpretation of inference

the probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times; uses p-value to help make decisions

explanatory variable

the variable suspected of affecting the other; usually what the researcher divides the population into first

independent variable

two variables are not associated

modality

unimodal bimodal uniform multimodal

normal distribution

unimodal and symmetric (bell curve), and follows 68-95-99.7 rule Short hand: N(µ, σ)

normal approximation to binomial

use expected value and SD to calculate Z score then use Z-distribution to calculate probability i.e. P(K>=70): find µ and σ --> calculate z-score --> find P(Z> calculated z-score) using Z-distribution

Binomial distribution

used for calculating the probability of exact number of successes for a given number of trials; follows Bernoulli random variable, meaning an individual trial only has 2 possible outcomes

probability trees

useful for organizing info in conditional probability calculations, especially when you know P(A | B) and you want to find P(B | A)

segmented bar plot

useful for visualizing conditional frequency distributions

side-by-side box plot

usually compares distribution of a numerical variable across levels of a categorical variable i.e. number of clubs college students are involved with and their class year

random process

we know what outcomes could happen but we don't know which particular outcome will happen

sample statistics

we use sample statistics as point estimates for unknown population parameters of interest

S-F Rule

you can use the normal distribution to approximate binomial probabilities when the sample size is large enough - the expected number of successes and failures are both at least 10 - closely follows a normal distribution np >= 10 and n(1-p)>= 10

Calculating standard deviation using binomial distribution

σ = sqrt(np(1−p))

observational study

• Collect data in a way that does not directly interfere w how the data arise • Only can establish association • Retrospective: uses past data • Prospective: data collected throughout study

variability vs. diversity

• Diverse: more different observations • Variability: more observations further away from mean

blocking vs. explanatory variables

• Explanatory variables (factors): conditions we can impose on experimental units (what we are testing) • Blocking variables: characteristics that the experimental units come with, that we would like to control for

Rules for binomial distribution

• Must follow rules of discrete probability distribution: • Events must be disjoint • Each probability must be between 0 and 1 • Probabilities must add up totally to 1

histogram

• Provides view of data density • Great for describing shape of distribution • Bin width: can alter story that the histogram tells • The wide the bin width, the less exact you can see the details of the distribution

scope of inference

• Random sample --> generalizability • Random assignment --> causality

experiment

• Randomly assign subjects to treatments • Establish casual connections

IQR

• Range of the middle 50% of data; distance between 1st quartile (25th percentile) and third quartile(75th percentile) • IQR = Q3 - Q1

Degrees of belief

• Step 1: Start with set of prior beliefs (aka prior probabilities) • Step 2: Observe data • Step 3: Based on that data, update beliefs • Step 4: New believes are called posterior beliefs (aka posterior probabilities), because they are post-data • Step 5: Set new prior beliefs = old posterior beliefs and repeat

Upper/lower fence

• Upper fence = Q3 + (1.5 * IQR) • Lower fence = Q1 - (1.5 * IQR)

Upper/lower whisker

• Upper whisker = maximum value before the upper fence • Lower whisker = minimum value before the lower fence

box plot

• Useful for highlighting outliers, median, and IQR • CANNOT display modality

independent event

Having information on A does not tell us anything about B (and vice versa) ○ If A and B are independent: P(A | B) = P(A) P(A and B) = P(A) * P(B)

convenience bias

Individuals who are easily accessible are more likely to be included in the sample i.e. poll everyone in the Stats 101 class

Interpreting data matrices

Observation = case = each row of the data matrix Variable: each column of the data matrix

percentiles using z-score

We can only use z-scores to calculate percentiles when distribution is normal; graphically percentile = area below the probability distribution curve to the left of that observation Similarly, if you are given a percentile, you can find the value by calculating the Z-score and solving for X (observed value)

Z-distribution aka standardized normal distribution

Z ~ N(µ=0, σ=1)


संबंधित स्टडी सेट्स

Topic 7: Search Engine Marketing

View Set

Legal Environment Business Exam 4 (chapter 15 quiz)

View Set

Ch. 15: Fetal Assessment during Labor

View Set

Lesson 7 speedback quiz, BYU Geography Final

View Set

Chapter 15: Calculation of Medication and Intravenous Dosages

View Set

ICT Principles and Elements of Design ft. Image File Formats

View Set

Series 63 missed questions UPDATED

View Set