MATH-11 STATISTICS MEGASTUDY
if the correlation is 1 or -1, the scatterplot must make
a perfect line
Random Variable
a variable that takes on possible numeric values that result from a random event
If x = Uniform(1,4) What is probability of getting a rational number? What is the probability of getting an irrational number
a. 0 b. 1, because nearly every value from 1 to 4 is irrational
If you break up the sample space into disjoint sets, the probabilities of these events must
add up to 1
Trial
an action that creates data
Quadrant 1 or 2 curve
apply a function higher on the tower of power than is currently used
Quadrant 3 or 4 curve
apply a function lower on the tower of power than is currently used
Law of Large Numbers (LLN)
as a random process is repeated more and more, the proportion of times an event occurs converges to a number (the probability of that event)
Mean on a histogram
balance point of the histogram, the torque is the same on both sides of the mean
Correlation cannot reveal
causation
In general, the side of the SD gives a sense for how
closely you experience playing the game will hug the mean
The X^2 distribution is used to compare
counts in a table (to a list of expected values).
Outliers
data that stands apart from the distribution
median
data value in the middle of the list of data
CI for mean difference of paired samples
dhat +/- t*df*(SE(dhat) -d stands for differences SE = s/sqrt(n) df = n-1
Difference in Means: Confidence Interval for paired, dependent samples
dhat +/- t*df*(SE(dhat) SE = Standard deviation of d / sqrt(n)
Good way of inflate r^2
dividing data into subgroups that are more homogenous
Residual
e=y-ŷ (how off the model is at the value of x) -y = value observed from actual data point -ŷ = value predicted from regression line
Pros of range
easy to calculate, gives sense of span of data
Attributes of a scatterplot
form, direction, strength, outliers
high influence point
gives a significantly different slope for the regression line if it is included, versus excluded, from an analysis
When a histogramis skewed right, the mean is
greater than the median. (Small amount of higher values push the mean forward but don't affect the median)
Standard deviation
how far each value is from the mean
Extrapolation is dangerous because
it assumes the relationship holds beyond the data range you have seen and used for a model
Reasons for using the complement rule
it is often easier to calculate the complement of something.
marginal probabilities are the sums of
joint probabilities
Two random variables are independent fi
knowing the outcome of one has no effect on the outcome of the other
As the graph skews to the right, the mean becomes
larger than the median. The mean is pulled right by the large values in the data set.
Tails of distribution
left and right sides of a graph
When the relationship is curved, the correlation is
less meaningful
The density graph predicts
likeliness of an event occurring, not its probability of occuring
Correlation only works with
linear relationships
correlation is unaffected by
linear scale changes (cor(x,y) = cor (x,2.5y) = cor(x,y+14) = cor(2x-17, 99999y+1))
Probability Model
lists the different outcomes for a random variable and gives the probability of each outcome
E(x) is also known as the
long-run average, denoted μ
Skewed left
longer tail on the left
Skewed right
longer tail on the right
When a histogram is skewed left, the mean is
lower than the median (Small amount of lower values push the mean back but don't affect the median)
To establish a causation, eliminate
lurking variables
Sampling distribution
making a histogram of all the means from all our different samples
center used for symmetric distributions without outliers
mean
The center of the sample distribution is at
mean, μ
Which center of distribution is resistant to outliers
median
center used for asymmetric distributions (skewed)
median
Outliers can occur for many different reasons:
mistakes, atypical, scientifically important
How to find line of best fit
model error = sum of the absolute value of residuals OR sum of (residuals)^2
As the size of a sample grows, the sampling distributions tends to look
more and more normal
Bigger padding in a confidence interval leads to
more confidence, but less relevance. (100% confidence that a value is in the interval 0-1000 is obvious. Where is the value, 2? 200? 547?)
In general, the bigger X^2 is, the
more evidence we have against H0
As df becomes larger, the t-distribution becomes
more standard/normal. The center does not change. The spread becomes narrower.
when you average things, you are eliminating
most variation that happens
The Poisson model is a good approximation of the Binomial model when
n >/= 20 and P <0.05 or n >/= 100 and p < 0.1 This is helpful because the Binomial model becomes unusable when n gets really big or small
finding n given margin of error
n=[(z*)²(p̂)(q̂)] / (ME)²]
Uniform model histogram
no peaks
As long as the conditions are met, it does not matter what distribution you start with. If you keep taking samples. you'll eventually get a
normal distribution
A high r^2 value is
not an indicator that a linear model is appropriate
If an outcome is common to both events, the events are
not disjoint
A random variable should always be a
numeric outcome, NEVER a probability
Exclusive "or"
often used in real life. A or B means: 1. A, but not B 2. B, but not A
don't assume your data are all part of
one homogeneous population. think about possible subgroups to make analysis better.
unimodel histogram
one peak
Scatterplot
one variable on x-axis (predictor), other on y-axis (response).
high leverage point
outlier where x is far from the mean of x values
the median is resistant to
outliers and skew
For a two sample proportion test testing p1 and p2, you would think about
p1 - p2. e.g. p1 - p2 > 0
Regression to the mean
people far from the mean are pulled towards it in subsequent trials because it is easier to score near the mean than far from it.
T-distribution gives more
precise results
General confidence interval formula
p̂ +/- [(z*)(SE(p̂))] = (p̂ - [(z*)(SE(p̂))],p̂ + [(z*)(SE(p̂))]) where SE(p̂) = sqrt(p̂*q̂ / n) z* is the critical value
so if p̂1~N(p1, sqrt((p1q1/n1)) and p̂2~N(p2,sqrt((p2q2/n2)), then
p̂1-p̂2~N(p1-p2, sqrt((p1q1/n1)+(p2q2/n2))
b1
r(stdev y / stdev x) this is the slope ( value that y increases by for every unit that x is increased by )
One of the best ways to avoid bias is by introducing
random elements into the sampling process e.g. Stir the pot before tasting the soup
Continuous random variable
random quantity that can take on any value on a continuous scale ("a smooth interval of possibilities") e.g. The amount of water you drink in a day, how long you wait for a bus, how far you live from the nearest grocery store.
Three ideas for measuring spread
range, interquartile range, the five number summary
Median on a histogram
same amount of area on both sides
bias
sample is not representative of the population in some way -good sampling is about reducing as much bias as possible
Correlation matrix
shows the correlation of every variable with every other variable
For any x-value (or z-score, if you convert to a standard normal model, N(0,1)) the percentile is
simply the area to the left of this value
Conditions for creating a regression model
since correlations are involved, we need our three conditions from before: 1) quantitative variable 2) straight enough 3) no outliers 4) residual noise
the line of best fit is determined by
slope and y-intercept
The sample size does not need to be
some percentage of the population size. Larger samples are better irrespective of the population size e.g. Tasting a small pot of soup gives same amount of info as tasting a large pot of soup -However, tasting 3 spoons of soup is better than tasting 1 spoon
event
some set of outcomes you might care about
subgroups can be identified in original data or residuals.
split your data into different parts and doing several linear regressions instead of one, clunky regression.
Standard Error
sqrt(p̂*q̂ / n) The same as standard deviation, sqrt(pq/n), but built upon p̂ (the sample distribution) instead of p. You are trying to estimate the population using the sample distribution, so you use p̂ values to estimate p.
Standard deviation of a density function
sqrt(var(x))
p̂ pooled
successes1+success2 / n1 + n2 Used when you are doing a hypothesis test. If we assume H0 is true (p1-p2 = 0) then the populations are the same and pooling p̂1 and p̂2 will give better approximations than using both of them separately.
cons of range
summarizes the data using only 2 data points, not resistant to outliers
Difference in Means: Hypothesis Test for paired, dependent samples
tdf = (dhat - 0)/ SE
mean
the center of a distribution must take into account the data values themselves, not just the order they're in. It is the calculated average value (sum of all terms / amount of terms)
outcome
the data created from a trial
To get a normal sampling distribution from samples of a population: The greater the skew in the population,
the higher n must be to get a normal sampling distribution
In any given situation, the higher the risk of Type I error, the lower the risk of Type II error.
the lower the risk of Type II error.
For positive test results to be useful ,you need
the orders of magnitude of "test accuracy" and "disease prevalence" to be better matched.
You can ask questions about __________ since the probability of any individual outcome is always 0
the probability of some interval of values occurring
Conditional probability P(A|B)
the probability that event A occurs given the information that event B occurred. Pronounced P(A given B)
Using noise to determine whether a regression model is appropriate
the residual plot should show "noise", or no observable patterns in the plot. -if a pattern is seen, regression is not appropriate
sample space
the set of ALL possible outcomes
The units on variance will always be
the square of the units in the problem. This can make variance difficult to interpret
Reexpressing data
to make data more visually appealing, to create more commonly-shaped histograms, to get lens of analysis correct
Methods for conditional probabiltiies
tree diagrams, P(A|B), and Baye's Theorem
bimodel histogram
two peaks
Inclusive "or"
used in probability. A or B means: 1. A, but not B 2. B, but not A 3. Both A and B
bad way of inflating r^2
using summarized data rather than unsummarized data
Extrapolation
using your model to predict a new y value for an x value that is outside the span of x data in your model
Interpolation
using your model to predict a new y value for an x value that is within the span of x data in your model
mode
value that occurs the most often in a set of data
Because of randomness, there is
variation in this statistic
When r is close to -1, the correlation is
very strong and negative
When r is close to 1, the correlation is
very strong and positive
When r is close to 0, the correlation is
weak (little to no correlation)
Spread of distribution
where does most of the data lie?
95% of all point estimates are
within (+/-) (2)*(SE) of p
Predictor variable
x-axis variable that predicts y.
if x and y are 2 independent random variables with normal distributions, then
x-y is also normal. also, since x and y are independent, var(x-y) = var(x) + var(y) and thus SD(x-y) =sqrt(var(x-y)) = sqrt(var(x)+var(y)) = sqrt(SD(x)^2 + SD(y)^2)
CI for the mean of one sample
x̄ +/- t*df*(SE(x̄)) SE = s/sqrt(n) df = n-1
Confidence Interval formula for t-distribution
x̄ +/- t*n-1 *(SE(x)) SE(x) = sigma/sqrt(n) = Sx/sqrt(n)
Response variable
y-axis variable that is predicted by x.
Z-score
y-ȳ /(SDy) ȳ = mean Unitless idea that tells you how many standard deviations above the mean some piece of data is z=0 is the mean (0 standard deviations from the mean) z=1 means 1 standard deviation away from the mean z=x means x standard deviations away from the mean
If the value xf, i.e. height of 6 feet, then the prediction interval of yf for a person with the same height xf is:
yf +/- t*n-2 * SE(PI) where SEpi = sqrt([SE(b1)]^2 * (xf - xhat)^2 + se^2/n + se^2)
Confidence interval formula from regression
yhat new +/- t*n-2 * SEci Where SEci = sqrt([SE(b1)]^2 * (xnew - xhat)^2 + se^2/n)
Subgroups may not be visible unless
you think about them
If the residuals show any type of pattern
your current linear model is not appropriate
Critical value
z* If you want a confidence interval of 80%, then 10% would be to the left and 10% to the right. Therefore, the critical value of the z-score (z*) would be at the 90th percentile (80+10) because 10% is to the right of the 90th percentile.
Regression line equation
ŷ = b0 + b1 * x
Recall the regression line equation
ŷ = b0 + b1 * x b0 is the intercept b1 is the slope
b0
ŷ-b1(x) this is the x-intercept ( value of ŷ when x=0 )
If you have a situation modelled by Binom(n,p) in which n is large and p is small, then use a Poisson model instead where
λ = np where: [n >/= 20 and P </=0.05 or n >/= 100 and p </= 010] and [np </=20]
Parameter, μ (or E(x)
μ (or E(x) A value that helps summarize a probability model
Variance of a density function
μ(mu) = mean The integral from -∞ to ∞ of ∫(x-μ)^2 * f(x)dx Easier version: E[X^2] - μ^2 =[integral from -∞ to ∞ of ∫x^2 * f(x)dx] - μ^2
The standard deviation of the sampling distribution is:
σ = sqrt(pq/n) (square root of [probability of success]*[probability of failure] divided by the [number of samples])
The spread of the sampling distribution is:
σ/sqrt(n) (Standard deviation over the square root of the number of samples)
Whiskers
(1.5 * IQR) away from Q1 and Q3 Lower whisker: Q1 - (1.5 * IQR) Upper whisker: Q3 + (1.5 * IQR)
The Law of Averages
(Gambler's fallacy) Incorrect use of LLN. False way of thinking that says if the current situation is out of whack, then it must correct itself in the short term.
range
(max value)-(min value)
Confidence interval for 2 sample proportions
(p̂1-p̂2)+/-z*SE(p̂1-p̂2) -samples must be independent from each other, at least 10 success/fails condition must be met also
interquartile range (IQR)
(upper quartile)-(lower quartile)
CI for mean difference in two samples
(x̄1-x̄2) +/- t*df*(SE(x̄1-x̄2) SE = sqrt(s1^2/n1 + s2^2/n2) df=min(n1-1, n2-1) (min means pick the lowest number between the two)
-If 60% of people run and 20% of runners wear long socks, what percent of people run and wear long socks? (What is the joint probability?) -What is the probability that someone doesn't wear long socks given that they run?
-0.2*0.6 = 0.12, so 12% of the total people run and wear long socks. -(0.6-0.12 = 0.48 = people that don't wear long socks and run) so 0.48 / 0.6 = 0.8 = 80% of wearing short socks given that they run
Summary of Sampling Statistics: To estimate a population parameter p, we can
-Draw a random sample of size n. -This sample will have a statistic p̂ ≈ p. -If we drew many samples, each would have its own statistic p̂ and we could make a histogram of these values -The histogram, the sampling distribution, is approximately: N( μ, σ/sqrt(n) )
When to use Z vs. T
-If you know sigma (almost never true): use z-distribution -In all other cases: Use t-distribution
Why doesn't X+X = 2X?
-In the X+X scenario, we often add winning and losing situations which diminish the influence of one another. (e.g win + win, loss + loss, win + loss, loss + win are all possible) -In the 2X scenario, you either win twice or you lose twice.
Cluster Sampling
-Sampling in which elements are selected in two or more stages, with the first stage being the random selection of naturally occurring clusters and the last stage being the random selection of elements within clusters -e.g. asking people as they walk into various gyms on campus what their average GPAs are. 3 different gyms can have both grads and undergrads. -Pieces just because it's more convenient -Pieces heterogeneous in relation to parameter you're measuring (Gyms all have same undergrads and grads)
In inference about regression, you use the histogram for all b1 values because b0 doesn't really tell us anything.
-The histogram of all possible b1's is centered at the true population parameter, β1 -SE = se / (sx*sqrt(n-1)) se = standard deviation of residuals sx = standard deviation of x values -The curve is best approximated by a histogram with tn-2 1. the conditions for inference from a regression line must be met (straight enough, quantitative, no outliers, residual noise) 2. independence condition (random, and <10% rule) 3. histogram of residuals is nearly normal
Stratified Random Sampling
-What is the average GPA of UCSD students? -Since grads and undergrads have much different average GPAs, you split the sample into 2 groups, do SRSs on each, then combine the results. -Pieces are homogeneous in relation to parameter you are measuring (undergrads have lower GPAs, grads have higher GPAs)
Common Geometric Model questions
-What is the probability that it takes exactly k <Bernoulli trials> to get the first <success>? -On average, how many <Bernoulli Trials> will it take to get the first <success>?
Common Binomial Model questions
-What's the probability of getting exactly k<successes> in n<Bernoulli trials>? -On average, how many <successes> will i get if I do n<Bernoulli trials>?
Margin of error is increased by
-smaller samples -higher level of confidence
The probability of any particular outcome happening is
0 This is because the integral from a to a of f(x) = 0
For a continuous random variable X which takes on any real number, we need model it through a density function f(x) which has 2 properties:
1) f(x) >/= 0 for all x 2) The integral from -∞ to ∞ of f(x) equals 1
How do we decide on the null hypothesis and the alternative hypothesis?
1. Adopt some belief for the moment (null hypothesis) 2. Operating under the assumption that this belief is true, you collect some data. -If the data supports the belief, you continue to operate with this mindset (fail to reject null hypothesis) -If the data supports an alternative belief, discard old belief in favor of new belief (reject null hypothesis in favor of alternative hypothesis)
Steps to testing a hypothesis
1. Create null hypothesis H0 2. Create alternative hypothesis HA 3. Draw a sample and consider it assuming the null hypothesis H0 is true. Find the mean and SD of this data and make a plot. (you use p and q instead of hats because if you are assuming that H0 is true, then you are assuming you know the values for p and q) -Calculate the p-value: the probability/chance of seeing our result or something more extreme if our universe is "H0: The drug works as well as the placebo" 4. If p-value </= 0.05, reject null hypothesis If p-value > 0.05, fail to reject null hypothesis
two ways to use t-distribution:
1. Estimate p1-p2 using a confidence interval about p̂1 - p̂2 2. Run a hypothesis test with H0: p1-p2 = 0
Steps for Test for Independence
1. Find the expected counts for each cell. This is equal to (row total)(column total)/(table total) 2. Find X^2 (same as before), X^2 = sum (Oi-Ei)^2/Ei 3. Find the P-Value: look up the X^2 value on X^2df, where df = (r-1)(c-1). r = amount of rows, c = amount of columns (exclude total column/row) 4. Use the P-value to conclude based on the null.
Two cases for the X^2 distribution
1. Goodness-of-fit 2. Test of homogeneity/independence
Examples of goodness-of-fit questions
1. If we look at the birth months of National Hockey League (NHL) players, do they resemble what we might see in the larger US population? 2. If we breed a bunch of peas, do we really get the results expected from Mendel's theory of genetics?
Steps of the Goodness-of-fit test
1. You wish to compare a collection of counts to those predicted by some theory 2. Calculate the expected counts from your theory (Expected = Total population * Percentage expected for that category) 3. Calculate X^2 = sum (Oi-Ei)^2/Ei Oi = observed counts Ei = expected counts 4. Find the P-value. Look up the X^2 value on the curve X^2k-1, where k is the number of categories 5. Use the P-value to decide about H0: The observed and expected values are the same.
Incorrect uses of linear regression
1. fail to look at the residuals and make sure the model is reasonable 2. don't extrapolate with caution 3. don't consider outliers carefully enough 4. build a model of data that isn't straight enough
p-value main points
1. p-values can indicate how incompatible the data are with a specified statistical model 2. P-values do not measure the probability that the studies hypothesis is true 3. A P-value (statistical significance) does not measure the size of an effect or the importance of a result (practical significance) 4. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold
P overload
1. p: is the proportion of some trait in a population. It is a parameter 2. p̂: is the proportion of some trait ina sample. It is a statistic. 3. P(A) is the probability of some event A happening 4. P-value a conditional probability: it is the probability of getting the value p̂ (or something more extreme) in a universe where p is true.
Test for Independence
2 or more populations split across a categorical variable. You should have a 2-dimensional table of counts.
multimodel histogram
3+ peaks
Common Confidence intervals and critical values
90% = 1.645 95% = 1.966 99% = 2.576
If an event can never occur, then P(A)
=0
If an event must occur, then P(A)
=1
Suppose A and B are independent evets, then P(A and B)
=P(A) * P(B)
Sample
A (hopefully) representative subset of your population e.g. A spoonful of soup from the top of the pot
Hypothesis
A claim that may or may not be true
Convenience Sample Bias
A form of bad sample frame. Easiest sample to take is not representative of population. e.g. You work at facebook and survey 5000 on whether they love FB. This is convenience sample bias because you are likely friends with your coworkers, who also work at facebook and are more likely to either love it or hate it (depending on how working there affects them).
Correlation (R or r)
A statistic that measures strength and direction of a linear association between two quantitative variables where no outliers are present.
Lurking variable
A variable not x or y that causes a change in either x or y.
Visualize the probability table on a graph
An outcome is more likely if there is more area in the bar for that value on the graph We also know that the sum of the areas of the bars must be 1 Heights must be at least 0 (no negative bars)
Spread
Another parameter we might care about. Big spread is exciting for people in Vegas because they focus on the bigger wins
Exponential Distrbution
Asks about a continuous idea, usually related to time: f(x) = {λe^(-λx) when x >/= 0 0 otherwise
Calculating area under a distribution using a z-score table
Calculate the z-score, then find the % value on the table corresponding to your calculated z-score value.
Ch 24
Chi-Squared Tests
ch 4
Comparing distributions
Ch 18
Confidence Intervals for Proportions
Subjective probability
Consider a number of factors important to the situation, personally decide how important they are, and use these to come up with an answer. eg. I have a 60% chance of getting an A because i do all readings, HW, come to classes, and am an A/B student in other math classes.
Handout Lectures
Continuous Random Variables
Correlation vs causation thinking
Correlation thinking: weight and height are correlated, so heavier people tend to be taller Causation thinking: (wrong) weighing more causes you to become taller
Categorical/Qualitative Data
Data that falls into categories or labels; often text ideas; tend not to have units
Normal Distribution (aka the Bell Curve)
Density function is too hard to calculate. Usually given or computed with technology. -Mean is in the middle of the bell curve. Increasing and decreasing values past the mean are evenly distributed before and after the mean, resulting in a bell curve.
CH14
Dependent events, tree diagrams, Bayes Theorem
Expected Value of a density function
E(X) = The integral from -∞ to ∞ of x*f(x)dx
Expected value for a discrete random variable
E(X) = sum of x for [P(x)*x] (the sum of all (each outcome multiplied by its probability)
Adding random variables (no constants)
E(X±Y) = E(X) ± E(Y) if X and Y are independent variables: Var(X±Y) = Var(X)+Var(Y) (always +, never -!) SD(X±Y) = sqrt(Var(X)+Var(Y)) (always +, never -!)
Adding constants to random variables
E(X±c) = E(X) ± c Var(x±c) = Var(x) SD(x±c) = SD(x)
Scaling random variables
E(aX) = aE(X) Var(aX) = a^2 * Var(X) SD(aX) = |a| * SD(X)
If x = Uniform(a,b):
E(x) = (a+b)/2 Var(x) = [(b-a)^2]/ 12 SD)x = sqrt(Var(x)) = (b-a) / sqrt(12)
Disjoint events
Events A and B are disjoint if they share no common outcomes ex: A: rolling an even number on a die B: rolling a 5 on a die
Independence
Events A and B are independent if event A occurring has no effect on the probability of B occurring, and vice versa.
Population
Everything you want to study e.g. A huge pot of soup
Type II Error
Failing to reject the null hypothesis when it is false. β
Claim 2: 65% of UCSD students are FB users
False. Population parameter may not match the sample statistic
Percent variance explains (R^2 or r^2)
For a given linear model, r^2 (the correlation coefficient squared) is the proportion of the variation in the y-variable that is accounted for (or explained) by the variation in the x-variable
Goodness-of-fit
Goodness of Fit Test: 1 population (NHL players, peas) split across a categorical variable (birth month, phenotype). You should have a 1-dimensional table of counts.
Discrete random variable
Has only a) finitely-many outcomes (e.g. X is time of DMV service) or b) space between the values (e.g. Y is the number of meteors that have hit a planet)
From the SEpi equation: what does (xf - xhat)^2 tell us?
How far the individual is from the center of all the individuals we used to build our model. As we move far away from the core of our data, we should be more worried.
The expected value of the geometric model answers
How many trials are needed to get the first success, on average
From the SEpi equation: what does [SE(b1)]^2 tell us?
How unsure we are about the real slope of the regression line.
Understanding p-value
If the p-value is below 0.05, there is less than a 5% chance for that probability to be observed given that the center of the distribution is the mean given by the conditions of the null hypothesis.
"95% confident" technically means
If you drew many, many samples, and for each one, you find p̂ and built a confidence interval by reaching out +/- 2 standard deviations, then the true population parameter would be in about 95% of these intervals. -It is not "A 95% chance that your value is in the interval" -It is actually "95% of the intervals you find will contain the parameter"
If you have two quantitative variables, you can measure the strength of an association using a correlation coefficient.
If you have two qualitative (categorical) variables, you can use a chi-squared test for the significance of an association.
Simple Random Sample (SRS)
Imagine each point in a box as a person. We just pick a certain number of random points.
Approximation rule
In ANY data set that is normally distributed: -About 68% of the data values are within 1 SD of the mean -About 95% of the data values are within 2 SDs of the mean -About 99.7% of the data values are within 3 SDs of the mean
Common question for the Poisson distribution
In general, <some behavior> is average. How likely am I to see <some specific behavior>? e.g. You have 12.5 emails per day and X% are spam. How likely are you to see 5 spam emails in a day? e.g. There is an average of 2.5 goals scored in each soccer game. How likely is it for a game to have 9 goals?
Assumptions made for statistical inference when using the t-distribution
Independence of data: Randomization condition, <10% condition. Population distribution must be nearly normal: -look for near-normality in histogram of your sample -More skew is OK as n gets larger
Ch 25 P1
Inference About the Regression Coefficients
Ch 20
Inferences for Means
Which is more accurate: interpolation or extrapolation?
Interpolation is more accurate because the pattern you built applies to the data within range
The density graph is NOT P(X)
It is a function that helps you figure out probabilities by examining the area under it. Its shape suggests what values are more likely (relatively) but the probability of any particular otucome occuring is still 0
For smaller sample sizes (n<30) or populations where you don't know σ (and must approximate using sx), there is a better approximation of the sampling distribution than the normal model
It is called the t-distribution
ch 7
Linear Regression
How does intensity of skew affect the difference between mean and median?
Lower skew = lower difference between mean and median. Greater skew = greater difference.
How do you increase the power of a test?
Lower the cutoff value (α)
Margin of Error
ME = z*(sqrt(p̂*q̂ / n)) ME = z*(SE)
Table of Contents:
Midterm 1 Ch 3: Welcome Ch 4: Comparing distributions Ch 6: Scatterplots, Association, Correlation (2 variables) Ch 7: Linear Regression Ch 8 and 9: More things about Regression Ch 13: Probability Ch 14: More Probability Theory Ch 15: Random Variables _________________________________ Midterm 2 Ch 16: Modeling Handout: Continuous Random Variables Ch 5: Z-scores, the normal model, the standard normal model Ch 17: Sampling Distributions Ch 18: Confidence Intervals for Proportions Ch 19: Testing Hypotheses About Proportions __________________________________ Final Ch 20: Inferences for Means Ch 21: Types of Errors and 21 Questions Ch 22: Two-Sample Proportion Inference Ch 22 and 23: Paired Data, Two Sample Means Ch 25 P1: Inference About the Regression Coefficients Ch 25 P2: Prediction Intervals/Confidence Intervals Ch 24: Chi-Squared Tests
Midterm 2 material
Midterm 2 material
When multiplying data by a value Y, how are the statistics affected?
Minimum value = Y*min Maximum value = Y*max Mean = Y*mean Median = Y*median SD = |Y|*SD IQR = |Y|*IQR
When multiplying data by a value Y and adding a value X, how are the statistics affected?
Minimum value = Y*min + X Maximum value = Y*max + X Mean = Y*mean + X Median = Y*median + X SD = |Y|*SD IQR = |Y|*IQR
When adding a value X to a data, how are the statistics affected?
Minimum value = min + X Maximum value = max + X Mean = mean + X Median = mean + X SD = SD (unaffected) IQR = IQR (unaffected)
Ch 16
Modeling
Ch 14
More Probability Theory
ch 8 and 9
More things about Regression
The sampling distribution is a normal curve with model
N( μ, σ/sqrt(n) )
Numeric/Quantitative Data
Numerical data with units
Density function
Only area under the graph is linked to probability.
P(A|B)
P(A and B)/P(B)
Losing Disjointness: P(A or B) =
P(A) + P(B) - P(A and B)
If all the outcomes in a sample space are equally likely, we define the probability of an event A to be
P(A) = (# of outcomes in event A)/(# fo outcomes in the sample space) where 0 <= P(A) <= 1
Complement rule
P(A) = 1 - P(A^c)
In general, P(A and B) =
P(A|B) * P(B)
Advanced Baye's Theorem (for when P(B) is not known)
P(A|B) = [P(B|A)*P(A)] / [P(B|A)*P(A) * P(B|A^c)P(A^c)]
Baye's Theorem
P(A|B) = [P(B|A)*P(A)]/(P(B))
P(making a Type I error) =
P(reject H0|H0 is true) = alpha
The Poisson Distribution
P(x) = (λ^x)(e^-λ) / (x!) λ = average value x = value whose probability you are trying to predict E(x) = λ SD(x) = sqrt(λ)
Ch 22 and 23
Paired Data, Two Sample Means
Ch 11
Populations and Samples
Ch 25 P2
Prediction Intervals/Confidence Intervals
prediction interval vs confidence interval
Prediction: Range of values that future observations will fall for [a single person] Confidence: Range of values that future observations will fall for [the average of all people like that person]
Ch 13
Probability
Know the symbols for both Statistics and Parameter:
Proportion, mean, SD, correlation, regression coefficient
The Central Limit Theorem (CLT)
Proves the sampling distribution for a proportion statistic or mean statistic will be a normal distribution, regardless of the population distribution (assuming we have met the 2 conditions: Independence and Nearly Normal)
Q1, Q2, and Q3
Q1: median in the first (lower) half of the data Q2 (median): median of the whole distribution Q3: median in the second (upper) half of the data
Ch 15
Random Variables
Bernoulli trial
Random variable with precisely 2 independent outcomes. P(x) = {p (x=success) or [1-p = q] (x=failure)
Confidence Interval
Range of values around a point estimate that convey our uncertainty about the population parameter (as well as a range of plausible values for it)
Type I Error
Rejecting the null hypothesis when it is actually true
Standard Deviation (σ)
SD(X) = sqrt(Var(X))
Systematic Sampling
Sample elements are selected from a list or from sequential files e.g. Asking every 10th person you see
Bad Sample Frame Bias
Sample is not representative of population. e.g. Want to determine if people in US like facebook. Study facebook users in US. You completely underrepresent people who don't use facebook. Maybe they don't use facebook because they hate it!
Ch 17
Sampling Distributions
ch 6
Scatterplots, Association, Correlation (2 variables)
Parameter
Some value summarizing the population
Statistic
Some value summarizing the sample
Standard deviation equation
Sqrt(sum of (y1-mean))/(n-1))
Null hypothesis (H0)
Statement that says nothing interesting is happening (the opposite of what you're looking for) e.g. if you're trying to prove that a drug produces more treatment than a placebo, your null hypothesis would be that the drug produces the same amount of treatment as the placebo: p(drug) = p(placebo)
Ch 19
Testing Hypotheses about Proportions
Statistical Inference
The attempt to say something about the population parameter given a particular sample statistic (i.e. point estimate)
Memoryless
The exponential distribution is memoryless. The probability of a washing machine lasting for 3 years is the same as the probability of a washing machine lasting for 3 years if it has already lasted 30 years. (This might not be true in real life, but it is true in probability)
If f(x) is a density function for the continuous random variable x, then P(a < x < b) equals
The integral from a to b of f(x)
From the SEpi equation: what does se^2/n + se^2 tell us?
The more spread that exists around our line (i.e., the bigger se) the less confident we are in our prediction. Having more data helps reduce SEPI but this can only help so much.
Power of a test
The power of any test of statistical significance is defined as the probability that it will reject a false null hypothesis. P(reject H0|H0 is false) = 1 - β or P=P(fail|HA is true) = 1-P(success|HA is true)
Marginal Probabilities
The probability of one value of a categorical variable occurring (A)
Joint probabilities
The probability of two things joining forces and happening simultaneously (A and B)
p-value
The probability under the curve of the test statistic z (recall that z = [(μ) - (μ0) / SE] For one-tailed test: p-value = P(z>z0) If HA is on the right tail or p-value = P(z<z0) if HA is on the left tail For two-tailed test: p=value = 1-(2*P(z<z0)) (<-most common) = 2*P(z>z0) = 2*P(z<-z0)
Requirements for both types of X^2 tests
The requirements are the same: 1) You start with a one-dimensional (k x 1) table of observed counts. You wish to compare these counts to those predicted by some theory. 2) The counts in the cells of the table must be independent of one another. Randomly sampling the people that comprise these counts usually gives us this. 3) The expected count for each cell must be at least 5. (Note: We don't require that the observed counts be at least 5, just the expected counts.)
What are the effects on error of increasing alpha (α)
The risk of a Type I error is decreased and the risk of a Type II error is increased.
Volunteer bias
Those who are willing to take their own time to voluntarily complete something like a survey usually look different from those who don't. It does not represent those are opt out of volunteering.
Hypothesis testing on Slopes of Regression Lines
Tn-2 = b1-0/SE(b1) -SE(b1) will be given
neutral way of inflating r^2
Tossing outliers and doing the analysis without them (good or bad depending on the situation) -if outliers are trolls, tossing them is fine -if outliers are valid, observed data, you cannot toss them
Draw SRS of 200 UCSD students and ask if they have a FB account. 130 say they do. Claim 1: 65% of our sample are FB Users
True. p̂ = 130/200 = 65%
Ch 22
Two-Sample Proportion Inference
Ch 21
Types of Errors
Sampling Frame
Universe you will be picking from
Difference in means (p1 and p2): Hypothesis test for two independent samples
Use tdf = (x̄1-x̄2)-0 / SE df=min(n1-1, n2-1) no pooling necessary
Difference in proportions (p̂1 and p̂2): Hypothesis test for two independent samples
Use z = (p̂1-p̂2)-0 / SEpooled where SEpooled = sqrt(p̂pooled * q̂pooled)/n1 + (p̂pooled * q̂pooled)/n2)
Claim 3: About 65% of UCSD students use FB
Vague. Need to learn how to do better. "About" is not precise enough in statistics
Variance (σ^2)
Var(X) = Sum of all x: (x-μ)^2 * P(x)
Methods for Marginal and Joint Probabilities
Venn diagrams, contingency tables, P(A or B) rule, P(A and B) rule.
Confidence statement format
We are (C%) confident that the (population parameter) is in the (confidence interval)
Do we always get a normal model for the sampling distribution of a mean
We do if 2 conditions are met: 1. Independence Assumption: The items in each sample must be independent of one another. Typically, better to check two conditions (which effectively mean independence) 1.A. Randomness Condition: The items in your sample must be randomly chosen 1.B. <10% Condition: Your sample size needs to be <10% of the population size. 2. Nearly Normal Condition (Sample size condition): The population histogram should look nearly normal. If this histogram shows skew, the sample size needs to be large for the sampling distribution to be normal. e.g. n>30 for moderate skew, n>60 for large skew.
Ch 3
Welcome
Alternate hypothesis (HA)
What you expect might be true. Opposite to the null hypothesis. e.g. The drug produces more treatment than the placebo: p(drug) > p(placebo)
Uniform Distribution
When a finite interval of possibilities are all equally likely: f(x) = {1/(b-a) when a </= x </= b 0 otherwise height = 1/(b-a)
Tower of power
When original data or the residuals convince you that the data are not straight enough, apply a mathematical function to the values
Proportions and means
When populations are big, we must draw a random sample and estimate these parameters using statistics
histogram symmetry
When the left and right halves of a histogram look similar/the same
Two-sided alternative hypothesis
When you are excited about results on both sides. You are wondering if your percentage is different from the comparison %. P(a) =/= P(b)
One-sided alternative hypothesis
When you are excited with only one side; the better side. You are hoping your % is on a certain side of the comparison %. P(a) > P(b) or P(a) < P(b)
There are many curves in the t-distribution family
With n data points in sample, you use t-distribution with df (degrees of freedom) = n-1
Geometric model
X = Geom(p), where p is the probability of success and X is the number of trials needed to get a success. -Assume we are doing a Bernoulli trial with success probability p (and failure probability q=1-p) over and over until we get a success. The probability of getting a success in x trials is: P(x)=[q^(x-1)]*p E(X) = 1/p SD(X) = sqrt(q/p^2) = [sqrt(q)]/(p)
E(X±Y) = E(X) ± E(Y) is true even if
X and Y are dependent
The choose symbol (X nCr Y)
X nCr Y: Helps you calculate how many ways there are to list X successes among Y attempts. Formula: (n!)/[k! * (n-k0!)]
If x = Exp(λ):
X represents how long we will have to wait before an event with rate λ occurs. E(x) = 1/λ (The probability that we have to wait X years before the event occurs) Var(x) = 1/(λ^2) SD(x) = 1/λ
Binomial Model
X=Binom(n,p), where n is the amount of trials, p is the probability of success, and X is the number of successes in n trials. -Probability of getting k successes in n Bernoulli trials is: P(k) = (n nCr k) * (q^(n-k)) * (p^(k)) E(X) = np SD(X) = sqrt(npq)
Does empirical probability make sense?
Yes, you tend to get what you expect
Theoretical probability
You build a mathematical model to describe a situation and use the axioms of probability to determine the likelihood of some events eg. Determine chance of rolling even numbers on a die is 1/2 because 3/6 of possible otucomes are even numbers
Empirical probability
You determine how likely something is by trying it over and over and looking at tons of data. eg. determining if a coin is fair by flipping it 100,000 times and recording number of heads and tails
Multistage Sampling
You focus on undergrads today and ask every 4th one you see. You do grads the next day and ask every 4th one you see. -Uses 2 or more of the previous methods (excluding SRS)
By assuming H0, build a universe where p is in accordance with H0.
You must first make sure that the sampling distribution is approximately normal: Make sure <10% of total population Make sure np >= 10 success and nq >= 10 fails
Finding Z using pooling
Z = (p̂1-p̂2)-0 / SEpooled SEpooled = sqrt(p̂pooled - q̂pooled)/n1 + (p̂pooled - q̂pooled)/n2)
Ch 5
Z-scores, the normal model, the standard normal model
The five number summary
[lower whisker {Minimum value, Q1, Median, Q3, Maximum Value} upper whisker]
All confidence intervals work the same way, with slight changes
a
Do not create a regression when what type of outlier is present?
a high influence outlier is present