Ch 7-11 & 20
What do residuals look like for over and underestimates?
- a positive residual → underestimate - a negative residual → overestimate
if you standardize x and y...
- x and y equal 0 - sy and sx=1
What assumptions must be met in order to make inferences about the coefficients of a line?
-LINEARITY (should check scatterplot, residuals, and qq-plot to see if the data can be represented by a linear model) - you would expect to see no pattern on the residual plot - you may need to reexpress data to make linear relationship - data must be quantitative for scatterplot to make sense -INDEPENDENCE ASSUMPTION (important because the errors need to be independent of each other; also want to make sure individuals are representative of population) - Can check the residuals for patterns or clumping which may indicate failure of independence -EQUAL VARIANCE ASSUMPTION (the variability of y should be the same for all values of x; this includes the Does the plot thicken? Condition) - can look at the standard deviation of residuals to measure scatter -NORMAL POPULATION ASSUMPTION (the errors around an idealized regression line should follow the Normal model; the normality assumption becomes less important as sample size grows - residuals should satisfy the "nearly normal condition" and the "outlier condition"
lurking and confounding
-a lurking variable creates an association between 2 other variables that tempts us to think one may cause the other (a variable that makes it appear like x is causing y) -confounding and lurking are outside influences that make it harder to understand the relationship being modeled
When doing inference for a regression model you should...
-check assumptions and conditions (plot the data and residuals) -look at R² and se -use the p-values to test hypothesis if coefficients are really 0
What are the different parts of a regression table?
-coefficients (the estimates of the parameters in the regression model) -standard errors (estimated standard deviations of sampling distributions of coefficients) -t-ratios (how many standard errors the coefficients are away from 0) -p-values (a small p value says that if the true parameter value were 0, it would be unlikely to see a coefficient this far from 0) -R² (fraction of variability of y accounted for by regression model) s (standard deviation of regression residuals - se)
coefficients of MLR
-coefficients do NOT tell us affect of its associated predictor on the response variable
Why don't you want to add to many predictors?
-concern of collinearity -having many predictors makes it unlikely model will predict new values well -predictors may model noise and not actual variation each predictor affects the coefficients of other predictors
Key things not to do with regression
-do NOT interpret association as causation -NEVER interpret the regression slop coefficient as predicting how y is likely to change if its x value were changed
Interpretation of multiple regression coefficients
-each coefficient must be interpreted as the relationship between y and x after allowing for the linear effects of the other x's on both variables -basically looking at how 2 variables relate for a given value of a 3rd variable -presence of several predictors in MLR make interpretation of slope less straightforward -when predictor variables are highly correlated → interpretation is more challenging (it may be impossible to remove effects of one predictor variable from another)
What are the three big ideas of sampling?
-examine a part of a whole -randomize -it's the sample size
Assumptions for MLR
-linearity assumption (straight enough condition - data should have underlying linear relationship) -also a good idea to check residuals for linearity -equal variance (Does the plot thicken? Condition - variability of errors should be about the same) -check the residuals (use this to check straight enough condition and equal variance) - histogram of residuals should be unimodal, symmetric and without outliers
When saying what the regression model can tell you beyond the sample...
-make a confidence interval -test hypotheses about slope and intercept of regression line
What happens when models get father from middle of the data?
-models tend to be worse the farther from the middle if the model is wrong -in a residual plot should see no direction, no particular shape, no outliers, and no bends
Conditions for regression
-must be quantitative variables -straight enough condition -Does the plot thicken? Condition (residuals must share the same spread) -outlier condition
In order to use a regression model you must check for...
-quantitative (the variables should be quantitative) -straight enough (the relationship between the variables is straight enough) -no outliers
What can regression models tell us?
-regression models can tell us how things differ but not how they might change if circumstances were different -regression only describes data as they are
What should the graph of the residuals look like?
-residual graph of residuals vs. x should have no shape -should be horizontal, uniform scatter -only really care about the overall pattern of the residual plot
examine part of the whole
-sample is meant to represent population -be careful to avoid bias -in order to get representative samples of entire populations, individuals are selected at random
What are common sampling mistakes?
-sample volunteers -sample conveniently -use a bad sampling frame -under coverage -nonresponse bias -response bias
What are major ideas to keep in mind when working with summary values?
-scatterplots of statistics summarized over groups tend to show less variability (summary statistics vary less than the data on the individuals) -for non-summary statistic plots you would expect a smaller R² value -a conclusion from summary data may seem better than they truly are
Key features for good line
-small residuals -smallest spread of data around line (the spread will be measured with the standard deviation of the residuals) -minimized square sum of squares of residuals
components of calculation of standard error coefficients
-spread around the model (the less scatter around the regression model→the more consistent sample to sample) - can also use se (standard deviation of the residuals) instead of R² to asses the strength of the relationship -spread of the x's (the broader the range of x-values gives a more stable base for the slope - if the sd of x is large, you get a more stable regression (for multiple regression, you look at sd of x after allowing for other x's in the model) -sample size (having a larger sample size (n) gives more consistent estimates sample to sample)
What does the linear model do?
-summarize the general pattern with only a couple of parameters -no line can go through all the points -even though the model is not perfect, it can still help us understand the relationship -models allow us to expand our understanding beyond the data -want the line of model to come closer to all the points than any other line
What can and can't the correlation coefficient tell us?
-the correlation coefficient can tell us if a relationship is strong -the correlation coefficient can NOT tell us if what the equation of that line is
In order to interpret a linear model you need to know...
-variables -units
Goal of residuals:
-want to make residuals as small as possible -mean of residuals is always 0 -Se=standard deviation of residuals - standard deviation of residuals is how much the points spread around the regression line -important to make sure residuals have same scatter throughout
What is the mean residual?
0 -the model balances overestimates and underestimates so the mean residual is 0
What are the four principles of experimental design?
1) Control -decide level of things and how they will be allocated -control experimental factos and other sources of variation 2) Randomize -randomization allows us to equalize effects of unknown/uncontrollable sources of variation and reduces bias 3) Replicate 4)Block
Only need three values to fine the slope of model line:
1) Correlation (strength of linear relationship) 2) Standard deviations (gives the units) 3) Means (tells us where to put the line)
Goals of re-expression for regression
1) make the form of a scatterplot more nearly linear 2) make the scatter in a scatterplot spread out evenly rather than thickening at one end 3) Make the distribution of the residuals more nearly normal (for a good re-expression you will than see normal distribution of residuals in histogram)
What 3 aspects of scatterplots affect the standard error of the regression slope?
1) the spread around the model (se) 2) variation among x values 3) sample size (n)
Should you check your assumptions before or after regression?
Both! -it depends on what assumption you are checking but sometimes you can only check assumptions after you have run regression
What must be done before interpreting regression?
Check conditions! -make sure to check residuals to check that the linear model is reasonable -sometimes nonlinearity is clearer in a residual plot
Conditions to check with residuals plot:
Does the plot thicken? Condition (this is part of the Equal Variance Assumption) -want to make sure that the spread is the same throughout
What kind of distribution does each coefficient have?
Each coefficient has a normal sampling distribution -sampling distributions are centered at their respective true coefficient values (the population coefficient)
Confidence intervals?
If the histogram of the residuals is unimodal and symmetric - 95% of errors made by regression are smaller than 2 se (68-95-99.7 rule)
For a linear relationship can you predict y will be farther from its mean than x?
No - you can never predict that y will be farther from its mean than x was from its mean - this is because r is always between -1 and +1 - the predicted y tends to be closer to its mean than the corresponding x - because of "regression to mediocrity"
Can you predict x from y?
No -with a given model can not predict x from y - would need to make a new model -must decide ahead of time which value (x) will be used to predict other values
What can you look at to check for collinearity?
R² -find the regression of that predictor and the others -R² gives the fraction of variability of the predictor accounted for by other predictors
What are possible R² values?
R² is always 0-100% -R²=100% means the model is a perfect fit and there is no scatter, se=0, all of the variance would be accounted for in the model
stratified random sampling
SRS is used within each stratum (homogenous groups a population is sliced into) and the results are combined -can be useful for ensure you get data from all parts of the population -samples tend to vary less - be aware of Simpson's paradox
What does 1-R² represent?
The amount of the predictor's variance left after allowing for effects of other predictors
voluntary response sample
a large group of individuals invited to respond and those who choose to respond are counted -respondents rather than researchers decide who is included -voluntary response bias
sampling frame
a list of individuals from which the sample is drawn -in order to select sample at random, need to be able to define where sample will come from
population parameter
a numerically valued attribute of a model of a population -rarely know the population parameter value so estimate the value from sampled data
Multiple regression
a regression with 2 or more predictor variables -simple regression is just regression on a single predictor -multiple regression models allow us to do regression with multiple variables -able to make better predictions if use multiple predictors -the coefficients of predictors are found to minimize the sum of the squared residuals
systematic sampling
a sample drawn by selecting individuals systematically from a sampling frame -when there is no relationship between the order of the sampling frame and the variable of interest, systematic sampling can be representative
Simple random sample (SRS)
a sample in which each set of n elements in the population has an equal chance of selection -ensures that every possible sample of the size drawn has an equal chance to be selected -samples drawn at random differ from one another
census
a sample that consists of the entire population -difficult to complete -populations are constantly changing - a population can change in the time it takes to complete a census
Cluster
a smaller representative -can be more practical to use clusters -if each cluster represents the population then cluster sampling won't be biased -clustering may be mor practical or affordable -clusters can be heterogeneous unlike strata
observational studies
a study based on data in which no manipulation factors have been employed
placebo
a treatment known to have no effect, administered so that all groups experience the same conditions -makes sure observed effect is not due to placebo effect
pilot
a trial run of the survey you eventually plan to give to a larger group
lurking variable
a variable that is not explicitly a part of a model but affects the way the variables appear to be related
factor
a variable whose levels are manipulated by the experimenter
response variable
a variable whose values are compared across different treatments
influential
a way to describe points that change the model when they aren't omitted -influence depends on leverage and residual -a case that does not influence the slope is not influential -a model dominated by a single case is often unuseful for identifying unusual cases -influential points can hide in residual plots by pulling the line close to them
completely randomized design
all experimental units have an equal chance of receiving any treatment
linear model
an equation of a straight line that goes through the data and can then be used to predict values
predicted value
an estimate from the model -the y^ value
randomized block design
an experiment design in which participant are randomly assigned to treatments within each block
prospective study
an observational study in which subjects are followed to observe future outcomes
retrospective study
an observational study in which subjects are selected then their previous conditions/behaviors are determined
outlier
any data point that stands away from the others -in regression, cases can be extraordinary by having a large residual or having high leverage -removing outliers can increase R²
blinding
any individual associated with the experiment who is not aware of how subjects have been allocated treatment -helps prevent bias
adjusted R²
attempt to adjust R² by adding a penalty for each additional predictor -this way more predictors does not automatically make R² bigger
regression to the mean
because the magnitude of correlation is always less than 1, the predicted y^ tends to be fewer standard deviations away from its mean than the corresponding x
In linear model equation what is b₀ equal to?
b₀=intercept b₀=y⁻ - b₁x⁻ -the intercept value often is not meaningful
In linear model equation what is b₁ equal to?
b₁=slope b₁=r*(sy/sx) -the slope inherits the same sign as the correlation (r) -changing units does not affect correlation value, but it does affect slope -slope units are always y per x
What should you be careful about when interpreting coefficients from multiple linear regression?
collinearity -when you have a good predictor but then add another predictor that does the same thing so it makes the other predictor worse -the effect of one predictor after allowing for the effects of another predictor is negligible -important to think about how the different predictors are related -when the predictors are unrelated, the new info helps account for even more variation
leverage
data points whose values are far from the mean exert leverage on linear model -high leverage points pull the line close to them -with enough leverage, the residuals of these points can appear small
sample surveys
designed to ask questions of a small group of people in the hope of learning something about the entire population
e
e=errors for residuals e=y-y^
population
entire group of individuals or instances whom we hope to learn -impossible to examine entire population so use sample
control group
experimental units assigned to a baseline treatment -response is useful for comparison
least squares (MLR)
fitting multiple regression models by choosing the coefficients that make the sum of the squared residuals as small as possible -R² still gives the fraction of variability -se=standard deviation of residuals -residual=predicted-observed
interaction term
gives adjustment to slope -variable accounts for different slopes in different groups
What does R² tell us?
how much of the variance in y is accounted for in regression model -a higher R² is more desirable -adding a new predictor will NEVER decrease R²
convenience sampling
includes individuals who are convenient to sample -not representative of the population
experimental unit
individuals on which experiment is conducted on -also called subject/participant
Confidence interval for mean prediction
interval around mean of predicted values -when there is less spread around the line it is easier to be precise -more certain of slope→more certain of prediction -more data→more precise predictions -predictions closer to the mean will have less variability
colinearity
issue in multiple regression if any predictor is highly correlated with or more other predictors
extrapolation
it is unsafe to predict for values of x far from the ones used to find the linear model equation -the farther an x value is from x⁻, the less that value should be trusted
What can happen if a predictor is collinear with another predictor in the model?
its coefficient can be surprising -may have unexpected sign/size
What does it mean that the sampling distribution of the estimated coefficient is estimated to be normal?
mean=0 -need a sampling distribution in order to find a p value
What is regression used for?
modeling relationships among variables
What is one reason why models are useful?
models help simplify real situations to make them easier to understand
Where is it easiest to predict values?
near the mean (harder to predict extreme values)
Does a high R² guarantee a straight relationship?
no
Does a strong correlation show causation?
no -can not conclude causation from regression -there is no way to be sure there is not a lurking variable responsible for the apparent association
In MLR is the coefficient a good indicator of the strength of a relationship between two variables?
no -there can be a strong relationship between y and x but b be almost 0 (this could be because another variable has accounted for that variation)
Do you want predictors to predict similar things?
no because of collinearity -don't want predictors to be highly correlated
What should you check for after making a scatterplot of the residuals?
no pattern, homoscedasticity, outliers
In MLR does the sign of the coefficient match the relationship of the two variables?
not always -it is possible for the coefficient to have the opposite sign of the actual relationship -this could be because for a given value of other variables, this variable has the opposite relationship overall -the coefficient of a predictor variable in multiple regression depends on other predictors
sample size
number of individuals in the sample -sample size is what determines how well the sample represents the population (not the fraction sampled) -how big a sample you need depends on what you are estimating (for something with many categories - need lots of respondents)
partial regression plots
plot for specified coefficient that helps us understand coefficient in multiple regression -plot direction corresponds to sign of multiple regression coefficient -can make sure form of plot is straight enough -the slope should be equal to the coefficient -can look for influential data points -looking at partial regression can help determine which variables are reliable for the relationship (if there is lots of scatter or non linear - be careful)
Regression models allow us to...
predict values
treatment
process, intervention and other controlled circumstances applied to randomly assigned experimental units -treatments are the different levels of factors -important to assign treatment at random
What makes the best experiments?
randomized, comparative, double-blind, placebo-controlled
matching
reduces unwanted variation -participants who are similar in ways not under study may be matched then compared with each other
How do you fix colinearity?
remove predictors -keep the predictors that are most reliably measured/easiest
Equivalent version of sample error ε
residual (e) -residual is sample based version of ε -residuals help us asses sample model
Residual equation
residual=observed value-predicted value OR e=y-y^ -residuals help us determine if the model makes sense
If the model was perfect...
residuals would be 0
biased
sampling methods that by their nature tend to over or under emphasize some characteristics of the population -conclusions from biased methods are flawed
multistage samples
sampling schemes that combine several sampling methods
What do we use to estimate true population values?
slope and intercepts -these tend to vary from the true values but they give us our best guess -if data follows certain assumptions and conditions we can estimate the uncertainty of the estimates
sample
smaller groups of individuals from population
What is the difference between standard deviation and standard error?
standard deviation: amount of variability/dispersion from individual values to the mean standard error: how far the sample mean is likely to be from the true population mean
What should you check for after making the scatterplot?
straight enough condition
does stratifying reduce or increase sampling variability?
stratifying reduces sampling variability
Experiment
study that examines the relationship between two or more variables -there must be at least one explanatory variable -aspect of manipulation -an experiment actively and deliberately manipulates factors to control details of possible treatments -assign subjects to treatment at random -experimenters compares responses for different groups
What could multiple modes in data be an indicator of?
subgroups of data -the residual histogram can potentially help identify multiple modes
randomizing
the best defense against bias is randomization in which each individual is given a fair, random chance of selection -randomization helps protect against factors in the data
residual
the difference between the observed and predicted value - Residual=observed value - predicted value = y - y^ -residuals tell us how well the model predicted the observed value
effect size
the effect of a treatment is the magnitude of the difference in responses -when effect size is larger, it is easier to discern a difference in treatment groups
r²
the fraction of the data's variation accounted for by the model
1-r²
the fraction of the original variation left in the residuals
least squares
the line of best fit will be the line for which the sum of the squared residuals is the smallest -this line minimizes variances of residuals
What should you check for after after making a histogram of the residuals?
the nearly normal condition
What does a negative residual mean?
the observed value was less than the predicted value
What varies from sample to sample?
the regression line -each regression line will have their own b₀ and b₁
least squares line is the same as...
the regression line -regression = linear model fit by least squares
sampling variability
the sample to sample differences that each sample yields different values
Does size of sample or fraction of population matter?
the size of the sample is what matters -you only need the same sample size to determine the population of different sample sizes -the fraction of the population sampled does not matter!!
What is the natural null hypothesis?
the slope is 0 (H₀=β₁=0)
levels
the specific values that the experimenter chooses for a factor
R²
the square of the correlation between x and y -R² of 0 means that none of the variance in the data is in the model -R² is often given as a percentage -R² gives the fraction of variability of y accounted for by the least square regression of x -R² is an overall measure of how successful the regression is in linearly relating x to y
random assignment
to be valid, an experiment must assign experimental units to treatment groups using some form of randomization
the slope and intercept are parameters and coefficients
true
statistic
values calculated for sampled data -used to estimate population parameter
indicator variable
variables that indicate which of the 2 categories each case is in -in order to use indicator variable, slopes of each group must be roughly parallel -coefficients of indicator variable act as vertical shift
R²
variance accounted for by the model
single blind
when all the individuals in one of the classes are blinded
statistically significant
when an observed difference is too large for us to believe that it is likely it occurred naturally -there are statistical tests to determine this
double blind
when both classes (everyone) is blinded
representative
when calculated statistic accurately reflects corresponding population parameters -a biased sampling methodology tends to over/under estimate the parameter of interest
block
when groups of experimental units are similar in a way that is not a factor under study -blocking helps isolate the variability -should do blocking whenever their is an identifiable difference among experimental units -blocking is useful but not a requirement of experiments -used to reduce variability
confounded
when levels of one factor are associated with levels of another factor
standard deviation of residuals (se)
when the assumptions and conditions are met, residuals can be well described by using standard deviation and the 68-95-99.7 rule
colinear
when two or more predictors are linearly related
linear model general equation form
y^=b₀+b₁x
Coefficients in general linear model equation
y^=b₀+b₁x -b₁=slope -b₀=intercept