Ch 7-11 & 20

Ace your homework & exams now with Quizwiz!

What do residuals look like for over and underestimates?

- a positive residual → underestimate - a negative residual → overestimate

if you standardize x and y...

- x and y equal 0 - sy and sx=1

What assumptions must be met in order to make inferences about the coefficients of a line?

-LINEARITY (should check scatterplot, residuals, and qq-plot to see if the data can be represented by a linear model) - you would expect to see no pattern on the residual plot - you may need to reexpress data to make linear relationship - data must be quantitative for scatterplot to make sense -INDEPENDENCE ASSUMPTION (important because the errors need to be independent of each other; also want to make sure individuals are representative of population) - Can check the residuals for patterns or clumping which may indicate failure of independence -EQUAL VARIANCE ASSUMPTION (the variability of y should be the same for all values of x; this includes the Does the plot thicken? Condition) - can look at the standard deviation of residuals to measure scatter -NORMAL POPULATION ASSUMPTION (the errors around an idealized regression line should follow the Normal model; the normality assumption becomes less important as sample size grows - residuals should satisfy the "nearly normal condition" and the "outlier condition"

lurking and confounding

-a lurking variable creates an association between 2 other variables that tempts us to think one may cause the other (a variable that makes it appear like x is causing y) -confounding and lurking are outside influences that make it harder to understand the relationship being modeled

When doing inference for a regression model you should...

-check assumptions and conditions (plot the data and residuals) -look at R² and se -use the p-values to test hypothesis if coefficients are really 0

What are the different parts of a regression table?

-coefficients (the estimates of the parameters in the regression model) -standard errors (estimated standard deviations of sampling distributions of coefficients) -t-ratios (how many standard errors the coefficients are away from 0) -p-values (a small p value says that if the true parameter value were 0, it would be unlikely to see a coefficient this far from 0) -R² (fraction of variability of y accounted for by regression model) s (standard deviation of regression residuals - se)

coefficients of MLR

-coefficients do NOT tell us affect of its associated predictor on the response variable

Why don't you want to add to many predictors?

-concern of collinearity -having many predictors makes it unlikely model will predict new values well -predictors may model noise and not actual variation each predictor affects the coefficients of other predictors

Key things not to do with regression

-do NOT interpret association as causation -NEVER interpret the regression slop coefficient as predicting how y is likely to change if its x value were changed

Interpretation of multiple regression coefficients

-each coefficient must be interpreted as the relationship between y and x after allowing for the linear effects of the other x's on both variables -basically looking at how 2 variables relate for a given value of a 3rd variable -presence of several predictors in MLR make interpretation of slope less straightforward -when predictor variables are highly correlated → interpretation is more challenging (it may be impossible to remove effects of one predictor variable from another)

What are the three big ideas of sampling?

-examine a part of a whole -randomize -it's the sample size

Assumptions for MLR

-linearity assumption (straight enough condition - data should have underlying linear relationship) -also a good idea to check residuals for linearity -equal variance (Does the plot thicken? Condition - variability of errors should be about the same) -check the residuals (use this to check straight enough condition and equal variance) - histogram of residuals should be unimodal, symmetric and without outliers

When saying what the regression model can tell you beyond the sample...

-make a confidence interval -test hypotheses about slope and intercept of regression line

What happens when models get father from middle of the data?

-models tend to be worse the farther from the middle if the model is wrong -in a residual plot should see no direction, no particular shape, no outliers, and no bends

Conditions for regression

-must be quantitative variables -straight enough condition -Does the plot thicken? Condition (residuals must share the same spread) -outlier condition

In order to use a regression model you must check for...

-quantitative (the variables should be quantitative) -straight enough (the relationship between the variables is straight enough) -no outliers

What can regression models tell us?

-regression models can tell us how things differ but not how they might change if circumstances were different -regression only describes data as they are

What should the graph of the residuals look like?

-residual graph of residuals vs. x should have no shape -should be horizontal, uniform scatter -only really care about the overall pattern of the residual plot

examine part of the whole

-sample is meant to represent population -be careful to avoid bias -in order to get representative samples of entire populations, individuals are selected at random

What are common sampling mistakes?

-sample volunteers -sample conveniently -use a bad sampling frame -under coverage -nonresponse bias -response bias

What are major ideas to keep in mind when working with summary values?

-scatterplots of statistics summarized over groups tend to show less variability (summary statistics vary less than the data on the individuals) -for non-summary statistic plots you would expect a smaller R² value -a conclusion from summary data may seem better than they truly are

Key features for good line

-small residuals -smallest spread of data around line (the spread will be measured with the standard deviation of the residuals) -minimized square sum of squares of residuals

components of calculation of standard error coefficients

-spread around the model (the less scatter around the regression model→the more consistent sample to sample) - can also use se (standard deviation of the residuals) instead of R² to asses the strength of the relationship -spread of the x's (the broader the range of x-values gives a more stable base for the slope - if the sd of x is large, you get a more stable regression (for multiple regression, you look at sd of x after allowing for other x's in the model) -sample size (having a larger sample size (n) gives more consistent estimates sample to sample)

What does the linear model do?

-summarize the general pattern with only a couple of parameters -no line can go through all the points -even though the model is not perfect, it can still help us understand the relationship -models allow us to expand our understanding beyond the data -want the line of model to come closer to all the points than any other line

What can and can't the correlation coefficient tell us?

-the correlation coefficient can tell us if a relationship is strong -the correlation coefficient can NOT tell us if what the equation of that line is

In order to interpret a linear model you need to know...

-variables -units

Goal of residuals:

-want to make residuals as small as possible -mean of residuals is always 0 -Se=standard deviation of residuals - standard deviation of residuals is how much the points spread around the regression line -important to make sure residuals have same scatter throughout

What is the mean residual?

0 -the model balances overestimates and underestimates so the mean residual is 0

What are the four principles of experimental design?

1) Control -decide level of things and how they will be allocated -control experimental factos and other sources of variation 2) Randomize -randomization allows us to equalize effects of unknown/uncontrollable sources of variation and reduces bias 3) Replicate 4)Block

Only need three values to fine the slope of model line:

1) Correlation (strength of linear relationship) 2) Standard deviations (gives the units) 3) Means (tells us where to put the line)

Goals of re-expression for regression

1) make the form of a scatterplot more nearly linear 2) make the scatter in a scatterplot spread out evenly rather than thickening at one end 3) Make the distribution of the residuals more nearly normal (for a good re-expression you will than see normal distribution of residuals in histogram)

What 3 aspects of scatterplots affect the standard error of the regression slope?

1) the spread around the model (se) 2) variation among x values 3) sample size (n)

Should you check your assumptions before or after regression?

Both! -it depends on what assumption you are checking but sometimes you can only check assumptions after you have run regression

What must be done before interpreting regression?

Check conditions! -make sure to check residuals to check that the linear model is reasonable -sometimes nonlinearity is clearer in a residual plot

Conditions to check with residuals plot:

Does the plot thicken? Condition (this is part of the Equal Variance Assumption) -want to make sure that the spread is the same throughout

What kind of distribution does each coefficient have?

Each coefficient has a normal sampling distribution -sampling distributions are centered at their respective true coefficient values (the population coefficient)

Confidence intervals?

If the histogram of the residuals is unimodal and symmetric - 95% of errors made by regression are smaller than 2 se (68-95-99.7 rule)

For a linear relationship can you predict y will be farther from its mean than x?

No - you can never predict that y will be farther from its mean than x was from its mean - this is because r is always between -1 and +1 - the predicted y tends to be closer to its mean than the corresponding x - because of "regression to mediocrity"

Can you predict x from y?

No -with a given model can not predict x from y - would need to make a new model -must decide ahead of time which value (x) will be used to predict other values

What can you look at to check for collinearity?

R² -find the regression of that predictor and the others -R² gives the fraction of variability of the predictor accounted for by other predictors

What are possible R² values?

R² is always 0-100% -R²=100% means the model is a perfect fit and there is no scatter, se=0, all of the variance would be accounted for in the model

stratified random sampling

SRS is used within each stratum (homogenous groups a population is sliced into) and the results are combined -can be useful for ensure you get data from all parts of the population -samples tend to vary less - be aware of Simpson's paradox

What does 1-R² represent?

The amount of the predictor's variance left after allowing for effects of other predictors

voluntary response sample

a large group of individuals invited to respond and those who choose to respond are counted -respondents rather than researchers decide who is included -voluntary response bias

sampling frame

a list of individuals from which the sample is drawn -in order to select sample at random, need to be able to define where sample will come from

population parameter

a numerically valued attribute of a model of a population -rarely know the population parameter value so estimate the value from sampled data

Multiple regression

a regression with 2 or more predictor variables -simple regression is just regression on a single predictor -multiple regression models allow us to do regression with multiple variables -able to make better predictions if use multiple predictors -the coefficients of predictors are found to minimize the sum of the squared residuals

systematic sampling

a sample drawn by selecting individuals systematically from a sampling frame -when there is no relationship between the order of the sampling frame and the variable of interest, systematic sampling can be representative

Simple random sample (SRS)

a sample in which each set of n elements in the population has an equal chance of selection -ensures that every possible sample of the size drawn has an equal chance to be selected -samples drawn at random differ from one another

census

a sample that consists of the entire population -difficult to complete -populations are constantly changing - a population can change in the time it takes to complete a census

Cluster

a smaller representative -can be more practical to use clusters -if each cluster represents the population then cluster sampling won't be biased -clustering may be mor practical or affordable -clusters can be heterogeneous unlike strata

observational studies

a study based on data in which no manipulation factors have been employed

placebo

a treatment known to have no effect, administered so that all groups experience the same conditions -makes sure observed effect is not due to placebo effect

pilot

a trial run of the survey you eventually plan to give to a larger group

lurking variable

a variable that is not explicitly a part of a model but affects the way the variables appear to be related

factor

a variable whose levels are manipulated by the experimenter

response variable

a variable whose values are compared across different treatments

influential

a way to describe points that change the model when they aren't omitted -influence depends on leverage and residual -a case that does not influence the slope is not influential -a model dominated by a single case is often unuseful for identifying unusual cases -influential points can hide in residual plots by pulling the line close to them

completely randomized design

all experimental units have an equal chance of receiving any treatment

linear model

an equation of a straight line that goes through the data and can then be used to predict values

predicted value

an estimate from the model -the y^ value

randomized block design

an experiment design in which participant are randomly assigned to treatments within each block

prospective study

an observational study in which subjects are followed to observe future outcomes

retrospective study

an observational study in which subjects are selected then their previous conditions/behaviors are determined

outlier

any data point that stands away from the others -in regression, cases can be extraordinary by having a large residual or having high leverage -removing outliers can increase R²

blinding

any individual associated with the experiment who is not aware of how subjects have been allocated treatment -helps prevent bias

adjusted R²

attempt to adjust R² by adding a penalty for each additional predictor -this way more predictors does not automatically make R² bigger

regression to the mean

because the magnitude of correlation is always less than 1, the predicted y^ tends to be fewer standard deviations away from its mean than the corresponding x

In linear model equation what is b₀ equal to?

b₀=intercept b₀=y⁻ - b₁x⁻ -the intercept value often is not meaningful

In linear model equation what is b₁ equal to?

b₁=slope b₁=r*(sy/sx) -the slope inherits the same sign as the correlation (r) -changing units does not affect correlation value, but it does affect slope -slope units are always y per x

What should you be careful about when interpreting coefficients from multiple linear regression?

collinearity -when you have a good predictor but then add another predictor that does the same thing so it makes the other predictor worse -the effect of one predictor after allowing for the effects of another predictor is negligible -important to think about how the different predictors are related -when the predictors are unrelated, the new info helps account for even more variation

leverage

data points whose values are far from the mean exert leverage on linear model -high leverage points pull the line close to them -with enough leverage, the residuals of these points can appear small

sample surveys

designed to ask questions of a small group of people in the hope of learning something about the entire population

e

e=errors for residuals e=y-y^

population

entire group of individuals or instances whom we hope to learn -impossible to examine entire population so use sample

control group

experimental units assigned to a baseline treatment -response is useful for comparison

least squares (MLR)

fitting multiple regression models by choosing the coefficients that make the sum of the squared residuals as small as possible -R² still gives the fraction of variability -se=standard deviation of residuals -residual=predicted-observed

interaction term

gives adjustment to slope -variable accounts for different slopes in different groups

What does R² tell us?

how much of the variance in y is accounted for in regression model -a higher R² is more desirable -adding a new predictor will NEVER decrease R²

convenience sampling

includes individuals who are convenient to sample -not representative of the population

experimental unit

individuals on which experiment is conducted on -also called subject/participant

Confidence interval for mean prediction

interval around mean of predicted values -when there is less spread around the line it is easier to be precise -more certain of slope→more certain of prediction -more data→more precise predictions -predictions closer to the mean will have less variability

colinearity

issue in multiple regression if any predictor is highly correlated with or more other predictors

extrapolation

it is unsafe to predict for values of x far from the ones used to find the linear model equation -the farther an x value is from x⁻, the less that value should be trusted

What can happen if a predictor is collinear with another predictor in the model?

its coefficient can be surprising -may have unexpected sign/size

What does it mean that the sampling distribution of the estimated coefficient is estimated to be normal?

mean=0 -need a sampling distribution in order to find a p value

What is regression used for?

modeling relationships among variables

What is one reason why models are useful?

models help simplify real situations to make them easier to understand

Where is it easiest to predict values?

near the mean (harder to predict extreme values)

Does a high R² guarantee a straight relationship?

no

Does a strong correlation show causation?

no -can not conclude causation from regression -there is no way to be sure there is not a lurking variable responsible for the apparent association

In MLR is the coefficient a good indicator of the strength of a relationship between two variables?

no -there can be a strong relationship between y and x but b be almost 0 (this could be because another variable has accounted for that variation)

Do you want predictors to predict similar things?

no because of collinearity -don't want predictors to be highly correlated

What should you check for after making a scatterplot of the residuals?

no pattern, homoscedasticity, outliers

In MLR does the sign of the coefficient match the relationship of the two variables?

not always -it is possible for the coefficient to have the opposite sign of the actual relationship -this could be because for a given value of other variables, this variable has the opposite relationship overall -the coefficient of a predictor variable in multiple regression depends on other predictors

sample size

number of individuals in the sample -sample size is what determines how well the sample represents the population (not the fraction sampled) -how big a sample you need depends on what you are estimating (for something with many categories - need lots of respondents)

partial regression plots

plot for specified coefficient that helps us understand coefficient in multiple regression -plot direction corresponds to sign of multiple regression coefficient -can make sure form of plot is straight enough -the slope should be equal to the coefficient -can look for influential data points -looking at partial regression can help determine which variables are reliable for the relationship (if there is lots of scatter or non linear - be careful)

Regression models allow us to...

predict values

treatment

process, intervention and other controlled circumstances applied to randomly assigned experimental units -treatments are the different levels of factors -important to assign treatment at random

What makes the best experiments?

randomized, comparative, double-blind, placebo-controlled

matching

reduces unwanted variation -participants who are similar in ways not under study may be matched then compared with each other

How do you fix colinearity?

remove predictors -keep the predictors that are most reliably measured/easiest

Equivalent version of sample error ε

residual (e) -residual is sample based version of ε -residuals help us asses sample model

Residual equation

residual=observed value-predicted value OR e=y-y^ -residuals help us determine if the model makes sense

If the model was perfect...

residuals would be 0

biased

sampling methods that by their nature tend to over or under emphasize some characteristics of the population -conclusions from biased methods are flawed

multistage samples

sampling schemes that combine several sampling methods

What do we use to estimate true population values?

slope and intercepts -these tend to vary from the true values but they give us our best guess -if data follows certain assumptions and conditions we can estimate the uncertainty of the estimates

sample

smaller groups of individuals from population

What is the difference between standard deviation and standard error?

standard deviation: amount of variability/dispersion from individual values to the mean standard error: how far the sample mean is likely to be from the true population mean

What should you check for after making the scatterplot?

straight enough condition

does stratifying reduce or increase sampling variability?

stratifying reduces sampling variability

Experiment

study that examines the relationship between two or more variables -there must be at least one explanatory variable -aspect of manipulation -an experiment actively and deliberately manipulates factors to control details of possible treatments -assign subjects to treatment at random -experimenters compares responses for different groups

What could multiple modes in data be an indicator of?

subgroups of data -the residual histogram can potentially help identify multiple modes

randomizing

the best defense against bias is randomization in which each individual is given a fair, random chance of selection -randomization helps protect against factors in the data

residual

the difference between the observed and predicted value - Residual=observed value - predicted value = y - y^ -residuals tell us how well the model predicted the observed value

effect size

the effect of a treatment is the magnitude of the difference in responses -when effect size is larger, it is easier to discern a difference in treatment groups

the fraction of the data's variation accounted for by the model

1-r²

the fraction of the original variation left in the residuals

least squares

the line of best fit will be the line for which the sum of the squared residuals is the smallest -this line minimizes variances of residuals

What should you check for after after making a histogram of the residuals?

the nearly normal condition

What does a negative residual mean?

the observed value was less than the predicted value

What varies from sample to sample?

the regression line -each regression line will have their own b₀ and b₁

least squares line is the same as...

the regression line -regression = linear model fit by least squares

sampling variability

the sample to sample differences that each sample yields different values

Does size of sample or fraction of population matter?

the size of the sample is what matters -you only need the same sample size to determine the population of different sample sizes -the fraction of the population sampled does not matter!!

What is the natural null hypothesis?

the slope is 0 (H₀=β₁=0)

levels

the specific values that the experimenter chooses for a factor

the square of the correlation between x and y -R² of 0 means that none of the variance in the data is in the model -R² is often given as a percentage -R² gives the fraction of variability of y accounted for by the least square regression of x -R² is an overall measure of how successful the regression is in linearly relating x to y

random assignment

to be valid, an experiment must assign experimental units to treatment groups using some form of randomization

the slope and intercept are parameters and coefficients

true

statistic

values calculated for sampled data -used to estimate population parameter

indicator variable

variables that indicate which of the 2 categories each case is in -in order to use indicator variable, slopes of each group must be roughly parallel -coefficients of indicator variable act as vertical shift

variance accounted for by the model

single blind

when all the individuals in one of the classes are blinded

statistically significant

when an observed difference is too large for us to believe that it is likely it occurred naturally -there are statistical tests to determine this

double blind

when both classes (everyone) is blinded

representative

when calculated statistic accurately reflects corresponding population parameters -a biased sampling methodology tends to over/under estimate the parameter of interest

block

when groups of experimental units are similar in a way that is not a factor under study -blocking helps isolate the variability -should do blocking whenever their is an identifiable difference among experimental units -blocking is useful but not a requirement of experiments -used to reduce variability

confounded

when levels of one factor are associated with levels of another factor

standard deviation of residuals (se)

when the assumptions and conditions are met, residuals can be well described by using standard deviation and the 68-95-99.7 rule

colinear

when two or more predictors are linearly related

linear model general equation form

y^=b₀+b₁x

Coefficients in general linear model equation

y^=b₀+b₁x -b₁=slope -b₀=intercept


Related study sets

Foundations Health & Human Performance Final

View Set

algebra 1b - unit 1: exploring functions

View Set

Chapter 6 Florida Statutes Rules and Regulations pertaining to life products

View Set

Community - Violence Elsevier missed questions

View Set

Chapter 2- terms/Study guide for test (Earth Science)

View Set

ALL PSYCHOLOGY NOTES (CH 1-9, 12, 14, 15)

View Set

Vocab G Unit 7 Choosing the Right Word

View Set