AP Statistics Ch.3
How can we determine the "best" regression line for a set of data?
Must go through point (x bar, y bar)<--average of mean x & y, then minimize the squares of residuals.
Do all relationships have a clear association?
No
Is correlation a resistant measure of spread.
No but that doesn't always mean that the correlation is weakened by the presence of an outlier. -An outlier that falls far above or below the data points cluster away from the line of best fit would strengthen the correlation if removed. -An outlier that follows the general trend of the data, but just falls much to the left or right, would weaken the correlation if removed because more dots stay close to the line when it is present.
Can SST ever be greater than SSE?
No, because the least-squares line yields the smallest possible sum of squared prediction errors, SSE can never exceed SST, which is based on the line y= y bar. In the worst case scenario, the least squares regression line does no better at predicting y than y= y bar does.
Does a strong association between 2 variables indicate a cause-effect relationship?
No, correlation does not mean causation.
Does it always make sense to interpret the y-intercept?
No, not if its extrapolated or illogical/ unreasonable in the context.
Outlier
-ONLY mention POTENTIAL outliers, do not waste time calculating it. (If none, do not mention none in sentence) -data values above/ below line -OR follow pattern of line & stray from rest (does not affect slope)
Explain the formula for the standard deviation of the residuals.
-R formula: dependent on 2 variables -min for variation= 3 data points→n-2
How do you know which variable to put on which axis? Where do you start each axis?
-explanatory= x if clear -response= y if clear -sometimes doesn't matter if no clear distinction
What are some additional facts about correlation?
-used with numerical summaries (e.g. µ, σ, M) -doesn't care about explanatory vs. response variables (i.e. you can ∆ the x and y and the r would remain the same) -has no units, if we ∆ the units, the r does not ∆ (e.g. height in cm>in) -x&y must be QUANTITATIVE -scatterplot must be LINEAR
How do you construct and interpret a residual plot?
1) Must run LIN REG 1st→ calculator automatically stores residuals as list 2) 2nd Y= →choose scatter plot X List: L₁ (or x's) Y List: 2nd Stat & scroll to residual 3) Zoom 9
How do you calculate the standard deviation of the residuals on a calculator?
1) Stat--> Tests 2) Alpha cos for LinRegTTest
What 2 things do you look for in a residual plot to see if a linear model is appropriate?
1) equal scatter above and below 0 residual line 2) no leftover patter --> linear -leftover pattern/ curve --> not linear -curve more important than scatter in determining whether or not data is linear
How do you create a scatterplot on a TI-84?
1. Enter the data in 2 different lists- L1= explanatory; L2=response 2. 2nd Y= turn on scatterplot, choose dots 3. Graph 4. Zoom 9 (zoom stat)
What are the steps for using the formula sheet to get a regression line?
1. Look at the formula for b1, your slope. Formula: b1=r(sy/sx) 2. Look for b0, your y-intercept. b0=y bar-slope(x bar) 3. Plug b0 & b1 into equation for regression line: y hat=b1x+ b0
r (aka correlation)
1. measures direction & strength of a LINEAR relationship, however, an r of -1 or 1 does not only apply to linear data. 2. -1≤r≤1 3. -1, 1 are equally strong, 0=weakest 4. same sign as slope.
How are s and r² affected by changing units?
1. r² has no units and thus will not change. 2. s has units, so yes it will change and is affected by multiplication.
What factors should you evaluate to see if a linear model is appropriate for a set of data?
1. scatterplot shows S P pattern 2. residual plot shows no leftover pattern 3. r² is __ so it is S. (must do one of the above 1st)
positive association
Above-average values of one variable tend to accompany above-average values of the other, and below-average values also tend to occur together.
negative association
Above-average values of one variable tend to accompany below-average values of the other, and vice versa.
Interpret the slope.
For every increase in one ___________ (x-value), we expect the _________ (y-value) to go ___________ (up/down) by about _________ (|slope|)
What happens to the predicted value of y for each increase of 1 standard deviation in x?
For every increase of 1 standard deviation in x, we expect the y value to go up by about slope years.
How do outliers affect correlation, least-squares regression line, & std deviation of the residuals (s)? Are the outliers influential?
High leverage is an outlier that affects the r, causing it to become weaker/ stronger, and increasing/ decreasing the slope and increasing/ decreasing y-intercept. S is also probably affected, unless the outlier is positioned the exact distance of s from the regression line. (inverse proportion between y-intercept & slope). An outlier can even change the sign if distanced from the line enough.
Interpret the y-intercept.
If the _________ (x-value) is 0, we expect the _________ (y-value) to be _________ (y-intercept) (units).
How do you know whether or not to use linear regression?
If the line is linear, you can use it, but if the data shows a distinct curve, do not.
How do you determine whether the regression over vs. under predicted the data value?
If the line is under the point, you under predicted. However, if the line is above the point, you over predicted.
What is a residual plot? What is its purpose?
It is a scatter plot of residuals against an explanatory variable. -assesses how well a regression line fits the data (i.e. is a linear model appropriate)
What does r squared provide?
It is how much better line is from the mean line.
On a computer output from a regression analysis, how do you identify the y-intercept?
It is the constant (to the right of "Constant")
On a computer output from a regression analysis, how do you identify the slope?
It is under the y-intercept value & next to the explanatory variable name.
What is the standard deviation of the residuals? How do you calculate and interpret it?
It measures the approximate size of the residuals and how far off the predictions are from the trend line. (i.e. the approximate size of the average error)
What does PODS stand for?
Pattern, Outliers, Direction, Strength
Does it matter which variable is x & which is y?
Sometimes, if y is AFFECTED BY x, we want x to be explanatory. Otherwise, it doesn't matter.
What happens if you standardize both variables?
The point (x bar, y bar) is transformed to (0,0). Since the formula for slope is b-r(sy/sx), and sx=sy=1, b=r (slope=correlation).
Interpret the correlation (r).
There is a _________ (strength) ________ (pos/ neg) association between ________ (x-value) & ________ (y-value).
How should you interpret a scatterplot in 1-2 sentences.
There is a ___________ (S) negative/ positive (D) _____________ (P) relationship between _________ (x-value) and _____________ (y-value). There is/are _______ potential outliers. Ex: There is a moderately strong negative linear relationship between sprint time and jump. There is 1 potential outlier.
When does an outlier decrease the slope/ increase y intercept?
When it falls far below the trend line.
Outlier weakens correlation.
When outlier (x,y) falls far above or below trend line.
Outlier strengthens correlation.
When point follows trend line but has a high or low x.
Interpret a residual.
When using this regression line, our actual value was _________ |residual| (units) more/ less than predicted (expected).
Interpret the standard deviation of the residuals
When we use this regression line, we will be off by about ______ (units).
Interpret r-squared.
_______ (r²)% of the variation in _________ (y-value) is accounted for by the regression line (linear model) (LSRL) relating _________ (y-value) and _________ (x-value).
How is r^2 related to s?
both measure the strength of a line
What must you ensure that you do every time you make a regression line equation?
define your variables: y is ________ x is ________ y hat is predicted _________ x hat is predicted _________ OR INSTEAD ∧ y-variable= a+bx
What is the easiest way to lose points when making a scatterplot?
forgetting label (both x & y words & scale)
Pattern
form (linear, polynomial, exponential)
Strength
how closely points follow a clear form -weak, strong, moderately, very -helps convey how accurate a prediction is
How can you calculate the equation of the least-squares regression line using summary statistics?
if given r, sx, sy, x bar, and y bar AND NOT A DATA SET... use the formula sheet and plug in givens to get the equations.
explanatory variable
in general, explains response -goes on x-axis (NOT CAUSES Y) -AKA IV
response variable
in general, outcome of explanatory -goes on y-axis -AKA DV
Is the least squares regression line resistant to outliers?
no
Do you need to know the formula for correlation?
no, get it from your calculator or in a table -avg of the products of the standardized scores for all data -adding a constant to all data points does not ∆ the r
Can you determine the form of the relationship using only the correlation?
no, r has a formula that you could use on any set of numbers, but that doesn't make it correct
Direction
positive or negative association
What is extrapolation? Is it a good idea to extrapolate?
predicting a y-value outside of the range of given x-values (NEVER OK IN AP)
What is the slogan to remember about deriving r from r²?
r from r² is easy to find, take the square root AND the slope's sign
How is r^2 related to r?
r*r=r² √r²=(+/-)r (r CANNOT BE BOTH POSITIVE AND NEGATIVE)
What is an easier way to think about a residual plot?
rotate scatter plot to get horizontal line
How do you calculate r^2?
r²=1-SSE/SST
scatterplot
shows the relationship between two quantitative variables measured on the same individuals
residual
the difference between an observed value of the response variable and the value predicted by the regression line.
coefficient of determination (r^2)
the fraction of variation that is accounted for by the LSRL of y on x
Least squares
the least amount of squares from the line (measurement of the distance between each data point to the best fit line). -WHAT % of ERROR (SQUARES) WAS GOTTEN RID OF? -minimize the amount of squares from the regression line
What is SST?
the ∑(yi-y₂)² (squares of the mean line) -TOTAL variation in y values
What is SSE?
the ∑residuals² (squares of our line added up) -sum of the squared errors
What is the equation for the residual?
y - ŷ (actual y-predicted y) (A-P)
What is the difference between y and y hat?
y hat is a predicted y, whereas y is a definite point along a line
What is the general form of a regression equation?
∧ y=a+bx, where b is the slope and x is the y-intercept