Statistics Chapter 3

¡Supera tus tareas y exámenes ahora con Quizwiz!

SST

Measures the sum of the totals (total variation in the y values). It is a constant multiple of the variance.

General idea of regression lines

Model for the data: the equation of a regression line gives compact mathematical description of what the model tells us about the relationship between the response variable y and the explanatory variable x

Y-intercept

A is the y-intercept, the predicted value of y when x = 0

Caution w/ scatterplots

Association does not imply causation because there may be other variables lurking in the background that contribute to the relationship between two variables

Facts about correlation: #2

Because r uses the standardized values of the observations, r does not change when we change the units of measurement of x, y, or both. Transformations do not affect r.

Residual

Difference between an observed value of the response variable and the value predicted by the regression line. Residual = observed y - predicted y = y - yhat *Represents the leftover variation in the response variable after fitting the regression line

Distance and standard deviations

For an increase of one standard deviation (sx) in the value of the explanatory variable x, the least-squares regression line predicts an increase of r standard deviations (rxy) in the response variable y

When is r2 = 1

If all the points fall directly on the least-squares line, SSE = 0 and r2 = 1. Then all of the variation in y is accounted for by the linear relationship with x.

Problems with regression lines

In most cases, no line passes exactly thru all the points in a scatterplot. Because we use the line to predict y from x, the prediction errors we make are errors in y, the vertical direction in the scatterplot. A good regression line makes vertical distances (residuals) of the points from the line as small as possible.

Specific values of variables

It is easiest to identify explanatory and response variables when we actually specify values of one variable to see how it affects another variable.

Least-squares regression line

Least-squares regression line of y on x is the line that makes the sum of the squared residuals as small as possible (Squares b/c positive and negative cancel out)

Linear relationships

Linear relationships are important because a straight line is a simple pattern that is quite common; a linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line.

Explanatory variable

May help explain or influence changes in a response variable

Response variable

Measures an outcome of a study

Problems w/ positive and negative association

Not all relationships have a clear direction that we can describe as a positive association or negative association

X = 0 as an extrapolation

Often using the regression line to make a prediction for x = 0 is an extrapolation. That's why the y-intercept isn't always statistically meaningful.

Causation

Often we want to know whether changes in the explanatory variable causes a change in the response variable. Remember, correlation does NOT imply causation.

Another name for residual

Prediction error

Scatterplot

Shows the relationship between two quantitative variables measured on the same individuals. The values of one variable (explanatory variable) appear on the horizontal axis and the values of the other variable (response variable) appear on the vertical axis. Each individual in the data appears as a point in the graph.

The size of slope

Small slope does not mean there is no relationship. The size of the slope depends on units in which we measure the two variables. You can't say how important a relationship is by looking at the size of the slope of the regression line (unlike correlation).

Accuracy of predictions

The accuracy of predictions from a regression line depends on how much the data scatter about the line.

Another meaning of correlation

The average of the products of the standardized scores

r2 name

The coefficient of determination

Nonsense correlations

The correlation is real but the conclusion that changing one variable causes a change in the other variable is nonsense.

The coefficient of determination

The fraction of the variation in the values of y that is accounted for by the least squares regression line of y on x. We can calculate r2: r2 = 1 - SSE/SST where SSE = sum of residuals squared and SST equals sum of observations-mean squared

Are all outliers influential?

The least-squares line is most likely to be heavily influenced by observations that are outliers in x. Influential points often have small residuals because they pull the regression line toward themselves. The scatterplot alerts you of these (don't just plot residual plot b/c may miss influential points)

(xbar, y bar)

The least-squares regression line for any data set passes through the point (x bar, y bar)

The importance of slope vs. y-intercept

The slope of a regression line is an important numerical description of the relationship between the two variables. Although we need the value of the y intercept to draw the line, it is statistically meaningful only when the explanatory variable can actual take values close to zero.

Cluster

There are a bunch of data points together - Name ranges of each variable where cluster appears

Outlier

There's a lot of white space around the data point - Outlier in response variable (y) - Outlier in explanatory variable - Outlier in both - Outlier b/c doesn't follow the overall pattern/trend

Prediction

We can use a regression line to help predict the response (y hat) for a specific value of the explanatory variable x

Rounding

When doing calculations, don't round until the end of the problem. Use as many decimal places as your calculator stores to get accurate values of the slope and y intercept.

Interpret regression line

_ % of the variation in the (response variable) is accounted for by the regression line

How to determine if relationship b/w explanatory and response variable

- Make scatterplot and look for overall pattern; if linear, find regression line and plot it - Look at size of residuals - Look at residual plot - Find r2 and s to determine how well the line describes data and how large our prediction errors will be

What happens to the least squares regression line if we standardize both variables?

Standardizing a variable converts it mean to 0 and standard deviation to 1. So, xbar, ybar is transformed to (0,0) so the least-squares line for the standardized values will pass through (0,0). Since sx = sy = 1, the slope is equal to the correlation.

How to calculate the correlation

Suppose that we have data on variables x and y for n individuals. The means and the standard deviations of the two variables are xbar and sx for the x-values and ybar and sy for the y values. The correlation r between x and y is 1/(n-1) times the sum of the products of zx and zy (Use calculator: 6 1 4)

Predicted value

Y hat is the predicted value of the response variable y for a given value of the explanatory variable x *Approximation

2 tools to describe the relationship between variables

1.) Correlation 2.) Regression

How to determine if a line is an appropriate model to use for the data

1.) Residual plot (scattered/no pattern) 2.) Small residuals 3.) Find S 4.) Find r2

Interpreting computer regression output

Constant coefficient: y intercept Variable coefficient: slope Variable: explanatory variable name S: standard deviation of residuals R-sq: determination coefficient

Facts about correlation: #1

Correlation makes no distinction between explanatory and response variables. It makes no difference which variable you call x and which you call y in calculating correlation (you can multiply in any order)

Facts about correlation: #3

The correlation r itself has no units of measurement. It is just a number.

Cautions: describing distribution two variables is more complex than describing the distribution of one variable

1.) Correlation requires that both variables be quantitative so that it makes sense to do the arithmetic indicated by the formula for r. 2.) Correlation measures the strength of only the linear relationship between two variables. Correlation does not describe curved relationships between variables no matter how strong the relationship is. A correlation of 0 doesn't guarantee that there's no relationship between two variables, just that there's no linear relationship. 3.) Like the mean and standard deviation, the correlation is not resistant: r is strongly affected by a few outlying observations. Use r with caution when outliers appear in the scatterplot. 4.) Correlation is not a complete summary of two-variable data, even when the relationship b/w the variables is linear. You should give the means and standard deviations of both x and y along with the correlation.

How to make a scatterplot

1.) Decide which variable should go on each axis 2.) Label and scale your axes - Don't start at (0,0) - Start scale to highlight main body of points 3.) Title your plot 4.) Plot individual data values

Procedure with 2 variable statistics

1.) Plot data and calculate numerical summaries 2.) Look for overall patterns and deviations from those patterns 3.) When there's a regular overall pattern, use a simplified model to describe it

Residual plot

A residual plot is a scatterplot of the residuals against the explanatory variable. Residual plots help us assess how well a regression line fits the data.

Examining residual plots

A residual plot turns the regression line horizontal. It magnifies deviations of the points from the line, making it easier to see unusual observations and patterns. If the regression line captures the overall pattern of the data, there should be no pattern in the residuals.

Problems with correlation calculation

A value of r close to 1 or -1 does not guarantee a linear relationship between two variables. A scatterplot with a clear curved form can have a correlation near 1 or -1. Always plot your data.

Explanatory-response relationship

Always plot explanatory variable if there is one on horizontal axis (x axis) of the scatterplot. We usually call the explanatory variable x and the response variable y. If there is no explanatory-response distinction, either variable can go on the horizontal axis.

Association does not imply causation

An association between an explanatory variable x and a response variable y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y. *Sometimes association is due to cause and effect but other times it is due to lurking variables

Influential observation

An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation; points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line

Outlier (definition)

An outlier is an observation that lies outside the overall pattern of other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals. Other outliers (large in x direction but not y direction) may not have large residuals.

Slope

B is the slope; the amount by which y is predicted to change when x increases by one unit *Coefficient of x is always the slope no matter what symbol is used

Limitations of regression and correlation: 3

Correlation and least-squares regression lines are not resistant. They are affected by outliers.

Limitations of regression and correlation: 2

Correlation and regression lines describe only linear relationship. You can calculate correlation and least-squares line for any relationship b/w 2 quantitative variables, but the results are only useful if the scatterplot shows a linear pattern (always plot your data!)

Direction/trend

Draw oval around data and find the slope of the major axis: negative slope means negative trend while positive slope means positive trend If relationship has a clear direction, we speak of positive association (high values of one variable tend to occur together) or negative association (high values of one variable tend to occur with low values of the other variable)

How to verify if a point is influential

Find the regression line both with and without the unusual point. If the line moves more than a small amount when point is deleted, the point is influential

Limitations of regression and correlation: 1

For regression, the distinction b/w explanatory and response variables is important. Least-squares regression makes distance of data points from line small only in y direction. If we reverse role of two variables we get a different line. This is not true for correlation.

Correlation vs. association

It only makes sense to talk about correlation between two quantitative variables. If one or both variables are categorical, you should refer to the association b/w them. To be safe, use "association" when describing relationship b/w 2 variables.

How to examine scatterplots

Look for the overall pattern and striking departures from that pattern. 1.) To describe the OVERALL PATTERN of a scatterplot, discuss the direction/trend, the form/shape, clusters and the strength of the relationship 2.) To describe DEPARTURES from the OVERALL PATTERN discuss outliers (an individual that falls outside the overall pattern of the relationship)

SSE

Measures the sum of squared errors

Overall pattern vs. departures

Once common method of data analysis is looking for an overall pattern and for striking departures from the pattern. A regression line describes the overall pattern of a linear relationship between an explanatory variable and a response variable. We see departures from this pattern by looking at the residuals.

Problem with judging linear relationships

Our eyes are not a good judge of strength of a linear relationship. It is easy to be fooled by different scales are the amount of space around the cloud of points. We need to use a numerical measure to supplement the graph. Correlation is the measure we use.

Positive association

Positive association, negative association Two variables have a positive association when above-average values of one tend to accompany above-average values of the other, and when below-average values also tend to occur together.

What is another way to see how well a least squares line fits our data

R2 (the coefficient of determination) tells us how well the least-square predicts the values of the response variable

Important things to look for when you examine a residual plot: #2

Residuals should be relatively small in size. A regression line that fits the data well should come "close" to most of the points. How do we decide whether residuals are small enough? We consider the size of a "typical prediction error.

The ratio SSE/SST

Tells us what proportion of the total variation in y still remains after using the regression line to predict values of the response variable (interpret: ___ of the variation in __response variable___ is unaccounted for by the linear model relating y to x

Interpret s in context

The average residual/prediction error for predicting the response variable is __ using the least squares line

Correlation coefficient and determination coefficient

The determination coefficient is the correlation coefficient squared! There is a relationship between correlation and regression. When reporting regression, find r to note strength of linear relationship. When reporting correlation, find r2 to note how successful the regression was in explaining the response.

Important things to look for when you examine a residual plot: #1

The residual plot should show no obvious patterns. Ideally, the graph shows an unstructured (Random) scatter of points in a horizontal band centered at zero. A curved pattern in a residual plot shows that the relationship is not linear. If the spread about the regression line increases for larger/smaller values of x, predictions of y using this line will be less accurate for these values of x.

Interpret a residual

The residual says ___ than predicted by the least squares regression line

There is a close connection between correlation and the slope of the least-squares regression line

The slope equation says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y. When the variables are perfectly correlated (r = 1 or r = -1) the change in the predicted response is the same (in standard deviations units) as the change in x. Otherwise, because -1 ≤ r ≤ 1, the change in y hat is less than the change in x. As the correlation grows less strong, the prediction moves less in response to changes in x.

Extrapolation

The use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate. *Few relationships are linear for all values of the explanatory variable. Don't make predictions using values of x that are much larger or much smaller than those that actually appear in your data.

Negative association

Two variables have a negative association when above-average values of one tend to accompany bleow-average values of the other

Equation for the least-squares regression line

We have data on the explanatory variable x and a response variable y for n individuals. From the data we calculate the means xbar and ybar and the standard deviations sx and sy of the two variables and their correlation r. The least-squares regression line is the line yhat = a + bx with slope b = r(sy/sx) with y intercept a = ybar - bxbar

How can we predict y if we don't know x

Use the mean of the response variable

Strength (heteroscedasticity)

How scattered is the data (based on the oval) How close the points in a scatterplot lie to a simple form such as a line - Thin hot dog shape = strong - Football shape = moderate - Basketball shape = week - Fan out = differs for different values of explanatory variable *Correlation coefficient

Benefits of residuals

Residuals show how far data fall from regression line and thus help us assess how well the line fits/describes the data. Residuals can be be calculated from any model fitted to data. However, residuals from least-squares line have a special property: the mean of the least-squares residuals is always zero.

Graph for displaying relationship between two quantitative variables

Scatterplot

SSE > SST

Since the least-squares line yields the smallest possible sum of squared prediction errors, SSE can never be more than SST which is based on the line y = ybar. In the worst case scenario, the least squares line does not better at predicting y than y = ybar does. Then SSE = SST and r2 = 0

Regression line

Suppose that y is a response variable (plotted on the vertical axis) and x is an explanatory variable (plotted on the horizontal axis) A regression line relating y to x has an equation of the form: y (hat) = a + bx

Correlation (coefficient)

The correlation r measures the direction and strength of the linear relationship between two quantitative variables. - The correlation r is always a number b/w -1 and 1. - Correlation indicates the direction of a linear relationship by its sign: r > 0 for a positive association and r <0 for a negative association - Values of r near 0 indicate a very weak linear relationship. The strength of the linear relationship increases as r moves away from 0 toward -1 or 1. - The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship, when the points lie exactly along a straight line

Form/shape

The general shape of the graph Ex: linear relationships/curved relationships/outliers/clusters

S (standard deviation of the residuals)

We know the average prediction error (mean of residuals) is 0 when using a least-squares regression line since positive and negative residuals cancel. That's why we use standard deviation to find the approximate size of a "typical" or "average" prediction error (residual). If we use a least-squares line to predict the value s of a response variable y from an explanatory variable x, the standard deviation of the residual (s) is given by s = square root of (sum of residuals squares)/(n-2) s = square root of (sum of y - yhat)/(n-2)


Conjuntos de estudio relacionados

R5: M7: Estate and Gift Transactions.

View Set

Chapter 29: Care of the Hospitalized Child Maternal Prep - U

View Set