GS ECO 302 CH 13 Correlation and Linear Regression

¡Supera tus tareas y exámenes ahora con Quizwiz!

The correlation coefficient gives a good idea about the (linear) relation between variables and is much better than simple scatter plot or covariance. But it has issues and limitations too.

1. First issue is that a correlation does not by itself imply any cause and effect type relation. Two variables may be highly correlated but in reality have no relation at all. 2. Another issue with coefficient of correlation is that it cannot help us in predicting one variable from the knowledge of other variables.

CHARACTERISTICS OF THE CORRELATION COEFFICIENT

1. The sample correlation coefficient is identified by the lowercase letter r. 2. It shows the direction and strength of the linear relationship between two interval- or ratio-scale variables. 3. It ranges from -1 up to and including +1. 4. A value near 0 indicates there is little relationship between the variables. 5. A value near 1 indicates a direct or positive relationship between the variables. 6. A value near -1 indicates inverse or negative relationship between the variables.

regression equation A formula for a line that models a linear relationship between two quantitative variables. Yi = β0 + β1Xi + εi for i = 1, 2... n 1. It is a hypothetical or theoretical relation assumed to exist in the unknown population. Therefore we use Greek letters for all the parameters of this equation. Also note that variables y and X are random variables. Therefore this relation has probability concepts underlying it. Each observation pair Yi and Xi is a pair or random variables with some joint probability distribution. 2. The variable on the L.H.S. of the equation is the Dependent variable with the generic notation Y. 3. The subscript i refers to the observation number. There are n observations or n pairs of Y and X values. We assume that the same relation holds across all observations. 4. It is assumed that the relation is linear. The Y and X variables are both in linear form with no powers or square root or log, etc.

5. The first parameter β0 is called the constant term or Intercept and is the expected or mean value of Y when X is zero. In many situations zero may not be a possible or valid value for X. . In fact β0 is the height at which the Regression line cuts the Y-axis. It is positive if the regression line cuts above the origin, negative if the regression line cuts below the origin, and zero if the regression line passes right through the origin. 6. The second parameter β1 is called the slope coefficient or parameter. It is generally the focus of Regression Analysis. It is in fact the slope or inclination of the Regression Line. If it is positive, the Regression line is upward sloping showing a positive relation. The value of β1 shows the rate of change in Y with respect to a small increment in X. In calculus terms it is the derivative of Y with respect to X. 7. The first two terms of the bivariate regression β0 + β1Xi constitute the deterministic or systematic part in contrast to the unsystematic last term. 8. The last term (epsilon) has many names. It is called the residual or error or random disturbance or stochastic term. We will generally call it the error term. It is an acknowledgement that the model is not perfect.

regression line

A straight line that describes how a response variable y changes as an explanatory variable x changes

predetermined variable

A variable whose value was fixed in a previous period of time.

13.3 The Correlation Coefficient Originated by Karl Pearson about 1900, the correlation coefficient describes the strength of the relationship between two sets of interval-scaled or ratio-scaled variables.

Designated r, it is often referred to as Pearson's r and as the Pearson product-moment correlation coefficient. It can assume any value from -1.00 to +1.00 inclusive. A correlation coefficient of -1.00 or +1.00 indicates perfect correlation. For example, a correlation coefficient for the preceding example computed to be +1.00 would indicate that the number of sales calls and the number of copiers sold are perfectly related in a positive linear sense. A computed value of -1.00 reveals that sales calls and the number of copiers sold are perfectly related in an inverse linear sense. How the scatter diagram would appear if the relationship between the two sets of data were linear and perfect is shown in Chart 13-2.

In this chapter, we shift the emphasis to the study of relationships between two interval- or ratio-level variables.

In all business fields, identifying and studying relationships between variables can provide information on ways to increase profits, methods to decrease costs, or variables to predict demand.

Learning Objectives When you have completed this chapter, you will be able to:

LO1 Define the terms independent variable and dependent variable. LO2 Calculate, test, and interpret the relationship between two variables using the correlation coefficient. LO3 Apply regression analysis to estimate the linear relationship between two variables. LO4 Interpret the regression analysis. LO5 Evaluate the significance of the slope of the regression equation. LO6 Evaluate a regression equation to predict the dependent variable. LO7 Calculate and interpret the coefficient of determination. LO8 Calculate and interpret confidence and prediction intervals.

Covarience

Measure of how to sample sets of data vary simultaneously.

Professor's Version of r formula (cont) Greek letter ρxy (rho) for population rxy for the corresponding sample statistic = correlation coefficient

Revised: rxy = sxy/ (sx* sy) Sample covariance / product of sample standard deviations of the two variables original ∑(X -X̅)(Y-Ȳ) (n-1)SxSy Since sxy involves the product of the units of X and Y, and so does the product in the denominator, the units cancel out. As a result we have a pure number and it has been proved that it can range only between -1 and +1.

correlation coefficient formula (Book Version) numerator explained ∑(X -X̅)(Y-Ȳ) explained You are provided a series of (X,Y) values example (20, 30) (40, 60) (20, 40) (30, 60) (10, 30)

Step 1. You would find the average of X values and Y values separately. = X̅ in this case is 20 + 40 + 20 + 30 + 10 = 120 / 5 = 24 = Ȳ in this case is 30 + 60 + 40 + 60 + 30 = 220 / 5 = 44 Step 2 we subtract X̅ and Ȳ from the values For (X -X̅) it would be and for (Y-Ȳ) it would be 20 - 24 = -4 30 - 44 = -14 40 - 24 = 16 60 - 44 = 16 20 - 24 = -4 40 - 44 = -4 30 - 24 = 6 60 - 44 = 16 10 - 24 = -14 30 - 44 = -14 Step 3. Multiply the results (X -X̅)(Y-Ȳ) -4 * -14 = 56 16 * 16 = 256 -4 * -4 = 16 6 * 16 = 96 -14 * -14 = 196 Step 4. Add up the results ∑(X -X̅)(Y-Ȳ) = 56 + 256 + 16 + 96 + 196 = 620 <- This is called the covariation

correlation coefficient formula (Book Version) denominator explained (n-1)SxSy n = the number of samples, when you -1 you create an unbiased estimator independent of the sample size In our case we have 5 sets (20, 30) (40, 60) (20, 40) (30, 60) (10, 30) So it would be (5-1) or 4

Sx * Sy Sx is the standard deviation of the x and y values calculated by using formula for standard deviation of a sample s = √ ∑(X -X̅)^2 n-1 (do the same for y) OR just use excel and multiply it all out before finally dividing numerator / denominator

The general form of the linear regression equation is exactly the same form as the equation of any line. a is the Y intercept and b is the slope.

The formulas for a and b are: (start with b) b is the slope of the regression line b = r * sy/sx • r = is the correlation coefficient. • sy = is the standard deviation of Y (the dependent variable). • sx = is the standard deviation of X (the independent variable). Y-intercept (a) a = Ȳ - bX̅ Where Ȳ = the mean of Y (the dependent variable) X̅ = the mean of X (the independent variable)

In the professor's example : he builds xi2 and yi2. He uses this to determine sx and sy (the standard deviations of x and y)

The sum of xi2 and yi2 are divided by n-1 to get the standard deviations. In his example the sum of the square values of X -X̅ is 60 which is divided by (9-1) to give 2.74

When we study the relationship between two interval- or ratio-scale variables, we often start with a scatter diagram.

This procedure provides a visual representation of the relationship between the variables. The next step is usually to calculate the correlation coefficient. It provides a quantitative measure of the strength of the relationship between two variables.

The Bivariate Simple Linear Regression Model linear regression model assumes that a variable denoted by Y called dependent (or response) variable can be predicted using the known values of another variable denoted by X called the explanatory (or independent) variable. There may or may not be an actual cause and effect relationship.

What we are claiming is that we can predict the values of Y given the values of X. It may just be like the sunglasses and ice-cream case, but for regression it is sufficient if we can predict the sales of ice-cream given the values of sales of sunglasses. For an ice-cream manufacturer that would be important whether there is a cause and effect relation or not. The predictor is also sometimes called the predetermined variable and the assumption is that we can determine or know its values which can be used to predict the dependent or response variable.

General form of a linear regression equation

Yˆ = a +bX where: ○ Yˆ read Y hat, is the estimated value of the Y variable for a selected X value ○ a is the Y-intercept. It is the estimated value of Y when X = 0 Another way to put it is: a is the estimated value of Y where the regression line crosses the Y-axis when X is zero. ○ b is the slope of the line, or the average change in Yˆ for each change of one unit (either increase or decrease) in the independent variable X. ○ X is any value of the independent variable that is selected.

correlation analysis

a group of techniques to measure the relationship between two variables

correlation coefficient

a measure of the strength of the linear relationship between two variables

independent variable

a variable (often denoted by x ) whose variation does not depend on that of another. The experimental factor that is manipulated; the variable whose effect is being studied. The independent variable provides the basis for estimation. It is the predictor variable. Notice that we choose this value. The independent variable is not a random number.

dependent variable

a variable (often denoted by y ) whose value depends on that of another. The outcome factor; the variable that may change in response to manipulations of the independent variable. The dependent variable is random. That is, for a given value of the independent variable, there are many possible outcomes for the dependent variable.

Some of the important assumptions about the error term are listed below (also called the Classical Linear Model Assumptions): i. The explanatory variable Xi's are assumed to be uncorrelated with the error terms εi's. That is the deterministic (or systematic) and stochastic parts of the regression equation are uncorrelated. ii. The expected value of the error term is zero. Or E(ε) = εbar = 0. This assumption is based on the belief that some of the omitted variables will have positive influence and some negative. On balance they will cancel out. As a result E(Yi) = β0 + β1E(Xi) because the third term drops out in the expectation operation. Thus the regression line passes through the mean or expected values of Y and X for each observation. Therefore, the book also refers to this line as the "line of means".

iii. The variance of the error term is assumed to be constant across all observations. In symbols, E(ε^2) = σ^2, a fixed value for all i. Statisticians call this property "Homoscedasticity". If this assumption is violated we have the problem of "Heteroscedasticity" which has serious implications for the estimates and requires correction. iv. The error term of one observation is assumed to be uncorrelated with the error term of another observation. In symbols E(εi εj) = 0 for all i ≠ j. This property is called the property of "No Autocorrelation" or "No Serial Correlation". If this assumption is violated then we have a serious problem called "Serial or Auto- Correlation" which needs correction. v. For hypothesis testing, we also assume that the error terms are normally distributed. Thus the three assumptions combined can be expressed as εi ~ iidN(0, σ^2) for all i; which means that the error terms are Identically, Independently and Normally distributed with mean 0 and variance σ^2.

In regression analysis, our objective is to use the data to

position a line that best represents the relationship between the two variables. Our first approach is to use a scatter diagram to visually position the line. However, the line drawn using a straight edge has one disadvantage: Its position is based in part on the judgment of the person drawing the line. The hand-drawn lines in Chart 13-8 represent the judgments of four people.

correlation coefficient formula (Book Version)

r = ∑(X -X̅)(Y-Ȳ) (n-1)SxSy X = X value X̅ = average of the X values Y = Y value Ȳ = average of the Y values n = number of items Sx = Sample Standard Deviation of X values Sy = Sample Standard Deviation of Y values

Bi-variate Analysis

relationships between two variables at a time A starting point of most research is the scatter plot. It gives a visual impression about the nature of relationship between two variables. We can see whether the relation is positive (upward sloping) or negative (downward sloping), and whether it is linear or non-linear.

Professor's Version of r formula Starts with covariance

sxy = ∑ xi* yi/(n-1) It is the same so far EXCEPT for we are not multiplying by the sample standard deviation for x or y in the denominator The explanation is this: You can see how compact the formula becomes. And mathematicians and Statisticians prefer compact and concise forms wherever possible, the main reason for using symbols.

predictor variable

the dependent variable in a correlational study that is used to predict the score on another variable

best-fitting line

the line that lies as close as possible to all the data points in a scatter plot LEAST SQUARES PRINCIPLE A mathematical procedure that uses the data to position a line with the objective of minimizing the sum of the squares of the vertical distances between the actual Y values and the predicted values of Y.

For the scatter diagram representing a strong relationship, there is

very little scatter about the line. This indicates, in the example shown on the chart, that hours studied is a good predictor of exam score. Scatter diagrams for r = 0, a weak r (say, -.23), and a strong r (say, +.87) are shown in Chart 13-3. Note that, if the correlation is weak, there is considerable scatter about a line drawn through the center of the data.

If there is absolutely no relationship between the two sets of variables, Pearson's r is

zero. A correlation coefficient r close to 0 (say, .08) shows that the linear relationship is quite weak. The same conclusion is drawn if r = -.08 Coefficients of -.91 and +.91 have equal strength; both indicate very strong correlation between the two variables. Thus, the strength of the correlation does not depend on the direction (either - or +). The following drawing summarizes the strength and direction of the correlation coefficient.


Conjuntos de estudio relacionados

AWS Cloud Practitioner Practice Test Questions

View Set

Identify the grammatically correct sentence

View Set

Gastrointestinal system medication wk 11

View Set

The Market System and Circular flow

View Set

Melanie Chiluisa & Emily Carrion

View Set

Fundamentals exam 1 online questions chapter 9

View Set

Римське приватне право

View Set