Scmt 303 Chapter 13
Rule of Thumb
-Acts as conditions -R-square should be at least 30% -P-value for slope < α -α= significant level. (In general, use a confidence level of 95% -which makes alpha=5%-unless otherwise specified)
Linear Regression Model
-Finding a line that best fits the data Yᵢ =bₒ+b1Xᵢ +Ԑᵢ -aka "actual Y value" -Only useful if it helps in predicting Y from X
Pitfalls in Regression
-Outliers- points on one extreme or the other away from the points around the trend-line. These points must be removed before running regression or they will shift the regression and could drastically change r. -Correlation does not imply causation. Need to have a cause-effect relationship between X and Y variables -Simple Linear regression should be used only when you have a linear relationship, otherwise will give misleading results -Need to exercise extreme caution when using the regression line to extrapolate beyond the range of observed data.
Regression Equation
-Prediction involves identifying the trend-line -Finding the slope and the y-intercept values in the regression equation: Ŷ=bₒ+b1X -aka "predicted Y value -trend-line is called the prediction equation in Regression
Ԑᵢ
-Random error/variation in observation i
b1
-Slope for the data b1= ΣXY-n̅x̅y̅/ ΣX²-n ̅x̅² -Must find b1 before finding bₒ -b1≠0: two sided hypothesis test
The Least Squares Method
-The best prediction equation minimizes the sum of squared prediction errors. (least squares method) -The least squares method is used to find the slope (b1) and y-intercept (bₒ) coefficients and minimizes the sum of the squared errors.
Dependent or Response Variable
-Variable we are trying to predict -Predicted variable -Y-variable
bₒ
-Y-intercept for the data bₒ=̅y̅-b1 ̅x̅
Example of graph to create for easy calculations
-Years of experience (X) -Sales (Y) -X² -Y² -XY -at bottom of graph have a section for columns total. -This will make plugging into the correlation coefficient much easier.
Coefficient of Correlation
-aka Correlation Coefficient -r is used to represent Sample Coefficient of Correlation -r represents the relative strength of a linear relationship between 2 variables
Identifying the prediction equation
-between: Yᵢ =bₒ+b1Xᵢ +Ԑᵢ and Ŷ=bₒ+b1X Yᵢ=Ŷᵢ +Ԑᵢ -Actual Y value=predicted Y value + error
"Perfect" Correlation
-r=+1: slope>0 -r=0: slope=0 -r=-1: slope<0 -Always need to make sure there is an actual relationship. High correlation does NOT necessarily imply causality.
Interpretation of r
-r>0: direct relationship -r<0: inverse relationship
Independent or Predictor variable
-variable used to make the prediction -variable that causes an effect in the response variable -aka explanatory variable -X-variable
Steps to Predict
1) Relationship- between the two variables 2) Validation- pattern- expect to continue 3) Prediction- the future
Multiple Choice: All but one of the following situations are impossible. Which is the only possible one?
A) Estimated slope=1.27 while the correlation coefficient=-.27 for the same data. B) A correlation coefficient= -1.27 C) Correlation Coefficient= 1.27 D) R-square=1.27 E) The predicted value of Y decreases as X increases. Correct Choice: E) The predicted value of Y decreases as X increases. Explanation: If the Y axis is the number of errors made during a test and the X axis is the amount of time preparing/studying for the test; this would give a downward sloping trend-line. Y decreases as X increases or in other words the number of errors made during a test (Y) decreases as the amount of time preparing for the test (X) increases.
Multiple Choice: In a simple Linear regression problem, r and b1....
A) May have opposite signs B) Must have opposite signs C) Must have the same sign D) are equal Correct Choice: C) Must have the same sign Explanation: r= correlation coefficient and b1= slope. If we have a downward sloping trend-line then that means we have a negative (or inverse) correlation coefficient. Just the same as if we had an upward sloping trend-line then we would automatically have a positive (or direct) correlation coefficient. Upward sloping trend-line and direct correlation coefficient on a graph go in an upward direction from left to right. And a downward sloping trend-line and an inverse correlation coefficient on a graph go in a downward direction from left to right.
Multiple Choice: The bₒ value in the prediction equation represents..
A) Predicted value of Y when X=0 B) The estimated average change in Y per unit change in X C) The predicted value of Y D) Variation around the line of regression Correct Choice: A) Predicted value of Y when X=0 Explanation: bₒ is the y-intercept, which is defined as the point where the line crosses the y-axis. Any point directly on the y-axis has an X value of 0.
Multiple Choice: The strength of the linear relationship between two numerical variables may be measured by the...
A) Slope B) Coefficient of Correlation C) Y-intercept D) Scatter diagram Correct Choice: B) Coefficient of Correlation Explanation: The closer the data points are, to all being on the same trend-line, the closer to 1 (100%) r reaches. The closer to 1 r is, the stronger the relationship between the 2 variables is.
Multiple Choice: In regression we must select the Y variable as the variable......
A) With the smaller variance. B) Used to predict the other variable. C) Which is the independent variable. D) To be predicted E) It does not make a difference; either variable can be X or Y. Correct Choice: D) To be Predicted
SS(Error) or SS(Residuals)
Excel Regression Measures of Variation: -Excel terms Errors as Residuals -Unexplained variation in Regression Model -Variation due to other factors apart from X Σ(Y-Ŷ)²
SS(Regression)
Excel Regression Measures of Variation: -Explained variation in Regression Model -Variation due to X Σ(Ŷ-Ῡ)²
SS(Total)
Excel Regression Measures of Variation: -Total variation in Regression Model -SS(Total)=SS(Regressions)+SS(Errors) Σ(Y-Ῡ)²
Multiple R
Excel Regression: -Multiple R is the absolute value or r, the correlation coefficient. -To find the sign of r (aka: direct or inverse), we need to look at the sign of slope b1. The signs of b1 and r have to match. -If have r² and looking for Multiple r, take the square root of r². -Positive slope implies direct relationship (+r) -Negative slope implies inverse relationship (-r)
R square
Excel Regression: -R-square is the square of correlation coefficient r -AKA coefficient of determination -Value of R-Square indicates strength of linear relationship. -R-square is the most important value in the regression output. -Tells us the proportion of variation in Y caused/explained by the independent variable X: r²=SS(Regression)/SS(Total) =1-[SS(Error)/SS(Total)] ex: if r= +.985429 then r²= .971069 This implies a strong linear relationship and that 97.1069% of the variation in Y in due to X, and the remaining percentage up to 100% is due to other factors or variables.
Example of incorporating slope equation
Lets say we're given an excel regression output: Intercept= 40.9262 X=-.1343 Predict the Value of y if x=100 using the partial excel output given. we know that bₒ=̅y̅-b1 ̅x̅ which can be manipulated to look like y=b+mx For the values above, Intercept=Y-intercept and X=Slope. we can now fill in our formula to predict the value of y when x=100. y=b+mx y=40.9262+-.1343(100)= 27.49
Linear Regression Analysis Solution Methodology
Our goal is too... 1)Find if variables are related to each other: Scatter plot or Time series plot 2)How Strong is the association?: Correlation coefficient r 3)Can we explain the relationship?: correlation does not imply causality -Our objective is to predict Y using X using prediction equation: identify bₒ and b1.
Regression Objective
Predict Y using X
Excel Regression Output: Measures of Variation
SS(Regression)=Regression row under SS column SS(Error)=Residual row under SS column SS(Total)=Total row under SS column P-value for slope="Years of Experience (X)" row under p-value column. (Ideal p-value=0 so we can reject all small alphas) -Row=Horizontal -Column=Vertical
T/F: To find the correlation coefficient between 2 variables, it does not matter which variable we term as Y because it does not make a difference to r; either variable can be X or Y
True
Example of finding coefficient of determination given a data set
________X____Y____X²______Y²______XY ________1___2300___1____5290000___2300 ________2___2400___4____5760000___4800 ________3___2300___9____5290000___6900 ________4___2500___16___6250000___10000 Total:___10___9500___30___22590000__24000 First we find X-bar and Y-bar: X-bar: x sum/n= 10/4= 2.5 Y-bar: y sum/n= 9500/4= 2375 Then we find r: r=ΣXY-n̅x̅y̅/√(ΣX²-n ̅x̅²)√(ΣY²-n ̅y̅²) numerator=(24000)-(4*2.5*2375)=250 denominator=√(30-4*2.5²)*√(22590000-4*2375²) =√(5)*√(27500)=370.8099 r= 250/370.8099=.674199 r²= √(.674199)=.4545=45.45%
Computing correlation coefficient
r=ΣXY-n̅x̅y̅/√(ΣX²-n ̅x̅²)√(ΣY²-n ̅y̅²) -Do numerator and denominator separately in order to prevent mistakes. -for r, it doesn't matter which variable is X or Y because each variable goes through the exact same process in the formula.
Excel Regression Output
|r|=multiple R r²=R Square n=Observations bₒ= Intercept row under Coefficients column=Y-intercept b1="Years of experience (X)" row under Coefficients column=slope (The name of this one changes based on the data but it is located directly underneath the intercept) -Row=Horizontal -Column=Vertical