Chapter 8 Trendlines and Regression Analysis
Confidence Intervals
(Lower 95% and Upper 95% values in the output) provide information about the unknown values of the true regression coefficients, accounting for sampling error.
R Square
- coefficient of determination, R2, which varies from 0 (no fit) to 1 (perfect fit)
Partial Regression Coefficients
- represent the expected change in the dependent variable when the associated independent variable is increased by one unit while the values of all other independent variables are held constant.
Standard Error
- variability between observed and predicted Y values.
Checking Assumptions - Homoscedasticity
-variation about the regression line is constant -residual plot shows no serious difference in the spread of the data for different X values.
Checking Assumptions - Normality of Errors
-view a histogram (bar chart) of standard residuals -residual histogram appears slightly skewed but is not a serious departure
Systematic Model Building Approach
1. Construct a model with all available independent variables. Check for significance of the independent variables by examining the p-values. 2. Identify the independent variable having the largest p-value that exceeds the chosen level of significance. 3. Remove the variable identified in step 2 from the model and evaluate adjusted R2. (Don't remove all variables with p-values that exceed a at the same time, but remove only one at a time.) 4. Continue until all variables are significant.
Standard Residual
= residual / standard deviation Rule of thumb: Standard residuals outside of ±2 or ±3 are potential outliers.
T-test
An alternate method for testing whether a slope or intercept is zero is to use a
Principle of Parsimony
Good models are as simple as possible.
Regression
Often used to identify(model) relationships between one or more independent variables and some dependent variable Predict future results
Excel Regression Tool
The independent variables in the spreadsheet must be in contiguous columns. Key differences: Multiple R and R Square are called the multiple correlation coefficient and the coefficient of multiple determination, respectively, in the context of multiple regression. ANOVA tests for significance of the entire model. That is, it computes an F-statistic for testing the hypotheses:
Exponential
Y=ab^x
Residual
actual Y value - predicted y value
Cross-Sectional Data
collected by observing many subjects (such as individuals, firms, countries, or regions) at the same point of time, or without regard to differences in time.
ANOVA
conducts an F-test to determine whether variation in Y is due to varying levels of X. used to test for significance of regression: H0: population slope coefficient = 0 H1: population slope coefficient ≠ 0
Checking Assumptions - Linearity
examine scatter diagram (should appear linear) examine residual plot (should appear random)
Simple linear Regression
involves a single independent variable.
Multiple Regression
involves two or more independent variables. Y=B0+B1X1+B2X2+...+BkXk+E
R^2
is a measure of the "fit" of the line to the data. The value of r2 will be between 0 and 1 A value of 1.0 indicates a perfect fit and all data points would lie on the line; the larger the value of R2 the better the fit.
Regression Analysis
is a tool for building mathematical and statistical models that characterize relationships between a dependent (ratio) variable and one or more independent, or explanatory variables (ratio or categorical), all of which are numerical.
Interactions
occurs when the effect of one variable is dependent on another variable. We can test for _____________ by defining a new variable as the product of the two variables,
Multicollinearity
occurs when there are strong correlations among the independent variables, and they can predict each other better than the dependent variable. When significant _____________ is present, it becomes difficult to isolate the effect of one independent variable on the dependent variable, the signs of coefficients may be the opposite of what they should be, making it difficult to interpret regression coefficients, and p-values can be inflated. Correlations exceeding ±0.7 may indicate multicollinearity
Adjusted R Square
reflects both the number of independent variables and the sample size and may either increase or decrease when an independent variable is added or dropped. An increase in adjusted R2 indicates that the model has improved. - adjusts R2 for sample size and number of X variables.
Checking Assumptions - Independence of Errors
successive observations should not be related. This is important when the independent variable is time Because the data is cross-sectional, we can assume this assumption holds.
Least Spuares Regression
the best-fitting line minimizes the sum of squares of the residuals.
Residuals
the observed errors associated with estimating the value of the dependent variable using the regression line:
Multiple R
where r is the sample correlation coefficient. The value of r varies from -1 to +1 (r is negative if slope is negative)
Polynomial (2nd order)
y = ax2 + bx + c
Polynomial (3rd order)
y = ax3 + bx2 + dx + e
Power
y = ax^b
Linear
y=a+b
Logarithmic
y=ln(x)