Ch. 14 - Applied Business Probability and Statistics
Which entry in a multiple regression output table is used to draw the conclusion for the Global Test?
"Significance F" in the ANOVA table
What is multicollinearity?
-Two or more independent variables are highly correlated with each other -Distorts the standard error
What are characteristics of a multicollinearity?
1. An independent variable known to be an important predictor ends up having a regression coefficient that is not significant. 2. A regression coefficient that should have a positive sign turns out to be negative, or vise versa. 3. When an independent value is added or removed, there is a drastic change in the values of the remaining coefficients.
What are characteristics of the adjusted coefficient of multiple determination?
1. It increases as more independent variables are added. 2. The number of variables (k) and the sample size (n) are equal to the coefficient of determination is 1.0.
What situations can occur when studying interaction among independent variables?
1. It is possible to have a three-way interaction among the independent variables. 2. It is possible to have an interaction where one of the interactions is nominal scale.
What are the characteristics of coefficient of multiple determination?
1. It is symbolized by capital r squared (R^2) 2. It can range from 0 to 1. 3. It cannot assume negative values. 4. It is easy to interpret.
What are characteristics of an F-Distribution?
1. There is a family of F-distributions. 2. The F-Distribution can not be negative. 3. It is a continuous distribution. 4. It is positively skewed. 5. It is asymptotic.
What characteristic of a residual plot is applicable for evaluating multiple regressions?
1. They are plotted on a vertical axis and centered around 0. They are both positive and negative. 2. They show a random distribution of positive/negative values across the entire range of variables plotted on the horizontal axis. 3. The points are scattered and there is no obvious pattern, so there is no reason to doubt the linearity assumption.
What are characteristics of the distribution of residuals?
1. They follow a normal probability distribution 2. We organize them into a frequency distribution. 3. We use a normal probability plot graph to analyze. 4. If the normal probability plot is fairly straight, it indicates normally distributed residuals.
Suppose the following variables are being considered to build a model that predicts the sales price of a home: size of home in square feet, age in years, number of bedrooms, number of bathrooms, and size of yard in square feet. How many possible regression models are considered using the best-subset method?
31 Reason: 25-1=31
What level of correlation between two independent variables in a regression model generally does not cause multicollinearity problems.
A correlation coefficient between -0.7 and +0.7.
What level of correlation between two independent variables in a regression model generally will cause multicollinearity problems.
A correlation coefficient greater than +0.7. A correlation coefficient less than -0.7.
What is an "interaction term" in the context of regression analysis?
A new variable created by multiplying two independent variables.
Many situations can occur when studying interactions among variables. Which of these situations are valid examples of an interaction?
A nominal scale variable interacting with a ratio-scale variable. An interaction among three variables.
Multicollinearity can have many adverse effects on a multiple regression equation. Which of the following would it be?
A variable known to be an important predictor has a non-significant coefficient. The value or sign of one or more coefficients violates common sense.
What is a dummy variable?
A variable that takes on only two values, 0 and 1.
If a regression equations predicts a Y-value of 15 with a standard error of 5, what does this mean?
About 68% of the sample Y-values are between 10 and 20.
Which method of building a regression model starts will all independent variables and removes one variable at a time until all remaining variables are significant?
Backward elimination method
What is a best subset regression?
Best subsets regression is an exploratory model building regression analysis. It compares all possible models that can be created based upon an identified set of predictors.
Suppose k independent variables are being considered for building a regression model. Which method builds the best 1-variable model, the best 2-variable, model, ..., the best k-variable model?
Best-subset method
How are the coefficients found for the multiple regression equation?
By the least squares method using a statistical software package.
What is the impact of correlated independent variables?
Correlated independent variables make inferences about individual regression coefficients difficult.
What is the preferred procedure if more than one coefficient of the multiple regression equation is found to be not significant?
Drop the independent variable with the smallest t and rerun the regression.
The global test of the regression model examines the ratio of two variances. Which of these is the correct description of the test statistic?
F = MSR/MSE
What is the global test formula?
F= (SSR/k)/(SSE/(n-(k+1)))
What are the hypotheses for the Global test of the multiple regression model with three independent variables?
H0: β1= β2= β3 = 0, H1: Not all β's are 0
On a residual plot the points are close to zero on the left side but widely scattered on the right side. This indicates a possible violation of which multiple regression assumption?
Homoscedasticity
On a residual plot there are many, widely scattered points in the middle, but only a few points close to the line at either end. This indicates a possible violation of which multiple regression assumption?
Homoscedasticity
What is the term used for the assumption that the variation around the regression line will appear to be the same for the whole range of the residual plot?
Homoscedasticity
Data which is collected over time (time series data) often violates which of these regression assumptions?
Independent Observations
Many situations can occur when studying interactions among variables. Which of these situations is not a valid example of an interaction?
Interaction occurring as the sum of two variables.
What is the purpose of the Adjusted Coefficient of Determination?
It adjusts R2 to reflect the the number independent variables used.
As more independent variables are added to a regression model, the coefficient of determination tends to increase. How is this bad?
It can lead to adding variables with no predictive power.
Which statement(s) correctly describe the Coefficient of Multiple Determination (R2)?
It can range from 0 to 1. It is the percent of explained variation.
Which statement(s) correctly describe the Coefficient of Multiple Determination (R2)? Select all that apply.
It explains the percent of variation explained by the regression. It cannot assume negative values.
What does a smaller standard error of estimate mean?
It indicates a better or more effective predictive equation.
What is a global test?
It investigates whether it is possible that all the independent variables have zero regression coefficients.
What is the multiple standard error of estimate?
It is a measure of the average deviation of the errors, the difference between the ^y -values predicted by the multiple regression model and the y -values in the sample.
What is the main advantage of using stepwise regression?
It is an efficient way to find a regression equation with only significant coefficients.
How do you interpret the "Standard Error" in a multiple regression output table?
It is the typical "error" when the regression equation is used to predict Y.
In an ANOVA table, how is the "Residual" related to the regression equation?
It is the variation in the value of Y not explained by the regression.
What effect does increasing the number of independent variables in the regression have on the coefficient of determination?
It makes it larger.
Which of the following are reasons to avoid correlation between independent variables (multicollinearity)? Select all that apply.
It may lead to erroneous results in hypothesis tests of independent variables. It is difficult to make inferences about the individual regression coefficients.
What is a normal probability plot? What is it used for? Select all that apply.
It plots the percentiles vs. the residuals. It is used to check the normality assumption of regression.
When you run a "stepwise regression" or "best subset regression", the software may work "too hard" to find an equation that fits the quirks of your data set. What characteristics should you look for in the regression equation?
It should make sense, based on your knowledge of the connection among the variables. It should be simple and logical.
How does the backward elimination method build a regression model?
It starts with all variables in the model and insignificant ones out one at a time.
When you use the global test for the multiple regression model, what are you testing?
It tests the null hypothesis that all population coefficients are zero.
Define step-wise regression.
Method used to denote the process that builds a regression equation one independent variable at a time
If the linearity assumption is violated, what might you see in a residual plot? Select all that apply.
Most of the residuals are positive. There are more negative values in one part of the range.
One of the requirements of regression analysis is called the multicollinearity assumption. How is multicollinearity defined?
Multicollinearity exists when independent variables are correlated.
What is the coefficient of multiple determination formula?
R^2 = SSR/ SS Total
What is the adjusted coefficient of multiple determination formula?
R^2 adjusted = 1 - ((SSE/n-(k+1))/(SS Total / n-1)
Drag and drop the descriptions against their corresponding terms from ANOVA in regression analysis.
Regression = explained variation of y Residual = unexplained variation of y DF = degrees of freedom F = ratio of explained to unexplained variation of y
Multicollinearity can have many adverse effects on a multiple regression model. Which of these could be one of them?
Removing a non-significant variable results in drastic changes in the values of the remaining coefficients.
Nominal level variables can be used in regression analysis, in which case they are known as qualitative variables. Identify the qualitative variables from this list. Select all that apply.
Right or left handedness Male or Female Whether or not a car has air conditioning.
What are the two kinds of plots that allow us to visually evaluate the "linearity assumption"? Select all that apply.
Scatter Diagram Residual Plot
What methods can you use to evaluate the assumptions of multiple regression?
Scatter plots and residual plots.
Which one of the following entries from the output tables of a multiple regression model could be used to reject the null hypothesis of equal coefficients in the global Test?
Significance F = 0.002
What is the difference between simple linear regression and multiple regression?
Simple linear regression has one independent variable and multiple regression has two or more.
What is the term used to denote the process that builds a regression equation one independent variable at a time, starting with the one the most highly correlated and keeping only terms with significant coefficients?
Stepwise regression
What does autocorrelation mean in the context of multiple regression analysis?
Successive residuals are correlated.
What is the fifth assumption about regression and correlation analysis?
Successive residuals should be independent. This means that the residuals are not highly correlated, and there are not long runs of positive or negative residuals.
What distribution is used with the global test of the regression model to reject the null hypothesis?
The F-distribution.
What happens in a regression analysis if the number of independent variables is equal to the sample size?
The coefficient of determination becomes 1.
The formula for the variance inflation factor is VIF = 11−Rj211-��2. If VIF>10 then multicollinearity is excessive. What is the meaning of Rj2 ?
The coefficient of determination of a regression with Xj as the dependent variable against the other independent variables
One of the assumptions of regression analysis is that the distribution of the Y values about the regression line is approximately normal. Which of these tools can you use to check this?
The histogram of residuals
If the pattern of residuals seems to cluster around a line with mostly positive values on the left and mostly negative values on the right, what regression assumption is violated?
The independent observations assumption.
If the points on a scatter diagram seem to be best described by a curving line, which one of the regression assumptions might be violated?
The linearity assumption.
Your experience tells you that an independent variable is positively correlated to the dependent variable, but a multiple regression model gives it a negative coefficient. What could cause this?
The model may have correlated independent variables.
Which of the following is not an advantage of stepwise regression models?
The model selected always has the highest R2.
What is the coefficient of multiple determination?
The percent of variation in the dependent variable, y, explained by the set of independent variables (x1, x2. x3...xk)
What would you expect to see in a residual plot if the linearity assumption is correct? Select all that apply.
The points are scattered and there is no obvious pattern. The positive and negative values are evenly spread across the whole range.
The normal probability plot shows each residual plotted according to the percentile it represents in the set of residuals. What does it look like if the normality assumption is true?
The points closely approximate a line with positive slope.
Regression analysis makes several assumptions. Which of these best describes the "linearity assumption"?
The relationship between the dependent and individual independent variables is a straight line.
What do the sample distributions follow?
The t distribution with n - (k+1) degrees of freedom.
Describe the sampling distribution of the test statistic for testing regression coefficients.
The t-distribution with n - (k + 1) degrees of freedom.
When testing individual coefficients of a multiple regression model, what is the sampling distribution?
The t-distribution with n - (k + 1) degrees of freedom.
What is meant by "homoscedasticity" in regard to a multiple regression model?
The variation around the regression line is the same for all values of the independent variables.
What does the independent observation assumption mean for the residuals plot?
There is no pattern to the residuals.
Most software packages provide a histogram of residuals as part of regression analysis. How would you use this?
To visually evaluate the normality assumption
Which of the following is characteristic of multiple regression but not simple linear regression?
Two or more independent variables.
Suppose someone is building a model to predict the sales price of a house and they would like to include a variable to indicate whether or not the house has a pool. Which of the following could be used to model that variable.
Use 0 if there is no pool and 1 if there is one.
What is the variance inflation factor?
VIF = 1/(1-R^2j) R^2j = coefficient of determination
In the population model Y = α + β1X1+ β2X2+β3X1X2 what is the interaction term?
X1X2
What is homoscedasticity?
for every value of x, variances on y should be equal. The variation around the regression is same for all the variables of independent variables.
What is stepwise regression?
it is a combination of forward selection and backward elimination. We can either start with all factors or no factors and at each step we remove or add a factor. As we go through the procedure after adding each new factor and at the end we eliminate right away factors that no longer appear.
What is the formula for testing the individual regression coefficients?
t = (bi-0/sbi)
What is the general multiple regression equation formula?
y = a + b1x1 + b2x2 + b3x3...+bkxk
What is the multiple standard error of estimate formula?
√(E(y-y)^2)/(n-(k+1)) = √SSE/(n-(k+1) y = actual observation y = estimated value computed from the regression equation n = number of observations in the sample k = number of independent variables SSE = residual number of squares in the ANOVA table