HB Core: Business Analytics (Combined set)
True or False: Multicollinearity occurs when two or more independent variables are highly correlated
True
True or False: Multicollinearity can typically be reduced by adding more independent variables
False Multicollinearity can be reduced by removing one or more of the collinear variables.
True or False: Multicollinearity can typically be reduced by decreasing the sample size
False. Multicollinearity can be reduced by increasing the sample size.
High p-value
linear relationship is not significant.
Adjusted R-squared
A measure of the explanatory power of a regression analysis. Adjusted R-squared is equal to R-squared multiplied by an adjustment factor that decreases slightly as each independent variable is added to a regression model. Unlike R-squared, which can never decrease when a new independent variable is added to a regression model, Adjusted R-squared drops when an independent variable is added that does not improve the model's true explanatory power. Adjusted R2 should always be used when comparing the explanatory power of regression models that have different numbers of independent variables.
R squared
A measure of the explanatory power of a regression analysis. R-squared measures how much of the variation in a dependent variable can be "explained" by the variation in the independent variable(s). Specifically, R-squared measures the vertical dispersion of the independent variable about the regression line compared to the dispersion of the independent variable about its mean. To calculate R-squared, divide the regression sum of squares by the total sum of squares. Range from 0 to 1. 1 being able to explain all the variation.
Multicollinearity
A situation that occurs when two independent variables are so highly correlated that it is difficult for the regression model to separate the relationship between each variable and the dependent variable. Multicollinearity can obscure the results of a regression analysis. If adding a new independent variable decreases the significance of another independent variable in the model that was previously significant, multicollinearity may well be the culprit. Another symptom of multicollinearity is when the R-square of a regression is high but none of the independent variables are significant.
A p-value to test the significance of a linear relationship between two variables was calculated to be 0.0210. What can we conclude? Select all that apply. There is a significant linear relationship between the 2 variables at A. 90% confidence B. 95% confidence C. 98% confidence D. 99% confidence
A, B. A. We can be 90% confident that there is a significant linear relationship between the two variables. Since the p-value, 0.0210, is less than 1-0.90=0.10, we can be 90% confident that there is a significant linear relationship between the two variables. Note another option is also correct. B. We can be 95% confident that there is a significant linear relationship between the two variables. Since the p-value, 0.0210, is less than 1-0.95=0.05, we can be 95% confident that there is a significant linear relationship between the two variables. Note another option is also correct. C. We can be 98% confident that there is a significant linear relationship between the two variables. Since the p-value, 0.0210, is greater than 1-0.98=0.02, we cannot be 98% confident that there is a significant linear relationship between the two variables. D. We can be 99% confident that there is a significant linear relationship between the two variables. Since the p-value, 0.0210, is greater than 1-0.99=0.01, we cannot be 99% confident that there is a significant linear relationship between the two variables.
residual sum of squares
Also known as the error sum of the squares, is denoted by SSResid, Is the sum of the squared residuals is a measure of y variation that cannot be attributed to an approximate linear relationship (unexplained variation). Usually less than SSTo The amount of variation that is not explained by the regression line. The residual sum of squares is equal to the sum of the squared residuals, that is, the sum of the squared differences between the observed values of the dependent variable and the predicted values of the dependent variable. To calculate the residual sum of squares, subtract the regression sum of squares from the total sum of squares.
A restaurant supply manager analyzes the relationship between a restaurant's location and the number of meals consumed by comparing clients in two locations: Munich and Paris. The manager's first regression model uses the number of meals consumed as the dependent variable and a dummy variable for location (Munich or Paris) as an independent variable. This model has an R-squared of 0.712, and the coefficient for location is statistically significant. The manager runs a second model, adding another variable, the amount of wine consumed with meal. In this model, the coefficient for location is no longer significant, the R-squared has increased from 0.712 to 0.719, and the adjusted R-squared has decreased. Which of the following is the most likely reason for this pattern of changes?
B. What is collinear? A. The owner made a mistake; it is impossible for a once significant variable to no longer be significant. B. The variables for location and wine consumption are collinear. C. The variable for location is a better predictor than wine consumption. D. Neither location nor wine consumption is a good predictor in the model.
Below is a list of students and the number of hours each one spends weekly studying Business Analytics: · Ching: 3 hours · Nadia: 4 hours · Rehmi: 1 hour · Tonio: 14 hours · Isabel: 4 hours · Michael: 2 hours · Neha: 6 hours · Alphonse: 5 hours Which of the following is the most appropriate way to deal with Tonio's data? A. Delete the data point because it's too far from the mean. B. Leave it in because one should never delete research-based data. C. Research the data point and then make a decision based on the D. Correct the typo by changing it to 4 hours.
C. Research the data point and then make a decision based on the findings. A. Delete the data point because it's too far from the mean. A data point should not be deleted simply because it is an outlier. B. Leave it in because one should never delete research-based data. There may be situations where the data should be deleted, because it is a measurement or entry error, because it is not representative of the population the researcher is interested in, or many other reasons. C. Research the data point and then make a decision based on the findings. Doing business analytics effectively requires judgment. An outlier should be eliminated only if researching the data shows that the data value is incorrect or irrelevant to the research at hand. In some cases, it makes more sense to focus analysis specifically on a subset of the data. A researcher needs to use experience and knowledge of the research question to decide things on a case-by-case basis. D. Correct the typo by changing it to 4 hours. While the data may be a typo, the researcher cannot know this without researching it first.
Which excel function is used to find out the correlation coefficient between two variables?
CORREL(array1, array2)
A streaming music site changed its format to focus on previously unreleased music from rising artists. The site manager now wants to determine whether the number of unique listeners per day has changed. Before the change in format, the site averaged 131,520 unique listeners per day. Now, beginning three months after the format change, the site manager takes a random sample of 30 days and finds that the site now has an average of 124,247 unique listeners per day. The manager finds that the p-value for the hypothesis test is approximately 0.0743. What can be concluded at the 95% confidence level? A. The manager should reject the null hypothesis; there is sufficient evidence that the number of unique daily listeners has likely changed. B. The manager should reject the null hypothesis; there is not enough evidence to conclude that the number of unique daily listeners has changed. C. The manager should fail to reject the null hypothesis; there is sufficient evidence that the number of unique daily listeners has likely changed. D. The manager should fail to reject the null hypothesis; there is not enough evidence to conclude that the number of unique daily listeners has changed.
D. A. The manager should reject the null hypothesis; there is sufficient evidence that the number of unique daily listeners has likely changed. Since the p-value, 0.0743, is greater than the significance level, 0.05, the manager should fail to reject the null hypothesis. B. The manager should reject the null hypothesis; there is not enough evidence to conclude that the number of unique daily listeners has changed. Since the p-value, 0.0743, is greater than the significance level, 0.05, the manager should fail to reject the null hypothesis. C. The manager should fail to reject the null hypothesis; there is sufficient evidence that the number of unique daily listeners has likely changed. Since the p-value, 0.0743, is greater than the significance level, 0.05, the manager should fail to reject the null hypothesis. By failing to reject the null hypothesis, the manager cannot accept the null hypothesis. Thus, the manager cannot conclude that the number of unique daily listeners has changed. D. The manager should fail to reject the null hypothesis; there is not enough evidence to conclude that the number of unique daily listeners has changed. Since the p-value, 0.0743, is greater than the significance level, 0.05, the manager should fail to reject the null hypothesis.
When analyzing a residual plot, which of the following indicates that a linear model is a good fit? A. Patterns or curves in the residuals B. Increasing size of the residuals as values increase along the x-axis C. Decreasing size of the residuals as values increase along the x-axis D. Random spread of residuals around the y-axis E. Random spread of residuals around the x-axis
E. A. Patterns or curves in the residuals Patterns or curves in the residual plot indicate that the linear model is not a good fit. It is possible that a nonlinear relationship exists between the independent and dependent variable. B. Increasing size of the residuals as values increase along the x-axis The size of the residuals should not increase as values increase along the x-axis; this is a sign of heteroskedasticity. C. Decreasing size of the residuals as values increase along the x-axis The size of the residuals should not decrease as values increase along the x-axis; this is a sign of heteroskedasticity. D. Random spread of residuals around the y-axis The y-axis of a residual plot represents the size of the residuals; this is not the axis that we examine. E. Random spread of residuals around the x-axis A linear model is a good fit if the residuals are spread randomly above and below the x-axis.
The linear relationship between two variables can be statistically significant but not explain a large percentage of the variation between the two variables. This would correspond to which pair of R^2 and p-value? A. Low R-squared, Low p-value B. Low R-squared, High p-value C. High R-squared, Low p-value D. High R-squared, High p-value
Low R-squared, Low p-value A low R-squared and low p-value indicates that the independent variable explains little variation in the dependent variable and the linear relationship between the two variables is significant. Other options Low R-squared, High p-value A low R-squared and high p-value indicates that the independent variable explains little variation in the dependent variable and the linear relationship between the two variables is not significant. High R-squared, Low p-value A high R-squared and low p-value indicates that the independent variable explains a lot of the variation in the dependent variable and the linear relationship between the two variables is significant. High R-squared, High p-value A high R-squared and high p-value indicates that the independent variable explains a lot of the variation in the dependent variable and the linear relationship between the two variables is not significant.
For categorical variables with more than two categories, how many multiple dummy variables are required?
Specifically, the number of dummy variables must be the total number of categories minus one.
coefficient of variation
Standard deviation divided by the mean STDEV.S/AVERAGE
Independent variable
The experimental factor that is manipulated; the variable whose effect is being studied. A variable that is presumed to be related to a dependent variable. In a regression model, independent variables can be used to help predict the value of the dependent variable. A regression model seeks to find the best-fit linear relationship between a dependent variable and one or more independent variables.
p-value
The probability level which forms basis for deciding if results are statistically significant (not due to chance).
Is there bias here: "How much more important is location than price when purchasing a house?"
Yes. This question presupposes that location is more important than price, and thus may bias the respondent in that direction.
residual plot
a scatterplot of the regression residuals against the explanatory variable
null hypothesis
a statement or idea that can be falsified, or proved wrong
Coefficient of variation
A measure of relative variability
Dummy variables
A variable used to model the effect of categorical independent variables in a regression model; generally takes only the value zero or one. Dummy variables are used to transform categorical variables into quantitative variables. A categorical variable with only two categories (e.g. "heads" or "tails") can be transformed into a quantitative variable using a single dummy variable that takes on the value 1 when a data point falls into one category (e.g. "heads") and 0 when a data point falls into the other category (e.g. "tails").
Alternative hypothesis
An alternative hypothesis is the theory or claim we are trying to substantiate, and is stated as the opposite of a null hypothesis. When our data allow us to nullify the null hypothesis, we substantiate the alternative hypothesis.
residual
For a specified value of an independent variable x, the residual in a regression model is the vertical distance between the observed value of the dependent variable y corresponding to that x-value and the expected value of y for that x-value. To calculate the residual for a given x-value, subtract the expected value of y from the observed value of y. See error term.
Multiple R value
For single variable linear regression, Multiple R, which is the square root of R2, is equal to the absolute value of the correlation coefficient.
What is the equation for coefficient of variance?
Formulae: standard deviation / mean
When do you use Adjusted R2?
It is important to use the Adjusted R2 to compare two regression models that have a different number of independent variables.
Which excel function returns the k-th percentile value in the specified array?
Percentile.inc (array, k)
This type of variable needs to be transformed to a dummy variable before regression analysis
Qualitative variable
Correlation coefficient
Strength of a linear relationship between 2 variables. (Value range between -1 and 1, where 0 is non existent linear relationship).
Name 3 ways that a mean can be found (Trick Question!)
The Descriptive Statistics tool AVERAGE(B2:B17) SUM(B2:B17)/COUNT(B2:B17)
Dependent variable
The measurable effect, outcome, or response in which the research is interested. A variable that is presumed to be related to one or more independent variables. In a regression model, the dependent variable is the value we are trying to understand or predict. A regression model seeks to find the best-fit linear relationship between a dependent variable and one or more independent variables.
Null hypothesis
The null hypothesis is the opposite of the hypothesis you are trying to substantiate.
p-value
The probability level which forms basis for deciding if results are statistically significant (not due to chance). A p-value can be interpreted as the probability, assuming the null hypothesis is true, of obtaining an outcome that is equal to or more extreme than the result obtained from a data sample. The lower the p-value, the greater the strength of statistical evidence against the null hypothesis.
Central Limit Theorem
The theory that, as sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution.
True or False: Multicollinearity is usually not an issue when the regression model is only being used for forecasting
True. Multicollinearity is typically not a problem when the model is being used for forecasting, especially if the predicative power of the model is increased by the additional variable(s). However, when used to understand net relationships, multicollinearity affects the estimates of the coefficients, thereby distorting the net relationships.
High R-squared value means
indicates that the independent variable explains a lot of the variation in the dependent variable
z-value
number of std a given observation is from the population mean z = (x-u)/sd (point x - population mean) / standard deviation
Collinear
points on the same line
Coefficient of variation
standard deviation/mean
Quantitative variable
takes numerical values for which it makes sense to find an average
alternative hypothesis
the hypothesis that a proposed result is true for the population