GB 307 exam 1
Performing a Best Subsets Regression will help us better understand why certain parameters matter.
False. Best subsets regression produces a set of models that include combinations of parameters that are highly predictive of the outcome, but it doesn't tell us why certain variables matter.
As noise increases, seasonality and cycles become more difficult to discover, but trend will be easier to find.
False. Noise tends to mask seasonality, cycles, and trend.
aggregation function in SQL
AVERAGE, MIN, SUM Aggregation functions combine multiple records to produce one summary statistic.
Which variable type is the smallest (i.e., requires the least amount of memory to store)?
Binary
True or False: Outliers always have a significant effect on the best fit line.
False Influential points always have a significant impact on the best fit line, but the same is not necessarily true for outliers.
The Root Mean Square Error (RMSE) is the average percent difference between observed and predicted values.
False The Mean Absolute Percentage Error (MAPE) is the average percent difference between observed and predicted values.
You have run a regression based on the following model: l n ( Y ^ ) = α + l n ( X ) β. The parameter estimate for β came back as 0.10. Which of the following is the correct interpretation of this parameter estimate?
For every one percent increase in X, we expect Y to increase by 0.10%.
Which of the following is an example of a continuous variable?
Google's earnings per share Amount of beer produced by MillerCoors in 2016
A ________ clause specifies a condition on which we can filter an aggregated value.
HAVING. The having clause allows us to filter our results based on aggregated values such as averages or sums.
When joining two tables, which of the following join types requires a match between columns in both tables to return a record?
INNER JOIN Inner joins require that the value appear in both tables to be returned in the result set. Outer joins allow records to be returned when the value appears in just one of the tables.
A primary key does which of the following?
Identifies a unique record in a table
When joining two tables together, the type of relationship between the fields on which you are joining matters for efficiency (i.e., how quickly you get results). Match the type of relationship and the relative efficiency generally experienced.
Joining on a primary key and a related foreign key - Most efficient Joining on a primary key and an indexed field that is not a related foreign key - Second most efficient Joining two indexed fields that don't have a pre-defined relationship - Third most efficient Joining on two un-indexed fields that do not have a defined relationship - Slowest
In linear regression, what are we doing to determine the parameter estimates for the best fit line?
Minimizing the sum of the squared residuals Linear regression minimized the sum of the squared residuals. By squaring the residuals, we accomplish two things: 1. Positive and negative residuals will not offset each other. 2. Outliers have a larger impact on model estimates.
Describe what overfitting is, and why it is a concern in model building.
Overfitting occurs when we use many parameters in our model to most accurately predict in sample, without regard for predictive ability out of sample. This can produce very complex models that don't actually help us make better managerial decisions.
Traits of good models
Predict well out of sample Uses the fewest possible independent variables to adequately predict the outcome Has coefficients that are easy to interpret
Which of the following statements regarding MAE, RMSE, and MAPE is correct?
RMSE is a good measure of model accuracy, but its magnitude depends on the scale of the outcome variable.
equal variance assumption
The equal variance assumption says that the variance of the population errors is constant for all values of X. That is, the spread of the points should not differ greatly over values of X (i.e., there is no "fan" or "diamond" shape in the residual plot). There is a clear fan shape in the residuals, indicating the population errors likely violate the equal variance assumption.
independence assumption
The independence assumption says that the values of the population errors are independent. That is, knowing the value of one residual tells us nothing about the value of the next. If residuals are positively auto-correlated, then when the last residual was large, we expect the next to be large. If residuals are negatively auto-correlated, when the last was residual was large, we expect the next to be small. This rarely occurs with cross sectional data, and is generally only a concern with time-series data.
linearity assumption
The linearity assumption says that for all values of X, the population errors have mean zero. In the residual plot, this is violated if there exist a range of X values over which the residuals consistent fall above or below zero. This frequently occurs when the residuals follow some non-linear pattern. Hence, the "linearity" assumption.
Which of the following are true about best subsets and stepwise regression?
They will capitalize on any spurious correlation present in the data, leading to concerns about over fitting. They can help sift through a large number of variables to build a predictive model.
normality assumption
This is a judgement call, as is common in real world data. The residual distribution appears left skewed, which means it is not symmetic, and therefore cannot be normal. However, this skew is a result of just a few points, including the large positive residual associated with our outlier. Further, recall that the assumption is over the population errors, not the sample residuals. With so few points, it is possible that the errors are still normally distributed despite the residual distribution. The good news is that linear regression is relatively robust to violations of the normality assumption. Because of the ambiguity here, there were no points awarded for this question regardless of your answer.
When performing out of sample validation, you must estimate your model using a subset of your data, and then compare your model's predictive accuracy on an entirely separate subset of your data.
True.
If not properly modeled, auto-correlation in the dependent variable can cause violations of the independent errors assumption in regression.
True. This is one of the most common causes of violations of the independent errors assumption, and why we need to be so careful when applying linear regression models to time-series data.
Which of the following statements about forecasting is correct?
Two common approaches to forecasting are qualitative and quantitative forecasting. Extrapolation & econometric models are both types of quantitative forecasting methods. When a time series increases at a rate such that the percentage difference from value to value is constant, an exponential trend is present.
Which of the following statements regarding Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE) is correct? Select all that apply.
When a model poorly fits the time-series data, the MAPE is large. RMSE is a good measure of model accuracy, but its magnitude depends on the scale of the outcome variable. Outliers have a greater impact on RMSE than on MAE