GENBUS307

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

By squaring the residuals, what two things are accomplished?

1. Positive and negative residuals will not offset each other 2. Outliers have a larger impact on model estimates

Which of the following is an Excel function commonly used to create categorical variables? =IF(...) =MAX(...) =SUMPRODUCT(...) =WHEN(...)

=IF(...)

We frequently create moving averages to reduce the amount of noise in a data set, and make the trend and potential cycles more apparent. If you create a 25 period moving average, how many observations would you have to eliminate? 23 12 24 0 13

24 when we create a moving average, we lose W-1 observations, where W is the length of the window (25) we lose (W-1)/2 observations from the beginning and end

Influential point

A point that does change our estimate of the relationship between X and Y

What are the advantages and disadvantages of creating a single period lagged dependent variable

Advantages: Helps control for autocorrelation without requiring a leading indicator. Disadvantages: We lose an observation. Further, we can only predict a single period in advance. This may severely constrain the business' ability to respond to unexpected forecasts, reducing the usefulness of the model.

Which of the following statements about forecasting is NOT correct? Extrapolation & econometric models are both types of quantitative forecasting methods. When a time series increases at a rate such that the percentage difference from value to value is constant, an exponential trend is present. Two common approaches to forecasting are qualitative and quantitative forecasting. After you select a particular forecasting model, you do not need to continually monitor your forecasts.

After you select a particular forecasting model, you do not need to continually monitor your forecasts. We should be continually checking the accuracy of any model that we use

In this class, we will focus on using SQL to extract information from a database using queries. However, it is much more powerful than that. Which of the following can SQL do? Select all that apply. Enter new records into a database. Change the structure of an existing database, including adding tables, deleting tables, deleting the entire database, etc. Extract information from a database. Update records already entered in a database. Define the database structure, including creating tables, fields, relationships, etc.

All

Which of the following are examples of databases? Select all that apply. IMDB Canvas Facebook Amazon.com

All

Which of the following is an advantage of databases over spreadsheets? Select all that apply. Easy integration with other programs Reduces file sizes Straightforward user controls Easier to maintain data integrity

All

OUTER JOIN

Allows records to be returned when the value appears in just one of the tables

Threshold for statistical significance

Always set relative to the cost of making a type-I error

Which of the following is an example of a continuous variable? Number of store employees scheduled to work from 12pm-3pm Total visitors to Disney parks on 12/31/2016 Amount of beer produced by MillerCoors in 2016 Google's earnings per share

Amount of beer produced by MillerCoors in 2016 Google's earnings per share

Which of the following is not a component of a time series? Trend Baseline Seasonality Cycles Random Effect / Noise

Baseline

Which of these approaches tests all possible combinations of the independent variables? Stepwise Regression Best Subsets Regression

Best Subsets Regression

An incremental value question asks the expected value of Y for a given value of X. how much of the value in our dependent variable is explained by the independent variable. by how much our prediction of the dependent variable would change in response to a shift in the independent variable. how much value the firm could add by following our advice.

By how much our prediction of the dependent variable would change in response to a shift in the independent variable

What is it called when two independent variables in a regression model are highly correlated? Strong Inter-P Intervariance Entropy Collinearity or Multi-Collinearity

Collinearity/ Multi-collinearity

Which of the following is not a measure we use to compare models? Cp Statistic Mean Absolute Percentage Error Combined p-value Adjusted R^2

Combined p-value

As compared to worksheets, which of the following is not an inherent advantage of relational databases? Allows capabilities (i.e., viewing and editing rights) to be assigned by user or user type Mitigates the risk of invalid data being created or stored Reduces redundancy in data entry Contains straight forward tools for visualizing and statistically analyzing patterns in the data Integrates easily with other programs and applications

Contains straight forward tools for visualizing and statistically analyzing patterns in the data

Which of the following is not a component of a relational database? Primary Key Field Cousin Object Record Table

Cousin Object

Macroeconomic effects

Cyclicality

When comparing two models what do you look at?

DON'T LOOK AT R^2 look at adjusted R^2

Performing a Best Subsets Regression will help us better understand why certain parameters matter. True False

False Best subsets regressions produces a set of models that include combinations of parameters that are highly predictive of the outcome, but doesn't tell us why certain variables matter

True or False: Outliers always have a significant effect on the best fit line. True False

False Influential points always have a significant impact one the best fit line, but the same isn't necessarily true for outliers

As noise increases, seasonality and cycles become more difficult to discover, but trend will be easier to find. True False

False Noise tends to mask seasonality, cycles and trends

True/False: A primary key is always a single field that defines a unique record within a table. True False

False Primary keys can span multiple fields, in which case they are known as composite keys.

The Root Mean Square Error (RMSE) is the average percent difference between observed and predicted values. True False

False The MAPE is the average % difference between observed and predicted value

True or False: Even when a relationship is not statistically significant, there is still a good chance that it is practically significant. True False

False when a relationship between X and U is statistically insignificant we are saying we can't discern an impact to Y of changing X, hard to argue for practical significance if can't be confident that X has an impact on Y

Beta parameter estimate of 0.10

For every one percent increase in X we expect Y to increase by .10%

A ___________ clause specifies a condition on which we can filter an aggregated value.

HAVING

Which one of the following was NOT a trait of good models: Predict well out of sample Has coefficients that are easy to interpret Uses the fewest possible independent variables to adequately predict the outcome Has an adjusted r2 greater than 0.5

Has an adjusted r2 greater than 0.5 There is no objective baseline for identifying a "good" adjusted r2 value across all models and apps

When joining two tables, which of the following join types requires a match between columns in both tables to return a record? RIGHT OUTER JOIN INNER JOIN FULL OUTER JOIN LEFT OUTER JOIN

INNER JOIN

Which of the following is an example of seasonality? Select all that apply. Ice cream sales decrease every winter. Terrace beer sales have increased steadily for the past ten years. Last Tuesday, shoe sales at Dick's Sporting Goods were higher than expected. Every year between Thanksgiving and Christmas, retail sales increase significantly.

Ice cream sales decrease every winter. Every year between Thanksgiving and Christmas, retail sales increase significantly.

A primary key does which of the following? Ensures that only certain types of information can be entered in a table Allows users to upload many records into a table simultaneously Allows a user to access a given table Identifies a unique record in a table

Identifies a unique record in a table

When not properly accounted for in the model, autocorrelation in the dependent variable leads to violations of which linear regression assumption? Normality Linearity Equal Variance Correct! Independence None of the above

Independence autocorrelation in the independent variable will produce autocorrelation in the regression residuals. This creates dependence between the residuals over time, violating the independent errors assumption.

Slowest type of relationship

Joining on two un-indexed fields that do not have a defined relationship

Third most efficient type of relationship

Joining two indexed fields that don't have a predefined relationship

Which of the following is not a method commonly used to account for cycles in time series forecasting models? Instrumental Variables Leading Indicators Lagged Variables

Instrumental variables

The equation for Root Mean Square Error (RMSE) Which of the following is true of Root Mean Square Error? Select all that apply. The output can be interpreted as the average percent forecast error arising from the model. It is one method by which we can evaluate out-of-sample model accuracy. Adding any variable to a regression will increase Root Mean Square Error. Large forecast errors carry disproportionate weight.

It is one method by which we can evaluate out-of-sample model accuracy. Large forecast errors carry disproportionate weight.

Most efficient type of relationship

Joining on a primary key and a related foreign key

Second most efficient type of relationship

Joining on a primary key and an indexed filed that is not a related foreign key

Violation of the equal variance assumption

Left is a violation Right is normal

In linear regression, what are we doing to determine the parameter estimates for the best fit line? Minimizing the average difference between our observed and predicted values. Minimizing the average value of the residuals Correct! Minimizing the sum of the squared residuals Minimizing the sum of the absolute values of the residuals

Minimizing the sum of the squared residuals

Which of the following are examples of seasonality? Select all that apply. Consumers are increasingly turning to alternative energy sources, driving the sales of solar panels. Stock market prices are currently much higher than we would expect based on historical trends and averages. Netflix web traffic increases every evening between 7 pm and 10 pm. Sales of ice skates at Dick's Sporting Goods are highest in the winter.

Netflix web traffic increases every evening between 7 pm and 10 pm. Sales of ice skates at Dick's Sporting Goods are highest in the winter.

Which of the following is not an assumption about the population errors in linear regression? No Collinearity Independence Equal Variance Linearity Normality

No collinearity

When predicting time series values such as sales, commodity prices, or resource utilization, which component of the time series will contain unpredictable, rare events? Seasonality Cycle Noise Trend

Noise

Which of the linear regression assumptions can be checked by creating a histogram of the residuals? Normality Independence Linearity Equal Variance

Normality

Which of the following is not a method by which we can smooth out the noise in a time series variable, aiding in data visualization? Out of sample validation Exponential smoothing Moving averages

Out of sample validation

Which of the following statements regarding MAE, RMSE, and MAPE are correct? Outliers have a greater impact one RMSE than MAE RMSE and MAPE are common types of forecast errors, while MAE isn't RMSE is a good measure of model accuracy, but its magnitude depends on the scale of the outcome variable When a model poorly fits the time-series data, the MAPE is large

Outliers have a greater impact one RMSE than MAE RMSE is a good measure of model accuracy, but its magnitude depends on the scale of the outcome variable When a model poorly fits the time-series data, the MAPE is large

MAPE

Preferred model is the one with the lowest MAPE

Suppose you add a new variable to an existing regression model. Which of the following can be true? Select all that apply. R square increases, Adjusted R square increases. R square decreases, Adjusted R square decreases. R square decreases, Adjusted R square increases. R square increases, Adjusted R square decreases

R square increases, Adjusted R square increases. R square increases, Adjusted R square decreases Adding a new variable will always result in the the r-square value increasing. This is why we can't use r-square to evaluate models when we add additional parameters. In contrast, adjusted r-square contains a penalty term for the number of parameters to offset the potential increase associated with spurious correlations. As a result, it can increase or decrease when a new parameter is added.

Imagine you're working for a retailer, and you've developed several models to forecast store traffic. You're getting ready to test these models out of sample, and are deciding which metric to use. While you want a model that is as accurate as possible, you're willing to take one that is slightly less accurate on average if it significantly reduces the risk of large forecast errors. Which would be the appropriate metric by which to compare your models? Mean Absolute Error (MAE) Root Mean Square Error (RMSE) Mean Absolute Percentage Error (MAPE)

RMSE Because it squares the prediction errors before calculating the average, RMSE places an additional penalty on instances in which the model is highly inaccurate. This is the value of RMSE over some of the other options for out of sample validation.

Which of the following statements regarding MAE, RMSE, and MAPE is correct? Outliers have a greater impact on MAE than on RMSE When a model poorly fits the time-series data, the MAPE is small RMSE is a good measure of model accuracy, but its magnitude depends on the scale of the outcome variable. RMSE and MAPE are common types of forecast errors, while MAE is not

RMSE is a good measure of model accuracy, but its magnitude depends on the scale of the outcome variable. MAPE is the only measure of model accuracy that is independent of the scale of the outcome variables

The threshold for statistical significance should be set relative to the cost of making a Type-I error. at 0.05 to mirror the confidence interval. at 0.10

Relative to the cost of making a Type-I error

Which of the following is an example of a one-to-many relationship? Select all that apply. Restaurants and reviews in Yelp's database. Oscar nominations and films at the Academy Awards Students matched to courses at the University of Wisconsin. Books and their authors in Amazon's database.

Restaurants and reviews in Yelp's database.

The _________ clause lists the fields that you want returned from your query.

SELECT

Which of the following is NOT an aggregation function in SQL? SUM MIN AVERAGE TOP

TOP

When the residuals exhibit unequal variance with respect to X, such that the variance increases as the value of X increases How could we transform Y and or X in a new regression?

Take the natural log (ln) of Y taking the ln will reduce the variance when values are large, but increase it when values are near 0, this will help with heteroskedasticity

Which of the following would likely be a required field in a retailer's database? Select all that apply. The employee ID number when creating a new entry in a table tracking hours worked. A customer's shipping address when creating a new e-commerce order. A coupon code when recording a transaction in the orders table. A customer's mobile phone number when creating a new customer record.

The employee ID number when creating a new entry in a table tracking hours worked. A customer's shipping address when creating a new e-commerce order.

Equal variance assumption

The equal variance assumption says that the variance of the population errors is constant for all values of the independent variables. This can be checked by examining the residual plot, shown below. The fan shape indicates that the population errors are heteroskedastic, which violates the equal variance assumption.

Independent errors assumption

The independent errors assumption says that there is no autocorrelation in the population errors. We examine the reasonableness of this assumption using the Durbin-Watson statistic, which measures the autocorrelation in the sample residuals. This measure will range from zero to four and equals two when there is no autocorrelation.

Which of the following is a legitimate concern when applying a square transformation to an independent variable? Select all that apply. You cannot apply a square transformation when the independent variable takes values less than zero. The intuitive interpretation of slope estimates can be more difficult You cannot compare the adjusted 𝑟^2 between otherwise identical models with and without the square transformation. The transformation requires an additional covariate, challenging model parsimony

The intuitive interpretation of slope estimates can be more difficult The transformation requires an additional covariate, challenging model parsimony

What does a p-value tell us? Select all that apply. The p-value is the probability, given the data, of getting a parameter estimate at least as extreme as what we observe if the true population estimate were zero. The p-value is the percent chance that our statistical relationship has practical significance. If we declare the relationship between our independent and dependent variables to be statistically significant, the p-value is the probability that we're making a type-1 error. The p-value is the percent of variation in the dependent variable explained by variation in the independent variable(s).

The p-value is the probability, given the data, of getting a parameter estimate at least as extreme as what we observe if the true population estimate were zero. If we declare the relationship between our independent and dependent variables to be statistically significant, the p-value is the probability that we're making a type-1 error.

When two independent variables in a regression model exhibit collinearity, which of of the following is NOT a likely result of removing one of the offending covariates? Select all that apply. The p-values for the slope coefficients decrease. The range of the confidence intervals for the slope coefficients increase. The adjusted 𝑟^2 increases. 𝑟^2 changes only slightly.

The range of the confidence intervals for the slope coefficients increase.

Which of the following is true? The sign of the correlation coefficient is always same as the sign of the slope of the least squares line The correlation coefficient is always equal to the square root of the r2 value. Given a non-zero correlation, the least squares line becomes steeper as the correlation coefficient approaches one in absolute value. There is no relationship between the correlation coefficient and the least squares line.

The sign of the correlation coefficient is always same as the sign of the slope of the least squares line

Which of the following can be determined using only the information available in a scatterplot? Select all that apply. The strength of the correlation Whether a relationship is causal The presence of outliers The magnitude of a relationship

The strength of the correlation The presence of outliers The magnitude of a relationship

True or False: The following statement would return all records from a hypothetical table named "Impressions". SELECT * FROM Impressions; True False

True

Which of the following are true about primary keys? Select all that apply. They can help accelerate join statements. They identify unique records within a table. They are always a single field. They are almost always required fields.

They can help accelerate join statements. They identify unique records within a table. They are almost always required fields.

Which of the following are true about best subsets and stepwise regression? They can help sift through a large number of variables to build a predictive model They are guaranteed to return the same set of candidate "best" models They will capitalize on any spurious correlation present in the data, leading to concerns about overfitting They may help identify why certain parameters matter

They can help sift through a large number of variables to build a predictive model They will capitalize on any spurious correlation present in the data, leading to concerns about overfitting

When performing out of sample validation, you must estimate your model using a subset of your data, and then compare your model's predictive accuracy on an entirely separate subset of your data. True False

True

If not properly modeled, auto-correlation in the dependent variable can cause violations of the independent errors assumption in regression. True False

True Autocorrelation is one of the most common causes of violations of the independent errors assumption

p-value

the probability of making a Type I error if you declare an estimate to be statistically significant.

Which of the following is true with respect to databases? Select all that apply. User profiles define what actions a user can take within a database, including which tables they can view and which they can edit. Validation rules help prevent erroneous data entry by defining the requirements for a valid entry. We cannot join tables unless a primary key / foreign key relationship has been specified between them. A primary key must be a single field that uniquely identifies each record.

User profiles define what actions a user can take within a database, including which tables they can view and which they can edit. Validation rules help prevent erroneous data entry by defining the requirements for a valid entry.

When relationships between two variables that are similar is very strong, why might you want to be cautious about including both variables in a regression predicting something?

Using two highly correlated independent variables in a regression creates multi-collinearity The inclusion of both variables results in insignificant p-values for each, while if either were included alone it would have a significant effect, this can mask relationships between certain variables

Durbin Waston Statistic

Values close to two represent less autocorrelation in the residuals

The variance inflationary factor can help identify: When two or more independent variables are highly correlated (i.e., multi-collinearity). The preferred model using best subsets regression. When the variance in our independent variable is larger than we would expect in a random sample. When the variance in our dependent variable is larger than we would expect in a random sample.

When two or more independent variables are highly correlated

Collinearity

When two or more variables in a regression are strongly correlated

INNER JOIN

requires that the value appear in both tables to be returned in the result set

r^2 value

x.xx% of the variance in y is explained by the x


Ensembles d'études connexes

Microbiology Ch. 9-Introduction to Microbial Genetics

View Set

acc 331 CHPT 9: Business income, business deductions, and accounting methods

View Set

IOWA DRIVER'S LICENSE PRACTICE TEST

View Set

Chapters 12, 13 and 14 Study Guide Exam Review Questions

View Set

Martini's Essentials of Anatomy & Physiology, Ch. 5: The Integumentary System

View Set

Human growth and development test 4: Attachment & Temperment

View Set