DSC 423 Final Practice
What assumptions do we make about residuals.... (mark all that apply)
They have a mean of 0. They have a constant variance. They are normally distributed. They are independent.
The quarterly single-family housing starts (in thousands of dwellings) in the United States from 2004 through 2014 have been recorded. Researchers want to predict the housing starts for the first two quarters of 2015. What technique should they use?
Time series using regression
T/F: The regression line is chosen because in minimizes the sum of the square of the errors.
True
The difference between glmnet and cv.glmnet is that the latter performs cross validation.
True
The interaction term is responsible for creating a "twist" in the regression plane.
True
We can use our domain knowledge to infer that a second-order term might be warranted.
True
When an interaction term is deemed important, include the first-order terms in the model regardless of their p-values.
True
When creating a pth order model, the number of levels for a variable must be greater than or equal to (p+1).
True
When evaluating a residual plot we should look for changes in variability.
True
When evaluating a residual plot we should look for trends.
True
When we keep a second order term in a model, we always keep the first order term.
True
When performing n-fold cross validation, n models are generated. Generally, it is best to select the best (highest adjusted R squared) of these n models as the final model.
False
When transforming variables, one should consider transforming the independent variables but not the dependent variable.
False
Assume the number of miles a car was driven last year is normally distributed with a mean of 9,000 and a standard deviation of 3,500. If we define "unusual" as those in the bottom 2.5% or top 2.5%, find the interval for the "usual" number of miles driven per year.
(2140, 15860)
Step 4: What is z?
-1.76
Step 4: What is z?
-2.14
Step 4: What is z?
-2.46
When you look up 1.73, what is the p-value?
.0427
Step 5: What is the p-value? Remember to consider whether it is a one-tailed or two-tailed test.
.7%
Step 3: What is the declared alpha?
1%
Step 5: What is the p-value? Remember to consider whether it is a one-tailed or two-tailed test.
1.6%
If the sample mean is 172.52 with a standard deviation of 10.31, what is z?
1.73 z=(x-μ)/σ,
A VIF value of ________ says that the other predictors "explain" 90% of the variation in predictor j and suggests that there may be multicolinearity in the dataset.
10
Assume the number of songs stored on college students' smart phones are normally distributed with a mean of 500 and a standard deviation of 150. What is the maximum number of songs a student can have on their smart phone and still be considered in the bottom 1%?
151
Given x is a random variable from a normal population with the indicated mean and standard deviation, estimate the probability using the 68-95-99.7 Empirical Rule. P(x < 13 | mean = 15, std = 2)
16%
Given x is a random variable from a normal population with the indicated mean and standard deviation, estimate the probability using the 68-95-99.7 Empirical Rule. P(x < 42 | mean = 50, std = 4)
2.5%
Given z is a random variable from a standard normal population, estimate the probability using the 68-95-99.7 Empirical Rule. P(z < -2)
2.5%
Assume the number of miles a car was driven last year is normally distributed with a mean of 9,000 and a standard deviation of 3,500. What percent of cars were driven over 15,000 miles last year?
4.32%
Assume you are building an interaction model with two qualitative variables. The first variables has 3 levels. The second variable has 4 level. In total, how many dummy variables do yo need any how many interaction terms do you need?
5 dummy & 6 interaction 2 & 3 dummy, you only want interaction terms between the dummy (2*3)
Step 3: What is the declared alpha?
5% (default)
Step 5: What is the p-value? Remember to consider whether it is a one-tailed or two-tailed test.
5.26%
Step 5: What is the p-value? Remember to consider whether it is a one-tailed or two-tailed test.
7.7%
Given z is a random variable from a standard normal population, estimate the probability using the 68-95-99.7 Empirical Rule. P(z <=1 )
84%
Assume the number of songs stored on college students' smart phones are normally distributed with a mean of 500 and a standard deviation of 150. What percent of students' phones have less than 300 songs?
9.12%
Which feature selection method is the most computationally expensive?
All Subsets Selection
Which feature selection method will guarantee the best model based on the selected metric?
All Subsets Selection
In a first-order model with a single independent variable, the slope of the regression line is captured by:
B1
____ starts with full model with k variables. It then remove variables one at a time, recording R-squared (or some other metric). It retains best (k-1)-variable model and repeats until there is no improvement in R-squared.
Backward elimination
In a first-order model with a single independent variable, the y-intercept of the regression list is captured by:
B0
Supervised learning is used to draw inferences from data sets consisting of input data without labeled responses
False - UNSUPERVISED LEARNING Ex of supervised: regression, classification techniques like decision trees Ex of unsupervised learning: clustering, associations
Multicollinearity can impact a regression model in several ways, including... (mark all that apply)
Creating rounding errors in betas. Reversing signs of betas. Muddying t-tests. Confounding beta estimates.
The tuning parameter, lamba, for ridge and LASSO regression is commonly set to 0.1 in ridge regression and 0.25 in LASSO regression.
False
When building a model relating E(y) to a qualitative independent variable with n levels you will need n dummy variables
False
When building a regression model, a qualitative variable with n levels would require n dummy variables.
False
What is/are the response variable(s)?
Exam scores
The general form of a probabilistic model includes
Expected value, error, dependent variable
Step 6: Draw a conclusion.
Fail to reject the null hypothesis. More data is needed.
Step 6: Draw a conclusion.
Fail to reject the null hypothesis. The manufacturer's claims may or may not be accurate. More data is needed.
A correlation table can be used to detect multicolinearity even if it is a one-to-many relationship.
False
AIC compares the precision and bias of the full model to models with a subset of the predictors.
False
Backward selection begins with an empty model.
False
Because ridge regression is more likely to produce a parsimonious model, it is often preferred over LASSO regression.
False
Consider a qualitative variable with three levels (tall, average, short). Dummy variables are used to model the levels in a regression model. We should consider the interaction between the dummy variables for "average" and "short".
False
Due to the nature of logistic regression, multicolinearity is not a problem when building models to predict a binary variable.
False
Everything else being equal a one-at-a-time treatment is preferred over an all-possible treatment.
False
Homoscedasticity refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it.
False
If a qualitative variable has 5 levels, it will require 5 dummy variables to model all the levels.
False
In a second-order model with one independent variable, if the beta associated with the second order term is negative the regression line will curve upwards.
False
In a second-order model with two independent variables, if both betas associated with the second order terms are negative the response surface will be saddle shaped.
False
In a second-order model, β1 represents the slope of the regression line relating y to x1 when all other x's are held fixed.
False
In an interaction model, β1 represents the slope of the regression line relating y to x1 when all other x's are held fixed.
False
In an observational study, the researcher actively changes some characteristics of the units before the data are collected.
False
In the model above, if β2 is negative the regression line will curve upward.
False
Leave-one out cross validation (LOOCV) - also called jackknifing -- generally gives less accurate estimates of true test error than 10-fold cross validation.
False
One of the primary benefits of second-order models is their affinity for extrapolation.
False
Rejecting the F-test, means that you accept the alternative hypothesis: all Betas are not equal to 0.
False
T/F: Experimental data should be used when evaluating the impact of cocaine on birth defects.
False
T/F: In the appliance store example, for every unit change in sales revenue, we expect a change of 0.70 in advertising expenditure.
False
T/F: Regression is an example of unsupervised learning.
False
T/F: The difference between R2 and R2-adjusted is that R2 represents the coefficient of determination between two variables while R2 -adjusted represents the coefficient of determination of a model.
False
The best cutoff value to use is 50%.
False
The null hypothesis of a basic F-test is that all beta parameters are equal to one.
False
The penalty for ridge regression is the L1-norm of the betas.
False
The size of a beta is a good indication of the important of a variable in a regression model.
False
Step 2: What is the alternative hypothesis? (Take an adversarial position. Are you interested in a one-sided or two-sided test?)
Ha: (mu1 - mu2) < 0
Step 2: What is the alternative hypothesis? (Take an adversarial position. Are you interested in a one-sided or two-sided test?)
Ha: mu < 14
Step 2: What is the alternative hypothesis? (Take an adversarial position. Are you interested in a one-sided or two-sided test?)
Ha: mu < 75000
Step 2: What is the alternative hypothesis? (Take an adversarial position. Are you interested in a one-sided or two-sided test?)
Ha: mu =/= .75
The United States and Japan often engage in intense trade negotiations. U.S. officials claim that Japanese manufacturers price their goods higher in Japan than in the United States, in effect subsidizing the low prices in the United States with extremely high prices in Japan. According to the U.S. argument, Japanese manufactures accomplish this by preventing U.S. good from reaching the market. An economist decides to test the hypothesis that higher retail prices are being charged for automobiles in Japan than in the United States. She obtains independent samples from 50 retail sales in the United States and 50 sales in Japan over the same time. She found the sample average of the U.S. sales to be 26,596 and the sample average of the Japanese sales to be 27,236. The standard deviations were 1,981 and 1,974 respectively. Using an alpha of 5%, conduct an hypothesis test. Step 1: What is the null hypothesis?
Ho: (mu1 - mu2) = 0
A tire manufacturer claims that the mean life of its tire is 75,000 miles. A sample of 50 tires finds xbar = 74,200 and s = 2300. Test the manufacturer's claim. Step 1: What is the null hypothesis?
Ho: mu = 75000
A weight loss center claims its participants have a mean loss of 14 pounds in 14 days. A sample of 100 participants finds x-bar = 13.1 pounds and s = 4.2 pounds. You suspect they are lying! Test the weight loss center's claim using an alpha of 1%. Step 1: What is the null hypothesis?
Ho: mu= 14
A supplier claims the mean thickness of its washers is 0.75 inches. Washers that are too big or too small can cause problems in the machines. A sample of 50 finds xbar = 0.73 inches and s = 0.08 inches. Test the supplier's claim. Step 1: What is the null hypothesis?
Ho: mu=.75
A pharmaceutical company has developed a new drug designed to reduce a smoker's reliance on tobacco. Since certain dosages of the drug may reduce one's pulse rate to dangerously low levels, the product-testing division of the pharmaceutical company wants to model the relationship between dosage x and decrease in pulse rate y. An important question with regards to this experiment is "What is the maximum dosage that could safely be administered?" What technique should be used?
Inverse prediction
We will be looking at feature selection in the next lecture. In general, you should select a model that ....
Is simple. Is parsimonious. Has few features. Has high accuracy.
The regression line was chosen because....
It minimized the sum of the square of the errors (SSE).
Offshore oil drilling near an Alaskan estuary has led to increased air traffic - mostly large helicopters - in the area. The U.S. Fish and Wildlife Service commissioned a study to investigate the impact these helicopters have on the flocks of Pacific Brant Geese, which inhabit the estuary in the Fall before migrating. Two large helicopters were flown repeatedly over the estuary at different altitudes and lateral distances from the flock. The flight response of the geese (recorded as "low" or "high"), altitude (meters), and lateral distance (meters) for each flight was recorded. What technique should they use?
Logistic regression
In the standard regression model, what are the assumptions about the errors?
Mean is 0 Errors are homoscedastic Errors are normal Errors are independent
R2 does not offer any significant insights into how well our regression model can predict future values.
R2
In a paper presented at the 2009 IM Conference in China, a group of university finance professors examined the relationship between customer satisfaction of a product and product performance with performance measured on a 10-point scale. The researchers discovered that the linear relationship varied over different performance ranges. What technique should they use?
Piecewise linear regression
Which of the following can be used to evaluate a logistic model?
Recall Precision Specificity Accuracy
When designing an experiment we may improve the quality of the data (and thus the model) by...
Reducing the noise Increasing the signal
Assuming an alpha of 5%, what is the conclusion?
Reject the null hypothesis and accept alternative. The engineer's belief about the mean is correct.
Step 6: Draw a conclusion.
Reject the null hypothesis and accept the alternative. The manufacture's claims are not accurate.
In a simple regression model.. what is b1?
Slope of regression line Change in y for every unit change in x
Consider this experiment. A scientist is interested in evaluating the impact of study location on two different college levels - undergraduates and graduates. He instructs one third of his students to study at home. Another third studies in the library. The last third of students study in a busy subway station. Undergraduates and graduates are divided in the same way. Afterwards he administers an exam, recording their scores. What is/are the experimental unit(s)?
Students
What is/are the factor(s)?
Study location, college level
AIC estimates the relative amount of information lost by a given model: the less information a model losses, the higher quality of that model.
TRUE
Which of these are assumptions made by the least-squares reqression model
The mean of the probability distribution of the errors is 0. The variability of the probability distribution of the errors is constant for all setting of the independent variable. The probability distribution of the errors is normal. The errors associated with any two different observations is independent.
If you fail to reject a null-hypothesis based on a t - test ..
There is no relationship b/w x and y There is a linear relationship. Type II error occurred. A relationship between x and y exists, but it is not linear
Which is possible if you fail to reject a null-hypothesis based on a t-test...
There is no relationship between x and y. There is a linear relationship. Type II error occurred. A relationship between x and y exists, but it is not linear.
What is/are the treatments(s)?
The six combinations of college level and study location
A common transformation for salary data is a log transformation.
True
Alpha is the mixing parameter that controls the whether you build a LASSO or ridge regression model (or something in between).
True
An Anova F-test for a factorial experiment can used to evaluate the whether an interaction between variables occur.
True
An Anova F-test for a randomized block design can be used to compare block means.
True
An Anova F-test for a randomized block design can be used to compare treatment means.
True
An experiment that includes all possible combinations of factors in called a complete factorial experiment.
True
Anova (Analysis of Variance) considers the between-sample variation and the with-in sample variation.
True
Block randomization works by randomizing participants within blocks such that an equal number are assigned to each treatment.
True
By shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias, substantially improving the accuracy of prediction for future observations.
True
E(y)=β0 +β1 x +β2 x2 In this second-order model, β2 controls the rate of curvature.
True
Evaluating the scatterplot of the dependent variable versus an independent variable can suggest the use of a second order term.
True
Evaluating the scatterplots of the dependent variable versus x1 for different values of x2 can help determine if an interaction term is warranted.
True
Feature engineering can be used to combine multiple independent variables into a new single independent variable.
True
Feature transformations can be based on domain knowledge or on insights from residual plots.
True
Hidden extrapolation may occur when you use a model for predictions outside the jointly defined space.
True
If an interaction is deemed important, do not conduct t-test on the first order terms, include them in the model.
True
If one dummy variable of a qualitative variable has a significant p-value we should keep the other dummy variables of the qualitative variable, even if they do not have a significant p-value.
True
In a main effects model with a single qualitative independent variable, β0 is equal to the average response value for the base case.
True
In a second-order model with one independent variable, the beta associated with the second order term controls the curvature of the regression line.
True
In an interaction model, the beta associated with the interaction term controls the rate of twist in the model.
True
LASSO is a form of continuous feature selection whereas forward selection is form of discrete feature selection.
True
LASSO regression can be used to select features.
True
Mallow's Cp helps strike a balance with the number of predictors in the model.
True
Modeling techniques tend to overfit the data. Therefor, the validation error of an n-fold cross-validation routine is useful because it gives an unbiased estimate of the predictive power of a model.
True
Multicollinearity occurs when the independent variables x1, x2, ..., xn are correlated instead of being independent.
True
Multiplicative processes are very common in income data.
True
Ordinary least square regression models minimize the sum of the square of the errors.
True
Overfitting occurs when a model is too closely fit to a limited set of data points
True
R squared is the proportion of the variance in the dependent variable that is predictable from the independent variables
True
Residual plots can be used to detect outliers.
True
Residuals plots can be used to evaluate the assumptions of homoscedasticity.
True
Ridge regression can be used to mitigate multicolinearity.
True
Ridge regression estimates tend to be stable in the sense that they are usually little affected by small changes in the data on which the fitted regression is based. In contrast, ordinary least squares estimates may be highly unstable under certain conditions, for example when the independent variables are highly multicollinear.
True
T/F: Experimental data should be used when evaluating the impact of temperature and pressure on product defects.
True
T/F: In a simple first order model, b1 represents the slope of the regression line relating y to x1 when all other x's are held fixed.
True
While automatic feature selection techniques like backward elimination and forward selection are useful for optimizing a metric (adjusted R-squared, AIC, etc.) a data scientist should consider other factors when removing/adding variables from a model including cost and interpretability
True
You can assess the independence of the residuals by using the Durbin-Watson test.
True
You may assess the normality of the residuals by looking at a histogram or a Q-Q (or P-P) plot.
True
A new electric car is being evaluated for the distance it can travel before the battery fails. Four engineers work independently over the course of several months to evaluate four different cars of the same model. Tweaks are made to the cars over time to get the performance possible. Consequently, some of the early trials, while still relevant, are not as accurate as some of the newer trials. What technique should they use?
Weighted least squares regression
Suppose we want to compute 10-Fold Cross-Validation error on 100 training examples. We need to compute error X times, and the Cross-Validation error is the average of the errors. To compute each error, we need to build a model with data of size Y, and test the model on the data of size Z. What are the appropriate numbers for X, Y and Z?
X = 10; Y = 90; Z = 10
In a complete ____ order model, a Beta indicates the change in y for every change in x
first order
In logistic regression, the dependent variable is the _________ of an event occurring.
log odds
For a data set with p features, of which q will eventually enter the model, forward selection will test approximately how many models?
pq
The standard regression model minimizes...
the sum of the square of the errors
An engineer measured the Brinnel hardness of 50 pieces of ductile iron that were annealed. The engineer hypothesized that the mean Brinell hardness of all such ductile iron pieces is greater than 170. What is the null hypothesis?
u = 170
What is the alternative hypothesis?
u > 170
What is/are the factor levels(s)?
undergraduate/graduate home/library/subway
Step 4: What is the critical value?
z=-1.62