DSC 423 Final Practice

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What assumptions do we make about residuals.... (mark all that apply)

They have a mean of 0. They have a constant variance. They are normally distributed. They are independent.

The quarterly single-family housing starts (in thousands of dwellings) in the United States from 2004 through 2014 have been recorded. Researchers want to predict the housing starts for the first two quarters of 2015. What technique should they use?

Time series using regression

T/F: The regression line is chosen because in minimizes the sum of the square of the errors.

True

The difference between glmnet and cv.glmnet is that the latter performs cross validation.

True

The interaction term is responsible for creating a "twist" in the regression plane.

True

We can use our domain knowledge to infer that a second-order term might be warranted.

True

When an interaction term is deemed important, include the first-order terms in the model regardless of their p-values.

True

When creating a pth order model, the number of levels for a variable must be greater than or equal to (p+1).

True

When evaluating a residual plot we should look for changes in variability.

True

When evaluating a residual plot we should look for trends.

True

When we keep a second order term in a model, we always keep the first order term.

True

When performing n-fold cross validation, n models are generated. Generally, it is best to select the best (highest adjusted R squared) of these n models as the final model.

False

When transforming variables, one should consider transforming the independent variables but not the dependent variable.

False

Assume the number of miles a car was driven last year is normally distributed with a mean of 9,000 and a standard deviation of 3,500. If we define "unusual" as those in the bottom 2.5% or top 2.5%, find the interval for the "usual" number of miles driven per year.

(2140, 15860)

Step 4: What is z?

-1.76

Step 4: What is z?

-2.14

Step 4: What is z?

-2.46

When you look up 1.73, what is the p-value?

.0427

Step 5: What is the p-value? Remember to consider whether it is a one-tailed or two-tailed test.

.7%

Step 3: What is the declared alpha?

1%

Step 5: What is the p-value? Remember to consider whether it is a one-tailed or two-tailed test.

1.6%

If the sample mean is 172.52 with a standard deviation of 10.31, what is z?

1.73 z=(x-μ)/σ,

A VIF value of ________ says that the other predictors "explain" 90% of the variation in predictor j and suggests that there may be multicolinearity in the dataset.

10

Assume the number of songs stored on college students' smart phones are normally distributed with a mean of 500 and a standard deviation of 150. What is the maximum number of songs a student can have on their smart phone and still be considered in the bottom 1%?

151

Given x is a random variable from a normal population with the indicated mean and standard deviation, estimate the probability using the 68-95-99.7 Empirical Rule. P(x < 13 | mean = 15, std = 2)

16%

Given x is a random variable from a normal population with the indicated mean and standard deviation, estimate the probability using the 68-95-99.7 Empirical Rule. P(x < 42 | mean = 50, std = 4)

2.5%

Given z is a random variable from a standard normal population, estimate the probability using the 68-95-99.7 Empirical Rule. P(z < -2)

2.5%

Assume the number of miles a car was driven last year is normally distributed with a mean of 9,000 and a standard deviation of 3,500. What percent of cars were driven over 15,000 miles last year?

4.32%

Assume you are building an interaction model with two qualitative variables. The first variables has 3 levels. The second variable has 4 level. In total, how many dummy variables do yo need any how many interaction terms do you need?

5 dummy & 6 interaction 2 & 3 dummy, you only want interaction terms between the dummy (2*3)

Step 3: What is the declared alpha?

5% (default)

Step 5: What is the p-value? Remember to consider whether it is a one-tailed or two-tailed test.

5.26%

Step 5: What is the p-value? Remember to consider whether it is a one-tailed or two-tailed test.

7.7%

Given z is a random variable from a standard normal population, estimate the probability using the 68-95-99.7 Empirical Rule. P(z <=1 )

84%

Assume the number of songs stored on college students' smart phones are normally distributed with a mean of 500 and a standard deviation of 150. What percent of students' phones have less than 300 songs?

9.12%

Which feature selection method is the most computationally expensive?

All Subsets Selection

Which feature selection method will guarantee the best model based on the selected metric?

All Subsets Selection

In a first-order model with a single independent variable, the slope of the regression line is captured by:

B1

____ starts with full model with k variables. It then remove variables one at a time, recording R-squared (or some other metric). It retains best (k-1)-variable model and repeats until there is no improvement in R-squared.

Backward elimination

In a first-order model with a single independent variable, the y-intercept of the regression list is captured by:

B0

Supervised learning is used to draw inferences from data sets consisting of input data without labeled responses

False - UNSUPERVISED LEARNING Ex of supervised: regression, classification techniques like decision trees Ex of unsupervised learning: clustering, associations

Multicollinearity can impact a regression model in several ways, including... (mark all that apply)

Creating rounding errors in betas. Reversing signs of betas. Muddying t-tests. Confounding beta estimates.

The tuning parameter, lamba, for ridge and LASSO regression is commonly set to 0.1 in ridge regression and 0.25 in LASSO regression.

False

When building a model relating E(y) to a qualitative independent variable with n levels you will need n dummy variables

False

When building a regression model, a qualitative variable with n levels would require n dummy variables.

False

What is/are the response variable(s)?

Exam scores

The general form of a probabilistic model includes

Expected value, error, dependent variable

Step 6: Draw a conclusion.

Fail to reject the null hypothesis. More data is needed.

Step 6: Draw a conclusion.

Fail to reject the null hypothesis. The manufacturer's claims may or may not be accurate. More data is needed.

A correlation table can be used to detect multicolinearity even if it is a one-to-many relationship.

False

AIC compares the precision and bias of the full model to models with a subset of the predictors.

False

Backward selection begins with an empty model.

False

Because ridge regression is more likely to produce a parsimonious model, it is often preferred over LASSO regression.

False

Consider a qualitative variable with three levels (tall, average, short). Dummy variables are used to model the levels in a regression model. We should consider the interaction between the dummy variables for "average" and "short".

False

Due to the nature of logistic regression, multicolinearity is not a problem when building models to predict a binary variable.

False

Everything else being equal a one-at-a-time treatment is preferred over an all-possible treatment.

False

Homoscedasticity refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it.

False

If a qualitative variable has 5 levels, it will require 5 dummy variables to model all the levels.

False

In a second-order model with one independent variable, if the beta associated with the second order term is negative the regression line will curve upwards.

False

In a second-order model with two independent variables, if both betas associated with the second order terms are negative the response surface will be saddle shaped.

False

In a second-order model, β1 represents the slope of the regression line relating y to x1 when all other x's are held fixed.

False

In an interaction model, β1 represents the slope of the regression line relating y to x1 when all other x's are held fixed.

False

In an observational study, the researcher actively changes some characteristics of the units before the data are collected.

False

In the model above, if β2 is negative the regression line will curve upward.

False

Leave-one out cross validation (LOOCV) - also called jackknifing -- generally gives less accurate estimates of true test error than 10-fold cross validation.

False

One of the primary benefits of second-order models is their affinity for extrapolation.

False

Rejecting the F-test, means that you accept the alternative hypothesis: all Betas are not equal to 0.

False

T/F: Experimental data should be used when evaluating the impact of cocaine on birth defects.

False

T/F: In the appliance store example, for every unit change in sales revenue, we expect a change of 0.70 in advertising expenditure.

False

T/F: Regression is an example of unsupervised learning.

False

T/F: The difference between R2 and R2-adjusted is that R2 represents the coefficient of determination between two variables while R2 -adjusted represents the coefficient of determination of a model.

False

The best cutoff value to use is 50%.

False

The null hypothesis of a basic F-test is that all beta parameters are equal to one.

False

The penalty for ridge regression is the L1-norm of the betas.

False

The size of a beta is a good indication of the important of a variable in a regression model.

False

Step 2: What is the alternative hypothesis? (Take an adversarial position. Are you interested in a one-sided or two-sided test?)

Ha: (mu1 - mu2) < 0

Step 2: What is the alternative hypothesis? (Take an adversarial position. Are you interested in a one-sided or two-sided test?)

Ha: mu < 14

Step 2: What is the alternative hypothesis? (Take an adversarial position. Are you interested in a one-sided or two-sided test?)

Ha: mu < 75000

Step 2: What is the alternative hypothesis? (Take an adversarial position. Are you interested in a one-sided or two-sided test?)

Ha: mu =/= .75

The United States and Japan often engage in intense trade negotiations. U.S. officials claim that Japanese manufacturers price their goods higher in Japan than in the United States, in effect subsidizing the low prices in the United States with extremely high prices in Japan. According to the U.S. argument, Japanese manufactures accomplish this by preventing U.S. good from reaching the market. An economist decides to test the hypothesis that higher retail prices are being charged for automobiles in Japan than in the United States. She obtains independent samples from 50 retail sales in the United States and 50 sales in Japan over the same time. She found the sample average of the U.S. sales to be 26,596 and the sample average of the Japanese sales to be 27,236. The standard deviations were 1,981 and 1,974 respectively. Using an alpha of 5%, conduct an hypothesis test. Step 1: What is the null hypothesis?

Ho: (mu1 - mu2) = 0

A tire manufacturer claims that the mean life of its tire is 75,000 miles. A sample of 50 tires finds xbar = 74,200 and s = 2300. Test the manufacturer's claim. Step 1: What is the null hypothesis?

Ho: mu = 75000

A weight loss center claims its participants have a mean loss of 14 pounds in 14 days. A sample of 100 participants finds x-bar = 13.1 pounds and s = 4.2 pounds. You suspect they are lying! Test the weight loss center's claim using an alpha of 1%. Step 1: What is the null hypothesis?

Ho: mu= 14

A supplier claims the mean thickness of its washers is 0.75 inches. Washers that are too big or too small can cause problems in the machines. A sample of 50 finds xbar = 0.73 inches and s = 0.08 inches. Test the supplier's claim. Step 1: What is the null hypothesis?

Ho: mu=.75

A pharmaceutical company has developed a new drug designed to reduce a smoker's reliance on tobacco. Since certain dosages of the drug may reduce one's pulse rate to dangerously low levels, the product-testing division of the pharmaceutical company wants to model the relationship between dosage x and decrease in pulse rate y. An important question with regards to this experiment is "What is the maximum dosage that could safely be administered?" What technique should be used?

Inverse prediction

We will be looking at feature selection in the next lecture. In general, you should select a model that ....

Is simple. Is parsimonious. Has few features. Has high accuracy.

The regression line was chosen because....

It minimized the sum of the square of the errors (SSE).

Offshore oil drilling near an Alaskan estuary has led to increased air traffic - mostly large helicopters - in the area. The U.S. Fish and Wildlife Service commissioned a study to investigate the impact these helicopters have on the flocks of Pacific Brant Geese, which inhabit the estuary in the Fall before migrating. Two large helicopters were flown repeatedly over the estuary at different altitudes and lateral distances from the flock. The flight response of the geese (recorded as "low" or "high"), altitude (meters), and lateral distance (meters) for each flight was recorded. What technique should they use?

Logistic regression

In the standard regression model, what are the assumptions about the errors?

Mean is 0 Errors are homoscedastic Errors are normal Errors are independent

R2 does not offer any significant insights into how well our regression model can predict future values.

R2

In a paper presented at the 2009 IM Conference in China, a group of university finance professors examined the relationship between customer satisfaction of a product and product performance with performance measured on a 10-point scale. The researchers discovered that the linear relationship varied over different performance ranges. What technique should they use?

Piecewise linear regression

Which of the following can be used to evaluate a logistic model?

Recall Precision Specificity Accuracy

When designing an experiment we may improve the quality of the data (and thus the model) by...

Reducing the noise Increasing the signal

Assuming an alpha of 5%, what is the conclusion?

Reject the null hypothesis and accept alternative. The engineer's belief about the mean is correct.

Step 6: Draw a conclusion.

Reject the null hypothesis and accept the alternative. The manufacture's claims are not accurate.

In a simple regression model.. what is b1?

Slope of regression line Change in y for every unit change in x

Consider this experiment. A scientist is interested in evaluating the impact of study location on two different college levels - undergraduates and graduates. He instructs one third of his students to study at home. Another third studies in the library. The last third of students study in a busy subway station. Undergraduates and graduates are divided in the same way. Afterwards he administers an exam, recording their scores. What is/are the experimental unit(s)?

Students

What is/are the factor(s)?

Study location, college level

AIC estimates the relative amount of information lost by a given model: the less information a model losses, the higher quality of that model.

TRUE

Which of these are assumptions made by the least-squares reqression model

The mean of the probability distribution of the errors is 0. The variability of the probability distribution of the errors is constant for all setting of the independent variable. The probability distribution of the errors is normal. The errors associated with any two different observations is independent.

If you fail to reject a null-hypothesis based on a t - test ..

There is no relationship b/w x and y There is a linear relationship. Type II error occurred. A relationship between x and y exists, but it is not linear

Which is possible if you fail to reject a null-hypothesis based on a t-test...

There is no relationship between x and y. There is a linear relationship. Type II error occurred. A relationship between x and y exists, but it is not linear.

What is/are the treatments(s)?

The six combinations of college level and study location

A common transformation for salary data is a log transformation.

True

Alpha is the mixing parameter that controls the whether you build a LASSO or ridge regression model (or something in between).

True

An Anova F-test for a factorial experiment can used to evaluate the whether an interaction between variables occur.

True

An Anova F-test for a randomized block design can be used to compare block means.

True

An Anova F-test for a randomized block design can be used to compare treatment means.

True

An experiment that includes all possible combinations of factors in called a complete factorial experiment.

True

Anova (Analysis of Variance) considers the between-sample variation and the with-in sample variation.

True

Block randomization works by randomizing participants within blocks such that an equal number are assigned to each treatment.

True

By shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias, substantially improving the accuracy of prediction for future observations.

True

E(y)=β0 +β1 x +β2 x2 In this second-order model, β2 controls the rate of curvature.

True

Evaluating the scatterplot of the dependent variable versus an independent variable can suggest the use of a second order term.

True

Evaluating the scatterplots of the dependent variable versus x1 for different values of x2 can help determine if an interaction term is warranted.

True

Feature engineering can be used to combine multiple independent variables into a new single independent variable.

True

Feature transformations can be based on domain knowledge or on insights from residual plots.

True

Hidden extrapolation may occur when you use a model for predictions outside the jointly defined space.

True

If an interaction is deemed important, do not conduct t-test on the first order terms, include them in the model.

True

If one dummy variable of a qualitative variable has a significant p-value we should keep the other dummy variables of the qualitative variable, even if they do not have a significant p-value.

True

In a main effects model with a single qualitative independent variable, β0 is equal to the average response value for the base case.

True

In a second-order model with one independent variable, the beta associated with the second order term controls the curvature of the regression line.

True

In an interaction model, the beta associated with the interaction term controls the rate of twist in the model.

True

LASSO is a form of continuous feature selection whereas forward selection is form of discrete feature selection.

True

LASSO regression can be used to select features.

True

Mallow's Cp helps strike a balance with the number of predictors in the model.

True

Modeling techniques tend to overfit the data. Therefor, the validation error of an n-fold cross-validation routine is useful because it gives an unbiased estimate of the predictive power of a model.

True

Multicollinearity occurs when the independent variables x1, x2, ..., xn are correlated instead of being independent.

True

Multiplicative processes are very common in income data.

True

Ordinary least square regression models minimize the sum of the square of the errors.

True

Overfitting occurs when a model is too closely fit to a limited set of data points

True

R squared is the proportion of the variance in the dependent variable that is predictable from the independent variables

True

Residual plots can be used to detect outliers.

True

Residuals plots can be used to evaluate the assumptions of homoscedasticity.

True

Ridge regression can be used to mitigate multicolinearity.

True

Ridge regression estimates tend to be stable in the sense that they are usually little affected by small changes in the data on which the fitted regression is based. In contrast, ordinary least squares estimates may be highly unstable under certain conditions, for example when the independent variables are highly multicollinear.

True

T/F: Experimental data should be used when evaluating the impact of temperature and pressure on product defects.

True

T/F: In a simple first order model, b1 represents the slope of the regression line relating y to x1 when all other x's are held fixed.

True

While automatic feature selection techniques like backward elimination and forward selection are useful for optimizing a metric (adjusted R-squared, AIC, etc.) a data scientist should consider other factors when removing/adding variables from a model including cost and interpretability

True

You can assess the independence of the residuals by using the Durbin-Watson test.

True

You may assess the normality of the residuals by looking at a histogram or a Q-Q (or P-P) plot.

True

A new electric car is being evaluated for the distance it can travel before the battery fails. Four engineers work independently over the course of several months to evaluate four different cars of the same model. Tweaks are made to the cars over time to get the performance possible. Consequently, some of the early trials, while still relevant, are not as accurate as some of the newer trials. What technique should they use?

Weighted least squares regression

Suppose we want to compute 10-Fold Cross-Validation error on 100 training examples. We need to compute error X times, and the Cross-Validation error is the average of the errors. To compute each error, we need to build a model with data of size Y, and test the model on the data of size Z. What are the appropriate numbers for X, Y and Z?

X = 10; Y = 90; Z = 10

In a complete ____ order model, a Beta indicates the change in y for every change in x

first order

In logistic regression, the dependent variable is the _________ of an event occurring.

log odds

For a data set with p features, of which q will eventually enter the model, forward selection will test approximately how many models?

pq

The standard regression model minimizes...

the sum of the square of the errors

An engineer measured the Brinnel hardness of 50 pieces of ductile iron that were annealed. The engineer hypothesized that the mean Brinell hardness of all such ductile iron pieces is greater than 170. What is the null hypothesis?

u = 170

What is the alternative hypothesis?

u > 170

What is/are the factor levels(s)?

undergraduate/graduate home/library/subway

Step 4: What is the critical value?

z=-1.62


Set pelajaran terkait

Lesson 5 - Digestion in the Stomach

View Set

Chapter 4 Personality Assessment (Personality Psychology: Foundations and Findings

View Set

1.10 Compare and contrast types of display devices and their features 9

View Set

AP Macroeconomics Reading Module 39

View Set

Final Review - Chapter 5: The Integral

View Set

Kieso Chapter 2 Intermediate Accounting Chapter-End ANSWERS

View Set

AP Government & Politics-Congress

View Set

Chapter 8 Learning Question #7: What are typical data-mining applications?

View Set