Quiz 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

The following figure shows the logic for creating dummy variables from a categorical predictor called "Remodel". Which class level is considered the baseline (reference)? "Recent" "None" "Remodel" "Old"

"None"

What is the Euclidean distance between the following two records WITHOUT normalization? Round your answer to 1 decimal. [Blank]

11.5

The following formula shows the model that we have developed to predict the average used car Price based on its Age, Fuel Type, and quarterly Tax amount (Fuel Type = Natural Gas is the model baseline). Use the model to predict the price of a car with the following characteristics: Age = 12 Fuel Type = Diesel Tax = 230 15996 19087 24566 21445

21445

The following linear regression model is calculated to predict Boston house prices using CRIM (per capita crime rate by town), ZN (proportion of residential land zone), INDUS (proportion of industrial land zone), CHAS (Charles River dummy variable), and NOX (air pollution level). 34.7 27.7 37.1 21.7

27.7

The following chart shows the within-cluster sum of square errors versus the number of clusters in a k-means clustering model. Based on the Elbow method, what value of k is optimum for clustering? 2 4 5 8

4

Which of the following models (the dark blue line) shows a case of underfitting? C None of the others B A

A

In the following scatter plot matrix, Price is the target variable. What predictor shows the strongest negative correlation with Price? Weight HP CC Age_08_04

Age_08_04

In the development of a linear regression model, what is the naive (based) model that we compare the performance of the linear model with? Random guess Simple linear model Multiple linear model Average model

Average model

With the k-NN model for a numerical target, after we determined the k nearest neighbors of a new data record, how the target value is predicted? Majority vote determines the predicted class Through a linear combination of neighbors Average of the neighbors Through a logistic regression between the neighbors

Average of the neighbors

Which model is an OVER-FITTED model? B C D A

B

In the context of predictive model training... is a measure that shows how much the model's predictions differ from the true values. Variance F-Test Bias R2

Bias

Based on the following correlation matrix, what is potentially the weakest predictor of MEDV? TAX CHAS RM PTRATIO

CHAS

Which statement is INCORRECT about clustering? Quality of a clustering model depends on the similarity measure that is used Clustering is useful for predicting association rules Clustering has many applications is marketing, insurance, logistics, and health care businesses Clustering is an unsupervised learning method

Clustering is useful for predicting association rules

Which of the following scatterplots shows the strongest negative correlation? C A D B

D

Which of the following variable selection methods for the linear regression model examines all possible combinations of variables? Forward selection Backward elimination Exhaustive search Stepwise search

Exhaustive search

Both numerical and categorical variables can be used in the Euclidian distance function in the k-means clustering algorithm. True False

False

Categorical variables can NOT be used as predictors in the linear regression model. True False

False

In the k-means clustering technique, the desired number of clusters (k) is a number that is determined in the middle of the algorithm by calculating the model error. True False

False

In the k-nearest neighbor models, increasing the value of k leads to overfitting. True False

False

In the search for the best set of variables for the linear regression model, when the number of potential predictors is small, the exhaustive search method gives significantly different and better results than other methods. True False

False

Predictors of a multiple linear regression model can only be numeric type. True False

False

The k-means clustering algorithm can easily handle noisy data with outliers as well as non-convex data patterns. True False

False

We run two k-means clustering models on the same data with k=3 and k=5. The model with k=3 is necessarily better than the other one because a smaller value of k is always better for clustering. True False

False

You are requested to use a large data set of customers to predict how many days after their first purchase they will make the second purchase. You can do this by developing a classification model. True False

False

We have trained a linear regression model and tested it on the validation set. Here are the results: On the training set: RMSE is low, R2 is high On the validation set: RMSE is low, R2 is high What type of fit is this? Under-fit Over-fit Good (optimized) fit

Good (optimized) fit

The following scatterplot matrix is created from the Boston Housing data set that shows the median price of the houses in the Boston metropolitan area. Which two variables (any two variables) show the strongest correlation? (read the column and row labels to find the variable names) CHAS & INDUS MEDV & RM INDUS & NOX MEDV & NOX

INDUS & NOX

The following report shows Excel output for a linear regression model. What can the p-value of F-statistic tell us? If this p-value is larger than our significance level then the model as a whole is significant If this p-value is less than our significance level then the model as a whole is significant If this p-value is larger than our significance level then the coefficients are significant If this p-value is less than our significance level then the coefficients are significant

If this p-value is less than our significance level then the model as a whole is significant

Which statement is INCORRECT about linear regression models? Regression models are robust against violation of some assumptions such as normality assumption Linear regression models are relatively easy to explain It is a very popular, robust, and flexible method for predicting numerical and categorical targets In some cases, it is better to transform the variables before training the model to build a better model in terms of goodness-of-fit and accuracy

It is a very popular, robust, and flexible method for predicting numerical and categorical targets

Based on the following correlation matrix, what is potentially the strongest predictor of MEDV? INDUS RM AGE LSTAT

LSTAT

Which statement is INCORRECT about k-NN predictive models? k-NN is sensitive to irrelevant features When k=n (number of data records) the k-NN and the universal average methods are the same Finding optimum value of k can be computationally expensive Larger values of k increase the risk of over-fitting

Larger values of k increase the risk of over-fitting

Which statement is INCORRECT about choosing the number of clusters in the k-means clustering method? Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k Similar analyses can be used to inform our decision about a right value of k Sometimes business considerations impose constrains on the value of k Ability to do a useful profiling based on the cluster centroids helps us select a right value of k

Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k

We have developed two different linear regression models on the same data set. Which model shows a better goodness-of-fit? Models are the same Not enough information Model B Model A

Model A

The following figure shows residual plots of two linear regression models A and B. Which of the following statements is CORRECT? Model B is better than Model A because it shows smaller residuals Model B is violating homoscedasticity assumption Model A is violating linearity assumption Both models have met all the assumptions of the linear regression model

Model B is violating homoscedasticity assumption

We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? (dots create 3 lines) Model is violating the linearity assumption Model is violating the assumption of independence of observations Model is not violating any of the linear regression assumptions Model is violating the homoscedasticity assumption

Model is violating the assumption of independence of observations

We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? (dots form a "U") Model is violating the linearity assumption Model is violating the assumption of independence of observations Model is not violating any of the linear regression assumptions Model is violating the homoscedasticity assumption

Model is violating the linearity assumption

When we are building a linear regression model, against what model do we compare it to evaluate its significance? Classification model Random model Logistic model Naïve (average) model

Naïve (average) model

The target variable in a multiple linear regression model must be a: Numerical variable Nominal Variable Ordinal variable Binary variable

Numerical variable

Which of the followings is NOT a strategy to prevent model over-fitting? Adding variables to the model only if they improve the model performance and goodness-of-fit Set a limit on the value of R2 metric Splitting data into train and validation sets Penalizing the model for including more variables

Set a limit on the value of R2 metric

You are building a multiple linear regression model to predict median house price (MEDV) in Boston using a data set with 12 predictors as shown in the following correlation matrix. Based on the matrix, you would expect the violation of the multicollinearity assumption to happen between what variables? Hint: multicollinearity means a strong linear relationship between two predictors (independent variables). MEDV & LSTAT TAX & RAD MEDV & PTRATIO CHAS & NOX

TAX & RAD

Which statement is INCORRECT about the k-means clustering algorithm? Each data point is assigned to the cluster with the nearest centroid The choice of distance function is arbitrary, and the Euclidean distance function is very popular The algorithm starts with random seeds as the initial centroids The algorithm starts with initial centroids that are determined by distance function

The algorithm starts with initial centroids that are determined by distance function

What statement is correct about the k-nearest neighbor (k-NN) method? The value of k can control model over and underfitting Overfitted k-NN models can be fixed by decreasing k Underfitted k-NN models can be fixed by adding a dummy variable for accuracy Logistic regression is a special case of k-NN

The value of k can control model over and underfitting

Before computing the distance between two data records, we should normalize the numerical variables to prevent variables with large scales from having an undue effect. True False

True

In a linear regression model, the t-Test for each predictor's coefficient indicates if the estimated value is significantly different from zero. True False

True

Increasing the data size can help avoiding both over and under-fitting problems. True False

True

When a model is over-fitted the regression coefficients represent noise in the data, rather than the genuine relationships in the population. True False

True

k-nearest neighbor (k-NN) is a supervised method that can be used for predicting categorical or numerical targets. True False

True

In model training, ... measures how much the model's predictions fluctuate when given different input data. Bias Adjusted R2 p-value Variance

Variance

Splitting the data set into training and validation is a method to avoid overfitting. We can calculate metrics such as RMSE and R2 for the training and validation sets. Under what conditions, we can tell the model is overfitted? High training, but low testing error Very high training and test errors Low training and testing errors Very low training error and high test error

Very low training error and high test error

We have created a multiple linear regression model to predict car MPG (miles per gallon) values by five cas features: cylinders, displacement, horsepower, weight, and acceleration. The following table shows the linear model. Based on the results, what variable's estimated coefficient is significantly different from zero? displacement cylinders horsepower acceleration

horsepower

When a model captures both the underlying patterns and random fluctuations in data, it is called... non-linear regression under-fitting optimal fitting over-fitting

over-fitting


Ensembles d'études connexes

Medication Calculation Quiz, IV dosage calculation practice test, Dosage Calculation Practice Exam 1, ATI Desired Over Have: Medication Administration, HESI Dosage Calculations Practice Exam, HESI Dosage Calculations Quiz, Med Surg 1 - HESI practice...

View Set

Lesson 9-1, 9-2, 9-3, 9-4, 9-5, 9-6, 9-7

View Set

Lisette's NCLEX MED SURG Study #3

View Set

Chapter 10 (Understanding Individual Behavior)

View Set

GFC: Computer Basics --What Is a Computer?

View Set

Advantages and Disadvantages of Economic Systems

View Set