Quiz 2
The following figure shows the logic for creating dummy variables from a categorical predictor called "Remodel". Which class level is considered the baseline (reference)? "Recent" "None" "Remodel" "Old"
"None"
What is the Euclidean distance between the following two records WITHOUT normalization? Round your answer to 1 decimal. [Blank]
11.5
The following formula shows the model that we have developed to predict the average used car Price based on its Age, Fuel Type, and quarterly Tax amount (Fuel Type = Natural Gas is the model baseline). Use the model to predict the price of a car with the following characteristics: Age = 12 Fuel Type = Diesel Tax = 230 15996 19087 24566 21445
21445
The following linear regression model is calculated to predict Boston house prices using CRIM (per capita crime rate by town), ZN (proportion of residential land zone), INDUS (proportion of industrial land zone), CHAS (Charles River dummy variable), and NOX (air pollution level). 34.7 27.7 37.1 21.7
27.7
The following chart shows the within-cluster sum of square errors versus the number of clusters in a k-means clustering model. Based on the Elbow method, what value of k is optimum for clustering? 2 4 5 8
4
Which of the following models (the dark blue line) shows a case of underfitting? C None of the others B A
A
In the following scatter plot matrix, Price is the target variable. What predictor shows the strongest negative correlation with Price? Weight HP CC Age_08_04
Age_08_04
In the development of a linear regression model, what is the naive (based) model that we compare the performance of the linear model with? Random guess Simple linear model Multiple linear model Average model
Average model
With the k-NN model for a numerical target, after we determined the k nearest neighbors of a new data record, how the target value is predicted? Majority vote determines the predicted class Through a linear combination of neighbors Average of the neighbors Through a logistic regression between the neighbors
Average of the neighbors
Which model is an OVER-FITTED model? B C D A
B
In the context of predictive model training... is a measure that shows how much the model's predictions differ from the true values. Variance F-Test Bias R2
Bias
Based on the following correlation matrix, what is potentially the weakest predictor of MEDV? TAX CHAS RM PTRATIO
CHAS
Which statement is INCORRECT about clustering? Quality of a clustering model depends on the similarity measure that is used Clustering is useful for predicting association rules Clustering has many applications is marketing, insurance, logistics, and health care businesses Clustering is an unsupervised learning method
Clustering is useful for predicting association rules
Which of the following scatterplots shows the strongest negative correlation? C A D B
D
Which of the following variable selection methods for the linear regression model examines all possible combinations of variables? Forward selection Backward elimination Exhaustive search Stepwise search
Exhaustive search
Both numerical and categorical variables can be used in the Euclidian distance function in the k-means clustering algorithm. True False
False
Categorical variables can NOT be used as predictors in the linear regression model. True False
False
In the k-means clustering technique, the desired number of clusters (k) is a number that is determined in the middle of the algorithm by calculating the model error. True False
False
In the k-nearest neighbor models, increasing the value of k leads to overfitting. True False
False
In the search for the best set of variables for the linear regression model, when the number of potential predictors is small, the exhaustive search method gives significantly different and better results than other methods. True False
False
Predictors of a multiple linear regression model can only be numeric type. True False
False
The k-means clustering algorithm can easily handle noisy data with outliers as well as non-convex data patterns. True False
False
We run two k-means clustering models on the same data with k=3 and k=5. The model with k=3 is necessarily better than the other one because a smaller value of k is always better for clustering. True False
False
You are requested to use a large data set of customers to predict how many days after their first purchase they will make the second purchase. You can do this by developing a classification model. True False
False
We have trained a linear regression model and tested it on the validation set. Here are the results: On the training set: RMSE is low, R2 is high On the validation set: RMSE is low, R2 is high What type of fit is this? Under-fit Over-fit Good (optimized) fit
Good (optimized) fit
The following scatterplot matrix is created from the Boston Housing data set that shows the median price of the houses in the Boston metropolitan area. Which two variables (any two variables) show the strongest correlation? (read the column and row labels to find the variable names) CHAS & INDUS MEDV & RM INDUS & NOX MEDV & NOX
INDUS & NOX
The following report shows Excel output for a linear regression model. What can the p-value of F-statistic tell us? If this p-value is larger than our significance level then the model as a whole is significant If this p-value is less than our significance level then the model as a whole is significant If this p-value is larger than our significance level then the coefficients are significant If this p-value is less than our significance level then the coefficients are significant
If this p-value is less than our significance level then the model as a whole is significant
Which statement is INCORRECT about linear regression models? Regression models are robust against violation of some assumptions such as normality assumption Linear regression models are relatively easy to explain It is a very popular, robust, and flexible method for predicting numerical and categorical targets In some cases, it is better to transform the variables before training the model to build a better model in terms of goodness-of-fit and accuracy
It is a very popular, robust, and flexible method for predicting numerical and categorical targets
Based on the following correlation matrix, what is potentially the strongest predictor of MEDV? INDUS RM AGE LSTAT
LSTAT
Which statement is INCORRECT about k-NN predictive models? k-NN is sensitive to irrelevant features When k=n (number of data records) the k-NN and the universal average methods are the same Finding optimum value of k can be computationally expensive Larger values of k increase the risk of over-fitting
Larger values of k increase the risk of over-fitting
Which statement is INCORRECT about choosing the number of clusters in the k-means clustering method? Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k Similar analyses can be used to inform our decision about a right value of k Sometimes business considerations impose constrains on the value of k Ability to do a useful profiling based on the cluster centroids helps us select a right value of k
Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k
We have developed two different linear regression models on the same data set. Which model shows a better goodness-of-fit? Models are the same Not enough information Model B Model A
Model A
The following figure shows residual plots of two linear regression models A and B. Which of the following statements is CORRECT? Model B is better than Model A because it shows smaller residuals Model B is violating homoscedasticity assumption Model A is violating linearity assumption Both models have met all the assumptions of the linear regression model
Model B is violating homoscedasticity assumption
We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? (dots create 3 lines) Model is violating the linearity assumption Model is violating the assumption of independence of observations Model is not violating any of the linear regression assumptions Model is violating the homoscedasticity assumption
Model is violating the assumption of independence of observations
We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? (dots form a "U") Model is violating the linearity assumption Model is violating the assumption of independence of observations Model is not violating any of the linear regression assumptions Model is violating the homoscedasticity assumption
Model is violating the linearity assumption
When we are building a linear regression model, against what model do we compare it to evaluate its significance? Classification model Random model Logistic model Naïve (average) model
Naïve (average) model
The target variable in a multiple linear regression model must be a: Numerical variable Nominal Variable Ordinal variable Binary variable
Numerical variable
Which of the followings is NOT a strategy to prevent model over-fitting? Adding variables to the model only if they improve the model performance and goodness-of-fit Set a limit on the value of R2 metric Splitting data into train and validation sets Penalizing the model for including more variables
Set a limit on the value of R2 metric
You are building a multiple linear regression model to predict median house price (MEDV) in Boston using a data set with 12 predictors as shown in the following correlation matrix. Based on the matrix, you would expect the violation of the multicollinearity assumption to happen between what variables? Hint: multicollinearity means a strong linear relationship between two predictors (independent variables). MEDV & LSTAT TAX & RAD MEDV & PTRATIO CHAS & NOX
TAX & RAD
Which statement is INCORRECT about the k-means clustering algorithm? Each data point is assigned to the cluster with the nearest centroid The choice of distance function is arbitrary, and the Euclidean distance function is very popular The algorithm starts with random seeds as the initial centroids The algorithm starts with initial centroids that are determined by distance function
The algorithm starts with initial centroids that are determined by distance function
What statement is correct about the k-nearest neighbor (k-NN) method? The value of k can control model over and underfitting Overfitted k-NN models can be fixed by decreasing k Underfitted k-NN models can be fixed by adding a dummy variable for accuracy Logistic regression is a special case of k-NN
The value of k can control model over and underfitting
Before computing the distance between two data records, we should normalize the numerical variables to prevent variables with large scales from having an undue effect. True False
True
In a linear regression model, the t-Test for each predictor's coefficient indicates if the estimated value is significantly different from zero. True False
True
Increasing the data size can help avoiding both over and under-fitting problems. True False
True
When a model is over-fitted the regression coefficients represent noise in the data, rather than the genuine relationships in the population. True False
True
k-nearest neighbor (k-NN) is a supervised method that can be used for predicting categorical or numerical targets. True False
True
In model training, ... measures how much the model's predictions fluctuate when given different input data. Bias Adjusted R2 p-value Variance
Variance
Splitting the data set into training and validation is a method to avoid overfitting. We can calculate metrics such as RMSE and R2 for the training and validation sets. Under what conditions, we can tell the model is overfitted? High training, but low testing error Very high training and test errors Low training and testing errors Very low training error and high test error
Very low training error and high test error
We have created a multiple linear regression model to predict car MPG (miles per gallon) values by five cas features: cylinders, displacement, horsepower, weight, and acceleration. The following table shows the linear model. Based on the results, what variable's estimated coefficient is significantly different from zero? displacement cylinders horsepower acceleration
horsepower
When a model captures both the underlying patterns and random fluctuations in data, it is called... non-linear regression under-fitting optimal fitting over-fitting
over-fitting