DSCI 4520 Exam 1 section 2
The following figure shows the logic for creating dummy variables from a categorical predictor called "Remodel". Which class level is considered the baseline (reference)? "None" "Recent" "Old" "Remodel"
"None"
What is the Euclidean distance between the following two records WITHOUT normalization? Round your answer to 1 decimal. Euclidean distance formula:
11.5
The following formula shows the model that we have developed to predict the average used car Price based on its Age, Fuel Type, and quarterly Tax amount (Fuel Type = Natural Gas is the model baseline). Use the model to predict the price of a car with the following characteristics: Age = 12 Fuel Type = Diesel Tax = 230 15996 21445 19087 24566
21445
The following linear regression model is calculated to predict Boston house prices using CRIM (per capita crime rate by town), ZN (proportion of residential land zone), INDUS (proportion of industrial land zone), CHAS (Charles River dummy variable), and NOX (air pollution level). Use the model to predict the Median Value of the following house: CRIM = 0.121 ZN = 22 INDUS = 1.47 CHAS = 0 NOX = 0.44 34.7 37.1 21.7 27.7
27.7
The following chart shows the within-cluster sum of square errors versus the number of clusters in a k-means clustering model. Based on the Elbow method, what value of k is optimum for clustering? 2 5 8 4
4
Which of the following models (the dark blue line) shows a case of underfitting? B C A None of the others
A
In the following scatter plot matrix, Price is the target variable. What predictor shows the strongest negative correlation with Price? CC HP Age_08_04 Weight
Age_08_04
In the development of a linear regression model, what is the naive (based) model that we compare the performance of the linear model with? Simple linear model Average model Multiple linear model Random guess
Average model
With the k-NN model for a numerical target, after we determined the k nearest neighbors of a new data record, how the target value is predicted? A. Majority vote determines the predicted class B. Average of the neighbors C. Through a logistic regression between the neighbors D. Through a linear combination of neighbors
Average of the neighbors
Which model is an OVER-FITTED model? B A D C
B
In the context of predictive model training .... is a measure that shows how much the model's predictions differ from the true values. Variance R2 Bias F-Test
Bias
Based on the following correlation matrix, what is potentially the weakest predictor of MEDV? TAX CHAS PTRATIO RM
CHAS
Which statement is INCORRECT about clustering? A. Clustering is useful for predicting association rules B. Clustering is an unsupervised learning method C. Clustering has many applications is marketing, insurance, logistics, and health care businesses D. Quality of a clustering model depends on the similarity measure that is used
Clustering is useful for predicting association rules
Which of the following scatterplots shows the strongest negative correlation? D A B C
D
Which of the following variable selection methods for the linear regression model examines all possible combinations of variables? Forward selection Stepwise search Exhaustive search Backward elimination
Exhaustive search
Both numerical and categorical variables can be used in the Euclidian distance function in the k-means clustering algorithm. True False
False
Categorical variables can NOT be used as predictors in the linear regression model. True False
False
In the k-means clustering technique, the desired number of clusters (k) is a number that is determined in the middle of the algorithm by calculating the model error. True False
False
In the k-nearest neighbor models, increasing the value of k leads to overfitting. True False
False
In the search for the best set of variables for the linear regression model, when the number of potential predictors is small, the exhaustive search method gives significantly different and better results than other methods. True False
False
Predictors of a multiple linear regression model can only be numeric type. True False
False
The k-means clustering algorithm can easily handle noisy data with outliers as well as non-convex data patterns. True False
False
We run two k-means clustering models on the same data with k=3 and k=5. The model with k=3 is necessarily better than the other one because a smaller value of k is always better for clustering. True False
False
You are requested to use a large data set of customers to predict how many days after their first purchase they will make the second purchase. You can do this by developing a classification model. True False
False
We have trained a linear regression model and tested it on the validation set. Here are the results: On the training set: RMSE is low, R2 is high On the validation set: RMSE is low, R2 is high What type of fit is this? Good (optimized) fit Over-fit Under-fit
Good (optimized) fit
The following scatterplot matrix is created from the Boston Housing data set that shows the median price of the houses in the Boston metropolitan area. Which two variables (any two variables) show the strongest correlation? (read the column and row labels to find the variable names) MEDV & NOX INDUS & NOX CHAS & INDUS MEDV & RM
INDUS & NOX
The following report shows Excel output for a linear regression model. What can the p-value of F-statistic tell us? A. If this p-value is less than our significance level then the coefficients are significant B. If this p-value is larger than our significance level then the coefficients are significant C. If this p-value is larger than our significance level then the model as a whole is significant D. If this p-value is less than our significance level then the model as a whole is significant
If this p-value is less than our significance level then the model as a whole is significant
Which statement is INCORRECT about linear regression models? A. Linear regression models are relatively easy to explain B. In some cases, it is better to transform the variables before training the model to build a better model in terms of goodness-of-fit and accuracy C. It is a very popular, robust, and flexible method for predicting numerical and categorical targets D. Regression models are robust against violation of some assumptions such as normality assumption
It is a very popular, robust, and flexible method for predicting numerical and categorical targets
Based on the following correlation matrix, what is potentially the strongest predictor of MEDV? AGE INDUS LSTAT RM
LSTAT
Which statement is INCORRECT about k-NN predictive models? A. Larger values of k increase the risk of over-fitting B. When k=n (number of data records) the k-NN and the universal average methods are the same C. k-NN is sensitive to irrelevant features D. Finding optimum value of k can be computationally expensive
Larger values of k increase the risk of over-fitting
Which statement is INCORRECT about choosing the number of clusters in the k-means clustering method? A. Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k B. Sometimes business considerations impose constrains on the value of k C. Ability to do a useful profiling based on the cluster centroids helps us select a right value of k D. Similar analyses can be used to inform our decision about a right value of k
Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k
We have developed two different linear regression models on the same data set. Which model shows a better goodness-of-fit? Not enough information Models are the same Model B Model A
Model A
The following figure shows residual plots of two linear regression models A and B. Which of the following statements is CORRECT? A. Model A is violating linearity assumption B. Model B is better than Model A because it shows smaller residuals C. Model B is violating homoscedasticity assumption D. Both models have met all the assumptions of the linear regression model
Model B is violating homoscedasticity assumption
We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? A. Model is violating the homoscedasticity assumption B. Model is not violating any of the linear regression assumptions C. Model is violating the linearity assumption D. Model is violating the assumption of independence of observations
Model is violating the assumption of independence of observations
We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? A. Model is violating the linearity assumption B. Model is not violating any of the linear regression assumptions C. Model is violating the homoscedasticity assumption D. Model is violating the assumption of independence of observations
Model is violating the linearity assumption
When we are building a linear regression model, against what model do we compare it to evaluate its significance? Naïve (average) model Logistic model Classification model Random model
Naïve (average) model
The target variable in a multiple linear regression model must be a: Binary variable Numerical variable Nominal Variable Ordinal variable
Numerical variable
Which of the followings is NOT a strategy to prevent model over-fitting? A. Adding variables to the model only if they improve the model performance and goodness-of-fit B. Splitting data into train and validation sets C. Set a limit on the value of R2 metric D. Penalizing the model for including more variables
Set a limit on the value of R2 metric
You are building a multiple linear regression model to predict median house price (MEDV) in Boston using a data set with 12 predictors as shown in the following correlation matrix. Based on the matrix, you would expect the violation of the multicollinearity assumption to happen between what variables? Hint: multicollinearity means a strong linear relationship between two predictors (independent variables). CHAS & NOX TAX & RAD MEDV & PTRATIO MEDV & LSTAT
TAX & RAD
Which statement is INCORRECT about the k-means clustering algorithm? A. The algorithm starts with initial centroids that are determined by distance function B. The algorithm starts with random seeds as the initial centroids C. Each data point is assigned to the cluster with the nearest centroid D. The choice of distance function is arbitrary, and the Euclidean distance function is very popular
The algorithm starts with initial centroids that are determined by distance function
What statement is correct about the k-nearest neighbor (k-NN) method? A. Underfitted k-NN models can be fixed by adding a dummy variable for accuracy B. Logistic regression is a special case of k-NN C. The value of k can control model over and underfitting D. Overfitted k-NN models can be fixed by decreasing k
The value of k can control model over and underfitting
Before computing the distance between two data records, we should normalize the numerical variables to prevent variables with large scales from having an undue effect. True False
True
In a linear regression model, the t-Test for each predictor's coefficient indicates if the estimated value is significantly different from zero. True False
True
Increasing the data size can help avoiding both over and under-fitting problems. True False
True
When a model is over-fitted the regression coefficients represent noise in the data, rather than the genuine relationships in the population. Trie False
True
k-nearest neighbor (k-NN) is a supervised method that can be used for predicting categorical or numerical targets. True False
True
In model training, ... measures how much the model's predictions fluctuate when given different input data. Variance p-value Bias Adjusted R2
Variance
Splitting the data set into training and validation is a method to avoid overfitting. We can calculate metrics such as RMSE and R2 for the training and validation sets. Under what conditions, we can tell the model is overfitted? A. Very high training and test errors B. Very low training error and high test error C. High training, but low testing error D. Low training and testing errors
Very low training error and high test error
We have created a multiple linear regression model to predict car MPG (miles per gallon) values by five cas features: cylinders, displacement, horsepower, weight, and acceleration. The following table shows the linear model. Based on the results, what variable's estimated coefficient is significantly different from zero? displacement cylinders acceleration horsepower
horsepower
When a model captures both the underlying patterns and random fluctuations in data, it is called ... under-fitting optimal fitting non-linear regression over-fitting
over-fitting