DSCI 4520 Exam 1 section 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

The following figure shows the logic for creating dummy variables from a categorical predictor called "Remodel". Which class level is considered the baseline (reference)? "None" "Recent" "Old" "Remodel"

"None"

What is the Euclidean distance between the following two records WITHOUT normalization? Round your answer to 1 decimal. Euclidean distance formula:

11.5

The following formula shows the model that we have developed to predict the average used car Price based on its Age, Fuel Type, and quarterly Tax amount (Fuel Type = Natural Gas is the model baseline). Use the model to predict the price of a car with the following characteristics: Age = 12 Fuel Type = Diesel Tax = 230 15996 21445 19087 24566

21445

The following linear regression model is calculated to predict Boston house prices using CRIM (per capita crime rate by town), ZN (proportion of residential land zone), INDUS (proportion of industrial land zone), CHAS (Charles River dummy variable), and NOX (air pollution level). Use the model to predict the Median Value of the following house: CRIM = 0.121 ZN = 22 INDUS = 1.47 CHAS = 0 NOX = 0.44 34.7 37.1 21.7 27.7

27.7

The following chart shows the within-cluster sum of square errors versus the number of clusters in a k-means clustering model. Based on the Elbow method, what value of k is optimum for clustering? 2 5 8 4

Which of the following models (the dark blue line) shows a case of underfitting? B C A None of the others

In the following scatter plot matrix, Price is the target variable. What predictor shows the strongest negative correlation with Price? CC HP Age_08_04 Weight

Age_08_04

In the development of a linear regression model, what is the naive (based) model that we compare the performance of the linear model with? Simple linear model Average model Multiple linear model Random guess

Average model

With the k-NN model for a numerical target, after we determined the k nearest neighbors of a new data record, how the target value is predicted? A. Majority vote determines the predicted class B. Average of the neighbors C. Through a logistic regression between the neighbors D. Through a linear combination of neighbors

Average of the neighbors

Which model is an OVER-FITTED model? B A D C

In the context of predictive model training .... is a measure that shows how much the model's predictions differ from the true values. Variance R2 Bias F-Test

Bias

Based on the following correlation matrix, what is potentially the weakest predictor of MEDV? TAX CHAS PTRATIO RM

CHAS

Which statement is INCORRECT about clustering? A. Clustering is useful for predicting association rules B. Clustering is an unsupervised learning method C. Clustering has many applications is marketing, insurance, logistics, and health care businesses D. Quality of a clustering model depends on the similarity measure that is used

Clustering is useful for predicting association rules

Which of the following scatterplots shows the strongest negative correlation? D A B C

Which of the following variable selection methods for the linear regression model examines all possible combinations of variables? Forward selection Stepwise search Exhaustive search Backward elimination

Exhaustive search

Both numerical and categorical variables can be used in the Euclidian distance function in the k-means clustering algorithm. True False

False

Categorical variables can NOT be used as predictors in the linear regression model. True False

False

In the k-means clustering technique, the desired number of clusters (k) is a number that is determined in the middle of the algorithm by calculating the model error. True False

False

In the k-nearest neighbor models, increasing the value of k leads to overfitting. True False

False

In the search for the best set of variables for the linear regression model, when the number of potential predictors is small, the exhaustive search method gives significantly different and better results than other methods. True False

False

Predictors of a multiple linear regression model can only be numeric type. True False

False

The k-means clustering algorithm can easily handle noisy data with outliers as well as non-convex data patterns. True False

False

We run two k-means clustering models on the same data with k=3 and k=5. The model with k=3 is necessarily better than the other one because a smaller value of k is always better for clustering. True False

False

You are requested to use a large data set of customers to predict how many days after their first purchase they will make the second purchase. You can do this by developing a classification model. True False

False

We have trained a linear regression model and tested it on the validation set. Here are the results: On the training set: RMSE is low, R2 is high On the validation set: RMSE is low, R2 is high What type of fit is this? Good (optimized) fit Over-fit Under-fit

Good (optimized) fit

The following scatterplot matrix is created from the Boston Housing data set that shows the median price of the houses in the Boston metropolitan area. Which two variables (any two variables) show the strongest correlation? (read the column and row labels to find the variable names) MEDV & NOX INDUS & NOX CHAS & INDUS MEDV & RM

INDUS & NOX

The following report shows Excel output for a linear regression model. What can the p-value of F-statistic tell us? A. If this p-value is less than our significance level then the coefficients are significant B. If this p-value is larger than our significance level then the coefficients are significant C. If this p-value is larger than our significance level then the model as a whole is significant D. If this p-value is less than our significance level then the model as a whole is significant

If this p-value is less than our significance level then the model as a whole is significant

Which statement is INCORRECT about linear regression models? A. Linear regression models are relatively easy to explain B. In some cases, it is better to transform the variables before training the model to build a better model in terms of goodness-of-fit and accuracy C. It is a very popular, robust, and flexible method for predicting numerical and categorical targets D. Regression models are robust against violation of some assumptions such as normality assumption

It is a very popular, robust, and flexible method for predicting numerical and categorical targets

Based on the following correlation matrix, what is potentially the strongest predictor of MEDV? AGE INDUS LSTAT RM

LSTAT

Which statement is INCORRECT about k-NN predictive models? A. Larger values of k increase the risk of over-fitting B. When k=n (number of data records) the k-NN and the universal average methods are the same C. k-NN is sensitive to irrelevant features D. Finding optimum value of k can be computationally expensive

Larger values of k increase the risk of over-fitting

Which statement is INCORRECT about choosing the number of clusters in the k-means clustering method? A. Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k B. Sometimes business considerations impose constrains on the value of k C. Ability to do a useful profiling based on the cluster centroids helps us select a right value of k D. Similar analyses can be used to inform our decision about a right value of k

Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k

We have developed two different linear regression models on the same data set. Which model shows a better goodness-of-fit? Not enough information Models are the same Model B Model A

Model A

The following figure shows residual plots of two linear regression models A and B. Which of the following statements is CORRECT? A. Model A is violating linearity assumption B. Model B is better than Model A because it shows smaller residuals C. Model B is violating homoscedasticity assumption D. Both models have met all the assumptions of the linear regression model

Model B is violating homoscedasticity assumption

We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? A. Model is violating the homoscedasticity assumption B. Model is not violating any of the linear regression assumptions C. Model is violating the linearity assumption D. Model is violating the assumption of independence of observations

Model is violating the assumption of independence of observations

We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? A. Model is violating the linearity assumption B. Model is not violating any of the linear regression assumptions C. Model is violating the homoscedasticity assumption D. Model is violating the assumption of independence of observations

Model is violating the linearity assumption

When we are building a linear regression model, against what model do we compare it to evaluate its significance? Naïve (average) model Logistic model Classification model Random model

Naïve (average) model

The target variable in a multiple linear regression model must be a: Binary variable Numerical variable Nominal Variable Ordinal variable

Numerical variable

Which of the followings is NOT a strategy to prevent model over-fitting? A. Adding variables to the model only if they improve the model performance and goodness-of-fit B. Splitting data into train and validation sets C. Set a limit on the value of R2 metric D. Penalizing the model for including more variables

Set a limit on the value of R2 metric

You are building a multiple linear regression model to predict median house price (MEDV) in Boston using a data set with 12 predictors as shown in the following correlation matrix. Based on the matrix, you would expect the violation of the multicollinearity assumption to happen between what variables? Hint: multicollinearity means a strong linear relationship between two predictors (independent variables). CHAS & NOX TAX & RAD MEDV & PTRATIO MEDV & LSTAT

TAX & RAD

Which statement is INCORRECT about the k-means clustering algorithm? A. The algorithm starts with initial centroids that are determined by distance function B. The algorithm starts with random seeds as the initial centroids C. Each data point is assigned to the cluster with the nearest centroid D. The choice of distance function is arbitrary, and the Euclidean distance function is very popular

The algorithm starts with initial centroids that are determined by distance function

What statement is correct about the k-nearest neighbor (k-NN) method? A. Underfitted k-NN models can be fixed by adding a dummy variable for accuracy B. Logistic regression is a special case of k-NN C. The value of k can control model over and underfitting D. Overfitted k-NN models can be fixed by decreasing k

The value of k can control model over and underfitting

Before computing the distance between two data records, we should normalize the numerical variables to prevent variables with large scales from having an undue effect. True False

True

In a linear regression model, the t-Test for each predictor's coefficient indicates if the estimated value is significantly different from zero. True False

True

Increasing the data size can help avoiding both over and under-fitting problems. True False

True

When a model is over-fitted the regression coefficients represent noise in the data, rather than the genuine relationships in the population. Trie False

True

k-nearest neighbor (k-NN) is a supervised method that can be used for predicting categorical or numerical targets. True False

True

In model training, ... measures how much the model's predictions fluctuate when given different input data. Variance p-value Bias Adjusted R2

Variance

Splitting the data set into training and validation is a method to avoid overfitting. We can calculate metrics such as RMSE and R2 for the training and validation sets. Under what conditions, we can tell the model is overfitted? A. Very high training and test errors B. Very low training error and high test error C. High training, but low testing error D. Low training and testing errors

Very low training error and high test error

We have created a multiple linear regression model to predict car MPG (miles per gallon) values by five cas features: cylinders, displacement, horsepower, weight, and acceleration. The following table shows the linear model. Based on the results, what variable's estimated coefficient is significantly different from zero? displacement cylinders acceleration horsepower

horsepower

When a model captures both the underlying patterns and random fluctuations in data, it is called ... under-fitting optimal fitting non-linear regression over-fitting

DSCI 4520 Exam 1 section 2

Ensembles d'études connexes

Nemzetközi pénzügyek 1-5

Genetics baka ito ang exam?

Business law chapter 11

Causes of the American Revolution & American Revolution

Chapter 26 - Monopoly behavior: Second-degree price discrimination

perfusion

Expressing feelings with verbs or ed/ing adj.

Psychology Test: Unit 14 Social Psychology

Statistics Final Exam

Cognitive Psychology Chapter 1 Quiz

ECE 4300 Test 1

Chapter 8

The Eiffel Tower 1889

CMS1 Assignment 4: Human Resources Planning and Recruitment

Chapter 17 Accounting

psych exam 4: 10,11,12

pediatric integumentary conditions

potential psychology test final question

Sociology Final

Final Exam (Communication)