DSCI exam 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

A decision tree can be pre- or post-pruned to avoid underfitting a classification and regression tree. True or False

False

Consider two models A and B. If the prediction accuracy of Model A is higher than that of Model B for the training dataset, we can safely say that Model A is better than Model B. True or False

False

In evaluating a predictive model with a numerical target, the mean absolute error (MAE) can be negative or positive but the mean error (ME) is always positive. True or False

False

In the k-means clustering technique, the desired number of clusters (k) is a number that is determined in the middle of the algorithm by calculating the model error. True or False

False

In the k-nearest neighbor models, increasing the value of k leads to overfitting. True or False

False

In the search for the best set of variables for the linear regression model, when the number of potential predictors is small, the exhaustive search method gives significantly different and better results than other methods. True or False

False

We run two k-means clustering models on the same data with k=3 and k=5. The model with k=3 is necessarily better than the other one because a smaller value of k is always better for clustering. True or False

False

The following chart shows the prediction error of a decision tree based on the training set and validation set as functions of the number of splits. What phenomenon is causing the gap between the two curves at higher numbers of splits? FOTO VAR Model overfitting Model underfitting Model instability Model variability

Model overfitting

Which statement is INCORRECT about the k-means clustering algorithm A)Each data point is assigned to the cluster with the nearest centroid B)the choice of distance function is arbitrary, an the euclidean distance function is very popular C)The algorithm starts with initial centroids that are determined by distance function D)The algorithm starts with random seeds as the initial centroids

The algorithm starts with initial centroids that are determined by distance function

Before computing the distance between two data records, we should normalize the numerical variables to prevent variables measured on large scales have a much larger effect. True or False

True

Statistical independence for two events is present when the outcome of the first event has no impact on the probability of the second event True or False

True

The overall goal of building a decision tree for classification is to create leaves that are purer in terms of class labels. True or False

True

With the Naive Bayes classification method, the zero frequency problem occurs if a given scenario for a single predictor has not been observed. True or False

True

k-nearest neighbor (k-NN) is a supervised method that can be used for predicting categorical and numerical targets. True or False

True

In the following confusion matrix, which cell is the FALSE POSITIVE?

c

A linear regression model that is developed on the original data, can be used to compare the effect of predictors on the predicted target variable true or false

false

the k-means clustering algorithm can handle noisy data with outliers as well as the non-convex data patterns true or false

false

In evaluating a predictive model with a numerical target, the root mean squared error (RMSE) has the same unit as the predicted variable True or false

true

In the standardized linear regression model, normalized predictors don't have the same unit and scale as the original predictors true or false

true

When a model is over-fitted the regression coefficients represent noise in the data, rather than the genuine relationships in the population true or false

true

What is the Euclidean distance between the following two records WITHOUT normalization? Euclidean distance formula:

11.5

The following chart shows the prediction error of a decision tree based on the training set and validation set as functions of the number of splits. To avoid overfitting what is the best number of splits? GORSEL VAR A)5 B)10 C)16 D)3

A)5

In the logistic regression model the target variable is: A)A categorical variable B)A numeric variable C)Either a numeric or a binary variable D)A number between 0 and 1

A)A categorical variable

Which of the following technique is NOT useful for preventing over-fitting a decision tree? A)Adding duplicate records B)Cross-validation C)Early stopping D)Pruning a tree

A)Adding duplicate records

What is the primary method to avoid underfitting when you are training a classification and regression tree model? A)Adding the number of tests (splits) B)Increasing the entropy or the Gini index of the leaves C)Converting numerical variables to categorical D)Substituting multi-level categorical variables with binary dummy variables

A)Adding the number of tests (splits)

Which one is NOT one of the advantages of the Naive Bayes classifiers? A)Assumption of independence of features B)Ability to handle both categorical and numerical variables C)Ease of build and understand D)Robustness to irrelevant features

A)Assumption of independence of features

With the k-NN model for a numerical target, after we determined the k nearest neighbors of a new data record, how the target value is predicted? A)Average of the neighbors B)Majority vote determines the predicted class C)Through a linear combination of neighbors D)Through a logistic regression between the neighbors

A)Average of the neighbors

How can we turn the logistic regression model into the classification model? A)By setting a cutoff value and comparing the predicted probability with it B)By setting a cutoff value and comparing the predicted odds with it C)By introducing inverse natural log function to the model D)By setting a cutoff value and comparing the predicted log odds with it

A)By setting a cutoff value and comparing the predicted probability with it

Which statement is INCORRECT about a CART trained to predict a numerical target? A)Impurity of the leaves can be measured with the Gini index or Entropy B)Prediction is computed as the average of numerical target variable at the leaves C)Training procedure is similar to training a CART for classification D)Pruning procedures and techniques are similar to those for the classification tree

A)Impurity of the leaves can be measured with the Gini index or Entropy

The following chart shows the within-cluster sum of square errors versus the number of clusters in a k-means clustering model. Based on the Elbow method, what value of k is optimum for clustering? 2 8 4 5

4

We have trained a classification model and it's ROC curve is shown below. Given that the Area Under the Curve (AUC) is our performance metric. Which model is performing better? FOTO BAK

A

Which of the following models (the dark blue line) shows a case of underfitting? FOTO BAK

A

Which of the following statements is INCORRECT about the logistic regression model? A)In the logistic regression, the intercept cannot be zero because of the natural logarithm function B)Logistic regression can be used for classification C)Logistic regression uses odds and the natural logarithm function D)Logistic regression can be developed for a binary or a multi-class target variable

A)In the logistic regression, the intercept cannot be zero because of the natural logarithm function

We are building a decision tree to predict loan default with four predictors: Age, Income, Gender, and Credit Score. For the first split, we have calculated the Gini index of each test. Based on the following information, which predictor is the best for the first split? FOTO VAR A)Income B)Gender C)Age D)Credit

A)Income

Which statement is INCORRECT about Naïve Bayes classifier? A)It computes and includes prior probability of predictors B)It returns the event with which the join probability of that event and predictors is maximized C)It examines the existing evidence to predict the probability of target levels D)It identifies the dependent variables level (i.e. events) that increases the probability of the desired target class label

A)It computes and includes prior probability of predictors

With the k-NN model for classification, after we determined the k nearest neighbors of a new data record, how the class is predicted? A)Majority vote determines the predicted class B)Average of the neighbors C)Through a linear combination of neighbors D)Through a logistic regression between the neighbors

A)Majority vote determines the predicted class

Which statement is INCORRECT about choosing the number of clusters in the k-means clustering method? A)Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k B)Sometimes business considerations impose constrains on the value of k C)Ability to do a useful profiling based on the cluster centroids helps us select a right value of k D)Similar analyses can be used to inform our decision about a right value of k

A)Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k

Which statement is INCORRECT about the structure of decision trees? A)Numerical attributes cannot be tested in the tree B)Each node represents a test result on a predictor C)Each branch is a test on the predictor D)Each leaf is a terminal node with prediction

A)Numerical attributes cannot be tested in the tree

If events A and B are statistically independent, what is P(A|B), that is the conditional probability of A, given B? A)P(A) B)P(A) * P(B) C)P(B|A) D)P(B)

A)P(A)

Which statement explains the issues when linear regression is used to model binary target variables? A)Predicted probabilities can be >1 or <0 leading to model interpretation difficulties B)Instability of the model coefficients C)Complexity of logarithmic calculations D)Large intercept value associated with the linear regression models

A)Predicted probabilities can be >1 or <0 leading to model interpretation difficulties

Which statement about Entropy and the Gini index is correct? A)Smaller values of both metrics indicate higher purity of a node B)Larger values of both metrics indicate higher purity of a node C)Larger values of Entropy and smaller values of the Gini index indicate higher purity of a node D)Larger values of the Gini index and smaller values of Entropy indicate higher purity of a node

A)Smaller values of both metrics indicate higher purity of a node

You are building a multiple linear regression model to predict median house price (MEDV) in Boston using a data set with 12 predictors as shown in the following correlation matrix. Based on the matrix, you would expect the violation of the multicollinearity assumption to happen between what variables? Hint: multicollinearity means a strong linear relationship between two predictors (independent variables). A)TAX & RAD B)MEDV & PTRATIO C)CHAS & NOX D)MEDV & LSTAT

A)TAX & RAD

Which statement is correct about the cutoff value of the probability calculated by a logistic regression model to be used for classification? A)The cutoff value is an arbitrary value determined by model performance assessment B)Larger cutoff values result in higher model performance C)Smaller cutoff values result in higher model performance D)The cutoff value must always be set to 0.5

A)The cutoff value is an arbitrary value determined by model performance assessment

What statement is correct about the k-nearest neighbor (k-NN) method? A)The value of k can control model over and underfitting B)Underfitted k-NN models can be fixed by adding a dummy variable for accuracy C)Overfitted k-NN models can be fixed by decreasing k D)Logistic regression is a special case of k-NN

A)The value of k can control model over and underfitting

Which of the following variable search methods for the linear regression model examines all possible combinations of variables? A)Forward selection B)Exhaustive search C)Backward elimination D)Stepwise search

B)Exhaustive search

We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? (4) A)Model is violating the homoscedasticity assumption B)Model is violating the assumption of independence of observations C)Model is violating the linearity assumption D)Model is not violating any of the linear regression assumptions

B)Model is violating the assumption of independence of observations

Given the following linear regression model for predicting House Prices, which variable has the largest effect on the predicted price? where: Age is the age of the house, Room is the number of rooms, CRIME is the crime rate per capita in the town, and Lot Size is the lot size (sqf) of the house A)the intercept B)Room C)It cannot be determined, because this is not a normalized model D)Age

C)It cannot be determined, because this is not a normalized model

The following linear model is developed on the normalized data to predict used car prices. Which of the predictors has the LARGEST effect on the predicted price? A)CC B)This model cannot be used to compare the effect of predictors C)HP D)Age

CC

Which statement is INCORRECT about clustering? A)Clustering is an unsupervised learning method B)Quality of a clustering model depends on the similarity measure that is used C)Clustering has many applications is marketing, insurance, logistics, and health care businesses D)Clustering is useful for predicting association rules

D)Clustering is useful for predicting association rules

The following figure shows residual plots of two linear regression models A and B. Which of the following statements is CORRECT? (3) A)Model A is violating linearity assumption B)Model B is better than Model A because it shows smaller residuals C)Both models have met all the assumptions of the linear regression model D)Model B is violating homoscedasticity assumption

D)Model B is violating homoscedasticity assumption

We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model? (8) A)Model is violating the homoscedasticity assumption B)Model is not violating any of the linear regression assumptions C)Model is violating the assumption of independence of observations D)Model is violating the linearity assumption

D)Model is violating the linearity assumption

Which of the followings is NOT a strategy to prevent model over-fitting? A)Penalizing the model for including more variables B)Adding variables to the model only if they improve the model performance and goodness-of-fit C)Splitting data into train and validation sets D)Set a limit on the value of R2 metric

D)Set a limit on the value of R2 metric

What is the error rate of the following confusion matrix? (rounded to 2 decimal places)

0.41

What is the fall-out score of the following confusion matrix given that "1" is positive? (rounded to 2 places)

0.47

What is the sensitivity score of the following confusion matrix given that "1" is positive? (rounded to 2 decimal places)

0.71

What is the specificity score of the following confusion matrix given that "1" is positive? (rounded to 2 places)

0.81

What is the predicted variable in the logistic regression model? A)RMSE B)Probability of class membership C)Confusion matrix

Probability of class membership


Ensembles d'études connexes

Diffusion and Osmosis Midterm Test

View Set

GRAMMAR PRACTICE, SENTENCES: Shifts in Point of View, paragraph editing

View Set

Basketball Study Guide (official)

View Set

Chem1360 Exam Chp. 8 Self Assessments

View Set