BADM Exam 3

¡Supera tus tareas y exámenes ahora con Quizwiz!

MLR assumes that the outcome is continuous numerical can take any value from

(-infinity, +infinity)

False Postive Rate = 1 - Specificity

FP/(TN+FP)

A classifier that performs well will have a ROC curve that is a perfectly diagonal line.

False

In a multiple linear regression problem, the outcome variable is a/an

continuous numerical variable

A propensity score is used in combination with a _____ to determine class membership.

cutoff value

Suppose that you are interested in the median of a variable in you dataset data_df. Which of the following commands can be used for that purpose?

data_df.describe()

Odds vs. Probability

exponential growth With odds on Y axis and probability on X axis

The columns of a dataset is also known as a(n) ______________.

feature

Logistic Regression is similar to MLR but assumes that the logit of the outcome variable p to be a linear function of predictors

Can be used for both predictive and explanatory model

What is the type of the outcome variable in a classification problem?

Categorical

Two main types of supervised learning methods are regression and

Classification

What is another name for confusion matrix?

Classification matrix

What is the naïve rule for classifying?

Classify the record as a member of the majority class

Assuming a threshold of 0.05, a variable with a p-value=0.02 is not significantly associated with the outcome variable.

False

When fitting a multiple linear regression model, it is best practice to include as many correlated predictors as possible.

False

Odds VS Logit range

Odds: 0 to Infinity Logit: negative infinity to positive infinity

ACME Corporation wants to develop a model to predict whether an employee will leave the organization in the next 6 months. Which of the following algorithms can be applied in this problem?

Logistic Regression

Which of the following algorithms can be used for the classification problems ?

Logistic Regression

Based on the prediction accuracy measures, which model performs better?

Lower the RMSE and MAE is the better.

numerical or regression problems

MLR

Which of the following metrics can not be used to assess Predictive Performance in a regression problem?

Mean Error (ME)

Assume that for a classification problem you have two models with AUC1=0.7 and AUC2=0.8. Which model has a better performance when AUC is used to evaluate performance?

Model 2

Suppose you build a model for classification and from the confusion matrix you observe that TP=40,TN=90,FN=10,FP=10. Then the sensitivity of the classifier is

0.8

Suppose you build a model for classification and from the confusion matrix you observe that TP=40,TN=90,FN=10,FP=10. Then the accuracy of the classifier is

13/15

Misclassification error arises when ____

A record belongs to one class but is classified as another.

You run the code car_df.head() and get the above output. Which of the following variables is not a nominal categorical variable

Age

You run an MLR and obtained the results displayed in the above table. Based on this output, how many of the coefficients are significant?

Any P value given in the chart less than given threshold or (0.05) then variable is significant

False Negative Rate = 1 - Sensitity

FN/(TN+FN)

Estimate the probabailty that each record is belonging to positive class By using threshold to determine yes or no

EX: Will it rain tomorrow yes or no? using threshold .50, if prob higher that .50 then yes it will

Which of the following is a classification problem?

Identification of digits ( 0-9) using images of handwritten digits.

Unsupervised or clustering problems

K-means algorithm

Specificity (TNR)

TNR = TN/AN = TN/(TN+FP)

Sensitivty TRP

TRP = TP/AP = TP/(TP+FN)

If logit(p) = 0

Then odds of winning =1 and Probability = 0.5

Higher AUC

The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.

What do the off-diagonal cells in a confusion matrix tell us?

The number of misclassifications

Multicollinearity is a problem for Logistic Regression.

True

Linear regression is used to estimate the dependent variable in case of a change in independent variables. For example, predict the price of houses.

Whereas logisticregression is used to calculate the probability of an event. For example, classify if tissue is benign or malignant.

Logistic Regression

classification algorithm that estimates the probability that a record belongs to a positive class p given a set of predictor

Which of the following lines of code will print the model performance in the training set? Assuming that we have already run the following lines of code

classificationSummary(train_y, pred_y_train)

(TP+TN)/(TP+TN+FP+FN)

accuracy

The naive benchmark

assign each record to the majority class

You want to run a linear regression analysis and want to visually check whether one of the variables in your dataset is normally distributed, you can use a ______________ for this purpose.

histogram

If a categorical variable is ordinal and represented as a string initially, then before you can run any model, it needs to be coded as ____________.

integer values

How many cells will a confusion matrix have if there are m classes?

m^2

Probability

odds/ ( 1+odds)

The Acme Corporation is launching a new line of exclusive widgets, but because of their platinum shielding and unobtainium cores, it's very important to accurately predict demand based on current customer behaviors. You have a large dataset, so it is possible to create independent training, and validation sets, which you should do so that you won't _______________.

overestimate model accuracy

Odds

p/(1-p)

You want to load data from a csv file using the read_csv function. To do this you have to first import the __________________ package/library.

pandas

We want to run a multiple linear regression model to predict the outcome variable charges. What code will you use to convert the nominal categorical values into dummy variables?

pd.get_dummies(car_df, drop_first=True)

The three most effective basic plots are _______________.

scatter plots, line graphs and bar charts

Suppose the actual outcome in your validation set is given as y_valid=[2,3,7]. Then the naive benchmark for the validation set is

y_naive=[4,4,4]


Conjuntos de estudio relacionados

Crazy APUSH review, Spanish Expressiones, APUSH AP EXAM BIG IDEAS, APUSH Thematic Review: Women, APUSH Thematic Review - Foreign Policy (1700's to mid 1800's), APUSH Civil Rights Vocab, Immigration and Migration APUSH Review, APUSH Major Legislation,...

View Set

Public Speaking: Module 4 Delivery

View Set

Geo - chapter 14 (Internal Processes)

View Set

RCD330: DAW Editing and Processing Final Exam Study Guide

View Set