BADM Exam 3
MLR assumes that the outcome is continuous numerical can take any value from
(-infinity, +infinity)
False Postive Rate = 1 - Specificity
FP/(TN+FP)
A classifier that performs well will have a ROC curve that is a perfectly diagonal line.
False
In a multiple linear regression problem, the outcome variable is a/an
continuous numerical variable
A propensity score is used in combination with a _____ to determine class membership.
cutoff value
Suppose that you are interested in the median of a variable in you dataset data_df. Which of the following commands can be used for that purpose?
data_df.describe()
Odds vs. Probability
exponential growth With odds on Y axis and probability on X axis
The columns of a dataset is also known as a(n) ______________.
feature
Logistic Regression is similar to MLR but assumes that the logit of the outcome variable p to be a linear function of predictors
Can be used for both predictive and explanatory model
What is the type of the outcome variable in a classification problem?
Categorical
Two main types of supervised learning methods are regression and
Classification
What is another name for confusion matrix?
Classification matrix
What is the naïve rule for classifying?
Classify the record as a member of the majority class
Assuming a threshold of 0.05, a variable with a p-value=0.02 is not significantly associated with the outcome variable.
False
When fitting a multiple linear regression model, it is best practice to include as many correlated predictors as possible.
False
Odds VS Logit range
Odds: 0 to Infinity Logit: negative infinity to positive infinity
ACME Corporation wants to develop a model to predict whether an employee will leave the organization in the next 6 months. Which of the following algorithms can be applied in this problem?
Logistic Regression
Which of the following algorithms can be used for the classification problems ?
Logistic Regression
Based on the prediction accuracy measures, which model performs better?
Lower the RMSE and MAE is the better.
numerical or regression problems
MLR
Which of the following metrics can not be used to assess Predictive Performance in a regression problem?
Mean Error (ME)
Assume that for a classification problem you have two models with AUC1=0.7 and AUC2=0.8. Which model has a better performance when AUC is used to evaluate performance?
Model 2
Suppose you build a model for classification and from the confusion matrix you observe that TP=40,TN=90,FN=10,FP=10. Then the sensitivity of the classifier is
0.8
Suppose you build a model for classification and from the confusion matrix you observe that TP=40,TN=90,FN=10,FP=10. Then the accuracy of the classifier is
13/15
Misclassification error arises when ____
A record belongs to one class but is classified as another.
You run the code car_df.head() and get the above output. Which of the following variables is not a nominal categorical variable
Age
You run an MLR and obtained the results displayed in the above table. Based on this output, how many of the coefficients are significant?
Any P value given in the chart less than given threshold or (0.05) then variable is significant
False Negative Rate = 1 - Sensitity
FN/(TN+FN)
Estimate the probabailty that each record is belonging to positive class By using threshold to determine yes or no
EX: Will it rain tomorrow yes or no? using threshold .50, if prob higher that .50 then yes it will
Which of the following is a classification problem?
Identification of digits ( 0-9) using images of handwritten digits.
Unsupervised or clustering problems
K-means algorithm
Specificity (TNR)
TNR = TN/AN = TN/(TN+FP)
Sensitivty TRP
TRP = TP/AP = TP/(TP+FN)
If logit(p) = 0
Then odds of winning =1 and Probability = 0.5
Higher AUC
The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
What do the off-diagonal cells in a confusion matrix tell us?
The number of misclassifications
Multicollinearity is a problem for Logistic Regression.
True
Linear regression is used to estimate the dependent variable in case of a change in independent variables. For example, predict the price of houses.
Whereas logisticregression is used to calculate the probability of an event. For example, classify if tissue is benign or malignant.
Logistic Regression
classification algorithm that estimates the probability that a record belongs to a positive class p given a set of predictor
Which of the following lines of code will print the model performance in the training set? Assuming that we have already run the following lines of code
classificationSummary(train_y, pred_y_train)
(TP+TN)/(TP+TN+FP+FN)
accuracy
The naive benchmark
assign each record to the majority class
You want to run a linear regression analysis and want to visually check whether one of the variables in your dataset is normally distributed, you can use a ______________ for this purpose.
histogram
If a categorical variable is ordinal and represented as a string initially, then before you can run any model, it needs to be coded as ____________.
integer values
How many cells will a confusion matrix have if there are m classes?
m^2
Probability
odds/ ( 1+odds)
The Acme Corporation is launching a new line of exclusive widgets, but because of their platinum shielding and unobtainium cores, it's very important to accurately predict demand based on current customer behaviors. You have a large dataset, so it is possible to create independent training, and validation sets, which you should do so that you won't _______________.
overestimate model accuracy
Odds
p/(1-p)
You want to load data from a csv file using the read_csv function. To do this you have to first import the __________________ package/library.
pandas
We want to run a multiple linear regression model to predict the outcome variable charges. What code will you use to convert the nominal categorical values into dummy variables?
pd.get_dummies(car_df, drop_first=True)
The three most effective basic plots are _______________.
scatter plots, line graphs and bar charts
Suppose the actual outcome in your validation set is given as y_valid=[2,3,7]. Then the naive benchmark for the validation set is
y_naive=[4,4,4]