ISDS 471 Exam 1
What do we usually observe if the model is overfitted?
Fitting training data very well Fitting validation data very badly
Interpretation of the beta coefficients
For some variable age, the beta coefficient is -150, means per unit increase in age, the selling price decrease by 150
Stepwise
Forward selection but at each step also may drop non-significant predictors like in backward elimination Drawback: R^2 are biased high, estimates for regression coefficients are biased high in absolute value, narrow confidence interval, s.e. of regression coefficient estimates are biased low
How many dummy variables can we create for a categorical variable with m levels?
M
What is homoscedasticity?
Variance of residuals are the same, check with scatter plot of residuals over fitted y All points have the same distance from the line
What is overfitting?
When we use the same data to build the model and assess performance Using validation to choose parameters Assessing multiple models on same validation
Which of the following is (are) assumption(s) for MLR? Normality Linearity Homoscedasticity All of the above
all of the above
Which of the following is unsupervised learning? - Data visualization - Data reduction - Cluster analysis - All of the above
all of the above
Why do we want to evaluate supervised learning tasks? Compare models Select the tuning parameter Learn the prediction or classification accuracy All of the above
all of the above
What is AUC
area under curve, most common metric 1, perfect discrimination between classes .5, no better than naive rule
Which of the following is not supervised learning? - Association rule - Classification - prediction
association rule
AE
average error, gives idea of systematic over or under prediction
For a data set with 200 observations and 200 variables, which of the following variable selection method do you think is inappropriate? -Backward -Best subset -Both
both
How do we check the assumption for normality? QQ plot Histogram Both
both
What type of data to use bar chart for
categorical
What can we use heatmap for?
check the missing value pattern with conditional formatting Check the collinearity
What is the naive rule?
classify all records as belonging to the most prevalent class Used as benchmark
How do we check the missing value pattern in excel?
conditional formatting --> new rule
What type of data to use histogram for
continuous
How do we check collinearity
correlation in data analysis toolbox
Which of the following is not a measure of prediction error? -error rate -average error -MAE -RMSE
error rate
What is ER
error rate accuracy = 1 - error rate
FNR
false negative rate # true C1 that are classified C0 / # true C1 % of C1 incorrectly classified
FPR
false positive rate # true C0 that are classified C1 / # true C0 % of C0 incorrectly classified
What are the 4 variable selection methods for MLR
forward, backward, stepwise, best subset (exhaustive search)
What is collinearity
high correlation among independent variables
What is collinearity?
high correlation among independent variables use pairwise correlation matrix to delete the variables when their correlation is above .8 or below -.8
MAD/MAE
mean absolute error (deviation), gives an idea of the magnitude of errors
MAPE
mean absolute percentage error
In an MLR model 1 with 12 predictors, you have adjusted R2 = 0.710 and Mallow's Cp = 13.09. In the other MLR model 2 with 13 predictors, you have adjusted R2 = 0.712 and Mallow's Cp = 11.35. Which model is better? -Model 1 -Model 2
model 1, mallow's cp is closer to the number of coefficients (13)
If data are partitioned into three parts, what is validation data not used for?
model valuation
What assumptions do we need in MLR?
normality, linearity, independence, homoscedasticity,
Why do we combine categories when there are too many categories for a categorical variable Avoid categories of too few observations - Reduce the number of dummy variables - Other two are correct
other two are correct
how many dummy variables do we need? how many dummy variables can we create?
p-1 p
What is ROC
receiver operating characteristic plots sensitivity on y-axis and 1-specificity on x-axis, positive relationship Curves closer to the top left = better performers Comparison curve is the diagonal, reflects performance of naive rule
RMSE
root-mean-squared-error (yhat - y)
What is linearity?
scatterplot of y over continuous x
if cut off increases, what happens to sensitivity, specificity, FPR, FNR
sensitivity decreases specificity increases FPR decreases FNR increases
Confusion matrix ........................Predicted class Actual ............1............ 0 1 ........................a.............b 0 ........................c............ d
summarizes correct and incorrect classifications from a dataset a and d give number of correct classifications b = number of class 1 incorrectly predicted as 0 c = number of class 0 incorrectly predicted as 1
Which dummy variable is usually considered as reference category?
the last category
If given a regression output, which variable should be deleted?
the one with a high p-value, usually higher than .05
What type of data to use line chart for
time series
SSE
total sum of squared error
Which data set do you conduct the model diagnosis on?
training data
What is independence?
when residuals or error terms are independent of each other, check with scatterplot of residuals over fitted y when points in a small chunk below or above regression line are grouped together
How many observations can XLminer handle for training data?
10,000 observations
What is the miclassification error rate? ..................Predicted class Actual ......1 ......2 ...........3 1 ................10........0 ..........2 2 ...............3 ........9............3 3.................1..........1..............11
10/40
What are partitioned data used for?
2 parts -Training (trains model) and validation (evaluates model) 3 parts -Training (trains model), validation (compares and selects model/parameters), test part (evaluates)
What is the sensitivity when cutoff is .6 Actual ......Prob(Y=1) 1....................93 1.....................91 1.....................76 0....................69 1.....................51 1.....................44 0.....................31 0.....................25 0.....................17 0.....................04
3/5 (sensitivity is # of C1 that are correctly classified as C1 / # true C1). We are saying everything above .6 is C1, that means we have three 1's that are correctly classified. But there are actually 5 1's in total.)
Example of the naive rule: Sample with 70% C'1 and 30% C0's, what is the error rate for the naive rule?
30% / .3 We are putting all observations in C1
The following 5 questions are based on the distribution chart below. Actual ......Prob(Y=1) 1....................93 1.....................91 1.....................76 0....................69 1.....................51 1.....................44 0.....................31 0.....................25 0.....................17 0.....................04 Using cutoff of .5, what is the value of a? ......................Predicted class Actual..........1.......0 1.....................a........b 0....................c.........d
4
What is a high correlation
> .8 or < -.8
Outliers
> Q3 + 1.5 * IQR < Q1 - 1.5 * IQR
Variance explained by the model
Adjusted R^2 select model with highest value, if 2 models have similar values, select the one whose mallow's cp is closest to the # of coefficients (variables + 1)
Best subset (exhaustive)
All possible subsets of predictors assessed, judge by adjusted R^2 Drawback: computationally intensive, not good for dimensional more than 23
Examples of supervised learning tasks
Classification (predict categorical target/outcome variable, binary) Prediction (predict numerical target/outcome variable)
How do we check the distribution of numeric variables in excel
Either plot box plot or histogram
FNR will increase if cutoff increases from .2 to .8 Actual ......Prob(Y=1) 1....................93 1.....................91 1.....................76 0....................69 1.....................51 1.....................44 0.....................31 0.....................25 0.....................17 0.....................04
# of true C1 that are classified as C0 / # of true C1 with .2, FNR = 0 with .8, FNR = 3/5 true
Specificity
# true C0 that are classified C0 / # true C0 % of C0 correctly classified
Sensitivity
# true C1 that are classified C1 / # true C1 % of C1 correctly classified
Which of the following is not a partial search algorithm? -Best subset selection -forward selection -backward selection -stepwise regression
-best subset selection
Which of the following variable selection method is best? -forward selection -backward selection -Stepwise regression -cannot be determined
-cannot be determined
What is the FPR when cutoff = .7 Actual ......Prob(Y=1) 1....................93 1.....................91 1.....................76 0....................69 1.....................51 1.....................44 0.....................31 0.....................25 0.....................17 0.....................04
0/5 (we are saying everything above .7 is 1. FPR is true C0 that are classified as C1 (which is nothing in our case) / # of true C0, which is five because there are five 0's below .7)
Using .5 as a cutoff, what is the value of c?
1 (everything above .5 is 1 for c, we are asking how many actual 0's are predicted as 1. There is one 1 above .5)
What is normality?
Assumes residuals are normally distributed, Histogram or residuals or QQ plot Take a slice of the data, should be dense in the middle and sparse on the sides Normall distributed around regression line
What is supervised learning?
Predict single target or outcome variable, often binary
Measures to evaluate prediction (how well does the model predict new data? Not how well it fits the data it was trained with)
RMSE, MAD/MAE, MAPE, AE, SSE
Measures to evaluate classification
ROC, AUC, ER, Sensitivity, Specificity, FPR, FNR, Confusion matrix
How to determine which variables are important?
Small p-values are important, if a p-value is greater than .05 it is not
Backward
Start with all predictors, eliminate least useful one by one, stop when all remaining predictors have statistically significant contribution Drawback: computing the initial model with all predictors can be time consuming and unstable, not good when n < p
Forward selection
Start with no predictors, add one by one starting with largest R^2, stop when addition no longer statistically significant Drawback: may miss pairs or groups of predictors that perform well together but perform poorly as single predictors