Statistical Learning Study
You have a bag of marbles with 64 red marbles and 36 blue marbles. What is the value of the Gini index for that bag?
(0.64)(1-0.64)+(0.36)(1-0.36)
Quadratic Discriminant Analysis (QDA) offers an alternative approach to LDA that makes more of the same assumptions, excep thtat QDA assumes that
each class has its own covariance matrix.
You are fitting a linear model to data assumed to have Gaussian errors. The model has up to p = 5 predictors and n = 100 observations. Which of the following is most likely true of the relationship between Cp and AIC in terms of using the statistic to select a number of predictors to include?
Cp will select the same model as AIC.
A fitted model with more predictors will necessarily have a lower Training Set Error than a model with fewer predictors.
False
A model with a high Cp coefficient is preferable.
False
If one feature (compared to all others) is a very strong predictor of the class label of the output variable, then all of the trees in a random forest will have this feature as the root node.
False
R^2 is a good measure of model adequacy for a logistic regression model.
False
Suppose you are given a dataset of cellular images from patients with and without cancer. If you are required to train a classifier that predicts the probability that the patient has cancer, you would prefer to use Decision trees over logistic regression.
False
The bootstrap method involves sampling withouth replacement.
False
We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p+1 models, containing 0, 1, 2, ..., p predictors. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k+1) variable model identified by best subset selection.
False
Logistic regression is a _____ used to model a binary categorical outcome using numerical and categorical predictors.
Generalized linear model
The LASSO, relative to least squares is:
Less flexible and hence will give improves prediction accuracy when its increase in bias is less than its decrease in variance.
logistic regression assumes a:
Linear relationship between continuous predictor variables and the logit of the outcome variable.
We want to predict gender based on annual income and weekly working hours. The training set consists of annual income and weekly working hours for 900 men and 800 women. Which method should one prefer?
Logistic regression
The logistic regression coefficients are usually estimated using the _____
Maximum likelihood estimation
The logistic regression coefficients are usually estimated using the _____ method
Maximum likelihood estimation
You have a bag of marbles with 64 red marbles and 36 blue marbles. What is the value of the entropy for that bag?
-0.64*log(0.64)=0.36*log(0.36)
A logistic regression model was used to assess the association between cardiovascular disease (CVD) and obesity. P is defined to be the probability that the people have CVD, and obesity was coded as 0-non obese, 1=obese., resulting in the model: ln(P/1-P) = -2+0.7obesity What is the log odds for CVD in persons who are obese as compared to not obese?
0.7
You are trying to fit a model and are given p=30 predictor variables to choose from. Ultimately, you want your model to be interpretable, so you decide to use Best Subset Selection. How many different models will you end up considering? (you can leave the answer expressed in terms of p)
2^30 models
Give an example of a supervised learning problem.
A supervised learning problem is one where we have an output to rely on to find the best predictors. One example can be trying to find out the best predictors for crime in a state. You can use historical data from the past couple of years to find which predictors turn out to be the most reliable to predict future crime rates from one year to the next. Then you can use that information to make a prediction on crime rates for future years.
Which of the following can be used to evaluate the performance of logistic regression model?
AIC
Bagging algorithms attach weights to a set of N weak learners. They re-weight the learners and convert them into strong ones. Boosting algorithms draw N sample distributions (usually with replacement) from an original data set for learners to train on.
False
Give an example of an unsupervised problem.
An unsupervised learning problem is a problem where we do not have an output to rely on for our predictors. What we can do for example is something called cluster sampling where we can identify trends in the data set. One example can be trying to see if there are certain temperatures that allow for better transmission of the cold virus.
Bagging = _____ _____
Bootstrap Aggregating
Which of the following gives the differences between he logistic regression and LDA?
If the classes are well separated, the parameter estimates for logistic regression can be unstable. If the sample size is small and the distribution of features are normal for each class. In such case, linear discriminant analysis is more stable thanm logistic regression.
We want to predict gender based on height and weight. The training set consists of heights and weights for 80 men and 60 women. Which method should one prefer?
LDA
_____ is a phenomenon where a model closely matches the training data such that it captures too much of the noise or error in the data. This results in a model that fits the training data very well, but does not make good predictions under test or in general.
Overfitting
What are the differences between Random Forest (RF) and Boosted Regression Trees (BRT) algorthms?
RF builds multiple independent trees, while BRT builds multiple dependent trees that take into account the fit of the previous tree. RF grows trees in parallel, while BRT is sequential. RF uses the bagging method to select random subsets, and BRT uses the boosting method.
ROC stands for
Receiver Operating Characteristic
How does the bias-variance decomposition of a ridge regression estimator compare with that of ordinary least squares regression?
Ridge has larger bias, smaller variance.
The standard error (SE) of an estimator reflects how it varies under repeated sampling. For simple linear regression:
S E ( β 1 ^ ) = σ 2 ∑ i = 1 n ( x i − x ¯ ) 2
The ROC curve is obtained by plotting...
Sensitivity vs. (1-Specificity)
Given a matrix X, the expression UEV^T denotes the _____ of X
Singular Value Decomposition
Decision trees such as regression or classification trees are known to _____
Suffer from high variance. Stratify or segment the predictor space into a number of simple regions.
A good strategy is to grow a very large tree T 0, and then prune it back in order to obtain a subtree. Cost complexity pruning — also known as weakest link pruning — is used to do this. We consider a sequence of trees indexed by a nonnegative tuning parameter α. For each value of α there corresponds a subtree T ⊂ T 0 such that ∑ m = 1 | T | ∑ i : x i ∈ R m ( y i − y ^ R m ) 2 + α | T | Here | T | indicates the number of terminal nodes of the tree T, R m is the rectangle (i.e. the subset of predictor space) corresponding to the m-th terminal node, and y ^ R m is the mean of the training observations in R m. Imagine that you are doing cost complexity pruning as defined above. You fit two trees to the same data: T 1 is fit at α = 1 and T 2 is fit at α = 2. Which of the following is true?
T1 will have at least as many nodes as T2.
You are doing a simulation in order to compare the effect of using Cross-Validation or a Validation set. For each iteration of the simulation, you generate new data and then use both Cross-Validation and a Validation set in order to determine the optimal number of predictors. Which of the following is most likely?
The validation set method will result in a higher variance of optimal number of predictors.
In binary logistic regression:
The dependent variable is divided into two equal subcategories.
What do residuals represent?
The difference between the actual Y values and the predicted Y values.
In a simple linear regression model, y=B0 + B1x, what does the B1 represent?
The estimated change in average per y unit change in x.
In order to perform Boosting, we need to select some parameters. List 3 of those parameters
The number of trees, the number of splits in those trees, and...
Which of the following can be a stopping ruls in fitting a Classification Tree?
The tree is stopped when all groups are relatively homogeneous. The tree is stopped when a predefined maximum number of splits is reached.
Which one of the following is the main reason for pruning a Decision tree?
To avoid overfitting the training set.
Some of the advantages of decision tree models are:
Trees closely mirror human decision-making process Trees are easy to explain Trees don't require dummy variables to model qualitative variables. Trees can be displayed graphically and are easily interpreted by non-experts.
Adjusted R^2 aims to penalize models that include unnecessary variables
True
In simple linear regression, the square of the correlation between X and Y (that is r2) and the fraction of variance explained (that is R2) match
True
Is logistic regression a supervised machine learning algorithm?
True
The link function of linear regression is the identity function (i.e. y=y), whereas the logit is the link function for logistic regression.
True
We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p+1 models, containing 0, 1, 2, ..., p predictors. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1) variable model identified by forward stepwise selection.
True
When using LASSO, normalizing your input features influences the predictions.
True
False Negative (FN) rate is also known as _____
Type II Error
Which of the following is NOT a benefit of the sparsity imposed by the Lasso?
Using the Lasso penalty helps to decrease the bias of the fits.
While doing a homework assignment, you fit a Linear Model to your data set. You are thinking about changing the Linear Model to a Quadratic one. Which of the following is most likely true?
Using the Quadratic Model with decrease the Bias of your model.
Given an ROC curve, we can use the _____ as an assessment of the predictive ability of the model.
area under the curve (AUC)
Logistic regression is a _____ used to model a binary categorical outcome using numerical and categorical predictors.
generalized linear model
If we want to build a logistic regression model in R, we can use the function:
glm() with the option 'family = binomial'
If we want to build a logistic regression model in R, we can use the function
glm() with the option 'family="binomial"'
Tree/Rule based classification algorithms generate _____ rule to perform the classification
if-then
In K-Nearest Neighbors, the choice of K can have a drastic effect on the yielded calssifier. Too low of a K yields a classifier that...
is too flexible has too high a variance has low bias
(in statistical learning) LASSO stands for
least absolute shrinkage and selection operator
Predicting how many points a student can get in a competitive exam based on hours of study can be solved using _____ regression model.
linear
Whether a student will pass of fail in the competitive exam based on hours of study can be solved using _____ regression model
logistic
For 0 ≤ p ≤ 1, ln(p/1-p) is called the _____
logit function
In simple linear regression, the least squares approach chooses ^B0 and ^B1 to _____
minimize the RSS
For a classification tree, predictors are made based on the notion that each observation belongs to the _____ of the training observations in the region to which the observation belongs.
most commonly occurring class.
A frequent problem in estimating logistic regression models is a failure of the likelihood maximization algorithm to converge. In most cases this failure is a consequence of data patterns known as _____
multi-collinearity
To present the results of a logistic regression model, it is often helpful to use graphs of _____
predicted probabilities
Ridge Regression
reduces variance at the expense of higher bias.
The _____ R package can be used with the caret package to train tree-based models.
rpart
When creating a logistic regression model in addition to the accuracy of the classifier, it is also important to check the values of _____
the log-odds
_____ is one example of a non-parametric method.
thin-plate spline
LASSO can be interpreted as least squares linear regression where
weights are regularized with the l1 norm.