Week 9
For independent binary classifier, what is the minimal accuracy threshold that the classifier must be greater than?
0.5000
Suppose you have 5 independent classifiers with 60% voting accuracy. What is the majority vote accuracy?
0.6826
Suppose you have 5 independent classifiers with 80% voting accuracy. What is the majority vote accuracy?
0.8000
What is the lower bound rule-of-thumb for B, the number of bootstrap samples?
50
a. Reduces overfitting by averaging predictions b. Robust to outliers and noisy data
Advantages of Bagging
a. Effective for improving the performance of weak learners b. Can achieve high accuracy
Advantages of Boosting
a. High accuracy and generalization b. Robust to overfitting and noisy data
Advantages of Random Forests
Simplicity, flexibility, and applicability to a wide range of statistical problems. Requires fewer distributional assumptions and is computationally efficient.
Advantages of the bootstrap
By training multiple models on bootstrapped samples from the original data and averaging their predictions to reduce variance.
Bagging for regression
Is warranted when there is a need to improve the performance of weak learners by sequentially giving more emphasis to misclassified examples.
Boosting methodology
Relies on the diversity and independence of base models. The more diverse the models, the better the ensemble's performance.
Characteristics of bagging models
Is to reduce overfitting, increase model robustness and improve predictive performance by aggregating the strengths of multiple models.
Motivation for ensembles
What happens in random forests when categorical data variables have different numbers of levels?
bias in favor of variables with more levels
For ensemble learning types, which of the following is based on reweighting the training data?
boosting
If you conduct an exhaustive survey of all elements of a population your result is a __________.
census
By training multiple models on bootstrapped samples and combining their predictions through voting or averaging to reduce overfitting and improve accuracy.
Bagging for classification
Reducing variance by averaging predictions from diverse models. The aggregation of independent models helps mitigate error and provides more robust predictions.
Bagging for sample data
____ is easily parallelized, ______ is not.
Bagging | Boosting
The moniker 'Bagging' comes from ______.
Bootstrap Aggregation
a. Sensitive to noisy data and outliers b. Computationally more expensive
Disadvantages of Boosting
a. Less interpretable compared to individual decision trees b. Computational complexity increases with the number of trees.
Disadvantages of Random Forests
Sensitivity to outliers and potential for overfitting the observed data. Also, may not perform well when the sample size is small.
Disadvantages of the bootstrap
Combine the predictions of multiple models to improve overall performance. They leverage the diversity among individual models to achieve better accuracy and generalization.
Ensembles
Example weights are adjusted based on the model's performance. Misclassified examples are given higher weights to emphasize their importance in subsequent iterations.
Example weights interaction with models
Don't require base models to be highly accurate individually; they focus on improving performance by emphasizing difficult to classify instances.
How accurate must boosting models be?
Weights the predictions of individual models based on their performance, giving higher weight to models that contribute more to correct prediction.
How does boosting weight voting?
Diversity among individual models
Key feature of ensemble models
The ability to leverage the strengths of different models for improved predictive accuracy.
Key feature of ensemble models
What were the two features (variables) used in the bootstrap library's law school data to compute correlations?
LSAT score and GPA
conduct classification by building multiple decision trees on bootstrapped samples and aggregating their predictions through voting.
Random Forests classification
By using random subsets of features for each tree, ensuring diversity, and reducing overfitting.
Random Forests perform data splitting
Used in classification tasks where the goal is to assign observations to distinct classes or categories.
Random Forests used in Classification
Combine multiple decision trees, making them an ensemble method.
Random forests
Require diverse base models, and there should be some degree of independence among them and diversity is crucial for capturing different aspects of the underlying data patterns.
Requirements for ensembles
Involves resampling from the observed data to create new datasets. The relationship lies in using the distribution of statistics from these resampled datasets to make inferences about the population parameter. It provides a way to estimate the sampling distribution of a statistic
The relationship between the bootstrap, the original statistic and the population parameter
Use multiple base models, which can be of the same or different types and common techniques include bagging and boosting.
Use of ensembles
Use to improve estimators include obtaining confidence intervals, estimating standard errors, and assessing the variability of sample statistics
Use of the bootstrap to improve estimators
Used for estimating confidence intervals, standard errors, bias correction, and constructing sampling distributions for various statistics.
Uses of the bootstrap in estimation
If you have a non-representative sample to use for bootstrapping what should you do about it?
abandon it and get a representative sample
Ensemble models _______________ to produce ________________
aggregate multiple weak learners | a strong learner
If we repeatedly sample a distribution and compute arithmetic means, the sampling distribution of those means will be ____________ distribution.
an empirical
In order to construct a normal sampling distribution of arithmetic means, the central limit theorem _____________ the underlying distribution be normal.
does not require that
A combination of multiple individual models
ensemble model
In boosting we ______ the weights of examples not correctly learned and ______ the weights of examples correctly learned.
increase | decrease
The bootstrap distribution is _______________ the underlying population.
not identical to
The random choice of a subset of features is used in __________.
random forests
Which technique uses double randomization?
random forests
You get goods results from a bootstrap when ____________.
the original sample is a representative sample
The R function sapply() requires ______________.
two parameters
The standard error produced by the law school correlation calculations _________.
will vary from machine to machine