Week 9

Ace your homework & exams now with Quizwiz!

For independent binary classifier, what is the minimal accuracy threshold that the classifier must be greater than?

0.5000

Suppose you have 5 independent classifiers with 60% voting accuracy. What is the majority vote accuracy?

0.6826

Suppose you have 5 independent classifiers with 80% voting accuracy. What is the majority vote accuracy?

0.8000

What is the lower bound rule-of-thumb for B, the number of bootstrap samples?

50

a. Reduces overfitting by averaging predictions b. Robust to outliers and noisy data

Advantages of Bagging

a. Effective for improving the performance of weak learners b. Can achieve high accuracy

Advantages of Boosting

a. High accuracy and generalization b. Robust to overfitting and noisy data

Advantages of Random Forests

Simplicity, flexibility, and applicability to a wide range of statistical problems. Requires fewer distributional assumptions and is computationally efficient.

Advantages of the bootstrap

By training multiple models on bootstrapped samples from the original data and averaging their predictions to reduce variance.

Bagging for regression

Is warranted when there is a need to improve the performance of weak learners by sequentially giving more emphasis to misclassified examples.

Boosting methodology

Relies on the diversity and independence of base models. The more diverse the models, the better the ensemble's performance.

Characteristics of bagging models

Is to reduce overfitting, increase model robustness and improve predictive performance by aggregating the strengths of multiple models.

Motivation for ensembles

What happens in random forests when categorical data variables have different numbers of levels?

bias in favor of variables with more levels

For ensemble learning types, which of the following is based on reweighting the training data?

boosting

If you conduct an exhaustive survey of all elements of a population your result is a __________.

census

By training multiple models on bootstrapped samples and combining their predictions through voting or averaging to reduce overfitting and improve accuracy.

Bagging for classification

Reducing variance by averaging predictions from diverse models. The aggregation of independent models helps mitigate error and provides more robust predictions.

Bagging for sample data

____ is easily parallelized, ______ is not.

Bagging | Boosting

The moniker 'Bagging' comes from ______.

Bootstrap Aggregation

a. Sensitive to noisy data and outliers b. Computationally more expensive

Disadvantages of Boosting

a. Less interpretable compared to individual decision trees b. Computational complexity increases with the number of trees.

Disadvantages of Random Forests

Sensitivity to outliers and potential for overfitting the observed data. Also, may not perform well when the sample size is small.

Disadvantages of the bootstrap

Combine the predictions of multiple models to improve overall performance. They leverage the diversity among individual models to achieve better accuracy and generalization.

Ensembles

Example weights are adjusted based on the model's performance. Misclassified examples are given higher weights to emphasize their importance in subsequent iterations.

Example weights interaction with models

Don't require base models to be highly accurate individually; they focus on improving performance by emphasizing difficult to classify instances.

How accurate must boosting models be?

Weights the predictions of individual models based on their performance, giving higher weight to models that contribute more to correct prediction.

How does boosting weight voting?

Diversity among individual models

Key feature of ensemble models

The ability to leverage the strengths of different models for improved predictive accuracy.

Key feature of ensemble models

What were the two features (variables) used in the bootstrap library's law school data to compute correlations?

LSAT score and GPA

conduct classification by building multiple decision trees on bootstrapped samples and aggregating their predictions through voting.

Random Forests classification

By using random subsets of features for each tree, ensuring diversity, and reducing overfitting.

Random Forests perform data splitting

Used in classification tasks where the goal is to assign observations to distinct classes or categories.

Random Forests used in Classification

Combine multiple decision trees, making them an ensemble method.

Random forests

Require diverse base models, and there should be some degree of independence among them and diversity is crucial for capturing different aspects of the underlying data patterns.

Requirements for ensembles

Involves resampling from the observed data to create new datasets. The relationship lies in using the distribution of statistics from these resampled datasets to make inferences about the population parameter. It provides a way to estimate the sampling distribution of a statistic

The relationship between the bootstrap, the original statistic and the population parameter

Use multiple base models, which can be of the same or different types and common techniques include bagging and boosting.

Use of ensembles

Use to improve estimators include obtaining confidence intervals, estimating standard errors, and assessing the variability of sample statistics

Use of the bootstrap to improve estimators

Used for estimating confidence intervals, standard errors, bias correction, and constructing sampling distributions for various statistics.

Uses of the bootstrap in estimation

If you have a non-representative sample to use for bootstrapping what should you do about it?

abandon it and get a representative sample

Ensemble models _______________ to produce ________________

aggregate multiple weak learners | a strong learner

If we repeatedly sample a distribution and compute arithmetic means, the sampling distribution of those means will be ____________ distribution.

an empirical

In order to construct a normal sampling distribution of arithmetic means, the central limit theorem _____________ the underlying distribution be normal.

does not require that

A combination of multiple individual models

ensemble model

In boosting we ______ the weights of examples not correctly learned and ______ the weights of examples correctly learned.

increase | decrease

The bootstrap distribution is _______________ the underlying population.

not identical to

The random choice of a subset of features is used in __________.

random forests

Which technique uses double randomization?

random forests

You get goods results from a bootstrap when ____________.

the original sample is a representative sample

The R function sapply() requires ______________.

two parameters

The standard error produced by the law school correlation calculations _________.

will vary from machine to machine


Related study sets

Growth and Development Peds Success

View Set

Final Insurance Exam Missed Questions

View Set

american gov and politics chapter 8 on public opinion

View Set