Quant II Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

1. What is a key difference between boosting and methods like bagging and random forests? A) Boosting models are always more accurate. B) In boosting, subsequent models are independent of each other. C) In boosting, subsequent models use information from previous trees D) Boosting models cannot handle outliers.

(C) In boosting, subsequent models use information from previous trees

Some strengths of neural networks are:

1. Ability to Model Complex Non-linear Relationships: Neural networks excel at identifying and modeling complex patterns and relationships in data, which may be difficult or impossible to capture with traditional statistical methods. 2. Flexibility and Adaptability: They can be applied to a wide range of problems, from image and speech recognition to time series prediction and natural language processing. 3. High Tolerance to Noisy Data: Neural networks can handle imperfect data and are relatively robust to noise, making them useful in real-world applications where data may be incomplete or messy. 4. Feature Extraction: In deep learning, neural networks can automatically detect and extract relevant features from raw data, eliminating the need for manual feature engineering. 5. Continuous Improvement: Given more data, neural networks can continually improve their performance, making them ideal for applications where data is constantly being generated.

Stopping Criteria

1. All data in the node are of the same class. No need to keep splitting here as the node is already perfectly pure. 2. A specified tree size limit has been met. Very deep trees tend to overfit the data, so stopping the tree growing too deep is one way to control for this. 3. The number of samples in a leaf node falls below a minimum value. Here, there are only a couple of samples present so adding additional splits may be focusing too much on individual patterns from the training data, again causing the model to overfit. 4. None of the possible splits are above a specified information gain threshold.

Some of the weaknesses of decision trees are:

1. Choosing splitters is biased towards features with a large number of levels. 2. Very easy for the model to overfit or underfit the data. 3. Reliance on splits involving only a single feature is limiting. 4. Large trees can be difficult to interpret or understand. 5. Small changes in the data result in large changes to the model, that is, trees are a high variance method.

The boosting process works as follows:

1. Create a naïve model. 2. Calculate the errors of the model. 3. Re-weight incorrect samples and fit a new model. 4. Add model to ensemble. 5. Repeat steps 2-4 until model convergence (no more improvement) or a set number of trees is reached. To make a prediction at each step 2 all the models in the ensemble are used.

Image Classification works as follows:

1. Define a set of target classes. 2. Provide the model with a set of labelled images. 3. Train the model to recognize the classes based on the images it is fed. An image can be represented as a matrix of pixel values, we previously saw this with the handwritten digits' dataset. To expand this to colour photos we get three values for each pixel, corresponding to red, green, and blue.

CNN Parameters

1. Depth - This is the number of filters we apply to the image, each filter will produce an output matrix that is used for the next step of training. 2. Stride - This is the number of pixels we slide our filter matrix over the input image, in the example on 2-1 we used a stride of 1 and moved our filter 1 pixel at a time. We could instead set this to 2 and jump two pixels each time. In this case the output will be a smaller matrix. 3. Zero padding - This controls whether a border of 0's is added around the image. This lets us control the size of the output matrix.

Some of the strengths of decision trees are:

1. Does reasonably well on most problems. 2. Handles numeric and categorical features well. 3. Does well with missing data. 4. Ignores unimportant features, they do not get chosen to be used to split the data, in effect performing internal feature selection. 5. Useful for both large and small datasets. 6. Output is easy to understand and interpret.

The process for fitting a random forest model is:

1. Generate B bootstrap samples of the data. 2. Fit B trees to the data: a. Here, whenever a split is considered only a random sample of m variables are chosen as possible split candidates from the p variables that we have available. b. Each split can only use one of the m variables. c. A fresh sample of m variables is selected at each split. 3. To make predictions the results from each of the B trees are again averaged to produce a final prediction

Some things to note about training and test sets:

1. Overfitting - This is when the models is tuned so it has very high accuracy/low error on the training data. In this case the model is likely to have been trained on the test set. 2. If the model has zero test error then it is likely that the model has been trained on the test set. 3. When selecting data for training and test sets, we want an equal representation of the classes in the training and test data. This is particularly Important when dealing with imbalanced data as we need to ensure that the minority class appears in both the training and test data. 4. We also want to ensure that the test data set is large enough to ensure statistically meaningful results.

Some strengths of bagging models are:

1. Perform relatively well in the presence of outliers. 2. Reduces the variance seen in a single tree and often gives superior results. 3. Avoids overfitting due to the randomness used when adding additional trees. 4. Good at modelling non-linear relationships.

This can be a challenging problem as many different aspects can impact the images besides what the class the picture is from. Some of these are:

1. Position of the object 2. Background of the object 3. Ambient lighting 4. Camera angle 5. Camera focus

So we need to think about what we can change from tree to tree, some options here would be:

1. Raw data used to build the tree. As it is a high variance method this will lead to large changes in the trees. 2. Size/Depth of the trees 3. Dependency on previous trees.

Some weaknesses of neural networks are:

1. Requirement of Large Amounts of Data: To perform effectively, especially deep neural networks, require large amounts of training data. 2. Black Box Nature: Neural networks, particularly deep learning models, are often considered "black boxes" because it can be difficult to interpret how they arrive at a specific decision or prediction. 3. Overfitting Risk: They can overfit to training data, especially if it is noisy or if the network is too complex. This means they perform well on training data but poorly on unseen data. 4. Computational Intensity: Training neural networks, especially deep networks, can be computationally intensive and time-consuming, requiring significant hardware resources. 5. Dependence on Quality of Data: Their performance is heavily dependent on the quality of the input data. Garbage in, garbage out is a principle that strongly applies. 6. Regularization and Tuning: They require careful tuning of parameters and regularization techniques to perform optimally and avoid overfitting.

Overfitting Set

1. Rotating the image 2. Shifting the image vertically or horizontally 3. Flipping the image 4. Zooming in or out of the image This both increases the training sample size and exposes the CNN to more variations of images of each class.

Tuning Other Parameters Process

1. Select sets of possible parameter values. 2. Iterate through each combination of parameters, each time fitting a model and estimating the test error. 3. Select the combination of parameter values that lead to the lowest error. 4. Use these parameters in the final model.

Some of the weaknesses of boosting are:

1. Sensitive to outliers - If we are less interested in correctly predicting outliers then this can be a weakness of boosting as it will pay a lot of attention to them. 2. Slow to train and tune. 3. Challenging to run in parallel since the trees are built sequentially.

There are some problems that can surface from splitting our data into training and test sets.

1. Some samples are easier/harder to predict than others. Depending on whether these samples are in the training or test set this can lead to the over/underestimation of test error. 2. Substantial portions of the data must be reserved for testing and validating the model. If dealing with smaller data sets this can lead to sample size issues.

Some of the strengths of boosting are:

1. Strong prediction power - Usually boosting will outperform random forest and bagging models. 2. Resilient to overfitting - This is dependent on the model being well tuned and boosting is less resilient than both bagging and random forest. 3. Sensitive to outliers - Boosting is good at accurately predicting outlier samples as it learns to predict them, if we are interested in correctly predicting the outliers then this is a strength.

Some things to keep in mind when tuning hyper parameters:

1. The optimal set of parameters will vary from problem to problem. 2. The default values for many algorithms work well and are often worth trying as a first approach. 3. Tuning can be computationally expensive and take considerable time but can often lead to increases in model performance.

Some weaknesses of bagging models are:

1. There is a substantial loss of transparency and model interpretability. 2. Can struggle to capture linear relationships. 3. If there is a single very strong predictor in a dataset, this can dominate the model. That is, this feature may be chosen as the first split in every tree which reduces the differences between them and so fails to reduce the variance.

Model Building Steps

1. Train/fit the model on our training data. 2. Evaluate the model's performance based on validation set or cross-validation. 3. Identify any hyperparameter changes to be made based on validation set/cross validation performance. If any changes to be made return to step 1. 4. Select best model based on validation set/cross-validation.5. Evaluate the model using the test set.

Generalization can be accomplished by splitting the data available into parts:

1. Training data - We use this subset of the data to train our model, often about80% of the data is used for this purpose. 2. Test data - We use this subset of the data to evaluate the performance of our model. Often about 20% is used for this. When parameter tuning a third split can be added, which is used to identify the optimal hyperparameters for the model. Though cross-validation can also be used for this purpose.

When trying to select the learning rate there are a couple of things to keep in mind:

1. Training error should decrease, steeply at first and then more slowly until the slope of the curve approaches or reaches 0. 2. If the training error does not converge then the number of iterations should be increased. 3. If the training loss decreases too slowly, then increase the learning rate. 4. Lowering the training rate and increasing the number of iterations is often a good combination.

Cross-validation works as follows:

1. We split the data into k subsets, 2. We then fit the model k times, each time using one subset of data as the test set and all remaining subsets as our training data. 3. We can then calculate the average error across all of the different test sets. This results in every sample appearing in a test set 1 time and being included in the training data k-1 times.

Random Forest Rationale

A common question when encountering random forests is how this process acts as an improvement over bagging models and why choosing the best splitter at each point is not optimal. Suppose we have a dataset with a single strong predictor and a number of moderately strong predictors. When we fit a bagging model to this dataset, the strong predictor will be used as the initial splitter for many or all of the trees that are built. This results in the bagged trees being very similar to each other and so the predictions coming from each tree will be similar. In this case as we are averaging very similar models, the reduction in variance will not be as large which impacts our test set performance. Random forests, by forcing each split to consider only a subset of possible predictors overcomes this problem. On average, for each predictor, it will only be considered for: 𝑚/𝑝 splits in the model. That is, the strong predictor will be left out of: 𝑝 ― 𝑚/𝑝 splits, which allows the other moderately strong predictors to play a role in the model. This process decorrelates the trees, making the averaged predictions have less variance and therefor more reliable.

Convolutional Neural Networks

A convolutional neural network (CNN) can be used to progressively extract higher-level representations of the image content. Instead of having to derive features here, a CNN uses the raw pixel data of the image as input and learns how to extract relevant features and determine what class these features are representing.

Pooling

After the convolution is applied a ReLU transformation is applied to the convoluted features to introduce non-linearity. After the ReLU has been applied there is a pooling step. Here the convoluted feature is down sampled to reduce the number of dimensions. This improves processing time.

Interpreting AUC

An AUC value of 0.5 suggests that the model is no better than random chance, while a value close to 1 indicates very good performance. In practical terms, the AUC can be thought of as the probability that a randomly chosen positive sample will have a higher predicted probability than a randomly chosen negative sample. This makes the AUC a particularly valuable metric for imbalanced data as it does not depend on how many samples are in each class.

K-fold Cross-Validation

An alternative to LOOCV is to use k-fold cross-validation where k is equal to 5 or 10. In this case there is much less computation to be done as we only need to fit the model 5 or 10 times. Since most of the data is still used to build the model, the bias should still be reduced. For an example of how this works, suppose we got with 5-fold cross-validation, we would first split our data into 5 subsets. We would then fit a model on the data from subsets 2 to 5 and use subset 1 as the test set for the first iteration. We would then calculate the error for this model using it's performance on subset 1. In the second iteration, subset 2 would be used as the test set, and subsets, 1,3,4, and5 would be used to build the model. We iterate through this process until all subsets have been used as the test set. We then average the error over all of the folds to get an estimate of error for the model

Cross-validation

An alternative to splitting the data into training, test, and validation sets is to use cross-validation

Bias Variance Trade-Off

As increasing model complexity increases variance and decreases bias, we now have a case where we must try to achieve a balance between bias and variance to achieve optimal model performance. In the above graph we see that as model complexity increases the bias decreases, but the variance increases. We would like to have our model set at the optimal complexity where the total error is minimized.

1. In Random Forest, how is the subset of features chosen for splitting at each node? A) All features are considered for each split. B) A random subset of features is chosen for each split. C) Features are chosen based on their importance scores. D) The features are chosen sequentially.

B

2. How does Random Forest help when there is a dominant predictor in the dataset? A) By always using the dominant predictor for splits. B) By reducing the chances of the dominant predictor being used in splits, allowing weaker predictors to play a role. C) By increasing the chances of the dominant predictor being used in splits. D) The presence of a dominant predictor does not affect Random Forest.

B

2. What is the key characteristic of Softmax Classification in a multi-class setting? A) It only allows for binary classification. B) It assigns a probability of being in each class, and these probabilities must sum to 1.o C) It can predict multiple classes for a single sample. D) It does not require probabilities to sum up to 1.

B

2. What is the purpose of pooling in a CNN? A) To increase the number of dimensions B) To reduce the number of dimensions C) To amplify the features D) To categorize pixel values

B

What does Leave-One-Out Cross-Validation (LOOCV) involve? A. Using 10 folds for cross-validation. B. Leaving out one sample at a time as the test set. C. Using a single train-test split. D. Training the model on the entire dataset.

B Leaving out one sample at a time as the test set.

2. What does a flattened loss/error curve signify in the context of boosting model training? A) The model has overfitted. B) The model has stopped improving and has converged. C) The model requires more trees. D) The model's learning rate is too high.

B) The model has stopped improving and has converged.

Bias

Bias refers to the error that is introduced by approximating a real-world problem using a statistical model. The models we estimate will often be a simplification of reality as we do not have data on all of the potential factors that influence the response variable along with estimating a simplification of the true relationship when we estimate 𝑓(𝑥). In linear regression we assume a linear relationship between the response variable and the predictors. In reality it is unlikely that any relationship is truly linear and so any approximation will result in some bias. In general, more complex models have a lower bias. Here the additional complexity of the model results in a less simplified estimation of the relationship between Y and X.

How does boosting work?

Boosting works by fitting an initial model on the data, identifying which samples are being incorrectly classified, increasing the weight of the samples which are being incorrectly classified and then refitting a model on the data. By increasing the weight of the incorrectly predicted samples the model steadily pays more attention to the hard to predict samples and hopefully gets them correct. This process continues until we hit a specified stopping point or no further improvements can be made.

1. What does data augmentation in CNNs typically involve? A) Changing the color of images B) Reducing the size of images C) Transforming images to create new variants D) Deleting redundant images

C

1. What is Multi-Class Classification? A) Classifying samples into two classes. B) Classifying samples based on their size. C) Classifying samples as one of multiple classes. D) Classifying samples based on a single characteristic

C

What is the main purpose of evaluating a model on test data? A. To make the model more complex. B. To optimize the model's hyperparameters. C. To assess the model's performance on unseen data. D. To increase the accuracy of the model on the training data.

C To assess the model's performance on unseen data.

Training and Test Sets

Calculating model performance for the data we used to train the model can be overly optimistic and is not a reliable measure of future performance. A better measure, is to evaluate the model on previously unseen data. We call this generalization.

Trees

Classification and regression trees are a commonly used statistical model and form the basis for many of the ensemble models. -Trees are a supervised learning method, that is we have both explanatory variables and a response variable. -They are non-linear models and so can often detect patterns that cannot be picked up by linear methods such as linear regression/logistic regression. -They are suitable for both classification and regression problem. -They can also handle multi-class problems. Trees work by constructing a set of rules which can be applied to carry out prediction/classification.

Tuning Other Parameters

For the other hyperparameters that are not the learning rate, we will use cross-validation combined with a grid search to find the most effective set of parameters.

Trees Terminology

Here, the data starts at the root node. It then gets passed through decision nodes (internal nodes), splits across branches, and ends at a terminal or leaf node. We can use this visualization of the model to produce and explain predictions for new data. The human readable nature of the output makes it very well suited for applications that require transparency such as the medical or legal fields.

Leave-One-Out Cross-Validation

Here, we set k as the number of samples available in our data, n. That is we split the data so that individual observation is its own subset. In this case, the model is fit n times, each time leaving out a single observation and using that as the test set for that model. Leave-one-out cross validation (LOOCV) results in less bias than the traditional training/test split since the model is fit on almost all the data at each step. Additionally, there is no randomness in the training and test set selection which provides less opportunity to over/underestimate the error of the model. However, LOOCV can be very computationally expensive, especially with large datasets, as we much fit the model n-1 times.

Problem with Looking at Only Accuracy

However, looking at just accuracy alone can be misleading. This can occur for cases where errors have different costs. E.g. in the medical field false negatives often have much higher cost, in that someone does not receive appropriate treatment. This can also occur for imbalanced data, where the accuracy could be high for the majority class and very low for the minority class but overall accuracy appears high.

Hyper-Parameter Tuning

Hyper-parameters are parameters whose value is used to control the model fitting process. As we fit an ensemble model (Bagging, Random Forest, Boosting, etc.) at each iteration the model will improve it's fit on the training data. When this error curve flattens out this, this is a sign that the model has stopped improving and has converged.

Hidden Layers/Activation Functions

In each of our nodes in the hidden layers (HL 1-3 in the previous diagram) we use an activation function to transform the input. If we used a linear function, such as assuming the inputs to the node then we would have a linear model, a complex one but still a linear model. We instead want a model than can capture non-linear relationships. So for this we must use a non-linear interaction function. Any mathematical function can be used as an activation function. Have a non-linear activation function allows the model to capture non-linear relationships that are present in the data. At each node we also add a bias term, that works like an intercept and shifts the output in either the positive or negative direction.

More complex models have _____. variance.

In general, more complex models have a higher variance. Model complexity can be thought of as the number of parameters that need to be calculated in the model. For example the number of 𝛽 values that we need to estimate for a linear regression problem. Adding interaction terms or additional predictors will lead to an increase in model complexity.

Information Gain

Information Gain = Gini(s1) - Gini(S2) Gini(S1) = Gini of original node Gini(S2) = Gini of the split nodes, weighted by the number of samples in the new nodes At each node, the algorithm will evaluate the possible splits and select the best possible split available for that current node. Note that when splitting the data at each node, the algorithm does not consider the other nodes that are present in the tree, and only focuses on splitting the data in that particular node. We call models which do this greedy learners.

Mutli-Class Classification

Multi-class classification is the problem of classifying objects as one of multiple classes.

Neural Networks

Neural networks are a statistical method that was developed to mimic the brain's structure when solving problems. This type of method forms the core of deep learning. Neural networks work by taking our predictor variables (inputs) passing them through a function (hidden layers) and using them to predict the response variable (output).

Approaching Model Complexity

One way to approach this problem is to apply a feature selection process so that predictors which have no relationship with the response variable do not play a role in the model. By removing variables that have no relationship with the response variable we are reducing the model complexity but hopefully keeping the bias low.

Test Set Error

Our objective here is to minimize the errors made when making predictions for the test set data, that is data that has not previously been seen by the model. There is no guarantee that the model with the lowest training set error will have the lowest test set error. For example, in linear regression, as we add more features, even if they are not related to the response, the training error will still decrease. However, the test set error may not.

Overfitting

Overfitting is of particular concern when training CNNs as the generalization of the model to new images can be quite challenging given all the small differences that images of the same object can contain. For CNNs we can dropout certain layers from the network when training, as we did in previous sections. Here we can also carry out data augmentation. This involves artificially increasing the number and variation of the training examples by transforming the existing images to create new variants of the images.

Random Forests

Previously we have used bootstrap aggregation to create ensembles of trees. Random Forest is a modification to the bagging process that provides an improvement by decorrelating the trees. That is Random Forest leads to larger differences in the trees that are aggregated for the ensemble model. When the random forest algorithm goes to split the data in a tree, instead of considering all of the variables available, only considers a random subset of the possible features available.

Sensitivity

Sensitivity, also called the true positive rate, is the number of correctly predicted positive samples, divided by the total number of true positive samples.

Softmax Limitations

SoftMax assumes that each example is a member of a single class. In cases where the samples can be members of multiple classes simultaneously SoftMax cannot be used and we must rely on one-vs-all.

Softmax Classification

SoftMax classification extends this idea to the multiclass world. When we use SoftMax each of the samples gets assigned a probability of being in each class and these probabilities must add up to 1.

Specificity

Specificity, also called the true negative rate, is the number of correctly predicted negative samples divided by the total number of negative samples.

Image Classification

Supervised Learning Problem

ROC and AUC Curve

The curve is created by generating predicted probabilities for each sample, then calculating the sensitivity and specificity at each cut-off. The curve is then created by plotting the Sensitivity (True Positive Rate) against 1-Specificty (False Positive Rate).A model that produces a curve that reaches the top left-hand side of the graph is considered a good model as it has high sensitivity and high specificity. Base on the curve we can also calculate the area underneath the curve to generate a single number to capture model performance. The area under the curve ranges from 0 to 1.

Generalization

The model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.

More complex models:

These have low bias but high variance, in this case we are overfitting the data. Here we are capturing the peculiarities of the training data that are not seen in the test data.

Less complex models:

These models have low variance but high bias, in this case we are not capturing the relationships in the data, which is called underfitting.

One vs All

This approach leverages binary classification. Suppose we have a problem with five classes, a one-vs.-all solution involves building five binary classifiers, one for each outcome. During the training process, we build a sequence of classifiers each trained to predict a single class. This approach works well when the number of classes is small, however, as the number of classes rises this becomes increasingly inefficient and computationally intensive.

Neural Networks Comparison

To compare with some methods we have previously covered, if there was no hidden layer here then this would be equivalent to linear regression. Instead of using linear combinations directly, here the variables are feed through hidden layers which transform the features allowing the model to capture non-linear patterns in the data.

Bootstrap Aggregation (Bagging)

To modify the raw data used for building each tree we can make use of bootstrapresampling. Bootstrap resampling is sampling with replacement. That is when takinga random sample, the same value can be selected multiple times. When we select a bootstrap sample of the data, in general only about 2/3rds of the samples will be included in the bootstrap sample. We can then build a tree on each bootstrap sampled dataset. Due to the high variance nature of trees this should produce different trees each time so averaging them will then reduce the variance of the overall model.

Growing Trees

Trees are built using a divide and conquer approach. This involves splitting the data into subsets, and then continuing to split the data into even smaller subsets until one or more stopping criteria is met. This approach is known as recursive partitioning. The goal of splitting the data at is to minimize the impurity of each node. A node with high impurity would have a balance of both classes, while a node with low impurity would have majority of samples from a single class. To measure impurity we will make use of Gini Impurity.

Variance

Variance refers to the amount by which our estimated function 𝑓(𝑥) would change in response to a change in the training data. For example, the difference in linear regression coefficients if a different train/test split was used. Ideally a model should not change too much with different training sets for the same problem. A model which has high variance will undergo a large change in 𝑓(𝑥) given a small change in the training data.

Boosting

We have previously encountered the ensemble methods bagging and random forest. In each of these multiple models are created and combined to produce the final model. Each of the models built in these processes are created independently, that is, they do not use any information from the other trees that are created in the modelling process. Boosting works in a similar fashion in that multiple models are combined to produce the final model, however, here the information from the previous trees is used for all the subsequent trees.

Bagging Motivation

We have previously used classification trees to perform classification. However, single trees can often suffer from low prediction accuracy compared to many methods, we call models like this a weak learner. Additionally, trees are a very high-variance method in that a small change to the data will lead to a large change in the estimated tree model. However, if we were to aggregate the results of may trees we may get an increase in performance. Here by averaging out the predictions from all of the trees, we should be able to reduce the variance. If we train multiple trees on the same dataset, using the same parameters, then the same model will be estimated each time. Aggregating the results from these trees would still give us the same answer as a single tree and not reduce the variance.

Learning Rate Converge

We see the model converges much faster at a learning rate of 0.1 than 0.001.If the error or Loss curve oscillates wildly or increases then this is a sign that the learning rate is too high.

Measuring Performance

When carrying out prediction we are interested to see how the model performs at making predictions on new data. That is, the performance on the training data is less useful to us here. Instead, we would like to know how the model will perform on unseen test data. There are several measures we can use to judge performance on new data. The best measure of performance is one that captures whether the model will be successful for its purpose. In general, we should aim to define our performance measures for their utility rather than raw accuracy.

ROC and AUC

When considering the performance of models we often evaluate their accuracy and other metrics based on the classification results that the model returns to us. Internally these models produce predicted probabilities between 0 and 1 and convert the results into classes by classifying any result above 0.5 as a 1 and below as a 0. However, the choice of cut-off can have a large impact on model performance and there is no guarantee that 0.5 is the best choice here. So if we only evaluate models based on that cut-off we may choose the wrong model as the best one. A better approach to this is to compare how the model performs at all cut-off values. We can do this using Receiver Operating Characteristic (ROC) curves and the area under the curve.

Balanced Accuracy

When dealing with imbalanced data, a better measure to calculate is balanced accuracy. Here the error for each class is calculated and then the average of the error rates is calculated.

Choosing Features

When splitting the data, the model must choose which variable to split upon and where in the variable to split. To do this, the model calculates the change in Gini Impurity that would result from a split on each feature. It then chooses the split that will lead to the largest decrease in impurity. Known as the Information Gain.

Bagging Variable Importance

When we are aggregating hundreds of trees, it no longer becomes practical to visualize each tree and identify the important predictors, instead we can take a different approach: 1. For each variable we calculate the amount that Gini Impurity has been decreased by splitting on a given predictor across the B trees. 2. We then average this value across all of the trees that we have built. 3. The variables that lead to the largest decrease in impurity are considered to be the most important variables for the model. So here, we are identify the features that are doing the most work in terms of splitting the data across all of the trees

1. In the Bias-Variance Trade-Off, a more complex model typically results in: a. High bias and low variance b. Low bias and high variance c. Low bias and low variance d. High bias and high variance

b

2. As the number of features in linear regression increases, even if they aren't related to the response, what happens to the training error? a. It increases b. It decreases c. It remains the same d. It first decreases then increases

b

2. Which of the following is a strength of neural networks? a) They do not require much data for training b) Ability to model complex non-linear relationships c) Low computational intensity d) Transparency in decision-making

b) Ability to model complex non-linear relationships

2. Why might decision trees be considered a high variance method? a) They always provide consistent results regardless of input data. b) Small changes in data lead to large changes in the model. c) They only work well with large datasets. d) They have a low risk of overfitting.

b) Small changes in data lead to large changes in the model.

1. The goal of splitting the data in decision trees is to: a) Maximize the impurity of each node. b) Achieve a balance of both classes in each node. c) Minimize the impurity of each node. d) Equalize the number of data points in each node

c) Minimize the impurity of each node

1. Which of the following is a commonly used activation function in modern neural networks? a) Linear function b) Sigmoid function c) Rectified Linear Unit (ReLU) d) Polynomial function

c) Rectified Linear Unit (ReLU)

Boosting is suitable for both...

classification and regression problems

The default number of variables considered at each split is:

m= sqroot(p) That is, the number of possible variables to be select as a splitter at each node is the square-root of the total number of predictors. Doing this in our trees leads to two kinds of randomness, both in the samples used to build the tree, coming from the bootstrap samples and the predictors chosen as potential splitters at each node in the tree.

To minimize expected test set error we want a model...

with low variance and low bias. There is nothing we can do about the noise

Some weaknesses of random forests include:

• Can be very computationally intensive, particularly when it comes to hyperparameter tuning. • Struggles with interpretability as the whole forest can be challenging to interpret at once.

Some strengths of random forests include:

• Good performing classifier • Produces lots of information about the data • Works with large number of variables • Can handle different types of variables • Applicable to both regression and classification problems

The three aspects of the error are:

𝑉𝑎𝑟(𝑦) - Variance of the model Bias(y)^2 - Bias of the model Var(E) - Variance of the error terms, also called noise or irreducible error.


Ensembles d'études connexes

Chapter 1 Study Guide - Political Culture, People & Economy of Texas

View Set

Educational Psychology Test 3 (Ch. 7, 8, 9)

View Set

Vitamins, minerals inside of chemicals Section 3

View Set

NUR 313: Providing Enteral Feeding

View Set

environmental science chapter 13: air, water, and soil pollution

View Set