BUSI 488 Quiz 4

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

CART

(Classification and Regression Tree) same as decision tree

Steps in the AdaBoost Algorithm

-Initialize a weight -Make the best split: -Determine the total error of the stump: -Use the error of the classifier (i.e., share of misclassified records) to determine how much say (i.e., vote) the particular stump has -Update (i.e., increase or decrease) the weights (wrwr) -Create a new dataset of records: Repeat steps 2-6.

Building a Random Forest

1. Create a 'bootstrapped dataset' from the original data. 2. Create a decision tree using this bootstrapped data. 3. Rinse and Repeat 1 and 2 to create more decision trees. 4. Take majority vote (Classification) or average (Regression)

Information Gain

= information before splitting (parent) — information after splitting (children) Shannon invented the concept of entropy, which measures the impurity of an input set. In physics and mathematics, entropy referred as the randomness or the impurity in the system. In information theory, it refers to the impurity in a group of examples. Information gain (IG) is the decrease in entropy (or another measure for impurity). IG computes the difference between entropy before split and average entropy after split of the dataset based on given feature values.

Attribute Selection Measures

An attribute selection measure (ASM) is a heuristic for selecting the splitting criterion that partition data into the best possible manner. ASM is also known as splitting rules because it helps us to determine breakpoints for records on a given node. ASM provides a rank to each feature in regard to how well it can explain the outcome (response). The best feature is selected for the split. In the case of a continuous-valued feature, split points must be defined. We will use the Information Gain to split out data

Consequence of CARTs

As far as accuracies of prediction go, Decision Trees are quite inaccurate. Even one mis-step in the choice of the next node, can lead you to a completely different end. Choosing the right branch instead of the left could lead you to the furthest end of the tree. You would be off by a huge margin!

ROC curves and AUC scores

Be able to interpret

Limitations of CARTs

Classfication: can only produce orthogonal decision boundaries Sensitive to small variations in the training set High variance: unconstrained CARTs may ovefit the training set Solution: Wisdom of the Crowds!

Example of weak learner:

Decision stump (CART whose maximum depth is 1). Basic Idea: Train an ensemble of predictors sequentially. Each predictor tries to correct its predecessor. Most popular boosting methods:AdaBoost,Gradient Boosting

Basic Idea of a Random Forest

Each tree is a full-size tree Each tree can look different in terms of depth and splits Each tree can use multiple features to make a decision Each tree has equal weight in voting Each tree is made independently of the others Most trees have different samples to operate on (because of bootstrapping)

Stochastic Gradient Boosting (SGB)

Each tree is trained on a random subset of rows of the training data The sampled instances (40%-80% of the training set) are sampled without replacement Features are sampled (without replacement) when choosing split points

Advantage of Tree-based methods

Enable measuring the importance of each feature in prediction.

Boosting:

Ensemble method combining several weak learners to form a strong learner.

Building blocks of decision tree

Essentially a Flowchart-like structure consisting of a hierarchy of nodes Nodes are questions or predictions about a particular feature Multiple elements make up a decision tree:

Internal Node

Feature one parent node, question giving rise to two children nodes.

Defection Detection Reading

First, methods do matter. Second, models have staying power. Third, researchers use a variety of modeling "approaches," characterized by variables such as estimation technique, variable selection procedure, number of variables included, and time allocated to steps in the model-building process.

The Cons of Gradient Boosting

Gradient Boosting involves an exhaustive search procedure Each CART is trained to find the best split points and features May lead to CARTs using the same split points and maybe the same features

Feature Randomness

In a normal decision tree, when it is time to split a node, we consider every possible feature and pick the one that produces the most separation between the observations in the left node vs. those in the right node. In contrast, each tree in a random forest can pick only from a random subset of features. This forces even more variation amongst the trees in the model and ultimately results in lower correlation across trees and more diversification.

Splitting criterion:

Information Gain (IG), sum of node impurities

White Box ML algorithm

Known internal decision-making logic (not available in for example Neural Networks) Faster than many other algorithms Time complexity is function of number of records and number of features Distribution-free and non-parametric method (does not depend upon probability distribution assumptions) Can handle high dimensional data with good accuracy

Objective function:

Maximize IG at each split, eqiv. minimize the the impurity criterion

Weak learner:

Model doing slightly better than random guessing.

Decision Trees

One of the easiest and popular classification algorithms to understand and interpret Can be utilized for both classification and regression problems

Classification tree

Sequence of if-else questions about individual features Objective: infer class labels Able to capture non-linear relationships between features and labels Does not require feature scaling (e.g., Standardization, Min-Max)

Gradient Boosting (GB)

Sequential correction of predecessor's errors. Does not tweak the weights of training instances. Fit each predictor is trained using its predecessor's residual errors as labels. Gradient Boosted Trees: a CART is used as a base learner.

AdaBoost

Stands for Adaptive Boosting. Each predictor pays more attention to the instances wrongly predicted by its predecessor. Achieved by changing the weights of training instances.

Ensemble learning

The process by which multiple models, such as classifiers, are generated and combined to solve a particular problem better than single models could. Ensemble learning is commonly used in machine learning to improve classification or regression models.

Where do the regions come from?

Theoretically, regions could have any shape in multi-dimensional space (usally higher than 2D or even 3D since we have more than 2-3 features) Divide predictor space into high-dimensional "containers" (for simplicity and ease of interpretation of the resulting predictive model

Impurity Measures

There are various measures for impurity available such as entropy, Gini and the misclassification error. We will use the Gini Index as impurity measure today.

Bootstrap Aggregation (BAGGING)

Uses a technique known as the bootstrapping. Reduces variance of individual models in an ensemble of the same algorithm One algorithm with different subsets of the training set.

Basic Idea of AdaBoost for Decision Trees

Usually uses stumps (root node with two leaves) Forest of stumps Stumps can only use one variable to make a decision (split) Stumps are not good at making accurate classifications = weak learners Some stumps have a greater say (i.e., significance) in the voting at the end than other (weighted by their accuracy) Stump order matters! Stumps are created in sequence, whereby the next stump implicitly takes the errors made by its predecessor into account for its own classification.

Regression

Weighted average. In sklearn: AdaBoostRegressor

Classification

Weighted majority voting. In sklearn: AdaBoostClassifier

Regression tree

When the decision tree has a continuous target variable. For example, a regression tree would be used for the price of a newly launched product because price can be anything depending on various constraints.

Decision Boundaries

While Logistic Regression needs a clear boundary, Decision Trees are much more flexible!

Strengths of CARTs

White Box Simple to understand Simple to interpret Easy to use Flexibility: ability to describe non-linear dependencies Preprocessing: no need to standardize or normalize features

Confusion matricies

be able to interpret

Stopping rules

can be set to stop splitting a tree if: 1. Leaf node is pure 2. Maximal node depth is reached 3. Splitting a node does not lead to an IG 4. Early stopping criteria is reached (e.g. min impurity decrease is reached)

Gini Index

pjpj: proportion of the samples that belongs to class cc of a particular node The Gini Index is therefore 0 if all samples at a node belong to the same class, and it is is maximal if we have an uniform class distribution.

Branch

represents a decision rule branches connect Parent Nodes with (their) Child Nodes

Leaf Node

represents an outcome one parent node, no children nodes --> prediction.

Root Node

starting point no parent node, question giving rise to two children nodes.


Ensembles d'études connexes

Net Present Value and Other Investment Criteria

View Set

Principles of Accounting - Chapter Five Study Guide

View Set

7 ACLS part 5 Cardiac arrest: VF/ Pulseless VT

View Set

CHAPTER 7 QUIZ ANSWERS youtube marketing MARKETING STRATEGIES: A GUIDE TO SOCIAL MEDIA AND DIGITAL MARKETING edify stukent

View Set