Cosc 3337 Week 7
How to build and evaluate a random forest?
1- create a Boostraped data set from the original data set 2- build trees 3-run data along each tree
Bayes Classifier
•A probabilistic framework for solving classification problems
Advatanges of naive bayes classifier
- not sensitive to irrelevant features -Very simple and easy to implement -needs less training data -Handles both continuous and discrete data -Highly scalable with number of predictors and data points -As it is fast, it can be used in real time predictions
Why Random Forests works:
-if trees are sufficiently deep, they have very small bias -number of trees increase, the variance decrease.
LDA Approach
1.Compute the within class and between class scatter matrices 2.Compute the eigenvectors and corresponding eigenvalues for the scatter matrices 3.Sort the eigenvalues and select the top k 4.Create a new matrix containing eigenvectors that map to the k eigenvalues 5.Obtain the new features (i.e. LDA components) by taking the dot product of the data and the matrix from step 4
Feature Selection
1.Remove features with missing values 2.Remove features with low variance 3.Remove highly correlated features 4.Univariate feature selection 5.Recursive feature elimination 6.Feature selection using SelectFromModel
Review :Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes' Theorem. Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Gradient descent
Advantage of this approach: •can be done for non-linear systems (e.g. SVM with Gaussian kernels) •can mix the search for features with the search for an optimal regularization parameters and/or other kernel parameters. • Drawback: •heavy computations •back to gradient based machine algorithms (early stopping, initialization, etc.)
Naïve Bayesian Classifier: Comments
Advantages -Easy to implement -Good results obtained in most of the cases Disadvantages -Assumption: class conditional independence, therefore loss of accuracy -Practically, dependencies exist among variables --E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. --Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks
bagging
Bootstrapping the data using the aggregate to make decisions is called Bagging
GD vs SGD
Both algorithms minimize or maximize a cost-function by iteratively adjusting the hypothesis functions parameters, by multiplying the gradient by the learning rate and adding it to the parameter vector. The only (algorithmic) difference is that each algorithm optimizes a different cost function Gradient Descents cost-function iterates over ALL training samples Stochastic Gradient Descents cost-function only accounts for ONE training sample, chosen at random.
Dimensionality Reduction
It reduces the dimensions of a d-dimensional dataset by projecting it onto a (k)-dimensional subspace (where k<d) in order to increase the computational efficiency while retaining most of the information.
Gradient descent summary
Many algorithms can be turned into embedded methods for feature selections by using the following approach: • 1.Choose an objective function that measure how well the model returned by the algorithm performs 2.Differentiate this objective function according to the s parameter 3.Performs a gradient descent on s. At each iteration, rerun the initial learning algorithm to compute its solution on the new scaled feature space. 4.Stop when no more changes (or early stopping, etc.) 5.Threshold values to get list of features and retrain algorithm on the subset of features. ● Difference from add/remove approach is the search strategy. It still uses the inner structure of the learning model but it scales features rather than selecting them.
PCA
PCA (Principle Component Analysis) is a dimensionality reduction technique that projects the data into a lower dimensional space.
Bayesian theorem
Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Review PCA
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.
Stochastic Gradient Descent
SGD) is a simple (select samples not all data) yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) -Support vector machine -And logistic regression
Out of bag error
The proportion of out-of-bag samples incorrectly classified is called Out-of-Bag Error
Limitation of PCA
The results of PCA depend on the scaling of the variables. A scale-invariant form of PCA has been developed.
Transformation
This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the next highest possible variance.
Purpose
• Discriminant Analysis classifies objects in two or more groups according to linear combination of features - Feature selection ØWhich set of features can best determine group membership of the object? Ø dimension reduction -Classification ØWhat is the classification rule or model to best separate those groups?
About the Bayesian framework
•Allows us to combine observed data and prior knowledge •Provides practical learning algorithms •It is a generative (model based) approach, which offers a useful conceptual framework -This means that any kind of objects (e.g. time series, trees, etc.) can be classified, based on a probabilistic model specification
Random Forest
•As in bagging, we build a number of decision trees on bootstrapped training samples each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. •Note that if m = p, then this is bagging.
Variable Importance Measures
•Bagging results in improved accuracy over prediction using a single tree •Unfortunately, difficult to interpret the resulting model. Bagging improves prediction accuracy at the expense of interpretability. •Calculate the total amount that the RSS or Gini index is decreased due to splits over a given predictor, averaged over all B trees.
LDA vs PCA
•Both are linear transformation techniques that are commonly used for dimensionality reduction. • •PCA can be described as an "unsupervised" algorithm( it "ignores" class labels) • •PCA goal is to find the directions (the so-called principal components) that maximize the variance in a dataset. • •LDA is "supervised" and computes the directions ("linear discriminants") that will represent the axes that that maximize the separation between multiple classes.
Bootstrap
•Construct B (hundreds) of trees (no pruning) •Learn a classifier for each bootstrap sample and average them •Very effective
cost functions
•Cost functions referred to by different names: loss function, or error function, or scoring function. •Consider linear regression, where we choose mean squared error (MSE) as our cost function. Our goal is to find a way to minimize the MSE. •Our final goal, however, is to use a cost function so we can learn something from our data.
Bagging
•Each tree is identically distributed (i.d.) ® the expectation of the average of B such trees is the same as the expectation of any one of them ® the bias of bagged trees is the same as that of the individual trees • •i.d. and not i.i.d
Feature Reduction (extraction) vs. Feature Selection
•Feature reduction -All original features are used -The transformed features are linear combinations of the original features •Feature selection -Only a subset of the original features are selected •Continuous versus discrete
Feature Selection vs Dimensionality Reduction
•Feature selection is simply selecting and excluding given features without changing them. • • •Dimensionality reduction (Feature extraction) transforms features into a lower dimension.
Random Forest Algorithm
•For b = 1 to B: (a)Draw a bootstrap sample Z∗ of size N from the training data. (b) Grow a random-forest tree to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nmin is reached. i.Select m variables at random from the p variables. ii.Pick the best variable/split-point among the m. iii.Split the node into two daughter nodes. Output the ensemble of trees.
How to Estimate Probabilities from Data?
•For continuous attributes: •Discretize the range into bins - one ordinal attribute per bin - violates independence assumption •Two-way split: (A < v) or (A > v) - choose only one of the two splits as new attribute •Probability density estimation: - Assume attribute follows a normal distribution - Use data to estimate parameters of distribution (e.g., mean and standard deviation) - Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)
Bagging
•If we split the data in random different ways, decision trees give different results, high variance. •Bagging: Bootstrap aggregating is a method that result in low variance. •If we had multiple realizations of the data (or multiple samples), we could calculate the predictions multiple times and take the average of the fact that averaging multiple onerous estimations produce less uncertain results
Method
•Maximize the between-class scatter -Difference of mean values (m1-m2) • Minimize the within-class scatter -Covariance
Generative vs Discriminative models
•Naïve Bayes is a type of a generative model Generative process: •First pick the category of the record •Then given the category, generate the attribute values from the distribution of the category • • •Conditional independence given C •We use the training data to learn the distribution of the values in a class •Logistic Regression and SVM are discriminative models -The goal is to find the boundary that discriminates between the two classes from the training data •In order to classify the language of a document, you can -Either learn the two languages and find which is more likely to have generated the words you see -Or learn what differentiates the two languages.
Out‐of‐Bag Error Estimation
•No cross validation? •Remember, in bootstrapping we sample with replacement, and therefore not all observations are used for each bootstrap sample. On average 1/3 of them are not used! •We call them out‐of‐bag samples (OOB) •We can predict the response for the i-th observation using each of the trees in which that observation was OOB and do this for n observations •Calculate overall OOB MSE or classification error
Advantages of Random Forest
•No need for pruning trees •Accuracy and variable importance generated automatically •Overfitting is not a problem •Not very sensitive to outliers in training data •Easy to set parameters •Good performance
Bagging
•Reduces overfitting (variance) •Normally uses one type of classifier •Decision trees are popular •Easy to parallelize
Naïve Bayes
•Robust to isolated noise points • •Handle missing values by ignoring the instance during probability estimate calculations • •Robust to irrelevant attributes • •Independence assumption may not hold for some attributes -Use other techniques such as Bayesian Belief Networks (BBN) • •Naïve Bayes can produce a probability estimate, but it is usually a very biased one -Logistic Regression is better for obtaining probabilities.
PCA Approach
•Standardize the data. •Perform Singular Vector Decomposition to get the Eigenvectors and Eigenvalues. •Sort eigenvalues in descending order and choose the k- eigenvectors •Construct the projection matrix from the selected k- eigenvectors. •Transform the original dataset via projection matrix to obtain a k-dimensional feature subspace.
Why does bagging generate correlated trees?
•Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. •Then all bagged trees will select the strong predictor at the top of the tree and therefore all trees will look similar. •How do we avoid this? •What if we consider only a subset of the predictors at each split? •We will still get correlated trees unless .... we randomly select the subset !
Advantage /Disadvantage
•The advantages of Stochastic Gradient Descent are: -Efficiency. -Ease of implementation (lots of opportunities for code tuning). •The disadvantages of Stochastic Gradient Descent include: -SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations. -SGD is sensitive to feature scaling
Random Forests Tuning
•The inventors make the following recommendations: ® For classification, the default value for m is √p and the minimum node size is one. ® For regression, the default value for m is p/3 and the minimum node size is five. •In practice the best values for these parameters will depend on the problem, and they should be treated as tuning parameters. •Like with Bagging, we can use OOB and therefore RF can be fit in one sequence, with cross-validation being performed along the way. Once the OOB error stabilizes, the training can be terminated.
Goals
•The main goal of a PCA analysis is to identify patterns in data • PCA aims to detect the correlation between variables. •It attempts to reduce the dimensionality.
Random Forest, Ensemble Model
•The random forest (Breiman, 2001) is an ensemble approach that can also be thought of as a form of nearest neighbor predictor. •Ensembles are a divide-and-conquer approach used to improve performance. The main principle behind ensemble methods is that a group of "weak learners" can come together to form a "strong learner".
Trees and Forests
•The random forest starts with a standard machine learning technique called a "decision tree" which, in ensemble terms, corresponds to our weak learner. In a decision tree, an input is entered at the top and as it traverses down the tree the data gets bucketed into smaller and smaller sets.
Trees and Forests
•The random forest takes this notion to the next level by combining trees with the notion of an ensemble. Thus, in ensemble terms, the trees are weak learners and the random forest is a strong learner.
Random Forest Algorithm
•To make a prediction at a new point x we do: ® For regression: average the results ® For classification: majority vote
Differences to standard tree
•Train each tree on Bootstrap Resample of data (Bootstrap resample of data set with N samples: Make new data set by drawing with Replacement N samples; i.e., some samples will probably occur multiple times in new data set) •For each split, consider only m randomly selected variables •Don't prune •Fit B trees in such a way and use average or majority voting to aggregate results
Running a Random Forest
•When a new input is entered into the system, it is run down all of the trees. The result may either be an average or weighted average of all of the terminal nodes that are reached, or, in the case of categorical variables, a voting majority. Note that: •With a large number of predictors, the eligible predictor set will be quite different from node to node. •The greater the inter-tree correlation, the greater the random forest error rate, so one pressure on the model is to have the trees as uncorrelated as possible. •As m goes down, both inter-tree correlation and the strength of individual trees go down. So some optimal value of m must be discovered.
Gradient descent
•pick a starting point (w) •repeat until loss doesn't decrease in all dimensions: -pick a dimension -move a small amount in that dimension towards decreasing loss (using the derivative)