Machine Learning Final Exam
Akaike Information Criterion (AIC)
An estimate of the relative information lost by a given model: the less information a model loses, the higher the quality of the model
Bias Variance Tradeoff
As complexity increases, bias decreases but variance increases.
Classes with very unequal frequencies
Imbalanced Data
Bagging (2)
also known as *Bootstrap aggregating*, it is a technique that generates a number of training datasets by bootstrap sampling the original training data. "using bootstrapping to allocate values to a model"
k-means strengths (3)
Uses simple non-statistical principles. • Very flexible and malleable algorithm. • Wide set of real-world applications.
loans_pred <- predict(loans_mod, loans_test, type = "class")
Using our new model against the test dataset, let's predict whether a borrower will or will not Default
The anti-monotone property of support states that the support of an itemset is _________ than that of its subsets.
always less or sometimes less
Generalized Linear Models
an extension of linear regression that allows for linear predictors to be related to a response variable that is not normally distributed via a transformation function or link function.
F-measure
combines precision and recall into a single number using the harmonic mean. It provides a convenient way to compare several models side by side
set.seed(1234) xgb.mod <- train( Default ~ ., data = loans.train, method = "xgbTree", metric = "Kappa", trControl = ctrl, tuneGrid = grid ) xgb.mod
creates extreme gradient boosting model
set.seed(1234) tree.mod <- train( Default ~ ., data = loans.train, method = "rpart", metric = "Kappa", trControl = ctrl, tuneGrid = grid )
creates rpart model
loans_train_b <- SMOTE(Default ~ ., data.frame(loans_train), perc.over = 100, perc.under = 200)
generates balanced TRAINING data
modelLookup("C5.0")
get a list of its tuning parameters of C5.0
modelLookup("rpart")
get a list of its tuning parameters of rpart
Combination Function
governs how disagreements among the predictions are reconciled
Entropy
a quantification of the level of randomness or disorder within a set of class values. • Low entropy implies large homogeneity within the group, while high entropy implies large heterogeneity.
Boosting
a technique that sequentially boosts the performance of weak learners in order construct a strong classifier as a linear combination of simple weak classifiers.
Additive Smoothing
additive (or Laplace) smoothing is used to smooth the data by adding a pseudocount α to the number of observed and unobserved cases in order to change the expected probability so as to avoid zero-frequency problems.
Kappa statistic
adjusts accuracy by accounting for the possibility of a correct prediction by chance alone.
The recursive process used in logistic regression to minimize the cost function during maximum likelihood estimation is known as _________
gradient descent
Given a set of candidate models from the same data, the model with the ________ AIC is the "preferred" model.
lowest
A small k...
makes the model susceptible to noise and/or outliers
tree.mod
will output results such as cp, accuracy, kappa, type of tree used (CART) and how it was sampled
Reasons to discretize data (3)
• Some algorithms require categorical or binary features. • Can improve visualization. • Can reduce categories for features with many values.
Error Due to Variance
Errors made as a result of the sampling of the training data.
Inexplicable Association Rules
Rules that defy rational explanation and do not suggest a clear course of action.
Specificity
(also called the true negative rate) measures the proportion of negative examples that were correctly classified.
Sensitivity
(also called the true positive rate) measures the proportion of positive examples that were correctly classified.
Precision
(also known as the positive predictive value) is defined as the proportion of positive examples that are truly positive. A model with high precision is trustworthy.
kNN Strengths (3)
- Simple and effective. • Makes no assumptions about the underlying data distribution. • Training phase is very fast.
Missing Data
-Can have meaning -include changes in data collection methods, human error, combining various datasets, human bias, etc.
Types of supervised learning
-Classification -Regression
Types of unsupervised learning
-Clustering
Stratified Random Sampling
-Sample from the data such that original class distribution is maintained. -Works for imbalanced data but is often inefficient
Actionable Association Rules
Rules that provide clear and useful insights that can be acted upon.
set.seed(1234) sample_set <- sample(nrow(loans), round(nrow(loans)*.75), replace = FALSE) loans_train <- loans[sample_set, ] loans_test <- loans[-sample_set, ]
-Set seed sets it so that the same random values are chosen each time -Samples with replacement (boostrapping) 75% of the data Creates train data with sample and test data with everything not in the sample
Set.seed(1234) sample.set <- createDataPartition(loans$Default, p = 0.75, list = FALSE) loans.train <- loans[sample.set, ] loans.train <- SMOTE(Default ~ ., data.frame(loans.train), perc.over = 100, perc.under = 200) loans.test <- loans[-sample.set, ]
-Sets seed so that same random values are chosen on each iteration -Patrition data using createDataPartition from caret package, 75% of the data -Creates *Stratified* split -SMOTE balances the data
Given two independent events A and B, with P(A) = 0.5 and P(B) = 0.4. The joint probability P(A ⋂ B) =
0.2
T1 bread, milk, beer T2 bread, diaper, beer, eggs T3 milk, diaper, beer, coke T4 bread, milk, diaper, beer T5 bread, milk, diaper, coke What is the support of the itemset {beer,coke} in the dataset above?
0.2 (1/5)
Random Forrest Steps:
1. Create Random Vectors 2. Use random vector to build multiple decision tree 3. Combine decision tree
Knowledge Discovery Process (6)
1. Data Collection 2. Data Exploration 3. Data Preparation 4. Modeling 5. Validation & Interpretation 6. Knowledge
Ensemble methods are distinguished by the answers to the following two questions:
1. How are the weak learning models chosen and/or constructed? 2. How are the weak learners' predictions combined to make a single final prediction?
2 ways to reduce computational complexity:
1. Reduce number of candidate itemsets (M) by pruning the itemset lattice (Apriori algorithm). 2. Reduce number of comparisons by using advanced data structures to store the candidate itemsets or to compress the dataset (FP-growth)
At each iteration of the Boosting process: (2)
1. The resampled datasets are constructed specifically to generate complementary learners. 2. Each learner's vote is weight based on its past performance.
3 Approaches for Choosing the right k in kNN
1. The square root of the number of training examples. 2. Test different k values against a variety of test datasets and chose the one that performs best. 3. Use weighted voting where the closest neighbors have larger weights.
In the Bagging process: (2)
1. The training datasets are used to generate a set of models using a single learner. 2. The models' predictions are combined using voting (for classification) or averaging (for numeric prediction).
Grid Search Process
1. Use k-fold cross validation to measure model performance 2. Find model performance across k folds 3. Pick the set of parameters that define the model with the highest accuracy
Automated parameter tuning requires you to consider: (3)
1. What type of machine learning model (and specific implementation) should be trained on the data? 2. Which model parameters can be adjusted, and how extensively should they be tuned to find the optimal settings? 3. What criteria should be used to evaluate the models to find the best candidate?
T1 bread, milk, beer T2 bread, diaper, beer, eggs T3 milk, diaper, beer, coke T4 bread, milk, diaper, beer T5 bread, milk, diaper, coke What is the support count of ({beer, milk, diaper})?
2
Fuzzy Clustering
A clustering method in which every object belongs to every cluster with a membership weight that goes between 0 (if it absolutely doesn't belong to the cluster) and 1(if it absolutely belongs to the cluster)
Itemset
A collection of one or more items. e.g. {beer, bread, diaper, milk, eggs}.
Parametric Models
A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) No matter how much data you throw at a parametric model, it won't change its mind about how many parameters it needs.
Trivial Association Rules
Rules that provide insight that is already well-known by those familiar with the domain.
Examples of Boosting Methods
AdaBoost Gradient Tree Boosting
Occam's razor
Also called "law of parsimony" when presented with competing hypothetical answers to a problem, one should select the answer that makes the fewest assumptions.
Parameter tuning
The process of adjusting a model's parameters to identify the best fit
The Curse of Dimensionality
As the dimensionality increases, the classifier's performance increases until the optimal number of features is reached. Further increasing the dimensionality without increasing the number of training samples results in a decrease in classifier performance.
2 Main Families of Ensemble Methods:
Averaging Methods & Boosting Methods
set.seed(1234) rf.mod <- train( Default ~ ., data = loans.train, method = "rf", metric = "Kappa", trControl = ctrl, tuneGrid = grid ) rf.mod
Creates random forrest model
Examples of averaging methods
Bagging methods and Random Forests.
set.seed(1234) tree.mod <- train( Default ~ ., data = loans.train, method = "rpart", metric = "Kappa", trControl = ctrl, tuneGrid = grid
Creates training model using: (1) The control object, (2) Our tuning grid, and (3) Our model performance evaluation statistic (Kappa).
grid <- expand.grid( .model = "tree", .trials = c(1, 5, 10, 15, 20, 25, 30, 35), .winnow = FALSE )
Creating a grid of parameters to optimize. To create the grid of tuning parameters, we use expand.grid(). This allows us to fill the grid without having to do it cell by cell. C5.0 grid
Example of a regression problem
Can I determine a country's currency exchange rate based on its GDP?
round(prop.table(table(loans$Default)),2) round(prop.table(table(loans_train$Default)),2) round(prop.table(table(loans_test$Default)),2)
Check proportion of all 3 classes
CART
Classification And Regression Trees
Examples of nominal (or discrete features)
Color, shape, angle and number of edges
loans$Grade <- as.factor(loans$Grade)
Convert categorical feature to a factor (vector of integer values)
grid <- expand.grid(.mtry = c(3, 6, 9))
Create a search grid based on the available parameters for random forrest
tree.mod <- train(Default ~ ., data = loans.train, method = "rpart")
Create a simple model using caret's train() function and the rpart decision tree learner.
CrossTable( loans_pred, loans_test$Default, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual') )
Creates a confusion matrix
accuracy <- mean(test == pred) precision <- posPredValue(as.factor(pred), as.factor(test), positive = "Yes") recall <- sensitivity(as.factor(pred), as.factor(test), positive = "Yes") fmeasure <- (2 * precision * recall)/(precision + recall) kappa <- kappa2(data.frame(test, pred))$value auc <- as.numeric(performance(roc.pred, measure = "auc")@y.values) comparisons <- tibble(approach="Classification Tree", accuracy = accuracy, fmeasure = fmeasure, kappa = kappa, auc = auc)
Creates all performance measurements and then compares them
loans %>% keep(is.factor) %>% gather() %>% group_by(key,value) %>% summarise(n = n()) %>% ggplot() + geom_bar(mapping=aes(x = value, y = n, fill=key), color="black", stat='identity') + coord_flip() + facet_wrap(~ key, scales = "free") + theme_minimal()
Creates bar charts for all categorical features
grid <- expand.grid( nrounds = 20, max_depth = c(4, 6, 8), eta = c(0.1, 0.3, 0.5), gamma = 0.01, colsample_bytree = 1, min_child_weight = 1, subsample = c(0.5, 1) )
Creates grid for extreme gradient boosting
loans %>% keep(is.numeric) %>% gather() %>% ggplot() + geom_histogram(mapping = aes(x=value,fill=key), color="black") + facet_wrap(~ key, scales = "free") + theme_minimal()
Creates histograms for all numeric features
Sparsity and Density
Data sparsity and density describe the degree to which data exists for each feature of all observations. So if a table is 80% dense, then 20% of the data is missing or undefined. This means it is 20% sparse.
Pruning (2)
Decision trees have a tendency to overfit against the training data. • To remediate this, the size of the tree is reduced in order for it to generalize better.
The primary difference between classification and regression is that classification is used to predict _____ values, while regression is used to predict ______ values.
Discrete, Continuous
legend(0.6, 0.6, c('Classification Tree', 'Random Forest', 'Extreme Gradient Boosting'), 2:4)
Draws ROC legend
C50 uses _____ for impurity
Entropy
The AUC metric and ROC curve can be used interchangeably because if two models have the same or identical AUC values, they will always have the same ROC curve. True or False?
False
Cold-deck Imputation
Fill in the missing value using similar instances from another dataset.
Hot-deck imputation
Fill in the missing value using similar instances from the same dataset
Support count
Frequency of an itemset.
head(loans_pred, n=20)
Get a glimpse of the predictions
CART uses _____ for impurity
Gini
T1 bread, milk, beer T2 bread, diaper, beer, eggs T3 milk, diaper, beer, coke T4 bread, milk, diaper, beer T5 bread, milk, diaper, coke What is the lift of ({beer, milk}→{diaper})?
Given rule: X→Y c({beer, milk}→{diaper}) = 0.67 s({diaper}) = 0.8 t({beer, milk}→{diaper}) = 0.67/0.8 = 0.84
T1 bread, milk, beer T2 bread, diaper, beer, eggs T3 milk, diaper, beer, coke T4 bread, milk, diaper, beer T5 bread, milk, diaper, coke What is the confidence of the rule ({beer, milk}→{diaper})?
Given rule: X→Y s({beer, milk}→{diaper}) = 0.4 s({beer, milk}) = 0.6 c({beer, milk}→{diaper}) = 0.4/0.6 = 0.67
Cluster Sampling (3)
Group or segment data based on similarities, then randomly select from each group. • Efficient. • Typically not optimal.
Example of a clustering problem
How can I group supermarket products using purchase frequency?
Support (2)
How frequently a rule occurs in the dataset. Support is the fraction of transactions containing an itemset.
Averaging methods
Independently built models with their predictions averaged or combined by a voting scheme. They attempt to reduce the variance of a single base estimator.
Example of a classification problem
Is an email message spam or not?
As we discussed in class, the elbow method makes use of the Within Cluster Sum of Squares (WCSS) metric to suggest the appropriate value for "k". If we keep increasing the value for "k", what will happen to the value for WCSS?
It will tend toward zero
How to fix Random Initialization trap
K-means++
library(DMwR)
Loads SMOTE package
modelLookup("rf")
Look up parameters for model rf
Min-Max Normalization
Makes it so that the mean is zero
resubstitution error.
Most learners present performance measures during training. This is known as the resubstitution error. This metric is overly optimistic and cannot reliably measure future performance. • A better measure of future performance is evaluation against unseen data.
Non-Parametric Models
Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don't want to worry too much about choosing just the right features
sampling without replacement
Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.
sampling with replacement
Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.
Bayes Theorem
P(A | B) = P(A ⋂ B) / P(B)
P(spam | free) = (P(free | spam)*P(spam))/P(free) Label each expression.
P(spam | free) = posterior probability P(free | spam) = Likelihood P(spam) = prior probability P(free) = marginal likelihood
roc.pred <- prediction(predictions = prob, labels = test) roc.perf <- performance(roc.pred, measure = "tpr", x.measure = "fpr") plot(roc.perf, main = "ROC Curve for Loan Default Prediction Approaches", col = 2, lwd = 2) abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
Plots ROC curve and then plots ab line
Automated parameter tuning
Rather than manually and arbitrarily choosing values for each of a model's parameters, it is better to conduct a search to identify the optimal combination of parameters using a choice of evaluation methods and metrics
ROC Curve
Receiver Operating Characteristic (ROC) curve is commonly used to examine the trade-off between the detection of true positives, while avoiding the false positives.
Dimensionality
Represents the number of features in the dataset.
In simple linear regression, the sum of squared errors are also known as __________
Residuals
Bootstrapping
Sampling with replacement that creates a sample of equal length as the original data Bootstrapping typically uses less data than cross-validation for training, therefore, its test error will be rather pessimistic.
Systematic Sampling (2)
Select instances from an ordered sampling window. This involves systematically selecting every kth element from the window, where k = N/n, N is the population size and n is the sample size. It risks interaction with irregularities in the data
Boosting is built...
Sequentially
Boosting methods
Sequentially built models which are combined to produce a powerful ensemble. They attempt to reduce the bias of the combined estimator.
Simple Random Sampling (3)
Shuffle the data and then select examples. • Avoids regularities in the data. • May be problematic with imbalanced data.
Occam's Razor
Simpler methods are not only computationally more efficient, they also reduce the chance of over fitting; thus the simpler model will be chosen if it is similar
Area Under the Curve (AUC)
The AUC treats the ROC diagram as a two-dimensional square and measures the total area under the ROC curve AUC ranges from 0.5 (for a classifier with no predictive value) to 1.0 (for a perfect classifier).
Class
The attribute or feature that is described by the other features within an instance.
Euclidean distance
The distance between each instance and the centroid
Lift (t) (2)
The increased likelihood that a rule occurs in a dataset relative to its typical rate of occurrence. Lift is the confidence of the itemset containing both X and Y divided by the support of the itemset containing only Y.
Transaction (T)
The itemset for an observation.
logit function
The link function used for binomial logistic regression
Ensemble
The meta-learning approach that utilizes the principle of creating a varied team of experts Ensemble methods are based on the idea that by combining multiple weaker learners, a stronger learner is created
The K in k-NN has to do with ______________.
The number of labeled observations to compare with the unlabeled observation.
Confidence (c) (2)
The predictive power or accuracy of a rule. Confidence is the support of the itemset containing both X and Y divided by the support of the itemset containing only X.
Supervised Learning
The process of training a predictive model
Apriori Principle / Anti-Monotone Principle of Support
The support of an itemset never exceeds that of its subsets. Therefore, if the subset of an itemset is infrequent, then the itemset is infrequent
Meta-learning
The technique of combining and managing the predictions of multiple models
ctrl <- trainControl(method = "cv", number = 10, selectionFunction = "oneSE")
The trainControl() function creates a set of configuration options. This is known as a control object. We will focus on two things: (1) The resampling strategy (method, number(k)), and (2) The measure used for choosing the best model (selectionFunction).
Unsupervised learning
The training of descriptive Models that are involved with summarizing or grouping data in new and interesting ways. In these types of models, no single feature is more important than the other.
loans_mod <- naiveBayes(Default ~ ., data = loans_train, laplace = 1)
Train naive bayes model with additive smoothing value of 1. it will output probabilities
Decimal Scaling
Transform the data by moving the decimal points of values of feature F. The number of decimal points moved depends on the maximum absolute value of F.
Discretization
Transformation of continuous data into discrete counterparts. This is a similar process as is used for binning. You do this because some algorithms would only work with either continuous or discrete values
Dummy Variables
Transformation of discrete features into a series of continuous features (usually with binary values).
confusionMatrix(tree.pred, loans.test$Default, positive = "Yes")
View confusion matrix
Before we use k-NN, what can we do if we have significant variance in the range of values for our features?
We normalize our data
Stacking
When ensembles utilize another model to learn a combination function from various combinations of predictions
head(predict(tree.mod, loans.test, type = "raw"))
Will output actual predicted classes
head(predict(tree.mod, loans.test, type = "prob"))
Will output predicted probabilities
caret package
Wrapper package that allows you to call other machine learning algorithms from within caret
Recall
a measure of the completeness of the results. This measure gives the same values as sensitivity. A model with high recall has wide breadth
Extreme Gradient Boosting (XGBoost)
an implementation of gradient boosted decision trees designed specifically for speed and performance. • With gradient boosting, instead of assigning weights to models at each iteration, subsequent models attempt to predict the residuals of prior models using a gradient descent algorithm to minimize loss
Grid Search is
building a model using a combination of parameters (?)
Association rules do not imply...
causality, they simply imply a strong co-occurrence relationship between items.
Allocation function
dictates how much of the training data each model receives.
The F-Measure assumes...
equal weight with precison and recall. This is a limitation
Error due to Bias
errors made as a result of the specified learning algorithm.
Note that regression does NOT...
establish causation between the independent variables and the dependent variable.
The goal of cross validation is to
evaluate future performance across iterations
Because events cannot simultaneously happen and not happen...
events are mutually exclusive with their complements
According to the formal definition of machine learning, "A computer program is said to learn from _______ with respect to some class of _______ and performance measure P, if its performance at ________, as measured by P, improves with __________".
experience (E), tasks (T), tasks (T), experience (E)
One of the major disadvantages of the leave-one-out cross-validation approach is that it _______________.
is computationally expensive
Random initialization trap
k-Means is very sensitive to the initial randomly chosen cluster centers
Hyperparameters
one or more parameters that need to be tuned before the learning process begins.
loans_raw <- predict(loans_mod, loans_test, type = "raw")
output the predicted probabilities using 'type="raw"
K-means clustering employs a __________ approach to clustering
partitional, exclusive and complete
A large k...
reduces the impact of noisy data but increases the risk of ignoring important patterns.
grid <- expand.grid( .cp = seq(from=0.0001, to=0.005, by=0.0001)
rpart grid
Examples of continuous features
temperature, height, weight, age
Information Gain (2)
the change in entropy that would result from a split on each possible feature. the split with the highest info gain is chosen
Clustering results in labels against previously unlabeled data, that is why it is sometimes referred to as _____________________
unsupervised classification
Lazy Learners
• A class of non-parametric learning methods that do not generate a model but instead make use of verbatim training data for classification. • They are also known as instance-based learners or rote learners.
Naive Bayes Weaknesses (4)
• Assumes that features are equally important and independent. • Not suitable for datasets with many numeric features. • Correlated features degrade performance. • Estimated probabilities are less reliable than predicted classes.
Advantages of Ensembles (4)
• Better generalizability to future problems. • Improved performance on massive or miniscule datasets. • The ability to synthesize data from distinct domains. • A more nuanced understanding of difficult learning tasks
Logistic regression can be of three types depending on the type of dependent variable:
• Binomial Logistic Regression • Multinomial Logistic Regression • Ordered Logistic Regression
Adaptive boosting strengths
• Boosting is a relatively simple ensemble method to implement. • Requires less parameter tuning compared to other ensemble methods. • Can be used with many different classifiers.
Association Rules Strengths (3)
• Capable of working with largevamounts of transactional data. • Rules are easy to understand. • Useful for "data mining" and discovering unexpected patterns in data.
Any regression analysis involves three key sets of variables:
• Dependent or response variable (y). • Independent or predictor variables (x). • Model parameters (unknown parameters to be estimated by the regression model).
kNN Weaknesses (4)
• Does not produce a model. • The selection of an appropriate k is often arbitrary. • Rather slow classification phase. • Does not handle missing, outlier and nominal data well without pre-processing.
Decision Tree Strengths (7)
• Does well on most problems. • Handles numeric and nominal features well. • Does well with missing data. • Ignores unimportant features. • Useful for both large and small datasets. • Output is easy to understand. • Efficient and low cost model.
5 ways to choose the right "k" in k-means
• Elbow Method • Information Criterion Approach • Silhouette method • Jump method • Gap statistic
Adaptive boosting weaknesses
• High tendency to overfit with many weak learners. • Rather slow training time. • Sensitive to noisy data and outliers.
Match-based Imputation (2)
• Impute based on similar instances with non-missing values. -Hot deck and Cold Deck
What are Box Plots good for? (4)
• Is a feature significant? • Does the location differ between subgroups? • Does the variation differ between subgroups? • Are there outliers in the data?
What are Odds Plots good for?
• Is a feature significant? • How do feature values affect the probability of occurrence? • Is there a threshold for the effect?
What are scatter plots good for?
• Is a feature significant? • How do features interact? • Are there outliers in the data?
List of Parametric Models (5)
• Logistic Regression • Linear Discriminant Analysis (LDA) • Perceptron • Naive Bayes • Simple Neural Networks
Logistic Regression Weaknesses (4)
• Makes strong assumptions about the data. • Does not do well with missing data. • Tends to underperform when there are multiple or non-linear decision boundaries. • Does not naturally capture complex relationships
Association Rules Weaknesses
• Not very useful for small data sets. • Separating true insight from common sense requires some effort. • Easy to draw misleading conclusions from random patterns.
Logistic Regression Strengths (4)
• Outputs have a nice probabilistic interpretation. • Can be regularized to avoid overfitting. • Easy to implement and use. • Very efficient to train.
Random Forrest Strengths (4)
• Performs well on most problems. • Handles noisy or missing data as well as categorical or continuous features. • Selects only the most important features. • Works for data with an extremely large number of features.
Feature
• Property or characteristic of an instance. • Features can be discrete or continuous.
Sampling (2)
• Sampling is typically used because, sometimes, it is too expensive or time-consuming to use all of the available data to generate a model. • The sample subset should permit the construction of a model representative of a model generated from the entire data set.
Naive Bayes Strengths (5)
• Simple, fast and effective. • Works well with noisy and missing data. • Useful for both continuous and discrete features. • Works well with both little or large training data. • Estimated probabilities are easy to obtain.
K-Means Weaknesses (6)
• Simplistic algorithm. • Relies on chance. • Sometimes requires some domain knowledge in determining the ideal number of clusters. • Not ideal for non-spherical clusters. • Works with numeric data only.
Dummy variables are useful because:
• Some algorithms only work with continuous features. • It is a useful approach for dealing with missing data. • It is a necessary pre-step in dimensionality reduction such as with PCA (Principal Component Analysis).
Decision Tree Weaknesses (5)
• Splits biased towards features with a large number of levels. • Easy to overfit or underfit. • Reliance on axis-parallel splits is limiting. • Small changes in data result in large changes to decision logic. • Large trees can be difficult to interpret or understand.
Adaptive Boosting (AdaBoost) (3)
• The Adaptive Boosting learner works by sequentially adding weak models which are trained using weighted training data. • Each model is assigned a stage value which corresponds to how accurate it is against the training data. • The prediction for the ensemble is taken as the sum of the weighted predictions of the weak classifiers.
Complexity Parameter (2)
• The complexity parameter (cp) is used to control the size of the decision tree and to select the optimal tree size. • When used for pre-pruning, if the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue.
What criteria to use (when considering automated parameter tuning) (2)
• These include statistics such as accuracy and kappa (for classifiers) and R-squared or RMSE (for numeric models). • Cost-sensitive measures such as sensitivity, specificity, and area under the ROC curve (AUC) can also be used.
Instance (4)
• Thing to be classified, associated or clustered. • Individual independent example of target concept. • Described by a set of attributes or features. • A set of instances are the input to the learning scheme.
Which parameters to tune (when considering automated parameter tuning) (4)
• This is largely dependent on the model used and the parameters available in the model. • The goal is to: 1. Create a matrix or grid of parameter combinations. 2. Develop candidate models based on the parameter grid. 3. Perform a search for the best model.
What model to choose (when considering automated parameter tuning) (3)
• This requires an understanding of the strengths and weaknesses of machine learning models. • It also requires an understanding of the data and the machine learning task. Simply understanding whether the task is a classification problem or a numeric prediction problem (regression) helps narrow the choices.
Random Forrest Weaknesses
• Unlike a decision tree, the model is not easily interpretable. • May require some work to tune the model to the data. • Increased computational complexity
What are Histograms good for? (5)
• What kind of population distribution does the data come from? • Where is the data located? • How spread out is the data? • Is the data symmetric or skewed? • Are there outliers in the data?
Random Forest
• also known as *Decision Tree Forrest*, this learner focuses only on ensembles of decision trees. It combines the base principles of bagging with random feature selection to add additional diversity to decision tree models. • After the ensemble of trees (the forest) is generated, the model combines the trees' predictions.
List of Non Parametric Models
• k-Nearest Neighbor • Decision Trees • Support Vector Machines
Good clustering will produce clusters with:
‣ High intra-class similarity. ‣ Low inter-class similarity.