Data Mining Exam 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Confusion Matrix (Binary Classification)

- Compare Actual Class to Predicted Class with 4 sections: F++(TP), F+-(FN), F-+(FP), F--(TN) -Error rate = (FP+FN)/n: fraction of mistakes -Accuracy = (TP+TN)/n: fraction of correct predictions True positive rate (TPR) (sensitivity) = TP/(TP+FN): fraction of positive examples correctly predicted; how many right from + True Negative rate (TNR) (specificity) = TN/(FP+TN): fraction of negative examples correctly predicted False Positive Rate (FPR) = FP/(FP+TN): fraction of negative examples predicted as positive False Negative Rate (FNR) = FN/(TP+FN): fraction of positive examples predicted as negative -can use with cross validation, and use the sum across all the folds

Bivariate Noise/Outliers

- Some data points may not be outliers in a single attribute, but are when attributes are combined Fixing: -Deletion -Transformation (to less extreme-ness of feature) -Binning -Imputation -Separation -leave as-is

Bayesian Classifiers

-Bayesian Classifiers take a probabilistic approach to classification -relationships between input features and class are expressed as probabilities

Bivariate Analysis - Continuous & Categorical

-Box Plots -Z-test (is the difference in these statistically significant? Calculation/formula)

Aggregation

-Combining -make data size smaller -Example: Mass & Volume into density

Reduced Nearest Neighbor (RNN)

-Create S, a subset of T (the training set); Start with S=T -if removing record r from S does not cause any other record in T to be misclassified using S, permanently remove record r -typically use k=1 or k=3

Hunt's Algorithm

-Decision tree grown recursively by partitioning data into successively purer subsets Step 1: if all records belong to the same class, or records have same attribute values, then this is a leaf node Step 2: Else, choose attribute to split on; an attribute test condition is selected to partition the records into smaller subsets. A child node is created for each partition of the data and the algorithm is recursively applied to each child node

Feature Selection

-Dimensionality reduction by selecting a subset of the attributes -Domain knowledge (remove irrelevant features w/ domain knowledge) -Missing Values ratio (columns w/ lots of missing values unlikely to be helpful) -Low Variance Filter (columns w/ low variance aren't going to help much) -High Correlation Filter (too many, redundant; say same thing) -Feature Creation (combine 2+ features into one) -Filter Methods:Score each feature based on a statistical correlation with the target variable -Wrapper Methods: Train a model using a subset of features; Based on results, decide to add or remove features from your subset (forward feature construction (start w/ 1 then add (choose best 2 go on and continue till doesn't improve))/Backward Feature Elimination (reverse)) (try to find best subset of features) -Embedded Methods: Feature Selection occurs as part of the data mining algorithm -Performed inside the cross-validation loop (specific to the training set)

Machine Learning

-Ex: images of hand-written numbers put into generic machine learning algorithm and output to "1" or "2" or "3" or etc -Ex: emails put into generic machine learning algorithm and output to spam or not spam

Curse of Dimensionality

-Exponential increase in data required to densely populate space as the dimension increases -data points get further away from each other as number of dimensions increase -as num dimensions increase, becomes sparse in space points occupy -Dimensionality reduction: process of reducing the number of variables under consideration (Feature selection, Feature Extraction)

Characteristics of Naive Bayes

-Fast and simple algorithm -probabilities can all be calculated with a single scan of the data, then stored; iterative processing of the data is not necessary -scales well: avoids problems with the curse of dimensionality -independence assumption may not always hold true -in practice, works quite well -good for text classification

Distinguish Winners in Data Science

-Feature Creation -Feature Selection -boils down to creative variables which capture hidden business insights and then making right choices about which variable to choose for predictive models

Decision Tree

-For classification -Start with root node -Internal Nodes -Leaf Nodes as class labels (what we are trying to classify as) -All nodes besides leaf represent feature in our data set, branch off it is value it can take, go one way or another based on value in feature

Explicit vs Implicit Data

-For example, explicit will be exact date while implicit is the day of the week

Univariate Analysis - Categorical Variables

-Frequency Table, Bar Chart, Pie Chart

Data Preprocessing

-Fusing Data (from multiple sources) -Cleaning Data -Data Exploration -Sampling/Aggregation -Dimensionality Reduction -Feature Selection/Creation -Discretization/Binarization

Partition Data

-Holdout method: split data into training set and test set -Build model on training set, evaluate generalization error on test set -want the testing data to be representative of training data and vice versa Stratefied Sampling: same proportion of types (check class labels) Issues: -less data for training (and less data for testing) -some records never get trained on (and some never get tested on) -a class overrepresented in one set will be underrepresented in the other set -varying performance of the model, depending on which records were held out for testing

Types of Ensembles

-Homogenous = All base classifiers are of the same type (ensemble classifiers) -Heterogeneous = multiple types of base classifiers (voting classifiers, multiple classifier systems) -Ensemble itself is a supervised learner (it is trained then used to make a classification) -the goal of ensemble methods is to reduce both bias and variance

Gain Ratio

-Impurity Measures tend to favor attributes that have a large number of distinct values (value a customer ID over gender) (favor multi-way splits over binary splits) -an attribute that produces a large number of branches will have a large SplitInfo, reducing its gain ration -use when the number of splits is not the same, since will make it more fair -will penalize the extra splits

Supervised Learning

-Inputs with labels put into predictive model -Then given new record, predict output -give algorithm labeled examples to train/learn on

Unsupervised Learning

-Just given features for example (not labels) -Put into categories of things that relate -unlabeled examples given -it tries to figure something out, and finds which things are similar or the same

Characteristics of KNN Classifiers

-KNN classification uses a general technique called instance-based learning, where specific training instances/records are used to make predictions, rather than maintaining an abstraction (model); no need to re-train a model as data changes/updates -simple to understand, easy to implement -classifying a test record can be computationally expensive -data needs to be scaled to avoid one attribute dominating the decision (different ranges can cause things to be farther away or not as far) -can suffer if there is a class imbalance (weighting votes can help) -curse of dimensionality: KNN breaks down in high dimensions -feature selection is critical: irrelevant features can dominate a decision

Selecting the Best Split

-Measures for selecting the best split are often based on the degree of impurity of the child nodes -can calculate impurity with entropy and gini max entropy = 1 (perfectly impure) max gini = 0.5 (perfectly impure) -want most pure subset -start building tree based on 1st choice, then recurse and see gini/gain on all splits (can split on same thing again)

Handle Missing Values

-Missing Values can lead to wrong results -List Wise Deletion (Take out records that have missing values) (delete entire record if missing info) -Pair Wise Deletion (Take out just the values which are missing) (ignore blanks and use rest) (sometimes an option, but not always) -Imputation (Replace missing values with reasonable expected value) (could be average, assumption)

Attribute test conditions

-Multiway Split -Binary Split (by grouping attribute values) -For continuous attributes, can have node be whether it is < or > value, or can have multiway splits with ranges -Must preserve order among ordinal attributes

Noise/Outliers

-Noise: Random Errors in the data; legit error in data that stands out bc it is an error -Outliers: Anomalous objects with characteristics that are different from, or unusual with respect to, the rest of the data; values way outside norm of data -Can skew results -determining whether noise or outlier is difficult

Feature Scaling

-Normalization: scale all data to set range (usually [0-1]) -Standardization: Rescale data to have a mean of 0 and standard deviation of 1 -Performed inside the cross-validation loop (specific to the training set) -scale so all on same scale -outliers/extreme values: can shorten range of all others (squishes data down to short interval) (don't normalize, instead standardize)

Dimensionality

-Number of attributes in a data set -Each row as a d-dimensional point -graphing features: plotting points out -categorical on graph could be just ticks on axis -color can help show another dimension

Predictive Modeling

-Predict the class of a new or unknown record -We don't know class label of that record, only know the features

Sampling

-Reduce the dataset size -Will become less overfitted -Don't reduce too much or will be underfitted -Want a sample size that has good probability of getting at least one object from each group (want it to be representative of data set) -good for data exploration

Bivariate Analysis - Continuous & Continuous

-Scatter Plot (helps identify outliers/noise) Positive slope: strong positive correlation Not solid positive slope: moderate positive correlation Blob: no correlation ~Same negative slopes~ Curve: Curvilinear Relationship -Correlation Coefficient

Reinforcement Learning

-Sort of like unsupervised -See categories, then assign labels -See how many correct -If errors, try again -pass in data, tries to figure out, shows what it believes, we give feedback if right/wrong, then it adjusts and tries again

Data Cleaning

-The process of trying to detect and correct data quality issues

Bivariate Analysis-Categorical & Categorical

-Two-way table -Chi-squared test (tests whether 2 categorical variables are independent or not; Does knowing value of one tell anything about value of other) -Stacked Column Chart

Feature Extraction/Dimensionality Reduction

-Use techniques from linear algebra to project the data from a high-dimensional space to a lower-dimensional space -want to preserve max amount of variance on data -values can/will change -Principal Component Analysis (PCA) (identifies the best hyperplane to project data on to); capture line with most variable (what % variance captured on single principal component) (see how much variance want to keep to decide how many dimensions to project to) -Singular Value Decomposition (SVD) -Linear Discriminant Analysis (LDA)

Postprocessing

-Visualization Analysis

Data

-a collection of objects and their attributes

Overfitting

-a model that fits the training data too well (has low training error) may have a higher generalization error than a model with a higher training error -an over-fitted model has gotten too specific to the training data and will not generalize well to new data -when graphed, the line follows exact -look at graph for over and under fitting -can lead to misclassified new records

Building the Final Model

-after folds, now have k different decision trees -once a good solution is found (got goal average accuracy) -run the whole process on your entire dataset to make a final model (don't use any of the specific folds) (no separate training or testing records) -perform scaling, feature selection, dimensionality reduction, etc on the entire dataset and train a model on the entire dataset -final model used in real-world to make predictions on new data

Exploratory Data Analysis (EDA)

-analyzing data -first explore and notice simple things -scatter plot data = helps find noise & outliers -analyze one feature at a time = univariative

Classification

-assigning objects to one of several predefined categories or class labels -done with supervised learning -based on features, predict what class it is in

Bayes: Continuous Attributes

-binning features first -probability density function

Training vs Test Error on Graph

-both decrease until a point where test data begins to get worse or plateaus -the point is where getting too specific to training data -point where overfitting begins to occur

Area Under the Curve (AUC) for ROC Curves

-can be used to compare classifiers -Alternate to accuracy -doesn't help pick threshold bc thresh values not explicit or in order 1>=AUC>=0.9:excellent(A) 0.9>AUC>=0.8:good(B) 0.8>AUC>=0.7:fair(C) 0.7>AUC>=0.6:poor(D) 0.6>AUC>=0.5:fail(F) -since lines should hug upper left corner, that's why we want AUC to be as large as possible (max being 1 since both axis max is 1)

Ensemble Methods

-improve classification accuracy by aggregating the predictions of multiple classifiers -construct a set of base classifiers from training data and perform classification by taking a vote on the predictions made by each classifier -ensemble will make wrong prediction only if more than half of the base classifiers predict incorrectly -find error rate of ensemble -for an ensemble classifier to perform better than a single classifier: -all base classifiers must perform better than 50% (better than random guessing) -all base classifiers must be independent (not identical) -accurate and diverse

Types of Attributes

Categorical Qualitative: -Nominal (==, !=): Names, Labels (category) -Ordinal (<, >): can be ordered; category but order associated (ex:small, medium, large) Numeric, Quantitative, Continuous: -Interval (+, -): Differences between are meaningful -Ratio (*, /): Ratios are meaningful; Numeric

What does a data scientist primarily do?

Clean Data -debugging code to fit models

In real life: Changing Pattern

Getting new data: -not always auto-update -must re-train whole new tree -need to re-train regularly No new data at fault of tree? -feedback loop broken, so pattern can't change -make sure pattern can adjust overtime -works for spamming, medical results, credit fraudulent, sports data, ads

Partitioning Data for Validation Set

Holdout method issues (2/3 for training, 1/3 testing): -less data for training (and less data for pruning/validation) -some records never get trained on (and some never get used for validation) -a class overrepresented in one set will be underrepresented in the other set) -varying performance of the model, depending on which records were held out for testing vs validation

Bayes Theorem

P(B | A) = (P(A | B) * P(B))/(P(A))

Cost Matrix

-can encode a penalty for misclassification errors -negative entry in a cost matrix indicates a reward for making a correct prediction -specify cost for each part of matrix -can be used for evaluation of a classifier -the lower the error, the better the F-score, but the higher the cost -to calculate cost: take cost at each individual spot in matrix and multiply by num records in that spot in matrix -identifying rare class is more important -negative entry=reward for correct prediction -model accuracy better at expense of making more costly errors -Risk: take chance making an error multiplied by the cost (can find risk of both positive and negative) -probability of error times cost is risk Risk(-)=prob(+)*cost(FN) Risk(+)=prob(-)*cost(FP) -Choose to classify as the class that has the lowest risk -want to evaluate the risks for new records, and predict the low-risk -can add in chance you are correct to this

Nested CV for KNN

-classify the test set using the best k found in the inner CV loop; gets you accuracy for one fold of the outer CV loop -repeat for all folds of the outer CV loop -average accuracies from the outer CV loop to get overall estimate of generalization accuracy/error to build final model: -train on all data -perform inner loops on k's to find best k based on entire data and accuracies of folds

Decision Tree Characteristics

-constructing decision trees is computationally inexpensive/cheap, even when training set is very large -once tree is built, classification of new records is extremely fast -decision trees easy to interpret -decision tree are robust to the presence of noise, especially when pruned to avoid overfitting (can leave outliers in) -presence of redundant/irrelevant attributes does not affect accuracy; presence of irrelevant attributes may result in larger-than-necessary trees (knows split on those won't have best impurity (no good gain)) -number of records becomes smaller as we traverse down the tree; at some leaves, number of records may be too small to make a statistically significant decision about class representation (data fragmentation) (must be careful)

Decision Boundaries

-decision trees use axis-parallel hyperplanes to split the data space into purer partitions -each decision draws a plane in space -want to get to where each space created is same shape/color -hyperplane because multi-dimensional

Lazy Learners

-delay the process of generalizing the training data until it is needed to classify test examples -Ex: nearest neighbor -waits to build model until classify

Gain

-determine the performance of a test condition, compare to the impurity of the parent node (before the split) to the impurity of the child nodes (after the split) -larger the gain, better the split -helps to figure out the best split -for splitting on continuous attributes, must find the best value to split on -first sort records; midpoints will be possible splits -only check threshold values between records where class label changes -don't loop on data gain, only constant amount time to update class distribution and re-calculate impurity

Underfitting

-doesn't classify well at all -when graphed, line doesn't split properly -model hasn't learned patterns well enough -bad training error and generalization error

Descriptive Modeling

-explain, describe, summarize the data -What features define various class labels? -Ex. what features define a mammal, reptile, bird, fish, amphibian

Feature Engineering

-features are the columns; what we know about each data point -Creating new features from the data you have -Effective for improving predictive models -Choosing features to use -Feature Transformation (make feature non-linear) -Feature Creation (new features from existing ones) -Feature Scaling -can begin to find patterns -do to get more information out of what you already have -most important factor is the features used

Nearest Neighbor

-find all the training examples that are relatively similar to the new record -these are nearest neighbors -used to classify the new record -training examples relatives similar to new record

Accuracy

-fraction of correct predictions on the testing set (1-error rate) -probability of correct prediction

Reduced Error Pruning (REP)

-from the bottom up: -for a decision node that only has leaves as children, replace the decision node with its most popular class from the training data -if the prediction accuracy is not worse on the validation data set, prune the tree to this value

Classification with Bayes Theorem

-given features X = {X1, X2,..., Xn}, predict class C for the record -what is probability record belongs to class -find the class C that maximizes P(C | X) Types: -P(C | X) = posterior probability: probability of class label being C after observing input features X -P(X | C) = class-conditional probability: probability of observing input features X, given that C is the class label (use the naive independence assumption to simplify) -P(C) = prior probability: probability of class label being C, prior to observing anything (calculate the fraction of records with each class label in training data) -P(X) = P(x1)*P(x2|x1)...: probability of observing input features X, regardless of what the class label is (constant that can be ignored bc same for all classes) (hard so ignore since constant) -probabilistic approach -labels and features as probability -which has highest probability? overall: P(C|X) proportional to P(X|C)*P(C) -if features all independent, then can multiply all as individual probabilities -for each class, do probability. Find which is bigger, and that is the prediction -want to normalize (the two or more should add to 1)

Regression Trees

-hard to make global model for all data -these split dataspace & then on each partition it does regression on just that -regression trees known for instability; small change in dataset could mean entire tree changes

Choosing K

-if k is too small, sensitive to noise -if k is too large, may include points of other classes -use a nested CV to decide k -use a validation set to choose the best k -split training data into train and validation -for each possible k, perform k folds; find accuracy and average for each k -select the best one -want to combat overfitting: increase k, but not to point of underfitting

Laplace Smoothing

-if we try to classify a record that contains an attribute value we've never seen before, all of the probabilities zero out -add 1 to numerator and denominator of all of your class-conditional probabilities for that attribute value if anything zero out -even if have some in other label must add to make even -add 1 anytime there is an attribute value that does not occur in every class -can pre-calculate these

Inside the Cross-Validation Loop

-inside each fold -used to assess the performance of the process -critical that the test data remains totally unseen and is not used for any part of the model creation 1) on training set only: perform feature engineering: feature scaling (normalization/standardization), mathematical or algorithmic feature selection, dimensionality reduction (PCA, etc) -Test data should be unseen data, so no decisions should be made based on it! 2) build the model using the training data 3) transform test data into feature space that was found on the training data (scale using the values found on the training data, reduce dimensionality using the PCA found on the training data); project test data by how scaled training data (prepare same as training) 4) make a prediction using the model; evaluate the prediction

Voting Schemes

-majority voting: every neighbor has same impact on the classification -can weigh each vote according to distance

Eager Learners

-model input attributes to a class label as soon as training data is available -Ex: Decision Trees

If model has high error

-more data -maybe different model -more feature engineering

Class Imbalance

-not enough of one of the classes, can make error rate very low even though won't classify correctly in practice -don't use accuracy to see how good; not evaluate with accuracy -imbalance on number of labels -rare class is positive class when class imbalance (if balanced then it is either one)

Building the Final Model including Inner Loop

-once found at least goal accuracy -outer cross validation loop evaluated the process -once process is good, run whole process on entire dataset to make final model -run inner loop on entire dataset (cross-validation with just training set and validation set, but no test set) to tune the hyperparameters -select the best model (the one with best hyperparameters) -final model will be used in the real-world to make predictions on new data -basically like one outer fold with inner loop; no test set on outer fold

Accuracy Error

-percent of correct classification on the test data set (100-Generalization Error)

Training Error

-percent of misclassification errors on the training data set -how many of training set put into correct spots? -error because can't always separate data further

Test Error/Generalization Error

-precent of misclassification errors on the test data set (previously unseen records) -treat old data subset as new data & see how well do on tree -must set records aside for (some not for training)

Decision Boundaries in Naive Bayes

-probability distribution in Bayes

K-Nearest Neighbors (KNN)

-represent each training example as a data point in a d-dimensional space (where d is the number of attributes) -given a test record, compute its distance to all of the training points -the test record is classified based on the majority class label of its k nearest neighbors -k as hyper parameter (like depth of tree) -to compute distance: euclidian distance -for nominal attributes hamming distance (or overlap) (distance = 0 if same, distance = 1 if different) -must scale all data to same scale since calculating distance -calculate distance to all points, then find k closest -then majority vote or weight according to distance -usually need to use odd number (so no tie) for k nearest neighbors -could report out a probability

Mitigating Class Imbalances

-sampling based approaches before sending to algorithm -undersampling: remove some of the majority class (risk losing information) -oversampling: duplicate the minority class records (will have overfitting)

Instance Reduction Algorithms

-select a representative subset of the training data to use (eliminate the rest) -can help with classification times -can help with memory/storage issues of storing all training records -can help eliminate noise points and reduce overfitting

K-Fold Cross-Validation

-split data into K folds -continuously shift test section -Model building (training) and error estimation (testing) is repeated K times -each iteration, one of the folds is used for testing, the rest are used for training -get an error estimate from each fold; average them to establish a final estimate of generalization error -each fold, the testing/training data is different -split original data into k groups, each time working on fold, one is test and k-1 is training -keep getting error then take average to see how performs in real world -every record will at some be used for both training and testing -5 fold or 10 fold is sufficient for finding error

Pre-Pruning

-stop growing the tree before it is fully grown/before gets too complex -max tree depth -x% of records in a node have same class label -number of records in a node falls below a min threshold -gain of the split is smaller than some threshold -test whether or not the split is statistically signification (like a chi-squared test)

F-Measure

-summarize precision and recall in one metric -combo of both precision and recall that measures the extent to which a group contains only objects of a particular class, and all objects of that class -summarizes both precision and recall in one metric -overall F-measure of the classifier is the mean of the class-specific F-measures -or you could consider the F-measure of only the positive class

Decision Boundaries for KNN

-the smaller the k, the more likely it is to be overfitted -just care about boundaries between class changes

Discretization/Binning

-transform a continuous attribute into a categorical/nominal attribute (or to a binary attribute for binarization) -idea that outlier goes away

Post-Pruning

-trim the fully-grown tree from the bottom up -keep aside a part of the dataset for post-pruning of the tree -this is different from the training or test sets and known as pruning or validation set -more successful generally, but more expensive -need validation set -> from within training data set (train on data then prune with validation)

Model Selection

Also known as: hyperparameter tuning -use validation error to determine when to stop training -once validation error begins increasing again, stop training -at some point, validation error will come back up (when overfitting occurs)

Precision & Recall

-typically only care about in relation to positive (rare) class Precision(+)= TP/(TP+FP): fraction of records that actually are of class C, out of records predicted to be of class C -addresses the question: "Given a positive prediction from the classifier, how likely is it to be correct?" -look down columns (predicted classes) -when predict a specific class, how often is it right Coverage(+)= TP/(TP+FN): fraction of correct predictions of class C, over all points in class C -known as: Recall, sensitivity, TPR -addresses the question: "Given a positive example, will the classifier detect it?" -look across rows (actual classes) -faction of correct predictions of class C, over all points in class C -Can be done with multi-classes Tradeoff: -if recall high, then everything + -tradeoff between precision and recall often -want them both to be high With Cross Validation: add up across all folds for final (confusion matrix only with test set so will combine all test sets to make up entire data set)

Class-Conditional Probability

-use independence assumption fo simplify P(X | C) = P (X1, X2, .., Xn | C) = P (X1 | C) ** P(X2 | C) ** ... ** P(Xn | C)

Naive Bayes

-uses Bayes Theorem and a naive assumption -assume the input features are statistically independent from one another -value of 1 feature doesn't affect value of other

Data Science Process

1) Input Data 2) Data Preprocessing (clean, explore, prep data) 3) Data Mining (Modeling) 4) Postprocessing (Analyze) 5) Information -Can cycle backwards 4->3->2 and back forward

ROC Curves (Receiver Operating Characteristic)

1) Sort Data by Probability of specific class label 2) select a cutoff threshold for the "Yes" class 3) Calculate TPR (sensitivity) and FPR (1-specificity) which becomes point on our ROC curve 4) Adjust the threshold for "Yes" class and recalculate and find new point -threshold for + class for classification; can set lower than 50% -how much FPR willing for better TPR -plot tradeoff from tradeoff of TPR & TNR -check all thresholds from 1->100 -the 0 threshold is always top right (1,1) and 100 always the opposite (every ROC curve will have) -we want high TPR and low FPR (top left) -how much does the curve hug the expected? -can use to compare classifiers -if diagonal line for curve = no better than random guessing

AI Model Phases

2: Training & Testing Phase

Attributes

Also known as: Features, Variables -Specific aspects of an object -Usually columns in a table

Objects

Also known as: Records, Data Points -usually a row in a table Example: person with certain attributes

Data Quality

Issues: -Missing Data -Inconsistent Data (try to detect as best as can) (similar solutions to missing data) -Duplicate Data (hard to detect) -Biased Values -Bad quality data can lead to wrong results

Univariate Analysis - Continuous Variables

Measures of Central Tendency: -Mean -Median -Mode Measures of Dispersion: -Range -Variance -Standard Deviation (how spread out from average) -Quartiles -Inter-Quartile Range -for example, can have difference in distribution of 2 datasets with same measures of central tendency -Distributions -Box Plots & Quartiles (outlier is larger than Q3 or smaller than Q1 by at least 1.5 times the inter-quartile range)

OSEMN

Obtain (be able to get data) Scrub (clean data for what you need) (can take 70% of the time or more) Explore (get to know data) Model iNterpret (what can I do with this?) -A taxonomy of tasks that data scientists should feel comfortable working on

Nested Cross Validation

Outer Fold: -splitting into training and testing data Inner Loop: -create more folds with validation training -for each inner fold find max depth and validation error -then, select the best max depth (here, we don't restart with average, but rather pick the best) -build model on all training data using this max depth Outer again: -run the test data through this model to evaluate error on outer fold -repeat on all outer folds which each have inner loop -then take average generalization error on outer folds -build final model with the best maxdepth found -inner loops also usually k=5, 10

Sampling Algorithms: SMOTE

SMOTE = Synthetic Minority Over-Sampling Technique -for each minority instance C, find its nearest neighbor N -create a new minority class instance R, using C's features and the difference between C and N's features, multiplied by a random variable R.features = C.features + (C.features - N.features) * rand(0,1) -generate fake records in minority class similar to original ones

Error Rate

fraction of incorrect predictions on the testing set -probability of misclassification


Ensembles d'études connexes

OCR A Level Biology: Cellular Control

View Set

Chapter 26 Saving, Investment, and Financial System

View Set

accounting- classifications on a balance sheet/ homework

View Set

Part 107 Load & Performance Quiz

View Set

Fundamentals pt. 1: Basics of nursing & Physiological Aspects of Care

View Set