Data Mining Test 1 (Post Midterm)
Gradient Boosting:
-In each stage, introduce a weak learner to compensate the shortcomings of existing weak learners. -*In Gradient Boosting, "shortcomings" are identified by gradients.* -Recall that, in Adaboost,"shortcomings" are identified by high-weight data points. -Both high-weight data points and gradients tell us how to improve our model.
The XOR Problem
Can you implement the XOR gate with the perceptron model? ---The quick answer is NO. Why? ---a perceptron is a linear classifier. Can you draw a single (straight) line to separate the green and red dots?
Distance for Numeric Data
Euclidean and Manhatten See word doc for equations
False Positive Rate =
FP/(FP+TN) FP+TN = total number of positives AKA False Alarm Rate Want FPR to = 0
Network Architecture?
Feed-forward networks ❖Signals can travel one way only; from input to output. There is no feedback loops Feedback (recurrent) networks ❖Signals can in both directions by introducing loops in the network. ❖usually designed for language, videos, etc.
Gradient Boosting Cont.
Gradient descent: Minimize a function by moving in the opposite direction of the gradient. For regression with square loss, ---residual ⇔ negative gradient ---fit h to residual ⇔ fit h to negative gradient ---update F based on residual ⇔ update F based on negative gradient So we are actually updating our model using gradient descent!
Similarity and Distance
Higher the similarity, the closer those points are in distance
Parameter Learning: Backpropagation Algorithm
Key idea: to adjust the weights using gradient descendent so that the error will be minimal. Based on the error of the output, you will calculate the adjustment for the weights of the hidden layer See slide no. 18
Cumulative Response Curve
Percentage of Positives Targeted (Y) v. Percentage of Test Instances (X) Basically: Percentage of population that is targeted (predicted as positive)
Learning Curve
Performance v Training Size X axis = # of training instances Y axis = Accuracy/performance of holdout/validation set Determines: need more instances or switch to a better model? v. Fitting curve = complexity v. error
Nearest Neighbors for Predictive Modeling
Predictive modeling: common region in space should be similar. Same prediction. Predictive modeling with similarity: ❖Given a new example, we find similar ones in the training example and predict the new example's target value based on the nearest neighbors' (known) target values. Similar examples are those that have small distances to the new example.
Q1: in SVM, we can remove some points and only leave the support vectors. Can you remove any points in a k-NN model?
Q1: no we need all of the points, not learning any model, just with each classification we are identifying who are the nearest neighbors
Q2: What is the training error of a 1-NN model?
Q2: The training error
Ensembles: Approaches: Voting
Random Forrest, Boosting
Finding the Best K: Cross-validation
•Split the data into training and test sets. •Split the training set into 5 (or 10, or other numbers) folds. Each time, use 4 fold to classify the remaining fold. Repeat 5 times. •The best value of k is defined to be the one that resulted in the smallest average error rate. •Recombine the 5 fold back into one dataset. This value of k and this dataset is then used to classify the test data
Ensemble:
❖A learning algorithm composed of a set of base learnings. The base learners may be organized in some structure ❖The base learners cannot be too correlated (want them to be independent)
Why is ANN taking off now?
❖ANN was invented in the 80s, why is it taking off now? ❖Much more data
Wrapping up ROC
❖An ROC graph encapsulates all information contained in the confusion matrix ❖ROC curves provide a visual tool for examining the tradeoff between the ability of a classifier to correctly identify positive cases and the number of negative cases that are incorrectly classified ❖ROC graphs decouple classifier performance from the conditions under which the classifiers will be used (e.g., no cost information is encoded into a ROC)
Ensemble Methods: Definition
❖An ensemble is a set of classifiers whose individual decision are combined in some way to classify new examples
Random Forrest
❖An ensemble model using many decision tree models ❖Each tree is built using a random subset of observations with a random subset of features ❖At each node: choose some small subset of variables at random and find the best split (Gini index) within these variables ❖ When a new input is entered into the system, it is run down all of the trees. Combine the results of all of the terminal nodes that are reached. ❖Can be used for classification (voting majority) or regression(average or weighted average )
Activation Function at a Neuron
❖Artificial neuron models are simplified models based on biological neurons. We usually refer to these artificial neurons as 'perceptrons'. ❖Each neuron is a processing unit where every input has an associated weight which modifies the strength of each input. The neuron simply adds together all the inputs and calculates an output to be passed on.
Working of Adaboost Classifier
❖At each iteration, weights for each observation will be updated ❖*Weight will be increased for incorrectly classified observation and reduced for correctly classified observation.* ❖The final decision is determined by weighted vote of all base learners where the weights are determined by the error rate for each base learner (higher error rate leads to a lower weight)
Performance Matrix: ROC (Receiver Operating Characteristics)
❖Confusion matrices and related metrics evaluates classification results from a single cut-off value ❖ROC evaluates the "ranking" of instances, equivalent to evaluating every cut-off value
AUC Interpretation
❖Equivalent to the probability that a randomly chosen positive instance will be ranked ahead of a randomly chosen negative instance by the model -Single number -Being close to the diagonal line is bad (random)—want to be next to top left corner -Calculate area of ROC •*Ideal: AUC = 1* •Random: AUC = 0.5
AdaBoost
❖Examples are given weights. ❖At each iteration, the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong. ❖*A new base learner is added which must focus on correctly classifying the most highly weighted examples while strongly avoiding over-fitting.* ❖During testing, each of the T baselearners get a weighted vote proportional to their accuracy on the training data. *Giving more importance to incorrect classifiers (v random forest which does not give them more importance)*
The Wisdom of Crowds:
❖Guess the weight of an ox ❖Average of people's votes close to true weight ❖Better than most individual members' votes *diverse, independent and decentralized: important characteristics of the crowd*
k-NN: How to Draw a Decision Boundary
❖No explicit boundary is created, but there are implicit regions created by instance neighborhoods. There regions can be calculated by systematically probing points in the instance space, determining each point's classification, and constructing the boundary where classifications change.
Why not use ROC?
❖Not the most intuitive visualization for many business stakeholders => Consider visualization frameworks that might not have all the nice properties of ROC curves, but are more intuitive.
Structure Learning
❖Number of layers ❖Number of nodes per layer ❖Type of activation function ❖Other learning parameters
Restated:
❖Repeat process (sweep) for all training pairs ❖Present data ❖Calculate error ❖Backpropagate error ❖Adjust weights ❖Repeat process multiple times
Evaluating a ranking classifier
❖Sort the test set by some score in decreasing/increasing order ❖score could be f(x) or P+, etc ❖Apply threshold at each unique value of score ❖Compute a confusion matrix Calculate TPR and FPR See slides for practice problem i.e. First: cut-off is one, predict all negatives: ---Decrease the threshold move right ---Increase the threshold move left
AUC (Area Under the ROC Curve)
❖The AUC is useful when a single number is needed to summarize the performance, or when nothing is known about the operating conditions
Boosting
❖Train classifiers (e.g. decision trees) in a sequence. ❖A new classifier should focus on those cases which were incorrectly classified in the last round. ❖Representative models: AdaBoost, Gradient boosting
Distance for Categorical Data
❖Treats the two objects as sets of characteristics (categorical attributes) ❖The proportion of all the characteristics (that either has) that are shared by the two ❖Appropriate for problems where the possession of a common characteristics between two items is important
Ensemble Methods: Why
❖When combing multiple independent and diverse decisions each of which is at least more accurate than random guessing, random errors cancel each other out, correct decisions are reinforced. ❖Less overfitting because their biases will cancel out. Built lots of different models and combine them into one super model. Collection of experts instead of only one. Ask them and combine their predictions (The wisdom of crowds)
k-NN Summary
❖Whenever we have a new point to predict, we find its K nearest neighbors from the training data. ❖Produces nonlinear decision boundaries ❖No model is built - simply looking up the nearest neighbors ❖Trivial to train (no training at all except figuring out K), slow to classify ❖Memory and CPU cost ❖Need to re-scale attributes
Decision Boundary of K Nearest Neighbors
❖k is a complexity parameter
Disadvantages
❖long training time ❖difficult to understand the learned function (weights) ❖not easy to incorporate domain knowledge
Advantages
❖prediction accuracy is generally high ❖robust, works when training examples contain errors ❖output may be discrete, real-valued, or a vector of several discrete or real-valued attributes ❖fast evaluation of the learned target function
Issues with Nearest-Neighbor Methods
1.Computation Efficiency ❖Training is fast because it usually involves only storing the instances, however: ❖Need a lot of space to store all examples ❖Takes more time to classify a new example ❖Compare with Decision Trees, SVM, and Logistic Regressions? 2.Difficult to explain the "knowledge" that has been mined from the data
Issues with Nearest-Neighbor Methods (cont)
3.Dimensionality (1)Dominance of attributes ➡Solution: rescaling data to the same range or same distribution (2)Having too many/irrelevant attributes may confuse distance calculations -> curse of dimensionality ❖For example: each instance is described by 100 attributes out of which only 10 are relevant in determining the target variable. In this case, instances that have identical values for the 10 relevant attributes may nevertheless be distant from one another in the 100 dimensional instance space. ➡Solution: Feature selection
If a classifier randomly predict positive and negative, what will TPR and FPR look like?
50% to 50% On graph: Straight diagonal line in middle
Deep Learning
Adding a lot more layers to learn more complexities low level feature-->mid-level feature-->high level feature-->Trainable classifier So multiple layers make sense
Base Learners
Arbitrary learning algorithm which could be used on its own Want base learners to be independent
What do (0,0), (1,1), (0,1) and diagonal line represent, respectively?
Assuming x = FPR and y = TPR ❖(0,0): declare everything to be negative class ❖(1,1): declare everything to be positive class ❖(0,1): ideal - perfect classifier (all true negs or pos) ❖Diagonal line: random guessing and the probability of predicting positive is the proportion of positive class ❖Below diagonal line: doing worse than random See pic
How to use ANN for modeling?
Data is presented to the network in the form of activations in the input layer Examples ❖Images: Pixel intensity RGB channels ❖Text: words represented as vectors ❖Structured: each input node takes a feature How to represent more abstract data, e.g. a name? ❖Choose a pattern, e.g. ❖0-0-1 for "Chris" 0-1-0 for "Becky"
Intuition of Gradient Descent
Define a Learning objective and parameters to be learned Example: training a linear regressor, the objective is some equation that's on the slide
Artificial Neural Networks
Information flow ❖ Data is presented to Input layer ❖ Passed on to Hidden Layer (could have multiple hidden layers) ❖ Passed on to Output layer Information processing is parallel ❖Process information much more like the brain than a serial computer
ANN Learning
It's a process by which an ANN adapts itself to a stimulus by making proper parameter adjustments, resulting in the production of desired response. Two kinds of learning ❖Structure Learning:- change in network structure ❖Parameter learning:- connection weights are updated, back propagation algorithm
k-NN
Nearest neighbor algorithm that uses k neighbors ❖k is a complexity parameter: the greater k is the more the estimates are smoothed out among neighbors Q: What happens if you choose k = n, where n is the size of the entire training set? ❖Output of k-NN k and all training data *k-NN is not a model itself*
See practice on slides
No. 13
Gradient Boosting: Intuitions
See Lecture 16 Slides
Demonstration of Backpropagation Algorithm
See slides but also: 1. Initialize with random weights 2. Present a training pattern 3. Adjust weights based on error 4. Repeat this infinite times...each time taking a random training instance and making slight weight adjustments Algorithms for weight adjustments are designed to make changes that will reduce the error
Using ROC for Model Comparison
See word doc
Lecture 17: ANN
see pic in word doc
Similarity
Similarity is at the core of many data mining methods ❖If two things are similar in some ways, they often share other characteristics as well ❖eg. similarity for classification and regression ❖kernels in SVM ❖Group similar items together into clusters ❖unsupervised segmentation
Strengths of k-NN model
Simple to implement and use Robust to noisy data Comprehensible - easy to explain prediction ---e.g., Amazon: "customers with similar searches also bought..."
Q: Is a smaller or greater k likely to overfit the data (i.e., has a overly complicated decision boundary)?
Smaller k = more likely to overfit Larger k = more stable a bigger k is robust to noise/outliers
What kind of problems can ANN solve?
Supervised learning: Structured data Unstructured data: audio, visuals, text
True Positive Rate =
TP/(TP+FN) TP+FN = total number of positives AKA recall Want TPR to = 1
If the classifier is not random but carefully trained. It makes meaningful predictions. What will TPR and FPR look like?
TPR closer to 1
Nearest Neighbors for Classification
Things we need to consider ❖How many neighbors? (k = ?) ❖How to combine their labels? (e.g., majority vote) ❖Vary the influence of neighbors based on distance?
ROC Measures:
True Positive Rate v. False Positive Rate
Distances
Typically, distances between data points are used for the determination of similarity Some business cases ❖Predict behavior of a new customer ❖Reasoning from similar cases (medicine, law)
Practice: Which of the following statements does NOT hold true?
a.Activation functions are threshold functions. b.Both forward and backward propagation take place during the training process of a neural network. c.Most of the data preprocessing is carried out in the hidden layers. d.Error is calculated at each layer of the neural network.
Extreme Case
all base learners are the same
Finding the Best K: good rule of thumb
k = square root of n
Ex
looking at nearest neighbors.... how many are yes's and how many are no's? Pick the larger number
ANN: same idea as logistic regression
see pic would just be a straight line