Data Mining Test 1 (Post Midterm)

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Gradient Boosting:

-In each stage, introduce a weak learner to compensate the shortcomings of existing weak learners. -*In Gradient Boosting, "shortcomings" are identified by gradients.* -Recall that, in Adaboost,"shortcomings" are identified by high-weight data points. -Both high-weight data points and gradients tell us how to improve our model.

The XOR Problem

Can you implement the XOR gate with the perceptron model? ---The quick answer is NO. Why? ---a perceptron is a linear classifier. Can you draw a single (straight) line to separate the green and red dots?

Distance for Numeric Data

Euclidean and Manhatten See word doc for equations

False Positive Rate =

FP/(FP+TN) FP+TN = total number of positives AKA False Alarm Rate Want FPR to = 0

Network Architecture?

Feed-forward networks ❖Signals can travel one way only; from input to output. There is no feedback loops Feedback (recurrent) networks ❖Signals can in both directions by introducing loops in the network. ❖usually designed for language, videos, etc.

Gradient Boosting Cont.

Gradient descent: Minimize a function by moving in the opposite direction of the gradient. For regression with square loss, ---residual ⇔ negative gradient ---fit h to residual ⇔ fit h to negative gradient ---update F based on residual ⇔ update F based on negative gradient So we are actually updating our model using gradient descent!

Similarity and Distance

Higher the similarity, the closer those points are in distance

Parameter Learning: Backpropagation Algorithm

Key idea: to adjust the weights using gradient descendent so that the error will be minimal. Based on the error of the output, you will calculate the adjustment for the weights of the hidden layer See slide no. 18

Cumulative Response Curve

Percentage of Positives Targeted (Y) v. Percentage of Test Instances (X) Basically: Percentage of population that is targeted (predicted as positive)

Learning Curve

Performance v Training Size X axis = # of training instances Y axis = Accuracy/performance of holdout/validation set Determines: need more instances or switch to a better model? v. Fitting curve = complexity v. error

Nearest Neighbors for Predictive Modeling

Predictive modeling: common region in space should be similar. Same prediction. Predictive modeling with similarity: ❖Given a new example, we find similar ones in the training example and predict the new example's target value based on the nearest neighbors' (known) target values. Similar examples are those that have small distances to the new example.

Q1: in SVM, we can remove some points and only leave the support vectors. Can you remove any points in a k-NN model?

Q1: no we need all of the points, not learning any model, just with each classification we are identifying who are the nearest neighbors

Q2: What is the training error of a 1-NN model?

Q2: The training error

Ensembles: Approaches: Voting

Random Forrest, Boosting

Finding the Best K: Cross-validation

•Split the data into training and test sets. •Split the training set into 5 (or 10, or other numbers) folds. Each time, use 4 fold to classify the remaining fold. Repeat 5 times. •The best value of k is defined to be the one that resulted in the smallest average error rate. •Recombine the 5 fold back into one dataset. This value of k and this dataset is then used to classify the test data

Ensemble:

❖A learning algorithm composed of a set of base learnings. The base learners may be organized in some structure ❖The base learners cannot be too correlated (want them to be independent)

Why is ANN taking off now?

❖ANN was invented in the 80s, why is it taking off now? ❖Much more data

Wrapping up ROC

❖An ROC graph encapsulates all information contained in the confusion matrix ❖ROC curves provide a visual tool for examining the tradeoff between the ability of a classifier to correctly identify positive cases and the number of negative cases that are incorrectly classified ❖ROC graphs decouple classifier performance from the conditions under which the classifiers will be used (e.g., no cost information is encoded into a ROC)

Ensemble Methods: Definition

❖An ensemble is a set of classifiers whose individual decision are combined in some way to classify new examples

Random Forrest

❖An ensemble model using many decision tree models ❖Each tree is built using a random subset of observations with a random subset of features ❖At each node: choose some small subset of variables at random and find the best split (Gini index) within these variables ❖ When a new input is entered into the system, it is run down all of the trees. Combine the results of all of the terminal nodes that are reached. ❖Can be used for classification (voting majority) or regression(average or weighted average )

Activation Function at a Neuron

❖Artificial neuron models are simplified models based on biological neurons. We usually refer to these artificial neurons as 'perceptrons'. ❖Each neuron is a processing unit where every input has an associated weight which modifies the strength of each input. The neuron simply adds together all the inputs and calculates an output to be passed on.

Working of Adaboost Classifier

❖At each iteration, weights for each observation will be updated ❖*Weight will be increased for incorrectly classified observation and reduced for correctly classified observation.* ❖The final decision is determined by weighted vote of all base learners where the weights are determined by the error rate for each base learner (higher error rate leads to a lower weight)

Performance Matrix: ROC (Receiver Operating Characteristics)

❖Confusion matrices and related metrics evaluates classification results from a single cut-off value ❖ROC evaluates the "ranking" of instances, equivalent to evaluating every cut-off value

AUC Interpretation

❖Equivalent to the probability that a randomly chosen positive instance will be ranked ahead of a randomly chosen negative instance by the model -Single number -Being close to the diagonal line is bad (random)—want to be next to top left corner -Calculate area of ROC •*Ideal: AUC = 1* •Random: AUC = 0.5

AdaBoost

❖Examples are given weights. ❖At each iteration, the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong. ❖*A new base learner is added which must focus on correctly classifying the most highly weighted examples while strongly avoiding over-fitting.* ❖During testing, each of the T baselearners get a weighted vote proportional to their accuracy on the training data. *Giving more importance to incorrect classifiers (v random forest which does not give them more importance)*

The Wisdom of Crowds:

❖Guess the weight of an ox ❖Average of people's votes close to true weight ❖Better than most individual members' votes *diverse, independent and decentralized: important characteristics of the crowd*

k-NN: How to Draw a Decision Boundary

❖No explicit boundary is created, but there are implicit regions created by instance neighborhoods. There regions can be calculated by systematically probing points in the instance space, determining each point's classification, and constructing the boundary where classifications change.

Why not use ROC?

❖Not the most intuitive visualization for many business stakeholders => Consider visualization frameworks that might not have all the nice properties of ROC curves, but are more intuitive.

Structure Learning

❖Number of layers ❖Number of nodes per layer ❖Type of activation function ❖Other learning parameters

Restated:

❖Repeat process (sweep) for all training pairs ❖Present data ❖Calculate error ❖Backpropagate error ❖Adjust weights ❖Repeat process multiple times

Evaluating a ranking classifier

❖Sort the test set by some score in decreasing/increasing order ❖score could be f(x) or P+, etc ❖Apply threshold at each unique value of score ❖Compute a confusion matrix Calculate TPR and FPR See slides for practice problem i.e. First: cut-off is one, predict all negatives: ---Decrease the threshold move right ---Increase the threshold move left

AUC (Area Under the ROC Curve)

❖The AUC is useful when a single number is needed to summarize the performance, or when nothing is known about the operating conditions

Boosting

❖Train classifiers (e.g. decision trees) in a sequence. ❖A new classifier should focus on those cases which were incorrectly classified in the last round. ❖Representative models: AdaBoost, Gradient boosting

Distance for Categorical Data

❖Treats the two objects as sets of characteristics (categorical attributes) ❖The proportion of all the characteristics (that either has) that are shared by the two ❖Appropriate for problems where the possession of a common characteristics between two items is important

Ensemble Methods: Why

❖When combing multiple independent and diverse decisions each of which is at least more accurate than random guessing, random errors cancel each other out, correct decisions are reinforced. ❖Less overfitting because their biases will cancel out. Built lots of different models and combine them into one super model. Collection of experts instead of only one. Ask them and combine their predictions (The wisdom of crowds)

k-NN Summary

❖Whenever we have a new point to predict, we find its K nearest neighbors from the training data. ❖Produces nonlinear decision boundaries ❖No model is built - simply looking up the nearest neighbors ❖Trivial to train (no training at all except figuring out K), slow to classify ❖Memory and CPU cost ❖Need to re-scale attributes

Decision Boundary of K Nearest Neighbors

❖k is a complexity parameter

Disadvantages

❖long training time ❖difficult to understand the learned function (weights) ❖not easy to incorporate domain knowledge

Advantages

❖prediction accuracy is generally high ❖robust, works when training examples contain errors ❖output may be discrete, real-valued, or a vector of several discrete or real-valued attributes ❖fast evaluation of the learned target function

Issues with Nearest-Neighbor Methods

1.Computation Efficiency ❖Training is fast because it usually involves only storing the instances, however: ❖Need a lot of space to store all examples ❖Takes more time to classify a new example ❖Compare with Decision Trees, SVM, and Logistic Regressions? 2.Difficult to explain the "knowledge" that has been mined from the data

Issues with Nearest-Neighbor Methods (cont)

3.Dimensionality (1)Dominance of attributes ➡Solution: rescaling data to the same range or same distribution (2)Having too many/irrelevant attributes may confuse distance calculations -> curse of dimensionality ❖For example: each instance is described by 100 attributes out of which only 10 are relevant in determining the target variable. In this case, instances that have identical values for the 10 relevant attributes may nevertheless be distant from one another in the 100 dimensional instance space. ➡Solution: Feature selection

If a classifier randomly predict positive and negative, what will TPR and FPR look like?

50% to 50% On graph: Straight diagonal line in middle

Deep Learning

Adding a lot more layers to learn more complexities low level feature-->mid-level feature-->high level feature-->Trainable classifier So multiple layers make sense

Base Learners

Arbitrary learning algorithm which could be used on its own Want base learners to be independent

What do (0,0), (1,1), (0,1) and diagonal line represent, respectively?

Assuming x = FPR and y = TPR ❖(0,0): declare everything to be negative class ❖(1,1): declare everything​ to be positive class ❖(0,1): ideal - perfect classifier (all true negs or pos) ❖Diagonal line: random guessing and the probability of predicting positive is the proportion of positive class ❖Below diagonal line: doing worse than random See pic

How to use ANN for modeling?

Data is presented to the network in the form of activations in the input layer Examples ❖Images: Pixel intensity RGB channels ❖Text: words represented as vectors ❖Structured: each input node takes a feature How to represent more abstract data, e.g. a name? ❖Choose a pattern, e.g. ❖0-0-1 for "Chris" 0-1-0 for "Becky"

Intuition of Gradient Descent

Define a Learning objective and parameters to be learned Example: training a linear regressor, the objective is some equation that's on the slide

Artificial Neural Networks

Information flow ❖ Data is presented to Input layer ❖ Passed on to Hidden Layer (could have multiple hidden layers) ❖ Passed on to Output layer Information processing is parallel ❖Process information much more like the brain than a serial computer

ANN Learning

It's a process by which an ANN adapts itself to a stimulus by making proper parameter adjustments, resulting in the production of desired response. Two kinds of learning ❖Structure Learning:- change in network structure ❖Parameter learning:- connection weights are updated, back propagation algorithm

k-NN

Nearest neighbor algorithm that uses k neighbors ❖k is a complexity parameter: the greater k is the more the estimates are smoothed out among neighbors Q: What happens if you choose k = n, where n is the size of the entire training set? ❖Output of k-NN k and all training data *k-NN is not a model itself*

See practice on slides

No. 13

Gradient Boosting: Intuitions

See Lecture 16 Slides

Demonstration of Backpropagation Algorithm

See slides but also: 1. Initialize with random weights 2. Present a training pattern 3. Adjust weights based on error 4. Repeat this infinite times...each time taking a random training instance and making slight weight adjustments Algorithms for weight adjustments are designed to make changes that will reduce the error

Using ROC for Model Comparison

See word doc

Lecture 17: ANN

see pic in word doc

Similarity

Similarity is at the core of many data mining methods ❖If two things are similar in some ways, they often share other characteristics as well ❖eg. similarity for classification and regression ❖kernels in SVM ❖Group similar items together into clusters ❖unsupervised segmentation

Strengths of k-NN model

Simple to implement and use Robust to noisy data Comprehensible - easy to explain prediction ---e.g., Amazon: "customers with similar searches also bought..."

Q: Is a smaller or greater k likely to overfit the data (i.e., has a overly complicated decision boundary)?

Smaller k = more likely to overfit Larger k = more stable a bigger k is robust to noise/outliers

What kind of problems can ANN solve?

Supervised learning: Structured data Unstructured data: audio, visuals, text

True Positive Rate =

TP/(TP+FN) TP+FN = total number of positives AKA recall Want TPR to = 1

If the classifier is not random but carefully trained. It makes meaningful predictions. What will TPR and FPR look like?

TPR closer to 1

Nearest Neighbors for Classification

Things we need to consider ❖How many neighbors? (k = ?) ❖How to combine their labels? (e.g., majority vote) ❖Vary the influence of neighbors based on distance?

ROC Measures:

True Positive Rate v. False Positive Rate

Distances

Typically, distances between data points are used for the determination of similarity Some business cases ❖Predict behavior of a new customer ❖Reasoning from similar cases (medicine, law)

Practice: Which of the following statements does NOT hold true?

a.Activation functions are threshold functions. b.Both forward and backward propagation take place during the training process of a neural network. c.Most of the data preprocessing is carried out in the hidden layers. d.Error is calculated at each layer of the neural network.

Extreme Case

all base learners are the same

Finding the Best K: good rule of thumb

k = square root of n

Ex

looking at nearest neighbors.... how many are yes's and how many are no's? Pick the larger number

ANN: same idea as logistic regression

see pic would just be a straight line


Kaugnay na mga set ng pag-aaral

nclex GU, Pediatric GU questions Nclex, renal gu nclex, Renal & GU- NCLEX, GU NCLEX 3500, NCLEX GU

View Set

Chapter 6: Capacitors and Inductors

View Set

Wyoming Statues, Rules and Regulations Common to All Lines

View Set

3.10.2 Erosion - Mass Movements & Gravity

View Set

GEOL 101 Sec 250 - All Smartwork Questions Before Exam 1

View Set