Applied Machine Learning

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Generalization Errors with Classifiers

A classifier: Has a low error rate on the training set Has high error when you evaluate on a test set Error = bias + variance

Association Rules - Applications

Market basket analysis Medical diagnosis Protein sequences Census data: education, health, transport, funds, public businesses CRM of credit card business

Association Mining Applications

Market basket analysis: what items do customers buy together Recommender System: A sales manager at an electronic store talking to a customer who recently purchased a computer and a camera, what should he recommend next? Customer relationship management: identify preferences of different customer groups Medical Diagnosis: find associations in symptoms and observations to predict diagnosis

Error correcting code design issues

Minimum codeword length to represent k classes n = log2k Can correct up to (d-1)/2 errors d: minimum HammingDistance between codewords Large row-wise separation: more tolerance for errors Large column wise separation: binary classifiers are mutually independent

True Negative Rate

True Negative Rate: fraction of negative instances correctly predicted TNR = TN/(FP + TN)

One versus All (OVA)

Y = {y1, y2, ..., yK} : the set of class labels Classifier building: For each yi, create a binary problem such that: Instances belonging to yi are positive. Instances not belonging to yi are negative. Tuple Classification: Classify the tuple using each classifier. If classifier i returns a positive label, yi gets one vote If classifier i returns a negative label, all classes except yi get a vote Assign the class with the most votes

Bagging Algorithm

1: Let D denote the original training date, k the number of base classifiers, T the test data 2: for i = 1 to k do 3: Create a bootstrap sample Di from D 4: Build a base classifier Ci from Di 5: end for 6: for each test record x in T do 7: C*(x) = Vote(C1(x), C2(x), ..., Ck(x)) 8: end for

Ensemble Method Algorithm

1: Let D denote the original training date, k the number of base classifiers, T the test data 2: for i = 1 to k do 3: Create training set Di from D 4: Build a base classifier Ci from Di 5: end for 6: for each test record x in T do 7: C*(x) = Vote(C1(x), C2(x), ..., Ck(x)) 8: end for 1: Let D denote the original training date, k the number of base classifiers, T the test data 2: for i = 1 to k do 3: Create training set Di from D 4: Build a base classifier Ci from Di 5: end for 6: for each test record x in T do 7: C*(x) = Vote(C1(x), C2(x), ..., Ck(x)) 8: end for

Candidate Set Generation

Avoid generating too many unnecessary candidates Ensure the set is complete: no frequent itemset is left out Do not generate duplicate itemsets {a, b, c, d} can be generated by merging: {a, b, c} and {d} {a, c} and {b, d} {c} and {a, b, d}

Association Rules Interpretation - Confidence

Confidence: Measures reliability of implication The higher the confidence, the more likely Y is present in transactions containing X Transaction set contains 1000 transaction. 200 transactions contain the items {Milk, Paper} 250 transactions contain {Milk} What is the support for {Milk} => {Paper}? What is the confidence for {Milk} => {Paper}? ... sigma({Milk, Paper}) = # trans. containing both = 200 ... s({Milk} => {Paper}) = s({Milk, Paper}) / N = 200/1000 = 0.2 ... c({Milk} => {Paper}) = s({Milk, Paper}) / s({Milk}) = 200/250 = 0.8

Class Imbalance - Cost Sensitive Learning

Incorporate cost in the process of building the model Decisions tree: Select the attribute for the split Decide whether to prune a subtree Nearest Neighbor: Update decision boundary based on cost

When is ensemble better than base classifiers?

Independent base classifiers error > 0.5

Association Rules - Objective Measures - Limitation of Support

Limitation of Support: some items appear infrequently in their normal settings compared to other items: number of times eggs are purchased vs a TV is purchased If we increase the support, patterns containing low occurring items (eg TV) will not be extracted If we decrease the support, many uninteresting patterns will be extracted

Methods of the Ensemble

Manipulate the training set: Resampling Manipulate the input features: Use subset of features Manipulate the class labels: When large number of classes, partition into sets Error correcting output coding Manipulate the learning algorithm (algorithm specific) Change topology in a neural network Inject randomness into decision tree growing

Sequential Pattern Mining - Candidate Set Generation

Merge two k-sequences s1 and s2 if the subsequences obtained by: dropping the first event of s1 dropping the last event of s2 are identical s1: <{1} {2 3} {4}> drop first event: <{2 3} {4}> s2: <{2 3} {4 5}> drop last event: <{2 3} {4}> s1 and s2 can be merged to generate a candidate 5-sequence If the last two events of s2 belong to the same element, then the last event of s2 is added to the last element in the resulting sequence s1: <{1} {2 3} {4}> s2: <{2 3} {4 5}> Result: <{1} {2 3} {4 5}> If the last two events of s2 belong to different elements, then the last event of s2 is added as a separate element in the resulting sequence s1: <{1} {2 3} {4}> s2: <{2 3} {4} {5}> Result: <{1} {2 3} {4} {5}>

Multiclass Approaches

One versus All (OVA) One versus One (OVO) Error correcting codes

OVA vs. OVO

One vs All: Builds k classifiers for a k class problem Full training set for each classifier One vs One: Builds k(k-1)/2 classifiers Subset of training set for each classifier Sensitive to binary classification errors

Dealing High bias and variance...

Plot learning curves for different training sizes Training set Cross validation set

Precision

Precision: fraction of records that are truly positive in the set predicted as positive p = TP/(TP + FP)

Classification Applications

Processing loan applications Screening images for oil slicks Electricity load forecasting Diagnosis of machine faults Marketing and sales Medical field

Recall

Recall: fraction of positive records correctly predicted (true positive) r = TP/(TP + FN)

Bagging (Bootstrap Aggregating)

Repeatedly creates samples with replacement according to uniform distribution Each record: selected with probability 1 - (1-1/N)N Pick class that receives highest number of votes

Association Mining

Search for patterns recurring in the given data set Given a set of item sets or transactions, find rules predicting the occurrence of items based on the occurrences of other items in the transactions

Handling Classifiers that take too long...

Subsampling: Do not necessarily use all the data Learning curve suggests training size Distributed Approach: How to split the data and combine the results Depends on algorithm Distributed-computing frameworks: Hadoop, MapReduce, ...

Association Rule Mining - Computational Complexity

Support threshold: lower support implies: More frequent itemsets, more candidate itemsets Larger frequent itemsets (larger k) Number of items: More space needed to store support counts Increases the number of candidate itemsets Number of transactions: Increases the time needed for a pass of the data Transaction Width: Increases the maximum size of frequent itemsets Increases the number of hash tree traversal

Association Rules Interpretation - Support

Support: Low: items may occur together by chance Used to eliminate uninteresting rules Transaction set contains 1000 transaction. A single transaction contains the items {Bandaids, TV} No other transactions contain either item What is the support for {TV} => {Bandaids}? What is the confidence for {TV} => {Bandaids}? ... sigma({TV, Bandaids}) = # trans. containing both = 1 ... s({TV} => {Bandaids}) = s({TV, Bandaids}) / N = 1/1000 = 0.001 .... c({TV} => {Bandaids}) = s({TV, Bandaids}) / s({TV}) = 1/1 = 1

True Positive Rate

True Positive Rate: fraction of positive instances correctly predicted TPR = TP/(TP + FN)

Association Rules - Subjective Measures

Visualization: allows human beings to interact with the data mining system and interpret and verify rules Template-Based: allows users to constrain the type of patterns extracted Subjective interest measures: based on domain information such as concept hierarchy or profit margin

Association Rules - Evaluation Metrics

Which rules are interesting? Subjective measures: Based on subjective arguments to decide if it reveals interesting information {Butter} => {Bread}: Not interesting {Diapers} => {Bread}: Interesting Objective measures: based on statistics computation

One versus One (OVO)

Y = {y1, y2, ..., yK} : the set of class labels Classifier building: For each pair yi and yj create a binary problem: Keep instances belonging to yi and yj Ignore other instances Tuple Classification: Classify the tuple using each classifier If classifier i returns a positive label, yi gets one vote If classifier i returns a negative label, all classes except yi get a vote Assign the class with the most votes

What does high bias look like using learning curves?

High bias: similar errors for both curves as training size increases

What does high variance look like using learning curves?

High variance: Training error increases with size of training examples Cross validation error start high, decrease with size of training examples Gap between both errors

Error correcting codes

Idea: Add redundancy to increase chances of detecting errors Training: Represent each yi by a unique n bit codeword Build n binary classifiers, each to predict one bit Testing: Run each classifier on the test instance to predict its bit vector Assign to the test instance the codeword with the closest Hamming distance to the output codeword Hamming distance: number of bits that differ

Motivation for Sequential Pattern Mining

An online shopping company would like to extract patterns about web pages visited in each session as an attempt to predict customer behavior Data collected: <{Homepage} {Electronics} {Cameras and Camcorders} {Digital cameras} {Shopping Cart} {Return to Shopping}> <{Homepage} {Books} {Programming Algorithms} {Modeling and Simulation} > Temporal information is not captured by <session-Id, items> model

Classification Error Rate

Error Rate: e = FN/Total

Sequential Pattern Mining - Apriori Principle

Any data sequence that contains a k-sequence also contains all its (k-1)-subsequences => Apriori principle holds Apriori-like algorithm for generating frequent data sequences 1. Generate frequent 1-sequences 2. Repeat: 1. Merge pairs of frequent (k-1)-sequences to generate candidate k-sequences 2. Prune candidates whose (k-1)-subsequences are infrequent 3. Make a pass over the data set to count the supports of the remaining candidates

General Multiclass Classification Characterisitics

Approaches: Decision trees, rule-based, nearest neighbors, neural networks, Bayes classifiers Characteristics: Prone/robust to noise and overfitting Different training/testing speeds Linear vs non linear model Variations: Multiple classes Multiple labels Multiple tasks

General Classification Characterisitics

Approaches: Decision trees, rule-based, nearest neighbors, neural networks, Bayes classifiers, artificial neural networks Characteristics: Prone/robust to noise and overfitting Handling missing values Different training/testing speeds Linear vs. non linear model

Association Rule

Association Rule: Implication of the form X → Y, where X and Y are disjoint itemsets {Bread, Diapers} → {Milk} {Bread} → {Milk}

Boosting

Adaptively changes the distribution of training examples Focuses on the examples that are hard to classify: Assign a weight (for getting selected) for each training example Generate a training set Generate a classifier based on the training set Adjust the weights based on classifier prediction Higher weights for examples incorrectly classified Repeat

Ensemble Method

Combines multiple base classifiers into one Given a test record: output a prediction by taking a vote on predictions of base classifiers

Bagging vs. Boosting

Bagging reduces variances by taking average Boosting reduces both bias and variances Boosting might hurt performance on noisy data. Bagging does not have this problem Bagging is easier to parallelize In practice, bagging and boosting are powerful techniques

Association Rule Discovery - the Brute Force Approach

Compute support and confidence of every possible rule Select only rules satisfying minsup and minconf thresholds Possible number of rules: R = 3d - 2d+1 + 1 (d: items set size) Example: d = 6 R = 36 - 211 + 1 = 602 d = 10 R = 310 - 211 + 1 = 57,002 d = 15 R = 14,283,372 THIS APPROACH IS PROHIBITIVELY EXPENSIVE

Class Imbalance Issues

In credit card fraud example: what is the accuracy of a model that classifies ALL transactions as legitimate? A correct classification of the rare class has a greater value than a correct classification of the majority class Issues: Performance measures need to be modified Models for the rare class are highly specialized Susceptible to noise

Apriori Algorithm

Ck: Candidate itemsets of size k (itemsets possibly frequent) Fk: Frequent itemsets of size k Compute F1: A single pass over the transactions table to count support of individual items Iteratively, use Fk-1 to compute Ck and then Fk Stop when Fk is empty A pass over the transactions is needed to count the support of every Ck

Practical Issues in Classification

Class imbalance Classifier takes too long Classifier suffers from overfitting Classifier suffers from underfitting

Bias - Variance (& noise) Decomposition

Classification Error = Bias + Variance + Noise Bias: The ability of the model to approximate the data The error of the best classifier Variance: Stability of the model in response to new training data Error of the trained classifier with respect to the best classifier

Solutions to Generalization Errors with Classifiers

Get more training examples: works if classifier has high variance overfitting Try a smaller set of features: Works if classifier has high variance Overfitting Obtain new features: Works if you have high bias Underfitting

Association Rule Discovery Problem

Given: a set of transactions T a minimum support minsup A minimum confidence minconf Find all association rules having: support > minsup confidence > minconf

Issues using Market Basket Analysis

Discovering patterns from large transaction data: computationally expensive Discovery of spurious patterns

F1 measure

F1 = 2rp/(r + p) = 2/(1/r + 1/p) A model can usually maximize one (r or p) but not the other Building a model that maximizes both is difficult

False Negative Rate

False Negative Rate: fraction of positive instances predicted negative FNR = FN/(TP + FN)

False Positive Rate

False Positive Rate: fraction of negative instances predicted positive FPR = FP/(TN + FP)

What is a better association mining approach?

Frequent itemset generation: Generate all itemsets satisfying minsup Rule Generation: Extract rules satisfying minconf from frequent itemsets

Applied Machine Learning

Set pelajaran terkait

ECON 2106 Test 1 (ch.1-5)

Data types in java script

TN LIFE *completing the application, underwriting, and delivering the policy

Logic quiz#1 study set

Human Anatomy: Quadrants and Regions Quiz

SIE Final #1

Chapter 5

Unit 1: Introduction to Chemistry Study Guide

APA Quiz

SDL integumentary

Ch 8b Reading quiz

street car named desire packet test

dievence

Real Estate Flash Cards Part. 4

AP Bio Chapter 4 & 5 Test

Introduction to Business

Instructor 1 Exam

FinanceExam Ch1 & Ch2

Psych Final, Psych Final Pt. 1, Chapter 13

Quiz 6 - VLOOKUP() Function - AIS