Decision Tree Learning (ML 3)
rule post-pruning
1. Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing overfitting to occur. 2. Convert the learned tree into equivalent set of rules by creating one rule for each path from the root node to a leaf node. 3. Prune each rule by removing any preconditions that result in improving its estimated accuracy 4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.
greedy algorithm
A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum.
gain ratio
A measure alternative to information gain designed to counteract the effect of attributes that can separate instances into very small subsets.
use statistics
A statistic determines whether or not pruning or expanding is likely to produce an improvement beyond the training set
split information
A term incorporated by the gain ratio
conjunction
AND
ID3's hypothesis space
All decision trees. A complete space of finite discrete-valued functions, relative to the available attributes.
Cost(A)
An attribute assigned to a vector that weights the information gain appropiately.
C4.5
An extension of ID3 that solves some of the issues with decision tree learning
preference or search bias
An inductive bias for certain hypotheses over others.
restriction bias or language bias
An inductive bias generated by the expressiveness of the hypothesis representation.
two approaches to overfitting
Approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the data, and approaches that allow the tree to overfit then perform post-pruning.
ID3
Begins by asking the question: which attribute should be tested at the root of the tree?. Each instance attribute is evaluated to using a statistical test to find the best one. A branch or descendant for each value of the attribute selected is created. The entire process is consistently repeated.
inductive learning methods
Can be characterized as search a space of hypotheses for one that fits the training examples. ID3 is the space of all possible decision trees. Simple-to-complex.
incorporating continuous values in decision trees
Defining new discrete-valued attributes that partition the continous attribute-valued decision attributes that parititon the continious attribute value into a discrete set of intervals.
best suited problems characteristics #3
Disjunctive descriptions may be required
entropy and encoding
Entropy can be used to determined the number of bits needed to encode an instance
Information gain's bias
Favors attributes with many different values as opposed to attributes with very few values.
finding a c threshold
Generate a set of candidate thresholds midway between the corresponding values of a set. These thresholds can then be evaluated for information gain associated with each.
Definition of decision tree overfitting
Given a hypothesis space H, a hypothesis h of the set H is said to overfit the training data if there exists some alternative hypothesis h' of the set H that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances.
handling missing attributes in instances
Giving those nodes the most common attribute in the data, or assigning a probability to each of the values.
Favoring low cost attributes over others
ID3 can be modified to bias towards low cost attributes at the top of the tree
ID3's search hypotheses
ID3 maintains only a single current hypothesis as it searches through the space of decision trees.
ID3's backtracking
ID3 performs no backtracking in its purest form
best suited problems characteristics #1
Instances are represented by attribute-value pairs
advantage of using all training examples at each step
Much less sensitive to errors in individual training examples
disjunction
OR - distinct alternatives
Occam's razor
Prefer the simplest hypothesis that fits the data
Approximate inductive bias of ID3
Shorter trees are preferred over larger trees. Trees that place high information gain attributes close to the root are preferred over those that do not.
minimum description length
Stop growing the tree when it is passed a certain encoded length
inductive bias
The bias an algorithm acquires as it goes from training instances to unforseen instances.
expected entropy
The sum of the entropies of each subset weighted by the fraction of the of examples that belong to that subset
ID3 Restrictions
The target attribute whose value is predicted by the learned tree must be discrete.
best suited problems characteristics #2
The target function has discrete output values
best suited problems characteristics #4
The training data may contain errors.
best suited problems characteristics #5
The training data may contain missing attribute values
training and validation set
Use a set of examples distinct from the training set and examine the utility of post-pruning the tree
preference bias versus restriction bias
Usually better to work with preference bias because the complete hypothesis space is sure to contain the target function, whereas restriction bias might leave the target function inexpressable
discrete attributes with many values
Will have high level of information gain while separating into very small subsets. Causes extreme overfitting
decision trees represent
a disjunction of conjunctions of constraints on the attribute values of instances
decision tree learning
a method for approximating discrete-valued target functions, in which the learned function is represented by a tree
entropy is 0
all examples belong to same class
entropy
characterizes the impurity of an arbitrary collection of examples
reduced-error pruning
consider each of the decision nodes in the tree to be candidates for pruning. Pruning a decision node means removing the subtree of that decision node, making it a leaf, and assigning it the most common classification of the training examples affiliated with that node.
entropy is 1
if there are equal number of positive and negative examples
information gain
is simply the expected reduction in entropy caused by partitioning the examples according to this attribute.
classification problems
labeling a specific example as part of a discrete step
information gain
measures how well a given attribute separates the training examples according to their target classification
Causes of ID3 Overfitting
noise in the data, or the number of training examples are too small to produce a representative sample of the true target function
Cause of ID3 inductive bias
only the search strategy.
best approach to overfitting
pruning the after overfit due to difficulty of knowing when to stop growing the tree
node
represents some attribute of an instance
branch
some value of that attribute
decision tree classification
starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node.
condition for node-removal in reduce-error pruning
the tree post removal performs equally efficient on the validation set as the original tree