ML2 - Inductive Learning
DT applications
> multivalued attributes: in this case information gain is no good. Convert to Boolean tests. > continuous input: identify split points with highest information gain. > continuous output: build a regression tree which ends in a linear function.
Linear classifiers
A linear function can be turned into a linear classifier by setting it equal to zero. It defines a boundary between two linearly separable classes.
Perceptron
A simple linear classifier with input function, activation function and output. A neural network is a combination of many perceptions
Information gain
Attribute, a, splits s into s_i, each with its own entropy. The goodness of a, can be measured in the reduction of entropy of s and all s_i. This corresponds to the weighted sum of the children's entropies.
Random forets
Bagging applied to decision trees in two ways. A set of DT's are learned from bagged datasets. However bagging is also applied at the feature level, known as the random subspace method.
Bagging
Bootstrap aggregation. Improves stability of unstable classifiers such as NN and DT. The training set is used to generate m new datasets by sampling with replacement. The m models are fitted with the classifier and the results are obtained by averaging (regression) or voting (classification)
CART
Classification And Regression Trees. Maximises homogeneity of child nodes by recursively selecting most discriminative attribute. Uses gini index. Post prunes starting with branches with weakest predictive power.
Linear regression
Estimate parameter values which minimise the loss function Weight value can be solved > directly > iteratively
Reduced error pruning
Every branch is considered for pruning. Each branch is removed and converted into a leaf with plurality classification for that leaf. This is then tested and kept it performance is superior to original. This requires: > training set > pruning set > test set
C4.5
Extends idea of ID3 and adds reduced error pruning. Error bound uses upper bound of confidence interval. + handle mixed data types + handle missing values - overfitting is a problem
Gain ratio
Information gain favours attributes with many values. Split information penalises attributes with lots of values. Gain ratio balances the two
Gini impurity
Is the probability of the classification mislabelling a randomly selected element.
ID3
Iterative dichotomiser. Top down induction of decision trees: 1. Window is chosen which is a randomly selected subset of training data 2. Tree is grown on data which has 100% accuracy 3. Tested on all other training instances 4. If accuracy == 1.0 done 5. Else repeat adding misclassified patterns to window Uses information gain and no pruning
DT overfitting
Low resubstitution error but Poor generalisation. Preventions include: > stop growing the tree early to prevent getting to the point of overfitting > allow to overfit then prune
Entropy
Method to select attribute that is best to split the data.
Gradient descent
Plots the loss function as a hyperplane and seeks to find values of W that find the global minimum.
Rule post pruning
Pruning solution in data limited situations A rule represents a route from root to node. Pruning involves removing the branch and applying the rule: at node v, class label x, and see if it improves performance. + very flexible
Decision trees
The aim is to find the most compact decision tree consistent with the training examples. This recursively selects the best attribute to split the examples. At a given node we may find: 1. All examples are positive. We are finished 2. All examples are negative. We are finished 3. Examples remain mixed: select best attribute and split. 4. No examples remain: lack of data. You can return: - default value - plurality classification: best guess given parent - probability 5. Mixed samples and no attributes remain. Due to noise or unobservable attributes. - plurality classification
Boosting
This maintains a weight over training examples, with a higher weight indicating a more difficult example that needs more learning. Classifiers which learn this correctly are given a higher weighted vote. Adaboost is a common approach
Inductive learning
This simply means learning from examples
Logistic function
This smooths out the boundary of the threshold to dampen the effects of misclassification.
Ensemble methods
This uses N classifiers on the training examples and gets them to vote in the final decision . Dramatically reduced error.
Regularisation
We have to be careful of overfitting and must take model complexity into account