ML - Chapter 3: Classification
Multiclass/multinomial classification
classifiers that distinguish between two or more classes.
Specificity:
Another name for TNR.
One-versus-one (OvO) strategy
Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. If there are N classes, you need to train N × (N - 1) / 2 classifiers. Used by Scikit-learn.
Multilabel Classification
In some cases you may want your classifier to output multiple classes for each instance. Consider a face-recognition classifier: what should it do if it recognizes several people in the same picture? It should attach one tag per person it recognizes. Say the classifier has been trained to recognize three faces, Alice, Bob, and Charlie. Then when the classifier is shown a picture of Alice and Charlie, it should output [1, 0, 1] (meaning "Alice yes, Bob no, Charlie yes"). Such a classification system that outputs multiple binary tags.
Multioutput (and multiclass) Classification
It is simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values).
One-versus-the-rest/all (OvR) strategy
One way to create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score.
Choosing between ROC and precision/recall curves
Since the ROC curve is so similar to the precision/recall (PR) curve, you may wonder how to decide which one to use. As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives.
NOTE
The F1 score favors classifiers that have similar precision and recall. This is not always what you want: in some contexts you mostly care about precision, and in other contexts you really care about recall.
NOTE
The line between classification and regression is sometimes blurry, such as in this example. Arguably, predicting pixel intensity is more akin to regression than to classification. Moreover, multi-output systems are not limited to classification tasks; you could even have a system that outputs multiple labels per instance, including both class labels and value labels.
Recall
Used along with precision as a performance metric. Aka: sensitivity, true positive rate. Eq. where TP: # of true positives, FN: # of false negatives, recall = TP / (TP + FN).
WARNING
When a classifier is trained, it stores the list of target classes in its classes_ attribute, ordered by value. In this case, the index of each class in the classes_ array conveniently matches the class itself (e.g., the class at index 5 happens to be class 5), but in general you won't be so lucky.
Cross-Validation [Classification]
a performance measure of classifiers and regressors but prioritizes accuracy. Bad with skewed datasets with binary classification.
Stochastic Gradient Descent (SGD)
advantage of handling very large datasets efficiently as it deals with training instances independently; SGDClassifier in scikit-learn.
Receiver operating characteristic (ROC)
another common tool used with binary classifiers. It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate (FPR). The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to 1 - the true negative rate (TNR), which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity.
Binary classifier
classification that distinguishes between just two classes (ex. 5 and not-5)
*Precision/recall trade-off
concerns decision function and threshold. In this precision/recall trade-off, images are ranked by their classifier score, and those above the chosen decision threshold are considered positive; the higher the threshold, the lower the recall, but (in general) the higher the precision. Has to do with SGDClassifier.
Data structure of datasets in Scikit-Learn
dictionary structure that includes: DESCR or description of dataset, data key that contains an array with one row/instance and one column/feature. And a target key containing an array with the labels.
Precision
equation where TP: # of true positives, FP: # of false positives, and precision = TP / (TP + FP).
F_1 score
the harmonic score of recall and precision. the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high. Eq. F_1 = 2 / [ (1 / precision) + (1 / recall) ] * (precision * recall) / (precision + recall) = TP / [ TP + (FN + FP) / 2 ].
Confusion Matrix
the preferred evaluation of classifiers; the general idea is to count the number of times instances of class A are classified as class B. For example, to know the number of times the classifier confused images of 5s with 3s, you would look in the fifth row and third column of the confusion matrix. Each row in a confusion matrix represents an actual class, while each column represents a predicted class. Uses cross_val_predict() function.