AI - TIG117 Lecture 4 (Classification)

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Good question in a decision trees

A good question that rapidly brings you closer to a classification of a new object

Distance measures

Distance measures are used in machine learning and other fields to quantify the similarity or dissimilarity between two objects or data points. There are several distance measures available, and the choice of measure depends on the type of data and the problem domain

Gini

The Gini impurity is a measure of the degree of probability that a randomly chosen element from a set would be incorrectly classified if it were randomly classified according to the distribution of labels in the set. We want the Gini impurity to be as low as possible!

Noise

"Noise" refers to random or unwanted variations or errors in data that can affect the accuracy of a model or algorithm. Noise can arise due to a variety of factors, such as measurement errors, data collection artifacts, or other sources of variability in the data. When data contains noise, it can be difficult to identify the underlying patterns and relationships, and to create a model that generalizes well to new data.

A bad decision tree

A "bad" tree is one that performs poorly in terms of accuracy. A bad decision tree may suffer from high error rates, poor generalization to new data, or overfitting. This can occur if the tree is too simple and cannot capture the underlying patterns in the data, or if it is too complex and captures noise in the training data.

Confusion matrix

A confusion matrix is a table that summarizes the performance of a classifier by comparing the predicted class labels to the true class labels of a set of test data. It is a useful tool for evaluating the accuracy and quality of a classifier, particularly in problems where the classes are imbalanced or where the cost of misclassification is different for each class. A confusion matrix has four possible outcomes: True positive (TP): The classifier correctly predicts a positive class. False positive (FP): The classifier incorrectly predicts a positive class. True negative (TN): The classifier correctly predicts a negative class. False negative (FN): The classifier incorrectly predicts a negative class

What is a good classifier?

A good classifier is one that accurately predicts the class labels of unseen data. In machine learning, a classifier is a model that takes input data and assigns it to one of several possible classes based on a set of predefined rules or patterns learned from the data. A good classifier should be able to correctly classify new data that it has not seen before with a high degree of accuracy.

Boolean split

Boolean split is a type of split that can be used in decision tree algorithms. In a boolean split, the data is split into two groups based on the value of a binary (true/false) feature. For example, if we are building a decision tree to predict whether a customer will buy a product, one possible boolean split could be based on whether the customer has purchased a similar product before (true or false).

Decision trees

Decision trees are a type of supervised learning algorithm in machine learning that can be used for both classification and regression tasks. The decision tree algorithm builds a tree-like model of decisions and their possible consequences, where each node in the tree represents a decision or a test on a feature, and each branch represents the outcome of the decision or test. The leaves of the tree represent the final decision or prediction. Advantages: Decision trees are easy to understand and interpret, and their results can be visualized and explained. They can handle both categorical and numerical data, and missing values can be easily handled. They can capture non-linear and complex relationships between the features and the target variable. Disadvantages: Decision trees are prone to overfitting, especially when the tree is too deep or the data is noisy. The choice of the splitting criterion and other hyperparameters can greatly affect the performance of the algorithm. The algorithm can be biased towards features with more categories or values.

Binary split:

In a binary split, the data is divided into two groups based on a threshold value of the numerical feature. For example, we can split the houses into those with square footage greater than or equal to 2000 and those with square footage less than 2000.

Multiway split:

In a multiway split, the data is divided into more than two groups based on multiple threshold values of the numerical feature. For example, we can split the houses into three groups based on their square footage: less than 1500, between 1500 and 2500, and greater than 2500.

Numerical split

Numerical split is a type of split that can be used in decision tree algorithms. In a numerical split, the data is split into two groups based on the value of a numerical feature. For example, if we are building a decision tree to predict the price of a house, one possible numerical split could be based on the square footage of the house.

Overfitting

Overfitting is a common problem in machine learning, which occurs when a model is too complex and captures the noise in the training data, rather than the underlying patterns that generalize well to new, unseen data. When a model overfits the training data, it fits the data too closely, and as a result, it performs poorly on new data. This is because the model has essentially memorized the training data, and it is not able to generalize to new data that it has not seen before.

The KNNClassifier

The k-nearest neighbors (KNN) classifier is a type of instance-based learning algorithm in machine learning. It is a simple but effective algorithm for classification and regression problems, where the goal is to predict the class label or value of a new data point based on its similarity to the existing data points in the training set. Advantages: KNN is a simple and intuitive algorithm that does not require any training process. It can handle multi-class classification and non-linear decision boundaries. It can be easily adapted to handle missing data and noisy data. Disadvantages: KNN can be computationally expensive, especially for large datasets. The choice of distance measure and the value of k can greatly affect the performance of the algorithm. The algorithm does not provide any insight into the underlying structure of the data.

Address overfitting

Various techniques can be used. One common technique is regularization, which involves adding a penalty term to the model's objective function to discourage the model from overfitting. Other techniques include early stopping, which involves stopping the training process when the model starts to overfit, and cross-validation, which involves dividing the data into multiple subsets and using them to train and evaluate the model.

Applications of classifiers

Will this treatment help that person? Will this person pay back that loan? Will this person like that book? Is this email spam or not? Is this review positive or negative or neutral? What musical genre does this song belong to? What breed of dog does this picture show?


Ensembles d'études connexes

Fertilization- the fusion of secondary oocyte and spermatozoon - forms a zygote

View Set

TIMS - NATIONAL TRAFFIC INCIDENT MANAGEMENT (TIM) RESPONDER TRAINING PROGRAM

View Set

Chapter 5 Learning Objectives Checkup

View Set

Senior Engineering Programming Quetsions

View Set

Unit 4: Types of Life Insurance Policies

View Set

IST 309 Chapter 11, IST 309 Chapter 10, IST 309 Chapter 8, Chapter 7

View Set