B455 ML Midterm
Supervised Learning
"learn" a model from input objects with desired output - Regression - Classification
Reinforcement Learning
"enhance" a model for an environment with maximum some cumulative rewards Game Behavior Selection Planning - Game - Behavior Selection - Planning
Gaussian Mixture Model (GMM)
A Gaussian mixture model (GMM) is a probabilistic model that represents the probability distribution of a mixture data as a weighted sum of Gaussian distributions, where each Gaussian represents one of the subpopulations in the mixture. In other words, a GMM assumes that the data is generated from a mixture of Gaussian distributions, with each Gaussian having its own mean and covariance matrix.
Confusion Matrix
A confusion matrix is a table used to evaluate the performance of a classification model by comparing the actual and predicted class labels. A confusion matrix for a two-class classification problem has four entries: - True Positive (TP): The number of samples that are correctly classified as positive. - False Positive (FP): The number of samples that are incorrectly classified as positive. - True Negative (TN): The number of samples that are correctly classified as negative. - False Negative (FN): The number of samples that are incorrectly classified as negative.
Activation Function
A mathematical function that is applied to the weighted sum of the inputs in a neural network. The activation function introduces non-linearity into the model and helps the model to learn complex relationships between the input and output variables.
Universal Approximation Theorem
A neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function for any activation functions in a broad class. It does not address how an appropriate Neural Network model can be learned. In simpler terms, it means that a properly designed neural network with just one hidden layer can be used to model and predict any function with enough precision, as long as it has enough neurons in that hidden layer. This theorem is important because it suggests that neural networks are very flexible and powerful tools for solving complex problems, and can be used to approximate a wide range of functions in a variety of domains.
Accuracy metrics for two-class classification
Accuracy: The overall accuracy of the model is the proportion of correctly classified samples to the total number of samples. The formula for accuracy is: Accuracy = (TP + TN) / (TP + TN + FP + FN) Precision: Precision is the proportion of true positives to the total number of predicted positives. The formula for precision is: Precision = TP / (TP + FP) Recall: Recall is the proportion of true positives to the total number of actual positives. The formula for recall is: Recall = TP / (TP + FN) F1 Score: F1 Score is the harmonic mean of precision and recall, which provides a balanced measure between precision and recall. The formula for F1 score is: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Computational complexity
The amount of resources used by the algorithm; Computational complexity of machine learning algorithms is particularly important in the context of learning on very large datasets.
Bayes Rule
Bayes' rule, also known as Bayes' theorem or Bayes' law, is a fundamental concept in probability theory and statistics. It describes how to update the probability of an event based on new evidence or information. P(A|B) = P(B|A) * P(A) / P(B) where: P(A|B) is the conditional probability of event A given event B has occurred. P(B|A) is the conditional probability of event B given event A has occurred. P(A) is the prior probability of event A. P(B) is the probability of event B.
Different Activation Functions
For Regression: 1. Linear: The linear activation function is often used in the output layer of neural networks for regression problems, where the goal is to predict a continuous output value. It returns the input value as it is, without any transformation. 2. ReLU (Rectified Linear Unit): ReLU is also used as an activation function for regression tasks. It returns the input if it is positive, and 0 if it is negative. ReLU has been shown to work well in practice due to its simplicity and computational efficiency. 3. Tanh (Hyperbolic Tangent): The tanh activation function is another commonly used activation function for regression problems. It maps any input value to a value between -1 and 1, which can be useful for tasks where the output range is bounded. For Multi-Class Classification: 1. Softmax: Softmax is often used in the output layer of neural networks for multi-class classification problems. It normalizes the output so that the sum of the values is 1, and the output represents the probability of each class. 2. ReLU (Rectified Linear Unit): ReLU can also be used as an activation function for multi-class classification tasks. It is often used in the hidden layers of the network to introduce non-linearity. 3. Tanh (Hyperbolic Tangent): The tanh activation function can also be used in the output layer of neural networks for multi-class classification problems. It maps any input value to a value between -1 and 1, which can be useful for tasks where the output range is bounded. 4. Sigmoid: Sigmoid can be used in the output layer of neural networks for binary classification problems. It returns a value between 0 and 1, which can be interpreted as the probability of the positive class. For multi-class classification, sigmoid can be used with multiple output neurons, where each neuron represents a binary decision for a specific class.
when to stop learning
Here are some common techniques for detecting when to stop learning: Validation loss: Stop training when the validation loss stops decreasing or starts increasing. This indicates that the model is no longer improving and has reached a good level of performance. Monitoring metrics: The training can be stopped when the accuracy metric reaches a certain threshold. This indicates that the model has achieved a desired level of performance. Cross-validation: The training can be stopped when the average performance across all subsets reaches a certain level.
Naïve Bayes
Naïve Bayes is a machine learning algorithm based on Bayes' theorem that is used for classification and prediction tasks. The Naïve Bayes algorithm is based on the assumption that the features (or attributes) of a data point are independent of each other, given the class label. This means that the probability of a data point belonging to a particular class can be calculated by multiplying the probabilities of each feature given the class label.
Training, Testing, and Validation Sets
In machine learning, it is common practice to split the available data into three sets: training, testing, and validation. The training set is used to train the model by adjusting the weights of the model to minimize the error. The testing set is used to evaluate the performance of the trained model on unseen data. The validation set is used to tune the hyperparameters of the model, such as learning rate, regularization strength, and number of hidden layers. It also used to prevent overfitting by selecting the best model based on its performance on the validation set.
Probabilistic learning
In probabilistic learning, the goal is to learn a probabilistic model that can predict the probability distribution of the target variable given the input features. The model is trained on a labeled dataset, where the input features and their corresponding target values are known. The model learns the probability distribution of the target variable given the input features by maximizing the likelihood of the training data. There are several types of probabilistic models that are used in probabilistic learning, such as Bayesian networks, Markov random fields, hidden Markov models, and Gaussian mixture models, among others.
back propagation
It works by propagating the errors back through the network and adjusting the weights of the connections between the neurons to minimize the error. Here is a high-level overview of how backpropagation works: 1. Forward pass: The input data is fed into the network, and the output is computed through the layers of the network using the current weights. 2. Error computation: The error between the predicted output and the actual output is computed. 3. pass: The error is propagated back through the network, layer by layer, from the output layer to the input layer. The error for each neuron in a layer is computed based on the weighted sum of errors from the neurons in the next layer. 4. Weight update: The weights of the connections between the neurons are adjusted based on the errors and the learning rate, using an optimization algorithm such as gradient descent. 5. Repeat: Steps 1-4 are repeated for multiple epochs, or until the error is minimized to a satisfactory level. The runtime complexity of backpropagation can be expressed in big O notation as O(NWL), where N is the number of training examples, W is the number of weights in the network, and L is the number of layers in the network. Backpropagation can be extended to multiple hidden layers by using a technique called "chain rule" to compute the errors and gradients for each layer. In this case, the errors are propagated back through each layer in reverse order, and the weights are updated for each layer based on the errors and gradients computed for that layer. This technique is called "backpropagation through time" (BPTT) in the context of recurrent neural networks (RNNs), which can have multiple hidden layers and feedback connections.
Linear Regression
Linear regression is used for regression problems, where the goal is to predict a continuous numerical output variable based on a set of input variables. During training, the linear regression algorithm learns a set of weights that are used to combine the input variables to produce a predicted output value. The weights are learned by minimizing the sum of squared errors (SSE) between the predicted values and the actual values in the training data. This is done using a technique called least squares regression.
Connection between Logistic Regression and Perception
Logistic regression and the perceptron algorithm are both linear models used for binary classification problems. In fact, the perceptron algorithm is a special case of logistic regression when the logistic function used in logistic regression is replaced with a step function. Both algorithms learn a set of weights that are used to combine the input variables to produce a binary output. However, the perceptron algorithm updates the weights after every misclassified sample, while logistic regression updates the weights based on the likelihood of the observed data given the model. Another difference between the two algorithms is that logistic regression outputs a probability value that represents the likelihood of belonging to a particular class, while the perceptron algorithm outputs a binary value of either 0 or 1.
Logistic Regression
Logistic regression is a supervised learning algorithm used for binary classification problems. In logistic regression, the goal is to predict a binary outcome variable (e.g., 0 or 1) based on one or more input variables. The output of logistic regression is a probability value that represents the likelihood of belonging to a particular class. The logistic function is an S-shaped curve that maps any input value to a value between 0 and 1. During training, the logistic regression algorithm learns a set of weights that maximize the likelihood of the observed data given the model. The algorithm uses maximum likelihood estimation to find the optimal values of the weights.
Machine Learning
Making computers modify or adapt their actions (whether these actions are making predictions, or controlling a robot) so that these actions get more accurate, where accuracy is measured by how well the chosen actions reflect the correct ones.
Maximum likelihood method
Maximum likelihood (ML) method is a statistical method used to estimate the parameters of a statistical model. It is based on the principle of finding the values of the parameters that maximize the likelihood of the observed data.
Mixture data
Mixture data is a type of data that consists of multiple subpopulations or components, each with its own probability distribution. The data points in each subpopulation are generated from a different probability distribution, and the overall probability distribution of the mixture data is a weighted sum of the probability distributions of the subpopulations.
Overfitting
Overfitting is a common problem in machine learning where a model learns the training data too well and performs poorly on unseen data, including the noise. This occurs when a model becomes too complex and fits the noise in the training data, rather than the underlying patterns in the data. Overfitting can be prevented by using techniques such as regularization, early stopping, and reducing the complexity of the model. This reduces the generalization capabilities of the network
Posterior Probability
Posterior probability refers to the probability of an event or hypothesis after taking into account the available evidence or data. In Bayesian statistics, the posterior probability is computed using Bayes' rule, which is a mathematical formula that describes how to update the probability of an event or hypothesis based on new evidence or information. The posterior probability is calculated as follows: Posterior probability = Prior probability x Likelihood / Evidence where: Prior probability is the initial probability of the event or hypothesis before any evidence is considered. Likelihood is the probability of the observed evidence given the event or hypothesis. Evidence is the probability of the observed evidence occurring regardless of the event or hypothesis. The posterior probability can be interpreted as an updated estimate of the probability of an event or hypothesis based on the available evidence or data. As more evidence is gathered, the posterior probability is updated accordingly.
Targets/Labels
Provide the correct answers that the algorithm is learning about, needed in supervised learning. The target vector t, with elements tj, where j runs from 1 to the number of output dimensions, n.
Error
Refers to the difference between the predicted output of the model and the true or actual output. The goal of the machine learning algorithm is to minimize the error or the difference between the predicted and true outputs. Error E: a function that computes the inaccuracies of the ML model as a function of the outputs y and targets t. A commonly used error function: Sum Sqaured Error (SSE)
Selection of Training data for convergence and performance
Sequential algorithm: update weights using the error of each input data; Batch algorithm: update weights using the error of all input data at a time. Batch algorithm converges to a local minimum faster than the sequential algorithm, but that the latter is sometimes less likely to get stuck in local minima. Minibatch method: splitting the training set into random batches, estimating the gradient based on one of the subsets of the training set, performing a weight update, and then using the next subset to estimate a new gradient and using that for the weight update, until all of the training set have been used. Stochastic gradient descent (SGD): use just one randomly picked piece of data to estimate the gradient at each iteration of the algorithm
Not all training data are linearly separable
TRUE; In certain case, the perceptron learning algorithm may not converge, but keeps on cycling through two different wrong solutions. In such cases, more complex classifiers or non-linear transformation techniques may be required to separate the classes effectively.
Expectation-Maximization (EM) Algorithm
The EM (Expectation-Maximization) algorithm is a computational method for finding the maximum likelihood or maximum a posteriori (MAP) estimates of the parameters of a statistical model, when the model involves latent or unobserved variables. The EM algorithm consists of two steps, the E-step and the M-step, which are iteratively repeated until convergence: In the E-step, the algorithm estimates the conditional distribution of the latent variables given the observed data and the current estimates of the model parameters. In the M-step, the algorithm updates the estimates of the model parameters based on the conditional expectations of the latent variables computed in the E-step.
Weights
The parameters of the model that are adjusted during the training process to minimize the error. The weights are initialized randomly, and the algorithm learns the optimal weights during training by adjusting them based on the error. Wij, are the weighted connections between nodes i and j. For neural networks these weights are analogous to the synapses in the brain. They are arranged into a matrix W.
Connection between Linear Regression and Perception
The perceptron algorithm learns a linear decision boundary to separate the input data into two classes, while linear regression models the relationship between the input variables and the output variable as a linear function. In fact, the perceptron algorithm can be viewed as a special case of linear regression, where the output is binary (0 or 1). In certain case, the weights learned by the linear regression algorithm can be used to define the linear decision boundary for classification in Perceptron.
Probabilistic Interpretation for two classes (vs. logistic regression)
The probabilistic interpretation of a neural network model with two output classes is similar to that of logistic regression. In both cases, the goal is to model the probability of the positive class given the input data. In logistic regression, the output is a scalar value between 0 and 1, which represents the probability of the positive class. The model uses a sigmoid activation function in the output layer, which maps the input to a probability value. The loss function used for training the model is the binary cross-entropy loss, which measures the difference between the predicted probabilities and the true labels. In neural networks, the output layer can also use a sigmoid activation function to produce a probability value between 0 and 1. However, instead of a scalar output, neural networks can have multiple output neurons, each representing the probability of a different class. The loss function used for training the model is typically the binary cross-entropy loss, or the categorical cross-entropy loss for multi-class problems.
Input
The set of data or features that are provided as an input to the model. The input can be in the form of numerical, categorical, or text data. It's a vector written as x , with elements xi, where i runs from 1 to the number of input dimensions, m.
Perceptron training algorithms
The simplest neural network approximates a single neuron with N binary inputs. It consists of a single layer of nodes, each of which computes a weighted sum of the input features and applies a threshold function to produce a binary output. The weights are updated by adding the product of the misclassified example and the learning rate to the current weights. The learning rate controls the step size of the weight updates and helps prevent overshooting. Given a linearly separable dataset, the Perceptron will converge to a solution that separates the classes, after a finite number of iterations. The number of iterations is bounded by R^2/\gamma^2, where \gamma is the distance between the separating hyperplane and the closest data point to it. The Perceptron training algorithm runs in O(Tmn)time, with T iterations, n data points, in m dimensions.
Structure of MLP
The structure of an MLP consists of an input layer, one or more hidden layers, and an output layer. Each layer consists of a set of nodes/neurons, that are interconnected with each other. The input layer of the MLP receives the input data, where the number of nodes in the input layer corresponds to the number of input features. The hidden layers perform computations on the input data and pass the output to the next layer. The number of hidden layers and the number of nodes in each hidden layer are hyperparameters that can be tuned during the training process. Each node in the hidden layers applies a weighted sum of its inputs and passes it through an activation function, such as sigmoid or ReLU, to produce an output. The output of the activation function is then passed to the next layer. The output layer is the final layer of the MLP and it produces the final output. The number of nodes in the output layer depends on the type of problem being solved. For example, for a binary classification problem, there would be one node in the output layer representing the probability of the positive class. During the training process, the weights of the connections between the neurons in the MLP are adjusted based on the error between the predicted output and the actual output. This is done using an optimization algorithm, such as backpropagation, to minimize the error and improve the accuracy of the predictions.
Perceptron convergence theorem
The theorem states that if the training datasets are linearly separable, then the Perceptron algorithm will find a solution that separates the examples into their respective classes in a finite number of iterations.
Outputs
The variable that the model is trying to predict. It can be categorical, such as class labels in classification, or continuous, such as a numerical value in regression. The vector is y, with elements yj, where j runs from 1 to the number of output dimensions, n.
Weight Space
The weight space in machine learning is the n-dimensional space of all possible weight configurations for a given model. In other words, the weight space is the set of all possible values that the weights of a model can take. The goal of training a machine learning model is to find the optimal set of weights in the weight space that minimizes the error between the predicted and actual outputs.
Regression
also a type of supervised learning problem, but the goal is to predict a continuous output variable or a numerical value predicting the price of a house, the temperature, or the amount of rainfall.
Unsupervised Learning
discovering patterns in input objects without labels - Clustering - Compression
Classification
supervised learning problem in which the goal is to predict the categorical output variable or class label of a given input data point. - Email (Spam or Not) - Type of Flowers
