Machine Learning

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

w0, bias unit

w0 refers to the bias unit is an additional input value that we provide to x0, which is set equal to 1

decision tree classifier

we can think of this model as breaking down our data by making a decision based on asking a series of questions

the decision tree algorithm

we start at the tree root and split the data on the feature that results in the largest information gain (IG). in an iterative process, we can then repeat this splitting procedure at each child node until the leaves are pure. in practice, this can result in a very deep tree with many nodes, which can easily lead to overfitting

leaf "purity" in decision trees?

when the leaves are pure, it means that the training examples at each node all belong to the same class

Rosenblatt's perceptron rule

with his perceptron rule, Rosenblatt (1957) proposed an algorithm that would automatically learn the optimal weight coefficients that would then be multiplied with the input features in order to make the decision of whether a neuron fires (transmits a signal) or not.

multinomial logistic regression / softmax regression?

logistic regression can readily be generalized to multi class settings, which is known as multinomial logistic regression or softmax regression

Clustering (UL)

an exploratory data analysis technique that allows us to organize a pile of information into meaningful subgroups (clusters) without having any prior knowledge of their group memberships... (sometimes called unsupervised classification)

the advantage of "online learning"

another advantage of SGD is that we can use it for online learning, where our model is trained on the fly as new training data arrives. using online learning, the system can immediately adapt to changes, and the training data can be discarded after updating th model if storage space is an issue

what does "supervised" mean in SL?

"supervised" refers to a set of training examples (data inputs) where the desired output signals (labels) are already known

perceptron algorithm

1. initialize the weights to 0 or small random numbers 2. for each training example, x^(i): a. compute the output value, y(hat) b. update the weights

5 main steps in training a supervised machine learning algorithm?

1. selecting features and collecting labeled training examples 2. choosing a performance metric 3. choosing a classifier and optimization algorithm 4. evaluating the performance of the model 5. tuning the algorithm

stochastic gradient descent (SGD)

A gradient descent algorithm in which the batch size is one. In other words, SGD relies on a single example chosen uniformly at random from a data set to calculate an estimate of the gradient at each step.

the bias-variance tradeoff

As complexity increases, bias decreases but variance increases

Mini-Batch Gradient Descent

Instead of using all m examples as in batch gradient descent, and instead of using only 1 example as in stochastic gradient descent, we will use some in-between number of examples b

McCulloch-Pitts (MCP) neuron

McCulloch and Pitts (1943) described a neuron as a simple logic gate with binary outputs; multiple signals arrive at the dendrites, they are then integrated into the cell body, and, if the accumulated signal exceeds a certain threshold, an output signal is generated that will be passed on by the axon

The Support Vector Machine (SVM)

SL classification algorithm that uses a boundary to separate the data into two or more categories ("classes").

feature (x)

a column in a data table or data (design) matrix. synonymous with predictor, variable, input, attribute, or covariate

Dimensionality Reduction (UL)

a commonly used approach in feature processing to remove noise from data, which can also degrade the predictive performance of certain algorithms, and compress the data into a smaller dimensional subspace while retaining most of the relevant information

standardization

a feature scaling method which gives our data the properties of a standard normal distribution: zero-mean and unit variance

training example

a row in a table representing the dataset and synonymous with an observation, record, instance, or sample (in most contexts, sample refers to a collections of training examples)

regression analysis (SL)

a subcategory of supervised learning where we are given a number of predictor (explanatory) variables and a continuous response variable (outcome), and we try to find a relationship between those variable that allows us to predict an outcome.

the OvA method for multi-class classification

a technique that allows us to extend any binary classifier to multi-class problems. using OvA, we can train one classifier per class, where the particular class is treated as the positive class and the examples from all other classes are treated negative classes

ADAptive LInear NEuron (Adaline)

a type of single-layer neural network (NN) published by Widrow and Hoff (1960). can eb considered an improvement on Rosenblatt's perceptron algorithm.

stochastic gradient descent vs batch gradient descent

although SGD is considered an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. since each gradient is calculated based on a single training example, the error surface is noisier than in gradient descent, which can also have the advantage that SGD can escape shallow local minima more readily

encoding class labels as integers?

although string format works for class labels, using integer labels is a recommended approach to avoid technical glitches and improve computational performance due to a smaller memory footprint; furthermore, encoding class labels as integers is a convention among most machine learning libraries

obtaining satisfying results via SGD?

it is important to present training data in a random order; also, we want to shuffle the training dataset for every epoch to prevent cycles

bias in the ML context

bias measures how far off the predictions are from the correct values in general if we rebuild the model multiple times on different training datasets; bias is the measure of the systematic error that is not due to randomness

for-loop calculations vs Numpy (vectorization of arithmetic operations)

by formulating our arithmetic operations as a sequence of instructions on an array, rather than performing a set of operations for each element at a time, we can make better use of our modern central processing unit (CPU) architectures with single instruction, multiple data (SIMD) support

Support Vector Machine (SVM) algorithm and optimization objective

can be considered an extension of the perceptron because we use the perceptron algorithm to minimize misclassification errors, but in SVMs our optimization objective is to maximize the margin

advanced optimization algorithms implemented by scikit-learn?

can be specified via the solver parameter... namely, 'newton-cg', 'lbfgs', 'liblinear', 'sag', and 'saga'

classification (SL)

classification is a subcategory of supervised learning where the goal is to predict the categorical class labels of new instances, based on past observations, such as in the example of email spam filtering.

scikit-learn and the 'liblinear' optimization algorithm?

currently (v 0.21) is used as a default, which cannot handle the multinomial loss and is limited to the OvR scheme for multi-class classification

Reinforcement Learning characteristics?

decision process, reward system, learn series of actions

preprocessing - getting data into shape

feature extraction and scaling, feature selection, dimensionality reduction, sampling... raw data rarely comes in the form and shape that is necessary for the optimal performance

determining the learning rate value?

finding an appropriate learning rate requires some experimentation. if the learning rate is too large, the algorithm will overshoot the global cost minimum. if the learning rate is too small, the algorithm will require more epochs until convergence, which can make the learning slow especially for large datasets

the log-likelihood function

firstly, applying the log function reduces the potential for numerical underflow, which can occur if the likelihood are very small. secondly, we can convert the product of factors into a summation of factors, which makes it easier to obtain the derivative of this function via the addition trick

optimization and the logistic regression loss

for minimizing convex loss functions, such as the logistic regression loss, it is recommended to use more advanced approaches than regular stochastic gradient descent (SGD). while the logistic regression loss is convex, most optimization algorithms should converge to the global loss minimum with ease. however, there are certain advantages to using one algorithm over the other

regularization and feature normalization?

for regularization to work properly, we need to ensure that all our features are on comparable scales

overfitting and variance

if a model suffers from overfitting, we also say that the model has a high variance, which can be caused by having too many parameters, leading to a model that is too complex given the underlying data

if the classes aren't linearly separable in the perceptron?

if the two classes can't be separated by a linear decision boundary, we can set a maximum number of passes over the training dataset (epochs) and/or a threshold for the number of tolerated misclassifications -- the perceptron would never stop updating the weights otherwise

dimensionality reduction for preprocessing?

in certain cases, dimensionality reduction can also improve the predictive performance of a model if the dataset contains a large number of irrelevant features (or noise); that is, if the dataset has a low signal-to-noise ratio

logistic regression vs. SVMs

in classical classification tasks, they often yield similar results. logistic regression tries to maximize the conditional likelihood of the training data, which makes it more prone to outliers than SVMs, which mostly care about the points that are closest to the decision boundary (support vectors). on the other hand, logistic regression has the advantage that it is a simpler model and can be implemented more easily. furthermore, logistic regression models can be easily updated, which is attractive when working with streaming data

Cross-validation

in cross-validation, we further divide the dataset into training and validation subsets in order to estimate the generalization performance of the model

gradient descent algorithm

in each iteration, we taker a step in the opposite direction of the gradient, where the step size is determined by the value of the learning rate, as well as the slope of the gradient

scikit-learn future versions and optimization algorithms?

in future versions (v 0.22) the default solver will be changed to 'lbfgs', which stands for the limited-memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. it is more flexible in regard to handling multinomial loss and multi-class classification

Reinforcement Learning goal?

in reinforcement learning, the goal is to develop a system (agent) that improves its performance, measured by a reward function, based on interactions with the environment, in order to learn a series of actions that maximizes this reward via an exploratory trial and error approach or deliberative planning

Adaline's objective function?

in the case of Adaline, we can define the cost function, J, to learn the weights as the sum of squared errors (SSE) between the calculated outcome and the true class label

ML's relation to AI?

in the second half of the 20th century, machine learning evolved as a subfield of artificial intelligence involving self-learning algorithms that derive knowledge from data in order to make predictions

biological neurons

interconnected nerve cells in the brain that are involved in the processing and transmitting of chemical and electrical signals

Supervised Learning characteristics?

labeled data, direct feedback, predict outcome/future

logistic regression and conditional probabilities?

logistic regression is a classification model that is very easy to implement and performs very well on linearly separable classes. it is one of the most widely used algorithms for classification in industry.

use cases of logistic regression

logistic regression is used in weather forecasting, not only to predict whether it will rain on a particular day but also to report the chance of rain. similarly, logistic regression can be used to predict the chance that a patient has a particular disease given certain symptoms

scaling (preprocessing)

many machine learning algorithms also require that the selected features are on the same scale for optimal performance, which is often achieved by transforming the features in the range [0, 1] or a standard normal distribution with a zero mean and unit variance

training

model fitting, for parametric models similar to parameter estimation

Unsupervised Learning characteristics?

no labels, no feedback, find hidden structure in data

no free lunch theorem

no single classifier works best across all possible scenarios

loss function

often used synonymously with a cost function. sometimes the loss function is also called an error function. in some literature, the term "loss" refers to the loss measured for a single data point, and the cost is a measurement that computes the loss (average or summed) over the entire dataset

objective function

one of the key ingredients of supervised machine learning algorithms is a defined objective function that is optimized during the learning process. this objective function is often a cost function that we want to minimize

Gaussian kernel

one of the most widely used kernels is the radial basis function (RBF) kernel, which can simply be called the Gaussian kernel. the gamma parameter can be understood as a cutoff parameter for the Gaussian sphere. the gamma parameter also plays an important role in controlling overfitting or variance when the algorithm is too sensitive to fluctuations in the training dataset

the kernel trick

one problem with the mapping approach is that the construction of the new features is computationally very expensive, especially if we are dealing with high-dimensional data--this is where the kernel trick comes into play. to save the expensive step, we define a kernel function, which can be interpreted as a similarity function between a pair of examples

regularization for handling the bias-variance tradeoff

one way of finding a good bias-variance tradeoff is to tune the complexity of the model via regularization. regularization is a very useful method for handling collinearity (high correlation among features), filtering out noise from data, and eventually preventing overfitting

undercutting and bias

our model can suffer from undercutting (high bias), which means that our model is not complex enough to capture the pattern in the training data well and therefore also suffers from low performance on unseen data

overfitting

overfitting means that the model captures the patterns in the training data well but fails to generalize well to unseen data

Reinforcement Learning

reinforcement learning is concerned with solving interactive problems-- learning to choose a series of actions that maximizes the total reward, which could be earned either immediately after taking an action or via delayed feedback

sigmoidal curve

s-shaped... the sigmoid function takes real-number values as input and transforms them into values in the range [0, 1] with an intercept at phi(z) = 0.5

caution: scikit-learn and predicting the class of a single example

scimitar-learn expects a two-dimensional array as data input; thus, we have to convert a single row slice into such a format first. one way to convert a single row entry into a two-dimensional data array is to use NumPy's reshape method to add a new dimension

stratification (train/test/split context)

stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset

target (y)

synonymous with outcome, output, response variable, dependent variable, (class) label, and ground truth

an object-oriented perceptron API

take an object-oriented approach to defining the perceptron interface as a (Python) class, which will allow us to initialize new Perceptron objects (with a given learning rate and number of epochs) that can learn from data via a fit method, and make predictions via a separate predict method.

why is regularization sometimes called "weight decay"?

the "regularization parameter," expressed mathematically with the lambda symbol, is added to the cost function, and will shrink the weights during model training

Adaline algorithm

the Adaline algorithm compares the true class labels with the linear activation function's continuous valued output to compute the model error and update the weights

advantages of mini-batch gradient descent

the advantage over batch gradient descent is that convergence is reached faster via mini-batches because of more frequent weight updates. furthermore, mini-batch learning allows us to replace the for loop over the training examples in SGD with vectorized operations leveraging concepts from linear algebra (for example, implementing a weighted sum via a dot product), which can further improve the computational efficiency of our learning algorithm

Machine Learning

the application and science of algorithms that make sense of data

Kernel methods

the basic idea behind kernel methods to deal with linearly inseparable data is to create nonlinear combinations of the original features to project them onto a higher-dimensional space via a mapping function, phi, where the data becomes linearly separable

conditions for convergence of the perceptron?

the convergence of the perceptron is only guaranteed if the two classes are linearly separable and the learning rate (eta) is sufficiently small

information gain (IG) and maximization

the information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities -- the lower the impurities of the child nodes, the larger the information gain

Adaline rule (the Widow-Hoff rule)

the key difference between the Adaline rule and Rosenblatt's perceptron is that the weights are updated based on a linear activation function rather than a unit step function

perceptron and Adaline hyperparameters

the learning rate, eta, as well as the number of epochs, are the tuning parameters

logit function

the logit function takes input values in the range 0 to 1 and transforms them to values over the entire real-number range, which we can use to express a linear relationship between feature values and the log-odds. it is also called he logistic sigmoid function or just the sigmoid function

continuous linear activation function vs. the unit step function

the main advantage of this continuous linear activation function, in contrast to the unit step function, is that the const function becomes differentiable. another nice property of this cost function is that it is convex; thus we can use the optimization algorithm called gradient descent to find the weights that minimize our cost function

Gradient Descent

the main idea behind gradient can be described as climbing down a hill until a local or global cost minimum is reached

Support Vector Machine (SVM) and the "margin"?

the margin is defined as the distance between the separating hyperplane (decision boundary) and the training examples that are closest to this hyperplane, which are so-called support vectors

when to use the class='multinomial' setting in scikit-learn for training a logistic regression model?

the multinomial setting is usually recommended in practice for mutually exclusive classes. here, "mutually exclusive" means that each training example can only belong to a single class (in contrast to multilevel classification, where a training example can be a member of multiple classes)

the odds of an event, mathematically expressed?

the odds can be written as p/(1-p) where p stands for the probability of the positive event, which refers to the event that we want to predict

the perceptron system?

the perceptron receives the inputs of an example, x, and combines them with the weights, w, to compute the net input. the net input is then passed on to the threshold function, which generates a binary output of -1 or +1 -- the predicted class label of the example. during the learning phase, this output is used to calculate the error of the prediction and update the weights

SVM maximum margin intuition

the rationale behind having decision boundaries with large margins is that they tend to have a lower generalization error, whereas models with small margins are more prone to overfitting

dealing with a nonlinearly separable case using slack variables?

the slack variable, "xi", which was introduced by Vapnick in 1995 and led to so-called "soft-margin classification," was motivated by the linear constraints needing to be relaxed for nonlinearly separable data to allow the convergence of the optimization in the presence of misclassifications, under appropriate cost penalization

regression toward the mean

the tendency for extreme or unusual scores to fall back (regress) toward their average.

what does "batch" gradient descent refer to?

the weight update is calculated based on all examples in the training dataset, instead of updating the weights incrementally after each training example

class labels in SL classification

these class labels are discrete, unordered values that can be understood as the group memberships of the instances

improving gradient descent through standardization?

this normalization procedure helps gradient descent learning to converge more quickly; however, it does not make the original dataset normally distributed

"pruning" the decision trees?

to avoid a very deep tree that leads to overfitting, we want to prune the tree by setting a limit for the maximal depth of the tree

the concept behind regularization?

to introduce additional information (bias) to penalize extreme parameter (weight) values. the most common form of regularization is so-called L2 regularization (sometimes called L2 shrinkage or weight decay

Supervised Learning goal?

to learn a model from labeled training data that allows us to make predictions about unseen or future data

kernel SVMs

to solve a nonlinear problem using an SVM, we would transform the training data into a higher-dimensional feature space via a mapping function, phi, and train a linear SVM model to classify the data in this new feature space. then, we would use the same mapping function, phi, to transform new unseen data to classify it using the linear SVM model

logistic regression and "odds"?

to understand the idea behind logistic regression as a probabilistic model for binary classification, look first at the odds: the odds in favor of a particular event. we then use the logit function, which is simply the log-odds or the natural logarithm of the odds.

Unsupervised Learning

using unsupervised learning techniques, we are able to explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function

variance in the ML context

variance measures the consistency (or variability) of the model prediction for classifying a particular example if we retrain the model multiple times. we can say the the model is sensitive to the randomness in the training data

vectorization

vectorization means that an elemental arithmetic operation is automatically applied to all elements in an array

the inverse regularization parameter, c

via the variable, c, we can control the penalty for misclassification. large values of c correspond to large error penalties, whereas we are less strict about misclassification errors if we choose smaller values for c. we can then use the c parameter to control the width of the margin and therefore tune the bias-variance tradeoff


Kaugnay na mga set ng pag-aaral

Ch 2 LS - Advanced Financial Accounting

View Set

Chapter 2 - Mendal's Laws (part II)

View Set

English IV: Early British History - Celts, Romans, Anglo-Saxons, Vikings

View Set

Solar System, Galaxies & Universe

View Set

IS Arab-Israeli Conflict and The Middle East Map Quiz

View Set