Supervised Learning

¡Supera tus tareas y exámenes ahora con Quizwiz!

Lasso

A linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solultion is dependent. For this reason Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non zero coefficients. Mathematically, it consists of a linear model with an added regularization term. The alpha parameter controls the degree of sparsity of the estimated coefficients.

Logistic Regression

A linear model for classification rather than regression. Also known in the literature as logit regression, maximum-entropy classification or the log-linear classifier. Can fit binary, one-vs-rest or multinomial logistic regression with optional l1, l2 or elastic-net regularization

Elastic-Net

A linear regression meodel trained with both l1 and l2 norm regulraization of the coefficients. This combination allows for learning a sparse model where few of the weights are non-zero like lasso, while still maintaining the regularization properties of Ridge. Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both. A practical advantage of trading-off between Lasso and Ridge is that it allows Elastic-Net to inherit some of Ridges stability under rotation

What does regularization achieve?

A standard least squares model tends to have some variance in it, i.e. this model won't generalize well for a data set different than its training data. Regularization, significantly reduces the variance of the model, without substantial increase in its bias. So the tuning parameter λ, used in the regularization techniques described above (Ridge Regression and Lasso), controls the impact on bias and variance. As the value of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point, this increase in λ is beneficial as it is only reducing the variance(hence avoiding overfitting), without loosing any important properties in the data. But after certain value, the model starts loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the value of λ should be carefully selected.

Perceptron

Another simple classification algorithm suitable for large scale learning. By default: -it does not require a learning rate -it is not regularized (penalized) -it updates its model only on mistakes The last characteristic implies that the perceptron is slightly faster to train than SGD with the hinge loss and that the resulting models are sparser.

Bayesian Regression

Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand. This can be done by introducing uninformative priors over the hyper parameters of the model. The l2 regularization used in Ridge Regression is equivalent to finding a maximum a posteriori estimation under a Gaussian prior over the coefficients w with precision lambda^-1. Advantages: It adapts to the data at hand, it can be used to include regularization parameters in the estimation proceudre. Disadvantages: inference of the model can be time consuming.

Ordinary Least Squares

LinearRegression fits a linear model with coefficients w = (w1, ... , wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experiemental design.

Linear and Quadratic Discriminant Analysis

Two classic classifiers, with, as their names suggest, a linear and quadratic decision surface, respectively. These classifiers are attractive because they have closed-form solutions that can be easily computed, are inherently multiclass, have proven to work well in practice, and have no hyperparameters to tune. LDA can only learn linear boundaries, while QDA can learn quadratic boundaries and is therefore more flexible.

Support Vector Machines

a set of supervised learning methods used for classification, regression and outliers detection. Advantages: -Effective in high dimensional spaces. -Still effective in cases where number of dimensions is greater than the number of samples. -Uses a subset of training points in the decision function (called support verctors), so it also memory efficient. -Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels. Disadvantages: -If the number of features is much greater than the number of samples, avoid over-fitting in choosing kernel functions and regularization term is crucial. -SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation

Stochastic Gradient Descent - SGD

a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) SVM and Logistic Regression. It is particularly useful when the number of samples (and the number of features) is very large. SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Advantages: Efficiency, ease of implementation (lots of opportunities for code tuning) Disadvantages: SGD requires a number of hyper parameters such as the regularization parameter and the number of iterations. SGD is sensitive to feature scaling

Shrinkage (Linear and Quadratic Discriminant Analysis)

a tool to improve estimation of covariance matrices in situations where the number of training samples is small compared to the number of features. In this scenario, the empirical sample covariance is a poor estimator. The shrinkage parameter can also be manually set between 0 and 1. In particular, a value of 0 corresponds to no shrinkage (which means the empirical covariance matrix will be used) and a value of 1 corresponds to complete shrinkage (which means that the diagonal matrix of variances will be used as an estimate for the covariance matrix).

Ridge Regression

addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients. The ridge coefficients minimize a penalized residual sum of squares. The complexity parameter alpha >= 0 controls the amount of shrinkage: The larger the value of alpha, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.

Dimensionality Reduction using Linear Discriminant Analysis

can be used to perform supervised dimensionality reduction, by projecting the input data to a linear subspace consisting of the diretions which maximize the separation between classes. The dimension of the output is necessarily less than the number of classes, so this is, in general, a rather strong dimensionality reduction, and only makes sense in a multiclass setting.


Conjuntos de estudio relacionados

Microbiology: Chapter 4: Functional Anatomy of Prokaryotic and Eukaryotic Cells

View Set

Really Important Info from each section (Need to Know)

View Set

5.1 Light, speed, wavelength, and frequency of light and Bohr model

View Set

NUR 113: Ch.47 intestinal and rectal disorders

View Set