TDDE01 Machine Learning

Ace your homework & exams now with Quizwiz!

when consider the bayesian approach (use of bayesian model)

"I consider Bayesian approach when my data set is not everything that is known about the subject, and want to somehow incorporate that exogenous knowledge into my forecast." When there is uncertainty, and there might be relevant info to get from the prior, which would increase certainty.

logit, a link function for a generalized linear model

(pictures shows plot of logit where x is probability, the link function for binomial family. It maps probabilities [0,1] to [-infinity,infinity]). Logit is the canonical link function for the bernoulli distribution. (x/1-x istället för y => [0,1] blir till [0, infinity], log på detta för att få [-infinity,infinity] )

Tekniken bakom naive bayes

1. Använd bayes theorem för att få fram ett uttryck för sannolikheten att en viss klass utifrån observerade värden. 2. Med bayes assumption får vi produkt av termer i nämnaren. 3. Eftersom att nämnaren är konstant så gäller att VL är proportionell med täljaren i HL. 4. Maximera HL på y för att få den mest sannolika klassen.

How estimate uncertainty of estimator f? (estimator of probabilistic model f(x,D)

1. If data D has a known distribution, compute distribution for estimator (difficult) 2. Use Bootstrap method.

Deterministic vs probabilistic model

A deterministic mathematical model is meant to yield a single solution describing the outcome of some "experiment" given appropriate inputs. A probabilistic model is, instead, meant to give a distribution of possible outcomes (i.e. it describes all outcomes and gives some measure of how likely each is to occur).

hyperparameter

A hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training. Tex; lambda i Ridge och Lasso. Eller width i Kernel.

AUC

AUC stands for area under the curve, and is the area under the ROC-graph. It can be used to compare classification methods, higher AUC is better.

probability density function

Any given sample can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample. In other words, while the absolute likelihood for a continuous random variable to take on any particular value is 0 (since there are an infinite set of possible values to begin with), the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would equal one sample compared to the other sample. (eller att i just den punkten är det viss sannolikhet / enhet)

Vad är sant om proportionalitet

Att om VL och HL är proportionella så kommer det vara samma värde på Y som maximerar / minimerar uttrycken.

Finding best classification from a classification tree.

Classification probability pmk=p(Y=k | X ∈ Rm) is estimated for every class in a node .

Histogram Classification

D = number of inputs? The histogram rule is less accurate at the borders of the cube, because those points are not as well represented by the cube as the ones near the center. (moving window classification is better at this)

Basic ML ingredients

Data, Model, Learning procedure, Prediction.

Vad säger p(x|θ)*p(θ)?

Eftersom att p(x|θ)*p(θ) är proportionell mot p(θ|x), så kan

varför med least squares metoden måste man ha minst lika många datapunkter som features

Exempel, tre features två datapunkter. Det spelar då ingen roll vilken lutning planet har i en av riktningarna, alltså går det inte att göra en prediction. Med ridge regression är detta dock möjligt.

Types of learning: Reinforcement learning.

Find suitable actions to maximize the reward. True targets are discovered by trial and error

maximal margin classifier

Finds a threshold between in the middle between observations to divide the data. This classifier has a hard margin and is very sensitive to outliers. It wont allow missclassifications in the sense that if one observation gets missclassified it can move the threshold to a non optimal position. The threshold gets updated when new observations are made.

LDA projecting the features in higher dimension space to lower dimensions, visualized

From this line the LDA model is able to estimate the mean and variance from your data for each class with the help of the assumptions that the data is gaussian with same variance. Outliers can skew the primitive statistics used to separate classes in LDA, so it is preferable to remove them. Since LDA assumes that each input variable has the same variance, it is always better to standardize your data before using an LDA model. Keep the mean to be 0 and the standard deviation to be 1.

classifying hadwritten digits, example Confusion Matrix

How many picture 2 are predicted as for example a 5?

stepAIC and model selection

If we are given two models then we will prefer the model with lower AIC value. Hence we can say that AIC provides a means for model selection. AIC is only a relative measures among multiple models.

Praktisk samband mellan MLE (maximum likelihood estimation) och normalfördelning

It turns out that when the model is assumed to be Gaussian, the MLE estimates are equivalent to the ordinary least squares method.

CART algorithm for finding optimal regression tree. (fitting)

Its a greedy algorithm (optimal solution not found)

Ridge regression idea

Keeps all predictors of mean square but shrink coefficients to make model less complex. (med hjälp av L2 regularization) Shrinking enables estimation of regression coefficients even if the number of parameters exceeds the number of cases.

How makes LDA a prediction

LDA makes predictions by estimating the probability that a new set of inputs belongs to each class. The class that gets the highest probability is the output class and a prediction is made. The model uses Bayes Theorem to estimate the probabilities.

why ridge regression instead of least square?

Least squares regression isn't defined at all when the number of predictors exceeds the number of observations; It doesn't differentiate "important" from "less-important" predictors in a model, so it includes all of them. This leads to overfitting a model and failure to find unique solutions. Ridge forces the model to use fewer predictors.

Lasso

Like Ridge but with L1 regularization instead of L2.

How to model normally distributed targets

Linear regression

How to model bernoulli or multinomial targets ?

Logistic regression

When use LDA (linear discriminant analysis) and when use logistic regression for classification?

Logistic regression is intended for two-class or binary classification problems. If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification technique. It might though be a good idea to try both logistic regression and linear discriminant analysis.

logistic regression

Logistic regression is useful when predicting binary outcomes from continuous predictor variables. (Example, if a loan is accepted or denied)

Vad är maximum likelihood estimation till för?

MLE can be defined as a method for estimating population parameters (such as the mean and variance for Normal, rate (lambda) for Poisson, etc.) from sample data such that the probability (likelihood) of obtaining the observed data is maximized. we can assume that we have a likelihood function L(θ;x), where θ is the distribution parameter vector and x is the set of observations.

MSE criterion

MSE criterion is a tradeoff between bias and variance. It is important especially when the polynomial has higher degree to avoid overfitting.

ICA (independent component analysis)

Making a linear transformation such that the features becomes independent of eachother while maximizing the similarity to the original features.

Projection matrix P (hat matrix)

Maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes the influence each response value has on each fitted value.

The Gini Index

Mått på impurity av en nod. The Gini Index is calculated by subtracting the sum of the squared probabilities of each class from one. (så tex 1- (36/70)^2 - (34/70)^2, för att 36 observationer i noden var klass 'yes' och 34 'no' angående heart disease. It favors larger partitions. När noden är pure så är gini index 0. Man vill ha ett lågt Gini index.

Example K-nearest neighbor density estimation

N = total data. Delta = volym av område(?). K = alla data inne i område.

Exponential family of distributions

Non exponential -> uniform, Student t

ANN: cost function, C(s,y)

När man fått ett värde på s (predicted output) så jämförs det med ett förväntat output (y). Detta sker m.h.a en cost function C(s,y). Cost can be equal to MSE, cross-entropy or any other cost function. Based on C's value, the model "knows" how much to adjust its parameters in order to get closer to the expected output y. This happens using the back-propagation algorithm.

When is a node pure / impure

Om dess alla observationer faller inom en klass. Tex alla som har chest pain -> headache -> high blood pressure, har heart disease. (faller inom klassen 'yes'). då är noden pure. Mäter impurity med gini index eller entropy/deviation

Why GLM instead of OLS regression?

Ordinary Least Squares regression provides linear models of continuous variables. However, much data of interest to statisticians and researchers are not continuous and so other methods must be used to create useful predictive models. The glm() command is designed to perform generalized linear models (regressions) on binary outcome data, count data, probability data, proportion data and many other data types.

logistic regression inner functioning

Output är binärt. Linear regression är output [-infinity,infinity]. Man måste göra om sannolikheterna från [0,1] till [-infinity,infinity], detta görs med en link funktion. Om man sedan löser ut p så får man följande graf. Man kan nu göra klassifiering. (ger bättre resultat än att inte använda family / link function, kanske för att man optimerar parametrar theta och a)?

PCA i andra ord

PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, such as that the variance is maximized. Skapar oberoende principal components utav (möjligtvis) beroende features. Målet är att dessa principal components ska beskriva så mycket som möjligt av datan. En principal component är inte en feature. PCA removes correlations, but not higher order dependence. ICA removes correlations and higher order dependence. In layman terms PCA helps to compress data and ICA helps to separate data

PCA (principal component analysis)

Reduces complexity in data, approximates data with fewer dimensions. Finding correlation by maximizing variance.

Regularization

Regularization works by biasing data towards particular values (such as small values near zero). The bias is achieved by adding a tuning parameter to encourage those values. L2 regularization (ridge regression) adds an L2 penalty equal to the square of the magnitude of coefficients.

Kernel Density Smoothing

Replaces each sample point with a Gaussian-shaped Kernel, then obtains the resulting estimate for the density by adding up these Gaussians. The width (h) of the kernels can be determined with cross-validation. Smoothing increases variance.

The big difference between ridge and lasso regression

Ridge regression can only shrink the slope asymptotically close to zero while lasso can shrink the slope all the way to zero. Så Lasso är lite bättre när vissa variabler är överflödiga, och Ridge är lite bättre när de flesta variabler är användbara.

sampling with replacement

Samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen. This allows a given observation to be included in a given small sample more than once. (Used in Bootstrap)

Second order dependency vs higher order dependency.

Second order dependency is dependency between two variables, Cov(x,y) != 0. Higher order means dependency between >2 variables. Cov(x,y,z) != 0.

Big advantage of Support Vector Machines (SVM) and Logistic Regression over K nearest neighbors

Support Vector Machines (SVM) and Logistic Regression trys to find a line to separate the data, once the line/plane/hyperplane is found, new data points can be very quickly classified.

TPR vs FPR

TPR (true positive rate) är antal TP genom totalt antal positiva. FPR (false positive rate) är antal FP genom totalt antal negativa.

How find good penalty factor for Ridge?

Testa modell med olika värde på lambda (penalty factor) på training data och välj den med minst error. (cross-validation)

LDA makes some simplifying assumptions about your data.

That each variable is Gaussian. That each attribute has the same variance. With these assumptions, the LDA model estimates the mean and variance from your data for each class.

ANN: forward propagation

The "normal" ANN. The equations in the picture form network's forward propagation.

ROC graph

The ROC graph summarizes all of the confusion matrix that each threshold produces (in classification problem). Each dot represents a threshold. Y-axis is the ratio of positives classified as positive. X-axis is the ratio of negatives classified as positives.

The assumption in naive bayes classifier

The assumption is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive. Another assumption made here is that all the predictors have an equal effect on the outcome.

Kernel function

The kernel function is what is applied on each data instance to map the original non-linear observations into a higher-dimensional space in which they become separable. (kernel trick)

Moving Window Classification

The moving window rule gives equal weight to all the points in the ball.

Ordinary least square regression, vad är RSS?

The sum of all the squared residuals (deviations) is known as the residual sum of squares (RSS) and provides a measure of model-fit for an OLS regression model.

Generalized Linear Model

The term generalized linear model (GLIM or GLM) refers to a larger class of models. It is like linear regression but also counts with distribution of dependent variable and a link function. Link function makes up for that is that the effect of the predictors on the dependent variable may not be linear in nature. The .glm() function is the basic tool for fitting generalized linear models.

What are Kernels used for

To generalize any linear algorithm to use curved shapes. (a function from the low dimensional space into a higher dimensional space.) The goal of the kernel is to make it so that two classes of data points that can only be separated by a curved line in the two-dimensional space can be separated by a flat plane in the three-dimensional space.

Hur välja rot node

Välj den som har lägst totalt impurity. I alla lägen i trädet placera noden med lägst impurity.

PCA process

Välj två eller fler features som axlar. Sätt ut datat. Skapa en linje som minimerar avståndet mellan data och linje. Detta är PC1. Lutningen säger viktigheten av en feature. (tex k = 4 innebär att F1 är mkt viktigare än F2). Lägg till en linje ortogonal mot PC1, detta är PC2. Enhetsvektorn i PC1s riktning är eigenvektorn för PC1. Eigenvärdet är summan av alla kvadrater.

Parametric bootstrap

Works even for small samples.

exemple ANN

a = activations, uträknade av en activation function. z2 = weight matrix * x + bias. a2 = f(z2) z3 = weight matrix * a2 + bias

Bayes theorem

a method used to compute posterior probabilities

ANN: what is backpropagation

backpropagation aims to minimize the cost function by adjusting network's weights and biases. The level of adjustment is determined by the gradients (derivera och minimera vektorn C) of the cost function with respect to those parameters. Initialization of the weights in the backpropagation algorithm is crucial.

Nonparametric bootstrap

can be applied to any deterministic estimator

degrees of freedom

degrees of freedom (DF) indicate the number of independent values that can vary in an analysis without breaking any constraints.

cross validation

en test metod som går ut på att använda olika delar av datat som test och sedan sätta ihop resultatet.

conjugate distributions

if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. Conjugate distributions -> posterior is also that distribution.

ANN; why not use linear activation functions

in that case the output would be a linear version of the input, meaning that all layers could be reduced to just one. Use sigmoid reLU or tanh.

support vector classifier

like maximal margin classifier but with a soft margin. (also called soft margin classifier). Bias/variance tradeoff, it allows some missclassifications to lower the variance. Observations inside the soft margin is called support vectors. The soft margin is determined with cross validation.

mean squared error (MSE)

measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.

What is stepAIC

stepAIC is one of the most commonly used search method for feature selection. We try to keep on minimizing the stepAIC value to come up with the final set of features. "stepAIC" does not necessarily means to improve the model performance, however it is used to simplify the model without impacting much on the performance. So AIC quantifies the amount of information loss due to this simplification

support vector machine

support vector classifiers cant handle data that is not easily divided by a single line/plane, but support machine vectors can. 1. Begin with data in a lower dimension. 2. Transpose the data to a higher dimension. 3. Find a support vector classifier to separate the data in the new dimension in two groups. (with help of kernel functions)

Types of learning: Semi-supervised

targets are known only for some observations.

sparse solution

the majority of x's components (weights) are zeros, only few are non-zeros. And a sparse solution could avoid over-fitting.


Related study sets

LA Compound Predicates, Simple Subjects, Simple Predicates, and Subjects in Imperative Sentences

View Set

abeka 7th grade History Appendix quiz D

View Set

Principles of Learning: Innate Behavior Patterns and Habituation

View Set

History Review Midterm (taken from castle learning)

View Set

ELT Machinery and Control Quiz 4

View Set

Becoming an Entrepreneur Quiz 100%

View Set