Machine learning

¡Supera tus tareas y exámenes ahora con Quizwiz!

Algorithms based on DT

-Random Forest: ensemble method that generates a set of decision trees with some random criteria and integrates their values into a final result. -Random criteria: 1) random subsets of data (bagging), 2) random subset of attributes (feature selection), ... -Integration of results: majority vote (most common class returned by all the trees). random forrest is less sensitive to overfitting

What is acheieved by reduced error pruning?

-it produces smallest version of most accurate subtree (removing sub-trees added due to coincidental irregularities). -When data set is limited, reducing the set of training examples (used as validation examples) can give bad results.

How to avoid overfitting?

-stop growing when data split not statistically significant -grow full tree, then post-prune

Determine correct size of tree

-use a separate set of examples (distinct from the training examples) to evaluate the utility of post-pruning -apply a statistical test to estimate accuracy of a tree on the entire data distribution -using an explicit measure of the complexity for encoding the examples and the decision trees.

Rule post-pruning

1 Infer the decision tree allowing for overfitting 2 Convert the learned tree into an equivalent set of rules 3 Prune (generalize) each rule independently of others 4 Sort final rules into desired sequence for use have rules on form (...^...) v (...^...) can either take away whole parentheses or part of parentheses

What is the sigmoid function?

1/(1 + exp(-a))

Si max likelihood 2 classes

1/N1 (sum)(xn - μi)(xn-μi)^T i = 1, 2

Accuracy

= 1 - error rate

π for max likelihood 2 classes

= N1/N

Error rate

= |errors| / |instances| = (FN+FP) / (TP + TN + FP + FN)

What are we trying to find in multiclass classification with linearly seperable data?

(all thick variables) find W~ so that y(x) = W~^T x~ is the K-class discriminant

Decision boundary

(thick w~_k -thick w~_j)^T thick x~ = 0

Issues in DT learning

-Determining how deeply to grow the DT -Handling continuous attributes -Choosing appropriate attribute selection measures -Handling training data with missing attribute values -Handling attributes with different costs

Rules of decision tree

Generated for every path to leaf

What are two types of families in probabilistic models?

Generative: estimate P(x|Ci) and then compute P(Ci|x) with bayes Discriminative: estimate P(Ci|x) directly

How does a gaussian noise distribution look?

Have a line and for evey point of line is a gaussian distribution

H

Hypothesis space

How does sampling affect linear regression?

If can sample perfectly, the problem is trivial

What are different solvers in logistic regression?

Iterative reweighted least squares with netwon raphson iteratigve optimization find gradient of error wrt to w~, find hessian of error, and take w~ = w~ -H^-1 * grad(E) grad(E) = X~^T /(y(w~-t) H(w~) = X~^T *R(w~)X~ where R = yn(1-yn)

How do we ususally compare to Learning algorithms?

K-fold validation and run both alogrithms on every partition

Methods for learning linear discriminants

Least squares Perceptron Fishers linear discriminant Support vector machines

Least squares

MInimize sum of squares error function (all thick variables) E(W ̃ ) = 1/2Trace((X ̃W ̃-T)^T *(X ̃W ̃ − T) where X~W~ is prediction of model, T is what we have in dataset W ̃ = (X~^TX~)^-1X~^T*T )pseudoinv(X) y(X) = W~^TX~ = T^T*pseudoinv(X)^T*X~ use learnt W~ to compute y(x) = W~^T*x~ = (y1(x)...yk(x))^T and assign a Class k = argmax(yi(x))

Which algorithms are used for linear regression? and what are conditions?

Maximum likelihood least squares Sequential learning observations are independent and identically distributed

Performance metrics for regression

Mean absolute error Mean squared error rot mean squared error

Can we use multiple binary linear models for classification?

No, you will get some regions that are not defined because they are not C1 og C2, but that doesnt mean that they neccessarily equal C3

Whata is issue with least squres?

Not robust to outliers They affect the partition a lot and we get a bad partition

What is the idea behind support vector machines?

Optimization problem to maximize the margin between groups of data, and we only look at the points closest to the line, ie good when dataset is large, and good against outliers

Independent events

P(A|B) = P(A)

Probabilistic generative models

P(C1|x) = σ(a), a = ln P(x|C1 )P(C1 )/ (P(x|C2 )P(C2 )) P(C2|x) = 1 − P(C1|x) Parametric model: P(C1) = π, P(C2) = 1 − π P(x|C1) = N (x; μ1, Σ) P(x|C2) = N (x; μ2, Σ) (normala distribution i think) where we want to estimate π, μ1, μ2, Σ

P(Ci | x) for more than two classes

P(Ci|x) = exp(ak)/sum exp(aj)

Bernoulli distribution

P(X = k, theta) = theta^k (1-theta)^(1-k) ex observing head after flipping coin

Binomial distribution

P(X=k, n, theta) = (n over k)theta^k(1-theta)^(n-k) flipping a coin n times and observing k heads

Multinomial distribution

P(X_1 = k_1...X_d = k_d) = n!/(k_1! ...k_d!) theta_1*k_1...theta^(k_d) ex rolling a d-sided die n times and observing k times a particular value

P(a|b)

P(a ^ b) /P(b)

Product rule

P(a ∧ b) = P(a|b)P(b) = P(b|a)P(a)

Naive bayes assumption

P(a1, a2, ... a_n | v_j, D) = mulitply all P(aa_i | v_j, D)

Bayes rule

P(a|b) = P(b|a)P(a)/P(b)

Max likelihood for two classes

P(t|π, μ1, μ2, Σ, D) = gangesum [πN (xn; μ1, Σ)]^tn [(1 − π)N (xn; μ2, Σ)^(1−tn)

D?

Training set

What does basis functions do and what are different types of them?

Transformations to make the dataset linearly separable Linear Polynomial Radial basis function Sigmoid

DT unknown Attribute values

Use training example anyway, sort through tree -If node n tests A, assign most common value of A among other examples sorted to node n -assign most common value of A among other examples with same target value -assign probability pi to each possible value vi of A assign fraction pi of example to each descendant in tree

Soft margin constraints

Used when dataset is almost linearly separable, solve the optimization problem with soft constraints

Information gain

measures how well a given attribute separates the training examples according to their target classification. Want attribute with highest info gains. = entropy(S) - abs(Sv)/abs(S) * Entropy(Sv)

Bernoulli multivariate distribution

mulitply all P(X_i = k_i, theta_i) ex observing head after flipping coin and extracting a lime

Termination conditions for perceptron

number of iterations threshold on change in loss function will converge if data is linearly seperable and eta is sufficiently small, but does not neccessarily give global minimum

Pros and cons of Bayes optimal classifier

provides good results, but not practical when hypothesis space is large

what does OMEGA denote

sample space in probabilites

Are least squares and max likelihood for linear regression related?

Yes they are kind of the inverse problem of eachother

What types of percceptron algorithms do we have?

determine how we set delta(w_i) batch mode: consider whole dataset mini-batch: take subset S of D incremental: choose one sample incremental and mini-batch speed up convergence and less sensitive to local minima

Remarks on maximum likelihood estimation?

efficiently solved when analytical solutions are available

How does it work for discriminative models?

estimate P(Ck|x~, D) = exp(ak)/sum exp(aj) w~* = argmax lnP(t|w~, X) without estimating the model parameters

What is a proposition

event where assignment to random variable holds playtennis = true wather = rain etc

Perceptron

linear combination of input with sign func. unthresholded lienar unit o = w0 + w1x1...= w^T*x learn wi that minimises square error/loss function E(w)=1/2 sum_N(tn-w^Tx_n)^2 dE/dw_i -> = sum_N(tn-w^Tx_n(-x_i,n) update weights w wi = wi+ delta(wi) where delta(wi) = -eta*dE(dw_i) and eta is small constant called learning rate

Joint probability distribution

matrix where you set up all the values and find the probability for each

Entropy

measure of impurity in set - p+ * log_2p+ - p_*log_2p_ p+ is proportion of positive of positive samples, p- = 1-p+, minmum with p+ = 1 or 0, max when p+ = 0.5

What do for decision trees with continuous valued attributes?

create discrete attribute to test continuous variables

Linearly separable

dataset can be partitioned by linear function

Two distributions used for Naive Bayes text classification

Bernoulli and Multinomial distribution

Naive bayes remarks

Conditional independence often violated, but works good a lot of the time still

Decision tree

Each internal node test attribute A_i Each branch denotes value of an attribute a_i,j c A_i Leaf node assigns classification value c c C (c contained in C

Brute force MAP

For each hypothesis h in H, calculate the posterior probability P(h|D) = P(D|h)P(h)/P(D) Output the hypothesis hMAP with the highest posterior probability, h_map = argmaxP(h|D)

K-fold cross validation

Partition D into k disjoint sets, choose, take one for testing and rest for training, and do so with all. delta is erorr from each, calculated error for all = 1/k *sum of deltas accuracy = 1-error

Rank the robustness of linear methods

Perceptron fisher least squares

What is logistic regression?

Probabilistic discriminative model based on maximum likelihood. have data set D = {(x~_n, t_n)^N} with t_n [0,1] Likelihood func: p(t~|w) gangesum y_n ^tn(1-y_n)^(1-t_n) with y_n = p(Ci | x~n) = σ( ̃wT ̃xn)

Some ways to improve text classification

Remove from document words like "the" "and" etc And for similar words "like","likes", "liking", replace with like

Sequential learning lin reg

Stochastic gradient descent algorithm w^ = w^ -η grad(E_n) η is learning rate wˆ ← wˆ + η [tn − wˆ^T φ(x_n)] φ(x_n) and converge for suitable small η The point is to change the parameters with the negative gradient to decrease the error between the datapoint and the correct value

Recall

ability to avoid false negatives |true positives| /|predicted positives| = TP /(TP+FP

Precision

ability to avoid false positives | true positives | / | predicted positives | = TP / (TP + FP)

Max likelihood for linear regression

argmax_w P({t1...t_n} | x_1,...x_n, w, β), where β is precision

Least square error for linear regression

argmin_w E_D(w) = argmin_w 1/2 sum_N[t_n - w^Tφ(x_n)]^2 can also write E_D(w) = 1/2 (t-Φ_w)^T(t-Φ_w) with t = [t_1...t_N]^T and Φ = matrise med φ_(x_1)...φ_M(x1) bortover og φ_0(x_1) ...φ_0(x_N) nedover

h*

best hypothesis consistent with D

How does linear regression work with nonlinear models?

can use nonlinear mapping function φ(x) so that y= w^Tφ(x) example can do polynomial fitting, but is danger of overfitting, ie a higher order polynomial fits for a lower order polynomial

DT attributes with cost

can use tan schlimmer: gain^2 /cost nunez: 2^(gain) - 1 /(cost+1)^w where w is [0,1] determining importancce of cost

Prediction with perceptron

classify c ad Ck, k = sign(w^Tx), w is the learned

What is optimality condition for lin reg algorithm?

grad E_D = 0 => Φ^T*Φw = Φ^Tt w_ML = (Φ^T*Φ)^-1*Φ^T *t

Overfitting

h overfits if erros_s (h) < erro_s (h') and error_d(h) > error_d(h') where error_s is of traning, ad error_d for all

Maximum a posteriori hypothesis

h_map = argmax P(D|h)P(h)

Maximum likelihood hypothesis

h_ml = argmax_h P(D|h)

What are some remarks for lin reg regularization

have to tune lambda also, and can also get underfitting

What are some considerations for choice of eta/learning rate in perceptron

if too small, line will lay close to one of the groups of samples but if too big you can pass the point with the minimal error

ID3 algorithm

selects attribute with highest info gain inout {examples, target_attribute, attributes} 1 create root node for tree 2 if all Examples are positive, then return the node Root with label + 3 if all Examples are negative, then return the node Root with label − 4 if Attributes is empty, then return the node Root with label = most common value of Target attribute in Examples or A ← the "best" decision attribute for Examples Assign A as decision attribute for Root For each value vi of A - add a new branch from Root corresponding to the test A = vi - Examplesvi = subset of Examples that have value vi for A - if Examplesvi is empty then add a leaf node with label = most common value of Target attribute in Examples - else add the tree ID3(Examplesvi , Target attribute, Attributes−{A})

Reduced error pruning

split data into trainign and evalutation -Evaluate impact on validation set of pruning each possible node (remove all the subtree and assign the most common classification) -Greedily remove the one that most improves validation set accuracy

what does omega denote?

subspace of OMEGA

Naive bayes algorithm

target func f: X-> V, X = A1 x ... x An V ={v1, ... vk}, data set D, new instance x = {a1, a2...an} for each value vj in V, --estimate P^(vj | D) --for each attirbute Ak ----for each attirbute value a_i in Ak ------estimate P^(a_i|vj, D)

regularization lin reg

technique to control over-fitting argmin_w E_D(w) + λE_W (w) common choice E_W(w)= 1/2 w^Tw where λ > 0 is regularization factor

Maximum likelihood solution parametric models, model M_theta

theta* = argmax lnP(t|theta, X) if M_theta is in exponential family: w~* = argmax ln P(t|w~, X)

K-clas discrimant of K linear functions

thick y(x) = (y1(x)...yk(x))^T = (thick w~_1^T(x~)...w~_k^T(x~))^T = thick W~^Tx~

difference between thresholded and untresholded in percetpron

unthresholded: o(x) = w^Tx, thresholded o(x) =sign(w^Tx

Multiclass logistic regression

use E = -sum_N sum_M t_nk * lny_nk gradient: grad(w~_j)*E(w~_1...w~_K) hessian: grad(w~_k)*grad(w~_j) E(w~_1...w~_K)

For DT how deal with attribute with many values?

use gainratio = gain/splitinformation splitinformation = -sum of abs(Si)/abs(S)log2abs(Si)/abs(S)

Naive bayes classifier

v_NB = argmax P(v_j | D)* mulitplication sum P(a_i| v_j, D) v is kind of play tennis yes or no, a is sunny, hot, humid etc so is P(sunny | yes) = ...

Bayes optimal classifier

v_ob = argmax sum of P(v_j|x, h_i)P(h_i|D)

w^T * x + w0 = (w0 w)(1 x)^T in compact form

w~ = (w0 w)^T x~(1 x)^T ak= w~^T *x~

What are we trying to solve in logistic regression?

w~* = argmin E(w~) where E is cross entropy function E(w~) = -lnp(t|w~) = -sum_N [tn ln yn + (1 − tn) ln(1 − yn)]

Fishers linear discriminant fo multiple classes

y = W^Tx maximize J(W) = trace(WS_wW^T)^-1(WS_BW^T)

Fishers linear discriminant for two class classification

y = w^Tx, and classification os new instances is given by y >= -w_0 where w= S_w^-1(m2 -m1), w_0 = w^Tm m2 and m1 are means for C1 and C2, and m is global mean

Linear model for regression

y(x;w) = w0 +w1x1 + ...+ wdxd= w^Tx

is perceptron robust to outliers?

yes when taking sign(w^Tx_n) of outliers, it becocmes zero, so does not contribute

μ1 and μ2 max likelihood 2 classes

μ1= 1/N1 (sum from 1 to N) t_n*x_n μ2= 1/N2 (sum from 1 to N) (1-t_n)*x_n

Prediction of new sample max likelihood 2 classes

σ(w^T*x' + w0)


Conjuntos de estudio relacionados

Ch. 16 Bushong Challange Questions - Digital Radiolgraphy

View Set

Anatomy & Physiology Blood, Ch 14

View Set

World History Honors unit 2: Ancient Greece

View Set

Ch 25 - Employment Discrimination

View Set