Machine Learning

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

" PCA algorithms

" Full eigenvalue decomposition of S (slow) 2 E cient eigenvalue decomposition - only M eigenvectors 3 Singular value decomposition of centered data matrix X

" Linear least squares solution

"(X'X)^-1X'T Assumes gaussian conditional distribution, not robust to outliers

" What does a CNN layer consist of?

"A convolution part (kernel and biases) A non linear part (activation function) A pooling part (max-pool)

" What is a kernel

"A distance function k(x,x') >=0 typically symmetric and non-negative linear = x'x poly = (bx'x+g)^d rbf/gauss = e(-b|x-x'|^2) sigmoid = tanh(bx'x+g)

" Markov property

"A- Once the current stat is known, the evolution of the dynamic system does not depend on the history of states, actions and observations - The current state contains all the information needed to predict the future - Future states are conditionally independent of past states and past observations given the current state - The knowledge about the current state makes past, present and future observations statistically independent

" Adaboost pros cons

"Advantages: fast, simple and easy to program no prior knowledge about base learner is required no parameters to tune (except for M) can be combined with any method for finding base learners theoretical guarantees given su cient data and base learners with moderate accuracy Issues: Performance depends on data and the base learners (can fail with insu cient data or when base learners are too weak) Sensitive to noise

" HMM learning parameters Aij, bk(v)

"Aij = |i->j trans| / | i -> * trans | bk(v) = |observe v and state k| / | observe * and state k |

" HMM Backward

"Bt^k = P(z_t+1:T|xt=k) for each state k BT^k = 1 for each time t = T-1...1 for each state k: Bt^K = sum( B_t+1^j Ajk bj(z_t+1) )

" Consistent, Version space

"Consisten(h,D) = (for all x in D) h(x)=c(x) Version space: {h in H| consistent(h,D)}

" HMM interference, filtering, smoothing

"Filtering P(xT=k|z1:T) = aT^K/sum(aT^j) Smoothing P(xT=k|z1:T) = at^K * bt^k/ sum(at^j * bt^j):" HMM Forward "at^k = P(xt = k, z1:t) For each state k a0^k = pi0 * bk(z0) For each time t = 1,, T for each state k at^k = bk(zt) sum(a_t-1^j * Ajk)

" Candidate Elimination algo

"G <- max gen h in H S <- max spec h in H For each train ex d: if d pos: remove fom G any hyp inconsisten with d For each hyp s in S that is not consisten with d: remove s from S Add to S all minimal generalizations h of s such that h consistent with d and some member of G is more general than h if d neg: remove from S any hyp inconsisten with d For ech hyp g in G inconsistent with d: remove g from G add to G all minimal specializations h of g such that h is consistent with d and some member of S is more specific than h remove from G any hyp that is less than general than another hyp in G

HMM setup

"HMM(X,Z,pio) transition model P(xt|xt-1) observation model P(zt|xt) initial distribution pi0 0 P(x0) State transition matrix Aij = P(xt=j|xt-1=i) observation model bk(zt) = P(zt | xt= k)

ID3, characteristics

"ID3 searches a complete hypothesis space in an incomplete way Inductive bias in ID3 is preference for smaller trees Overfitting is an important issue and can be avoided with pruning methods

" Missing attribute decision tree

"If node n tests A - assign most common value of A among other examples sorted to node n - assign most common value of A among other examples with same target value - assign prop pi to each possible value vi of A. Assign fraction pi of each examåle to each descendant in tree

What is concept learning

"Inferring a boolean-valued function from training examples c(x) - target function h(x) - estimation of h over x Goal is to find the best hypothesis h that predicts correct values of h(xj) xj not in D

FIND_S

"Init h to most specific hypothesis in H For each positive training instance x in D For each attr_const ai in h if constrain h in h is satisfied by x do nothin replace ai in h by the next more general constrain that is satisfied by x

" "CNN properties Kernel, Padding, Stride, Receptive Field"

"Kernel - matrix corresponding to filter Depth - number of filters Stride - step of sliding kernel Receptive field - 2D dimension of kernel

" One state MDP Solv

"MDP(X0, A, s, r) r(ai) det and known: pi^* = argmax r(ai) r(ai) det and unknown: try all ai then select highest r(ai) non-det and known: pi^*(x0) = argmax E[r(ai)]

" One versus all, one versus one

"One versus all - C binary classifiers One-versus-One: C(C-1)/2 classifiers

Naive bayes assumption and classifier

"P(a1,a2...,an|vj,D) = PI p(ai,vj,D) Vnb = argmax P(vj,D)PI p(ai|vj,D)

Logistic regression

"P(c1|x) = g(w'x+w0) g(t) = 1/ (1+e^-t)

Naive Bayes Estimation

"P(vj|D) = |{....,vj..,}|/|D| P(ai|vj,D) = |{<...,ai,..vj>}|/ |{....,vj..,}|

" POMDP

"POMPD(X,A,Z,s,r,o) X set of states A set of actions Z set of observations P(x0) prob of initial state s(x,a,x') = p(x'|x,a) prob dist over transitions r(x,a) o(x',a'z') = P(z'|x',a) prob dist over observations

" Q-learning training rule

"Q(s,a) <- Q(s,a) + alpha[ r + gamma * max Q(s',a') - Q(s,a) ] alpha = 1/ ( 1 + visits(x,a)

" SARSA learning rule

"Q(s,a) = ""Q(s,a) <- Q(s,a) + alpha[ r + gamma * Q(s,a) - Q(s,a) ] alpha = 1/ ( 1 + visits(x,a)

Recall, Precision

"Recall = TP / (TP + FN) Precision = TP / (TP + FP)

" PCA steps

"To perform PCA in a M-dimensional projection space, with M < D compute the mean of the data x ̄ compute covariance matrix of the dataset S find M eigenvectors of S corresponding to the M largest eigenvalues

Properties of CNN

"Translation invariant - an object can appear anywhere in the image Not transformation invariant - cannot fix rotation, scaling etc, data augmentation Weight sharing

" Cost function neural net

"Typically negative log likelihood - Maximum likelihood principle J(theta). = - ln ( p(t|x)) with sigmoid: g(t) = 1 / (1+exp(-t) J(theta) = softplus((1-t)a), a=w't + b the softplus only saturates when given correct answer

MDP(X,A, s,r)

"X is a finite set of states A is a finite set of actions s : X x A -> X is a transition function r: X x A -> R is a reward function

" Rmsprop

"dW, db Sdw = B2Sdw + (1-B2)dW^2 Sdb = B2Sdb + (1 - B2)db^2 W = W - lr * dW/sqrt(Sdw + e) b = b - lr * db/sqrt(Sdb + e)

What is PCA used for?

"dimensionality reduction data compression (lossy) data visualization feature extraction

LMS weight update rule

"error(x)= V(x) - Y(x) wi <- wi + c * fi * error(x)

" Knn

"find K nearest neighbours of test input x assign x to the most commont label among the majority of votes

" NB_Learn(A,V,D)

"for each vj in V: p(vj|D) = estimate for each attribute Ak: for each ai in Ak: p(ai|vj,D) = estimate

" backprop

"forward: for k =1 to l: a = W*h + b h = f(a) Backward: g = derJy for k = l, l-1, .. 1: g = g x f'(a) - elementwise derb = g derW = g(h^(k-1))' g = W'g

hmap, hml

"hmap = argmax P(D|h)p(h) hml = argmax P(D|h)

Adaboost

"init wn = 1/N for m =1 ... M: Train a weak learning ym(x) with weighted error functon: Jm = sum wn I(ym(xn) != tn) Evaluate em = (sum wn I(ym(xn) != tn) / sum wn alpham = ln ( (1-em)/em) update weights wn = wn exp(alpam I(ym(x) != tn)) ym(x) = sign(sum alpham * ym(x) )

" Smoothing Kernels

"integral k(x)dx = 1 integral xk(x)dx = 0 integral x^2k(x)dx > 0

" SVM classification

"min 1/2 ||w||^2 + CsumS ti(w'x + wo) >= 1 - S

" Parametric non parametric

"parametric: model has a fixed number of parameters non parametric: number of parametrs grows with amount of data

" Kernelized regression

"primal: w = (X'X-lamdaI)X'T dual: alpha = (XX' + lambdaI)^-1t, w = X'alpha y(x,w) = sum alphai * xi'x = sum alphaik(xi,x) alpha = (K + lambdaI)^-1t

relu, sigmoid, tanh

"relu = max(0,a) sigmoid = 1/(1+exp(-a) tanh = 2sigmoid(2a) -1

Multiclass logistic regression, cross entropy, softmax

"softmax = e(aj)/sumj(e(aj)) Cross ent = - sumn sumk ( tnk * ln(ynk))

" SGD, SGD momentum,

"theta = theta - lr grad J momentum v = gamma*v + lr grad J theta = theta - v

" Nesterov momentum

"v = gamma*v + lr grad J(theta-gamma*v) theta = theta - v

" Fishers criterion

"w = Sw^-1(m2 - m1) wo = w'm Sw = sum (xn-m1)(xn-m1)' + sum(xn-m2)(xn-m2)'

" PCA variance maximization

"xbar = 1/N sum xn variance after projection: (1/N) sum (u1'xn-u1'xbar)^2 = u1'Su1 S = (1/N) sum (xn - xbar)(xn - xbar)' max var subject to u1'u1 = 1 max u1'Su1 + l1(1-u1'u1), der = 0 Su1 = l1u1 -> u1'Su1 = l1

" more_general_than_or_equal_to hk (hj >= hk) iff

(for all X) (hk(x)=1) -> (hj(x)=1)

" Number of parameters in a CNN layer

(fxfxdepth + bias)nc

hj is more_general_than hk (hj > hk) iff

(hj >= hk) and (hk not > hj)

Inductive learning hypothesis

Any hypothesis that approximates the target function well over a sufficiently large set of training examples will also approximate the target funciton well over the unobserved examples

Bayes optimal classifier

Bayes optimal classifier = argmaxv sum( p(vi|xi,hi)*p(hi|D))

Boosting

Boosting: use a 'weak' (or base) learner to build simple rules, then combine these 'weak' rules into a single prediction rule that will be more accurate than each single rule. Ym(x) = sign(sum( alpha_m * ym(x) )

Classify new instances Candidate Elimination

Check that it is positive for every h in S Check that it is negative for every h in g

Candidate Elimination algorithm convergence

Converges to target concept if no errors exists and target function exists in H

Entropy

E(S) = sum(-pi*log(pi))

What is filtering and smoothing HMM?

Filtering: asks about the state of the process in the end Smoothing: Asks about a state in the middle of the process

Information Gain

Gain(S,A) = E(S) - sum( abs(Sv)/abs(S)* E(Sv))

" Learning problem

Improve over task T, with respect to performance measure P base don experience E

" Markov process

Is a process that has the markov property

" "Is hmap the most probable classifier for a new instance? "

No it is not, the bayes optimal classifier is.

X is conditionally independent of Y given Z

P(X,Y | Z) = P(X|Y,Z)P(Y|Z)=P(X|Z)P(Y|Z)

Chain Rule probb

P(X1,..Xn) = PI_1^n P(Xi|X1,..Xi-1)

" Conditional probability

P(a|b)=P(a and b)/P(b)= alpha * P(a and b)

" "Suppose a computer program for recognizing dogs in photographs identifies eight dogs in a picture containing 12 dogs and some cats. Of the eight dogs identified, five actually are dogs (true positives), while the rest are cats (false positives). Precision and recall?"

The program's precision is 5/8 while its recall is 5/12

Inductive bias Candidate Elimination

The target concept c is in H

Inductive bias Find-S

The target concept c is in H and that the solution is a maximally specific hypothesis

" Wout CNN

Wout = floor( (Win - f + 2p)/s + 1)

Roc Curve

X - FP, y = TP

General EM problem

X, unobs Z, parametrised P(Y| h) where yi = xi U zi and h parameters Goal: find h that maximises E[ln P(Y|h)

crosscorrelation convolution

crosscor: g(I,j) = Sum_u=-k^k Sum_v=-k^k h(u,v)f(I+u,j+v) cons: g(I,j) = Sum_u=-k^k Sum_v=-k^k h(u,v)f(I-u,j-v)

Logistic regression algorithm iterative least squares

dE(w)= phi'(y-t) H = phi'Rphi R = diagonal with Rn,n =yn(1-yn) w = w - H^-1dE(w)

General EM Algo

define likelihood function Q(h',h) which calculates Y = X U Z using X and parameters h to estimate Z Q(h',h) = E[ ln P(Y|h) | h, X] E: calc Q(h'|h) using h,X to estimate prob dist over Y Q(h'|h) <- E[ ln P(Y|h)|h,X] M: replace h by h' that maximises Q h <- argmax Q(h'|h)

" Perceptron training rule

delta wi = lr (td - od) xi,d

What is the expectation maximization algorithm and what is its goal?

given X observed variables and Z hidden variables with joint distribution p(X,Z| theta). The goal is to maximise the likelihood function p(X|theta) = sum_z ( P(X,Z|theta) 1. init theta 2. calculate prob och each possible Z given theta 3. use calculated values of Z to better estimate theta repeat until convergence

Expectation maximization means

init h = (u1,...un) random E: E[zi] = exp(-(1/2s^2)(xi-uj)^2) / sum exp(-(1/2s^2)(xi-un)^2) M: uj' = sum E[zij]xi / sum E[zij] replace h with h'

" SVM regression

min 1/2 ||w||^2 + C * sum( S- + S+) S >0 ti <= y(xi,w) + e + S+ ti >= y(xi,w) - e - S-

" Bayes Rule

p(a|b) = p(b|a)p(a)/p(b)

Learning as probability estimation

p(h|D) = p(B|h)p(h)/p(D)

Linear regression wml

wml = (phi'phi)^-1phi'T

" softmax

y = softmax(a) = exp(ai)/sum(exp(ai))

" Linear Basis Function Models

y(x,w) = w'phi

" KNN kernel trick

||xa - xb ||^2 = xa'xa + xb'xb - 2xa'xb


Kaugnay na mga set ng pag-aaral

VHL Lección 5 Estructura 5.4 Repaso

View Set

Unit 5 History The peopling of the America's

View Set

Chapter 3: Categorical Grants and block grants

View Set

Economics True/False questions (all chapters)

View Set

Water Cycle Vocabulary Test Study Guide

View Set

Maternal-Newborn Ch 29 contraception and unplanned pregnancies

View Set

geometry a - unit 6: right triangles and ratios lesson 26-28

View Set

27.1++ V2 (did, didn't + v1) ar has/have V3 - 1 dalis - (Present Perfect vs Past Simple) (+) https://quizlet.com/804036977/write

View Set