Machine Learning
" PCA algorithms
" Full eigenvalue decomposition of S (slow) 2 E cient eigenvalue decomposition - only M eigenvectors 3 Singular value decomposition of centered data matrix X
" Linear least squares solution
"(X'X)^-1X'T Assumes gaussian conditional distribution, not robust to outliers
" What does a CNN layer consist of?
"A convolution part (kernel and biases) A non linear part (activation function) A pooling part (max-pool)
" What is a kernel
"A distance function k(x,x') >=0 typically symmetric and non-negative linear = x'x poly = (bx'x+g)^d rbf/gauss = e(-b|x-x'|^2) sigmoid = tanh(bx'x+g)
" Markov property
"A- Once the current stat is known, the evolution of the dynamic system does not depend on the history of states, actions and observations - The current state contains all the information needed to predict the future - Future states are conditionally independent of past states and past observations given the current state - The knowledge about the current state makes past, present and future observations statistically independent
" Adaboost pros cons
"Advantages: fast, simple and easy to program no prior knowledge about base learner is required no parameters to tune (except for M) can be combined with any method for finding base learners theoretical guarantees given su cient data and base learners with moderate accuracy Issues: Performance depends on data and the base learners (can fail with insu cient data or when base learners are too weak) Sensitive to noise
" HMM learning parameters Aij, bk(v)
"Aij = |i->j trans| / | i -> * trans | bk(v) = |observe v and state k| / | observe * and state k |
" HMM Backward
"Bt^k = P(z_t+1:T|xt=k) for each state k BT^k = 1 for each time t = T-1...1 for each state k: Bt^K = sum( B_t+1^j Ajk bj(z_t+1) )
" Consistent, Version space
"Consisten(h,D) = (for all x in D) h(x)=c(x) Version space: {h in H| consistent(h,D)}
" HMM interference, filtering, smoothing
"Filtering P(xT=k|z1:T) = aT^K/sum(aT^j) Smoothing P(xT=k|z1:T) = at^K * bt^k/ sum(at^j * bt^j):" HMM Forward "at^k = P(xt = k, z1:t) For each state k a0^k = pi0 * bk(z0) For each time t = 1,, T for each state k at^k = bk(zt) sum(a_t-1^j * Ajk)
" Candidate Elimination algo
"G <- max gen h in H S <- max spec h in H For each train ex d: if d pos: remove fom G any hyp inconsisten with d For each hyp s in S that is not consisten with d: remove s from S Add to S all minimal generalizations h of s such that h consistent with d and some member of G is more general than h if d neg: remove from S any hyp inconsisten with d For ech hyp g in G inconsistent with d: remove g from G add to G all minimal specializations h of g such that h is consistent with d and some member of S is more specific than h remove from G any hyp that is less than general than another hyp in G
HMM setup
"HMM(X,Z,pio) transition model P(xt|xt-1) observation model P(zt|xt) initial distribution pi0 0 P(x0) State transition matrix Aij = P(xt=j|xt-1=i) observation model bk(zt) = P(zt | xt= k)
ID3, characteristics
"ID3 searches a complete hypothesis space in an incomplete way Inductive bias in ID3 is preference for smaller trees Overfitting is an important issue and can be avoided with pruning methods
" Missing attribute decision tree
"If node n tests A - assign most common value of A among other examples sorted to node n - assign most common value of A among other examples with same target value - assign prop pi to each possible value vi of A. Assign fraction pi of each examåle to each descendant in tree
What is concept learning
"Inferring a boolean-valued function from training examples c(x) - target function h(x) - estimation of h over x Goal is to find the best hypothesis h that predicts correct values of h(xj) xj not in D
FIND_S
"Init h to most specific hypothesis in H For each positive training instance x in D For each attr_const ai in h if constrain h in h is satisfied by x do nothin replace ai in h by the next more general constrain that is satisfied by x
" "CNN properties Kernel, Padding, Stride, Receptive Field"
"Kernel - matrix corresponding to filter Depth - number of filters Stride - step of sliding kernel Receptive field - 2D dimension of kernel
" One state MDP Solv
"MDP(X0, A, s, r) r(ai) det and known: pi^* = argmax r(ai) r(ai) det and unknown: try all ai then select highest r(ai) non-det and known: pi^*(x0) = argmax E[r(ai)]
" One versus all, one versus one
"One versus all - C binary classifiers One-versus-One: C(C-1)/2 classifiers
Naive bayes assumption and classifier
"P(a1,a2...,an|vj,D) = PI p(ai,vj,D) Vnb = argmax P(vj,D)PI p(ai|vj,D)
Logistic regression
"P(c1|x) = g(w'x+w0) g(t) = 1/ (1+e^-t)
Naive Bayes Estimation
"P(vj|D) = |{....,vj..,}|/|D| P(ai|vj,D) = |{<...,ai,..vj>}|/ |{....,vj..,}|
" POMDP
"POMPD(X,A,Z,s,r,o) X set of states A set of actions Z set of observations P(x0) prob of initial state s(x,a,x') = p(x'|x,a) prob dist over transitions r(x,a) o(x',a'z') = P(z'|x',a) prob dist over observations
" Q-learning training rule
"Q(s,a) <- Q(s,a) + alpha[ r + gamma * max Q(s',a') - Q(s,a) ] alpha = 1/ ( 1 + visits(x,a)
" SARSA learning rule
"Q(s,a) = ""Q(s,a) <- Q(s,a) + alpha[ r + gamma * Q(s,a) - Q(s,a) ] alpha = 1/ ( 1 + visits(x,a)
Recall, Precision
"Recall = TP / (TP + FN) Precision = TP / (TP + FP)
" PCA steps
"To perform PCA in a M-dimensional projection space, with M < D compute the mean of the data x ̄ compute covariance matrix of the dataset S find M eigenvectors of S corresponding to the M largest eigenvalues
Properties of CNN
"Translation invariant - an object can appear anywhere in the image Not transformation invariant - cannot fix rotation, scaling etc, data augmentation Weight sharing
" Cost function neural net
"Typically negative log likelihood - Maximum likelihood principle J(theta). = - ln ( p(t|x)) with sigmoid: g(t) = 1 / (1+exp(-t) J(theta) = softplus((1-t)a), a=w't + b the softplus only saturates when given correct answer
MDP(X,A, s,r)
"X is a finite set of states A is a finite set of actions s : X x A -> X is a transition function r: X x A -> R is a reward function
" Rmsprop
"dW, db Sdw = B2Sdw + (1-B2)dW^2 Sdb = B2Sdb + (1 - B2)db^2 W = W - lr * dW/sqrt(Sdw + e) b = b - lr * db/sqrt(Sdb + e)
What is PCA used for?
"dimensionality reduction data compression (lossy) data visualization feature extraction
LMS weight update rule
"error(x)= V(x) - Y(x) wi <- wi + c * fi * error(x)
" Knn
"find K nearest neighbours of test input x assign x to the most commont label among the majority of votes
" NB_Learn(A,V,D)
"for each vj in V: p(vj|D) = estimate for each attribute Ak: for each ai in Ak: p(ai|vj,D) = estimate
" backprop
"forward: for k =1 to l: a = W*h + b h = f(a) Backward: g = derJy for k = l, l-1, .. 1: g = g x f'(a) - elementwise derb = g derW = g(h^(k-1))' g = W'g
hmap, hml
"hmap = argmax P(D|h)p(h) hml = argmax P(D|h)
Adaboost
"init wn = 1/N for m =1 ... M: Train a weak learning ym(x) with weighted error functon: Jm = sum wn I(ym(xn) != tn) Evaluate em = (sum wn I(ym(xn) != tn) / sum wn alpham = ln ( (1-em)/em) update weights wn = wn exp(alpam I(ym(x) != tn)) ym(x) = sign(sum alpham * ym(x) )
" Smoothing Kernels
"integral k(x)dx = 1 integral xk(x)dx = 0 integral x^2k(x)dx > 0
" SVM classification
"min 1/2 ||w||^2 + CsumS ti(w'x + wo) >= 1 - S
" Parametric non parametric
"parametric: model has a fixed number of parameters non parametric: number of parametrs grows with amount of data
" Kernelized regression
"primal: w = (X'X-lamdaI)X'T dual: alpha = (XX' + lambdaI)^-1t, w = X'alpha y(x,w) = sum alphai * xi'x = sum alphaik(xi,x) alpha = (K + lambdaI)^-1t
relu, sigmoid, tanh
"relu = max(0,a) sigmoid = 1/(1+exp(-a) tanh = 2sigmoid(2a) -1
Multiclass logistic regression, cross entropy, softmax
"softmax = e(aj)/sumj(e(aj)) Cross ent = - sumn sumk ( tnk * ln(ynk))
" SGD, SGD momentum,
"theta = theta - lr grad J momentum v = gamma*v + lr grad J theta = theta - v
" Nesterov momentum
"v = gamma*v + lr grad J(theta-gamma*v) theta = theta - v
" Fishers criterion
"w = Sw^-1(m2 - m1) wo = w'm Sw = sum (xn-m1)(xn-m1)' + sum(xn-m2)(xn-m2)'
" PCA variance maximization
"xbar = 1/N sum xn variance after projection: (1/N) sum (u1'xn-u1'xbar)^2 = u1'Su1 S = (1/N) sum (xn - xbar)(xn - xbar)' max var subject to u1'u1 = 1 max u1'Su1 + l1(1-u1'u1), der = 0 Su1 = l1u1 -> u1'Su1 = l1
" more_general_than_or_equal_to hk (hj >= hk) iff
(for all X) (hk(x)=1) -> (hj(x)=1)
" Number of parameters in a CNN layer
(fxfxdepth + bias)nc
hj is more_general_than hk (hj > hk) iff
(hj >= hk) and (hk not > hj)
Inductive learning hypothesis
Any hypothesis that approximates the target function well over a sufficiently large set of training examples will also approximate the target funciton well over the unobserved examples
Bayes optimal classifier
Bayes optimal classifier = argmaxv sum( p(vi|xi,hi)*p(hi|D))
Boosting
Boosting: use a 'weak' (or base) learner to build simple rules, then combine these 'weak' rules into a single prediction rule that will be more accurate than each single rule. Ym(x) = sign(sum( alpha_m * ym(x) )
Classify new instances Candidate Elimination
Check that it is positive for every h in S Check that it is negative for every h in g
Candidate Elimination algorithm convergence
Converges to target concept if no errors exists and target function exists in H
Entropy
E(S) = sum(-pi*log(pi))
What is filtering and smoothing HMM?
Filtering: asks about the state of the process in the end Smoothing: Asks about a state in the middle of the process
Information Gain
Gain(S,A) = E(S) - sum( abs(Sv)/abs(S)* E(Sv))
" Learning problem
Improve over task T, with respect to performance measure P base don experience E
" Markov process
Is a process that has the markov property
" "Is hmap the most probable classifier for a new instance? "
No it is not, the bayes optimal classifier is.
X is conditionally independent of Y given Z
P(X,Y | Z) = P(X|Y,Z)P(Y|Z)=P(X|Z)P(Y|Z)
Chain Rule probb
P(X1,..Xn) = PI_1^n P(Xi|X1,..Xi-1)
" Conditional probability
P(a|b)=P(a and b)/P(b)= alpha * P(a and b)
" "Suppose a computer program for recognizing dogs in photographs identifies eight dogs in a picture containing 12 dogs and some cats. Of the eight dogs identified, five actually are dogs (true positives), while the rest are cats (false positives). Precision and recall?"
The program's precision is 5/8 while its recall is 5/12
Inductive bias Candidate Elimination
The target concept c is in H
Inductive bias Find-S
The target concept c is in H and that the solution is a maximally specific hypothesis
" Wout CNN
Wout = floor( (Win - f + 2p)/s + 1)
Roc Curve
X - FP, y = TP
General EM problem
X, unobs Z, parametrised P(Y| h) where yi = xi U zi and h parameters Goal: find h that maximises E[ln P(Y|h)
crosscorrelation convolution
crosscor: g(I,j) = Sum_u=-k^k Sum_v=-k^k h(u,v)f(I+u,j+v) cons: g(I,j) = Sum_u=-k^k Sum_v=-k^k h(u,v)f(I-u,j-v)
Logistic regression algorithm iterative least squares
dE(w)= phi'(y-t) H = phi'Rphi R = diagonal with Rn,n =yn(1-yn) w = w - H^-1dE(w)
General EM Algo
define likelihood function Q(h',h) which calculates Y = X U Z using X and parameters h to estimate Z Q(h',h) = E[ ln P(Y|h) | h, X] E: calc Q(h'|h) using h,X to estimate prob dist over Y Q(h'|h) <- E[ ln P(Y|h)|h,X] M: replace h by h' that maximises Q h <- argmax Q(h'|h)
" Perceptron training rule
delta wi = lr (td - od) xi,d
What is the expectation maximization algorithm and what is its goal?
given X observed variables and Z hidden variables with joint distribution p(X,Z| theta). The goal is to maximise the likelihood function p(X|theta) = sum_z ( P(X,Z|theta) 1. init theta 2. calculate prob och each possible Z given theta 3. use calculated values of Z to better estimate theta repeat until convergence
Expectation maximization means
init h = (u1,...un) random E: E[zi] = exp(-(1/2s^2)(xi-uj)^2) / sum exp(-(1/2s^2)(xi-un)^2) M: uj' = sum E[zij]xi / sum E[zij] replace h with h'
" SVM regression
min 1/2 ||w||^2 + C * sum( S- + S+) S >0 ti <= y(xi,w) + e + S+ ti >= y(xi,w) - e - S-
" Bayes Rule
p(a|b) = p(b|a)p(a)/p(b)
Learning as probability estimation
p(h|D) = p(B|h)p(h)/p(D)
Linear regression wml
wml = (phi'phi)^-1phi'T
" softmax
y = softmax(a) = exp(ai)/sum(exp(ai))
" Linear Basis Function Models
y(x,w) = w'phi
" KNN kernel trick
||xa - xb ||^2 = xa'xa + xb'xb - 2xa'xb