Machine learning
Algorithms based on DT
-Random Forest: ensemble method that generates a set of decision trees with some random criteria and integrates their values into a final result. -Random criteria: 1) random subsets of data (bagging), 2) random subset of attributes (feature selection), ... -Integration of results: majority vote (most common class returned by all the trees). random forrest is less sensitive to overfitting
What is acheieved by reduced error pruning?
-it produces smallest version of most accurate subtree (removing sub-trees added due to coincidental irregularities). -When data set is limited, reducing the set of training examples (used as validation examples) can give bad results.
How to avoid overfitting?
-stop growing when data split not statistically significant -grow full tree, then post-prune
Determine correct size of tree
-use a separate set of examples (distinct from the training examples) to evaluate the utility of post-pruning -apply a statistical test to estimate accuracy of a tree on the entire data distribution -using an explicit measure of the complexity for encoding the examples and the decision trees.
Rule post-pruning
1 Infer the decision tree allowing for overfitting 2 Convert the learned tree into an equivalent set of rules 3 Prune (generalize) each rule independently of others 4 Sort final rules into desired sequence for use have rules on form (...^...) v (...^...) can either take away whole parentheses or part of parentheses
What is the sigmoid function?
1/(1 + exp(-a))
Si max likelihood 2 classes
1/N1 (sum)(xn - μi)(xn-μi)^T i = 1, 2
Accuracy
= 1 - error rate
π for max likelihood 2 classes
= N1/N
Error rate
= |errors| / |instances| = (FN+FP) / (TP + TN + FP + FN)
What are we trying to find in multiclass classification with linearly seperable data?
(all thick variables) find W~ so that y(x) = W~^T x~ is the K-class discriminant
Decision boundary
(thick w~_k -thick w~_j)^T thick x~ = 0
Issues in DT learning
-Determining how deeply to grow the DT -Handling continuous attributes -Choosing appropriate attribute selection measures -Handling training data with missing attribute values -Handling attributes with different costs
Rules of decision tree
Generated for every path to leaf
What are two types of families in probabilistic models?
Generative: estimate P(x|Ci) and then compute P(Ci|x) with bayes Discriminative: estimate P(Ci|x) directly
How does a gaussian noise distribution look?
Have a line and for evey point of line is a gaussian distribution
H
Hypothesis space
How does sampling affect linear regression?
If can sample perfectly, the problem is trivial
What are different solvers in logistic regression?
Iterative reweighted least squares with netwon raphson iteratigve optimization find gradient of error wrt to w~, find hessian of error, and take w~ = w~ -H^-1 * grad(E) grad(E) = X~^T /(y(w~-t) H(w~) = X~^T *R(w~)X~ where R = yn(1-yn)
How do we ususally compare to Learning algorithms?
K-fold validation and run both alogrithms on every partition
Methods for learning linear discriminants
Least squares Perceptron Fishers linear discriminant Support vector machines
Least squares
MInimize sum of squares error function (all thick variables) E(W ̃ ) = 1/2Trace((X ̃W ̃-T)^T *(X ̃W ̃ − T) where X~W~ is prediction of model, T is what we have in dataset W ̃ = (X~^TX~)^-1X~^T*T )pseudoinv(X) y(X) = W~^TX~ = T^T*pseudoinv(X)^T*X~ use learnt W~ to compute y(x) = W~^T*x~ = (y1(x)...yk(x))^T and assign a Class k = argmax(yi(x))
Which algorithms are used for linear regression? and what are conditions?
Maximum likelihood least squares Sequential learning observations are independent and identically distributed
Performance metrics for regression
Mean absolute error Mean squared error rot mean squared error
Can we use multiple binary linear models for classification?
No, you will get some regions that are not defined because they are not C1 og C2, but that doesnt mean that they neccessarily equal C3
Whata is issue with least squres?
Not robust to outliers They affect the partition a lot and we get a bad partition
What is the idea behind support vector machines?
Optimization problem to maximize the margin between groups of data, and we only look at the points closest to the line, ie good when dataset is large, and good against outliers
Independent events
P(A|B) = P(A)
Probabilistic generative models
P(C1|x) = σ(a), a = ln P(x|C1 )P(C1 )/ (P(x|C2 )P(C2 )) P(C2|x) = 1 − P(C1|x) Parametric model: P(C1) = π, P(C2) = 1 − π P(x|C1) = N (x; μ1, Σ) P(x|C2) = N (x; μ2, Σ) (normala distribution i think) where we want to estimate π, μ1, μ2, Σ
P(Ci | x) for more than two classes
P(Ci|x) = exp(ak)/sum exp(aj)
Bernoulli distribution
P(X = k, theta) = theta^k (1-theta)^(1-k) ex observing head after flipping coin
Binomial distribution
P(X=k, n, theta) = (n over k)theta^k(1-theta)^(n-k) flipping a coin n times and observing k heads
Multinomial distribution
P(X_1 = k_1...X_d = k_d) = n!/(k_1! ...k_d!) theta_1*k_1...theta^(k_d) ex rolling a d-sided die n times and observing k times a particular value
P(a|b)
P(a ^ b) /P(b)
Product rule
P(a ∧ b) = P(a|b)P(b) = P(b|a)P(a)
Naive bayes assumption
P(a1, a2, ... a_n | v_j, D) = mulitply all P(aa_i | v_j, D)
Bayes rule
P(a|b) = P(b|a)P(a)/P(b)
Max likelihood for two classes
P(t|π, μ1, μ2, Σ, D) = gangesum [πN (xn; μ1, Σ)]^tn [(1 − π)N (xn; μ2, Σ)^(1−tn)
D?
Training set
What does basis functions do and what are different types of them?
Transformations to make the dataset linearly separable Linear Polynomial Radial basis function Sigmoid
DT unknown Attribute values
Use training example anyway, sort through tree -If node n tests A, assign most common value of A among other examples sorted to node n -assign most common value of A among other examples with same target value -assign probability pi to each possible value vi of A assign fraction pi of example to each descendant in tree
Soft margin constraints
Used when dataset is almost linearly separable, solve the optimization problem with soft constraints
Information gain
measures how well a given attribute separates the training examples according to their target classification. Want attribute with highest info gains. = entropy(S) - abs(Sv)/abs(S) * Entropy(Sv)
Bernoulli multivariate distribution
mulitply all P(X_i = k_i, theta_i) ex observing head after flipping coin and extracting a lime
Termination conditions for perceptron
number of iterations threshold on change in loss function will converge if data is linearly seperable and eta is sufficiently small, but does not neccessarily give global minimum
Pros and cons of Bayes optimal classifier
provides good results, but not practical when hypothesis space is large
what does OMEGA denote
sample space in probabilites
Are least squares and max likelihood for linear regression related?
Yes they are kind of the inverse problem of eachother
What types of percceptron algorithms do we have?
determine how we set delta(w_i) batch mode: consider whole dataset mini-batch: take subset S of D incremental: choose one sample incremental and mini-batch speed up convergence and less sensitive to local minima
Remarks on maximum likelihood estimation?
efficiently solved when analytical solutions are available
How does it work for discriminative models?
estimate P(Ck|x~, D) = exp(ak)/sum exp(aj) w~* = argmax lnP(t|w~, X) without estimating the model parameters
What is a proposition
event where assignment to random variable holds playtennis = true wather = rain etc
Perceptron
linear combination of input with sign func. unthresholded lienar unit o = w0 + w1x1...= w^T*x learn wi that minimises square error/loss function E(w)=1/2 sum_N(tn-w^Tx_n)^2 dE/dw_i -> = sum_N(tn-w^Tx_n(-x_i,n) update weights w wi = wi+ delta(wi) where delta(wi) = -eta*dE(dw_i) and eta is small constant called learning rate
Joint probability distribution
matrix where you set up all the values and find the probability for each
Entropy
measure of impurity in set - p+ * log_2p+ - p_*log_2p_ p+ is proportion of positive of positive samples, p- = 1-p+, minmum with p+ = 1 or 0, max when p+ = 0.5
What do for decision trees with continuous valued attributes?
create discrete attribute to test continuous variables
Linearly separable
dataset can be partitioned by linear function
Two distributions used for Naive Bayes text classification
Bernoulli and Multinomial distribution
Naive bayes remarks
Conditional independence often violated, but works good a lot of the time still
Decision tree
Each internal node test attribute A_i Each branch denotes value of an attribute a_i,j c A_i Leaf node assigns classification value c c C (c contained in C
Brute force MAP
For each hypothesis h in H, calculate the posterior probability P(h|D) = P(D|h)P(h)/P(D) Output the hypothesis hMAP with the highest posterior probability, h_map = argmaxP(h|D)
K-fold cross validation
Partition D into k disjoint sets, choose, take one for testing and rest for training, and do so with all. delta is erorr from each, calculated error for all = 1/k *sum of deltas accuracy = 1-error
Rank the robustness of linear methods
Perceptron fisher least squares
What is logistic regression?
Probabilistic discriminative model based on maximum likelihood. have data set D = {(x~_n, t_n)^N} with t_n [0,1] Likelihood func: p(t~|w) gangesum y_n ^tn(1-y_n)^(1-t_n) with y_n = p(Ci | x~n) = σ( ̃wT ̃xn)
Some ways to improve text classification
Remove from document words like "the" "and" etc And for similar words "like","likes", "liking", replace with like
Sequential learning lin reg
Stochastic gradient descent algorithm w^ = w^ -η grad(E_n) η is learning rate wˆ ← wˆ + η [tn − wˆ^T φ(x_n)] φ(x_n) and converge for suitable small η The point is to change the parameters with the negative gradient to decrease the error between the datapoint and the correct value
Recall
ability to avoid false negatives |true positives| /|predicted positives| = TP /(TP+FP
Precision
ability to avoid false positives | true positives | / | predicted positives | = TP / (TP + FP)
Max likelihood for linear regression
argmax_w P({t1...t_n} | x_1,...x_n, w, β), where β is precision
Least square error for linear regression
argmin_w E_D(w) = argmin_w 1/2 sum_N[t_n - w^Tφ(x_n)]^2 can also write E_D(w) = 1/2 (t-Φ_w)^T(t-Φ_w) with t = [t_1...t_N]^T and Φ = matrise med φ_(x_1)...φ_M(x1) bortover og φ_0(x_1) ...φ_0(x_N) nedover
h*
best hypothesis consistent with D
How does linear regression work with nonlinear models?
can use nonlinear mapping function φ(x) so that y= w^Tφ(x) example can do polynomial fitting, but is danger of overfitting, ie a higher order polynomial fits for a lower order polynomial
DT attributes with cost
can use tan schlimmer: gain^2 /cost nunez: 2^(gain) - 1 /(cost+1)^w where w is [0,1] determining importancce of cost
Prediction with perceptron
classify c ad Ck, k = sign(w^Tx), w is the learned
What is optimality condition for lin reg algorithm?
grad E_D = 0 => Φ^T*Φw = Φ^Tt w_ML = (Φ^T*Φ)^-1*Φ^T *t
Overfitting
h overfits if erros_s (h) < erro_s (h') and error_d(h) > error_d(h') where error_s is of traning, ad error_d for all
Maximum a posteriori hypothesis
h_map = argmax P(D|h)P(h)
Maximum likelihood hypothesis
h_ml = argmax_h P(D|h)
What are some remarks for lin reg regularization
have to tune lambda also, and can also get underfitting
What are some considerations for choice of eta/learning rate in perceptron
if too small, line will lay close to one of the groups of samples but if too big you can pass the point with the minimal error
ID3 algorithm
selects attribute with highest info gain inout {examples, target_attribute, attributes} 1 create root node for tree 2 if all Examples are positive, then return the node Root with label + 3 if all Examples are negative, then return the node Root with label − 4 if Attributes is empty, then return the node Root with label = most common value of Target attribute in Examples or A ← the "best" decision attribute for Examples Assign A as decision attribute for Root For each value vi of A - add a new branch from Root corresponding to the test A = vi - Examplesvi = subset of Examples that have value vi for A - if Examplesvi is empty then add a leaf node with label = most common value of Target attribute in Examples - else add the tree ID3(Examplesvi , Target attribute, Attributes−{A})
Reduced error pruning
split data into trainign and evalutation -Evaluate impact on validation set of pruning each possible node (remove all the subtree and assign the most common classification) -Greedily remove the one that most improves validation set accuracy
what does omega denote?
subspace of OMEGA
Naive bayes algorithm
target func f: X-> V, X = A1 x ... x An V ={v1, ... vk}, data set D, new instance x = {a1, a2...an} for each value vj in V, --estimate P^(vj | D) --for each attirbute Ak ----for each attirbute value a_i in Ak ------estimate P^(a_i|vj, D)
regularization lin reg
technique to control over-fitting argmin_w E_D(w) + λE_W (w) common choice E_W(w)= 1/2 w^Tw where λ > 0 is regularization factor
Maximum likelihood solution parametric models, model M_theta
theta* = argmax lnP(t|theta, X) if M_theta is in exponential family: w~* = argmax ln P(t|w~, X)
K-clas discrimant of K linear functions
thick y(x) = (y1(x)...yk(x))^T = (thick w~_1^T(x~)...w~_k^T(x~))^T = thick W~^Tx~
difference between thresholded and untresholded in percetpron
unthresholded: o(x) = w^Tx, thresholded o(x) =sign(w^Tx
Multiclass logistic regression
use E = -sum_N sum_M t_nk * lny_nk gradient: grad(w~_j)*E(w~_1...w~_K) hessian: grad(w~_k)*grad(w~_j) E(w~_1...w~_K)
For DT how deal with attribute with many values?
use gainratio = gain/splitinformation splitinformation = -sum of abs(Si)/abs(S)log2abs(Si)/abs(S)
Naive bayes classifier
v_NB = argmax P(v_j | D)* mulitplication sum P(a_i| v_j, D) v is kind of play tennis yes or no, a is sunny, hot, humid etc so is P(sunny | yes) = ...
Bayes optimal classifier
v_ob = argmax sum of P(v_j|x, h_i)P(h_i|D)
w^T * x + w0 = (w0 w)(1 x)^T in compact form
w~ = (w0 w)^T x~(1 x)^T ak= w~^T *x~
What are we trying to solve in logistic regression?
w~* = argmin E(w~) where E is cross entropy function E(w~) = -lnp(t|w~) = -sum_N [tn ln yn + (1 − tn) ln(1 − yn)]
Fishers linear discriminant fo multiple classes
y = W^Tx maximize J(W) = trace(WS_wW^T)^-1(WS_BW^T)
Fishers linear discriminant for two class classification
y = w^Tx, and classification os new instances is given by y >= -w_0 where w= S_w^-1(m2 -m1), w_0 = w^Tm m2 and m1 are means for C1 and C2, and m is global mean
Linear model for regression
y(x;w) = w0 +w1x1 + ...+ wdxd= w^Tx
is perceptron robust to outliers?
yes when taking sign(w^Tx_n) of outliers, it becocmes zero, so does not contribute
μ1 and μ2 max likelihood 2 classes
μ1= 1/N1 (sum from 1 to N) t_n*x_n μ2= 1/N2 (sum from 1 to N) (1-t_n)*x_n
Prediction of new sample max likelihood 2 classes
σ(w^T*x' + w0)