Support Vector Machines
What must hold true for each training example in hard-margin SVM?
Primal Feasibility: yi * (w^T * i + b) - 1 >= 0 Dual feasibility: alphai >= 0 Complementarity: alphai * [yi * (w^T * i + b) - 1] = 0
What is the margin defined by? Define "margin of the classifier" (gamma) and give equation as well as simplified equation
Two parallel hyperplanes: w^T * x + alpha = 0 w^T * x + beta = 0 Margin of the classifier = distance between two hyperplanes that form the boundaries of the separation gamma = |alpha - beta| / ||w|| Simplified gamma: without loss of generality, can set alpha = b - 1 and beta = b+1, so margin is gamma = 2 / ||w||
Margin of error define
the separation between the two data sets achieved by a classifier
What is the distance of a point x to the hyperplane w^T * x + beta = 0?
|w^T * x + beta| / ||w||
Write the equation for dual problem and primal problem for SVM
*Dual problem* a is alpha maximize -(sum i = 1 .. n (sum j = 1 .. n ( aiajyiyjXi^T*Xj))/2 + sum i = 1 .. n (ai) so that: (these are trivial constraints) sum i = 1 .. n (aiyi) = 0 ai >=0 for all i = 1 .. n *Primal problem* minimize 1/2 * ||w||^2 so that: (n constraints, 1 for each data point) yi(w^T * xi + b) >= 1 for all i = 1..n - this is minimization of a concave soln complexity = min/ max eqn fit = "so that" part; the constraints
What's the difference between perceptron and SVM?
*Perceptron*: - penalizes misclassifications, but if correct, just 0 *SVM*: - maximize the margin between the two classes
Pros and Cons of SVMs
*Pros* - polynomial-time exact optimization rather than approximate methods (unlike decision trees and neural nets) - kernels allow very flexible hypothesess - can be applied to very complex data types like graphs, sequences, etc. *Cons* - must choose a good kernel and kernel params - very large problems are computationally intractable - quadratic in num of examples - probs with >20k examples are very difficult to solve exactly
What assumption is made in linearly separable data and why?
- Assumption: data is linearly separable; no noise in the data set and the resulting model doesn't require a loss function - Why: to derive a max-margin model
Contrast soft-margin SVM's optimization problem with the regularization function of Ridge Regression
- C is the same as lambda in Ridge Regression - more TBD
Why do we bother with the dual problem?
- both primal and dual are convex/ quadratic opt probs with exact same solution - dual, however, has FEWER constraints so easier to solve - dual soln is sparse, so easier to represent More explanation: With N training points and d dimensions of feature vector x, primal prob: w is from domain R^d dual prob: alpha is from domain R^N So need to learn d params for primal, N for dual N << d usually, so more efficient to solve for alpha than w So dual prob easier to solve
What does regularization do to the SVM model?
- introduces inductive bias over solutions - controls the complexity of the solution - imposes smoothness restriction on solutions
What are the solution approaches for nonlinear SVM classifiers? Which is best?
1) Explicit transformation - transform data to higher dimensional feature space - train linear SVM classifier in high-dim. space - transform high-dim linear classifier back to original space to obtain classifier BAD b/c very expensive: if have m training features, size of transformation grows very fast b/c need constant term, linear terms, pure quadratic terms, and quadratic cross-terms 2) Kernel trick - dual solution only depends on the inner products of training data, so only need to compute inner products in the higher-dim space - kernel relates the inner-products in the original and transformed spaces to avoid explicit transformations - EX: k(x,z) = (x^T * z)^2
What are the steps for SVM modeling choices?
1) select the kernel function to use - in practice, a low degree polynomial kernel or RDF kernel with a reasonable width is a good initial try 2) select the param of the kernel function and the value of C - can set apart a validation set to determine the values of the param
What happens in the following cases for SVM solutions? a is alpha 1) ai = 0, yi * (w^T * xi + b) > 1 2) ai > 0, yi * (w^T * xi + b) > 1 3) ai = 0, yi * (w^T * xi + b) = 1
1) training example is NOT on margin: is above or below margin; lots of the training examples will have ai = 0 due to first order condition on w 2) training example IS on margin; only some training examples will have ai > 0 3) degenerate case not shown; training example is IN margin?
What does the hard-margin SVM dual problem depend on?
Only on the dual vars (alphai) and the inner products (xi^T * xi) between each pair of training examples
What does the regularization constant C trade off between in the dual problem for soft-margin SVMs? What is special about the dual solution?
between regularization term/ bias and loss term/ variance - dual solution depends only on the inner products of the training data, allowing us to extend linear SVMs to learn nonlinear classification functions without explicit transformation
What are the first-order conditions for hard-margin SVM?
differentiate the Lagrangian with respect to the primal vars, w and b delta w L(w,b,alphai) = 0: w = sum i = 1 .. n (alphai * yi * xi) delta b L(w,b,alphai) = 0: sum i = 1 .. n (alphai * yi) = 0 Substitute these first-order optimality conditions into LAgrangian to eliminate the primal variables
For each constraint (each training example (i=1..n)) in the primal problem, what do we introduce?
introduce new variables called *dual variables* or *Lagrange multipliers*: alphai >= 0 - these multipliers give us a mechanism to ensure feasibility (ensure the optimal solns w & b achieve linear separation of the two classes)
What is the primal problem for soft-margin SVMs? How different from hard-margin SVM primal problem?
minimize (||w||^2)/2 + C * sum i=1..n (loss i) so that yi(w^T * xi + b) >= 1- loss for all i = 1..n, loss >= 0 loss is the hinge loss function, known as the slack variables that allow for flexibility for misclassifications by the model and thus "softens" the classification constraints
What is the optimization problem/ formulation for soft-margin SVM? How is complexity and fit balanced in the equation? What is the regularization parameter?
minimize w,b (1/2*w^T * w) + C * sum i=1..n (max{0, 1-yi*(w^T * xi+b)}) first part = complexity second part = fit - regularization parameter is C > 0, sometimes denoted as lambda, which trades-off between margin maximization and loss minimization
What is a hard-margin SVM? When is it feasible?
rigid SVM model; doesn't allow flexibility for misclassifications - only feasible when data set is linearly separable
What is bias for SVM?
Bias is towards selecting the classifier with the largest margin
How does C relate to regularization and overfitting?
C is the regularization parameter, so defines relative trade-off between norm (bias/ complexity) and loss(error/ variance) - want to find classifiers that minimize (regularization + C loss) As C increases, the effect of regularization decreases and the SVM tends to overfit the data Small C values -> low complexity, high error (underfit) Large C values -> high complexity, low error (overfit)
What is the primal problem? What is the consequence of being constrained?
Constrained optimization problem; convex, quadratic minimization problem guaranteed to have global min minimize w,b over (w^T * w)/2 subject to yi * (w^T * xi + b) >= 1 for i = 1 .. n Constrained means need additional optimization tools to ensure feasibility of solutions (solns meed constraints)
What is a linearly separable data set?
Data set separable by a linear classifier/ hyperplane
What is the Lagrangian function?
L(w,b,alpha) = 1/2 (w^T * w) - sum i = 1 .. n (alphai * [yi * (w^T * xi + b) - 1]) - it's a function of the primal variables w & b AND the dual vars alpha - it converts a constrained optimization prob into an unconstrained optimization prob
What is the hard-margin SVM classifier a linear combination of?
Of training examples i = 1 .. n
What is the problem setup for SVMs for linearly separable data?
Find a linear classifier f(x) = w^T * x + b with the largest margin such that sign(f(x)) = +1 when positive example sign(f(x)) = -1 when negative example
How is the solution to Lagrangian function related to the original constrained problem?
If there's a minimization of the Lagrangian function (sum i = 1 .. n of (alphai * [yi * (w^T * i + b) - 1])) = 0, the soln is ALSO the soln to the original constrained problem
What is the loss function for soft-margin SVM?
L(f(xi), yi) = max(0, 1-yi*(w^T * xi + b)) so, L = 0 if yi*(prediction) >= 1 and L = yi*pred if yi*pred < 1 - penalize each misclassification by the size of the violation using hinge loss
Give the following kernels: Linear kernel Polynomial kernel Gaussian kernel Sigmoid kernel
Linear kernel k(x,z) = <x,z> Polynomial kernel k(x,z) = (<x,z> + c)^d , c,d >= 0 Gaussian kernel k(x,z) = e^-((||x-z||^2)/d), d > 0 Sigmoid kernel k(x,z) = tanh^-1 s<x,z> + theta
What is the goal for an SVM? Write the goal as an equation
Maximize the margin of a classifier: Given linearly separable data (xi,yi), learn a linear classifier w^T * x + b = 0 such that - all training examples with yi = +1 lie ABOVE the margin - All training examples with yi = -1 lie BELOW the margin - margin is maximized maximize w in gamma, which is 2/||w|| = minimize w in ((w^T * w) / 2)
How does the dual problem change when doing the Kernel SVM approach?
Replace inner-products with the kernel function: replace inner-products matrix with a kernel matrix a is alpha maximize -(sum i = 1 .. n (sum j = 1 .. n ( aiajyiyj * k(xi,xj))))/2 + sum i = 1 .. n (ai) so that: sum i = 1 .. n (aiyi) = 0 C >= ai >=0 for all i = 1 .. n ** make sure to maintain convexity of problem: must remain positive semi-definite (EX: x^2.)
What is the optimal classifier for hard-margin SVM?
Sparse linear combo of the training examples - so, a classifier that depends only on the support vectors - Meaning: if remove all other training examples except the support vectors, the soln will remain unchanged
Contrast hinge loss (SVM) with Perceptron loss function
TBD
What is a good problem example that SVM solves?
Text categorization: uses bag-of-word representation to capture frequency of words, tho loses semantics, word order, etc. - each document is a vector of counts for each word relating to a category, categorize as sports or politics
What is the best linear classifier for SVMs?
The hyperplane that: - achieves max separation between the two data sets (so, maximize the margin)
What is different about soft-margin SVMs?
The problem/ goal has changed: Want the largest margin AND the minimization of misclassifications - also, no longer assume data is linearly separable, which isn't valid for real-world apps
What are the support vectors? What makes a hard-margin SVM solution sparse?
The training examples with ai > 0 since they support the classifier; all others have ai=0 and this makes the solution sparse
Can kernel be constructed from other kernels?
Yes! EX: products of kernels: k(x,z) = k1(x,z)k2(x,z) conical combos: k(x,z) = a1k1(x,z) + a2k2(x,z) products of funcs: k(x,z) = f1(x)f2(z) where f1 and f2 are real valued functions
What is the dual problem for soft-margin SVMs? How different from hard-margin SVM dual problem?
a is alpha maximize -(sum i = 1 .. n (sum j = 1 .. n ( aiajyiyjXi^T*Xj))/2 + sum i = 1 .. n (ai) so that: (these are trivial constraints) sum i = 1 .. n (aiyi) = 0 C >= ai >=0 for all i = 1 .. n the only difference is that the Lagrange multipliers are upper-bounded by the regularization param C