ML Midterm Prep

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Information Gain - Decision Trees

- mathematical way of capturing how much info one gains by picking an attribute to split on. The reduction of randomness over the labels that you have w/ set of data, based upon knowing value of a particular attribute. Information gain measures the reduction in entropy that results from partitioning the data on an attribute A, which is another way of saying that it represents how effective an attribute is at classifying the data. Entropy is maximized when p+ is ½. -

Bayes Net / Belief Network

- no arrow from storm to thunder so thunder is conditionally independent of storm given lightning -if there was an arrow from storm to thunder - it's like adding an attribute - we'd have to look at all the joint combinations of storm and lightning - this grows exponentially in degree of parents as you add more and more variables -arrows don't imply causal relationships - just imply probability relationship

Backpropagation

--Backpropagation - computationally beneficial organization of the chain rule - computing the derivatives with respect to all the weights in the network all in one convenient way . Backprop works with any differential activation function. If we replace the sigmoid fn with another fn - we can still compute the derivative and move weights around to produce correct weights. -info flows from input to output - and error info flowing from output to input - this is how learning is taking place

Prevent overfitting in Neural Networks

-Cross validation helps us determine # of nodes, and weights -as we continue to improve fit with more and more iterations (and constant network size) we can still overfit (cross-val increases) because weights can get larger with more iterations

KNN and LinReg Time and Space Complexity

-In Instance Based - learning is fast, and querying is slow -in LinReg - learning is slow, querying is easy. In LR we only have to learn once - but we can query many times - so if we query more than n times - it will be a worse overall running time compared to instance based. -query 1-nn is lg n using binary search because we have a sorted list -for KNN query running time - if k is on the order of (n/2) - then the whole thing will be O(n) - but if it's on the order of lg n - then it's just O(lgn) -LinReg learning runtime - doing a regression involved inverting a matrix - we are just scanning through list to populate matrix so it's O(n)

VC Dimensions Quiz Theta Cutoff

-Largest set of inputs the hypothesis class can label in all possible ways? *ONE* Any pair of inputs we can label: FF, FT, TT - but not TF

Bayes Rule and Definitions

-Pr(D) : prior belief of seeing some particular set of data. Ends up to be a normalizing term and typically does not matter. -Pr(D|h) - Given a set of x's - what is probability that I would see a particular label. This quantity is much easier to find than Pr(h|D) Pr(h) - "prior" belief that one hypothesis is likely or unlikely compared to other hypothesis. -if the hypothesis was likely before, it's more likely after we've seen the data

Gradient Descent / Delta Rule

-Push weights in direction of reducing error by taking the derivative of error function with respect to each weight -a is Sum[xi*wi] - not thresholded -not thresholded -a is just Sum[xi*wi] - we are just trying to minimize the MeanSquareError -difference between activation and target output times the activation on that input unit Sum[ (y-a)*(xi)]

Restriction Bias of Neural Network

-Restriction bias definition - tells us about representational and set of hypotheses we're willing to consider. If there's a great deal of restriction, we're just looking at a subset -perceptrons - just consider planes -perceptron networks - same as simple perceptron but add XOR -sigmoid and activation functions - add even more complexity, not much restriction -an NN with a single hidden layer can represent any continuous (non-jump) function. If we add another hidden layer we can represent any function -NN are not very restrictive according to their bias as long as we have sufficiently complex network structure- thus we have a danger of

Boosting: Final Hypothesis

-SGN of weighted hypothesis -if it's below 0, -1 is output -if it's above 0, +1 is output

Bayesian min of squared error

-Using bayesian learning trying to find max likelihood hypothesis - using substitution - end up w/ sum of squared errors -this is justification for minimizing Sum of squared errors in other algos -assumes true deterministic function, with gaussian noise on labels. Whenever you try to minimize sum of squared errors - we are assuming data has been corrupted by sum of squared errors

VC Dimensions and Shattering

-VC Dimensions: size of largest set of inputs that the hypothesis class can label in all possible ways (shatter) -VC dimensions are related to the amount of data needed to learn effectively in this class. As long as dimensionality is finite, even if hypothesis class if infinite, we can say things about how much data we need learn -a way to talk about PAC Learning and the amount of data we need when the hypothesis space H is infinite

VC Dimensions Quiz Range a,b

-VC dimension can't be 3 or greater (the plus-minus-plus condition). So we can't actually shatter 3 points for this example. -in order to prove 1 or 2 - we just need to come up with one example of points that we can shatter (not all different points - like points on top of each other that couldn't be shattered in any dimension) -In the case of 3 points - prove no example exists. Proving a lower bound (1 and 2) seems easier than proving an upper bound (3)

VC Dimensions - Number of Parameters

-VC dimensions often number of parameters. -for any d - dimensional hyperplane concept class or hypothesis class - the VC dimension is going to be d+1 -Number of params needed to represent d dimensional hyperplane is (d+1)

Bayesian Learning: Hypothesis in the Version Space

-VS(h,d) is the number of hypotheses from H that are consistent with D -every consistent hypothesis has a posterior probability P(h|D) of 1/VS. Every inconsistent hypothesis has a posterior probability of 0 -is true only for noise free examples, and the concept is in the hypothesis space, and uniform prior over all hypotheses -no reason to pick one or the other in the version space - there are equally likely to be picked. Just pick something from the consistent set of hypotheses

Boosting: Good and Evil Distributions

-a good distribution is uniform since we can pick h1 and it will be considered a weak learner (h2 and h3 would be error of 1/2) -an evil distribution is an error rate of 1/2 on x1 and x2 - since h1,h2, and h3 would all NOT be weak learners (error rate of 1/2)

Bias Unit

-always 1 -kind of like having the threshold be one of the weights so we can set the perceptron rule equal to 0

VC Dimensions of Finite H

-anything that has finite VC dimension is learnable from the previous bound, but we're saying that it's actually the other way as well - if something is learnable, it has finite vc dimensions. Or if it has infinite vc dimensions it's not learnable. VC dimension captures notion of PAC Learnable

Boosting and Overfitting

-boosting stops if it's done enough rounds or no weak classifiers left In general, boosting can reduce the bias (increase model complexity) in a very general model that is too simple, contrary to another ensemble learning method called bagging, which works to reduce the variance of a model that is too complex. One would think that as we add more and more hypothesis to the ensemble, the model would become more complex. However, in our lectures, we were shown that boosting acts similar to SVM in that it increases confidence, or maximizes the margin, between positive and negative examples. This can help prevent overfitting even though boosting is increasing model complexity. -Boosting can suffer from overfitting in the pathological case where the initial learner perfectly fits the training data, causing the weights to never be updated.

Haussler's Theorem

-bounding true error as function of the number of training examples drawn -how mch data do we need to knock out the hypothesis with high true error greater than epsilon Pr(not epsilon exhausted) = |H|e^-em m >= 1/epsilon * (ln|H| + ln(1/delta)) -tells us how many training example suffice to ensure that every hypothesis in H in the version space (having zero training error) will have a true error of at most E -m grows linearly in the number of literals, linearly in 1/epsilon, and logarithmically in 1/delta.

Optimizing Weights - Gradient Descent vs Other Methods

-can use momentum, higher order derivatives, randomized optimization, penalty for structure that's too complex -in NN more nodes / layers==> more local minima ==> overfitting -large weight numbers can also cause overfitting

KNN Distance Function

-choice of distance function matter (euclidean, manhattan are good for regression) -choice of k really matters. --- k=n will just be an average of the y values but if we do it with a weighted average the points nearest it will look different -locally weighted regression - in place of averaging function you can put other algos for the k nearest points (dtree, neural net, just a line etc..) -locally weighted linear regression - all points that are nearby and fit a line to it - so different sloped liens for various points. Start with hypothesis space of lines , but end up being able to represent a hypothesis space that's bigger than the set of lines - end up with a more complicated space. -power of KNN - take local info and build functions around those nearest points

Decision Tree Greedy

-decides which attribute should be at the root of the tree by looking just one move ahead. It compares all available attributes to find which one classifies the data the best, but it doesn't look ahead (or behind) at other attributes to see which combinations of them classify the data the best. This means that while the algorithm will, in most cases, come up with a good decision tree, a better one may exist. -as the size of the training data and number of attributes increase - it becomes more likely that ID3 returns a suboptimal tree

Decision Tree / SVM / 1-NN Decision Planes

-decision trees - nested rectangle splits (possibly in higher dimensions) -basic SVM finds a linear separator, also true of perceptron or simple NeuralNet -1-NearestNeighbor - (Vornoi Diagram) break up plane into the regions that are closest to some particular point.

Bayes Optimal Classifier - finding the best label

-difference between finding the best hypothesis and finding the best label -for most likely hypothesis for all h in H we compute the Pr(h | D) and output argmax -for most likely value/label is a weighted vote for h in H, Pr(h|D) -bayes optimal classifier - cannot do better than a weighted vote on average.

Why Sampling?

-distributions give probability and can generate values -can simulate a complex process -approximate inference - generate a bunch of samples where I have thunder to see where storm was also true. Instead of doing complex probability calculation. Approximate is faster -visualization - intuitive understanding

Support Vector Machine (SVM)

-draws an optimal decision boundary that maximizes margin width between classes. -gutter lines are parallel to the optimal boundary and go through positive and negative class points (which are called support vectors) -the optimal boundary is best as to prevent overfitting - if the center line is too close to one side then we may be believing training data too much wT * x1 + b = 1 - wT * x2 + b = -1 ------------------- 2 / ||w|| -w is inversely proportional to margin width, and always normal to boundary

Eager vs Lazy Learner

-either do all work up front w/ LR (eager learner) - or put off any work until you have to with KNN/Instance Based (procrastination/ lazy learner)

PAC Learning / Haussler Example Quiz

-epsilon and delta given -hypotheses are 1 for each bit, so |H| = 10 . (even though input space is 2^10, number of hypothesis is just 10) -plug into haussler's theorem - we get 40 40 / (input space) = 40 / 2^10 = 4%

Epsilon Exhausted Version Space

-epsilon exhausted IFF all hypothesis in the version space have error less than epsilon -a hypothesis can have zero training error but a true error greater than zero -if epsilon exhausted, then we can uniformly choose a hypothesis from the version space

LinReg find best constant function by deriving RMSE

-error function constant c -set derivative of sum of squared errors to 0 - this eventually derives to constant c equal to the average of the y values -we set derivative of error to 0 because this is when cost function is lowest - a minima

SVM Properties of Alpha Supportiveness Values

-for all non-support vector points, alpha =0, otherwise if the point is a support vector and it lies on the gutter it will have an alpha > 0 -no alpha will be negative - no point contributes negatively to determining boundary -alphas on positive side should equal the sum of alphas on the negative side -alphas that are further away will have a lower alpha, since ||w|| will be less -in a smaller margin, moving point a little bit changes boundary a lot, points are playing a greater role in boundary

Bias - Variance Trade-off Diagram

-high variance is a complex model that's overfit - to fix we can add more training data, decrease model complexity, reduce the number of features, or introduce regularization -high bias is a simple model that's underfit - to fix we can add features, decrease regularization, or add more features (polynomial features)

Degree of Polynomial Train/CV Graph

-higher degree is overfitting at the expense of generalization

Boosting: Axis Aligned Semi Planes

-hypothesis space H is axis aligned semi-planes (on one side is pos examples, on other is negative) -if you take 3 hypotheses and you weight them accordingly - you end up with bottom figure - bends line around and capture positive and negative examples perfectly. -A final weighted sum is able to represent a hypothesis that's as least as complicated as the original hypothesis - and often more complicated. Able to be more expressive because we are combining them in some way. Result is nonlinear at the end because we pass it through the sgn function.

Perceptron Rule Linear Separability

-if the data is linearly separable - it will find a set of weights that separate the data in a finite number of iterations -not easy to tell if data is linearly separable

Perceptron Training Rule

-if the output is already correctly - they'll be no changes to the weights -if output is wrong we will move weights in the appropriate direction -alpha is the learning rate and how big of a step to take -should only be running weight updates until it classifies the data -

Learning Theory: Learner chooses, teacher chooses, nature chooses

-if the teacher chooses the question we can find the answer in 1 step -if a learner chooses the question it takes log2|H| - log of the hypothesis space. We want to choose questions with the max information gain. -learner should be robust against any type of input choice

Neural Network Activation Function

-if( sum of weighted inputs >= threshold ): Y=1 -else: y=0

Evolving posterior probabilities

-initially all hypothesis have the same probability. As training data accumulates, the posterior probability for inconsistent hypothesis becomes zero while the total probability summing to one is shared equally among the remaining hypothesis

KNN Domain Knowledge

-knn is easy since most of the domain is left up to the designer so multiple implementations could end up w/ completely different answers -we choose: k, distance function, how to break ties

Ensemble Learning

-learn over a subset of data and get simple rules and then combine them to form one complex rule that works really well -simplest way to combine them is average

Boosting

-learn over a subset of picked weighted data. Combining learners reduce inductive biases and improve performance . 'hardest' examples ------------------------ -Initially assign uniform equal weights to each training example - weights of examples modeled poorly in the previous learner will increase, while examples modeled correctly will decrease (via a weight/distribution update rule) Final classification: ---------------------------- H(x) = SGN[a1*h1(x) + a2*h2(x) + .. an*hn(x)] -alpha is the voting power and learners that have a low (or high if we take the inverse of the vote) error rate over the distribution are given more voting power alpha. A learner error rate of 1/2 has a voting power of 0

Which hypothesis spaces are infinite? Example/quiz

-linear separators - infinite number of lines. Y=mx+b - we can put infinite number of numbers (m and b) in this -neural nets - infinite range of weights -d trees w/ discrete inputs - finite assuming finite # of features -d trees w/ continuous inputs - inifinite number of questions you can ask (is it greater than .1? is it greater than .11? is it...) K-NN: classifier that comes out of k-nn is based on neighbor points that could be in an infinite arrangement. But Charles argues that if the training set is fixed - it is finite. (K-NN is non parametric - but it really has infinite number of parameters)

Cost/Error Function

-many different error function but RMSE is easiest because can use the derivative to minimize it

Decision Tree Continuous Attributes

-many more possible branch choices -we can reuse an attribute -generally use attribute range (doesn't make sense to repeat a discrete valued attribute) -info gain difficult on continuous outputs

SVM Quadratic Programming

-maximize distance subject to constraint of classifying everything correctly -we can rewrite: rather than maximizes 2/||w|| - we can try to minimize the reciprocal [ .5 * ||w|| ^2 ] -the minimization problem is easier using quadratic programming SVM will turn the problem of maximizing the margin width 2/(||w||), into an equation that can determine alpha for each point. Known linear algebra methods can calculate each point's "supportiveness value" alpha . It can then determine the optimal decision boundary slope and offset in the form of the slope w (w= Sum[a(i) * x(i) * y) and b. -Points that are not found to be support vectors will have a low alpha. The sum of positive class alpha should equal the sum of negative class alpha values. most of the alpha values are 0 - few of the xi points matter. -alpha values say pay attention to this data point or not

Low vs High Information Gain

-measures the effectiveness of an attribute in classifying training data -info gain => reduction in entropy caused by partitioning the examples according to an attribute

Entropy and Mutual Information

-more bits => higher entropy

Boosting: Distribution/Weight Update

-more weight for ones it gets wrong, less weight for one it gets right -if we have a weak learner we assume that whatever the distribution is, it will be correct on some of the ones you got wrong before -y(i) times h(t) => positive 1 if they agree, -1 if they disagree -alpha(t) constant number - always positive (learning rate, scalar). Error always between 0 and 1 -Z(t) - normalization constant to make it all work

Boosting: Weak Learner

-no matter what distribution over data - it will do better than chance - always going to have an error rate that's less than half -always gaining info

Natural Errors in Training data

-noisy data -transcription errors -unmodeled influences

Bayes Rule - Calculate Pr(D | h)

-probability of all these happen is product so we multiple (1/32 * ¼ * ½ etc... = 1/65536) -

Bagging

-random subsets with replacement - average out at the end -works well because you're not commiting too much to one subset of training data

SVM Kernel (Trick)

-special way to handle non linear separable data -replace dot product in one of our equations with a kernel function - which effectively transforms the space to a higher dimension. -simple dot product is a linear kernel that draws a linear hyperplane decision surface - quadratic, rbf, or poly kernel can draw a more complicated decision boundary -kernel function is a similarity measure - the way we figure out the classification of a new point is to see how similar the test point is to the positive or negative support vectors -kernel is injecting domain knowledge into SVM (like k in knn) -Mercer Condition - technical requirement of kernel function that it's a well behaved distance function

SVM vs KNN

-svm is similar to knn in that uses only the points closest to a region, but SVM has the added advantage of being able to throw away the points that don't matter. "eager lazy learners" - raw classifier -------------------------- SVM eagerly determines these points near the margin, as opposed to KNN which lazily determines the k nearest points

Epsilon Exhausted Example

-the green rectangles are the training examples we see - so the version space VS = {x1, or, xor} -4th example doesn't matter since it has no probability of showing up - calculate true error (w.r.t. the distribution) for each hypothesis in VS -first hypothesis x1 has a true error of .5 . The max true error in the version space is the smallest epsilon that we can epsilon-exhaust with

Sample Complexity and VC Dimension - Finite vs Infinite

-there is a minimum # of samples for both the inifinite hypothesis case and the finite hypothesis case -as VC dimension gets bigger - we're going to need more data

Topological/Acyclic Bayes Net

-topological can't have cycles -must be directed acyclic

Computational Learning Theory

-traditional theory of computing (big o notation) is on time and space -For learning theory we primarily care about sample complexity (size of data) -Can we learn well with a small amount of samples. The fewer samples that it can use the better (the more that it's generalizing effectively) ? -we still care about time and space -Sample Complexity (for batch scenarios) - how many training examples are needed for a learner to converge (w/ high probability) to a successful hypothesis? Computational Complexity - how much computational effort is needed for a learner to converge (w/ high probability to a successful hypothesis? Mistake Bound (for online scenarios) - how many training examples will the learner misclassify before converging to a successful hypothesis?

KNN Algorithm

-training data D -similar/distance metric -number of neighbors k -query point q -find k nearest neighbors such that they are closest to the query point Classification: return vote of the y's Regression: return the avg

Instance Based Learning

-with linear regression we learn some function then throw away the data -instance based stores the (x,y pair) data - then when we want to predict the data we just look for that value Pros: fast to train, simple Cons: no generalization, can overfit by believing data too much -or- same x for multiple y (lookup returns both results and has to choose one)

MAP vs ML hypothesis

1) Maximum a posteriori (MAP) hypothesis - largest posterior probability given all prior (distribution of h's) hMAP = argmax[ Pr(h|D) ] => argmax(bayes rule) 2) maximum likelyhood (ML) hypothesis - hML = argmax[ Pr(D|h) ] -if uniform probability hMAP = hML -for each candidate hypothesis in H, calculate bayes rule probability - either ML or MAP - the max probability (argmax). This may take a long time with a large hypothesis space (like linreg or neural net infinite H) - but it still provides a standard against which we can judge the performance of other algos

Perceptron Rule vs Gradient Descent / Delta Rule

1) Perceptron Rule ------------------------------ - thresholded. Yhat is thresholded from a. - finite convergence when linear separable 2) Gradient Descent Rule ------------------------------- - unthresholded. GD just uses just the weighted sum a -converge in limit to some local optima -can't do gradient descent on Yhat because its a step function and we can't differentiate it

Restriction and Preference Bias

1) Restriction Bias - hypothesis set that you care about - in D Trees this is the set of all possible decision trees (ie DTrees are not considering y=2x or quadratic equations - since this is not linear regression). Instead of looking at infinite number of hypothesis - we are only looking at a specific set of decision trees. 2) Preference Bias - what sorts of hypothesis from hypothesis set that we prefer. Which is the heart of Inductive bias .

Training Error and True Error

1) Training Error - fraction of training examples miss-classified by h. (target concept has training error of 0, but hypothesis h may not have 0 error) 2) True Error - fraction of examples that would be misclassified on sample drawn from D -probability of a mismatch on sample drawn with distribution D . For examples that we never see and get wrong , that won't contribute to the error. -if D is uniform that assigns the same probability to every instance in X, the error for h(x) will be the fraction of total instance space that falls into the region where h and c disagree. However if D assigns higher probability to instances where h and c disagree - our error will be high.

KNN Preference Bias

1)Locality - near points are similar to one another. Nearness embedded in whatever distance function we choose. 2) Smoothness - by choosing to average, we expect functions to behave smoothly 3) All features matter equally - preference bias is the algo's belief about what makes a good hypothesis

Induction vs Deduction

Induction - going from examples to a more general rules (sun rises each of last 5 days, then tomorrow it will rise) Deduction - general rule to specific - this was what AI was originally.

Complementary Distribution

P(A | B) = 1- P(not A | B)

Naive Bayes - MAP Class

P(V | a1, a2, ... an) = Product[ P(ai|V)] * P(V) / z -one way of getting inference -where V is spam, attributes are viagra/prince/udacity -if we have a way of going Value to attributes - we can do the reverse with Naive Bayes of going from attributes to Value. -MAP class - most likely class given attributes/data we've seen = argmax( Product[ P(ai|V)] * P(V))

Joint Distribution of Entire Bayes Network

P(a,b,c,d,e) = P(a)* P(b) * P(c|ab) * Pr(d|bc) *P(e|c,d) -this is a more compact representation with only 14 numbers on the right hand side, vs the left hand side only needing 31 numbers -if all nodes are conditionally independent it's only 5 numbers -joint distirbution - exponential growth with attributes -here we have exponential that depends on # of parents: -no parents - growth constant -parents - grows exponential with parents -fewer parents - the more compact the distribution ends up being

Bayes Rule Example Spleentitis

P(a|b) = 1 - P(not a | b) Pr(+| s) = 0.98 Pr(- | s) = .02 Pr(- | not s) =0.97 Pr(+ | not s) =.03 Pr(s) = .008 Pr(s | + ) = Pr( + | s) * Pr(s) = .98 * .008 = .00784 Pr( not s | + ) = Pr(+ | not s) * Pr(s) = .03 * .992 = .02976 .02976 / (.02976 + .00784) = ~80% -even though test is positive it's more likely that the person does not have the disease - the intuition being a random person showing up in the office is unlikely to have the disease (0.8%) -the prior Pr(s) really matter in this case -if we see more data the prior may change

Conditional Independence

P(x | y,z ) = P(x | z) P(Th | L,S) = P(Th | L) -thunder is conditionally independent of storm given lightning -no matter what T/F values you

Joint Distribution Example

Pr(not storm) = .35 Pr(lighting | storm) = .4615 -each time we add a true/false variable - the number of probabilities needed doubles -if we have 5 food types for a var - its going to multiple the number of probabilities by 5

Preference Bias of Neural Network

Preference bias - tells us about algo - given two representations, why it prefers one over the other (ie DTrees prefer nodes at the top that have high info gain) -we need somewhere in infinite weight space to start: start somewhere in weight space: small random values which gives us low complexity and not many minima -NN *prefers simpler explanations Occam's Razor* . In neural nets there's a lot of unnecesary multiplications. We get better generalization error w/ simpler hypotheses

Prior - Domain Knowledge

Prior belief is our domain knowledge. (domain knowledge is: similarity metric in knn, or structure of neural network). In bayesian learning / bayes rule - our domain knowledge sits in Pr(h)

Pros and Cons of Naive Bayes

Pros ---------- -even though inference is np hard, naive bayes structure is cheap to infer from -number of parameters is linear - 2 probabilities for each of the attributes, and 1 for the class -countable probabilities - in labeled data, how often have attribute appears -connects inference and classification - can generate missing attributes. If on a decision tree we hit an attribute doesn't have a value -empirically successful - if you have enough data - its easy to estimate probabilities Cons ---------- -Naive - bayesian net does not represent real world most of the time. Doesn't model interrelationships between attributes. -naive - assume attributes are independent of one another -one unseen attribute spoils the whole bunch - can workaround this by

Concept Learning Definitions: Target Concept, Hypothesis Space, Instance, Sample

Target Concept - desired function that maps inputs to outputs - as we want them Hypothesis Space - all hypotheses to consider - determined by the human designer's choice of hypotheses representation -Instance - set of individual inputs (scalar/vector) -Sample - set of labeled training data (training set) -concept learning can be viewed as the task of searching through a large space of hypothesis implicitly defined by the hypothesis representation. Choosing a hypothesis representation will restrict (bias) the space that can be searched.

VC Dimensions Quiz Linear Separator

VC >=1? Yes - line could label point in 2d plane either way VC>=2? Yes - if 2 points were on a line we can separate them in any way. VC>=3? Yes - triangular shape can be split in all ways and shattered VC>=4 ? No - can't do this (similar to neural net xor problem) -VC Dimensions of points inside convext polygon is infinite

Entropy w/ Decision Trees

Entropy (measure of homogeneity of a set between 0 and 1) is defined below where v is a possible value for the attribute (num of classifications) and p(v) is the probability of a sample attribute having a value of v which is derived from all the samples in the current set S -Entropy is 0 if all members belong to the same class. -Entropy is 1 when collection contains an equal number of pos and neg examples.

K-Fold Cross Validation

Grid Search: for each combination of hyperparameters , perform k-fold cross validation - and taking the average of the k folds . The hyperparam combo that performs best/lowest error (or highest accuracy) is the model we choose to use on our test data. Ie.. 3-NN has lower CV error than 5-NN ------------------------- Model selection | -k-fold cross val | | ------------------------- ------------------------ Test / held out set | -----------------------

Decision Tree Expressiveness

Hypothesis space is very expressive because there's lots of different functions your tree can represent. But that also means we have to have some clever way to search among them. And it's why we need a smart algorithm to go through and pick out which decision tree - because if we aren't smart about it - we have to look at billions of possible decision trees.

Decision Tree Overfitting/Pruning

ID3 is prone to overfitting since it splits on attributes until either it classified the data perfectly, or there are no more attributes to split on. Solutions include stop growing the tree before it becomes too large, or prune the tree after it becomes too large. Typically a limit to a decision tree's growth is specified in terms of max number of layers (depth). -post error pruning will, after the tree has been built, remove nodes that improve the validation/test scores

Inferencing Rules - Marginalization, Chain Rule, Bayes Rule

The graph represents P(x,y) = P(y) * P(x|y)

Consistent Learner / Version Space Definitions

True hypothesis: c in H Candidate hypothesis: h in H Consistent learner produces c(x) = h(x) for all x in instance space Version Space: all hypotheses that are consistent with data that we see

PAC (Probably Approximately Correct) Learnability

a learner can *probably* learn a hypothesis that is *approximately correct*. -PAC learnable if - learn to get low error with fairly high confidence in time that's sort of polynomial with the parameters. C is PAC-learnable by L using H iff: Learner L will with probability 1-delta, output a hypothesis h in H that has TrueError <= epsilon in time and samples polynomial in 1/epsilon, 1/delta, and n Error goal - error in hypothesis we produce should be no bigger than epsilon Delta (certainty/confidence goal) - with probability (1-delta) the algorithm has to produce TrueError <= epsilon Perfect error or perfect certainty - 1/epsilon or 1/delta would need to go to infinity - so PAC learnability is only giving us partial guarantees.

Decision Tree

a tree structure where each node asks a yes/no question of an instance and the answer determines which branch to proceed down. The leaf node the instance arrives in contains the final classification. A single tree with its set of question nodes is a candidate concept. There can be many variations of the trees (questions based on different features, same questions in different order, etc.). Just using the same attributes but mixing up the node order can result in a huge number of candidates. each node/question should narrow the possibilities based the outcome of previous nodes.

Version Space Example: XOR

answers: x1, OR, XOR these are the only hypotheses that are consistent with all the training data

Curse of Dimensionality

as the number of features/dimensions grows, the amount of data we need to generalize accurately also grows exponentially -for each feature we add, we need a lot more data -extends to other algos besides KNN -each x at first cover 1/10 of space, then in plane it's 1/100 of space, in a cube it's 1/1000 -10^d x's , where d is number of dimensions -better to add more data then add more features

Perceptron Boolean Functions

can represent all boolean functions with 1 perceptron - except XOR which needs 2 perceptrons

LinReg Degree / Overfitting (Order of Polynomial)

d=0 => constant d=1 => line d=2 => parabola -higher degrees start to hit every point but sometimes have wild lines -if we increase degree too much we start to overfit

Independently and Identically Distributed (IID)

fundamental assumption of supervised learning: all data (train, cross val, test) should look the same done via stratify in sklearn

Minimum Description Length

hMAP = argmin[ length(D | h) + length(h) ] = (error) + (size of h) -hMAP is the one that minimizes error - and minimizes the size of the hypothesis. We want the simplest hypothesis that minimizes error - Ockham's razor. Best hypothesis is one that minimizes error, without paying too much of a price for complexity of the hypothesis -the misclassification/error and size of h are inversely correlated -prefer smaller decision tree w/ less depth and fewer nods. -if neural nets weights get really big, we're going to need more bits to represent them, so the length of the hypothesis is higher. We get overfitting when weights get bigger. Length(h) - number of bits needed to represent a hypothesis.

Agnostic Learner

makes no assumption that target concept is representable by H and simply finds the hypothesis w/ minimum training error. -slightly worse bounds than if true concept was in H - but still polynomial in 1/epsilon and 1/delta

Sigmoid function

primary considerations in choosing activation function is to ensure its differentiable - as a (the weighted sum) goes to -inf , sigmoid goes to 0 -as a goes to +inf we get 1 -Derivative of sigma function is easy to calculate Sigma(a) * (1 Sigma(a) ) -The derivative when a is very small is 0, and the derviative when a is very large is 0

LinReg solving for weight/coefficients directly

solve for weights/coefficients directly - these are the weights that minimize squared error -this is just solving for training data but to prevent overfitting we may need to add regularization terms

Inductive Biases of ID3

• Good splits near at top • Prefer correct over incorrect (tree that has good splits at the top but produces the wrong answer - it will not take that one that doesn't have good initial splits but does give the right answer) • Shorter trees to longer trees (comes naturally from having good splits at the top - answer tends to come faster than if you didn't do that ) • ID3 searches a complete hypothesis space but does so incompletely since once it finds a good hypothesis it stops (cannot find others).


संबंधित स्टडी सेट्स

Bio II chapter 41 homework answer

View Set

Copied - USPSTF Health Promotion and Maintenance

View Set

WORLD WAR II: Battle of the Atlantic

View Set

NS433 ~ aviation practicum 12 weeks

View Set