CS7641 - Midterm
What might cause overfitting in decision trees?
- when there is noise in the data - when the number of training examples is too small to produce a representative sample of the true target function
what is the probability of success?
1 - delta where delta is the failure rate
How would we represent an OR in a bit string
By having both of the bits as 1s for the two values
How do we adapt the optimization problem to allow misclassifications in SVMs?
By introducing slack variables for each example
Why can we drop P(D) when computing the MAP hypothesis?
Because P(D) doesn't depend on h
What is Kullback-Leibler (KL) Divergence and what are some of its properties?
It measures the distance between and two distributions. Is always negative and doesn't follow triangle law .
What does a high value of C mean in SVMs
Margin (misclassification) errors incur a high penalty
what is the chain rule for P(A ^ B ^ C)
P(A| BC) * P(BC) = P(A|BC) * P(B|C) * P(C)
What is the VC dimension?
The largest set of inputs that the hypothesis space can shatter.
What is fitness proportionate selection
The probability that a hypothesis will be selected is given by the ratio of its fitness to the fitness of other members of the current population
How can you avoid the feature bias in ID3?
Use a measure like gain ratio.
what does all ones in the bit string mean
We don't care
What is occam's razor in relation to neural networks?
We prefer smaller weights because they represent lower complexity
What does it mean when $ P(a,b) = 0$
a and b are mutually exclusive
How will decreasing the number of leaves affect the bias and variance of a decision tree?
increase bias decrease variance
what does uniform cross over do?
combines bits samples uniformly from the two parents
how many rows would there be in the table to compute the joint distribution for 100 binary variables
2^100
For linear separators what is the VC dimension
3
in a conjunction of literals what is the maximum number of hypotheses
3^k b/c there are three options with k variables
What is the probability that at least one hypothesis is consistent with concept c on m examples?
<= k * (1-epsilon)^m <= |H|(1-epsilon)^m
How do we calculate the expected size of a message
$\sum_i P(s_i) * #(s)$
What is the probability that one hypothesis is consistent with concept c on m examples?
(1-epsilon)^m
what is the conjugate prior
P(theta) is the conjugate prior for the likelihood function P(data | theta) if the form of P(theta) and P(theta | data) are the same
What are three reasons why a model will misclassify a test instance?
- if instances from different classes are described by the same feature vectors - the model lacks expressivity to exactly represent the target concept (high bias) - variance
Break down the formula for Bayes rule
- $Pr(D|H)Pr(h)$ is just the probability of D and h together - $Pr(D)$ is our prior belief of seeing some set of the data. It is usually just a normalizing term and we don't really care - $Pr(D|h)$ - probability of the data data given the hypothesis; it is the likelihood we would see data D given that that the hypothesis was true
What are the limitation of ID3?
- ID3 maintains only a single current hypothesis as it searches through the space of decision trees unlike Candidate-Elimination which maintains the set of all hypothesis consistent with the available training examples - ID3 in its pure form performs no back training in its search meaning it is susceptible the risk of converging to a locally optimal solution
Describe the probability function in simulated annealing? What role does T play?
- If it is a giant step down then we likely won't take it; the larger the step up we will more likely take it. - if the temperature is really high we are willing to take downward steps; if the temperature is really low we won't take any downward steps
What are appropriate problems for decision tree learning?
- Instances are repressed by attribute value pairs - the target has discrete output values - Disjunctive descriptions may be required - The training data may contain errors: decision trees naturally represent disjunctive expressions - the training data may contain missing attribute values: decision tree learning methods are robust to errors in classification of the training examples and errors in the attribute values that describe these examples -
What are the caveats to boosting?
- The performance of boosting on a particular problem is dependent on the data and the base learner - boosting can fail to perform well given insufficient data, overly complex base classifiers or base classifiers that are too weak - boosting is susceptible to noise
What is the general algorithm for GA
- algorithm operates by iteratively updating a pool of hypotheses called the population - on each iteration all the members of the population are evaluated according to the fitness function. - a new population is then generated by probabilistically selecting the most fit individuals from the current population - some of these selected individuals are carried forward into the next generation population intact; others are used as the basis for creating new offspring by applying genetic operations such as crossover and mutation
What are some approaches to reduce crowding?
- alter the selection method such as using tournament selection or ranking selection - fitness sharing: the measured fitness of an individual is reduced by the presence of other similar individuals in the population - restrict the kinds of individuals allowed to recombine to form offspring: by allowing only the most similar individuals to recombine, we can encourage the formation of clusters of similar individuals or multiple subspecies within the population
What are the difficulties associated with gradient descent?
- converging to a local minimum can be slow - if there are multiple local minima the error surface, then there is no guarantee that the procedure will find the global minimum
what are the assumptions in bayes learning?
- d_i are classification labels that are noise free - the true concept c is in the hypothesis space - uniform prior: we don't know anything about the hypothesis
What is the training rule for gradient descent? And break down the constituent parts
- eta is a positive constant called the learning rate which determines the step size in the gradient descent search - the negative sign is present because we want to move the weight vector in the direction that decreases $E$
Why are GAs popular
- evolution is known to be a successful robust method for adaption within biological systems - GAs can search spaces of hypotheses containing complex interacting parts where the impact of each part on overall hyothesis fitness may be difficult to model - GAs are easily parellelized and can take advantage of the decreasing costs of powerful computer hardware
Describe hill climbing and its downside
- find the neighbor with the highest f(x) if it is higher than our current point move to that point else stop because that means we are at a local maxima - the local maxima found depends on the the starting point
What is the chain rule in probability
P(x,y) = P(y|x) P(x)
What is the difference between gradient descent and SGD?
- in standard gradient descent the error is summed over all examples updating weights, in SGD weights are updated upon examining each training example - summing over multiple examples in standard gradient descent requires more computation per weight step. Because it uses the true gradient, standard gradient descent is often used with a larger step size per weight update than SGD - In cases where there are multiple local minima w/ respect to $E(\mathbf{w})$, SGD can sometimes avoid falling into these local minima because it used various $\nabla E_d(\mathbf{w})$ rather than $\nabla E(\mathbf{w})$
Why is naive bayes "cool"
- inference is cheap - few parameters - estimtate parameters with labeled data - connects inference and classification - empirically successful
What is the inductive bias for backpropagation?
- it largerly depends on the interplay between the gradient descent search and the way in which the weight space spans the space of representable functions - smooth interpolation between data points - given two positive training examples with no negative examples between them backpropagation will tend to label points in between them as positive examples as well
why does overfitting tend to occur during later iterations but not during earlier iterations?
- the complexity of the hypotheses that can be reached by backpropagation increases with the number of weight-tuning iterations - given enough weight-tuning iterations, backpropagation will often be able to create overly complex decision surfaces that fit noise ins the training data
What are inputs to a GA
- the fitness function for ranking candidate hypotheses, - a threshold defining and acceptable level of fitness for terminating the algorithm - the size of the population to be maintained - parameters that determine how successor populations are to be generated: the fraction of the population to be replaced at each generation and the mutation rate
When can gradient descent be applied?
- the hypothesis space contains continuously parameterized hypotheses (e.g., the weights in a linear unit) - the error can be differentiated with respect to these hypothesis parameters
What would increase the left hand side of Bayes Rule
- the probability of the hypothesis: the prior - also the likelihood (the probability of the data given the hypothesis )
What is tournament selection and what is its benefit?
- two hypotheses are first chosen at random from the current population with some predefined probability p the more fit of these two is then selected with probability (1-p) the less fit hypothesis is selected - this method leads to a more diverse population
How do we initialize weights in neural networks
- use small random values - random because it gives us variability so that we don't get stuck in local minima as well as gives us variability when we run things multiple times - small because it means we start with low complexity
For bayes P(X|Y) how many parameters would we have to estimate without conditional independence? with?
2* 2^{n-1} + 1; 2n + 1
How can we combat overfitting in neural networks?
- weight decay - decrease each weight by some small factor during each iteration - keep weight values small to bias learning against complex decision surfaces - use a validation set - the algorithm monitors the error with respect to the validation set while using the training set to drive the gradient descent search - keep iterating until the training error reaches a significantly higher error over the validation set
Under what circumstances does MIMIC do well
- when the optimal values depend on the structure and not optimal point values MIMIC does well - it is not enough to represent the probability distribution of the optima we want to represent everything in between
to be pac learnable what are the bounds on epsilon and delta
0 <= delta <= 1/2; 0 <= epsilon <= 1/2
What is the minimum entropy?
0; when exactly one of the probabilities is one and all the rest are zero
What are the value for most of the alphas in the W(\alpha) equation and what does it mean?
0; you can find all of the support for finding the optimal w viewing a few of those vectors (data points) only a few x's matter
What is the annealing algorithm
1) Sample new point x_t in N(X) 2) Jump to a new sample with probability given by an acceptance probability function P(x, x_t, T) Decrease temperature T
What is the annealing algorithm
1) Sample new point x_t in N(X) 2) Jump to anew sample with probability given by an acceptance probability function P(x, x_t, T) Decrease temperature T
Approximately how many data points will be left out with bootstrap sampling?
1/3
If we flip a coin 10 times, how many bits do we need to send the message containing the results.
10
How many data points will you need to obtain an MLE of P(Y) within a few percent of its correct value?
100
For intervals what is the VC dimension
2
What is the equation we are trying to maximize in SVMs and what does it represent
2 / ||w|| while classifying everything correctly and that is the margin
What is overfitting?
A hypothesis overfits the training examples if some other hypothesis that fits the training examples less well actually performs better over the entire distribution of instances (I.e., including instances beyond the training set).
What is a consistent learner?
A learner that can learn the true concept; it produces output that match the truth hypothesis for all the data
When X is a vector of discrete-values attributes what can Naive Bayes be considered?
A linear classifier
How can Adaboost help identify outliers?
Adaboost focuses its weight on the hardest example, the examples with the highest weight often turn out to be outliers.
Why does boosting end up reducing the margin ( in a quantifiable sense)?
Because it focuses on the hardest examples, trying to maximize the margin on the training set leading to a drop test error?
Why do perceptrons always compute lines?
Because of the linear relationship
Why is bagging often used with trees?
Because they have high variance?
Why do we choose parameters that minimize the sum of squared training errors in Linear Regression.
Because this corresponds to the MLE assuming that data is generated from a linear function plus Gaussian noise.
In regression how is the best line defined?
Best fit is defined as the line that has the least squared error.
What does boosting reduce bias or variance?
Boosting is a variance reduction technique
How do we choose the distribution for the data that will be chosen in Adaboost?
By placing the most weight on the examples most often misclassified by the preceding weak rules. This has the effect of forcing the base learner to focus its attend on the "hardest" examples
How do we determine what attribute is the best classifier?
By using information gain. It measures how well a given attribute separates the training examples according to their target classification.
How can we use a validation set to prevent overfitting?
Consider each of the decision nodes in the tree to be candidates for pruning. Pruning a decision node consists of remove the subtree rooted at that node, making it a leaf node and assigning it the most common classification of the training examples affiliated with that node. Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set. Nodes are pruned iteratively always choosing the node whose removal increase the decision tree accuracy over the validation set.
True or False. The depth of a learned decision tree can be larger than the number of training examples used to create the tree.
False: Each split of the tree must correspond to at least one training example, therefore, if there are n training examples, a path in the tree can have length at most n.
True or False. We learn a classifier f by boosting weak learners h. The functional form of f's decision boundary is the same as h's, but with different parameters. (e.g., if h was a linear classifier, then f is also a linear classifier).
False: For example, the functional form of a decision stump is a single axis-aligned split of the input space, but the functional form of the boosted classifier is linear combinations of decision stumps which can form a more complex (piecewise linear) decision boundary
How do random forests reduce variance?
First, bagging isn't enough we need our models to be uncorrelated. - we use a sample of bootstrapped training data - random vector method: best split at each node is chosen from a random sample of m attributes instead of all attributes
What is generalization error?
Generalization error is the probability of misclassifying a new examples while the test error is the fraction of mistakes on a newly sampled test set. Generalization error is the expected test error.
What is eager learning?
Generalized beyond the training data before observing the new query.
mistake bounds
How many misclassifications can a learner make over an infinite run?
What does backpropagation do?
I has the interpretation of the flow of inputs to the output and the error back to the input . It learns the weights of a multilayer network by using gradient descent to attempt to minimize the squared error between the network output values and the target values for these outputs
What is alpha in the adaboost algorithm
IT measures the importance that we assign to the different classifiers learned which is set as a function of the error achieved at time t
What does it mean to epsilon exhausted?
If everything you might choose has an error less than epsilon; if there is anything in the version space that has an error greater than epsilon it isn't epsilon exhausted
Consider a learning problem with 2D features. How are the decision tree and 1-nearest neighbor decision boundaries related?
In both cases, the decision boundary is piecewise linear. Decision trees do axis-aligned splits while 1-NN gives a voronoi diagram
Why does the kernel trick allow us to solve SVMs with high dimensional feature spaces, without significantly increasing the running time?
In the dual formulation of the SVM, features only appear as dot products which can be represented compactly by kernels
What is the gist of simulated annealing?
It is the balance of trying to improve and searching through the space In the context of randomized hill climbing, exploit refers to always trying to climb up the hill as soon as possible and exploring refers to visiting more of the space to find out if there is a better maxima
What is one of the dangers of the MLE in Naive Bayes and what do you do about it?
It will output a zero if the data does not contain any training examples satisfying the condition in the numerator. To avoid this, it is common to use a "smoothed" estimate which effectively adds in a number of additional "hallucinated" examples, and which assumes these hallucinated examples are spread evenly over the possible values of X_i which corresponds to the MAP estimate if we assume a Dirichlet prior distribution .
What are the differences between lazy and eager learning as it relates to computation time.
Lazy learners will require less computation during training but more computation when they must predict the target value for a new query
Describe the differences in the inductive bias of lazy vs eager learners?
Lazy methods may consider the new query instance when deciding how to generalize beyond the training data while methods cannot consider the new query instance because they have already chosen their global approximation to the target function. This affects the accuracy of the learner because the eager learner must commit to a single hypothesis instead of using many different local approximations like the lazy learner.
Describe random hill climbing and its advantages over hill climbing?
Like hill climbing but once a local maximum is reached try again starting from a randomly chosen x. - advantages: multiple tries to find a good starting point, not much more expensive (constant factor)
AdaBoost will eventually give zero training error regardless of the type of weak classifier it uses, provided enough iterations are performed.
Not if the data in the training set cannot be separated by a linear combination of the specific type of weak classifiers we are using. For example consider the EXOR example(In hw2 we worked with a rotated EXOR toy dataset) with decision stumps as weak classifiers. No matter how many iterations are performed zero training error will not be achieved.
why is smoothing important in naive bayes
Otherwise you will overfit because you're believing your data too much leading to an inductive bias
What is bayes rules?
P(A|B) = P(B|A) (A) / P(B|A) P(A) + P(B | !A) P(!A)
How would you calculate calculate P(W | G, H) ?
P(W | G, H) = P(W, G, H) / P(G, H)
What is the joint probability of both a and b?
P(a, b) = P(a|b)P(b)
How do you come to bayes theorem through the joint probability formula
P(a|b) P(b) = P(b|a)P(a) P(a|b) = $\frac{ P(b|a)P(a) }{P(a|b)}$
What is preference bias/search bias?
Preference for certain hypotheses over other with no hard restriction on the hypotheses that can be eventually enumerated
What is mutual information and what is the formula
The reduction of randomness of a variable given information about another variable. $I(X, y) = H(Y) - H(Y | X)$
What happens with the value of eta is sufficiently small in SGD?
SGD can be made to approximate true gradient descent arbitrarily closely.
Describe the properties of simulated annealing?
T -> 0: like random hill climbing T -> inf: random walk Decreasing T slowly allows us to explore at the current Temperature before cooling it. When the temperature is high, it doesn't notice big valleys. As the temperature gets cooler we start to break things up into different basins of attraction
How do we combine the weak learners?
Take a (weighted) majority vote of their predictions
What effect does only removing nodes if it performs no worse than the original over the validation set?
That any leaf node added due to coincidental regularities in the training set is likely to be pruned because these coincidences are unlikely to occur in the validation set
What assumption do most models have?
That the data is independent and identically distributed
What is conditional entropy and what is the formula?
The randomness of one variable conditional on another
What is the version space?
The space of all the hypotheses that are consistent with the data
What does it mean if two events a and b are such that P(a|b) = P(a)
The two events a and b are independent.
What do kernels do?
They act implicitly as if data is in a higher dimensional space
Briefly explain GAs
They provide a learning method motivated by an analogy to biological evolution. rather than search from general-to-specific hypotheses or from simple to complex, GAs generate successor hypotheses by repeatedly mutating and recombining parts of the best currently known hypotheses at each step a collection of hypotheses called the current populations is updated by replacing some fraction of the population by offspring of the most fit current hypotheses
true/false A classifier trained on less training data is less likely to overfit.
This is false. A specific classifier (with some fixed model complexity) will be more likely to overfit to noise in the training data when there is less training data, and is therefore more likely to overfit.
What is the essence of regularization? And how does it prevent overfitting?
To come up with a measure for the complexity of an individual hypothesis so instead of just minimizing the training error we minimize the complexity and the training error. This avoids overfitting by constraining the learning algorithm to fit the data well using a simple hypothesis
What does the kernel representation allow for?
To use discrete data along as we have some notion of similarity
true/false In AdaBoost weights of the misclassified examples go up by the same multiplicative factor.
True, follows from the update equation.
true/false In AdaBoost, weighted training error ǫt of the t th weak classifier on training data with weights Dt tends to increase as a function of t
True. In the course of boosting iterations the weak classifiers are forced to try to classify more difficult examples. The weights will increase for examples that are repeatedly misclassified by the weak classifiers. The weighted training error ǫt of the t th weak classifier on the training data therefore tends to increase.
Which bias is better preference or restriction and why?
Typically a preference bias is more desirable than a restriction bias because it allows the learners to work within a complete hypothesis space that is assured to contain the unknown target function. A restriction bias that strictly limits the set of potential hypotheses is bad b/c it introduces the possibility of excluding the unknown target function altogether.
What are we doing with Kernel functions?
We are "projecting" our points into a higher dimensional space where they are linearly separable
When should you use MIMIC?
When the cost of evaluating the fitness function is high?
define conditional independence
X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z
What assumption does naive Bayes rely on
X_i are conditionally independent, given Y.
Will the perceptron rule find a half plane if the data is linearly separable
Yes, in a finite number of steps
To not have to worry about k in KNN, what can we use?
a distance-weighting function.
In KNN, when all the training examples are considered what is it called
a global method
What is an agnostic learner
a learner that makes no assumption that the target concept is representable by H and that simply finds the hypothesis with minimum training error; it is called agnostic because it makes no prior commitment about whether C is a subset of H
In KNN, when only the nearest training examples are considered what is it called
a local method
What does a low value of C mean in SVMs
a low value permits more margin errors in order to achieve a large margin
What is the semantic hypothesis space?
actually different functions that you're representing.
What is the syntactic hypothesis space
all the things that you could actually write. In the case of decision trees w/ discrete inputs, you could write different hypothesis but they wouldn't be meaningfully different
What is the maximum likelihood hypothesis?
any hypothesis that maximizes $P(D|h)$; in the cases when we consider each hypothesis equally probable we only need to consider the term P(D|h) to find the most probable hypothesis
What is stochastic gradient descent?
approximates the gradient descent search by updating weights incrementally, follwoing the calculation of the error for each individual example?
How does momentum help when fitting a neural network?
as we are doing gradient descent we don't want to get stuck in local maximum we can use momentum to bounce out of it
What are two ways in which training examples are presented
batch where the training set is fixed and online where the training algorithm is given an example one at a time.
Why is it important to think of phi as implicit and not explicity
because this would make the feature space grow really large really quickly
If likelihood is binomial and the prior is a beta distribution, what is the posterior distribution
beta distribution P(theta | data) ~ Beta(alpha_H + \beta_H, \alpha_H + \beta_H)
Simpler models usually have a higher _ and a lower _
bias; variance
What does Hoeffding theorem do
bounds characterize the deviation between the true probability of some events and its observed frequency over m independent trails. If the training error measures over the set D containing m randomly drawn examples then
How does Naive Bayes reduce the number of parameters that need to be estimated?
by assuming all attributes describing X are conditionally independent given Y
how are hypotheses represented in GAs
by bit strings so that they can be easily manipulated during mutation and crossover
What are you trying to do in maximum likelihood
choose parameters theta that maximizes that probability of the observed theta
What is one of the difficulties associated with GA applications?
crowding; which is when some individual that is more highly fit than others in the population quickly reproduces so that copies of this individual and very similar individuals take over a large fraction of the population. crowding reduces the diversity of the population which slows the progress of the GA
What is the vc dimension for d-dimensional hyperplane
d+1
Wha does the fitness function do?
defines the criterion for ranking potential hypotheses and for probabilistically selecting them for inclusion in the next generation population
What does the fitness function do?
defines the criterion for ranking potential hypotheses and for probabilistically selecting them for inclusion in the next generation population
What do SVMS do?
directly optimize for the maximum margin separator
What is the formula for the true error
error_D(h) = Pr[c(x) != h(x)]
What is the hypothesis space search in backpropagation
every possible assignment of network weights represents a syntactically distinct hypothesis that in principle can be considered by the learner
what is the true error
fraction of examples that would be misclassified on a sample drawn from a distribution D.
What is the training error?
fraction of training examples misclassified by h
What is a concept?
function that maps inputs (instances) to outputs
Compare gradient descent and the perceptron learning rule
gradient descent rule - move the in the negative direction of the gradient - more robust to data sets that are linearly separable - converges to local optimum perceptron rule - take the activation, threshold it, and move in the direction - has a guarantee of finite convergence when data is linearly separable
What is the restriction bias of perceptrons?
half spaces
sample complexity
how many training examples are needed for a learner to converge (with high probability) to a sucessful hypothesis (batch)
What happens if eta is too large?
if eta is too large the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it
what does it mean for the data to be linearly separable?
if there is a half plane that separates all of the positives and negatives - we give a learning rate for the weights w but not for the theta - we iterate over the training data and we change $w_I$ by the amount we change the weight by - if the output is correct there's no change to the weights - if the output is too high we move the weights in a negative direction - if the output it too small we increase the weights in a positive direction - the learning rate helps us set how much we want to move in a certain direction
When is C PAC-learnable by L using H
iff learner L will, with probability 1 - delta, output a hypothesis h in H such that the error_d(h) <= epsilon in time and samples polynomial in 1/epsilon 1/delta, and the size of the hypothesis space
How wil increasing k in k-nearest neighbor classifier affect the bias and variance?
increase bias; decrease variance
What is the VC dimension for complex polygons?
infinte
What is the key advantage of lazy learning
instead of estimating the target function once for the entire space, they can estimate it locally and differently for each new instance to be classified
Why do we use the sigmoid function as our activation function?
is differentiable and as the activation gets larger we get closer to one. the more negative it gets it goes to zero
What is the perceptron rule?
is how to set the weights of a single unit so that it matches some training set
If the data isn't perfectly linearly separable what does a SVM try to do?
maximize the margin and minimize the number of misclassifications; optimiziing the sum of ||w||^2 + C(# number of mistakes)
What is the restriction bias of sigmoid?
much more complex; not much restriction
Why does overfitting happen with more complex models?
noise leads the learning astray, and the larger, more complex model is more susceptible to noise than the simpler one because it has more ways to go astray
What does 2-point cross over do?
offspring are created by substituting intermediate segments of one parent into the middle of the second parent string
Define preference bias
our belief about what makes a good hypothesis.
What does overfitting lead to?
poor generalization
What is occam's razor?
prefer the simplest hypothesis that fits the data.
Wha does crossover operator do do
produces two new offspring from two parent strings by copying selected bits from each parent
What are bootstrap samples
samples taken uniformly with replacement
In the Maximum a posteriori probability what are you trying to do
seek the estimate of theta that is itself most probable given the observed data, plus background assumptions about its value
What is a hypothesis (class)?
set of all concepts (or functions) that we are willing to entertain.
What does learning consist of in instance-based algorithms?
storing the data in O(n) time
If we allow more more errors we need fewer what?
support vectors; C then controls some of the complexity of the SVM and is sometimes referred to the complexity parameter
What does delta represent in the formula
the delta statement is probabilistic because the sample is a random sample from the underlying distribution and there's always a chance that the sample we see is not representative of the underlying distribution; so delta is the unlikeliness of this happening
What is the activation in neural networks
the dot product of the weights and the inputs
Explain occam's razor in regards to Bayes Rule and
the length of the data given the hypothesis P(D|h) - if the hypothesis generates the data really well then we don't really need the data, the data is free - if it deviates a lot from the hypothesis we will need a long description to describe where the data fits
When trying to classify a set of points with a linear separator, what is the best line?
the line that creates the largest margin
What is the max of the entropy function?
the log() of the number of possible events
What is overfitting
the phenomenon where fitting the observed facts (data) well no longer indicates that we will get a decent out-of-sample error, and may lead to the opposite effect. - occurs when the learning model is more complex than is necessary to represent the target function - can also occur when the hypothesis set contains only functions which are *simpler* than the target function
What does C control in the SVM minimization function?
the relative weighting between the goal of making the ||w||^2 small (margin is large) and the sure that most example have functional margin >=1 (against slack minimization)
What is inductive bias?
the set of assumptions that together with the training data, deductively justify the classifications assigned by the learner to future instances
With Gaussian error what should we minimize to maximize the likelihood hypothesis
the sum of squared error
What is the definition of an epsilon exhausted version space
the version space is epsilon-exhausted just in the case that all the hypotheses consistent with the observed training examples (I.e., those with zero training error) happen to have true error less than epsilon
What is the problem addressed by GAs
to search a space of candidate hypotheses to identify the best hypothesis. Where best is the hypothesis that optimizes a predefined numerical measure for the problem at hand called the hypothesis fitness
true/false Given m data points, the training error converges to the true error as m → ∞
true if we assume the data points are iid
How do we inject domain knowledge into SVMs
using the kernel function
More complex models usually have a higher _ and a lower _
variance; bias
Why do we need epsilon in the theorem?
we need epsilon because if we have a distribution that has tiny weights on certain parts of the space. unless we see a training example from that part of the space, then we won't know what the target function says about that part of the space. If there is a part of the space with weight $\epsilon$, unless we see a point there there's no way to know what the target function is saying there. In order to see a point there, we need to see 1/epsilon samples
When does entropy = log(n)
when all of the events have the same probability \frac{1}{n}
how is the accuracy to which are target approximated
with epsilon
When sending a message how many bits do we need if the output will be the same every time?
0
What does bagging reduce bias or variance?
Bagging is a variance reduction technique.
How do we inject domain knowledge into KNN?
By defining a distance function. There is a best distance function for each problem. We just don't know it. (No Free Lunch)
What is the curse of dimensionality in KNN
Distance between instances is calculated based on all attributes of the instance which can be problematic when only a subset of features are relevant for classification. On the relevant features, they could be close or identical but on others they may be far apart so their distance would be large
What is the Gini Index?
Is the expected error if we label examples in the leaf randomly: positive with probability p and negative with probability 1 - p.
does it make sense to repeat an attribute along a path in the tree?
It depends: if the values of discrete no; if the A is continuous than it does because you're making another question
How is diversity created among the models in the ensemble?
The differences between teh bootstrap samples.
Why don't we just try to compute the inverse of X when the starting formula is Xw = Y
because it might not be well behaved so we multiply both sides by the $X^T$
What is the closed formed solution to find the weights in linear regression?
$(X^TX)^-1X^TY$
What are the problems with Occam's razor?
- Why should we believe that the small set of hypotheses consisting of decision trees with short descriptions should be any more relevant than the multitude of other small sets of hypotheses we might define - The size of a hypothesis is determined by the particular representation used internally by the learner. two learners using different internal representations could therefore arrive at different hypotheses both justifying their contradictory concussing by Occam's razor
How can you avoid overfitting in decision trees?
- approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data - approaches that allow the tree to overfit the data and then post-prune the tree
What are some strategies to estimate missing values in decision trees?
- assign the missing attribute value to the value that is most common among training examples at node n. - assign a probability to each of the possible values of A rather than simply assigning the most common value to A(x)
Describe the KNN algorithm.
- assumes all instances correspond to points in n-dimensional space - nearest neighbors are defined in terms of some distance function - training: for each training example store it - classification: 1) given a query instance find the k closes training examples 2) find the most common label among the k instances
what are the properties of a nearest-neighbor classifier
- can perfectly classify the training set unless identical instances with different labels are included - by choosing the right examples, it can more or less represent any decision boundary or at least an arbitrarily close piecewise linear approximation - has low bias but high variance - will overfit if the training data is limited, noise or unrepresentative
What are the disadvantages of instance-based learning?
- cost of classifying a new instance is high: takes O(n) time to classify an example - consider all the attributes of the instances when attempting to retrieve similar training examples from memory. If the target concept depends on only a few of the many attributes, then the instances that are truly most similar may be a large distance apart
what are the advantages of boosting?
- fast,simple and easy to program - only has one parameter to Tune (the number of rounds T) - requires no prior knowledge about the base learner meaning it can be flexibly combined with any method for finding base classifier - comes with a set of theoretical guarantees given sufficient data and a base learner than can reliably provide only moderately accurate base classifier - instead of trying to design a learning algorithm that is accurate over the entire space, we focus on finding base learning algorithms that only need to be better than random.
What is KNN's preference bias?
- locality: near points are similar - smoothness: by choosing to average/vote we are expecting functions to behave smoothly - all features matter equally
How is it possible for one tree to work better on the training examples and perform worst on subsequent examples?
- random error or noise: could cause a more complex tree to be created - when small numbers of examples are associated with leaf nodes: it is possible for coincidental regularities to occur in which some attribute happens to partition the examples very well despite being unrelated to the actual target function
What is the inductive bias of ID3?
- selects in favor of shorter trees over longer ones -selects trees that place the attributes with the highest information gain closest to the root - prefers correctness over incorrect
Where do errors come from?
- sensor error - maliciously: being given bad data - transcription error - unmodeled influences: there are could be other factors that are not represented in the model
Why do we use weak learners in boosting?
- speed of fitting classifier; - accuracy improvement: if you already have a strong learner, the benefits of boosting are less relevant; - avoid overfitting: boosting combines many different hypothesis from the hypothesis space so that we end up with a better final hypothesis. The power of boosting comes from the diversity of the hypothesis combines. If we use a strong learner, this diversity tends to decrease: after each iteration there won't be many error (since the model is complex), which won't make boosting change the new hypothesis much. With very similar hypothesis, the ensemble will be very similar to a single complex model, which in turn tends to overfit.
why does overfitting tend to occur during later iterations but not during earlier iterations of backpropagation?
- the complexity of the hypotheses that can be reached by backpropagation increases with the number of weight-tuning iterations - given enough weight-tuning iterations, backpropagation will often be able to create overly complex decision surfaces that fits noise in the training data
What are the components of learning?
- there is the input $x$ (customer info) - the unknown target function $f: X \rightarrow Y$ (ideal formula for credit approval), where $X$ is the the input space (set of all possible inputs $x$) and $Y$ is the output space (set of all possible outputs , which are just yes or no in this case), - there is a data set $D$ of input examples $(x_1, y_1), ..., (x_N, y_N)$ where $y_n = f(x_n)$ for $n = 1, ..., N$ - learning algorithm that uses the data $D$ to pick a formula $g: X \rightarrow Y$ that approximates $f$
What is the general algorithm for boosting?
- we start with a method or algorithm for finding the - the boosting algorithm calls this weak learning algorithm repeatedly Each time feeding it in a different distribution or weighting over the training examples - each time it is called the base learning algorithm generates a new weak prediction rule - after many round the boosting algorithm must combine these weak rules into a single prediction rule
What is the methodology of incorporating human knowledge into boosting
Have a human expert construct by hand a rule $p$ mapping each instance $x$ to an estimatied probabily $p(x) \in [0, 1]$ that is interpreted as the guessed probability that instance $x$ will appear with label +1. We try to minimize a loss function and we balance the conditional likelihood of the data against the distance from our model to the human's model.
How can decision trees handle attributes with differing costs? What is one of the caveats with these approaches.
ID3 can be modified to take into account attribute costs by introducing a cost term into the attribute selection measure. We might divide the IG by the cost of the attribute so that lower cost attributes would be preferred. Cost-sensitive measures do not guarantee finding an optimal cost-sensitive decision tree, they do bias the search in favor of low-cost attributes.
What are the benefits of ID3
ID3's hypothesis space of all decision trees is a complete space of finite discrete valued function, relative to the available attributes?
What bias does ID3 have in selecting attributes?
IG has the natural bias of favoring attributes with many values over those with few values.
Why do tree models have high variance?
If we change the training data sufficiently for another feature to be selected at the root of the tree, then the rest of the tree is likely to be different as well
What is boosting effective in doing? And what is the effect of this?
Increasing the margin of examples even if they are already on the correct side of the decision boundary. Boosting may continue to improve performance on the test set even after the training error has been reduced to zero
Why is RBF a nice middle ground between eager and lazy learners?
It commits to a global approximation to the target function at training time; however it represents the global function as a linear combination of multiple local kernel functions.
What happens to the number of different predictions KNN can make as k increases.
It initially increases then decreases.
What is the main idea of boosting?
It is based on the observation that finding many rough rules of thumb can be a lot easier than finding a single, highly accurate prediction rule
What is the benefit of gain ratio.
It penalizes attributes such as by incorporating a term called split information that is sensitive to how broadly and uniformly the attribute splits the data?
When k is equal to the number of training examples, what will the probability vector be?
It will be equal to the prior distribution over the examples
What is regression?
Mapping a value some input space to a real value. It refers to functional form to approximate a set of points.
What is the argument behind Occam's razor
One argument is that the b/c there are fewer short hypotheses than long ones, it is less likely that one will find a short hypothesis that coincidentally fits the training data. In contrast, there are often many very complex hypotheses that fit the current training data but fail to generalize correctly to the subsequent data
What is an issue with Gain Ratio and how can it be avoided?
One issue with Gain Ratio is that it select attributes in the denominator can be zero or very small when one of the splits is as big as the entire training set at that level. To avoid selecting attributes purely on this basis we can adopt some heuristic such as first calculating the Gain of each attribute, then applying the Gain Ratio test only considering those attributes with above average Gain.
What is a solution to the curse of dimensionality in KNN?
Option 1: weight each attribute differently when calculating the distance between two instances. Option 2: completely eliminate the least relevant attributes from the instance space using LOOCV
Which is better: early-stopping or post-pruning and why?
Pruning has been found to be more successful in practice due the difficulty in the first approach of estimating precisely where to stop growing the tree.
What is supervised learning?
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).
What is the goal in polynomial regression?
The goal is try to find the coefficients c to best approximate the various values of y
How do margins relate to generalization error in Adaboost?
The larger the margin the more superior the the upper bound on the generalization is.
How can the margin be interepreted
The measure of confidence in the prediction.
What is classification?
The process of taking some kind of input x and mapping to some discrete label such as true or false.
What is the hypothesis space searched by ID3
The set of possible decision trees. It performs a hill-climbing search through this hypothesis space beginning with the empty tree then considering progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data.
How can we incorporate human knowledge into boosting?
We allow the human's rough judgements to be refined, reinforced and adjusted by the statistics of the training data in a manner that does not permit the data to entirely overwhelm human judgements
Why is constructing a different approximation to the target function for each distinct query distance good?
When the target function is very complex but still can be described by a collection of less complex local approximation
What is lazy learning?
When you defer the decision of how to generalize beyond the training data until each new query instance is encountered?
Describe the type of search adaboost is doing?
a kind of steepest descent search to minimize where the search is constrained at each step to follow coordinate directions (where we identify coordinates with weights assigned to the base classifiers)
what is bagging?
an ensemble method that creates diverse models on different random samples of the original data set
curse of dimensionality
as the number of features or dimensions grows the amount of data we need to generalize accurate grows exponentially
How are instances classified in decision trees?
by starting at the root node of the tree, testing the attribute specified by the node, then moving down the tree branch corresponding to the value of the attribute in the given example
What is the formula for information gain in decision trees and does it represent
simply the expected reduction in entropy caused by partitioning the examples according to this attribute
What happens to the bias and variance of different predictions KNN can make as k increases.
the bias increases and then the variance decreases
what is weak learnability
they hypothesis is slightly better than chance.
What is the target concept?
thing that we are trying to find / answer
What are instances?
vectors of attributes that define what the input space is; set of inputs