ml
When does 𝜎(𝑤𝑇𝑥)σ(wTx) equal 0.5? And what does that imply about the decision boundary of the a logistic regression model?
𝜎(𝑤𝑇𝑥)σ(wTx) equals 0.5 when 𝑤𝑇𝑥wTx equals 0. This suggests that logistic regression has a decision boundary (the set of locations where both classes are equally likely) defined by the line where 𝑤𝑇𝑥=0wTx=0.
Each value in a probability density function must be less than 1.
False. Values in a PDF must be non-negative but can be unbounded above. PDF do however have to integrate to one.
When the slack penalty hyperpameter C is set to zero, the soft-margin SVM reverts to a hard-margin SVM.
False. When the penalty is set to zero, the margin can grow infinitely large as incorrect predictions are not penalized in the objective.
The SVM dual formulation shows us that the optimal weight vector always depends on all data points.
False. While the weight vector is a weighted combination of all training points, points not within or on the margin have zero weight.
Both discriminative and generative classifiers make decisions according to P(y|x) (where y is the output and x is the input). However, discriminative model learn _________, while generative models learn _____________.
P(y|x) directly; P(x|y) and P(y) and apply Bayes Theorem Generative models make use of Bayes Theorem and learn make decisions according to argmax_y P(x|y)P(y)
You want to predict whether someone will like a particular music artist given a list of artists they already enjoy. You have a dataset of "liked artist lists", "like this new artist" pairs. This is an example of:
Supervised Classification out of Supervised Regression Unsupervised Clustering Supervised Classification Unsupervised Dimensionality Reduction
When deriving MLE estimates in lecture, we frequently would write out the likelihood as: 𝐿(𝜃)=𝑃(𝐷∣𝜃)=∏𝑛𝑖=1𝑃(𝑥𝑖∣𝜃)L(θ)=P(D∣θ)=∏i=1nP(xi∣θ) What assumption allows us to write the probability of our dataset as a product of the probabilities of each data point?
The IID Assumption: Independence between datapoints lets us write the joint probability of the dataset as the product of independent probabilities. If A and B are independent random variables, P(A,B) = P(A)P(B)
Write the following equation as a vector operation involving column vectors x and y: ∑𝑑𝑖=1𝑥𝑖𝑦𝑖∑i=1dxiyi
This is a dot product and could be written <𝑥,𝑦><x,y> or 𝑥𝑇𝑦xTy
The Naïve Bayes assumption generally reduces the number of parameters we have to learn in our model. This is because we don't need to learn the full conditional joint distribution 𝑃(𝑥1,𝑥2,...,𝑥𝑑|𝑦)P(x1,x2,...,xd|y) and can instead learn conditionals for individual input dimensions 𝑃(𝑥𝑖|𝑦) ∀𝑖P(xi|y) ∀i .
True! The conditional independence assumption in Naïve Bayes lets us assume 𝑃(𝑥1,𝑥2,...,𝑥𝑑|𝑦)=𝑃(𝑥1|𝑦)∗𝑃(𝑥2|𝑦)∗...∗𝑃(𝑥𝑑|𝑦)P(x1,x2,...,xd|y)=P(x1|y)∗P(x2|y)∗...∗P(xd|y) so we can then just learn the individual conditional (right) rather than the full conditional joint (left).
Linear regression by minimizing the sum of squared error is equivalent to maximizing the likelihood of data under a linear model with Gaussian noise.
True! We spent the second half of the lecture proving this.
We can use kernel functions for any machine learning algorithm based only on dot products between feature vectors.
True. Kernels are a very general idea. We showed them for SVMs and Perceptron.
Classifying new points for an SVM requires computing dot products between support vectors and the new point.
True. Substituting the definition of the optimal weight vector into 𝑤𝑇𝑥+𝑏wTx+b shows this fact.
A hard-margin linear SVM only has a solution when the data is linearly separable.
True. The constrained optimization only considers solutions where all datapoints are correctly classified.
What role does a prior play in MAP inference?
A prior is a way for us to encode beliefs about the values of our parameters _before_ seeing any data.
What do the slack variables accomplish in the soft-margin SVM formulation?
- allow for violations of the margin in the constraints - are then penalized in the objective to minimize them
Explain Logistic Regression / Perceptron / k Nearest Neighbor / Linear Regression
Logistic regression and the perceptron are both linear classifiers, but the perceptron has no probabilistic interpretation. k Nearest Neighbors can form arbitrary decision boundaries based on the examples it uses to compute neighbors, but lacks any direct probabilistic interpretation. Linear regression is obviously a linear regression model and can be interpreted as a linear model with Gaussian noise.
Recall / Precision
Recall is the fraction of positive examples correctly predicted as positive. Precision is the fraction of positive predictions that are actually positive examples.
What is regularization and what is its relationship to priors?
Regularization refers to including additional terms in a model's objective to encourage parameters towards simpler models. This is similar to priors encoding prior beliefs about what parameter should be. In fact, L2 maps directly to a Gaussian prior on the weights with mean zero and unit covariance.
Explain what a decision boundary is in classification.
In classification, decision boundaries are borders among all classifications. A decision boundary defines which category a new data belongs to based on an arbitrary algorithm.
Nut a shell
Input: x Output: y (Unknown) Target Function f: X→Y Data (x1,y1), (x2,y2), ..., (xN,yN) Model / Hypothesis Class 𝓗 = {g: X→Y} (images, text, emails...) (spam or non-spam...) (the "true" mapping / reality) (our observations of the world) (the space of possible models)
Bayes Error
Irreducible error inherit to the problem. We can't really fix this.
What is Maximum Likelihood Estimation (MLE)?
Maximum Likelihood Estimation (MLE) is a way to fit parameters of a probabilistic model to data. In MLE, we assume some generative model of our data (i.e. a probabilistic model of how the data is produced) and then find parameters for that model that maximize the likelihood of our observed data. MLE is a general technique and we've now seen it applied to binary random variables (with a Bernoulli assumption), continue values (with a Normal assumption), and in linear regression (a conditional Normal assumption).
How can our techniques for linear regression be used to fit non-linear functions of the input?
Linear regression can be used to fit non-linear functions of the input by augmented the input with non-linear transforms (or basis functions). The linear regression model then fits a linear model in this non-linear feature space.
What does the IID (independently and identically distributed) assumption assume about our data?
That data points do not affect other data points and all data points are generated from the same probabilistic mechanism. The "independent" part implies that data points do not affect each other and the "identically distributed" bit implies they are generated from the same distribution.
Bayes error is the error due to ______________.
inherit uncertainty in the problem. If two examples look identical except they have different outputs, no model can get them both correct.
List three hyperparameters for the k-Nearest Neighbor algorithm.
k: Defines how many neighbors you will consider. A distance metric: Defines how we compute distances from neighbors (i.e. choosing a metric from Euclidean, Manhattan, Minkowski or something else.) Weighting function: Defines how distances contribute to the output function.
In lecture, we fit the parameters of the logistic regression function by minimizing the _________________.
negative log likelihood using gradient descent. When we computed the gradient of the negative log likelihood for the logistic regression model, we were met with a system of non-linear equations that didn't let us write a closed-form solution. Instead, we used the gradient expression to perform gradient descent to minimize the negative log likelihood.
Over all possible datasets generated from the true model, an estimator with high bias but low variance will _______________.
produce similar estimates for most datasets, but will not perfectly predict the true parameters. Bias refers to how wrong the provided estimate is on average. Variance refers to how much the predicted estimate varies across different datasets.
The Naïve Bayes assumption is that __________.
the input features are conditionally independent given the class label.
Given random variables X, Y, and Z -- what does it mean for X and Y to be conditionally independent given Z?
Conditioned on the value of Z, X and Y do not provide additional information about each other. That is to say P(X,Y|Z) =P(X|Z)P(Y|Z).
Optimization Error
Error due to the difficulty of finding optimal models for a dataset during learning. Can be reduced with more computation dedicated to the search.
Modelling Error
Error from a mismatch between our hypothesis set and the real function. Can be reduced with more expressive model classes.
Estimation Error
Error from learning a model from finite dataset. Can be reduced with more data or with more data efficient algorithms
Linear regression by minimizing the sum of squared errors (SSE) is robust to outliers.
False. As we showed in lecture, a single outlier point is sufficient to dramatically change the solution to ordinary least squares.
In the ordinary least squares solution 𝑤=(𝑋𝑇𝑋)−1𝑋𝑇𝑦w=(XTX)−1XTy , the inverse (𝑋𝑇𝑋)−1(XTX)−1 always exists.
False. If X itself is not full rank, this inverse will not exist. That is why we will typically use the psuedo-inverse as described in lecture.
The standard perceptron learning algorithm is guaranteed to converge even for non-linearly separable data.
False. It will continue to oscillate as zero error will never be reached.
We can use any function to define a kernel.
False. Only certain functions can define a kernel. Specifically, ones which result always result in positive semidefinite Gram matrices.
Finding the optimal weight vector for a hard-margin SVM requires solving an unconstrained optimization problem.
False. The optimization includes constraints for the data to be correctly classified.
Computing a quadratic kernel requires quadratic computation in the dimensionality of the input vectors.
False. The quadratic kernel between a and b is (a^Tb+1)^2 and takes only linear time in the dimensionality.
K-Nearest Neighbors is referred to as a _________ algorithm because no parameters are learned during training.
Non-Parametric (other names are Instance-based, Exemplar)
what is output function for knn?
Output Function (Given a set of neighbors, how do we compute the output?) Ex.) Majority Vote
What is meant by overfitting and underfitting?
Overfitting means that the selected hypothesis performs well on the specific training dataset, and poorly performs on the test dataset. Underfitting means that the selected hypothesis does not perform well on either the training dataset or the test dataset.
Assuming we are using a Bayesian approach to fitting the parameters of a model to some data, match the names of the following distributions to their written versions.
Posterior = P(parameters | data) Prior = P(parameters) Likelihood = P(data|parameters)
Both the SVM primal and dual formulations can be solved using __________.
Quadratic Program Solvers
If a prior distribution A is a conjugate prior to likelihood distribution B, what can I say about the posterior distribution B*A?
The posterior will have the same distribution as A -- which is the definition of a conjugate prior.
What is a hypothesis set?
The set of possible functions for a machine learning algorithm