Data Science

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

You fit a multiple regression to examine the effect of a particular feature. The feature comes back insignificant, but you believe it is significant. Why can it happen?

A predictor or a set of predictors are collinear with your main outcome, inflating the variance, causing it to be non-significant. Your major predictor can be interaction with another predictor (e.g. sexes) in such a coincident way that the effects within each sex cancel out each other. You're examining a very specific subset of a large population.

What is a Box Cox Transformation?

A Box Cox transformation is a statistical technique that transform non-normal dependent variables into a normal shape. Many statistical techniques assume normality.

Can you describe what a classification problem is?

A classification problem is when the output variable is a category, such as "red" or "blue" or "disease" and "no disease". A classification model attempts to draw some conclusion from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes.

What is a confusion matrix?

A confusion matrix is a 2x2 table that contains 4 outputs provided by a binary classifier.

Why is Linear Regression called linear?

Linear regression is called linear because you model your output variable (lets call it \(f(x)\)) as a linear combination of inputs and weights (lets call them \(x\) and \(w\) respectively). Namely \(f(x) = < w, x > + b = SUM w_i x_i + b\)

What are the assumptions of Linear Regression?

Linear relationship between x and y Multivariate normality No or little multicollinearity between variables No auto-correlation - no relationship exists between the values of the error term Homoscedasticity - the size of the error term differs across values of an independent variable

What is logistic regression? Give an example.

Logistic Regression (a logit model) is a technique to predict the binary outcome of a linear combination of predictor variables. Example: You want to predict whether a candidate will win an election or not. Here, the outcome of the prediction is binary: [0, 1] ([Lose, Win]). The predictors here would be something like the amount of money spent on campaigning or the number of hours spent campaigning.

You are given a data set with many variables, some of which are highly correlated. You have been asked to run PCA. Would you remove correlated variables first? Why?

Remove them! Discarding correlated variables can have a substantial effect on PCA because in the presence of correlated variables the variance explained by a particular component gets inflated.

What is a convex hull? (Hint: think SVM)

In the case of linearly separable data, convex hull represents the outer boundaries of the two groups of data points. Once a convex hull is created, we get the maximum margin hyperplane as a perpendicular bisector between two convex hulls.

When is Ridge regression favorable over Lasso regression?

In the presence of few variables with medium/large sized effects, use lasso regression. In the presence of many variables with small/medium sized effects, use ridge regression Conceptually, we can say that lasso regression (L1) does both variable selection and parameter shrinkage whereas ridge regression (L2) only does parameter shrinkage, including all the coefficients in the model. In the presence of correlated variables, ridge might be the preferred choice. Ridge also works better in situations where the least square estimates have higher variance.

What are the advantages of Naïve Bayes Classification?

Interpretable Efficient Compact models Works with small data

What are the advantages of Decision Trees?

Interpretable No feature engineering necessary Nonlinear model Provide feature importance weights Prediction is efficient

What is Ordinary Least Squares (OLS)?

It is the best unbiased linear estimator for the statistics of the linear regression. There are other methods as well, which sometimes work as good as OLS, but they have different drawbacks. The core of OLS lies in minimizing the sum of \((y - bX)^2\), whereas y and b are vectors and \(X\) is a matrix of regressors. Note that both \(Y\) and \(X\) are known variables (measured observations, given data, whatever you want to call them, they are known to you and that's my point). \(b\) is the unknown variable and by minimizing this sum of squares, you are minimizing the errors of the model: \(y = bX + \text{error}\).

Describe AUC for ROC

Metric for classification. Area under ROC curve (plots true positive rate against false positive rate for different threshold settings of a binary classifier). The probability that a random positive sample scores higher then a random negative sample. AUC close to 1: ideal. 0.5: model not better than random. 0: model does opposite of perfect classification. Like percentage chance model will be able to correctly distinguish between pos and neg classes

How does gradient descent work?

Minimize the loss function with respect to the weights by taking small steps in the opposite direction of the gradient.

What are some advantages and disadvantages of Logistic Regression?

Advantages Easy to implement and compute Provides probabilities of outcomes Disadvantages High bias

What are some advantages and disadvantages of Linear Regression?

Advantages Simple and fast Disadvantages Sensitive to outliers Only models relationships between dependent and independent variables that are linear

Linear regression models are usually evaluated using the Adjusted \(R^2\) or \(F\) value. How would you evaluate a logistic regression model?

Since logistic regression is used to predict probabilities, we can use the AUC-ROC curve along with a confusion matrix to determine its performance The analogous metric of adjusted \(R^2\) in logistic regression is AIC. AIC is the measure of fit which penalizes the model for the number of model coefficients, meaning we prefer a lower AIC.

You are given a training data set having 1000 columns and 1 million rows for a classification problem. You're asked to reduce the dimensionality of the data so that computation time can be reduced - you also have memory constraints. What could you do?

Since we have memory constraints, we could close all other applications on our machine. We can randomly sample the dataset, say 1000 columns and 300,000 rows To reduce dimensionality, we can separate the numerical and categorical variables and remove the correlated variables. For numerical values we'll use correlation and for categorical variables we can use the chi-squared test. We can use PCA and pick the components that explain the maximum variance in the data set.

What is Singular Value Decomposition (SVD)?

Singular Value Decomposition is a dimensionality reduction technique. Factors a matrix where By taking only the largest singular values (and corresponding vectors of U and V), you obtain a low-rank approximation to X (minimizes the Frobenius norm between X and The number of non-zero singular values is the rank of X) (# linearly independent columns, dimension of space spanned by rows or cols).

What are the disadvantages of SVMs?

Slow and large in both training and prediction; don't scale well Not great with multiclass problems

What is soft k-means?

Soft k-means takes into account the fact that there are other, neighboring clusters to which a point may belong. Rather than discarding this information, it incorporates it into its "soft" label, which is basically a list of likelihoods that a point belongs to a particular cluster.

How does k-means clustering work?

Start with k random samples as centroids Assign all point to nearest centroid Recompute centroids (vector averages) Repeat until stationary

What is the difference between supervised and unsupervised machine learning?

Supervised machine learning requires training labeled data. Unsupervised machine learning doesn't require labeled data.

Is it necessary to normalize data before performing PCA?

The PCA calculates a new projection of your data set. And the new axis are based on the standard deviation of your variables. So a variable with a high standard deviation will have a higher weight for the calculation of axis than a variable with a low standard deviation. If you normalize your data, all variables have the same standard deviation, thus all variables have the same weight and your PCA calculates relevant axis.

Explain how the ROC curve works

The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false positive rate.

How do precision and recall relate to the ROC curve?

The ROC curve represents a relation between sensitivity (recall) and specificity (not precision) and is commonly used to measure the performance of binary classifiers. However, when dealing with high skewed datasets, Precision-Recall curves give a more representative picture of performance.

What are the advantages of using Singular Value Decompostion (SVD) vs. Eigenvalue Decomposition (EVD)

The SVD gives you the U matrix (coordinates) and the base (V) while PCA only gives you the coordinates. The base V is really useful in many applications. The SVD doesn't need to compute the covariance matrix so it's numerically more stable than PCA. There exist pathological cases where computing the covariance matrix leads to numerical problems. This doesn't mean that PCA will fail because those cases are very rare but in general SVD is more efficient.

What is the maximal margin classifier? How this margin can be achieved and why is it beneficial?

The best or optimal line that can separate the two classes is the line that as the largest margin. This is called the Maximal-Margin hyperplane. This can be achieved when the data is linearly separable. It maximizes the margin of the hyperplane and gives us our best shot at correctly classifying new data.

What is GINI impurity?

The chance of being incorrect if you randomly assign a label from the set to an example in the same set.

What is a kernel? What's the intuition behind the Kernel trick?

The equation for making a prediction for a new input using the dot product between the input \((x)\) and each support vector \((x_i)\) is calculated as follows: \(f(x) = B_0 + sum(a_i * (x,x_i))\) (\(B_0\) and \(a_i\) are coefficients) The dot-product is called the kernel and can be re-written as: \(K(x, x_i) = sum(x * x_i)\) The kernel defines the similarity or a distance measure between new data and the support vectors. The dot product is the similarity measure used for linear SVM or a linear kernel because the distance is a linear combination of the inputs. Kernel Trick: It is desirable to use more complex kernels as it allows lines to separate the classes that are curved or even more complex. This in turn can lead to more accurate classifiers.

Whats the analytical solution for linear regression?

The equation is \(\theta = (X^T X)^{-1} X^Ty \text{where } \theta = \text{coefficients} \) This can be calculated in numpy with \(b = inv(X.T.dot(X)).dot(X.T).dot(y)\) Once the coefficients are calculated, we can use them to predict outcomes given X: \(yhat = X.dot(b)\)

What is Information Gain?

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that return the highest information gain.

You think your regression model is suffering from multi-collinearity. How would you check if this is true? Without losing any information, can you build a better model?

To check multi-collinearity, we can create a correlation matrix to identify & remove variables having a correlation about some threshold (like 75%). We can also use VIF (variance inflation factor) to check the presence of multi-collinearity. VIF value ≤ 4 suggests no multi-collinearity whereas a value of ≥ 10 implies serious multi-collinearity. In order to retain the information in this variables, we can used a penalized regression model like ridge or lasso.

What is a True Negative? A False Negative?

True Negative: A correct negative prediction False Negative: An incorrect negative prediction

What is a True Positive? A False Positive?

True Positive: A correct positive prediction False Positive: An incorrect positive prediction

What do you understand by Type I and Type II errors?

Type I error is committed when the null hypothesis is true and we reject it, a False Positive Type II error is committed when the null hypothesis is false and we accept it, a False Negative

What are some techniques for feature selection?

Univariate Ranking: Choose the best features as ranked by variance, correlation with labels, mutual information, etc. Forward selection: find the best single feature via cross-validation. Then add each remaining feature, one at a time, to find the best pair of features. Repeat, adding one feature each iteration until "good enough." Recursive feature elimination: Use a model that gives weights to features, train with all features. Remove the lowest-weight features, repeat until satisfied. Dimensionality reduction: PCA, SVD Built-in: Some algorithms have feature selection built in: LASSO, decision trees

When should we use spectral clustering over k-means? How does it work?

Use spectral when the data is not globular.

What is variance in an algorithm?

Variance is error introduced into your model due to an over-complex algorithm, leading to overfitting. Your model is learning noise, performing well on the training set and poor on the test set.

What cross-validation technique would you use on a time-series dataset?

We can't use k-folds cross-validation: time-series are not randomly distributed data - they are inherently ordered by chronological order. In this case, a technique like forward-chaining should be used where you model on past data then test on future data: fold 1: training[1], test[2] fold 2: training[1, 2], test[3] fold 3: training[1, 2, 3], test[4] fold 4: training[1, 2, 3, 4], test[5]

In k-means or kNN we use euclidean distance to calculate the distance between nearest neighbors. Why not Manhattan distance?

We don't use manhattan distance because it calculates distance horizontally or vertically only - it has dimension restrictions. On the other hand, Euclidean distance can be used in any space to calculate distance.

What are some methods of dealing with class imbalance?

Weight: minority instances higher Metrics: precision, recall, F1, normalized accuracy (kappa), AUC... Resample: Upsample small classes or downsample large ones Interpolate: Generate synthetic samples by interpolation or SMOTE (randomly-weighted combinations of neighbors) Bag: Divide majority class into N subsets of same size as minority, train N classifiers, combine

When can two matrices not be multiplied?

When num columns != num rows

What is pruning in a decision tree? Why do it?

When we remove sub-nodes of a decision node Decision Trees are prone to overfitting (as they can become large and complex). Pruning reduces the size of the decision tree by removing parts that do not provide power to classify instances.

What is overfitting? How do you deal with it?

Where a model doesn't generalize well to unseen data. Dealing with it Cross validate Train on more data Feature selection Regularization

Is \(y= x_1 + x_2 + x_1 \times x_2\) still linear? Why?

Yes, it's multiple linear regression with three variables: \(x_1\), \(x_2\), and \(x_1x_2\)

Are local minima always bad? When would they be bad/not be bad?

You can consider local minima \(L\) bad if a) your model does not overfit on \(L\)and b) there's some other minima \(L'\)which has significantly lower CV error rate than \(L\). Global minima in NN is not usually a bad thing. It is bad only if your model overfits, but you can use always proper regularization and stop early.

Define the silhouette score

\(\frac{\text{nearest cluster dist - within cluster dist}}{\max(\text{nearest cluster dist, within cluster dist})}\)

How is Accuracy calculated from a Confusion Matrix?

\(\text{Accuracy} = \frac{\text{True Positive} + \text{True Negative}}{\text{All Positives} + \text{All Negatives}}\)

How is Error Rate calculated from a Confusion Matrix?

\(\text{Error Rate} = \frac{\text{False Positive} + \text{False Negative}}{\text{All Positives} + \text{All Negatives}}\)

How do you calculate the F1-Score(Harmonic mean of precision and recall) from a Confusion Matrix?

\(\text{F-Score} = \frac{2(\text{Recall} \times \text{Precision}}{\text{Recall} + \text{Precision}}\)

How is Precision(Positive predicted value) calculated from a Confusion Matrix? = TP/(TP+FP)

\(\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}\)

How is Sensitivity(Recall or True positive rate) calculated from a Confusion Matrix?

\(\text{Sensitivity} = \frac{\text{True Positive}}{\text{All Positives}}\)

How is Specificity(True negative rate) calculated from a Confusion Matrix?

\(\text{Specificity} = \frac{\text{True Negatives}}{\text{All Negatives}}\)

Name some problems for which you should use a clustering algorithm

market segmentation social network analysis search result grouping medical imaging image segmentation anomaly detection

What is a hard SVM and soft SVM?

A hard SVM would be one with a hyperplane that perfectly separates the data. Real world data is messy and this usually isn't the case. A soft SVM uses a tuning parameter, c*,* to allow it wiggle room and lets some points violate the separating line. The larger the c, the more violations we allow.

What is a local optima?

A local optima is a point on a cost function that is low...but not the lowest point on the cost function

Describe the regression problem. What does it try to do? Is it supervised? Why?

A regression problem is when the output variable is a real or continuous value, such as "salary" or "weight". Many different models can be used, the simplest is the linear regression. It tries to fit data with the best hyper-plane which goes through the points. It is a supervised technique. Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output Y = f(X) . The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

What is the basis of a vector space?

A set of linearly independent vectors that span the space, i.e., the number of vectors equals the dimension of the space.

How can we check if the regression model fits the data well?

Being the ratio of regression sum of squares to total sum of squares, \(R^2\) can tell you how many % of variability in your dependent variable are explained by the model. Adjusted \(R^2\) can be used to check if the extra sum of squares brought about by the additional predictor(s) is really worth the degrees of freedom they'll take.

What is bias in an algorithm? What are some high/low bias algorithms?

Bias is error introduced into your model due to over-simplification in your algorithm, leading to underfitting. Low bias algorithms Decision Trees, K-NN, SVM High bias algorithms Linear Regression, Logistic Regression

What is the bias-variance trade-off? Explain it.

Bias: is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data (underfitting). Variance is the algorithm's tendency to learn random things irrespective of the real signal by fitting highly flexible models that follow the error/noise in the data too closely (overfitting). The ideal fit, naturally, is one that captures the regularities in the data enough to be reasonably accurate and generalizable to a different set of points from the same source. Unfortunately, in almost every practical setting, it is nearly impossible to do both simultaneously. Therefore, to achieve good performance on data outside the training set, a trade-off must be made. This is referred to as the bias-variance trade-off.

How does hierarchical clustering work?

Bottom up (agglomerative) Start with all points as clusters Merge close clusters (distance: sum-squared, mean, max) Repeat until target # of clusters remain

What are the advantages of SVMs?

Can model known non-linear relationships with kernels Can find "best" model for given hyperparams, since error function has global minimum

What are some advantages of k-Nearest Neighbors?

Easy to implement for multi-class problems. Can be used for classification and regression. Variety of distance criteria to choose from (Euclidean, Hamming, Manhattan, Minkowski)

What is Entropy?

Entropy is a measure of impurity. It is the sum of the probability of each label times the log probability of that same label. For a binary class with values a and b: Entropy = \(-p(a) \times log(p(a)) - p(b) \times log(p(b))\) It is generalized with: \(H(X) = -\sum_{i=1}^{n} p(x_i) log_b p(x_i)\)

What is collinearity and what can we do to deal with it?

If two or more predictor variables in a multiple regression model are highly correlated, the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. PCA and regularization are two methods to resolve this issue.

Is it better to have too many false positives or too many false negatives? In which situations would this differ?

It is application-dependent. Let's say you have a model that performs disease diagnosis — does a given patient have disease X or not. It is better to have false positives here. It is okay to falsely assert that a patient has a disease, and then later realize that the decision was wrong, maybe after more tests / some preliminary medication. However, a false negative here means that someone who had a disease was not provided proper medical treatment, which could be fatal. On the other hand, consider a model that shortlists resumes for a job interview at a company. Assuming you get more promising candidates than the number of positions you want to fill, a false negative amounts to rejecting a good candidate, which is not so much of an issue, given that you will get other such candidates. However, a false positive means you shortlist someone who is not good enough, which will waste company resources in the interview process.

Name some kernel functions

Linear Kernel Polynomial Kernel Radial Basis Kernel Sigmoid Kernel

When would you use an SVM and when would you use logistic regression?

Linear SVMs and logistic regression generally perform comparably in practice. Use SVM with a nonlinear kernel if you have reason to believe your data won't be linearly separable (or you need to be more robust to outliers than LR will normally tolerate). Otherwise, just try logistic regression first and see how you do with that simpler model. If logistic regression fails you, try an SVM with a non-linear kernel like a RBF.

One hot encoding increases the dimensionality of a data set but label encoding doesn't....how?

One hot encoding increases dimensionality because it creates a new variable for each level present in categorical variables. In label encoding, the levels of a categorical variable get encoded as 0 and 1, so no new variable is created. This is used on binary variables.

What kind of data causes k-means to perform poorly? What else could we use?

Performs poorly when the data is not globular: It wants something like this:

How do you decide the number of components to keep in PCA?

Plot explained variance vs. number of components and pick a point of diminishing returns. The explained variance is the sum of the retained eigenvalues divided by the sum of all eigenvalues.

How do you use AUC-ROC for multi-class classification?

Plot one ROC curve for each class, classified against all the others.

What are the advantages of Gradient Boosted Trees?

Preferred over Random Forests - each new tree compliments the already built ones. Can be distributed and very fast. Can be used for almost all objective functions you can write a gradient out with.

How is PCA related to SVD?

The right-singular vectors of X are the principal components - the eigenvectors of the covariance matrix The component weights (eigenvalues) are the squares of the singular values. Given the SVD of , the principal components are the columns of V and their weights (eigenvalues) are the squares of the singular values PCA can be computed by SVD. After mean-centering each column, PCA finds the matrix of eigenvectors \(W\) and diagonal matrix of eigenvalues \(D\) of the covariance matrix \(X^TX\), such that \(X^TX = WDW^{-1}\).

Define the span of a set of vectors

The set of all linear combinations of the vectors.

Do we always need the intercept term? When do we need it and when do we not?

The shortest answer: never, unless you are sure that your linear approximation of the data generating process (linear regression model) either by some theoretical or any other reasons is forced to go through the origin. If not, the other regression parameters will be biased even if intercept is statistically insignificant. By leaving the intercept term you insure that the residual term is zero-mean.

What is a concave function? Know an equation for one? Draw one.

A concave function is the negative of a convex function. Take \(-x^2\) for example:

What is a convex function? Know an equation for one? Draw one.

A convex function has one minimum - a nice property, as an optimization algorithm won't get stuck in a local minimum that isn't a global minimum. Take \(x^2−1\), for example:

What is a non-convex function? Know an equation for one? Draw one.

A non-convex function is wavy - has some 'valleys' (local minima) that aren't as deep as the overall deepest 'valley' (global minimum). Optimization algorithms can get stuck in the local minimum, and it can be hard to tell when this happens. Take \(x^4+x^3−2x^2−2x\), for example:

What is L1 Regularization and how does it work?

Adding a term for the L1 norm of the weights to the loss function of a model. Penalizes a large number of (non-zero) weights, for feature selection and simpler models. (The L1 norm is a convex approximation to the L0 norm (the number of nonzero), which is what you actually want.) Also known as lasso regression.

What is L2 Regularization and how does it work?

Adding a term for the L2 norm of the weights to the loss function of a model. Penalizes large weights to reduce variance and overfitting. Also known as ridge regression; a special form of Tikhonov regularization.

What is regularization, and what problem does it address?

Adding information to a model, often constraints, to make a problem well-defined and/or avoid overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (Ridge), but can in actuality be any norm. The model predictions should then minimize the mean of the loss function calculated on the regularized training set.

What are "lambdas in Python?

Anonymous functions i.e., they don't have a name. lambda arguments: expression cubing lambda: g = lambda x: x*x*x

What are the disadvantages of Naïve Bayes Classification?

Assumes features are independent (given the class) Assumes a distribution for continuous features (usually normal) Does not handle sparse data well Fixed-sized model; diminishing returns with more data

Describe how decision trees work

At each node, ask a true/false question. Which questions to ask and when? The goal is to maximize information gain at each node

How does backpropagation work?

Backpropagation iteratively updates the weights of a neural network to minimize the error between the actual and desired outputs. Each weight gets updated in the opposite direction of the derivative of the loss function with respect to that weight (the gradient). The derivatives at each layer depend on the derivatives of all successive layers (between it and the output), so the weight updates are calculated from the output back: backpropagation.

How does a random forest work?

Bagging multiple decision trees. Plus "feature bagging," selecting the split feature from a random subset of all features, to reduce correlation between trees

How can you choose the optimal \(k\) in k-Means?

By plotting an elbow plot Within Group Sum of Squares is used to explain the homogeneity of a cluster. If we plot this WSS for a range of clusters, we get the plot below, The Elbow Curve: The red circle about, k=6, is the point at which we don't see a decreasing WSS. This is known as the bending point and taken as the k in k-means If you plot k against the SSE, you will see that the error decreases as k gets larger.

You are given a dataset on cancer detection and you've built a classification model that has achieved accuracy of 96%. Why shouldn't you be happy with your model performance? What can you do about it?

Cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% might only be predicting the majority class correctly, but our class of interest is the minority class (4%) which is the people who actually have cancer. A better approach would be to use Sensitivity (True Positive Rate), Specificity (True Negative Rate) and the F measure to determine class-wise performance of the classifier. If the minority class performance is poor we can do the following: We can use undersampling, oversampling, or SMOTE to make the data more balanced We can alter the prediction threshold by doing probability calibration and finding an optimal threshold using the AUC-ROC curve We can assign weights to classes such that the minority classes get larger weights We can use anomaly detection

What are some limits to SVMs?

Choosing an appropriate kernel function is not easy. This moves the problem from optimizing the parameters to model selection. Not great for large data sets. Does not provide class probabilities For multiclass classification, you need one model per class.

Explain boosting

Combining many models to reduce bias. Sequentially train models, weighting mispredicted samples higher in the next model. Combine models weighting by accuracy. Gradient boosting trains each model to predict the error of the previous model (the gradient of the loss function). Usually uses "weak," high-bias models.

Explain bagging

Combining several models to reduce variance. (Bootstrap aggregating) Train multiple models on subsets of the data and combine outputs, by e.g., averaging or majority vote. Usually uses "strong," low-bias models.

How do you evaluate clusterings without ground-truth labels?

Compare within-cluster distances to between-cluster distances Example: Silhouette Coefficient

What are exploding gradients?

Exploding gradients are a phenomena where large error gradients accumulate and result in very large updates to neural network model weights during training, sometimes causing an overflow resulting in NaN values. This makes models unstable and unable to learn from the training set.

How does tree splitting work? How does the tree decide which variable to split at the root node and succeeding nodes? ends-vowel [9m,5f] / \ [3m,4f] [6m,1f]

First we calculate the entropy before splitting Entropy before = \(- (5/14)*log(5/14) - (9/14)*log(9/14) = 0.9403\) Next we compare it with the entropy computed after considering the split by looking at two child branches. In the left branch of ends-vowel=1, we have:\(\text{Entropy_left} = - (3/7)*log2(3/7) - (4/7)*log2(4/7) = 0.9852\) And the right branch of ends-vowel=0, we have:\(\text{Entropy_right} = - (6/7)*log2(6/7) - (1/7)*log2(1/7) = 0.5917\) We combine the left/right entropies using the number of instances down each branch as weight factor (7 instances went left, and 7 instances went right), and get the final entropy after the split:Entropy_after = \(7/14*\text{Entropy_left} + 7/14*\text{Entropy_right} = 0.7885\) Now by comparing the entropy before and after the split, we obtain a measure of information gain, or how much information we gained by doing the split using that particular feature:Information_Gain = \(\text{Entropy_before} - \text{Entropy_after} = 0.1518\) At each node of the tree this is calculated for every feature and the feature with the largest information gain is chosen for the split

What are some disadvantages of k-means?

Having to choose k manually (use "Loss vs. Clusters" plot here) Dependent on initial values - for a low k you can mitigate this by running k-means with different initial values and picking the best result. For large k's, use k-means seeding. Clustering data of varying sizes and densities - you may have to tweak heights/widths of clusters. Centroids can be dragged by outliers or outliers may get their own cluster. Consider removing or clipping them. Does not scale well with many dimensions - consider reducing with PCA.

What do eigenvalues and eigenvectors represent?

If you imagine a box you can stretch the length, the width and the height all separately without affecting the other directions. However, if you stretch it some other direction it will be a combination of those directions. The eigenvectors (and corresponding eigenvalues) are a way to diagonalize a problem so you can tweak just the parts you want to without disturbing the other parts. In terms of symbols you might see Av = λv which is basically telling you that v is a special direction and its effect on A is just like merely stretching in the v direction and leaving everything else alone. (Av means stretch in the v direction, and the λv result means that the only net effect was to stretch in the v direction) So Eigenvalues loosely correspond to directions corresponding to degrees of freedom of motion.

What is Independent Component Analysis (ICA)? What's the difference between ICA and PCA?

In ICA the basis you want to find is the one in which each vector is an independent component of your data, you can think of your data as a mix of signals and then the ICA basis will have a vector for each independent signal. In a more practical way we can say that PCA helps when you want to find a reduced-rank representation of your data and ICA helps when you want to find a representation of your data as independent sub-elements. In layman terms PCA helps to compress data and ICA helps to separate data. Ex: Two images → mix them → use ICA to separate them

What is a Random Forest? How does it work?

In Random Forest, we grow multiple trees (weak learners). To classify a new object based on attributes, each tree gives a prediction. The forest chooses the prediction with the most votes and in the case of regression, averages the outputs of the different trees. It can also be used for dimensionality reduction, treats missing and outlying values.

Is mean imputation of missing data acceptable practice? Why or why not?

It depends. If it's on a small amount of data, it's probably okay. Not the best practice in general: - If just estimating means: mean imputation preserves the mean of the observed data - Leads to an underestimate of the standard deviation - Distorts relationships between variables by "pulling" estimates of the correlation toward zero

What is the definition of log-loss (or cross-entropy)? Why is it used?

It's a binary classification metric: it measures the similarity of two probability distributions and punishes extreme confidence. Error function for logistic regression

Define F1 Score

It's the mean of precision and recall

What is Kernel PCA? Why use it?

Kernel PCA just performs PCA in a new space. It uses Kernel trick to find principal components in different space (Possibly High Dimensional Space). The standard PCA always finds linear principal components to represent the data in lower dimension. Sometimes, we need non-linear principal components. If we apply standard PCA for the below data, it will fail to find good representative direction. Kernel PCA (KPCA) rectifies this limitation.

How does Stochastic Gradient Descent work?

Like gradient descent, but at each step, compute the gradient and update the weights using only one or a few (mini-batch) training samples. The samples are randomly shuffled or sampled. More time and space efficient.

What is LDA?

Linear Discriminant Analysis: This method identifies components (i.e., linear combination of the observed variables) that maximize class separation (i.e. between-class variance) when such prior information is available (i.e., supervised). E.g., you have a training set containing a variable specifying the class of each observation.

What are some advantages of Random Forests?

No need for feature scaling. Solid when there is a small amount of data or a large proportion of data is missing. Helps correct the problem of correlated features vs. a single decision tree.

Does gradient descent always converge to an optimum?

Only if the function is convex. It depends on the learning rate, initial start value, and number of iterations. Conjugate gradient is not guaranteed to reach a global optimum or a local optimum! There are points where the gradient is very small, that are not optima (inflection points, saddle points). Gradient Descent could converge to a point \(x=0\) for the function \(f(x)=x3\).

What is Principal Component Analysis (PCA)?

PCA "rotates" data such that features become linearly independent and ordered by variance, useful for decorrelating features or dimensionality reduction. Example If you think you have a bunch of variables that are related to each other like this, you can perform a mathematical transformation on this data to turn the related variables (length, height, weight) into some unrelated variables (x, y, z). These unrelated variables don't really have any physical meaning like length, height and weight - they are a mathematical abstraction. The reason you might want to do this is because once you get your x,y,z variables, you may be able to ignore the z (or the y and the z) variables and use only the x variable if your original data was very related to each other. It allows you to reduce the number of variables. This is helpful when you have massive amounts of data and huge numbers of variables. In PCA the basis you want to find is the one that best explains the variability of your data. The first vector of the PCA basis is the one that best explains the variability of your data (the principal direction) the second vector is the 2nd best explanation and must be orthogonal to the first one, etc.

Explain prior probability, likelihood and marginal likelihood in the context of Naive Bayes

Prior probability is the proportion of the dependent variable in the data set. Ex: the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there is a 70% chance that any new email would be spam. Likelihood is the probability of classifying a given observation as 1 in the presence of some other variable. Ex: the probability that the word FREE is used in a spam message Marginal likelihood is the probability that the word FREE is used in any message

What are the disadvantages of Decision Trees?

Prone to overfitting High variance Hard to learn some simple functions (parity, xor) Axis-parallel decision boundaries (not great for smooth boundaries)

Both being tree-based algorithms, how does random forest differ from gradient boosted machines?

Random forest uses bagging techniques while GBM uses boosting techniques In bagging, a data set is divided into n samples using randomized sampling. Then, using a single learning algorithm a model is built on all samples, in parallel. The resulting predictions are then combined using voting or averaging In boosting, after the first round of predictions, the algorithm weighs misclassified predictions higher, such that they can be corrected in the succeeding round. This sequential process of giving higher weights to misclassified predictions continue until a stopping criterion is reached.

How do Decision Trees work?

Recursively split the data into groups based on most discriminating feature; each leaf gives a prediction. To build: - Compute "purity" metric on labels for all splits of all features (e.g. Gini impurity, information gain) - Take best-scoring split, create child nodes for each subgroup - Recurse on each child node, stop at some desired level of purity

What's the difference between Ridge/Lasso/Elastic Net regularization?

Ridge performs feature weight updates as if the cost (loss) has an extra term containing the squared L2 norm of the weights vector. There is a penalty hyper-parameter to control how much the regularisation term contributes to the cost. During optimisation this tends to drive the overall size of the weight values down, constraining the variance of the model and reducing overfitting. Lasso performs feature weight updates as if the cost has an extra term containing the L1 norm of the weights vector, again with a penalty hyper-parameter. During optimisation this tends to cause the weights for some features to go all the way down to zero at some point in time. This effectively acts as feature elimination and supposedly eliminates features such that those causing most variance/overfitting are down-weighted and eliminated faster. Elastic Net is the combination of the other two, and the two penalty hyper-parameters also balance between L2 and L1 regularisation. This method should be less aggressive in eliminating features, compensated by having smaller weight values overall.

What is \(AX = b\)? How do we solve it?

Say we have the set of the equations below, in \(AX = B\) form:1: \(2x - 3y = -1\) 2: \(-5x + 5y = 20\) We can build A by using the coefficients of x and y: \(A = \begin{bmatrix} 2 & -3 \\ -5 & 5 \end{bmatrix}\) X is the unknown variables x and y and it is a Vector: \(X = \begin{bmatrix} x \\ y \end{bmatrix}\) And the multiplication of matrix A with vector X is the solution vector B: \(B = \begin{bmatrix} -1 \\ 20 \end{bmatrix}\)

What are some advantages of k-means clustering?

Scales to large data sets Guarantees convergence. Can warm-start the positions of the centroids. Easily adapts to new examples. Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

What are the disadvantages of k-nearest neighbors?

Sensitive to feature scaling Suffers from class imbalance; weighting can help Suffers from curse of dimensionality (all points basically same distance in high dimensions). Model size grows with data; must store all samples

What are the limitations of PCA?

Sensitive to the scale of features Assumes features with less variance are less important Assumes (Gaussian) variance accurately characterizes features Assumes features are orthogonal Only performs linear transformations (but see kernel PCA) Only removes linear correlation

What are the support vectors in SVMs?

Support vectors are the closest data points to the hyperplanes used to separate classes. They are distanced to the hyperplane by a certain margin.

What is TF/IDF Vectorization?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It's used as a weighting factor in information retrieval and text mining. If the TF-IDF value increases proportionally to the number of times a word appears in a document but is offset by the frequency the word is in the corpus, it adjusts for the fact that some words appear more frequently in general.

What is an identity matrix? What effect does it have in matrix multiplication?

The "Identity Matrix" is the matrix equivalent of the number "1":\(I = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\) It is "square" (has same number of rows as columns) It has 1s on the diagonal and 0s everywhere else It is a special matrix, because when we multiply by it, the original is unchanged: \(A × I = A\) \(I × A = A\)

Define cosine similarity

The cosine of the angle between two vectors 1 for parallel vectors. 0 for orthogonal vectors. -1 for opposite direction.

Define the rank of a matrix

The dimension of the space spanned by the columns; the number of linearly independent vectors among the columns.

Describe orthogonality for vectors

The dot product of the two is zero

Define the eigen vectors and eigenvalues of a matrix

The eigenvectors of a matrix A are the vectors x such that Ax = λx i.e., multiplication by A acts like scalar multiplication by λ, the eigenvalue corresponding to x

What does PCA compute? (in matrix-theoretic terms)

The eigenvectors of the covariance matrix (the principal components).

What is "yield" in Python?

The yield statement suspends a function's execution and sends a value back to the caller, but retains enough state to enable the function to resume where it left off. This allows code to produce a series of values over time rather than computing them all at once and sending them back in something like a list. We should use yield, vs. return, when we want to iterate over a sequence but don't want to store the entire sequence in memory.

What are the disadvantages of Gradient Boosted Trees?

Training often longer because they are built sequentially. Prone to overfitting - use a cyclical learning rate and shallow tree depth.

Define specificity

True Negative Rate

Define recall (sensitivity)

True Positive Rate

Define precision

True Predicted Positives/All Predicted Positives

How can we use Naive Bayes classifier for categorical features? What if some features are numerical?

We can use a Naive Bayes classifier for categorical variables if we one-hot encode. If we have n categories then we create n-1 dummy variables or features and add to our data. If we have some numerical features we can discretize numeric values into few categories. For example we can categorize marks of students in a class as low, medium or high. Other method we can use for this is probability density function where we assume the probability distribution of attribute follows a normal distribution

You have a model that is suffering from low bias and high variance - what algorithm can you use to tackle this?

We could use a bagging algorithm (like random forest) to tackle the high variance problem. Bagging divides a data set into subsets made with repeated random sampling. Use regularization to penalize high coefficients, lowering model complexity Use top \(n\) features.

What's the difference between feature selection and feature extraction? Provide examples.

feature extraction and feature engineering: transformation of raw data into features suitable for modeling; - Texts(ngrams, word2vec, tf-idf etc) - Images(CNN'S, texts, q&a) - Geospatial data(lat, long etc) - Date and time(day,month,week,year..) - Time series, web, etc... - Dimensional Reduction Techniques... feature selection: removing unnecessary features. - Statistical approaches - Selection by modeling - Grid search - Cross Validation

What is the syntax for multiplying numpy matrices?

numpy.dot(a, b, out=None) where a and b are matrices If both a and b are 1-D arrays — the inner product of two vectors If both a and b are 2-D arrays — matrix multiplication If either a or b is 0-D (scalar) — multiply using a * b


Ensembles d'études connexes

Introduction to Business - ch. 5

View Set

A&P 2 online Ch. 17 The Cardiovascular System 1: The Heart

View Set

Origins and Patterns of Biodiversity: BIOS 120

View Set

Comp Eng as a Discipline - Reviewer

View Set

bloque 1-Razonamiento probabilistico

View Set

NSG 330 Ch 37- Assessment & Management Allergic Disorders

View Set

Deductive argument T/F Practice Questions

View Set

Xcel ch. 4 Types of Insurance Policies

View Set