ML Study

Ace your homework & exams now with Quizwiz!

What are some common hyperparameters?

Batch size, learning rate/LR schedule (momentum), dropout, weight decay, regularization parameter, num epochs, plus architecture changes

What's a Fourier transform?

.A mathematical transform which decomposes a function (often a function of time, or a signal) into its constituent frequencies, such as the expression of a musical chord in terms of the volumes and frequencies of its constituent notes. There is also an inverse Fourier transform that mathematically synthesizes the original function from its frequency domain representation. "Given a smoothie, it's how we find the recipe."

What's the trade-off between bias and variance?

1. Bias is the difference between the average prediction of our model and the correct value. If the bias value is high, then the prediction of the model is not accurate. Hence, the bias value should be as low as possible to make the desired predictions. 1. Variance is the number that gives the difference of prediction over a training set and the anticipated value of other training sets. High variance may lead to large fluctuation in the output. Therefore, the model's output should have low variance Tradeoff: Minimizing one error effectively makes the one more likely to be present when creating and assessing a model. If you make the model more complex and add more parameters, you'll lose bias but gain some variance — in order to get the optimally reduced amount of error, you'll have to tradeoff bias and variance. You don't want either high bias or high variance in your model.

Describe CNNs. What are they good for?

A convolution is applied on the input data using a convolution filter/kernel to produce a feature map. Apply element-wise matrix multiplication between the kernel and the receptive field and sum the products. You can change the stride and add padding. You can also pool after a convolution by applying a pooling operation like 2x2 max pool with a stride 2. Convolutional neural networks use three basic ideas: local receptive fields, shared weights, and pooling. Common nets using CNNs: LeNet ('98), AlexNet ('12), VGG ('14), GoogLeNet/Inception ('14)

What's the difference between a generative and discriminative model?

A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks.

Difference between a loss, cost, and objective function.

A loss function is a part of a cost function which is a type of an objective function. Loss function is usually a function defined on a data point, prediction and label, and measures the penalty. Cost function is usually more general. It might be a sum of loss functions over your training set plus some model complexity penalty (regularization). Objective function is the most general term for any function that you optimize during training.

Explain the difference between L1 and L2 regularization.

A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. L1/lasso adds "absolute value of magnitude" of coefficient as penalty term to the loss function. L2/ridge adds "squared magnitude" of coefficient as penalty term to the loss function. While L1 has the influence of pushing weights towards 0 and L2 does not, this does not imply that weights are not able to reach close to 0 due to L2. L2 regularization tends to spread error among all the terms, while L1 is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting.

What is the sigmoid function?

A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula: 1 / (1 - e^-x)

Explain how a AUC - ROC curve works.

AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve shows how the recall vs precision relationship changes as we vary the threshold for identifying a positive in our model. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease. Up and to the left is better! TPR/Recall = TP / (TP + FN) FPR = FP / (TN + FP) AUC = 1 is best. AUC = 0.5 no discrimination

LSTM vs. Transformer

Advantages of Transformer: * Easier to train, more efficient * Transfer learning workers * Can be trained on unsupervised text LSTM better when: * Sequence length is long or infinite (Transformers are n^2) * Real-time control for robotics or similar * Can't pre-train on large corpus

How would you handle an imbalanced dataset?

An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data! Here are a few tactics to get over the hump: 1- Collect more data to even the imbalances in the dataset. 2- Resample the dataset to correct for imbalances. 3- Try a different algorithm altogether on your dataset. What's important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.

What's the difference between MAE (mean absolute error), MSE (mean squared error), RMSE (Root MSE)? When would you choose one over another?

As a general guide, I think we can use MAE when we aren't too worried about the outliers. MSE: As a result of the squaring, it assigns more weight to the bigger errors. The algorithm then continues to add them up and average them. RMSE can be obtained just be obtaining the square root of MSE. MSE & RMSE are really useful when you want to see if the outliers are messing with your predictions. RMSE should be more useful when large errors are particularly undesirable.

Explain backpropagation. Why is it fast?

At the heart of backprop is an expression for the partial derivative ∂C/∂w of the cost function C with respect to any weight w (or bias b) in the network. The expression tells us how quickly the cost changes when we change the weights and biases. 1. Input a set of training examples 2. For each training example: * Feedforward: compute activations (a' =σ(w*a+b) for a given layer) * Output error: Compute loss vectors * Backprop the error 3. Gradient descent What's clever about backprop is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass http://neuralnetworksanddeeplearning.com/chap2.html

Explain batch normalization.

BN adds two trainable parameters to each layer, so the normalized output is multiplied by a "standard deviation" parameter (gamma) and add a "mean" parameter (beta). The 3 steps are: 1. Calculate the mean and variance of the layers input. 2. Normalize the layer inputs using the previously calculated batch statistics. 3. Scale and shift in order to obtain the output of the layer. (Notice that gamma and beta are learned during training along with the original parameters of the network.) The basic formula is x* = (x - E[x]) / sqrt(var(x)), where x* is the new value of a single component, E[x] is its mean within a batch and var(x) is its variance within a batch. BN extends that formula further to x** = gamma * x* + beta, where x** is the final normalized value. gamma and beta are learned per layer. BN reduces the amount by what the hidden unit values shift around (covariance shift). Also, BN allows each layer of a network to learn by itself a little bit more independently of other layers. https://towardsdatascience.com/batch-normalization-theory-and-how-to-use-it-with-tensorflow-1892ca0173ad

When to use classification over regression?

Classification produces discrete values/categories, while regression gives you continuous results that allow you to better distinguish differences between individual points. You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)

What is cross-entropy loss?

Cross entropy can be used to define a loss function in machine learning and optimization. The true probability y' is the true label, and the given distribution y is the predicted value of the current model. g(y', y) = - sigma(y'i * log yi)

What is the Dot and Hadamard products?

Dot product: The elements corresponding to same row and column are multiplied together and the products are added such that, the result is a scalar. Hadamard Product (Element -wise Multiplication): Basically matrix addition except you multiply. Inputs and outputs are all the same dim matrices.

Name an example where ensemble techniques might be useful.

Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data). You could list some examples of ensemble methods, from bagging to boosting to a "bucket of models" method and demonstrate how they could increase predictive power.

What's the F1 score? How would you use it?

F1 = 2 * (P * R) / (P + R) The F1 score is a measure of a model's performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don't matter much.

What is representation/feature learning?

Feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

What is feature scaling? Why is that important?

Feature scaling is a method used to normalize the features of data. Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. Examples: rescaling (min-max normalization), mean normalization, standardization (Z-score normalization), scaling to unit length

Explain Batch, Mini Batch & Stochastic Gradient Descent.

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function where you: 1., Compute the slope (gradient) that is the first-order derivative of the function at the current point 2. Move-in the opposite direction of the slope increase from the current point by the computed amount Batch Gradient Descent: all the training data is taken into consideration to take a single step. We take the average of the gradients of all the training examples and then use that mean gradient to update our parameters. So that's just one step of gradient descent in one epoch. Stochastic Gradient Descent (SGD): we consider just one example at a time to take a single step. Mini Batch Gradient Descent: We use a batch of a fixed number of training examples which is less than the actual dataset and call it a mini-batch. Just like SGD, the average cost over the epochs in mini-batch gradient descent fluctuates because we are averaging a small number of examples at a time.

Explain gradient descent.

Gradient descent is repeatedly apply update rule. General: v→v′=v−η∇C Weights: wk→w′k = wk − η(∂C/∂wk) "Partial derivative of the cost (loss) function w.r.t. each variable." By repeatedly applying this update rule we can "roll down the hill", and hopefully find a minimum of the cost function.

What are the types of hyperparameter optimization?

Grid search (parameter sweep) Random search Bayesian opt (black box): builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set Gradient-based opt Evolutionary opt: uses evolutionary algrithms Population-based (PBT): Learns both hyperparameter values and network weights. Multiple learning processes operate independently, using different hyperparameters.

Describe these methods of model validation: hold out, k-fold cross validation, LOOCV

In holdout validation, we split the data into a training and testing set. Model validation is the process by which we ensure that our models can perform acceptable in "the real world." Cross validation is a method of model validation which splits the data in creative ways in order to obtain the better estimates of "real world" model performance, and minimize validation error. It leverages subsets of our data and an understanding of the bias/variance trade-off in order to improve generalization. K-fold validation is a popular method of cross validation which shuffles the data and splits it into k number of folds (groups). In general K-fold validation is performed by taking one group as the test data set, and the other k-1 groups as the training data, fitting and evaluating a model, and recording the chosen score. This process is then repeated with each fold (group) as the test data and all the scores averaged to obtain a more comprehensive model validation score. Leave One Out Cross Validation (LOOCV) can be considered a type of K-Fold validation where k=n given n is the number of rows in the dataset. Other than that the methods are quire similar. You will notice, however, that running the following code will take much longer than previous methods. https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd

What cross-validation technique would you use on a time series dataset?

Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data — it is inherently ordered by chronological order. If a pattern emerges in later time periods for example, your model may still pick up on it even if that effect doesn't hold in earlier years! You'll want to do something like forward chaining where you'll be able to model on past data then look at forward-facing data. fold 1 : training [1], test [2] fold 2 : training [1 2], test [3] fold 3 : training [1 2 3], test [4] fold 4 : training [1 2 3 4], test [5] fold 5 : training [1 2 3 4 5], test [6]

Explain the transformer architecture. What is it good for?

Justification: RNNs are slow and don't handle long sequences well (vanishing). LSTM gets better at longer sequences but are even slower. Inputs must be processed sequentially. Previous inputs/states are needed to make progress on current state. Transformer using encoder-decoder (like RNN) but the inputs can be passed in parallel. Encodes input into a vector using a "Positional Encoder" which gives context based on position of word in sentence. Output is word vectors with positional info. These word vectors are passed into "Encoder Block" with multi-headed attention layer and feed forward layer. Attention: What part of the input should I focus on? Self-attention: Attention w.r.t. one's self. i.e. How relevant is a given word in a sentence relevant to all the other words? For every word we have an attention vector that captures contextual relationships. Feed forward nets takes these vectors and translate them in a format that can be used by the decoder block. Decoder gets embedding of word and positional encoding/vector. Decoder has 3 main components. 1. Encoder-Decoder Attention Block

How is KNN different from k-means clustering?

K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points. The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn't — and is thus unsupervised learning.

Law of Large Numbers? Central Limit Theorem?

LLN: Carry out identical but independent experiments each yielding a random number and average all those numbers. The more experiments you perform, the more likely will the average be close to the expected value (of the experiment). CLT: Carry out identical but independent experiments each yielding a random number and add up all those numbers. If you repeat this process of coming up with a sum of random numbers, the frequencies of resulting sums will approximately follow a normal distribution (i.e. a Gaussian bell curve). The more numbers you sum per experiment (and the more experiments), the better the approximation.

What is Naive Bayes and why is it naive?

Naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. The assumptions the algorithm makes are virtually impossible to find in real-life data. Conditional probability is calculated as a pure product of individual probabilities of components. This means that the algorithm assumes the presence or absence of a specific feature of a class is not related to the presence or absence of any other feature (absolute independence of features), given the class variable. "A Naive Bayes classifier that figured out that you liked pickles and ice cream would probably naively recommend you a pickle ice cream."

What is Bayes' Theorem? How is it useful in a machine learning context?

P(A | B) = [P(B | A) * P(A)] / P(B) Describes the probability of an event, based on prior knowledge of conditions that might be related to the event. If the risk of developing health problems is known to increase with age, Bayes's theorem allows the risk to an individual of a known age to be assessed more accurately than simply assuming that the individual is typical of the population as a whole. Bayes' Theorem is the basis behind a branch of machine learning that most notably includes the Naive Bayes classifier. That's something important to consider when you're faced with machine learning interview questions.

What's the difference between pipeline approach and end-to-end approach to machine learning system architecture.

Pipeline for speech recognition could look like: audio -> features -> phonemes -> words -> transcripts e2e uses a large NN do go from audio -> transcript

What's the difference between probability and likelihood?

Probability quantifies anticipation (of outcome), likelihood quantifies trust (in model). Given all the fine technical answers above, let me take it back to language: Probability quantifies anticipation (of outcome), likelihood quantifies trust (in model). Suppose somebody challenges us to a 'profitable gambling game'. Then, probabilities will serve us to compute things like the expected profile of your gains and loses (mean, mode, median, variance, information ratio, value at risk, gamblers ruin, and so on). In contrast, likelihood will serve us to quantify whether we trust those probabilities in the first place; or whether we 'smell a rat'.

What are vanishing gradients? How to solve them?

Problem: As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train. Why: Certain activation functions, like the sigmoid function, squishes a large input space into a small input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small. Solutions: The simplest solution is to use other activation functions, such as ReLU, which doesn't cause a small derivative. Residual networks are another solution, as they provide residual connections straight to earlier layers. Finally, batch normalization layers can also resolve the issue by simply normalizing the input so |x| doesn't reach the outer edges of the sigmoid function.

Explain reinforcement learning.

RL enables software-defined agents to learn the best actions possible in virtual environment in order to attain their goals. It unites function approximation and target optimization, mapping state-action pairs to expected rewards. Tries to correlate immediate actions with the delayed returns they produce. Following is done repeatedly: env(a, s) -> s', r agent(s', r) -> a' RL is the process of running the agent through sequences of state-action pairs, observing the rewards that result, and adapting the Q function to those rewards until it accurately predicts the best path for the agent to take. Prediction is known as a policy. Agent: takes actions. Acton: set of moves agent can make Discount factor: multiplied by future rewards as discovered by the agent to dampen these rewards' effect on the agent's choice of action. Designed to make future rewards worth less than immediate rewards. Environment: The world agent moves in. State: concrete and immediate situation in which the agent finds itself. Policy (π): strategy that the agent employs to determine the next action based on the current state. Maps states to actions. Value (V): The expected long-term return with discount as opposed to short-term reward R. Vπ(s) is expected LT reward of current state under policy π. Q-value/action-value: Like value except it takes an extra parameter, the current action a. Qπ(s, a) refers to the LT return of an action taking action a under policy π from current state s. Maps state-action pairs to rewards. Trajectory: sequence of states and actions that influence those state. Key distinctions: reward is an immediate signal that is received in a given state, while value is the sum of all rewards you might anticipate from that state. Value is a LT expectation, while reward is an immediate pleasure.

What are RNNs, LSTMs, GRUs?

Recurrent neural networks (RNNs) are a class of artificial neural networks which are often used with sequential data. The 3 most common types of recurrent neural networks are 1. vanilla RNN 2. LSTM 3. GRU RNN has a looping mechanism to allow information (hidden state) to flow from one step to the next. Basically feed forward NNs rolled out over time. 1. Vanilla RNN RNN processes iterable input one-by-one in a vector format. It passes the previous hidden state to the next step of the sequence in addition to the input at that step. Uses tanh activation. Issue is that it has short term memory caused by vanishing gradients. It back props through time so as time increases since update the weights gets smaller. rnn= RNN() f = FeedForwardNN() hidden_state = [...] for word in input: output, hidden_state = rnn(word, hidden_state) prediction = ff(output) 2. LSTM Has the following components: * cell state * forget gate * input gate * output gate Forget gate decides what information should be kept/thrown away. Pass in last hidden state and input and run it through sigmoid (0-1). Input gate updates the cell state. We pass last hidden and input into sigmoid AND tanh (-1,1) functions. Multiply the outputs of these. Cell state is the last one times forget cell output plus input gate. Output gate decides what the next hidden state should be and is used for prediction. Take last hidden and input into sigmoid and multiply it with updated cell state that has been updated with tanh. This output is the new hidden state. def LstmCell(prev_ct, prev_ht, input): combine = prev_ht + input ft = forget_layer(combine) candidate = candidate_layer(combine) it = input_layer(combine) ct = prev_ct * ft + candidate * it ot = output_layer(combine) ht = ot * tanh(ct) return ht, ct for input in inputs: ct, ht = LstmCell(ct, ht, input) 3. GRU Newer than LSTM, got rid of cell state only using hidden state. 1. Update gate Similar to LSTM's forget and input gate; decides what information to throw away/add. 2. Reset gate Regulates how much past information we should forget. GRU has fewer tensor ops so they can be faster. https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

What is selection bias? Survivorship bias?

Selection bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. Survivorship bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. It is a form of selection bias.

What is attention? What is soft and hard attention? Self-attention?

Soft attention: different parts, different subregions. It is deterministic meaning regions of interest will always be the same given the same inputs. This is because we consider all the regions to produce the ROI. Hard attention: only one subregion. It is a stochastic process which introduces randomness.

What is weight initialization? Are there any special techniques?

Standard approach is to use a normal/Gaussian distribution of variables. Mean of 0 and standard deviation of 1. We can do better with std dev = 1/sqrt(n) where n = # of input weights. (Speeds up learning)

What is dropout regularization?

Start by randomly deleting half the hidden neurons. Then forward and back prop over a mini-batch. Then delete a different half and train with anew mini-batch. Then halve the weights outgoing from hidden neurons because net was trained with half the neurons. Prevents over reliance on certain neurons and prevents overfitting.

What is the difference between supervised and unsupervised machine learning?

Supervised learning requires training labeled data. For example, in order to do classification (a supervised learning task), you'll need to first label the data you'll use to train the model to classify data into your labeled groups. Unsupervised learning, in contrast, does not require labeling data explicitly.

What's the "kernel trick" and how is it useful?

The Kernel trick involves kernel functions that can enable in higher-dimension spaces without explicitly calculating the coordinates of points within that dimension: instead, kernel functions compute the inner products between the images of all pairs of data in a feature space. This allows them the very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation of said coordinates. Many algorithms can be expressed in terms of inner products. Using the kernel trick enables us effectively run algorithms in a high-dimensional space with lower-dimensional data.

What is the softmax function? What is special about it?

The Softmax regression is a form of logistic regression that normalizes an input value into a vector of values that follows a probability distribution whose total sums up to 1. Softmax is not ideally used as an activation function like Sigmoid or ReLU but rather between layers which may be multiple or just a single one.

Explain Bayes error rate and avoidable bias. How to estimate it?

The lowest possible error rate for any classifier of a random outcome (into, for example, one of two categories) and is analogous to the irreducible error. A number of approaches to the estimation of the Bayes error rate exist. One method seeks to obtain analytical bounds which are inherently dependent on distribution parameters, and hence difficult to estimate. Another approach focuses on class densities, while yet another method combines and compares various classifiers. Difference (Training Error, Human-Level Performance) = Avoidable Bias Difference (Development Error, Training Error) = Variance Usually, human and Bayes error are quite close, especially for natural perception problems, and there is little scope for improvement after surpassing human-level performance and thus, learning slows down considerably.

How do you ensure you're not overfitting with a model?

This is a simple restatement of a fundamental problem in machine learning: the possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations. There are three main methods to avoid overfitting: 1- Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data. 2- Use cross-validation techniques such as k-folds cross-validation. 3- Use regularization techniques such as LASSO that penalize certain model parameters if they're likely to cause overfitting.

What are exploding gradients? How to solve them?

Training a neural network can become unstable given the choice of error function, learning rate, or even the scale of the target variable. Large updates to weights during training can cause a numerical overflow or underflow often referred to as "exploding gradients." The problem of exploding gradients is more common with recurrent neural networks, such as LSTMs given the accumulation of gradients unrolled over hundreds of input time steps. A common and relatively easy solution to the exploding gradients problem is to change the derivative of the error before propagating it backward through the network and using it to update the weights. Two approaches include rescaling the gradients given a chosen vector norm and clipping gradient values that exceed a preferred range. Together, these methods are referred to as "gradient clipping." You can also use weight regularization.

What is transfer learning?

Transfer learning make use of the knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars can be used to some extent to recognize trucks. 1. Pre-training on something like Wikipedia/Imagenet 2. Fine tuning with target dataset/problem domain

What's the difference between Type I and Type II error?

Type I error is a false positive, while Type II error is a false negative. Briefly stated, Type I error means claiming something has happened when it hasn't, while Type II error means that you claim nothing is happening when in fact something is. A clever way to think about this is to think of Type I error as telling a man he is pregnant, while Type II error means you tell a pregnant woman she isn't carrying a baby.

What is online ML?

When data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core (external memory bc it's so big) algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction.

What is the curse of dimensionality? Is there ways to overcome it?

When your data has too many features. Can solve by using dimension reduction like PCA. If we have more features than observations than we run the risk of massively overfitting our model — this would generally result in terrible out of sample performance. When we have too many features, observations become harder to cluster — believe it or not, too many dimensions causes every observation in your dataset to appear equidistant from all the others. If the distances are all approximately equal, then all the observations appear equally alike (as well as equally different), and no meaningful clusters can be formed.

What evaluation approaches would you work to gauge the effectiveness of a machine learning model?

You would first split the dataset into training and test sets, or perhaps use cross-validation techniques to further segment the dataset into composite sets of training and test sets within the data. You should then implement a choice selection of performance metrics: here is a fairly comprehensive list. You could use measures such as the F1 score, the accuracy, and the confusion matrix. What's important here is to demonstrate that you understand the nuances of how a model is measured and how to choose the right performance measures for the right situations. https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/

What are precision and recall? Explain the trade-off between them?

precision = TP / (TP + FP) recall = TP / (TP + FN)


Related study sets

CHAPTER 28: GREEN ALGAE AND LAND PLANTS

View Set

CompTIA 901 Unit 2.9 Networking Tools

View Set

Chapter 9: Small Business Marketing - Customers and Products

View Set

Chapter 7: Sampling and sampling distributions

View Set

Prep U questions Chapter 39: Oxygenation and Perfusion

View Set

Jewish, Early Christian, and Byzantine Art

View Set

The Origins and Spread of Christianity; Chapter 36

View Set

Chapter 7: Fluids, Electrolytes and Acid-Base Disorders

View Set