Tech Questions

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

When approximating a target function in Machine Learning Y=f(X). Questions to ask:

"How well does the model generalize to new data?"

Two ways to handle Overfitting?

1. Use a resampling technique to estimate model accuracy. The Most popular resampling technique is k-fold cross-validation. 2. Hold back a validation dataset.

MSE equation

1/N*⅀(Y_real - Y_predicted)^2

Classification Models

Accuracy, Precision, Recall, F1_score, AUC, Confusion Matrix

RF: The Subspace Sampling Method

After bagging the data, the Random Forest uses the Subspace Sampling Method to further increase variability between the trees in the Random Forest. Although it has a fancy mathematical-sounding name, the Subspace Sampling Method refers to randomly selecting a subset of features to use as predictors for each node when training a decision tree, instead of using all predictors available at each node.

Information gain is...

An impurity/uncertainty based criterion that uses the entropy as the impurity measure. _____________________ is the key criterion that is used by ID3 classification tree algorithm to construct a Decision Tree. Decision Trees algorithm will always tries to maximize ___________________. Entropy of dataset is calculated using each attribute, and the attribute showing highest ____________________ is used to create the split at each node.

Gradient Descent equation

B = A - [𝛾 Δ f(A)]

Adjusted R^2

Because R^2 cannot determine if coefficient estimates and predictions are biased, use adjusted. Adjusted R2 goes up if the new predictor improves the model more than would be expected by chance.

What is Bagging short for?

Bootstrap AGGregation

What does CRISP-DM stand for?

Cross-industry process for data mining

Linear Regression optimizations/regularizations

Feature Selection, Scaling/Normalization, Correlation Matrix, Gradient Descent, Polynomials and Interactive Terms, Lasso (L1), Ridge (L2), Cross-Validation

When R^2 = 0.5 for a model...

For the baseline model that just predicts the mean of the target values each time.

F1 Score Definition

Harmonic mean between precision and recall

Accuracy Definition

How frequently is our prediction correct?

Draw the CRISP-DM diagram for different phases. Describe what CRISP-DM is.

Is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model

MAP (Maximum A Posteriori) Estimation for Posterior probability

It's similar to MLE, but we use _____ when we have some idea about the prior probability of the data. We can calculate marginal probabilities or simply a subjective value for the prior probability in a Bayesian context. The difference between this and MLE is the presence of the prior. Note that when p(θ)p(θ) is uniform i.e. giving same prior probability to all possible values (P(H) in a coin toss is 0.5), MLE and _____ are equivalent, and as the number of observations in X increases, MLE and _____ converge i.e. they both estimate the underlying parameters.

DT: Maximum Features

Maximum number of features to consider when splitting a node

P > |t|

P-value that the null-hypothesis that the coefficient = 0 is true. If it is less than the confidence level, often 0.05, it indicates that there is a statistically significant relationship between the term and the response.

Confusion Matrix

Plots the values for TP, TN, FP, FN

Scaling/Normalization

Prevents model to put too much weight on features which are measured in units that are much larger than most other features. This can cause a skewed representation of feature importance.

DT: Minimum Samples Leaf with Split

Restrict the size of sample leaf

What does a Data Analyst do?

Takes data and uses it to help companies make better business decisions.

What is the SVD++ Algorithm?

The ______ algorithm, an extension of SVD taking into account implicit ratings.

MSE definition

The sum of the squared difference between y-predicted and y-actual divided by the number of observations. Demonstrates how accurate on average are our predictions.

Splitting Criteria

The training process of a decision tree can be generalized as "Recursive binary Splitting". In this procedure all the features are considered and different split points are tried and tested using some Cost Function. The split with the lowest cost is selected. Two main criteria are used: - CART (Classification and Regression Trees): uses the Gini Indexas metric. - ID3 (Iterative Dichotomiser 3): uses Entropy function and Information gain as metrics.

Accuracy and F1-Score

The two most informative metrics that are often cited to describe the performance of a model are _____1_____ and _____2_____. Let's take a look at each and see what's so special about them. _____1_____ is useful because it allows us to measure the total number of predictions our model got right, including both True Positives and True Negatives. _____2_____ allows us to answer: "Out of all the predictions our model made, what percentage were correct?" ____1____ is the most common metric for classification. It provides a solid holistic view of the overall performance of our model.

The t-statistic value

This is a measure of how statistically significant the coefficient is.

Log Likelihood

When calculating maximum likelihood, we often use the ________________________, as taking the logarithm can simplify calculations. For example, taking the logarithm of a set of products allows us to decompose the problem from products into sums. (You may recall from high school mathematics that x(a+b)=xa∙xb. Similarly, taking the logarithm of both sides of a function allows us to transform products into sums.

Mini Batch GD

divides training set into diff. Batches and performs update for each batch (mix of SGD and BGD)

Lasso (L1)

lambda - penalty term (regularization method can help with overfitting - can also be considered a feature selection since it can reduce certain feature coefficients to zero)

Ridge (L2)

lambda - penalty term (regularization method can help with overfitting)

Correlation matrix

removing features with high correlation to reduce noise.

RMSE

root mean square error; standard deviation of the difference between actual positions of GCPs and their calculated positions

RSS

⅀(Y_real - Y_predicted)^2

Precision Equation

( TP) / (TP + FP)

Recall Equation

( TP)/ (All Positives)

Accuracy Equation

(TP + TN)/ (Total Observation)

Feature Selection

(Using RFE or other methods such as using p-values - removing features with p-values > 0.05) can improve model performance

Bias is...

...an error from erroneous assumptions in the learning algorithm. High _____ can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

Maximum Likelihood Estimation primarily deals with....

...deals with determining the parameters that maximize the probability of the data. Such a determination can help us predict the outcome of future experiments e.g. If we Toss the coin 1 more time, what is the probability of seeing a Head?

The Receiver Operator Characteristic curve (ROC curve) which illustrates....

...illustrates the false positive against the True Positive rate of our classifier. When training a classifier, we are hoping the ________ curve will hug the upper left corner of our graph. A classifier with 50-50 accuracy is deemed 'worthless'; this is no better than random guessing, as in the case of a coin flip.

Information gain (IG) measures...

...measures how much "information" a feature gives us about the class.

Overfitting happens when...

...when a model models the training data too well. In fact, so well that it is not generalizable. ________ can often increase the variance of the model and make it very vulnerable to changes in the data.

Logistic Regression most used hyperparameters:

1 Penalty term: L1 or L2 2 C - Inverse regularization strength (must be a positive float) - helps minimize the error between what the y_pred and y_actual. Smaller C values mean higher regularization, thus the "Inverse strength".

The process for training an ensemble through bootstrap aggregation is as follows:

1. Grab a sizable sample from your dataset, with replacement. 2. Train a classifier on this sample. 3. Repeat until all classifiers have been trained on their own sample from the dataset. 4. When making a prediction, have each classifier in the ensemble make a prediction. 5. Aggregate all predictions from all classifiers into a single prediction

Boosting works as follows

1. Train a single weak learner. 2. Figure out which examples the weak learner got wrong. 3. Build another weak learner that focuses on the areas the first weak learner got wrong. 4. Continue this loop until a predetermined stopping condition is met, such as a set number of weak learners has been created, or the model's performance has plateaued.

F1 Equation

2 * (precision * recall) / (precision + recall)

Example of the relationship between precision and recall

A doctor that is overly obsessed with recall will have a very low threshold for declaring someone as sick, because they are most worried about sick patients get by them. Their precision will be quite low, because they classify almost everyone as sick, and don't care when they're wrong--they only care about making sure that sick people are identified as sick. A doctor that is overly obsessed with precision will have a very high threshold for declaring someone as sick, because they only declare someone as sick when they are absolutely sure that they will be correct if they declare a person as sick. Although their precision will be very high, their recall will be incredibly low, because a lot of people that are sick but don't meet the doctor's threshold will be incorrectly classified as healthy.

What is the Minkowski Distance?

A metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. A Normed Vector Space is just a fancy way of saying a collection of space where each point has been run through a function. You'll often see _________________________ used as a parameter for any distance-based machine learning algorithms inside Scikit-Learn.

The Naive Bayes Classifier

A simple algorithm that uses the integration of maximum likelihood estimation techniques for classification. It is a predictive statistical modeling algorithm that results from a few probabilistic assumptions. Even with highly complicated datasets, it is suggested to try Naive Bayes approach first before trying more sophisticated classifiers. To demonstrate the concept of Naive Bayes Classification, consider the example below. As indicated, the objects can be classified as either GREEN or RED, based on their position on the 2-D plane i.e. x and y coordinates. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the positions currently existing objects. *Prior Class Probability* Since there are twice as many GREEN objects as RED, it would seem that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED and this would be the basis of our prior probability. Prior probabilities are based on previous experience. There is a total of 60 objects, 40 of which are GREEN and 20 RED: Prior Probability of Green = Total Green Objects / Total Objects = 2/3 Prior Probability of Red = Total Red Objects / Total Objects = 1/3 Posterior Probability of Unseen Data Having formulated our prior probability, we are now ready to classify a new object (WHITE circle) i.e. calculate the posterior probability of new object, based on its position on the 2-D plane. Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. Likehood of Data To measure this likelihood, we draw a circle around X which encompasses a number to define "Vicinity" (this can be chosen arbitrarily and changed to view the impact on likelihood) of points irrespective of their class labels. We can then calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood: Likelihood of X given Green = No. of Green Objects in the vicinity of X / Total Green Objects Likelihood of X given Red = No. of Red Objects in the vicinity of X / Total Red Objects From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus: Likelihood of X given Green = 1/40 Likelihood of X given Red = 3/20 Prediction For New Data Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the Bayes' theorem. Posterior Probability of X being Green = Likelihood of X given Green * Prior Probability of Green 2/3* 1/40 = 1/60 Posterior Probability of X being Red = Likelihood of X given Red * Prior Probability of Red 1/3* 3/20 = 1/20 Finally, we classify X as RED since its class membership achieves the largest posterior probability i.e. the ARGMAX. We simplify maximized the likelihood of color given the data.

The Random Forest algorithm

A supervised learning algorithm that can be used for classification or regression tasks. If our task is a classification task, then our ___________________ will consist of many different Decision Trees for classification-if regression, then the trees will be built for regression. Decision Trees are the cornerstone of ___________________ - Put simply, the ___________________ algorithm is an ensemble of Decision Trees.

Gradient Boosting: Weak Learners

All the models we've learned so far are Strong Learners--models with the goal of doing as well as possible on the classification or regression task they are given. The term Weak Learner refers to simple models that do only slightly better than random chance. Boosting algorithms start with a single weak learner--tree methods are overwhelmingly used here, but technically, any model will do. Boosting works as follows: 1. Train a single weak learner. 2. Figure out which examples the weak learner got wrong. 3. Build another weak learner that focuses on the areas the first weak learner got wrong. 4. Continue this loop until a predetermined stopping condition is met, such as a set number of weak learners has been created, or the model's performance has plateaued. In this way, each new weak learner is specifically tuned to focus on the weak points of the previous weak learner(s). The more often an example is missed, the more likely it is that the next weak learner will be one that can classify that example correctly. In this way, all the weak learners work together to make up a single strong learner.

B = A - [𝛾 Δ f(A)]

B = next position A = current position 𝛾 = learning rate (can't be too big or too small) Minus sign = minimizing portion of GD Δ f(A) = direction of steepest descent. Partial derivative of the parameter

Recall Definition

How many o these actual positives did our model capture?

Precision Definition

How many of the predicted positives were accurate?

Adaboost

In __________, each learner is trained on a subsample of the dataset, much like we saw with Bagging. Initially, the bag is randomly sampled with replacement. However, each data point in the dataset has a weight assigned. As learners correctly classify an example, that example's weight is reduced. Conversely, when learners get an example wrong, the weight for that sample increases. In each iteration, these weights act as the probability that an item will be sampled into the "bag" which will be used to train the next weak learner. As the number of learners grows, you can imagine that the examples that are easy to get correct will become less and less prevalent in the samples used to train each new learner.

Logistic Regression

It is best used when applied to a binary dataset when what you're trying to do is perform a classification of your data in one group versus another one. EX: If want to classify each observation in a dataset between people who "earn more than 4k" and "earn less than 4k". A model will have to make a guess of what the probability is for each person in the dataset of belonging to one group versus another based on their features.

MAP and Naive Bayes Classifier

MAP (Maximum A Posteriori) Estimation for Posterior probability MAP is the basis of Naive Bayes (NB) Classifier.

Logistic regression classification models use maximum likelihood estimation in their logic

MLE primarily deals with determining the parameters that maximize the probability of the data. Such a determination can help us predict the outcome of future experiments e.g. If we Toss the coin 1 more time, what is the probability of seeing a Head? Our assumption leads us to believe that 10 flips we observed are governed by the same parameter theta. We now have just one parameter governing the entire sequence of coin flips, and that includes the 11th flip as well. This is how MLE allows us to connect the first 10 coin flips to the 11th coin flip and is the key for inference. The two assumptions we made are used so often in Machine Learning that they have a special name together as an entity: "The i.i.d. assumption" i.e. Independent and Identically distributed samples. This means that the 10 flips are independent and identically distributed which is great as it will allow us to explicitly write down the likelihood that we are trying to optimize. MLE: Maximum likelihood estimation finds the underlying parameters of an assumed distribution to maximize the likelihood of the observations. The method obtains the parameter estimates by finding the parameter values that maximize the likelihood function. The estimates are called maximum likelihood estimates, which is also abbreviated as MLE. The method of maximum likelihood is used with a wide range of statistical analyses. As an example, suppose that we are interested in the heights of adult female penguins, but are unable to measure the height of every penguin in a population (due to cost or time constraints). Assuming that the heights are normally distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish that by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable given the normal model. Logistic regression expands upon our previous example of a binomial variable by investigating the conditional probabilities associated with the various features. For example, when predicting your risk for heart disease, we might consider various factors such as your family history, your weight, diet, excercise routines, blood pressure, cholestoral, etc. When looked at individually, each of these has an associated conditional probability that you have heart disease based on each of these factors.

Logistic Regression Optimizations

Many of the Optimization and feature selection techniques used in Linear Regression can also be applied towards optimizing Logistic regression models.

Gradient Boosted Trees are a more advanced boosting algorithm that makes use of Gradient Descent.

Much like Adaboost, Gradient Boosting starts with a weak learner that makes predictions on the dataset. The algorithm then checks this learner's performance, identifying examples that it got right and wrong. However, this is where Gradient Boosting algorithm diverges from Adaboost's methodology. The model then calculates the Residuals for each data point, to determine how far off the mark each prediction was. The model then combines these residuals with a Loss Function to calculate the overall loss. There are many loss functions that are used--the thing that matters most is that the loss function is differentiable so that we can use calculus to compute the gradient for the loss, given the inputs of the model. We then use the gradients and the loss as predictors to train the next tree against! In this way, we can use Gradient Descent to minimize the overall loss.

RF: Resiliency to Overfitting

Once we've created our target number of trees, we'll be left with Random Forest filled with a diverse set of Decision Trees that trained on different sets of data, and also look at different subsets of features to make predictions. This amount of diversity among the trees in our forest will make for a model that is extremely resilient to noisy data, thus reducing the chance of overfitting.

DT: Maximum Depth

Reduce the depth of the tree to build a generalized tree. Set the depth of the tree to 3, 5, 10 depending on the performance metrics of the train and test/validation data.

DT: Maximum Leaf Nodes

Reduce the number of leaf nodes

DT: Minimum Leaf Sample Size

Size in terminal nodes can be fixed to 30, 100, 300 or 5% of total

AUC

Some accuracy scores such as 80% accuracy might seem pretty darn good on the first try! What we have to keep in mind is that when predicting a binary classification, we are bound to be right sometimes, even just by random guessing. If you have a skewed dataset with rare events (such as a disease or winning the lottery) where there are only 2 positive cases in 1000, then even a trivial algorithm that classifies everything as 'not a member' will achieve an accuracy of 99.8% (998 out of 1000 times it was correct). So remember that an 80% accuracy must be taken into a larger context. AUC is an alternative comprehensive metric to confusion matrices, which we previously examined, and ROC graphs allow us to determine optimal precision-recall tradeoff balances specific to the specific problem we are looking to solve.

K-Nearest Neighbors is a...

Supervised learning algorithm that can be used for both Classification and Regression tasks. ______ is a distance-based classifier, meaning that it implicitly assumes that the smaller the distance between 2 points, the more similar they are. Since this is a Supervised Learning Algorithm, we must also have the labels for each point in our dataset, or else we can't use this algorithm for prediction. In ______, each column acts as a dimension. In a dataset with two columns, we can easily visualize this by treating values for one column as X coordinates and the other as Y coordinates. The three main distances that can be used in distance-based algorithms are: Manhattan Distance, Euclidean Distance, Minkowski Distance If k grows too large, then the model begins to underfit the data. It's important to try to find the best value for K by iterating over multiple values and comparing performance at each step. Optimization techniques/ hyperparameters: 1. K - the number of neighbors used to vote on each observation 2. The distance metric/similarity function - Manhattan/Euclidean/Minkowski

Since Decision Trees are a Greedy Algorithm that always tries to maximize information gain, how do we ensure that we can create variance in results and prevent every tree in a Random Forest from generating the exact same outcome by using the same features to get maximum information gain?

The answer lies in two clever techniques that the algorithm uses to make sure that each tree focuses on different things--Bagging and the Subspace Sampling Method.

RF: Bagging The Data

The first way to encourage differences among the trees in our forest is to train them on different samples of data. Although more data is generally better, if we gave every tree the entire dataset, we would end up with each tree being exactly the same. Because of this, we instead use Bootstrap Aggregation, or Bagging, to a portion of our data with replacement. For each tree, we sample 2/3 of our training data with replacement--this is the data that will be used to to build our tree. The remaining data is used as an internal test set to test each tree--this remaining third is referred to as Out-Of-Bag Data, or OOB. For each new tree created, the algorithm then uses the remaining 1/3 of data that wasn't sampled to calculate the Out-Of-Bag Error, in order to get a running, unbiased estimate of overall tree performance for each tree in the forest.

Bootstrap Aggregation

The main concept that makes ensembling possible is Bagging. _________ is itself a combination of two ideas--Bootstrap Resampling, and Aggregation. You're already familiar with ____________ Resampling from our section of the Central Limit Theorem----______________ping just refers to the subsets of your dataset by sampling with replacement, much as we did to calculate our sample means when working with the Central Limit Theorem. Aggregation is exactly as it sounds--this refers to the practice of combining all the different estimates to arrive at a single estimate--although the specifics for how we combine them are up to us. A common approach is to treat each classifier in the ensemble's prediction as a "vote" and let our overall prediction be the majority vote. It's also common to see ensembles that take the arithmetic mean of all predictions, or compute a weighted average.

Which Models Are Used in Ensembles?

The most common ones are tree-based ensemble methods, such as Random Forests and Gradient Boosted Trees. However, we can technically use any models in an _________! It's not uncommon to see Model Stacking, also called Meta-Ensembling. In this case, multiple different models are stacked, and their predictions are aggregated. In this case, the more different the models are, the better! The more different the models are, the more likely they may be to pick up on different things. It's not uncommon to see ensembles consisting of multiple Logistic Regressions, Naive Bayesian Classifiers, Tree-Based Models (including ensembles such as Random Forests), and even Deep Neural Networks!

SVD++ Equation

The precidtion is r(hat)_ui yj terms are a new set of item factors that capture implicit ratings. Here, an implicit rating describes the fact that a user u rated an item j, regardless of the rating value. If user u is unknown, then the bias bu and the factors pu are assumed to be zero. The same applies for item i with bi, qi and yi. Just as for SVD, the parameters are learned using a SGD on the regularized squared error objective. Baselines are initialized to 0. User and item factors are randomly initialized according to a normal distribution, which can be tuned using the init_mean and init_std_dev parameters. You have control over the learning rate γ and the regularization term λ. Both can be different for each kind of parameter (see below). By default, learning rates are set to 0.005 and regularization terms are set to 0.02 Parameters: n_factors - The number of factors. Default is 20. n_epochs - The number of iteration of the SGD procedure. Default is 20. init_mean - The mean of the normal distribution for factor vectors initialization. Default is 0. init_std_dev - The standard deviation of the normal distribution for factor vectors initialization. Default is 0.1. lr_all - The learning rate for all parameters. Default is 0.007. reg_all - The regularization term for all parameters. Default is 0.02. lr_bu - The learning rate for bu . Takes precedence over lr_all if set. Default is None. lr_bi - The learning rate for bi . Takes precedence over lr_all if set. Default is None. lr_pu - The learning rate for pu . Takes precedence over lr_all if set. Default is None. lr_qi - The learning rate for qi . Takes precedence over lr_all if set. Default is None. lr_yj - The learning rate for yj . Takes precedence over lr_all if set. Default is None. reg_bu - The regularization term for bu . Takes precedence over reg_all if set. Default is None. reg_bi - The regularization term for bi . Takes precedence over reg_all if set. Default is None. reg_pu - The regularization term for pu . Takes precedence over reg_all if set. Default is None. reg_qi - The regularization term for qi . Takes precedence over reg_all if set. Default is None. reg_yj - The regularization term for yj . Takes precedence over reg_all if set. Default is None. random_state (int, RandomState instance from numpy, or None) - Determines the RNG that will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. Default is None. verbose - If True, prints the current epoch. Default is False. pu: numpy array of size (n_users, n_factors) - The user factors (only exists if fit() has been called) qi: numpy array of size (n_items, n_factors) - The item factors (only exists if fit() has been called) yj: numpy array of size (n_items, n_factors) - The (implicit) item factors (only exists if fit() has been called) bu: numpy array of size (n_users) - The user biases (only exists if fit() has been called) bi: numpy array of size (n_items) - The item biases (only exists if fit() has been called)

Decision Trees

These are rule based classifiers and belong to the first generation of modern AI. Despite the fact that this algorithm has been used in practice for decades, its simplicity and effectiveness for routine classification tasks is still on par with more sophisticated approaches. A decision tree comprises of decisions that originate from a chosen point in sample space. In terms of an understanding (recall the graph section), it is a directed acyclic graph with a root called "root node" that has no incoming edges. All other nodes have one (and only one) incoming edge. Nodes having outgoing edges are known as internal nodes. All other nodes are called leaves. Nodes with an incoming edge, but no outgoing edge are called terminal nodes.

What does the CRISP-DM methodology provide?

This methodology provides a structured approach to planning a data mining project.

Gradient Boosting and Weak Learners

This technique is at the heart of some very powerful, top-of-class ensemble methods currently used in machine learning, such as Adaboost and Gradient Boosted Trees (XGBoost).

Greedy Search

We need to determine the attribute that best classifies the training data, we use this attribute at the root of the tree. At each node, we repeat this process creating further splits, until a leaf node is achieved, i.e. all data gets classified. This means we are performing a top-down, ______________ through the space of possible decision trees. In order to identify the best attribute for ID3 classification trees, we use the "Information Gain" criteria. Information gain (IG) measures how much "information" a feature gives us about the class. Decision Trees always try to maximize Information gain. So an attribute with the highest Information gain will be tested/split first.

Greedy Search

We need to determine the attribute that best classifies the training data, we use this attribute at the root of the tree. At each node, we repeat this process creating further splits, until a leaf node is achieved , i.e. all data gets classified. This means we are performing a top-down, greedy search through the space of possible decision trees. In ordert identify the best attribute for ID3 classification trees, we use the "Information Gain" criteria. Information gain (IG) measures how much "information" a feature gives us about the class. Decision Trees always try to maximize Information gain. So an attribute with the highest Information gain will be tested/split first.

Multinomial Naive Bayes

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial {\displaystyle (p_{1},\dots ,p_{n})} (p_1, \dots, p_n) where {\displaystyle p_{i}} p_{i} is the probability that event i occurs (or K such multinomials in the multiclass case). A feature vector {\displaystyle \mathbf {x} =(x_{1},\dots ,x_{n})} {\mathbf {x}}=(x_{1},\dots ,x_{n}) is then a histogram, with {\displaystyle x_{i}} x_{i} counting the number of times event i was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption). The likelihood of observing a histogram x is given by {\displaystyle p(\mathbf {x} \mid C_{k})={\frac {(\sum _{i}x_{i})!}{\prod _{i}x_{i}!}}\prod _{i}{p_{ki}}^{x_{i}}} {\displaystyle p(\mathbf {x} \mid C_{k})={\frac {(\sum _{i}x_{i})!}{\prod _{i}x_{i}!}}\prod _{i}{p_{ki}}^{x_{i}}} The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space:[2] {\displaystyle {\begin{aligned}\log p(C_{k}\mid \mathbf {x} )&\varpropto \log \left(p(C_{k})\prod _{i=1}^{n}{p_{ki}}^{x_{i}}\right)\\&=\log p(C_{k})+\sum _{i=1}^{n}x_{i}\cdot \log p_{ki}\\&=b+\mathbf {w} _{k}^{\top }\mathbf {x} \end{aligned}}} {\displaystyle {\begin{aligned}\log p(C_{k}\mid \mathbf {x} )&\varpropto \log \left(p(C_{k})\prod _{i=1}^{n}{p_{ki}}^{x_{i}}\right)\\&=\log p(C_{k})+\sum _{i=1}^{n}x_{i}\cdot \log p_{ki}\\&=b+\mathbf {w} _{k}^{\top }\mathbf {x} \end{aligned}}} where {\displaystyle b=\log p(C_{k})} b=\log p(C_{k}) and {\displaystyle w_{ki}=\log p_{ki}} w_{{ki}}=\log p_{{ki}}. If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called pseudocount, in all probability estimates such that no probability is ever set to be exactly zero. This way of regularizing naive Bayes is called Laplace smoothing when the pseudocount is one, and Lidstone smoothing in the general case. Rennie et al. discuss problems with the multinomial assumption in the context of document classification and possible ways to alleviate those problems, including the use of tf-idf weights instead of raw term frequencies and document length normalization, to produce a naive Bayes classifier that is competitive with support vector machines.[2]

Entropy and Decision Trees

______2______ aim to tidy the data by separating the samples and re-grouping them in the classes they belong to. We know the target variable since we are using a supervised approach having a training set. So we maximize the Purity of the classes as much as possible while making the splits, aiming to have clarity in the leaf nodes. Remember, it may not be possible to remove the uncertainty totally i.e. to fully clean up the data. Have a look at the image. We can see that the split has not FULLY classified the data above, but the resulting data is tidier than it was before the split. Using a series of such splits using different feature variables, we try to clean up the data as much as possible in the leaf nodes. At each step, we want to decrease the entropy, so entropy is computed before and after the split. If it decreases, the split is retained and we can proceed to the next step, otherwise, we must try to split with another feature or stop this branch (or quit, calling it best solution) and reaching a terminal node.

The Relationship Between Precision and Recall

__________ and __________ have an inverse relationship. As our _______ goes up, our _______ will go down, and vice versa. If this doesn't seem intuitive, let's examine this through the lens of our disease analogy.

Ensemble Methods

an algorithm that makes use of more than 1 model to make a prediction. Ensemble Methods are typically more effective when compared with single-model results for Supervised Learning tasks. Most Kaggle competitions are won using Ensemble Methods Ensemble Methods work off of the idea of the "Wisdom of the Crowd". This phrase refers to the phenomenon that the average estimate of all predictions typically outperforms any single prediction by a statistically significant margin--often, quite a large one. Think back to what you've learned about sampling, inferential statistics, and the Central Limit Theorem. The same magic is at work here.

Variance is...

an error from sensitivity to small fluctuations in the training set. High _________ can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

Batch GD

calculates error for each example in training set. Only after all errors have been calculated will the model get updated (1 training epoch). - Pro: computationally efficient and produces stable errors and converges - Con: can get stuck at local minimum and requires entire training set which eats up memory

Cross-validation

can help improve overall model robustness by decreasing overfitting/noise caused by the randomness of data selection when using train-test split method.

Polynomials and Interactive Terms

can improve performance by increasing model complexity by exploring hidden relationships between features

A decision tree is a DAG type of classifier where...

each branch node represents a choice between a number of alternatives and each leaf node represents a classification. An unknown (or test) instance is routed down the tree according to the values of the attributes in the successive nodes. When the instance reaches a leaf, it is classified according to the label assigned to the corresponded leaf.

Central limit theorem

establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. Wikipedia

Stochastic GD

evaluates GD for each example and updates weights accordingly. - Pro: can be faster than BGD and frequent updates allow us to have detailed rate improvements - Con: computationally expensive and results in noisy gradients (error rates jumping around)

R-squared

how much of the variance (observed values - mean) in the data can be explained by our model. When MSE of our predictions = 0, then R2 = 1, since the cumulative difference between our model's predicted values and the actual values is zero, that means that our model's equation (intercept, weights etc.). percent of variance explained by the model. The remaining percentage represents the variance explained by error, the E term, that which the model and predictors fail to grasp.

Gradient Descent

is a popular algorithm optimization strategy used when training a model. It uses a convex cost function and iteratively adjusts its weights to find the minimum of the function.

What is the parameter for decision trees that we normally tune first?

max_depth This parameter indicates how deep we want our tree to be. If the tree is too deep, it means we are creating a large number of splits in the parameter space and capturing more information about underlying data. This may result in overfitting as we are learning granular information from given data, we make it difficult for our model to generalize for unseen data. This will result in a low training error but a large testing error

Linear regression models are used to show or predict...

show or predict the relationship between two variables or factors. The factor that is being predicted (the factor that the equation solves for) is called the dependent variable. The factors that are used to predict the value of the dependent variable are called the independent variables( features). EX: You can use linear regression to predict continuous variables (salary) taking into account variables that may explain it (education, experience, occupation).

F1-Score represents....

the Harmonic Mean of Precision and Recall. In short, this means that the __________ cannot be high without both precision and recall also being high. When a model's F1-Score is high, you know that your model is doing well all around.

AUC

the tradeoff between the false positive and true positive rate

Directed Acyclic Graphs

this is the basic idea behind decision trees , every internal node checks for a condition and performs a decision. Every terminal/lead node represents a discrete class. Decision tree induction is closely related to rule induction. In essence a decision tree is just a series of IF-ELSE statements (rules). Each path from the root of a decision tree to one of its leaves can be transformed into a rule simply by combining the decisions along the path to form the antecedent part and taking the leaf's class prediction as the class value.

Tree Pruning

trimming decision trees to optimize the learning process

You can visualize gradient descent with...

visualize with iterations v. cost function values graph. Look at where the cost function converges.

Underfitting happens when...

when a model cannot model the training data, nor can it generalize to new data. Sometimes underfitting can cause Bias of certain features with higher weights.


Ensembles d'études connexes

The 30 Most Famous Landmarks in the World (3 Taj Mahal: Agra, India)

View Set

CHAPTER 30: LIABILITY OF PRINCIPALS, AGENTS, AND INDEPENDENT CONTRACTORS

View Set

Performing Oral Hygiene for an Unconscious Patient

View Set

Pharmacology test 2: units 3&4: chapters (18,19,20,21)(22,23,24,25,26,27,29)

View Set

Chapter 14 Cost Allocation, Profitability

View Set

Spanish- How long (ago) something happened

View Set