2. Machine Learning

Ace your homework & exams now with Quizwiz!

Clearly Explained: Mutual Information

Like R2, Mutual Information is a numeric value that gives us a sense of how closely related two variables are

Optimization: Stochastic Gradient Descent

**Important "One last thing before we're done, in our Gradient Descent example, we only had three data points, so the math didn't take very long... but when you have millions of data points, it can take a long time. So there is a thing called Stochastic Gradient Descent that uses a randomly selected subset of the data at every step rather than the full dataset. This reduces time spent calculating the derivatives of the Loss Function. That's all. Stochastic Gradient Descent sound fancy, but it's no big deal." In this StatQuest video, the concept of Gradient Descent is explained in an intuitive and visual manner. Gradient Descent is a method used for finding the minimum of a function by iteratively adjusting the parameters of the function in the direction of the negative gradient. The video begins by explaining the basic idea of Gradient Descent using a 2D plot of a function and showing how moving in the opposite direction of the gradient can lead to the minimum point. It then goes on to demonstrate how Gradient Descent can be used to optimize machine learning models. The video explains the three main variants of Gradient Descent: Batch, Stochastic, and Mini-Batch Gradient Descent. It also explains the concept of learning rate, which controls the step size of each iteration, and the importance of choosing an appropriate learning rate. Finally, the video illustrates how to implement Gradient Descent in Python using NumPy and shows how to use it to minimize a simple quadratic function.

Regularization Part 1: Ridge (L2) Regression

Bam 1) Least Square (AKA linear regression) works by fitting the least sum of residuals which works well with many data points. But what if you only have 2 data points; the data has high variance and is overfit if least squared is still used ------------------------------------------------ Bam 2) The main idea behind Ridge Regression is to find a New Line that doesn't fit the Training Data as well. In other words, we introduce a small amount of Bias into how the New Line is fit to the data. But in return for that small amount of Bias, we get a significant drop in Variance. In other words, by starting with a slightly worse fit, Ridge Regression can provide better long term predictions ------------------------------------------------ Bam 3) Internal: When Least Squares determines values for the parameters in this equation... Size = y-axis intercept + slope x weight ... it minimizes the sum of the squared residuals In contrast, when Ridge Regression determines values for the parameters in this equation... Size = y-axis intercept + slope x weight it minimizes the sum of the squared residuals + lambda x the slope^2 slope^2 - This part of the equation adds a penalty to the traditional Least Squares method lambda - determines how severe that penalty is *** Also remember it's a type of regularization Essentially, ridge regression is just least squares + (lambda x the slope^2) —————————————————————————————————————————— Bam 4) Lambda can be any values from 0 - positive infinity and is determined using Cross Validation —————————————————————————————————————————— Bam 5) Best Scenario: Used best when most of the features are usable because Ridge Regression will shrink the useful parameters, but will not remove any of them

StatQuest: Principle Component Analysis (PCA), Step-by-Step

Bam 1: Motivation: Let's say we want to know the correlation between features, however what if we want to plot 4 or more (I chose 4 because it's still possible to plot a 3d graph) features for correlation? Do we draw tons and tons of 2 feature plots and try to make sense of them all? Or draw some crazy graph that has an axis for each cell and makes our brain explode? No, both of those options are just plain silly. Instead we draw a Principle Component Analysis (PCA) plot ————————————————————————————- Bam 2: Convert: PCA converts the correlations (or lack there of / Covariance) among all the features into a 2-D graph ————————————————————————————- Bam 3: Feature Cluster: Features that are highly correlated, cluster together ————————————————————————————- Bam 4: Rank: The axes are ranked in order of importance. Differences along the first principle component axis (PC1) (x-axis) are more important than differences along the second principle component axis (PC2) (y-axis)

Optimization: Gradient Descent

Big Man Steps & Baby Steps - Hunter Step size - Gradient Descent determines the step size by multiplying the slope (from the equation of line) by a small number called the learning rate Learning Rate - **Important: "We now know how Gradient Descent optimizes two parameters, the Slope and Intercept. If we had more parameters, then we'd just take more derivatives and everything else stays the same Regardless of which Loss Function you use, Gradient Descent works the same way. In this StatQuest video, the concept of Gradient Descent is explained in a step-by-step manner. Gradient Descent is a method used for finding the minimum of a function by iteratively adjusting the parameters of the function in the direction of the negative gradient. The video begins by explaining the basic idea of Gradient Descent and its importance in machine learning. It then goes on to demonstrate how to calculate the gradient of a function, step size, and the direction to move in to minimize the function. The video explains the three main variants of Gradient Descent: Batch, Stochastic, and Mini-Batch Gradient Descent. It also explains the concept of learning rate, which controls the step size of each iteration, and the importance of choosing an appropriate learning rate. Finally, the video illustrates how to implement Gradient Descent in Python and demonstrates how to use it to minimize a simple quadratic function.

StatQuest: Clustering with DBScan

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups data points into clusters based on their density. The algorithm works by identifying core points, which have a minimum number of neighboring points within a specified radius, and then expanding clusters around those core points by including neighboring points that also have enough nearby points within the radius. The video explains the basic steps of the DBSCAN algorithm using a simple example with 12 data points. The presenter shows how the algorithm identifies core points and expands clusters based on the specified radius and minimum number of neighboring points. The video also explains how DBSCAN can handle noisy data points that do not belong to any cluster, and how it can automatically determine the number of clusters based on the density of the data. The video concludes by discussing some of the strengths and weaknesses of DBSCAN. One strength is that it can handle data with arbitrary shapes and sizes, and can automatically determine the number of clusters. However, one weakness is that it requires specifying the radius and minimum number of neighboring points, which can be difficult and subjective. Additionally, DBSCAN may not work well with datasets that have varying densities or clusters that have different densities. Best Use Case: Use for High Dimensionality Reduction where the clusters are nested Side Note: Any remaining Non-Core Points that are not close enough to Core Points in either cluster are not added to clusters and called outliers

Regularization Part 3: Elastic Net Regression

Elastic Net Regression is a combination of Ridge and Lasso regression that uses both the L1 and L2 penalty terms to balance their strengths and weaknesses. This approach is particularly useful when dealing with high-dimensional data sets that have many features and where the data is noisy. The video explains how the elastic net regularization term combines the L1 and L2 norms of the coefficient vector. The L1 term promotes sparsity and feature selection, while the L2 term promotes smoothness and encourages coefficients to be small. The relative weight of the two terms is controlled by a hyperparameter called alpha. The video demonstrates how to use cross-validation to select the best value of alpha for a given model. The optimal value of alpha depends on the specific data set and the tradeoff between bias and variance in the model. The video also highlights the importance of standardizing the features before applying the elastic net regularization, to ensure that all features are on the same scale. Overall, Elastic Net Regression is a powerful tool for dealing with high-dimensional data sets and can be used to balance the strengths of Ridge and Lasso regression. The choice of alpha depends on the specific data set and the tradeoff between bias and variance in the model. —————————————————————————————————————————— Bam 1: So we know to use Ridge Regression when the features matter and Lasso Regression when a lot of features are not important, but what do we do when the dataset has MILLIONS of features and we don't have the time to figure out which features are important and unimportant. Well, that's where Elastic Net Regression comes in —————————————————————————————————————————— Bam 2: Just like Lasso and Ridge Regression, Elastic-Net Regression starts with Least Squares, then it combines the Lass Regression Penalty with the Ridge Regression Penalty. Altogether, Elastic Net Regression combines the strengths of Lasso and Ridge Regression. —————————————————————————————————————————— Bam 3: BOTH: The Lasso Regression Penalty and the Ridge Regression Penalty get their own Lambdas —————————————————————————————————————————— Bam 4: Cross-Validation We use Cross Validation on different combinations of lambda 1 and Lambda 2 to find the best values Bam 5: Shrinking or Removing By combining Lass and Ridge Regression, Elastic-Net Regression groups and shrinks the parameters associated with the correlated features and leaves them in equation OR removes all the correlated features all at once Bam 6: Best Case Scenario The hybrid Elastic-Net Regression is especially good at dealing with situations when there are correlations between parameters. This is because on it's own, Lasso Regression tends to pick just of correlated feature pairs and eliminates the other correlated features

Optimization: The Chain Rule

Height, Weight, and Shoe Size - Hunter (Also think like dimensional analysis in chemistry) In this StatQuest video, the concept of the Chain Rule in calculus is explained. The Chain Rule allows us to compute the derivative of a composite function, which is a function that is composed of two or more functions. The Chain Rule states that the derivative of a composite function is equal to the product of the derivative of the outer function and the derivative of the inner function. In other words, the derivative of a composite function is found by first taking the derivative of the outer function with respect to its input, and then multiplying that result by the derivative of the inner function with respect to its input. The video provides several examples of how to use the Chain Rule to compute the derivative of composite functions, including functions that involve trigonometric functions, exponential functions, and logarithmic functions. Finally, the video explains how the Chain Rule can be extended to compute the gradient of a composite function with respect to multiple variables.

StatQuest: Hierarchical Clustering

Hierarchical clustering is a method for clustering data points into groups based on their similarity. The method starts by considering each data point as its own cluster and then iteratively merges the two most similar clusters until all points are in a single cluster. The video explains two main types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering starts with each data point as its own cluster and then iteratively merges the two most similar clusters until all points are in a single cluster. Divisive clustering starts with all points in a single cluster and then iteratively splits the cluster into smaller and smaller clusters based on their dissimilarity. The video demonstrates how hierarchical clustering works using a simple example with four data points. The presenter shows how the algorithm calculates the distance between the clusters at each iteration and decides which two clusters to merge. The video also explains how the results of hierarchical clustering can be visualized using a dendrogram, which is a tree-like diagram that shows the order in which the clusters were merged. The height of the branches in the dendrogram represents the distance between the clusters at each step. The video concludes by discussing some of the strengths and weaknesses of hierarchical clustering. One strength is that it can reveal hierarchical relationships between clusters at different levels of granularity. However, one weakness is that it can be computationally expensive, particularly for large datasets.

StatQuest: K-means clustering

K-means clustering is a method for grouping data points into clusters based on their similarity. The algorithm works by randomly selecting k initial cluster centroids, assigning each data point to the nearest centroid, and then updating the centroids based on the mean of the points assigned to each cluster. The algorithm iterates between these two steps until the centroids no longer change. The video explains the basic steps of the K-means clustering algorithm using a simple example with four data points. The presenter shows how the algorithm randomly selects two initial centroids, assigns each point to the nearest centroid, and updates the centroids based on the mean of the points assigned to each cluster. The video also explains how the K-means algorithm can be used to optimize the within-cluster sum of squares (WSS), which is a measure of how similar the points are within each cluster. The algorithm tries to minimize the WSS by finding the optimal number of clusters k. The video concludes by discussing some of the strengths and weaknesses of K-means clustering. One strength is that it is computationally efficient and can handle large datasets. However, one weakness is that it requires the user to specify the number of clusters k, which can be subjective and difficult to determine. Additionally, K-means clustering may not work well with data that has complex or irregular shapes.

StatQuest: K-nearest neighbors

K-nearest neighbors (KNN) is a non-parametric algorithm used for classification and regression tasks. The basic idea is to classify or predict the value of a new data point by looking at the k nearest neighbors in the training set and taking a majority vote (for classification) or an average (for regression). The video explains the steps of the KNN algorithm using a simple example with two classes of data points (red and blue). The presenter shows how to classify a new data point by finding the k nearest neighbors in the training set, and how the choice of k can affect the classification decision. The video also discusses some of the strengths and weaknesses of KNN. One strength is that it can work well with datasets that have complex decision boundaries or non-linear relationships. Additionally, KNN is easy to understand and implement. However, one weakness is that it can be sensitive to the choice of k and the distance metric used to measure similarity between data points. Additionally, KNN can be computationally expensive for large datasets. The video concludes by showing some examples of KNN in action, including a classification task with iris flowers and a regression task with house prices. Important: If two categories are the same distance away from the deciding data point, then a vote will be held

Regularization Part 2: Lasso (L1) Regression

Lasso regression is another type of regularization that adds a penalty term to the sum of absolute coefficients in the linear regression equation, instead of the sum of squared coefficients used in ridge regression. The penalty term is proportional to the L1 norm of the coefficients, which encourages sparsity in the solution by setting some of the coefficients to exactly zero. The video provides a helpful visual analogy to understand the effect of L1 regularization. Imagine you have a budget of $100 to spend on groceries, and you want to buy a variety of items without exceeding your budget. You might choose to buy a few expensive items and several cheaper items, rather than buying lots of mid-priced items. L1 regularization works similarly by encouraging the model to focus on a few important features and ignore the rest. The video explains how Lasso regression can be used to perform feature selection, by setting the coefficients of unimportant features to zero. This can improve the interpretability of the model by identifying the most important features for predicting the target variable. The video also highlights the difference between ridge and Lasso regularization. Ridge regression tends to shrink all the coefficients towards zero by the same amount, while Lasso regression can set some coefficients exactly to zero. As a result, Lasso regression can be more effective for models with a large number of features, where many of the features may be unimportant. —————————————————————————————————————————— Bam 1: Just like With Ridge Regression, Lambda can be any values from 0 - positive infinity and is determined using Cross Validation —————————————————————————————————————————— Bam 2: Internal: Exact same equation with one difference The same of the squared residuals + Lambda + |the slope| Instead of squaring the slope, Lasso Regression takes the absolute value of the slope *** Remember this is a type of Regularization method —————————————————————————————————————————— Bam 3: Difference: The big difference between Ridge and Lasso Regression is that Ridge Regression can only shrink the slope asymptotically (comes infinitely close but never touches) close to 0 while Lasso Regression can shrink the slope all the way to 0 —————————————————————————————————————————— Bam 4: Best Case Scenario: Works best if there are lots of features that are not important because Lasso Regression will shrink the important features, but only remove the features that are not important

StatQuest: Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised classification method that separates two or more classes of objects or events by finding a linear combination of features that maximizes the separation between the classes. To perform LDA, we start by standardizing the data and computing the mean and covariance matrix for each class. We then use these values to calculate the within-class and between-class scatter matrices. The within-class scatter matrix measures the scatter within each class, while the between-class scatter matrix measures the scatter between the classes. The goal of LDA is to find a linear combination of features that maximizes the ratio of the between-class scatter to the within-class scatter. This linear combination is called the discriminant function and is used to project the data onto a lower-dimensional space. The number of discriminant functions is equal to the number of classes minus one. The discriminant function can be used to classify new data points by projecting them onto the same lower-dimensional space and assigning them to the class with the closest mean value. LDA is similar to Principal Component Analysis (PCA), but while PCA is an unsupervised method that maximizes variance, LDA is a supervised method that maximizes class separation. LDA can be used for binary classification problems as well as multi-class problems. It assumes that the classes have the same covariance matrix, but can be extended to handle cases where this assumption is not valid.

Machine Learning Fundamentals: Sensitivity and Specificity

Look at "Chart" deck for flash card about sensitivity vs specificity

StatQuest: One-Hot, Label, Target and K-Fold Target Encoding (All Categorical Variables)

One-Hot Encoding Analogy: One-Hot encoding is a technique in data science where we represent categorical variables as binary vectors, where each category is represented by a separate binary variable. Here's a simple analogy to help you understand One-Hot encoding: Imagine you are a teacher and you want to assign grades to your students based on their performance in different subjects, such as math, science, and history. You decide to use One-Hot encoding to represent the grades as binary vectors. For example, if a student gets an "A" in math, a "B" in science, and a "C" in history, their grades would be represented as a binary vector [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1], where the first four positions represent the math grades (A, B, C, D), the next four positions represent the science grades, and the last four positions represent the history grades. Similarly, in data science, we can use One-Hot encoding to represent categorical variables as binary vectors. For example, if we have a dataset of car colors, we can represent each color as a binary vector where each position represents a different color, and the position corresponding to the color of the car is set to 1. This allows us to perform numerical calculations on the data and use it in machine learning models. ---------------------- Label Encoding Analogy: Label encoding is a technique in data science where we assign numerical labels to categorical variables, such as colors or types of fruit. Here's a simple analogy to help you understand label encoding: Imagine you are organizing a fruit basket and you have apples, bananas, and oranges. You want to label each piece of fruit with a number so that you can easily count how many of each type of fruit you have. You decide to label apples as "1", bananas as "2", and oranges as "3". Now, you can easily count how many pieces of each fruit you have by looking at the labels. Similarly, in data science, we can use label encoding to assign numerical labels to categorical variables. For example, if we have a dataset of car colors, we can assign labels like "1" for red, "2" for blue, and "3" for green. This allows us to perform numerical calculations on the data and use it in machine learning models. ---------------------------- Target Encoding Analogy: Target encoding is a technique in data science where we replace categorical variables with the mean (or another aggregation function) of the target variable for each category. Here's a simple analogy to help you understand target encoding: Imagine you are a coach of a basketball team and you want to select the best players for a game. You have a dataset of player statistics, including their height and the number of points they scored in the last game. You want to know whether height is a good predictor of scoring ability. You decide to use target encoding to represent height as the mean number of points scored for each height category. For example, if you group the players by height and find that players who are 6 feet tall on average scored 15 points, players who are 6 feet 2 inches tall on average scored 20 points, and players who are 6 feet 4 inches tall on average scored 25 points, you can use target encoding to represent each height category as its corresponding mean number of points scored. Similarly, in data science, we can use target encoding to represent categorical variables as the mean (or another aggregation function) of the target variable for each category. For example, if we have a dataset of car manufacturers and their average fuel efficiency, we can use target encoding to represent each manufacturer as its corresponding mean fuel efficiency. This allows us to perform numerical calculations on the data and use it in machine learning models. ---------------------------- K-Fold Target Encoding Analogy: K-Fold Target encoding is a technique in data science where we use the mean (or another aggregation function) of the target variable for each category, but we split the data into K folds and calculate the mean for each fold separately. Here's a simple analogy to help you understand K-Fold Target encoding: Imagine you are a chef and you want to make a cake with chocolate chips. You have a dataset of chocolate chips, including their brand and price, and you want to know which brand of chocolate chips is the best value for your money. You decide to use K-Fold Target encoding to represent the brand of chocolate chips as the mean price for each brand, but you split the data into K folds and calculate the mean for each fold separately. For example, if you split the data into 5 folds, you would calculate the mean price for each brand separately in each fold, then average the means across all folds to get a more accurate estimate of the true mean price for each brand. Similarly, in data science, we can use K-Fold Target encoding to represent categorical variables as the mean (or another aggregation function) of the target variable for each category, but we split the data into K folds and calculate the mean for each fold separately. This helps to reduce overfitting and improve the accuracy of the model by reducing the impact of outliers and the bias of the estimation.

Clearly Explained: ROC and AUC

ROC(Receiver Operator Characteristic) Graph - Summarizes all of the confusion matrices that each threshold produced instead of plotting each predicting each threshold line to figure out which is best. It's a great alternative to making a lot of confusion matrices and manually plotting different threshold lines to see which threshold line is best AUC (Area Under Curve) - compares multiple ROC curves to see which is better (has more area under curve) (i.e. red ROC curve represents logistic regression & blue ROC curve represents Random Forest, you can now compare)

Ridge vs Lasso Regression, Visualized!!

Ridge and Lasso regression are two different types of regularization techniques used in linear regression models to prevent overfitting. Ridge regression adds a penalty term to the sum of squared coefficients, while Lasso regression adds a penalty term to the sum of absolute coefficients. The video uses a visual analogy to explain the difference between the two types of regularization. Imagine you are trying to fit a line to a set of data points, and you have a choice between a wiggly line and a straight line. The wiggly line can pass through all the points, but it may overfit and perform poorly on new data. The straight line is simpler, but it may underfit and not capture all the important features of the data. Ridge and Lasso regression add a penalty to the wiggly line to make it less attractive to the model. The video shows how the penalty term affects the coefficients of the linear regression equation in the case of ridge and Lasso regression. Ridge regression tends to shrink all the coefficients towards zero by the same amount, while Lasso regression can set some coefficients exactly to zero, leading to sparse solutions. The video also highlights the difference in the shape of the penalty regions for ridge and Lasso regression. The penalty region for Lasso regression is diamond-shaped, while the penalty region for ridge regression is circular. This leads to different effects on the coefficients and can affect the interpretation of the model. Overall, both ridge and Lasso regression are useful tools for preventing overfitting in linear regression models, and the choice between the two depends on the specific needs of the model and the tradeoff between accuracy and interpretability.

Machine Learning Fundamentals: Bias and Variance

Straight Line vs Squiggly Line

Clearly Explained: Entropy (for data science)

Surprise = Log of Inverse of Probability (Heads or Tails) Expected Value - Entropy = Expected Suprise

Machine Learning Fundamentals: The Confusion Matrix

Test - Confusion Matrix (PA) (TF) (Left - Right) (Green Diagonal)

Machine Learning Fundamentals: Cross Validation

Train - Cross Validation Test - Confusion Matrix

StatQuest: t-SNE

t-SNE (t-distributed Stochastic Neighbor Embedding) is a machine learning algorithm that is commonly used for visualizing high-dimensional data in a low-dimensional space. It is particularly useful for visualizing complex datasets with many variables. The basic idea behind t-SNE is to create a probability distribution that measures the similarity between pairs of objects in the high-dimensional space, and then create a similar probability distribution in the low-dimensional space. The algorithm then tries to minimize the difference between the two probability distributions, resulting in a visualization that emphasizes the relationships between the objects in the high-dimensional space. One of the key features of t-SNE is that it places more emphasis on preserving the local structure of the data than on preserving the global structure. This means that objects that are close to each other in the high-dimensional space are more likely to be close to each other in the low-dimensional space, even if they are far apart from other objects in the high-dimensional space. In the video, the presenter provides an example of using t-SNE to visualize the expression of genes in different types of cells. They show how t-SNE can reveal clusters of cells with similar gene expression patterns, which can be used to identify different cell types. The video also explains some of the parameters that can be adjusted when using t-SNE, such as the perplexity parameter, which controls the balance between preserving local and global structure, and the learning rate, which controls the speed at which the algorithm converges. The presenter provides some tips for choosing appropriate parameter values based on the size and complexity of the dataset.


Related study sets

Basic Differentiation Formulas (2.3)

View Set

Global Issues UNIT 2 key actors on the world stage

View Set

Unit 4 Chapter 30 Quiz Government Budgets and Fiscal Policy

View Set

Chapter 2 Creativity, innovation, opportunities and entrepreneurship

View Set

Greater than, Less than, Equal to 1/2

View Set

International Business Chapter 16 - True or False

View Set

ATI Testing Immune System (Exam 1)

View Set