Exam 1 review

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What is AdaBoost?

AdaBoost, short for Adaptive Boosting, is a machine learning algorithm that builds an ensemble of weak classifiers to make predictions. It is a meta-algorithm that combines several weak classifiers to create a strong classifier. The basic idea behind AdaBoost is to sequentially train a series of weak classifiers on the data and combine their predictions to make a final prediction. In each iteration, the algorithm assigns weights to the training examples based on their classification error. The examples that are misclassified in the previous iteration are given higher weights, so that the subsequent weak classifiers can focus on those examples.

What is bagging?

Bagging (short for bootstrap aggregating) is a machine learning ensemble method that involves combining the predictions of multiple models to improve the overall prediction accuracy and reduce overfitting. In bagging, the models are trained on random subsets of the training data with replacement, creating a diverse set of models that can capture different aspects of the data. The basic idea behind bagging is to generate multiple models using random subsets of the training data, and then combine their predictions by taking a weighted average or majority vote. This approach helps to reduce the variance of the model, as the different models are trained on different subsets of the data and have different sources of randomness.

What is boosting for regression trees?

Boosting for regression trees is a machine learning algorithm that combines multiple weak regression trees to make more accurate predictions for regression problems. Boosting for regression trees is a variant of the AdaBoost algorithm that is specifically designed for regression problems. The basic idea behind boosting for regression trees is similar to that of AdaBoost. The algorithm trains a series of weak regression trees on the data, where each tree is trained to minimize the residual errors of the previous tree. In other words, each tree is trained to predict the difference between the target value and the predicted value of the previous tree

What is bootstrapping?

Bootstrapping is a statistical technique for estimating the variability of an estimator or model parameter by resampling the original dataset with replacement to create multiple new datasets. The term "bootstrap" refers to the idea of pulling oneself up by one's bootstraps, as the method involves creating new datasets from the original data by randomly sampling from it with replacement. The basic idea behind bootstrapping is to simulate the process of collecting new data from the population by repeatedly resampling from the original dataset. This allows us to estimate the sampling distribution of the estimator or model parameter, which provides information about the variability and uncertainty of the estimate.

How do we center and scale our data?

Centering and scaling are important preprocessing steps in many machine learning algorithms, particularly those that rely on distance measures or require features to be on the same scale. Centering involves subtracting the mean of each feature from its values, while scaling involves dividing the centered feature values by their standard deviation.

What are the disadvantages of Netwon's method?

Computational Complexity: Newton's method requires the computation of the Hessian matrix, which can be computationally expensive for large-scale problems. Inverting the Hessian can also be challenging, especially if it is ill-conditioned or singular. Sensitivity to Initial Guess: Newton's method can be sensitive to the choice of initial guess, especially if the Hessian is not positive definite. If the initial guess is far from the true minimum, the algorithm may converge to a local minimum or diverge. Lack of Robustness: Newton's method is not robust to noise or outliers in the data. It assumes that the objective function is smooth and has a unique minimum, which may not be the case for some problems. Memory Requirements: The Hessian matrix can be large and require a significant amount of memory, which can be a limiting factor for large-scale problems. Limited Applicability: Newton's method is not suitable for non-differentiable or non-continuous objective functions. It also may not be applicable for optimization problems with constraints. Need for Second Derivative Information: Newton's method requires the computation of second derivative information, which may not always be available or difficult to compute accurately.

Describe coordinate descent.

Coordinate descent is an optimization algorithm used to solve unconstrained or constrained optimization problems, particularly in machine learning and data analysis. In this algorithm, the objective function is minimized by successively updating the value of one variable or coordinate at a time while holding the values of the other variables fixed.

What is elastic net?

Elastic Net is a regularization technique for linear regression that combines the L1 and L2 penalty terms of Lasso and Ridge regression, respectively. It is used to overcome some of the limitations of these individual methods and provide a more flexible and robust approach to variable selection in high-dimensional datasets. The Elastic Net method adds a penalty term to the standard linear regression objective function that is a combination of both the L1 and L2 penalties: Elastic Net Penalty = λ1 * L1 penalty + λ2 * L2 penalty

No matter how my parameters are initialized, so long as the learning rate is small enough, we can safely expect gradient descent to converge to the same solution. True or False? Explain why

False. Even with a small learning rate, the initialization of the parameters can have a significant impact on the final solution obtained by gradient descent. Gradient descent is an iterative optimization algorithm that updates the parameters in the direction of the negative gradient of the loss function. The updates to the parameters depend on the initial values of the parameters, the learning rate, and the gradients.

What are the advantages of Newton's method?

Fast Convergence: Newton's method can converge much faster than other optimization algorithms, especially for high-dimensional problems. This is because it uses second-order derivative information, which can give a more accurate estimate of the minimum. Global Convergence: Newton's method is guaranteed to converge to a minimum, provided that the objective function is twice differentiable and has a unique minimum. Accurate Minimization: Newton's method can find the minimum of the objective function with high accuracy. This is because it uses second-order derivative information, which can provide a better estimate of the minimum than first-order methods like gradient descent. Robustness: Newton's method is robust to noisy or ill-conditioned problems. This is because it can adapt to the local curvature of the objective function and adjust the step size accordingly. Can Handle Constraints: Newton's method can handle optimization problems with constraints, using methods like Lagrange multipliers or penalty methods.

What are the advantages of using a large learning rate?

Faster convergence: A large learning rate may allow the optimization algorithm to converge faster, especially when the objective function is relatively simple or has a clear and steep gradient. Jumping out of local minima: A large learning rate may allow the optimization algorithm to jump out of a local minimum and find a better solution in a different area of the search space. Exploring more of the search space: A large learning rate may help the optimization algorithm to explore more of the search space and find a wider range of solutions. Simpler hyperparameter tuning: With a large learning rate, the optimization algorithm may be less sensitive to the choice of learning rate, which can simplify the process of hyperparameter tuning.

What is Gradient Boosting?

Gradient Boosting is a machine learning algorithm that builds an ensemble of weak decision trees to make predictions. It is an iterative algorithm that trains decision trees sequentially, each time trying to correct the errors made by the previous tree. The basic idea behind Gradient Boosting is to combine multiple weak models to create a strong model. A weak model is a model that performs only slightly better than random guessing. The weak model used in Gradient Boosting is usually a decision tree with a small depth, also known as a decision stump.

in a support vector machine what are the dashed lines called?

In a support vector machine (SVM), the dashed line is the decision boundary, also called the hyperplane, which separates the data points into different classes.

What are Kernels?

In machine learning, a kernel is a function that takes two inputs and returns a similarity score or distance measure between them. Kernels are commonly used in machine learning algorithms such as support vector machines (SVMs) to transform input data into a higher-dimensional space where it is easier to separate different classes or groups of data points. The basic idea behind using kernels is to find a way to map input data into a higher-dimensional space where it is easier to find a linear or non-linear boundary that separates the different classes of data points. The kernel function is used to calculate the similarity or distance between pairs of data points in this higher-dimensional space, allowing the machine learning algorithm to identify the optimal boundary or decision surface.

What is the difference between a global and local minimum?

In optimization, a global minimum is the smallest value of a function over its entire domain, whereas a local minimum is the smallest value of the function in a specific region of its domain.

What is the null hypothesis in simple linear regression?

In simple linear regression, the null hypothesis is that there is no linear relationship between the predictor variable (X) and the response variable (Y). Specifically, the null hypothesis states that the population slope (beta) is equal to zero, or mathematically: H0: beta = 0 This means that there is no effect of X on Y, and any observed association between the two variables in the sample data is due to chance. The alternative hypothesis (Ha) is that there is a linear relationship between X and Y, or more specifically that beta is not equal to zero:

How can we modify the Maximum Margin Classifier for data that isn't linearly separable?

In summary, to modify the MMC for non-linearly separable data, a kernel function can be used to map the input data to a higher-dimensional feature space where the data may become linearly separable. The decision boundary in this higher-dimensional feature space is then represented as a linear combination of the transformed data points, allowing the MMC to be applied to a wider range of data.

When would I use a dimensioning learning rate?

Large datasets: When dealing with large datasets, decreasing the learning rate can help the optimization algorithm to converge more slowly but more accurately. High-dimensional data: High-dimensional datasets may have complex and irregular surfaces, which can cause optimization algorithms to converge slowly or get stuck in local minima. Decreasing the learning rate can help the algorithm to explore more of the surface and avoid getting trapped in suboptimal solutions. Stochastic Gradient Descent: In stochastic gradient descent, the learning rate is updated after each mini-batch. Decreasing the learning rate over time can help to improve the convergence of the algorithm, especially when dealing with noisy or uncertain data.

What is Leave One Out Cross Validation (LOOCV)?

Leave One Out Cross Validation (LOOCV) is a technique for evaluating the performance of a machine learning algorithm by training the model on all but one of the available samples and testing the model on the remaining sample. This process is repeated for each sample in the dataset, so that each sample is used once as the test set and the remaining samples are used for training. The results from each iteration are then averaged to produce an estimate of the model's generalization performance. LOOCV is a special case of k-fold cross validation, where the number of folds is equal to the number of samples in the dataset. Unlike k-fold cross validation, which partitions the data into k subsets and trains the model on k-1 subsets while testing on the remaining subset, LOOCV trains the model on all but one sample and tests on that one sample

What is the type of problem we can solve with logistic regression?

Logistic regression is a type of supervised learning algorithm that is used for binary classification problems, where the goal is to predict a binary output variable based on one or more input features. In particular, logistic regression is used to model the probability of the output variable belonging to one of two classes, typically represented as 0 and 1.

What is the equation for the odds?

Odds of event occurring = P / Q The odds can be simplified by dividing both the numerator and the denominator by the probability of not the event occurring: Odds of event occurring = P / (1 - P)

How do we interpret the odds?

Odds of winning = probability of winning / probability of not winning Odds of winning = 0.25 / 0.75 Odds of winning = 1:3

What do the Lagrangian multipliers mean for Maximum Margin Classifiers?

Overall, the Lagrange multipliers are a key component of the optimization problem for maximum margin classifiers, and they are used to determine the position and orientation of the optimal hyperplane in the feature space. By maximizing the margin between the hyperplane and the closest data points, maximum margin classifiers are able to find a decision boundary that generalizes well to new data, and can achieve high accuracy on classification tasks.

What is random forest?

Random Forest is a machine learning algorithm that combines multiple decision trees to create a more accurate and robust model for classification and regression problems. In random forest, each tree is built independently using a subset of the training data and a subset of the features, making it less prone to overfitting compared to a single decision tree. The basic idea behind random forest is to build an ensemble of decision trees, where each tree is trained on a different random subset of the data and a random subset of the features. This randomness ensures that each tree is different and captures different aspects of the data.

What is the difference between reducible and irreducible error?

Reducible error is the error that can be reduced by improving the accuracy of the model through data cleaning, feature selection, feature engineering, or algorithmic improvements. It is also known as bias or model bias. Reducible error occurs because of the limitations of the model and its assumptions, and can be minimized by improving the model. irreducible is noise

What do p-values tell us when testing our simple linear regression model?

Specifically, if the p-value is small (typically less than a pre-specified significance level, such as 0.05), we reject the null hypothesis and conclude that there is evidence of a linear relationship between X and Y. On the other hand, if the p-value is large, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that X and Y are linearly related.

What is the biggest advantage of lasso regularization?

The biggest advantage of Lasso regularization is that it can perform feature selection by setting some of the regression coefficients to zero. This is achieved by adding an L1 penalty term to the objective function of the linear regression problem, which encourages sparsity in the regression coefficients.

in a support vector machine what are the dots on the dashed line called

The dots on the dashed line are called support vectors, because they are the data points that are closest to the decision boundary and provide the support for the hyperplane.

How can the output of a logistic regression model be used to make predictions?

The output of a logistic regression model is a probability value between 0 and 1, which represents the predicted probability that a given input belongs to a particular class. To make binary classification predictions, a decision threshold is applied to the predicted probability value. If the predicted probability is greater than the threshold, the input is classified as belonging to the positive class, and if it is less than or equal to the threshold, it is classified as belonging to the negative class.

What is the regularization term for Lasso?

The regularization term for Lasso (short for "Least Absolute Shrinkage and Selection Operator") is the L1 penalty term. The L1 penalty term is added to the standard linear regression objective function to shrink the magnitude of the regression coefficients towards zero and to perform feature selection by setting some of the coefficients to exactly zero.

What is the regularization term for Ridge Regression?

The regularization term for Ridge Regression is the L2 penalty term, which is added to the standard linear regression objective function to shrink the magnitude of the regression coefficients towards zero. The L2 penalty term is proportional to the square of the L2 norm of the regression coefficients, which is given by: ||w||_2^2 = w_1^2 + w_2^2 + ... + w_p^2 where w is the vector of regression coefficients, p is the number of features, and ||.||_2 denotes the L2 norm or Euclidean norm

If I want to minimize 𝐹 𝑥;𝑤 using gradient descent, how do I compute the update step?

To compute the update step for gradient descent when minimizing a function 𝐹(𝑥;𝑤), we need to compute the gradient of the function with respect to the parameters 𝑤. The gradient of 𝐹 with respect to 𝑤 is denoted by ∇𝑤𝐹(𝑥;𝑤) and is a vector that indicates the direction of steepest ascent of the function at a particular point in the parameter space. The update step for gradient descent is then computed as follows: 𝑤 ← 𝑤 − 𝛼 ∇𝑤𝐹(𝑥;𝑤)

What do we use to measure the fit of a regression model?

To measure the fit of a regression model, we typically use a metric that quantifies the difference between the predicted values of the model and the actual values of the target variable. There are several common metrics used for this purpose, including: Mean Squared Error (MSE): This is the average of the squared differences between the predicted and actual values. It is a popular metric because it gives more weight to larger errors, which can be important in some applications. Root Mean Squared Error (RMSE): This is the square root of the MSE and is often used as a more interpretable metric, as it is in the same units as the target variable. Mean Absolute Error (MAE): This is the average of the absolute differences between the predicted and actual values. It is less sensitive to outliers than MSE and RMSE. R-squared (R2): This is a measure of how well the model fits the data compared to a baseline model (usually a model that predicts the mean of the target variable for all samples). It is a value between 0 and 1, with higher values indicating a better fit.

What are the predict and true values for TP, FP, TN, FN?

True Positive (TP): The model predicts positive, and the true label is also positive. False Positive (FP): The model predicts positive, but the true label is negative. True Negative (TN): The model predicts negative, and the true label is also negative. False Negative (FN): The model predicts negative, but the true label is positive.

What do the Lagrangian multipliers mean for Support Vector Maximization?

the Lagrange multipliers are a key component of the SVM optimization problem, and they are used to maximize the margin between the decision boundary and the closest data points. By enforcing the constraint that the closest data points lie on the boundary itself, SVMs are able to find the optimal decision boundary that separates the data into different classes with the maximum possible margin.

λ1 and λ2 are hyperparameters

λ1 and λ2 are hyperparameters that control the strength of the L1 and L2 penalties, respectively. The L1 penalty promotes sparsity by setting some of the regression coefficients to zero, while the L2 penalty promotes shrinkage of the regression coefficients towards zero


Set pelajaran terkait

Ch. 69: Mgmt of Pts w/ Neurologic Infections, Autoimmune Disorders, & Neuropathies

View Set

Chapter 10, Chapter 11, Chapter 12

View Set

TERP10 - Unit 5: Purchase to Pay Processing

View Set

Lesson 3 - Begin a Book Report and Use Comprehension Skills

View Set

الملك حسين بن طلال

View Set