ML Final

Ace your homework & exams now with Quizwiz!

F1-Score

Harmonic mean of precision and recall. Used in evaluating classification models.

Hill Climbing

Heuristic; find the neighbor that has the largest function value and if that neighbor is above then move to that point, otherwise stop

What are characteristic features of AdaBoost?

-They adjust the weights of misclassified instances to focus them in subsequent models - They combine weak learners sequentially to form a strong learner

Why might pruning be applied to a decision tree?

-To simplify the model and improve interpretability - To reduce overfitting -To remove branches that provide little to no predictive power

Learning rate

-determines the step size at each iteration while moving toward a minimum of a loss function. -a low learning rate results in more iterations, and vice versa -a higher learning rate may result in the model learning a more precise answer, causing overfitting.

NN activation functions

1. Sigmoid 2. ReLU 3. Softmax 4. Hyperbolic trangent (tanh)

Characteristics of Hill Climbing

- local search - converges to local optima - does not remember past - susceptible to CoD - not robust to noise and can be impacted by initial solution - biased towards dominant class in imbalanced data set - picks best fit based on higher neighbor

VC dimension

What is the largest set of inputs that the hypothesis can label in all possible ways (shatter)? - Vapnick-Chernonenkis

features

data used as input to models

How does the degree of the polynomial relate to the VC dimension?

-Higher-degree polynomials have greater capacity to fit complex patterns, which is reflected in a higher VC dimension. -increasing the degree of a polynomial increases its flexibility, thus increasing the VC dimension

When deciding on a split for a continuous variable in decision trees:

-DT determines threshold value to best separate instances into groups that are as homogeneous as possible. - Determining that threshold often involves sorting the data by the values of the continuous var and finding the threshold that maximizes the chosen purity measure (like gini or info gain) - The split aims to increase the homogeneity of child nodes

When using KNN as an instance-based learning algo, which of the following are important considerations? 1. choice of k 2. learning rate 3. architecture of underlying neural network 4. the way distances between instances are calculated 5. handling of ties when multiple classes have the same vote count 6. depth of the decision tree used

- Choice of k - the way distances between instances are calculated -handling of ties when multiple classes have the same vote count

What are challenges of instance-based learning?

- Curse of dimensionality - Sensitivity to noisy data - Storage requirements due to retaining all training instances

What measures are commonly used measures to determine the best split in a decision tree?

- Gini impurity - Information Gain - Chi-squared

Sample complexity in PAC learning

- To achieve higher confidence or lower error, more samples are typically needed. - It refers to the number of training examples required to ensure that a learned hypothesis meets specific accuracy and confidence requirements - PAC learning gives a theoretical estimate for how much data is needed to learn a good hypothesis

In the context of classification, which of the following are reasons to use ensemble methods like random forests or gradient boosting machines?

- combining predictions from multiple models - reduction of model variance because EMs average out errors across models, decreasing variance. - Combining predictions from multiple models can lead to better ability to model complex relationships that a single model might miss.

Optimization Approaches

- generate and test: input x value and solve until peak is reached (requires small input space, helpful for complex functions) - Calculus: requires function has a derivative solvable to 0 - Newton's method: have derivative and have time to iteratively improve (requires function has derivative, able to keep iteratively improving)

Essential components of the PAC learning framework

- hypothesis space from which hypotheses are drawn - A confidence parameter representing the prob that a hypothesis will perform worse than the error measure -An error measure representing the probability that a hypothesis will misclassify a randomly drawn instance -A sample complexity determining the number of examples required to achieve a certain error and confidence level

Implications of a higher VC dimension?

- indicates a model with higher complexity that can shatter more points, which can lead to overfitting. - can be indicative of the model's ability to fit a wider range of functions - A higher VC dimension means the model can represent more complex relationships.

Which of the following algorithms can be used for classification tasks

-KNN - Naive bayes - DTs - SVMs - NNs

Pearson correlation coefficient

It gives you the measure of the strength of association between two variables

Softmax

transforms the raw outputs of the neural network into a vector of probabilities, essentially a probability distribution over the input classes.

Is a very large HS better than a small one in general?

A smaller, well-chosen HS can often lead to better generalization because it's easier to find the best hypothesis and have less flexibility, so are less likely to fit noise and fluctuations in the data, focusing instead on the broader patterns.

T/F: Random Forests require distance metrics for making predictions

False: RFs use mode or mean of the predictions instead of distance metrics.

T/F: Evidence is equivalent to the likelihood for the most probable hypothesis

False: The evidence takes into account the likelihoods of all hypotheses, not just the most probable one.

Area under the receiver operating characteristics curve (AUC-ROC)

Area under curve of true positive rates vs. false positive rates are various threshold settings. Used in evaluating classification models.

T/F: VC dimension is the sole factor in determining GE?

False: VC is one factor, but not the sole one.

T/F: AdaBoost involves random feature selection for each learner

False: all features are considered for each learner.

AdaBoost

Boosting algorithm that trains a series of weak learners on data and focuses later learners on data misclassified by earlier learners

T/F: Soft margin SVM doesn't make use of the C parameter (regularization strength)

False: soft margin SVM uses a regularization param ( C) to control the trade-off between maximizing the margin and minimizing classification errors

Kernel function

Determines the decision boundaries in feature space.

T/F: Gradient boosting relies heavily on data normalization before training

False

T/F: Gradient boosting uses a completely random subset of data to construct each tree?

False: Does not use completely random subset of data, but instead each tree is built specifically to address the issues of the previous trees, so only includes that data.

T/F: The number of parameters is always equal to the number of neurons

False: Each neuron can have multiple params associated with it, including weights for each connection and bias term.

T/F: KNN starts to discard older data as we add more training data

False: KNN always uses all of the data

Kernel methods

Map a non-linearly separable input space to another space which hopefully is linearly separable • space is usually higher-dimensional • The key element is that they do not actually map features to this space, instead they return the distance between elements in this space

R-Squared

Measure of proportion of variance for a dependent var that's explained by an independent var/vars in a regression model

Accuracy

Measure proportion of true results among total # of cases examined. Used in evaluating classification models.

Are standard SVMs suited for multi-class classification?

No, standard SVMs are binary classifiers. SVMs can be EXTENDED for multi-class classification

Neural Networks are...

Non-linear and multi-layer

Neural Network

Once an input layer (input data) is determined, weights are assigned. These weights help determine the importance of any given variable, with larger ones contributing more significantly to the output (prediction) compared to other inputs. All inputs are then multiplied by their respective weights and then summed.

Gradient Descent

Optimization algorithm for finding the input to a function that produces the optimal value; iterative

PAC learning framework

Probably approximately correct (PAC) learning theory helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier. -relationship between the desired error rate and the number of samples for training a learnable classifier

Recall

Proportion of actual positives that are correctly identified. Used in evaluating classification models.

Precision

Proportion of positive identifications that are actually correct. Used in evaluating classification models.

How does the performance of KNN change as training data increases?

Query time (prediction time) generally increases. The model better generalizes to new data

Bootstrapped samples

Random sampling with replacement. - RFs use bagging (bootstrap aggregating)

T/F: The VC dimension of a polynomial classifier is equal to its degree.

The VC dimension depends on the number of parameters in the model, which is not always directly equal to the degree of the polynomial.

ReLU

The function returns 0 if it receives any negative input, but for any positive value x, it returns that value back

Backpropagation

The practice of fine-tuning the weights of a neural net based on the error rate (i.e. loss) obtained in the previous epoch (i.e. iteration.

T/F: # of params is influenced by the size of all layers, not just input layer.

True

T/F: Deep networks with many layers generally have more parameters, as each layer adds additional weights and biases to be learned.

True

T/F: More parameters results in a more complex model often, this increased complexity can result in learning finer details in the training data and lead to overfitting if not controlled.

True

T/F: Regularization methods introduce penalties for complexity, encouraging simpler models that generalize better.

True

T/F: VC dimensions are a measure of model complexity, not computational complexity

True

T/F: A higher VC dimension indicates a more flexible hypothesis class capable of representing more complex functions

True- Increasing flexibility increases the VC dimension

T/F: A lower VC dimension implies a more constrained hypothesis class that might not fit the data as closely.

True: A hypothesis class with a lower VC dimension is less flexible and might not be able to capture complex patterns in the data

T/F: Evidence serves as a normalization factor in Bayes' theorem

True: The evidence normalizes the probability distribution, ensuring the posterior probs sum to 1.

T/F: For optimization, Newton's method is prone to getting stuck at local maximum

True: it is bad for functions with lots of local maximums, wants something with single optimum

T/F: Having more attributes can make the distance measure in KNN less meaningful?

True: more attributes increases dimensionality, possibly leading to the CoD.

Suppose we are given a decision tree. We generate a training set consistent with that tree, and then apply decision tree learning to build a new tree. As the training set size goes to infinity, the learning algorithm's new tree will be the same as the original tree.

Well, the wording is confusing, BUT in order for this to be true, it would mean that the DT is using the same rules and It doesn't specify that in the question.

Soft margin SVM

While a hard margin aims to find the line that perfectly separates the data, a soft margin allows for some misclassification.

The output of a boosting algorithm that learns using "decision stumps" (i.e., a decision tree with only one node) can be converted to an equivalent ordinary decision tree in a straightforward way.

With boosting, each stump is created to correct the errors of the last stump (so stump A misclassifies x as y, while stump B corrects that and classifies x as x). The final model is a weighted combo of all the stumps, which cannot be represented as a single straightforward DT bc the combo of multiple models is integral to the boosted model's performance.

Linear Regression

Works best on continuous problems. provides a linear relationship between an independent variable and a dependent variable to predict the outcome of future events

SVM & Kernel Methods: Preprocessing needed?

Yes, missing data needs to be accounted for, along with regularization.

generalization error

a measure of how accurately an algorithm is able to predict outcome values for previously unseen data.

SVM

classify data by finding the optimal decision boundary that maximally separates different classes

ensemble methods

combines multiple models instead of using single model to improve accuracy

Sigmoid

commonly used for models where we have to predict the probability or binary as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range

Support vectors

data points that are closer to the hyperplane and influence the position and orientation of the hyperplane

activation function

decide whether the neuron's input to the network is important or not (activates neuron)

k (number of neighbors)

determines how many neighbors will vote on the classification

model-based learning

include decision trees, linear regression, neural networks, support vector machines These models learn parameters that define relationships between input features and target outputs.

Gini Impurity

measures freq. at which any element will be mislabeled when it's randomly labeled according to distribution of labels in subset.

Gradient

measures the change in all weights with regard to the change in error

Target concept

method for solving a problem that an algo searches thru its training data to find, to use to predict results.

Boosting

not about creating a more complex model, but instead combining several simple models to gain a better understanding of specific cases and combining it to a larger understanding of the data.

overfitting

occurs when the model cannot generalize and fits too closely to the training dataset instead

Vanishing gradient problem

phenomenon that occurs during the training of deep neural networks, where the gradients that are used to update the network become extremely small or "vanish" as they are backpropogated from the output layers to the earlier layers. Problem increases with layers.

Naive Bayes Classifier

predicts the probability of a certain outcome based on prior occurrences of related events

Evidence (marginal likelihood) in Bayesian Learning

probability of generating the observed sample for all possible values of the parameters. - calculated by considering how likely the observed data is under each possible hypothesis and then summing these probs

Hyperbolic Tangent

produces a zero-centered output, thereby supporting the backpropagation process. mostly used in recurrent neural networks for natural language processing and speech recognition tasks

Gradient boosting

relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. The key idea is to set the target outcomes for this next model in order to minimize the error.

Pruning

removing parts of the tree that don't provide power to classify instances (gets rids of attributes/features with 0 information gain)

Solution to vanishing gradient problem

replace the activation function of the network. Instead of sigmoid, use an activation function such as ReLU.

Support Vector Machine

seeks a dividing hyperplane for any number of dimensions. using the kernel-trick, you "send" the data into a higher dimensional space where it can be linearly separable (and classified as x or y based on being below or above hyper-plane)

Hypothesis space

set of all hypotheses that can be formulated by an algo

Euclidean distance

signifies the shortest distance between two points

instance-based learning

sometimes referred to as lazy learning methods because they delay processing until a new instance must be classified. - KNN, kernel machines, rbf

loss function

the difference between predicted and actual values in a machine learning model

curse of dimensionality

the more attributes there are, the easier it is to build a model that fits the sample data but that is worthless as a predictor

posterior probability

the revised or updated probability of an event occurring after taking into consideration new information. - calculated by updating the prior probability using Bayes' theorem.

In the algorithms we've shown in class, we never test the same attribute twice along one path in a decision tree. Someone argued that testing the same attribute twice is inefficient and unnecessary. This argument is correct whether our attributes are discrete or continuous

this question comes down to the last sentence. If it is discrete (x is always y and a is always b), then it doesn't make sense to test x again along the same path because we already know it's y. If it's continuous (x can be 1, 2, 3 and a can be b, c, d), then it might make sense to test it twice along the same path because there's a chance x can be different the second time

Cosine similarity

used in KNN as a measurement that quantifies the similarity b/w two+ vectors

Decision Trees

used to categorize or make predictions based on how a previous set of questions were answered

Chi-squared

used to test the independence of two events (given x and y, we can get the observed count of A and expected count of B). The occurrence of the feature and the target class

KNN

uses neighbors (distance determined by K) of a point to predict value of point.

Overfitting

when the machine learning model gives accurate predictions for training data but not for new data

Underfitting

where a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data.


Related study sets

Which organ is each hormone associated with?

View Set

Congress (6) - American Government

View Set

(6:2) The Age of the Railroads (Quizlet - Reading)

View Set

Dosage Calculation RN Fundamentals Online Practice Assessment 3.0

View Set

ETECH LESSON 3- Effective Internet Research

View Set

Supply Chain - Chapter 11: Customer Relationship Management

View Set

Anatomy of Shoulder, Arm, and Elbow (Lab 5)

View Set