ML Final
F1-Score
Harmonic mean of precision and recall. Used in evaluating classification models.
Hill Climbing
Heuristic; find the neighbor that has the largest function value and if that neighbor is above then move to that point, otherwise stop
What are characteristic features of AdaBoost?
-They adjust the weights of misclassified instances to focus them in subsequent models - They combine weak learners sequentially to form a strong learner
Why might pruning be applied to a decision tree?
-To simplify the model and improve interpretability - To reduce overfitting -To remove branches that provide little to no predictive power
Learning rate
-determines the step size at each iteration while moving toward a minimum of a loss function. -a low learning rate results in more iterations, and vice versa -a higher learning rate may result in the model learning a more precise answer, causing overfitting.
NN activation functions
1. Sigmoid 2. ReLU 3. Softmax 4. Hyperbolic trangent (tanh)
Characteristics of Hill Climbing
- local search - converges to local optima - does not remember past - susceptible to CoD - not robust to noise and can be impacted by initial solution - biased towards dominant class in imbalanced data set - picks best fit based on higher neighbor
VC dimension
What is the largest set of inputs that the hypothesis can label in all possible ways (shatter)? - Vapnick-Chernonenkis
features
data used as input to models
How does the degree of the polynomial relate to the VC dimension?
-Higher-degree polynomials have greater capacity to fit complex patterns, which is reflected in a higher VC dimension. -increasing the degree of a polynomial increases its flexibility, thus increasing the VC dimension
When deciding on a split for a continuous variable in decision trees:
-DT determines threshold value to best separate instances into groups that are as homogeneous as possible. - Determining that threshold often involves sorting the data by the values of the continuous var and finding the threshold that maximizes the chosen purity measure (like gini or info gain) - The split aims to increase the homogeneity of child nodes
When using KNN as an instance-based learning algo, which of the following are important considerations? 1. choice of k 2. learning rate 3. architecture of underlying neural network 4. the way distances between instances are calculated 5. handling of ties when multiple classes have the same vote count 6. depth of the decision tree used
- Choice of k - the way distances between instances are calculated -handling of ties when multiple classes have the same vote count
What are challenges of instance-based learning?
- Curse of dimensionality - Sensitivity to noisy data - Storage requirements due to retaining all training instances
What measures are commonly used measures to determine the best split in a decision tree?
- Gini impurity - Information Gain - Chi-squared
Sample complexity in PAC learning
- To achieve higher confidence or lower error, more samples are typically needed. - It refers to the number of training examples required to ensure that a learned hypothesis meets specific accuracy and confidence requirements - PAC learning gives a theoretical estimate for how much data is needed to learn a good hypothesis
In the context of classification, which of the following are reasons to use ensemble methods like random forests or gradient boosting machines?
- combining predictions from multiple models - reduction of model variance because EMs average out errors across models, decreasing variance. - Combining predictions from multiple models can lead to better ability to model complex relationships that a single model might miss.
Optimization Approaches
- generate and test: input x value and solve until peak is reached (requires small input space, helpful for complex functions) - Calculus: requires function has a derivative solvable to 0 - Newton's method: have derivative and have time to iteratively improve (requires function has derivative, able to keep iteratively improving)
Essential components of the PAC learning framework
- hypothesis space from which hypotheses are drawn - A confidence parameter representing the prob that a hypothesis will perform worse than the error measure -An error measure representing the probability that a hypothesis will misclassify a randomly drawn instance -A sample complexity determining the number of examples required to achieve a certain error and confidence level
Implications of a higher VC dimension?
- indicates a model with higher complexity that can shatter more points, which can lead to overfitting. - can be indicative of the model's ability to fit a wider range of functions - A higher VC dimension means the model can represent more complex relationships.
Which of the following algorithms can be used for classification tasks
-KNN - Naive bayes - DTs - SVMs - NNs
Pearson correlation coefficient
It gives you the measure of the strength of association between two variables
Softmax
transforms the raw outputs of the neural network into a vector of probabilities, essentially a probability distribution over the input classes.
Is a very large HS better than a small one in general?
A smaller, well-chosen HS can often lead to better generalization because it's easier to find the best hypothesis and have less flexibility, so are less likely to fit noise and fluctuations in the data, focusing instead on the broader patterns.
T/F: Random Forests require distance metrics for making predictions
False: RFs use mode or mean of the predictions instead of distance metrics.
T/F: Evidence is equivalent to the likelihood for the most probable hypothesis
False: The evidence takes into account the likelihoods of all hypotheses, not just the most probable one.
Area under the receiver operating characteristics curve (AUC-ROC)
Area under curve of true positive rates vs. false positive rates are various threshold settings. Used in evaluating classification models.
T/F: VC dimension is the sole factor in determining GE?
False: VC is one factor, but not the sole one.
T/F: AdaBoost involves random feature selection for each learner
False: all features are considered for each learner.
AdaBoost
Boosting algorithm that trains a series of weak learners on data and focuses later learners on data misclassified by earlier learners
T/F: Soft margin SVM doesn't make use of the C parameter (regularization strength)
False: soft margin SVM uses a regularization param ( C) to control the trade-off between maximizing the margin and minimizing classification errors
Kernel function
Determines the decision boundaries in feature space.
T/F: Gradient boosting relies heavily on data normalization before training
False
T/F: Gradient boosting uses a completely random subset of data to construct each tree?
False: Does not use completely random subset of data, but instead each tree is built specifically to address the issues of the previous trees, so only includes that data.
T/F: The number of parameters is always equal to the number of neurons
False: Each neuron can have multiple params associated with it, including weights for each connection and bias term.
T/F: KNN starts to discard older data as we add more training data
False: KNN always uses all of the data
Kernel methods
Map a non-linearly separable input space to another space which hopefully is linearly separable • space is usually higher-dimensional • The key element is that they do not actually map features to this space, instead they return the distance between elements in this space
R-Squared
Measure of proportion of variance for a dependent var that's explained by an independent var/vars in a regression model
Accuracy
Measure proportion of true results among total # of cases examined. Used in evaluating classification models.
Are standard SVMs suited for multi-class classification?
No, standard SVMs are binary classifiers. SVMs can be EXTENDED for multi-class classification
Neural Networks are...
Non-linear and multi-layer
Neural Network
Once an input layer (input data) is determined, weights are assigned. These weights help determine the importance of any given variable, with larger ones contributing more significantly to the output (prediction) compared to other inputs. All inputs are then multiplied by their respective weights and then summed.
Gradient Descent
Optimization algorithm for finding the input to a function that produces the optimal value; iterative
PAC learning framework
Probably approximately correct (PAC) learning theory helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier. -relationship between the desired error rate and the number of samples for training a learnable classifier
Recall
Proportion of actual positives that are correctly identified. Used in evaluating classification models.
Precision
Proportion of positive identifications that are actually correct. Used in evaluating classification models.
How does the performance of KNN change as training data increases?
Query time (prediction time) generally increases. The model better generalizes to new data
Bootstrapped samples
Random sampling with replacement. - RFs use bagging (bootstrap aggregating)
T/F: The VC dimension of a polynomial classifier is equal to its degree.
The VC dimension depends on the number of parameters in the model, which is not always directly equal to the degree of the polynomial.
ReLU
The function returns 0 if it receives any negative input, but for any positive value x, it returns that value back
Backpropagation
The practice of fine-tuning the weights of a neural net based on the error rate (i.e. loss) obtained in the previous epoch (i.e. iteration.
T/F: # of params is influenced by the size of all layers, not just input layer.
True
T/F: Deep networks with many layers generally have more parameters, as each layer adds additional weights and biases to be learned.
True
T/F: More parameters results in a more complex model often, this increased complexity can result in learning finer details in the training data and lead to overfitting if not controlled.
True
T/F: Regularization methods introduce penalties for complexity, encouraging simpler models that generalize better.
True
T/F: VC dimensions are a measure of model complexity, not computational complexity
True
T/F: A higher VC dimension indicates a more flexible hypothesis class capable of representing more complex functions
True- Increasing flexibility increases the VC dimension
T/F: A lower VC dimension implies a more constrained hypothesis class that might not fit the data as closely.
True: A hypothesis class with a lower VC dimension is less flexible and might not be able to capture complex patterns in the data
T/F: Evidence serves as a normalization factor in Bayes' theorem
True: The evidence normalizes the probability distribution, ensuring the posterior probs sum to 1.
T/F: For optimization, Newton's method is prone to getting stuck at local maximum
True: it is bad for functions with lots of local maximums, wants something with single optimum
T/F: Having more attributes can make the distance measure in KNN less meaningful?
True: more attributes increases dimensionality, possibly leading to the CoD.
Suppose we are given a decision tree. We generate a training set consistent with that tree, and then apply decision tree learning to build a new tree. As the training set size goes to infinity, the learning algorithm's new tree will be the same as the original tree.
Well, the wording is confusing, BUT in order for this to be true, it would mean that the DT is using the same rules and It doesn't specify that in the question.
Soft margin SVM
While a hard margin aims to find the line that perfectly separates the data, a soft margin allows for some misclassification.
The output of a boosting algorithm that learns using "decision stumps" (i.e., a decision tree with only one node) can be converted to an equivalent ordinary decision tree in a straightforward way.
With boosting, each stump is created to correct the errors of the last stump (so stump A misclassifies x as y, while stump B corrects that and classifies x as x). The final model is a weighted combo of all the stumps, which cannot be represented as a single straightforward DT bc the combo of multiple models is integral to the boosted model's performance.
Linear Regression
Works best on continuous problems. provides a linear relationship between an independent variable and a dependent variable to predict the outcome of future events
SVM & Kernel Methods: Preprocessing needed?
Yes, missing data needs to be accounted for, along with regularization.
generalization error
a measure of how accurately an algorithm is able to predict outcome values for previously unseen data.
SVM
classify data by finding the optimal decision boundary that maximally separates different classes
ensemble methods
combines multiple models instead of using single model to improve accuracy
Sigmoid
commonly used for models where we have to predict the probability or binary as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range
Support vectors
data points that are closer to the hyperplane and influence the position and orientation of the hyperplane
activation function
decide whether the neuron's input to the network is important or not (activates neuron)
k (number of neighbors)
determines how many neighbors will vote on the classification
model-based learning
include decision trees, linear regression, neural networks, support vector machines These models learn parameters that define relationships between input features and target outputs.
Gini Impurity
measures freq. at which any element will be mislabeled when it's randomly labeled according to distribution of labels in subset.
Gradient
measures the change in all weights with regard to the change in error
Target concept
method for solving a problem that an algo searches thru its training data to find, to use to predict results.
Boosting
not about creating a more complex model, but instead combining several simple models to gain a better understanding of specific cases and combining it to a larger understanding of the data.
overfitting
occurs when the model cannot generalize and fits too closely to the training dataset instead
Vanishing gradient problem
phenomenon that occurs during the training of deep neural networks, where the gradients that are used to update the network become extremely small or "vanish" as they are backpropogated from the output layers to the earlier layers. Problem increases with layers.
Naive Bayes Classifier
predicts the probability of a certain outcome based on prior occurrences of related events
Evidence (marginal likelihood) in Bayesian Learning
probability of generating the observed sample for all possible values of the parameters. - calculated by considering how likely the observed data is under each possible hypothesis and then summing these probs
Hyperbolic Tangent
produces a zero-centered output, thereby supporting the backpropagation process. mostly used in recurrent neural networks for natural language processing and speech recognition tasks
Gradient boosting
relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. The key idea is to set the target outcomes for this next model in order to minimize the error.
Pruning
removing parts of the tree that don't provide power to classify instances (gets rids of attributes/features with 0 information gain)
Solution to vanishing gradient problem
replace the activation function of the network. Instead of sigmoid, use an activation function such as ReLU.
Support Vector Machine
seeks a dividing hyperplane for any number of dimensions. using the kernel-trick, you "send" the data into a higher dimensional space where it can be linearly separable (and classified as x or y based on being below or above hyper-plane)
Hypothesis space
set of all hypotheses that can be formulated by an algo
Euclidean distance
signifies the shortest distance between two points
instance-based learning
sometimes referred to as lazy learning methods because they delay processing until a new instance must be classified. - KNN, kernel machines, rbf
loss function
the difference between predicted and actual values in a machine learning model
curse of dimensionality
the more attributes there are, the easier it is to build a model that fits the sample data but that is worthless as a predictor
posterior probability
the revised or updated probability of an event occurring after taking into consideration new information. - calculated by updating the prior probability using Bayes' theorem.
In the algorithms we've shown in class, we never test the same attribute twice along one path in a decision tree. Someone argued that testing the same attribute twice is inefficient and unnecessary. This argument is correct whether our attributes are discrete or continuous
this question comes down to the last sentence. If it is discrete (x is always y and a is always b), then it doesn't make sense to test x again along the same path because we already know it's y. If it's continuous (x can be 1, 2, 3 and a can be b, c, d), then it might make sense to test it twice along the same path because there's a chance x can be different the second time
Cosine similarity
used in KNN as a measurement that quantifies the similarity b/w two+ vectors
Decision Trees
used to categorize or make predictions based on how a previous set of questions were answered
Chi-squared
used to test the independence of two events (given x and y, we can get the observed count of A and expected count of B). The occurrence of the feature and the target class
KNN
uses neighbors (distance determined by K) of a point to predict value of point.
Overfitting
when the machine learning model gives accurate predictions for training data but not for new data
Underfitting
where a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data.