Intro to ML CS4375: Exam 2
X
Features or input data
𝑓𝐰,𝑏(𝐱)=𝐰⋅𝐱+𝑏
Model prediction with multiple variables (⋅ is dot product)
DataFrame.columns
Returns column names
KNN Algorithm
classify a new data point based on the class labels of its k nearest neighbors in the training data. Example: For example, suppose we have a dataset of customer information, including age and income, and we want to predict whether a new customer is likely to make a purchase or not. To do this, we first calculate the distance between the new customer and each customer in the training data. We then select the k customers in the training data that are closest to the new customer, and classify the new customer based on the majority class label among those k neighbors.
F1 Score
harmonic mean of precision and recall, provides balanced measure of the two 0 indicates model is completely wrong 1 indicates perfect predictions
Feature Scaling
involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable--brings all input values to same range such as 0-1 or 100-1000
b
the base bias or intercept of the model (usually a scalar)
w
weight of each feature or input variable (usually vector in multi-feature models)
Information Gain Formula
∑(pₙ+nₙ)/(p+n) * [−∑ P(vₙ)log₂p(vₙ)]
Entropy formula
−∑ P(vk)log₂p(vk)
Multiple-Variable Cost Function
𝐽(𝐰,𝑏)=(1/2𝑚) ∑𝑖=0:𝑚−1 (𝑓𝐰,𝑏(𝐱(𝑖))−𝑦(𝑖))^2
Model prediction with multiple variables
𝑓𝐰,𝑏(𝐱)=𝑤0𝑥0 + 𝑤1𝑥1 + ... + 𝑤𝑛−1𝑥𝑛−1 + 𝑏
Model prediction of i-th example
𝑓𝐰,𝑏(𝐱^(𝑖))
Adaptive Learning Rates
-- Learning Rates are no longer fixed -- Can be made larger or smaller depending on: size of gradient, speed of learning, value of particular weights, etc.
Logistic Regression
Applies the sigmoid function to linear regression to minimize cost function (RMSE) and create a classification solution.
Uses for ML
Fraud detection, web search results, real-time ads, pricing models, spam filtering etc.
Unsupervised Learning
Model is given unlabeled data with no notion of correct or wrong outputs, instead tries to group similar data points based off features or derive some correlation between features
Underfitting
Model performs poorly on training data and is unable to capture relationships between attributes and outcomes.
Overfitting
Model performs well on training data, but poorly on test data because it 'memorized' the training data too much and can't generalize the relationships.
Supervised Learning
Model receives a set of inputs (features) along with the corresponding correct outputs (targets/labels), and learns by comparing its output with correct outputs
Encoding Inputs for NN
Neural networks require numerical inputs, so we must translate all inputs to vectors of fixed size
m
Number of examples in dataset
SGD 2.0
See Image:
Information Gain
The expected reduction in the entropy of the output variable on the whole set.
x^i, y^i
ith example from dataset
Sequence Modeling Applications
- 1:1 Binary Classification - N:1 Sentiment Class. - 1:N Image Captions - N:N Machine Translation
Confusion Matrix
- Displays false/true positive/negatives for each prediction outcome - Useful for evaluating classification models
Decision Tree Algorithms
- Non-parametric supervised learning method used for classification and regression. - The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features (series of if-else decisions). - In order to pick which, feature to split on, we need a way of measuring how good the split is, which is where information gain and entropy come in.
Sigmoid function
- Used in logistic regression to minimize cost function (RMSE), and necessary for outputs which have discrete values (classification problems). - Sigmoid function takes in any values and outputs between (0, 1), so we take our linear regression solution and run it through sigmoid to get a discrete/classification solution. - Also useful in normalization
Convolutional Neural Network (CNN)
-- A convolutional neural network (CNN) is a type of neural network that is commonly used for image and video recognition, natural language processing, and other tasks involving structured data. -- The network consists of multiple layers, including convolutional layers, pooling layers, and fully connected layers.
Recurrent Neural Networks (RNN)
-- A recurrent neural network (RNN) is a type of artificial neural network which uses sequential data or time series data (Sequence Modeling). -- Used for ordinal or temporal problems, such as language translation, natural language processing (nlp), speech recognition, and image captioning (Siri, voice search, Google Translate). -- Distinguished by their memory as they take information from prior inputs to influence the current input and output.
Stochastic RL Policy
-- A stochastic policy can be represented as a family of conditional probability distributions, πs(A∣S), from the set of states, S, to the set of actions, A. -- For every state, you have probability distribution for actions to take from that state.
RNN Notes
-- Apply recurrence relations at every time step to process a sequence -- Same function and set of parameters used at each step -- RNN's have a state that is updated at each step
Regularization 1: Dropout
-- During training, randomly set some activation's to 0 -- Typically drop 50% of activation's in layer -- Forces network to not rely on any 1 node
Computing Gradients: Backpropagation
-- Facilitate weight adjustment of NN nodes in RL -- The output of the neural network is compared to the desired output, and the error is calculated using a loss function. The error is then propagated back through the network, layer by layer, in a process known as backpropagation.
Why do we flip the kernel in a CNN?
-- In a practical sense, flipping the kernel ensures that the convolution operation captures the appropriate relationships between features in the input data and the filters in the kernel. -- If the kernel were not flipped, the resulting feature maps would be mirror images of the desired output.
Policy NN in RL
-- In reinforcement learning, a policy is a function that maps an observed state of the environment to an action to be taken by the agent. -- The goal of the agent is to learn a policy that maximizes some reward signal over time. -- In other words, a policy-based neural network learns to select the best action directly from the input state.
Value NN in RL
-- In reinforcement learning, the value of a state-action pair is the expected total reward that the agent can obtain by following a given policy starting from that state and taking that action. -- A value-based neural network represents the value function as a neural network that takes the state of the environment and the action as inputs and outputs the expected value of that state-action pair.
Characteristics of CNN and FNN
-- Outputs are independent of previous inputs -- Inputs are fixed size
Sequence Modeling
-- Sequence modeling is the task of predicting the next element in the input (word, character, etc.) -- Unlike CNN and FNN, outputs are dependent on previous inputs and input has dynamic size
Setting Learning Rate in Optimization of Loss
-- Small learning rate converges slowly and gets stuck in false local minima -- Large learning rate overshoots and diverges -- Stable learning rate converges smoothly and avoids local minima
K-Means Clustering
-- The most common centroid-based clustering algorithm is K Means 1.) In k-means, the algorithm starts by randomly selecting K points in the data to serve as the initial centroids (buckets). 2.) Each data point is then assigned to the nearest centroid, based on some distance metric such as Euclidean distance. 3.) The centroids are then recalculated as the mean of the points assigned to each cluster, and the assignment step is repeated until convergence.
Deterministic RL Policy
-- a function of the form πd:S→A, that is, a function from the set of states of the environment, S, to the set of actions, A. The subscript d indicates this is a deterministic policy. -- Every state has a clearly defined action (state X will always take action A)
Centroid-based Clustering Algorithm
-- a type of unsupervised machine learning algorithm that is used to group similar data points into clusters based on the distance between the points. -- The goal of these algorithms is to partition the data into K clusters, where K is a pre-defined parameter that is chosen by the user.
General Procedure to Produce a Model
1.) Data acquistion 2.) Data cleaning 3.) Test data or Train & Build Model 4.) Test Model and repeat Training if neccesary 5.) Deploy model
Solution to Setting Learning Rate
1.) Try lots of different learning rates and see what works 2.) Design an adaptive learning rate that changes dynamically
ID3 Procedure
1.) divide into two opposite groups 2.) Calc entropy and info gain of each attribute to find most dominant one 3.) Dominant attr. set as decision node 4.) Repeat for other attributes until decisive decision reached
RNN Characteristics/Requirements for Sequence Modeling
1.) handle variable-length sequences 2.) track long-term dependencies 3.) maintain information about order of sequence 4.) Share same parameters across the sequence
Entropy
A measure of the uncertainty of a random variable; the more information, the less entropy. A random variable with only 1 possible value has no uncertainty and thus its entropy is defined as zero.
Machine Learning
A method of data analysis that automates analytical model building, allowing computers to find hidden insights without explicit programming.
Difference in facilitating RNN weight updates:
Backpropagation Through Time (BPTT)
How are FNN/CNN networks trained/facilitated?
Backpropagation and Gradient Descent
XGBClassifier
Build many decision trees sequentially, where each subsequent tree places a higher weight on misclassified observations (boosting), in an attempt to fix the errors of previous trees
Random Forest (CART)
Build many decision trees, each on a different subset (bag) of training data, and then combine predictions into one final prediction
Bias-variance tradeoff
By continuing to add more complexity and flexibility to the model we start to overfit, and performance on test data decreases.
Entropy remaining after testing attribute A
Can be subtracted from the entropy of the output variable on the whole set to determine the information gained from attribute A.
Mean Squared Error Loss
Can be used to measure loss in regression models that output continuous real numbers
Disadvantages of Random Forest
Computational complexity Difficult to interpret individual trees
Machine Learning is a combination of:
Computer Science, Math & Statistics, Domain Knowledge
Coefficient Interpretation
Continuous/Regression: directly interpret each coefficient w, in relation to the outcome, e.g. Cost of house increases by 12000 per unit increase in the number of rooms. Discrete/Classification: Interpret each coefficients relation with the probability of the target label/class e.g. We can expected the odds of passing the test to decrease per unit increase of the age.
pairplot()
Generate plots of each attribute against every other attribute
DataFrame.describe()
Gives statistical summary of data (mean, std, min, max, percentiles)
Gradient Descent
Gradient descent is an algorithm that numerically estimates where a function outputs its lowest values. That means it finds local minima. We can use gradient descent on our cost function to minimize the RMSE.
Advantages of Random Forest
Handle large datasets with many features robustness to outliers and noisy data ability to estimate importance of each feature
Decision Tree Learning Algorithms
ID3, C4.5, CART, CHAID, MARS
Why do we need Regularization
Improves generalization of our model on unseen data
[Entropy of output variable on X] - [Estimated entropy remaining after testing attribute A]
Information gain
Empirical Loss/Cost Function/Empirical Risk
Measures the total loss over our entire dataset
Benefit of SGD 2.0
Mini-batches lead to fast training so we can parallelize computation and achieve significant speed increases on GPU's
Reinforcement Learning
Model discovers through experience or trial-and-error which actions yield the greatest or most correct rewards
n
Number of features or inputs in each example
CNN Pooling Layer
Pooling layers, also known as downsampling, conducts dimensionality reduction, reducing the number of parameters in the input. Similar to the convolutional layer, the pooling operation sweeps a filter across the entire input, but the difference is that this filter does not have any weights. Instead, the kernel applies an aggregation function to the values within the receptive field, populating the output array. There are two main types of pooling: Max & Average Pooling
Recall
Proportion of correctly predicted positive observations out of all actual positive observations Measures the ability of a model to identify all positive observations in test
Bagging
Randomly sample training data (w/ replacement) for each build of each tree Can be duplicates, and other values can be unused
Gradient Descent Algorithms
SGD, Adam, Adadelta, Adagrad, RMSProp
Loss Optimization Formula
See Image
Euclidean Distance Formula
See Image:
Stochastic Gradient Descent (SGD)
See Image:
𝑔(𝑧)=1/(1+𝑒^(−𝑧))
Sigmoid function
Regularization 2: Early Stopping
Stop training before we overfit.
How do we make a regressive decision w/ decision tree?
Take the average of the predictions of each tree
y
Targets or output data
CNN Convolutional layer
The convolutional layer is the core building block of a CNN, and it is where the majority of computation occurs. It requires a few components, which are input data, a filter, and a feature map. Let's assume that the input will be a color image, which is made up of a matrix of pixels in 3D. This means that the input will have three dimensions—a height, width, and depth—which correspond to RGB in an image. We also have a feature detector, also known as a kernel or a filter, which will move across the receptive fields of the image, checking if the feature is present. This process is known as a convolution. The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image. While they can vary in size, the filter size is typically a 3x3 matrix; this also determines the size of the receptive field. The filter is then applied to an area of the image, and a dot product is calculated between the input pixels and the filter. This dot product is then fed into an output array. Afterwards, the filter shifts by a stride, repeating the process until the kernel has swept across the entire image. The final output from the series of dot products from the input and the filter is known as a feature map, activation map, or a convolved feature. After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the feature map, introducing nonlinearity to the model.
Quantifying Loss
The loss of our neural network measures the cost incurred from incorrect predictions
RNN: BPTT
The principles of BPTT are the same as traditional BP, where the model trains itself by calculating errors from its output layer to its input layer. These calculations allow us to adjust and fit the parameters of the model appropriately. BPTT differs from the traditional approach in that BPTT sums errors at each time step whereas FF networks do not need to sum errors as they do not share parameters across each layer.
CNN Fully-Connected Layer
This layer performs the task of classification based on the features extracted through the previous layers and their different filters. While convolutional and pooling layers tend to use ReLu functions, FC layers usually leverage a softmax activation function to classify inputs appropriately, producing a probability from 0 to 1.
How do we make a classification decision w/ decision tree?
Use each tree in the forest to get a prediction, and the label with the most-votes/recurrences is the predicted class
Binary Cross Entropy Loss
Used to measure loss of models that output a probability between 0 and 1
Iterative Dichotomy (ID3)
Uses information theory or entropy to split on an attribute that gives the highest information gain. Top-down, greedy search
Problem of Long-Term Dependencies
Vanishing gradients
Loss Optimization
We need to find the NN weights that achieve the lowest loss (use gradient descent)
DataFrame.head()
Will return first n rows of dataset, defaults to 5
Does XGBClassifier use a variant of gradient descent?
Yes, to minimize error in each subsequent tree
Mean/Z-Score Normalization
Z-score is a variation of scaling that represents the number of standard deviations away from the mean. It's useful when your ML technique requires normalized data (Gaussian Naive Bayes). x' = (x - μ) / σ
Non-parametric statistics
involves methods and techniques that do not make any assumptions about the underlying distribution of the data being analyzed
K Nearest Neighbor Algorithm (KNN)
non parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.
Precision
proportion of correctly predicted positive observations out of all observations Measures the accuracy of the positive prediction made by the model
Least Squares Linear Regression
statistical method used to determine a line of best fit for some data points by minimizing the sum of the squares of the residuals. We use linear regression to minimize RMSE.