Intro to ML CS4375: Exam 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

X

Features or input data

𝑓𝐰,𝑏(𝐱)=𝐰⋅𝐱+𝑏

Model prediction with multiple variables (⋅ is dot product)

DataFrame.columns

Returns column names

KNN Algorithm

classify a new data point based on the class labels of its k nearest neighbors in the training data. Example: For example, suppose we have a dataset of customer information, including age and income, and we want to predict whether a new customer is likely to make a purchase or not. To do this, we first calculate the distance between the new customer and each customer in the training data. We then select the k customers in the training data that are closest to the new customer, and classify the new customer based on the majority class label among those k neighbors.

F1 Score

harmonic mean of precision and recall, provides balanced measure of the two 0 indicates model is completely wrong 1 indicates perfect predictions

Feature Scaling

involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable--brings all input values to same range such as 0-1 or 100-1000

b

the base bias or intercept of the model (usually a scalar)

w

weight of each feature or input variable (usually vector in multi-feature models)

Information Gain Formula

∑(pₙ+nₙ)/(p+n) * [−∑ P(vₙ)log₂p(vₙ)]

Entropy formula

−∑ P(vk)log₂p(vk)

Multiple-Variable Cost Function

𝐽(𝐰,𝑏)=(1/2𝑚) ∑𝑖=0:𝑚−1 (𝑓𝐰,𝑏(𝐱(𝑖))−𝑦(𝑖))^2

Model prediction with multiple variables

𝑓𝐰,𝑏(𝐱)=𝑤0𝑥0 + 𝑤1𝑥1 + ... + 𝑤𝑛−1𝑥𝑛−1 + 𝑏

Model prediction of i-th example

𝑓𝐰,𝑏(𝐱^(𝑖))

Adaptive Learning Rates

-- Learning Rates are no longer fixed -- Can be made larger or smaller depending on: size of gradient, speed of learning, value of particular weights, etc.

Logistic Regression

Applies the sigmoid function to linear regression to minimize cost function (RMSE) and create a classification solution.

Uses for ML

Fraud detection, web search results, real-time ads, pricing models, spam filtering etc.

Unsupervised Learning

Model is given unlabeled data with no notion of correct or wrong outputs, instead tries to group similar data points based off features or derive some correlation between features

Underfitting

Model performs poorly on training data and is unable to capture relationships between attributes and outcomes.

Overfitting

Model performs well on training data, but poorly on test data because it 'memorized' the training data too much and can't generalize the relationships.

Supervised Learning

Model receives a set of inputs (features) along with the corresponding correct outputs (targets/labels), and learns by comparing its output with correct outputs

Encoding Inputs for NN

Neural networks require numerical inputs, so we must translate all inputs to vectors of fixed size

m

Number of examples in dataset

SGD 2.0

See Image:

Information Gain

The expected reduction in the entropy of the output variable on the whole set.

x^i, y^i

ith example from dataset

Sequence Modeling Applications

- 1:1 Binary Classification - N:1 Sentiment Class. - 1:N Image Captions - N:N Machine Translation

Confusion Matrix

- Displays false/true positive/negatives for each prediction outcome - Useful for evaluating classification models

Decision Tree Algorithms

- Non-parametric supervised learning method used for classification and regression. - The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features (series of if-else decisions). - In order to pick which, feature to split on, we need a way of measuring how good the split is, which is where information gain and entropy come in.

Sigmoid function

- Used in logistic regression to minimize cost function (RMSE), and necessary for outputs which have discrete values (classification problems). - Sigmoid function takes in any values and outputs between (0, 1), so we take our linear regression solution and run it through sigmoid to get a discrete/classification solution. - Also useful in normalization

Convolutional Neural Network (CNN)

-- A convolutional neural network (CNN) is a type of neural network that is commonly used for image and video recognition, natural language processing, and other tasks involving structured data. -- The network consists of multiple layers, including convolutional layers, pooling layers, and fully connected layers.

Recurrent Neural Networks (RNN)

-- A recurrent neural network (RNN) is a type of artificial neural network which uses sequential data or time series data (Sequence Modeling). -- Used for ordinal or temporal problems, such as language translation, natural language processing (nlp), speech recognition, and image captioning (Siri, voice search, Google Translate). -- Distinguished by their memory as they take information from prior inputs to influence the current input and output.

Stochastic RL Policy

-- A stochastic policy can be represented as a family of conditional probability distributions, πs(A∣S), from the set of states, S, to the set of actions, A. -- For every state, you have probability distribution for actions to take from that state.

RNN Notes

-- Apply recurrence relations at every time step to process a sequence -- Same function and set of parameters used at each step -- RNN's have a state that is updated at each step

Regularization 1: Dropout

-- During training, randomly set some activation's to 0 -- Typically drop 50% of activation's in layer -- Forces network to not rely on any 1 node

Computing Gradients: Backpropagation

-- Facilitate weight adjustment of NN nodes in RL -- The output of the neural network is compared to the desired output, and the error is calculated using a loss function. The error is then propagated back through the network, layer by layer, in a process known as backpropagation.

Why do we flip the kernel in a CNN?

-- In a practical sense, flipping the kernel ensures that the convolution operation captures the appropriate relationships between features in the input data and the filters in the kernel. -- If the kernel were not flipped, the resulting feature maps would be mirror images of the desired output.

Policy NN in RL

-- In reinforcement learning, a policy is a function that maps an observed state of the environment to an action to be taken by the agent. -- The goal of the agent is to learn a policy that maximizes some reward signal over time. -- In other words, a policy-based neural network learns to select the best action directly from the input state.

Value NN in RL

-- In reinforcement learning, the value of a state-action pair is the expected total reward that the agent can obtain by following a given policy starting from that state and taking that action. -- A value-based neural network represents the value function as a neural network that takes the state of the environment and the action as inputs and outputs the expected value of that state-action pair.

Characteristics of CNN and FNN

-- Outputs are independent of previous inputs -- Inputs are fixed size

Sequence Modeling

-- Sequence modeling is the task of predicting the next element in the input (word, character, etc.) -- Unlike CNN and FNN, outputs are dependent on previous inputs and input has dynamic size

Setting Learning Rate in Optimization of Loss

-- Small learning rate converges slowly and gets stuck in false local minima -- Large learning rate overshoots and diverges -- Stable learning rate converges smoothly and avoids local minima

K-Means Clustering

-- The most common centroid-based clustering algorithm is K Means 1.) In k-means, the algorithm starts by randomly selecting K points in the data to serve as the initial centroids (buckets). 2.) Each data point is then assigned to the nearest centroid, based on some distance metric such as Euclidean distance. 3.) The centroids are then recalculated as the mean of the points assigned to each cluster, and the assignment step is repeated until convergence.

Deterministic RL Policy

-- a function of the form πd:S→A, that is, a function from the set of states of the environment, S, to the set of actions, A. The subscript d indicates this is a deterministic policy. -- Every state has a clearly defined action (state X will always take action A)

Centroid-based Clustering Algorithm

-- a type of unsupervised machine learning algorithm that is used to group similar data points into clusters based on the distance between the points. -- The goal of these algorithms is to partition the data into K clusters, where K is a pre-defined parameter that is chosen by the user.

General Procedure to Produce a Model

1.) Data acquistion 2.) Data cleaning 3.) Test data or Train & Build Model 4.) Test Model and repeat Training if neccesary 5.) Deploy model

Solution to Setting Learning Rate

1.) Try lots of different learning rates and see what works 2.) Design an adaptive learning rate that changes dynamically

ID3 Procedure

1.) divide into two opposite groups 2.) Calc entropy and info gain of each attribute to find most dominant one 3.) Dominant attr. set as decision node 4.) Repeat for other attributes until decisive decision reached

RNN Characteristics/Requirements for Sequence Modeling

1.) handle variable-length sequences 2.) track long-term dependencies 3.) maintain information about order of sequence 4.) Share same parameters across the sequence

Entropy

A measure of the uncertainty of a random variable; the more information, the less entropy. A random variable with only 1 possible value has no uncertainty and thus its entropy is defined as zero.

Machine Learning

A method of data analysis that automates analytical model building, allowing computers to find hidden insights without explicit programming.

Difference in facilitating RNN weight updates:

Backpropagation Through Time (BPTT)

How are FNN/CNN networks trained/facilitated?

Backpropagation and Gradient Descent

XGBClassifier

Build many decision trees sequentially, where each subsequent tree places a higher weight on misclassified observations (boosting), in an attempt to fix the errors of previous trees

Random Forest (CART)

Build many decision trees, each on a different subset (bag) of training data, and then combine predictions into one final prediction

Bias-variance tradeoff

By continuing to add more complexity and flexibility to the model we start to overfit, and performance on test data decreases.

Entropy remaining after testing attribute A

Can be subtracted from the entropy of the output variable on the whole set to determine the information gained from attribute A.

Mean Squared Error Loss

Can be used to measure loss in regression models that output continuous real numbers

Disadvantages of Random Forest

Computational complexity Difficult to interpret individual trees

Machine Learning is a combination of:

Computer Science, Math & Statistics, Domain Knowledge

Coefficient Interpretation

Continuous/Regression: directly interpret each coefficient w, in relation to the outcome, e.g. Cost of house increases by 12000 per unit increase in the number of rooms. Discrete/Classification: Interpret each coefficients relation with the probability of the target label/class e.g. We can expected the odds of passing the test to decrease per unit increase of the age.

pairplot()

Generate plots of each attribute against every other attribute

DataFrame.describe()

Gives statistical summary of data (mean, std, min, max, percentiles)

Gradient Descent

Gradient descent is an algorithm that numerically estimates where a function outputs its lowest values. That means it finds local minima. We can use gradient descent on our cost function to minimize the RMSE.

Advantages of Random Forest

Handle large datasets with many features robustness to outliers and noisy data ability to estimate importance of each feature

Decision Tree Learning Algorithms

ID3, C4.5, CART, CHAID, MARS

Why do we need Regularization

Improves generalization of our model on unseen data

[Entropy of output variable on X] - [Estimated entropy remaining after testing attribute A]

Information gain

Empirical Loss/Cost Function/Empirical Risk

Measures the total loss over our entire dataset

Benefit of SGD 2.0

Mini-batches lead to fast training so we can parallelize computation and achieve significant speed increases on GPU's

Reinforcement Learning

Model discovers through experience or trial-and-error which actions yield the greatest or most correct rewards

n

Number of features or inputs in each example

CNN Pooling Layer

Pooling layers, also known as downsampling, conducts dimensionality reduction, reducing the number of parameters in the input. Similar to the convolutional layer, the pooling operation sweeps a filter across the entire input, but the difference is that this filter does not have any weights. Instead, the kernel applies an aggregation function to the values within the receptive field, populating the output array. There are two main types of pooling: Max & Average Pooling

Recall

Proportion of correctly predicted positive observations out of all actual positive observations Measures the ability of a model to identify all positive observations in test

Bagging

Randomly sample training data (w/ replacement) for each build of each tree Can be duplicates, and other values can be unused

Gradient Descent Algorithms

SGD, Adam, Adadelta, Adagrad, RMSProp

Loss Optimization Formula

See Image

Euclidean Distance Formula

See Image:

Stochastic Gradient Descent (SGD)

See Image:

𝑔(𝑧)=1/(1+𝑒^(−𝑧))

Sigmoid function

Regularization 2: Early Stopping

Stop training before we overfit.

How do we make a regressive decision w/ decision tree?

Take the average of the predictions of each tree

y

Targets or output data

CNN Convolutional layer

The convolutional layer is the core building block of a CNN, and it is where the majority of computation occurs. It requires a few components, which are input data, a filter, and a feature map. Let's assume that the input will be a color image, which is made up of a matrix of pixels in 3D. This means that the input will have three dimensions—a height, width, and depth—which correspond to RGB in an image. We also have a feature detector, also known as a kernel or a filter, which will move across the receptive fields of the image, checking if the feature is present. This process is known as a convolution. The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image. While they can vary in size, the filter size is typically a 3x3 matrix; this also determines the size of the receptive field. The filter is then applied to an area of the image, and a dot product is calculated between the input pixels and the filter. This dot product is then fed into an output array. Afterwards, the filter shifts by a stride, repeating the process until the kernel has swept across the entire image. The final output from the series of dot products from the input and the filter is known as a feature map, activation map, or a convolved feature. After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the feature map, introducing nonlinearity to the model.

Quantifying Loss

The loss of our neural network measures the cost incurred from incorrect predictions

RNN: BPTT

The principles of BPTT are the same as traditional BP, where the model trains itself by calculating errors from its output layer to its input layer. These calculations allow us to adjust and fit the parameters of the model appropriately. BPTT differs from the traditional approach in that BPTT sums errors at each time step whereas FF networks do not need to sum errors as they do not share parameters across each layer.

CNN Fully-Connected Layer

This layer performs the task of classification based on the features extracted through the previous layers and their different filters. While convolutional and pooling layers tend to use ReLu functions, FC layers usually leverage a softmax activation function to classify inputs appropriately, producing a probability from 0 to 1.

How do we make a classification decision w/ decision tree?

Use each tree in the forest to get a prediction, and the label with the most-votes/recurrences is the predicted class

Binary Cross Entropy Loss

Used to measure loss of models that output a probability between 0 and 1

Iterative Dichotomy (ID3)

Uses information theory or entropy to split on an attribute that gives the highest information gain. Top-down, greedy search

Problem of Long-Term Dependencies

Vanishing gradients

Loss Optimization

We need to find the NN weights that achieve the lowest loss (use gradient descent)

DataFrame.head()

Will return first n rows of dataset, defaults to 5

Does XGBClassifier use a variant of gradient descent?

Yes, to minimize error in each subsequent tree

Mean/Z-Score Normalization

Z-score is a variation of scaling that represents the number of standard deviations away from the mean. It's useful when your ML technique requires normalized data (Gaussian Naive Bayes). x' = (x - μ) / σ

Non-parametric statistics

involves methods and techniques that do not make any assumptions about the underlying distribution of the data being analyzed

K Nearest Neighbor Algorithm (KNN)

non parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.

Precision

proportion of correctly predicted positive observations out of all observations Measures the accuracy of the positive prediction made by the model

Least Squares Linear Regression

statistical method used to determine a line of best fit for some data points by minimizing the sum of the squares of the residuals. We use linear regression to minimize RMSE.


Ensembles d'études connexes

Insurance Minnesota Exams 2 prep

View Set

Life Insurance - Basics (chapter 3)

View Set

Life Insurance (Types of Life policies)

View Set

LearningCurve Activity #12 - Recognizing Physical Sensation, Smell, and Taste

View Set

AP Bio cell membrane and cell communication

View Set

Adult Health Final practice questions

View Set

WGU UFC1 Wild Managerial Accounting

View Set