DVGC27 ML

¡Supera tus tareas y exámenes ahora con Quizwiz!

non-response bias/participation bias

Users from certain groups opt-out of surveys at different rates than users from other groups

Selection bias

he selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed

What's the difference between F.relu and nn.ReLU?

nn. relu() is a layer that has weights and bias whereas F. relu is just an activation function.

False positive/False negative & True positive/True negative

FP - a classified positive being negative in reality. FN - a classified negative being positive in reality. TP - a classified positive being positive in reality. TN - a classified negative being negative in reality.

KNN classifier

K nearest neighbours. An object is classified by a plurality vote of its neighbours.

Supervised learning

Learning a function that maps an input to an output based on example input-output pairs. The supervised learning algorithm produces an inferred function from the training data which can be used to map new instances of data. The data set contains both the features and the correct labels for the items that we are trying to predict.

Linear vs nonlinear activation function

Linear - Not possible to backpropagate. Only one layer in neural network since the last layer will be a linear function of the first layer. Nonlinear - allow backpropagation. Allow stacking of layers for complex neural networks.

Why does SGD use mini-batches?

Mini-batches are used because calculating over the whole dataset would take too long as well as only taking one training item gives a too low precision. As SGD is very computationally requiring, mini-batching helps compensate.

Gradient (Gradient descent)

Gradient descent is an optimization algorithm used to find the values of parameters of a function that minimizes a cost function (loss).

Hidden layer

Hidden layer(s) are all layers that are not input or output layers.

Backwards pass

"backward pass" refers to process of counting changes in weights (de facto learning), using gradient descent algorithm (or similar). Computation is made from last layer, backward to the first layer.

What reduces (real) performance? (Mention 3 things)

1. Not enough data 2. Noise in data 3. Biased data 4. Wrong labels 5. bad labels

Mention 4 biases

1. automation bias 2. confirmation bias 3. group attribution bias 4. implicit bias 5. in-group bias 6. out-group homogeneity bias 7. reporting bias 8. selection bias

Forward pass

A calculation process in the neural network where it traverses through all neurons from first to last layer. A loss function is calculated from the output values.

Exploding gradients

If the derivatives that gets multiplied in a neural network are too large, the gradient will increase in a too fast manner as we propagate the model until they eventually explode. If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. In this case you can have exploding gradients: gradients that get too large to converge. Batch normalization can help prevent exploding gradients, as can lowering the learning rate.

Confusion matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. (Shows TP/FP/TN/FN)

How is a color image represented on a computer?

In a color image, each pixel is represented as a 3-tuple with three values, one for each of the basic colors red, blue and green, each with a value between 0 and 255. Ex. (37, 102, 230).

How is a grayscale image represented on a computer?

In grayscale each pixel represents the intensity of the light with numbers from 0 to 255 where 0 is the brightest and 255 the darkest.

What is the difference between a loss function and a metric?

A loss function is to help automated learning and the metric is to help human understanding.

Convolutional network

A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers: convolutional layers pooling layers dense layers

Loss

A prediction error of Neural Net

Weights

A weight represent the strength of the connection between units

Why can't we use accuracy as a loss function?

Accuracy is not an accurate enough number to use. Optimizing the loss will not be presented well enough using a accuracy since the changes are not reflected in the computed accuracy score.

Hyper-parameter tuning

Altering parameters for a classifier to improve an algorithm. Can only use validation set.

Semi-supervised learning

An approach to machine learning that combines a small amount of labeled data with a large amount of unlabeled data during training. Combine supervised and unsupervised learning to help labeling unlabeled instances.

Inference

Applying the trained model to unlabeled examples

Deep neural network

At least 3 hidden layers + 1 input and 1 output layer

Automation bias

Automation bias is the propensity for humans to favor suggestions from automated decision-making systems and to ignore contradictory information made without automation, even if it is correct.

Bagging/bootstrapping

Bagging is a process where from a (training) set X one randomly draws with replacement samples of size n

Group attribution bias

Believe that an individual's characteristics always follow the beliefs of a group that they belong to or that a group's decisions reflects the feelings of all of the members of the group

Classification Trees vs Regression Trees.

Classification trees are used to classify data, while regression trees can predict arbitrary numbers, for example house prises.

One-Hot Encoding

Converting string labels into binary number presentations in order for certain algorithms to compute it.

Sampling bias

Data is not collected randomly from the target group

Epoch

One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.

Principal component method

One of the main methods used for unsupervised learning.

Weak learner

One which performs relatively poorly--its accuracy is above chance, but just barely.

Reporting bias

Only a selection of results are included in any analysis, which typically covers only a fraction of relevant evidence.

Filter Parameters

Padding - How big is my "frame". Kernel - How big is my Filter Stride - How much do I move

Universal approximation theorem

Put simply, it states that a neural network with one hidden layer containing a sufficient but finite number of neurons can approximate any continuous function to a reasonable accuracy, under certain conditions for activation functions (namely, that they must be sigmoid-like).

ReLU

Rectified Linear Unit. An activation function defined as the positive part of its argument. can "die" if < 0. Is practically better then sigmoid.

Implicit bias

Refers to the attitudes or stereotypes that affect our understanding, actions, and decisions in an unconscious manner.

ResNet

Residual neural network. ResNet makes it possible to train up to hundreds or even thousands of layers and still achieves compelling performance.

Softmax

Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.

SGD

Stochastic gradient descent. An iterative method for optimizing an objective function with suitable smoothness properties. It is only a way to train a model.

Activation function

The activation function of a node defines the output of that node given an input or set of inputs.

Max pooling

The objective is to down-sample an input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned. Used to reduce size and focus on the most important features

What does the @ operator do in Python?

The symbol is used for class, function and method decorators if in the beginning of a line. E.g. @classmethod and @property. Otherwise the symbol is used for matrix multiplication.

Unsupervised learning

Unsupervised learning is learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. The data set only contains the labels and not the label that we are trying to predict.

Cluster analysis method

Used in unsupervised learning to group, or segment, datasets with shared attributes in order to get algorithmic relationships.

Augmentation

Used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data

Why do we have to zero gradients?

We need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradients on subsequent backward passes.

Broadcasting

When applying operations to two tensors of different rank the smaller tensor is automatically expanded to have the same size as the one with the larger rank. In order to work with different sized vectors.

Tree Classification

Such a tree is built through a process known as binary recursive partitioning. This is an iterative process of splitting the data into partitions, and then splitting it up further on each of the branches. At the end of the learning process, a decision tree covering the training set is returned.

How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a Python loop?

Using numpy arrays.

in-group bias

a pattern of favoring members of one's in-group over out-group members

out-group homogeneity bias

the perception of out-group members as more similar to one another than are in-group members

Explain 3 metrics

F1-score - A weighted average of the precision and recall. 2∗Precision∗Recall / Precision + Recall. Harmonic mean of recall and precision. Precision - The fraction of relevant instances divided by the retrieved instances. TP = True positive FP = False positive. (TP / TP + FP). This tells when you predict something positive, how many times they were actually positive Recall - the fraction of TP divided by the actual number of positives. (TP / TP + FN). This tells out of actual positive data, how many times you predicted correctly. Accuracy - the fraction of TP + TN / P + N, aka (number of correctly labeled instances)/(total number of instances). Proportion of right predictions Error rate - 1 - Accuracy. Proportion of wrong predictions

RMSE

(L2 norm) Root mean square error. Calculating the mean square of difference and then taking the square root.

Sigmoid function, what is special about its shape?

.The function takes any input value and outputs a value between 0 and 1 which makes it easier for for example an SGD to find gradients. The function is always increasing towards 1. The vanishing gradient problem is particularly problematic with sigmoid activation functions. As can be observed, when the sigmoid function value is either too high or too low, the derivative (orange line) becomes very small i.e. << 1. This causes vanishing gradients and poor learning for deep networks. This can occur when the weights of our networks are initialized poorly - with too-large negative and positive values.

What information do we have to pass to Learner?

1. Dataloaders 2. A model 3. An optimization function 4. A loss function 5. Metrics

What are the seven steps in SGD for machine learning?

1. Initialize weights. 2. Use weights to make prediction 3. Calculate loss based on predictions 4. Calculate a gradient which measures the impact of a specific weight change to the loss 5. Change all weights according to 4. 6. Go back to 2. and keep going as long as you wish.

Mention the 4 parts of a perceptron.

1. Input values or One input layer 2. Weights and Bias 3. Net sum 4. Activation Function

Baseline classifier

A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. Important to compare any other algorithm to.

What are the "bias" parameters in a neural network?

A bias value allows you to shift the activation function to the left or right, which may be critical for successful learning. A simpler way to understand what the bias is: it is somehow similar to the constant b of a linear function y = ax + b

Loss function

A loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event.

What is a "rank-3 tensor"?

A matrix of matrixes ( like a rectangle )

Random Forest classification

A random forest classifier computes several decision trees and takes the majority votes from leaf nodes to decide a final classification. In a random forest regression you take the mean of the target values in the terminal leafs. RF has a built in feature selection.

How is a random forest classifier built?

A random forest is a group of decision trees evaluated and then taken the average from in order to compute a result. Harder to interpret than single tree. 1. Draw random bootstrap sample from the training set. 2. Fit a decision tree to the just drawn sample. However a. at each leaf node select only a subset of features (without replacement) b. Find the best (feature, split_value) given this subset of features. 3. Repeat steps two and three N times to grow a random forest of N estimators. 4. When predicting the class of an item take the majority vote of all the trees in the forest.

Neural networks

A series of algorithms that aim to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. Input, output and hidden layer.

Perceptron

A single layer neural network also known as a linear classifier. Perceptron is usually used to classify data into two parts. Used in supervised learning.

Binary step activation function

A threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.

Why can't we always use a high learning rate?

A too high learning rate can result in missing the minimum/optimal gradient in the current context in order to get a good loss rate. Can eventually lead to exploding gradients.

VGG16

A top scoring convolutional neural network model

How is a decision tree classifier built?

A tree is built by splitting the source set, constituting the root node of the tree, into subsets—which constitute the successor children. The splitting is based on a set of splitting rules based on classification features. Prone to overfit. 1. Loop over each leaf: a. Loop over every feature and find the best split value (i.e., the one with largest information gain) for that feature. b. Record the feature, split value, and information gains. 2. Pick the split with the largest information gain and split that leaf into two child leaves accordingly. Label the parent leaf as inner node - it wont be considered again since it is not a leaf. 3. Repeat 2 and 3 until a termination condition (for example tree depth) is reached or all the leafes in the tree are pure - or cannot be split anymore.

Confirmation bias

A type of bias which states that you tend to lean towards information which confirms your own conception of things.

L1 norm

A way of calculating the absolute value of differences, also called the mean absolute difference.

What is a list comprehension?

An in-line list function: list = [letter for letter in 'human']

Neural network iteration

Backward and forward pass makes together one "iteration".

Batch normalization

Batch Normalization is a technique that normalizes layer inputs per mini-batch. It speed up training, allows for the usage of higher learner rates, and can act as a regularizer. Reduces overfitting.

How do you get the rank from a shape?

By taking the length of the shape.

Convolution

Convolution is the first layer to extract features from an input image. It is a mathematical operation that takes two inputs such as image matrix and a filter or kernel.

Cross-validation

Cross-validation splits the entire data set into X sub-sets of validation sets in order to test the entire set using different validation sets. This to estimate how accurately a predictive model will perform in practice.

Mini-batch neural network

During one "iteration" (one backward and one forward pass), you usually pass a subset of the data set, which is called "mini-batch" (if you pass all data at once, it is called "batch")

Convolution filters

Each filter in a convolution layer produces one and only one output channel. Each filter is a collection of kernels. In convolutional networks, multiple filters are taken to slice through the image and map them one by one and learn different portions of an input image

In which kind of situations would you emphasize recall over precision as a performance metric of your ml model?

High recall (This tells out of actual positive data, how many times you predicted correctly.) is preferred in e.g classifying if someone is sick or not. It's better to say that someone is sick and investigate it instead of saying that they are healthy when they're not. For example in today's society with covid-19 it's better to say that someone got it so they can stay at home and wont spread it instead of saying that they are healthy and meets a lot of people hence spreading it.

Vanishing gradients

If the learning step is too small: the gradient can eventually be vanishingly small, effectively preventing the weight from changing its value. When the gradients vanish toward 0 for the lower layers, these layers train very slowly, or not at all. The ReLU activation function can help prevent vanishing gradients.

The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

In order to compute and model more complex functions we can increase the amount of linear layers. Although linear layers can't be added on top of eachother, instead we can add nonlinear layers in between.

Why would you perform feature selection and what does it mean?

In order to optimize algorithms and reduce time complexity. Feature selection is the process of selecting the best features or most relevant ones from a data set which are then used for to train a classifier.

Reinforcement learning

Loop: 1. Agent takes action (interacts with environment) 2. A reward function is calculated 3. New state + reward is given to Agent

Dead ReLU units

Once the weighted sum for a ReLUunit falls below 0, the ReLUunit can get stuck. It outputs 0 activation, contributing nothing to the network's output, and gradients can no longer flow through it during backpropagation. With a source of gradients cut off, the input to the ReLUmay not ever change enough to bring the weighted sum back above 0. Lowering the learning rate can help keep ReLUunits from dying.

Overfitting/Underfitting

Overfitting - The production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably. Underfitting - The case where a model has " not learned enough" from the training data, resulting in low generalization and unreliable predictions.

In which kind of situations would you emphasize precision over recall as a performance metric of your ml model?

Precision (This tells when you predict something positive, how many times they were actually positive). Email spam for example. In order for email spam to get classified as real mail whereas real mail might not be classified as spam, which is a better outcome than the other way around. This tells when you predict something positive, how many times they were actually positive

Why is random forest less prone to overfit as compared to a single decision tree Classifier?

Random forest takes a mean of several computations in the contrast to a single decision tree which only performs one computation to classify.

Explain Test/Validation/Train set

Test set - Final set to compute accuracy score from. Validation set - Set to use during hyper parameter tuning and testing. The validation set is also used to find when we are overfitting to the training set. Train set - Set to train the algorithm with.

Backpropagation

The Backpropagation algorithm comprises a forward and backward pass through the network. The goal of backpropagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to outputs. It works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer.

Explain how the "pixel similarity" approach to classifying digits works.

The pixel similarity approach calculates a grayscale average for every pixel in the "number X" pictures and the "number Y" pictures respectively. When doing the classification, one approach is to calculate the absolute difference of each pixel compared to the average picture of each number. The picture is then classified as the one with the lowest value.

coverage bias

The population represented in the dataset does not match the population that the machine learning model is making predictions about.

Tensor rank

The rank (or order) of a tensor is defined by the number of directions (and hence the dimensionality of the array) required to describe it.

What is the difference between tensor rank and shape?

The rank of a tensor is the number of dimensions it has. The shape of a tensor is the detailed number of components in each dimension.

What does view do in PyTorch?

View in PyTorch is a method that changes the shape of a tensor without changing the content of the tensor.

Dropout Regularization

Works by randomly "dropping out" unit activations in a network for a single gradient step. The more you drop out, the stronger the regularization. "Dropout" in machine learning refers to the process of randomly ignoring certain nodes in a layer during training. prevents overfitting by ensuring that no units are codependent

How does a perceptron work?

a. All the inputs x are multiplied with their weights w. Let's call it k. b. Add all the multiplied values and call them Weighted Sum. c. Apply that weighted sum to the correct Activation Function. d. The activation functions are used to map the input between the required values like (0, 1) or (-1, 1)

Emergence

properties or behaviors which emerge only when the parts interact in a wider whole.


Conjuntos de estudio relacionados

Chapter One: An Introduction to Nursing Research and Evidence Based Practice

View Set

Study Cumulative Exam - Financial Math

View Set

Insurance Terms and Related Concepts

View Set

General Psychology Chapters 15 + 16

View Set

Chapter 11 Axial Musclature saladins Jalota

View Set

Central Venous Access Devices (CVADs)

View Set