Intro to Neural Networks
to get a better model, what CE are you looking for?
a lower CE
how is a perceptron similar to a brain nerve cell?
dendrites are inputs, axon is output, outputs if electrical charge is strong enough
what 2 conditions should be met in order to apply gradient descent? (Check all that apply.)
error function should be: - differentiable - continuous
calculate cross entropy in python
import numpy as np def cross_entropy(Y, P): .....Y = np.float_(Y) .....P = np.float_(P) .....return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))
how can you go from 'and' to 'or' perceptron?
increase weights or decrease magnitude of bias
in a nutshell, what does backpropagation consist of:
-Doing a feedforward operation. -Comparing the output of the model with the desired output. -Calculating the error. -Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights. -Use this to update the weights, and get a better model. - Continue this until we have a model that is good.
The sigmoid function is defined as sigmoid(x) = 1/(1+e-x). If the score is defined by 4x1 + 5x2 - 9 = score, then which of the following points has exactly a 50% probability of being blue or red? (Choose all that are correct.) 1,1 2,4 5,-5 -4,5
1,1 -4,5
what is gradient descent algorithm?
1. start with random weights for every point(x1..xn): ....for all n: ........update wi' <- wi-alpha(y^-y)x_i ........update b' <- b-alpha(y^-y) repeat until small error
Now that you know the equation for the line (2x1 + x2 - 18=0), and similarly the "score" (2x1 + x2 - 18), what is the score of the student who got 7 in the test and 6 for grades?
2
what is the error formula, used in gradient descent?
E=−m1∑i=1m(yiln(yi^)+(1−yi)ln(1−yi^))
how does the perceptron algorithm work?
Recall that the perceptron step works as follows. For a point with coordinates (p,q), label y, and prediction given by the equation y^=step(w1x1+w2x2+b): If the point is correctly classified, do nothing. If the point is classified positive, but it has a negative label, subtract αp,αq, and α from w1,w2,w_1, w_2,w1,w2, and b respectively. If the point is classified negative, but it has a positive label, add αp,αq,and α to w1,w2,w_1, w_2,w1,w2, and b respectively.
what is the basic flow for logistic regression?
Take your data Pick a random model Calculate the error Minimize the error, and obtain a better model
after all the math is done, what is the gradient turn out to be?
The gradient is actually a scalar times the coordinates of the point! And what is the scalar? Nothing less than a multiple of the difference between the label and the prediction.
Given the table in the video above, what would the dimensions be for input features (x), the weights (W), and the bias (b) to satisfy (Wx + b)?
W: (1xn), x: (nx1), b: (1x1
a high Cross entropy indicates?
a worse model
how is the bias calculated in a perceptron?
as a node with initial weight 1, and multiplied by weight, then added in perceptron score
what is the formula for multi-class cross entropy?
ce=-sum_n*sum_m(y_ij * ln(p_ij) n=num doors m=num animals where p_ij is probability of animal i behind door j
If a point is well classified, we will we get a small or large gradient. And if it's poorly classified, will we get a large or small gradient?
closer the label to prediction, smaller gradient farther from label to prediction, larger gradient If a point is well classified, we will get a small gradient. And if it's poorly classified, the gradient will be quite large.
whats a network look like for XOR
combo of (AND,NOT) + (OR)
how is cross-entropy related to the total probability of an outcome?
cross-entropy is inversely proportional to the total probability of an outcome.
let's define the combination of two new perceptrons as w1*0.4 + w2*0.6 + b. Which of the following values for the weights and the bias would result in the final probability of the point to be 0.88?
dont forget to apply sigmoid (σ′(x)=σ(x)(1−σ(x))) σ=(e^x/(e^x + 1)) w1 =3 w2=5 b=-2.2 3*.4 + 5*.6 -2.2=2 (e^2)/(e^2+1) = 0/88
what function turns every number into a positive number?
exp
how are events and probabilities related to cross-entropy?
given events and probabilities, how likely is is those events based on the probabilities. if its likely , small cross-entropy unlikely, then have large cross-entropy
whats is the cross-entropy for a good and bad model ?
good = low cross entropy bad = high cross entropy
which of the following is true: a) higher CE => lower probability of event b) higher CE => higher probability of event c) no relation CE=> probability of event
higher CE => lower probability of event cross-entropy is inversely proportional to the total probability of an outcome
how does learning rate work?
it adjusts the subtraction of inputs from the coords of the line equation to make the line move in a smaller amount
how does the step function work for perceptron?
it returns 1/0 based on value of node. (this is activation functions)
how is the activation function applied to the equation for line?
its multiplied
what function turns products into sums?
log log(ab) = log(a) + log(b)
is a lower cross entropy better or worse model?
lower CE is better model
what do you do to go from a bad model to get to a good one?
minimize the cross-entropy
how are the error function and probability related?
minimizing the error function maximizes the probability
what is a deep neural network?
mlp with many layers, highly non-linear boundary
whats maximum likelihood?
pick the models that gives the existing labels the highest probability.
how do you go from discrete to continuous functions?
replace step function with a continuous function like sigmoid
how do you construct 'and' perceptron?
set weights and biases, so plotted line is above/below correctly w1=.25 w2=.25 b=-.5
what happens in the feed forward step?
simply calculates the probability of the y value being correct, and also how far the point is from the line
what function do you use if you have 3 or more classes?
softmax
what is the formula used in minimizing error function? what are steps?
start with random weights using sigma since each point is under cross-entropy, correctly classified will give smaller error
what is cross-entropy?
the sum of negative logs
how is the gradient descent algorithm different from perceptron algorithm?
they are essentially the same, but: GD: y^ can take continuous values PA: y^ only 0 or 1 and once classified correct: PA: stops moving line GD: keeps pushing line away.
whats the fundamental calculations for softmax classification
use exp (since no negative numbers), and divide by sum of exponents, giving a probability
how to get multi-class classification?
use multiple outputs and softmax
how would you input different values for different categories (such as duck,beaver,dog)?
use one-hot encoding
since the multiplication of many many probabilites ( [0,1] is a very small number, how do you calculate the probabilities in a better way
use the log, and since it returns a negative, use - log (cross entropy)
how does back prop work with 2 inputs , one hidden layer?
uses error of each point, and tells model to move line away/toward the point, the weights are then adjusted
chain rule usage?
using chain rule, just multiplying a bunch of partial derivates
how do you calculate the gradient of E at point x?
using partial derivs: ∇E=(∂w1∂E,⋯,∂wn∂E,∂b∂E)
how do you set the weights for 'not' perceptron
w1=-.33 w2=-.99 bias = .66
calculation for perceptron X1 inputs, wi weights, b bias?
w_x +b = sum(w_i * x_i) +b
when would the sigmoid function yield 50%?
when the score(line equation @ points) evaluates to 0
whats the formula for feedforward using matrices?
y^=σ dot W^2 dot σ dot W^1 *x
what is formula for sigmoid?
σ′(x)=σ(x)(1−σ(x))
what is the final formula used for calculating the error gradient?
∇E=−(y−y^)(x1,...,xn,1)