Machine Learning

Ace your homework & exams now with Quizwiz!

Define *Backpropagation* for neural networks.

*Backpropagation* is neural-network terminology for *minimizing* our cost function.

How does one experience *high variance* with a large training set size?

*Large training set size*: J train of theta increases with training set size and J CV of theta continues to decrease without leveling off.

How does one experience *high bias* with a large training set size?

*Large training set size*: causes both J train of theta and J CV of theta to be high with J train approximately equal to J CV.

How does one experience *high variance* with a low training set size?

*Low training set size*: J train of theta will be low and J CV of theta will be high.

How does one experience *high bias* with a low training set size?

*Low training set size*: causes J train of theta to be low and J CV of theta to be high.

What are the two *main options* to address the issue of overfitting? #4

*Reduce the number of features*: • Manually select which features to keep. • Use a model selection algorithm. *Regularization*: • Keep all the features, but reduce the magnitude of parameters theta j. • Regularization works well when we have a lot of slightly useful features.

Which 3 separate error values can we calculate, if we break down our dataset as such: • Training set: 60%. • Cross validation set: 20%. • Test set: 20%.

1. Optimize the parameters in Theta (Θ) using the training set for each polynomial degree. 2. Find the polynomial degree d with the least error using the cross validation set. 3. Estimate the generalization error using the test set with J test, using d = theta from polynomial with lower error.

How does one *train* a neural network? #6

1. Randomly initialize the weights. 2. Implement forward propagation. 3. Implement the cost function. 4. Implement backpropagation. 5. Use gradient checking to confirm that your backpropagation works. 6. Use gradient descent to minimize the cost function with the weights in theta.

What issue poses a neural network with more parameters?

A large neural network with more parameters is *prone to overfitting*.

What issue poses a neural network with fewer parameters?

A neural network with fewer parameters is *prone to underfitting*.

Give the vectorized implementation for Gradient Descent! (Logistic Regression Model)

A vectorized implementation is:

Give the vectorized implementation of our simplified cost function! (Logistic Regression Model)

A vectorized implementation is:

State the *Backpropagation algorithm*.

Backpropagation algorithm.

Why does training an algorithm on a very few number of data points easily have 0 errors?

Because we can always find a quadratic curve that touches exactly those number of points.

How can we *improve* the form of our hypothesis function? (Multivariate Linear Regression)

By making it a *quadratic*, cubic or square root function (or any other form).

How do we measure the *accuracy* of a hypothesis function?

By using a *cost function*, usually denoted by J.

How do we change the form of our binary hypothesis function to be continuous in the range between 0 and 1?

By using the *Sigmoid Function*, also called the *Logistic Function*.

Why do we assume that x0=1 in multivariate linear regression?

Convention.

Give a derivation of for a single example in batch gradient descent! (Gradient Descent For Linear Regression)

Derivation of a single variable in gradient descent.

How do you implement both feature scaling and mean normalization? #2

Feature Scaling and Mean Normalization.

What is the test set error for linear classification?

For classification ~ Misclassification error (aka 0/1 misclassification error):

What is the test set error for linear regression?

For linear regression

What is the issue with higher-oder polynomials in regard to fitting the training data and test data?

Higher-order polynomials (high model complexity) fit the *training data* extremely well and the *test data* extremely poorly.

What bias-variance tradeoff do higher-order polynomials (high model complexity) have?

Higher-order polynomials (high model complexity) have low bias on the training data, but very high variance.

What approach will not generally help much by itself, when a learning algorithm is suffering from high bias?

If a learning algorithm is suffering from high bias, *getting more training data* will not (by itself) help much.

Under which circumstances will *getting more training data* help a learning algorithm to perform better?

If a learning algorithm is suffering from high variance, *getting more training data* is likely to help.

What are the *dendrites* in the model of neural networks?

In our model, our dendrites are like the input features.

What are the *axons* in the model of neural networks?

In our model, the axons are the results of our hypothesis function.

How can you address the overfitting of a large neural network?

In this case you can use regularization (increase λ) to address the overfitting.

What usually *causes* overfitting?

It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

What usually *causes* underfitting?

It is usually caused by a function that is too simple or uses too few features.

What is *multivariate linear regression*?

Linear regression with multiple variables.

What bias-variance tradeoff do lower-order polynomials (low model complexity) have?

Lower-order polynomials (low model complexity) have *high bias* and *low variance*.

What is the multivariate form of a hypothesis function?

Multivariate form of the hypothesis function.

In a basic sense, what are *neurons*?

Neurons are basically computational units that take inputs, called *dendrites*, as electrical inputs, called "spikes", that are channeled to outputs , called *axons*.

State the *normal equation formula*!

Normal Equation Formula.

What is the notation for equations where we can have any number of input variables? (Multivariate Linear Regression)

Notation.

What is *overfitting*?

Overfitting, or *high variance*, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.

State the algorithm for *gradient descent*.

Repeat until convergence, where j=0,1 represents the feature index number.

What is the definition of a *cost function* of a supervised learning problem?

Takes an average difference of all the results of the hypothesis with inputs from x's and the actual output y's.

What is the average test error for the test set?

The average test error for the test set is. This gives us the proportion of the test data that was misclassified.

What does the matrix Delta in the Back propagation algorithm do?

The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative.

Depict an example of *One-versus-all* to classify 3 classes! (Multiclass Classification)

The following image shows how one could classify 3 classes:

Compare gradient descent and the normal equation!

The following is a comparison of gradient descent and the normal equation:

Give an example of the implementation of the *OR-function* as a neural network!

The following is an example of the logical operator 'OR', meaning either x1 is true or x2 is true, or both:

Give an example of the implementation of the *AND-function* as a neural network!

The following is an example of the logical operator AND, meaning it s only true if both x1 and x2 are 1.

Give an example of a neural network which classifies data into one of four categories!

The inner layers, each provide us with some new information which leads to our final hypothesis function.

What is the *activation* function of a neural network?

The logistic function (as in classification) is also called a *sigmoid (logistic) activation function*.

Given a training set and a test set, what is the new procedure for evaluating a hypothesis?

The new procedure using these two sets is then: 1. Learn Θ and minimize Jtrain(Θ) using the training set 2. Compute the test set error Jtest(Θ)

Does feature scaling speed up the implementation of the normal equation?

There is *no need* to do feature scaling with the normal equation.

What are the theta-matrices for implementing the logical functions 'AND', 'NOR', and 'OR' as a neural network?

Theta Matrices for Neural Network implementation.

How can we approach regularization using the alternate method of the non-iterative normal equation?

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

What is *underfitting*?

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data.

Give the *vectorization* of the multivariable form of a hypothesis function.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

What are the *weights* of a neural network?

Using the logistic function, our "theta" parameters are sometimes called "weights".

How can we *simplify* our cost function? (Logistic Regression Model)

We can compress our cost function's two conditional cases into one case.

How can we implement the 'XNOR' operator with a neural network?

We can implement the 'XNOR' operator by using two hidden layers.

How can we *speed up* gradient descent?

We can speed up gradient descent by having each of our input values in roughly the same range.

How can we get our discrete 0 or 1 classification from a logistic function?

We can translate the output of the hypothesis function as follows:

Which method randomly initializes our weights for our Theta matrices of a neural network?

We initialize each Theta l,i,j to a random value between minus epsilon and epsilon.

When might it be a good time to go from a normal solution to an iterative process?

When the number of examples exceeds *10,000* due to the complexity of the normal equation.

What code is implemented if we perform forward *and* back propagation?

When we perform forward and back propagation, we loop on every training example.

How does batch gradient descent differ from gradient descent? (Gradient Descent For Linear Regression)

While gradient descent can be susceptible to local minima in general, batch gradient descent has only one global, and no other local, optima.

What is the complexity of computing the inversion with the normal equation?

With the normal equation, computing the inversion has complexity of n cubed.

Which function returns the values for jVal and gradient in a single turn?

function [jVal, gradient] = costFunction(theta) jVal = [...code to compute J(theta)...]; gradient = [...code to compute derivative of J(theta)...]; end

How can we interpret the output of our logistic function?

h of theta of a given input variable give us the probability that our output is 1.

Given costFunction(), what do we have to do to implement fminunc()?

we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()"

How can we break down our decision process *deciding what to do next*? #6

• *Getting more training examples*: Fixes high variance. • *Trying smaller sets of features*: Fixes high variance. • *Adding features*: Fixes high bias. • *Adding polynomial features*: Fixes high bias. • *Decreasing lambda*: Fixes high bias. • *Increasing lambda*: Fixes high variance.

What is the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis? #2

• *High bias (underfitting)*: both J train(Θ) and J CV(Θ) will be high. Also, J CV(Θ) is approximately equal to J train(Θ). • *High variance (overfitting)*: J train(Θ) will be low and J CV(Θ) will be much greater than J train(Θ).

What are alternative terms of a Cost Function? #2

• *Squared error function*. • *Mean squared error*.

How does gradient descent converge with a *fixed* step size alpha? #2

• As we approach a local minimum, gradient descent will take smaller steps. • Thus no need to decrease alpha over time.

What is the *Automatic convergence test* in gradient descent? #2

• Declare convergence if J of theta decreases by less than E in one iteration, where E is some small value such as 0.001. • However in practice it's difficult to choose this threshold value.

What are the ideal ranges of our input variables in gradient descent? #2

• For example a range between minus 1 and 1. • These aren't exact requirements; we are only trying to speed things up.

What is *batch gradient descent*? #2 (Gradient Descent For Linear Regression)

• Gradient descent on the original cost function J. •This method looks at every example in the entire training set on every step.

How can the step parameter alpha in gradient descent cause bugs? #2

• If alpha is too small: slow convergence. • If alpha is too large: may not decrease on every iteration and thus may not converge.

Plot the cost function, if the correct answer for y is 0. #2

• If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. • If our hypothesis approaches 1, then the cost function will approach infinity.

Plot the cost function J, if the correct answer for y is 1.

• If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. • If our hypothesis approaches 0, then the cost function will approach infinity.

What important thing should one keep in mind if one changes the form of a hypothesis function? (Multivariate Linear Regression) #2

• If you create new features when doing polynomial regression then *feature scaling* becomes very important. • For example, if x has range 1 - 1000 then range of x^2 becomes 1 - 1000000.

What is *feature scaling*? #2

• Involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable. • Results in a new range of just 1.

What is *mean normalization*? #2

• Involves subtracting the average value for an input variable from the values for that input variable. • Results in a new average value for the input variable of just zero.

How are the variables L, s of l and K in the cost function of a neural network defined? #3

• L = total number of layers in the network. • s of l = number of units (not counting bias unit) in layer l. • K = number of output units.

How can you *debug* gradient descent? #3

• Make a plot with number of iterations on the x-axis. • Now plot the cost function J of theta over the number of iterations of gradient descent. • If J of theta ever increases, then you probably need to decrease alpha.

What is *gradient descent* for our simplified cost function? (Logistic Regression Model) #2

• Notice that this algorithm is identical to the one we used in linear regression. • We still have to simultaneously update all values in theta.

Give the setup of using a neural network. #4

• Pick a network *architecture*. • choose the *layout* of your neural network. • Number of input units; dimension of features x i. • Number of output units; number of classes. • Number of hidden units per layer; usually more the better.

What are common causes for X Transpose X to be *noninvertible*? #2

• Redundant features, where two features are very closely related (i.e. they are linearly dependent). • Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization".

How do we determine the dimension of the matrices of weights? (Neural Network) #2

• The +1 comes from the addition of the "bias nodes. • In other words the output nodes will not include the bias nodes while the inputs will.

What is the *decision boundary* given a logistic function? #2

• The decision boundary is the line that separates the area where y = 0 and where y = 1. • It is created by our hypothesis function.

What is the Gradient Descent for Multiple Variables? #2

• The gradient descent equation itself is generally the same form. • we just have to repeat it for our 'n' features.

What is the *bias unit* of a neural network? #2

• The input node x0 is sometimes called the "bias unit." • It is always equal to 1.

Why does *feature scaling* speed up gradient descent? #2

• This is because theta will descend quickly on small ranges and slowly on large ranges. • Thus it will oscillate inefficiently down to the optimum when the variables are very uneven.

What is the implementation of *One-versus-all* in Multiclass Classification? #2

• Train a logistic regression classifier h of theta for each class to predict the probability that y = i . • To make a prediction on a new x, pick the class that maximizes h of theta.

Which function do we want to use in octave when implementing the normal equation? #2

• Use the 'pinv' function rather than 'inv'. • The 'pinv' function will give you a value of theta even if X Transpose X is not invertible.

How do we obtain the values for each of the activation nodes, given a single-layer neural network with 3 activation nodes and a 4-dimensional input? #2

• We apply each row of the parameters to our inputs to obtain the value for one activation node. • Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by the parameter matrix theta 2.

How can we improve our features? (Multivariate Linear Regression) #2

• We can *combine* multiple features into one. • For example, we can combine x1 and x2 into a new feature x3 by taking x1 times x2.

What is the algorithm for implementing gradient descent for *linear regression*? #2

• We can substitute our actual cost function and our actual hypothesis function. • m is the size of the training set, theta 0 a constant that will be changing simultaneously with theta 1 and x, y are values of the given training set (data).

What is the intuition of the multivariable form of a hypothesis function in the example of estimating housing prices? #2

• We can think about theta 0 as the basic price of a house, theta 1 as the price per square meter, theta 2 as the price per floor, etc. • x1 will be the number of square meters in the house, x2 the number of floors, etc.

How does the *cost function* for a logistic regression look like? #2

• We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. • In other words, it will not be a convex function.

How do we label the hidden layers of a neural network? #2

• We label these intermediate or hidden layer nodes. • The nodes are also called *activation units*.

Depict an example of gradient descent as it is run to minimize a quadratic function. #2

• shown is the trajectory taken by gradient descent, which was initialized at 48,30. • The x's in the figure (joined by straight lines) mark the successive values of theta that gradient descent went through as it converged to its minimum.

State the *cost function* for neural networks. #3

• the double sum simply adds up the logistic regression costs calculated for each cell in the output layer. • the triple sum simply adds up the squares of all the individual theta s in the entire network. • the i in the triple sum does not refer to training example i.


Related study sets

Astronomy Exam 1: Online Questions Part 1

View Set

Chapter 12 Gender, Sex, and Sexuality

View Set

CH 27 Peds-The child with a condition of the blood.....

View Set