Machine Learning

Ace your homework & exams now with Quizwiz!

% note: if hist() crashes, try [...]

% note: if hist() crashes, try "graphics_toolkit('gnu_plot')"

Define *Backpropagation* for neural networks.

*Backpropagation* is neural-network terminology for *minimizing* our cost function.

How does one experience *high variance* with a large training set size?

*Large training set size*: J train of theta increases with training set size and J CV of theta continues to decrease without leveling off.

How does one experience *high bias* with a large training set size?

*Large training set size*: causes both J train of theta and J CV of theta to be high with J train approximately equal to J CV.

How does one experience *high variance* with a low training set size?

*Low training set size*: J train of theta will be low and J CV of theta will be high.

How does one experience *high bias* with a low training set size?

*Low training set size*: causes J train of theta to be low and J CV of theta to be high.

What are the two *main options* to address the issue of overfitting? #4

*Reduce the number of features*: • Manually select which features to keep. • Use a model selection algorithm. *Regularization*: • Keep all the features, but reduce the magnitude of parameters theta j. • Regularization works well when we have a lot of slightly useful features.

Bias: approximation error (Difference between expected value and optimal value)

- High Bias = UnderFitting (BU) - Jtrain(Θ) and JCV(Θ) both will be high and Jtrain(Θ) ≈ JCV(Θ)

Variance: estimation error due to finite data

- High Variance = OverFitting (VO) - Jtrain(Θ) is low and JCV(Θ) ≫Jtrain(Θ)

Regularization Effects?

- Small values of λ allow model to become finely tuned to noise leading to large variance => overfitting. - Large values of λ pull weight parameters to zero leading to large bias => underfitting

1. Step in PCA?

1. Compute "covariance matrix" This can be vectorized in Octave as: Sigma = (1/m) * X' * X;

What do we need in order to choose the model and the regularization λ?

1. Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24}); 2. Create a set of models with different degrees or any other variants. 3. Iterate through the λs and for each λ go through all the models to learn some Θ. 4. Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ) without regularization or λ = 0. 5. Select the best combo that produces the lowest error on the cross validation set. 6. Using the best combo Θ and λ, apply it on Jtest(Θ) to see if it has a good generalization of the problem.

Which 3 separate error values can we calculate, if we break down our dataset as such: • Training set: 60%. • Cross validation set: 20%. • Test set: 20%.

1. Optimize the parameters in Theta (Θ) using the training set for each polynomial degree. 2. Find the polynomial degree d with the least error using the cross validation set. 3. Estimate the generalization error using the test set with J test, using d = theta from polynomial with lower error.

Model Selection with the validation set?

1. Optimize the parameters in Θ using the training set for each polynomial degree. 2. Find the polynomial degree d with the least error using the cross validation set. 3. Estimate the generalization error using the test set with Jtest(Θ(d)), (d = theta from polynomial with lower error);

Model Selection without the validation set?

1. Optimize the parameters in Θ using the training set for each polynomial degree. 2. Find the polynomial degree d with the least error using the test set. 3. Estimate the generalization error also using the test set with Jtest(Θ(d)), (d = theta from polynomial with lower error);

Stochastic gradient descnet algorithm?

1. Randomly 'shuffle' the dataset 2. For i=1...m Θj:=Θj−α(hΘ(x(i))−y(i))⋅x(i)j

How does one *train* a neural network? #6

1. Randomly initialize the weights. 2. Implement forward propagation. 3. Implement the cost function. 4. Implement backpropagation. 5. Use gradient checking to confirm that your backpropagation works. 6. Use gradient descent to minimize the cost function with the weights in theta.

% element-wise reciprocal

1./v

2. Step in PCA?

2. Compute "eigenvectors" of covariance matrix Σ This can be vectorized in Octave as: [U,S,V] = svd(Sigma);

3. Step in PCA?

3. Take the first k columns of the U matrix and compute z

What is the *contour line* of a two variable function?

A *contour line* of a two variable function has a *constant value* at all points of the same line.

% element-wise multiplication

A .* B

% element-wise square of each element in A

A .^ 2

% append column vec

A = [A, [100; 101; 102]];

% generates a magic matrix - not much used in ML algorithms

A = magic(3)

What is the definition of *learning from experience* for a computer program?

A computer program is said to *learn from experience* E with respect to some *class of tasks* T and *performance measure* P, if its performance at tasks in T, as measured by P, improves with experience E.

A large lambda [...], which [...].

A large lambda heavily penalizes all the Θ parameters, which greatly simplifies the line of our resulting function, so causes underfitting.

What issue poses a neural network with more parameters?

A large neural network with more parameters is *prone to overfitting*.

What issue poses a neural network with fewer parameters?

A neural network with fewer parameters is *prone to underfitting*.

What is the issue with higher-oder polynomials in regard to fitting the training data and test data?

Higher-order polynomials (high model complexity) fit the *training data* extremely well and the *test data* extremely poorly.

What bias-variance tradeoff do higher-order polynomials (high model complexity) have?

Higher-order polynomials (high model complexity) have low bias on the training data, but very high variance.

% 4x4 identity matrix

I = eye(4)

If C is large, then we get [...]

If C is large, then we get higher variance/lower bias

If C is small, then we get [...]

If C is small, then we get lower variance/higher bias

Large Margin Intuition If C is very large, we must choose Θ parameters such that [...]

If C is very large, we must choose Θ parameters such that:

What approach will not generally help much by itself, when a learning algorithm is suffering from high bias?

If a learning algorithm is suffering from high bias, *getting more training data* will not (by itself) help much.

Under which circumstances will *getting more training data* help a learning algorithm to perform better?

If a learning algorithm is suffering from high variance, *getting more training data* is likely to help.

Logistic Regression vs. SVMs If n is large (relative to m), then [...]

If n is large (relative to m), then use logistic regression, or SVM without a kernel (the "linear kernel") In the second example, we have enough examples that we may need a complex non-linear hypothesis

Logistic Regression vs. SVMs If n is small and m is intermediate, then [...]

If n is small and m is intermediate, then use SVM with a Gaussian Kernel In the second example, we have enough examples that we may need a complex non-linear hypothesis

Logistic Regression vs. SVMs If n is small and m is large, then [...]

If n is small and m is large, then manually create/add more features, then use logistic regression or SVM without a kernel. In the last case, we want to increase our features so that logistic regression becomes applicable.

If our anomaly detector is flagging too many anomalous examples, then we need to [...]

If our anomaly detector is flagging too many anomalous examples, then we need to decrease our threshold ϵ

If we have [...], then we can reduce C.

If we have outlier examples that we don't want to affect the decision boundary, then we can reduce C.

In PCA, we are taking a number of features x1,x2,...,xn, and finding a [...]. We aren't trying to [...] and we aren't [...]

In PCA, we are taking a number of features x1,x2,...,xn, and finding a closest common dataset among them. We aren't trying to predict any result and we aren't applying any theta weights to the features.

In SVMs, the decision boundary has the special property that it is [...]

In SVMs, the decision boundary has the special property that it is as far away as possible from both the positive and the negative examples.

In cases where [...] it is strongly recommended to run a loop of random initializations.

In cases where K<10 it is strongly recommended to run a loop of random initializations.

In certain cases, an "inferior algorithm," if given [...], can [...]

In certain cases, an "inferior algorithm," if given enough data, can outperform a superior algorithm with less data.

PCA is not linear regression!

In linear regression, we are minimizing the squared error from every point to our predictor line. These are vertical distances. In PCA, we are minimizing the shortest distance, or shortest orthogonal distances, to our data points.

In order for the F Score to be large, [...]

In order for the F Score to be large, both precision and recall must be large.

Trading Off Precision and Recal In order to turn these two metrics into one single number, we can [...]

In order to turn these two metrics into one single number, we can take the F value.

In other words, if x and the landmark are close, then [...], and if x and the landmark are far away from each other, the [...]

In other words, if x and the landmark are close, then the similarity will be close to 1, and if x and the landmark are far away from each other, the similarity will be close to 0.

Choosing the Number of Principal Components?

In other words, the squared projection error divided by the total variation should be less than one percent, so that 99% of the variance is retained.

Another way to choose K is to observe how well k-means performs on a downstream purpose. In other words [...]

In other words, you choose K that proves to be most useful for some goal you're trying to achieve from using these clusters.

What are the *dendrites* in the model of neural networks?

In our model, our dendrites are like the input features.

What are the *axons* in the model of neural networks?

In our model, the axons are the results of our hypothesis function.

A useful way to think about Support Vector Machines is to think of them as [...]

A useful way to think about Support Vector Machines is to think of them as Large Margin Classifiers

Give the vectorized implementation for Gradient Descent! (Logistic Regression Model)

A vectorized implementation is:

Give the vectorized implementation of our simplified cost function! (Logistic Regression Model)

A vectorized implementation is:

A very common application of anomaly detection is [...]

A very common application of anomaly detection is detecting fraud:

% matrix transpose

A'

% get the 2nd row.

A(2,:)

% indexing is (row,col)

A(3,2)

% Select all elements as a column vector.

A(:)

% get the 2nd col

A(:,2)

% change second column

A(:,2) = [10; 11; 12]

% print all the elements of rows 1 and 3

A([1 3],:)

As the training set gets larger, the error [...] increases.

As the training set gets larger, the error for a quadratic function increases.

State the *Backpropagation algorithm*.

Backpropagation algorithm.

Why does training an algorithm on a very few number of data points easily have 0 errors?

Because we can always find a quadratic curve that touches exactly those number of points.

Before we can apply PCA, there is a [...]

Before we can apply PCA, there is a data pre-processing step we must perform

How can we *improve* the form of our hypothesis function? (Multivariate Linear Regression)

By making it a *quadratic*, cubic or square root function (or any other form).

How do we measure the *accuracy* of a hypothesis function?

By using a *cost function*, usually denoted by J.

How do we change the form of our binary hypothesis function to be continuous in the range between 0 and 1?

By using the *Sigmoid Function*, also called the *Logistic Function*.

% same as C = [2 2 2; 2 2 2]

C = 2*ones(2,3)

% concatenating A and B matrices side by side

C = [A B]

% Concatenating A and B top and bottom

C = [A; B]

Using An SVM In practical application, the choices you do need to make are: [...]

Choice of parameter C Choice of kernel (similarity function) No kernel ("linear" kernel) -- gives standard linear classifier Choose when n is large and when m is small Gaussian Kernel (above) -- need to choose σ2 Choose when n is small and m is large

Algorithm for anomaly detection with gaussian distribution?

Choose features xi that you think might be indicative of anomalous examples. Fit parameters μ1,...,μn,σ21,...,σ2n Calculate μj Calculate σ2j Given a new example x, compute p(x) Anomaly if p(x)<ϵ

Different ways we can approach a machine learning problem?

Collect lots of data (for example "honeypot" project but doesn't always work) Develop sophisticated features (for example: using email header data in spam emails) Develop algorithms to process your input in different ways (recognizing misspellings in spam).

Intuition for the bias-variance trade-off: Complex model => [...]

Complex model => sensitive to data => much affected by changes in X => high variance, low bias.

Why do we assume that x0=1 in multivariate linear regression?

Convention.

Give a pictorial representation of what the cost function of a supervised learning problem does.

Cost function of a supervised learning problem.

Data is linearly separable when [...]

Data is linearly separable when a straight line can separate the positive and negative examples.

Give a derivation of for a single example in batch gradient descent! (Gradient Descent For Linear Regression)

Derivation of a single variable in gradient descent.

Doing dimensionality reduction will [...]

Doing dimensionality reduction will reduce the total data we have to store in computer memory and will speed up our learning algorithm.

What tradeoff will we have if we want a more confident prediction of two classes using logistic regression?

Doing this, we will have higher precision but lower recall (refer to the definitions in the previous section).

Each landmark gives us [...]

Each landmark gives us the features in our hypothesis:

How do you implement both feature scaling and mean normalization? #2

Feature Scaling and Mean Normalization.

Finally, note that the hypothesis of the Support Vector Machine is not interpreted as [...] (as it is for the hypothesis of logistic regression). Instead, it [...]

Finally, note that the hypothesis of the Support Vector Machine is not interpreted as the probability of y being 1 or 0 (as it is for the hypothesis of logistic regression). Instead, it outputs either 1 or 0.

What is the test set error for linear classification?

For classification ~ Misclassification error (aka 0/1 misclassification error):

What is the test set error for linear regression?

For linear regression

Map Reduce and Data Parallelism for Neural Networks?

For neural networks, you can compute forward propagation and back propagation on subsets of your data on many machines. Those machines can report their derivatives back to a 'master' server that will combine them.

What is the definition of a *supervised learning problem*?

Given a training set, learn a function h such that h of an input variable x is a "good" predictor for the corresponding output variable y.

Data preprocessing in PCA?

Given training set: x(1),x(2),...,x(m) Preprocess (feature scaling/mean normalization): Replace each x(i)j with x(i)j−μj If different features on different scales (e.g., x1 = size of house, x2 = number of bedrooms), scale features to have comparable range of values

PCA Problem formulation

Given two features, x1 and x2, we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature.

In the cluster assignment step, our goal is to: [...]

Minimize J(...) with c(1),...,c(m) (holding μ1,...,μK fixed)

In the move centroid step, our goal is to: [...]

Minimize J(...) with μ1,...,μK

A typical rule of thumb when running diagnostics is?

More training examples fixes high variance but not high bias. Fewer features fixes high variance but not high bias. Additional features fixes high bias but not high variance. The addition of polynomial and interaction features fixes high bias but not high variance. When using gradient descent, decreasing lambda can fix high bias and increasing lambda can fix high variance (lambda is the regularization parameter). When using neural networks, small neural networks are more prone to under-fitting and big neural networks are prone to over-fitting. Cross-validation of network size is a way to choose alternatives.

Motivations for Dimensionality Reduction?

Motivation I: Data Compression Motivation II: Visualization

What is the multivariate form of a hypothesis function?

Multivariate form of the hypothesis function.

In a basic sense, what are *neurons*?

Neurons are basically computational units that take inputs, called *dendrites*, as electrical inputs, called "spikes", that are channeled to outputs , called *axons*.

State the *normal equation formula*!

Normal Equation Formula.

What is the notation for equations where we can have any number of input variables? (Multivariate Linear Regression)

Notation.

Note: J will always decrease as K is increased. The one exception is if k-means [...]

Note: J will always decrease as K is increased. The one exception is if k-means gets stuck at a bad local optimum.

Note: do perform [...] before using the Gaussian Kernel.

Note: do perform feature scaling before using the Gaussian Kernel.

Note: in dimensionality reduction, we are reducing our features rather than [...].

Note: in dimensionality reduction, we are reducing our features rather than our number of examples.

Note: not all similarity functions are valid kernels. They must satisfy [...] which guarantees that the SVM package's optimizations [...]

Note: not all similarity functions are valid kernels. They must satisfy "Mercer's Theorem" which guarantees that the SVM package's optimizations run correctly and do not diverge.

How do we choose the learning rate α for stochastic gradient descent?

One strategy is to plot the average cost of the hypothesis applied to every 1000 or so training examples. If you increase the number of examples you average over to plot the performance of your algorithm, the plot's line will become smoother. With a very small number of examples for the average, the line will be too noisy and it will be difficult to find the trend.

One way to get the landmarks is to [...]

One way to get the landmarks is to put them in the exact same locations as all the training examples.

What is our optimization objective in k-means?

Our optimization objective is to minimize all our parameters using the above cost function:

What is *overfitting*?

Overfitting, or *high variance*, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.

%% Change Octave prompt

PS1('>> ');

Precision: of all patients we predicted where y=1, [...]

Precision: of all patients we predicted where y=1, what fraction actually has cancer?

Predicted: 0, Actual, 1

Predicted: 0, Actual, 1 --- False negative

Predicted: 0, Actual: 0

Predicted: 0, Actual: 0 --- True negative

Predicted: 1, Actual: 0

Predicted: 1, Actual: 0 --- False positive

Predicted: 1, Actual: 1

Predicted: 1, Actual: 1 --- True positive

K-Means Algorithm

Randomly initialize two points in the dataset called the cluster centroids. Cluster assignment: assign all examples into one of two groups based on which cluster centroid the example is closest to. Move centroid: compute the averages for all the points inside each of the two cluster centroid groups, then move the cluster centroid points to those averages. Re-run (2) and (3) until we have found our clusters.

What is the "Rationale for large data"?

Rationale for large data: if we have a low bias algorithm (many features or hidden units making a very complex function), then the larger the training set we use, the less we will have overfitting (and the more accurate the algorithm will be on the test set).

Recall: Of all the patients that actually have cancer, what [...]

Recall: Of all the patients that actually have cancer, what fraction did we correctly detect as having cancer?

State the algorithm for *gradient descent*.

Repeat until convergence, where j=0,1 represents the feature index number.

Similarly (to make a support vector machine), we modify the second term of the cost function so that [...]

Similarly (to make a support vector machine), we modify the second term of the cost function so that when z is less than -1, it outputs 0.

Intuition for the bias-variance trade-off: Simple model => [...]

Simple model => more rigid => does not change as much with changes in X => low variance, high bias.

Since SVMs maximize this margin, it is often called a [...]

Since SVMs maximize this margin, it is often called a Large Margin Classifier.

So, PCA has two tasks: [...]

So, PCA has two tasks: figure out u(1),...,u(k) and also to find z1,z2,...,zm.

The recommended approach to solving machine learning problems is?

Start with a simple algorithm, implement it quickly, and test it early. Plot learning curves to decide if more data, more features, etc. will help Error analysis: manually examine the errors on examples in the cross validation set and try to spot a trend.

Stochastic gradient cost function?

Stochastic gradient descent is written out in a different but similar way:

Stochastic gradient descent will be unlikely to [...] and will instead [...], but usually yields a result that is close enough

Stochastic gradient descent will be unlikely to converge at the global minimum and will instead wander around it randomly, but usually yields a result that is close enough

Stochastic gradient descent will usually take [...] passes through your data set to get near the global minimum.

Stochastic gradient descent will usually take 1-10 passes through your data set to get near the global minimum.

Give the *pictorial* process for a supervised learning problem.

Supervised Learning Problem.

What is the definition of a *cost function* of a supervised learning problem?

Takes an average difference of all the results of the hypothesis with inputs from x's and the actual output y's.

Why is it possible that you may get a slightly better solution with stochastic gradient descent with a smaller learning rate?

That is because stochastic gradient descent will oscillate and jump around the global minimum, and it will make smaller random jumps with a smaller learning rate.

The K-Means Algorithm is the most popular and widely used algorithm for [...]

The K-Means Algorithm is the most popular and widely used algorithm for automatically grouping data into coherent subsets.

The Support Vector Machine (SVM) is yet another type of supervised machine learning algorithm. It is sometimes [...]

The Support Vector Machine (SVM) is yet another type of supervised machine learning algorithm. It is sometimes cleaner and more powerful.

How can you address the overfitting of a large neural network?

In this case you can use regularization (increase λ) to address the overfitting.

What is the consequence of selecting a model without the validation set?

In this case, we have trained one variable, d, or the degree of the polynomial, using the test set. This will cause our error value to be greater for any other set of data.

Increasing and decreasing C is similar to [...], and can [...]

Increasing and decreasing C is similar to respectively decreasing and increasing λ, and can simplify our decision boundary.

Mini-Batch Gradient Descent

Instead of using all m examples as in batch gradient descent, and instead of using only 1 example as in stochastic gradient descent, we will use some in-between number of examples b

What usually *causes* overfitting?

It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

What usually *causes* underfitting?

It is usually caused by a function that is too simple or uses too few features.

It's important to get error results as [...]. Otherwise it is difficult to [...]

It's important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance

The relationship of λ to the training set and the variance set is as follows: Intermediate λ:

Jtrain(Θ) and JCV(Θ) are somewhat low and Jtrain(Θ)≈JCV(Θ).

The relationship of λ to the training set and the variance set is as follows: Low λ:

Jtrain(Θ) is low and JCV(Θ) is high (high variance/overfitting).

High variance (overfitting):

Jtrain(Θ) will be low and JCV(Θ) will be much greater thanJtrain(Θ).

Just because a learning algorithm fits a training set well, that does not mean [...]

Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis.

K-Means Algorithm Our main variables are [...]

K (number of clusters) Training set x(1),x(2),...,x(m) Where x(i)∈Rn Note that we will not use the x0=1 convention.

K-means can get stuck [...]. To decrease the chance of this happening, you can [...]

K-means can get stuck in local optima. To decrease the chance of this happening, you can run the algorithm on many different random initializations.

Kernels allow us to [...]

Kernels allow us to make complex, non-linear classifiers using Support Vector Machines.

With high variance Large training set size:

Large training set size: Jtrain(Θ) increases with training set size and JCV(Θ) continues to decrease without leveling off. lso, Jtrain(Θ)<JCV(Θ) but the difference between them remains significant.

With high bias Large training set size: [...]

Large training set size: causes both Jtrain(Θ) and JCV(Θ) to be high with Jtrain(Θ)≈JCV(Θ).

Which algorithms are easily parallelizable?

Linear regression and logistic regression are easily parallelizable.

What is *multivariate linear regression*?

Linear regression with multiple variables.

With high variance Low training set size: [...]

Low training set size: Jtrain(Θ) will be low and JCV(Θ) will be high.

With high bias Low training set size: [...]

Low training set size: causes Jtrain(Θ) to be low and JCV(Θ) to be high.

What bias-variance tradeoff do lower-order polynomials (low model complexity) have?

Lower-order polynomials (low model complexity) have *high bias* and *low variance*.

Model Complexity Effects?

Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently. Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance. In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

What will MapReduce to with the "dispatched jobs"?

MapReduce will take all these dispatched (or 'mapped') jobs and 'reduce' them by calculating:

Clustering is good for [...]

Market segmentation Social network analysis Organizing computer clusters Astronomical data analysis

Trading Off Precision and Recal The lower the threshold, the [...]

The lower the threshold, the greater the recall and the lower the precisio

Typical values for b in mini-batch gradient descent?

Typical values for b range from 2-100 or so.

What is *underfitting*?

Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data.

Unsupervised learning is contrasted from supervised learning because it [...]

Unsupervised learning is contrasted from supervised learning because it uses an unlabeled training set rather than a labeled one.

Give the *vectorization* of the multivariable form of a hypothesis function.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

What are the *weights* of a neural network?

Using the logistic function, our "theta" parameters are sometimes called "weights".

What is the cost function for k-means?

Using these variables we can define our cost function:

How can we check if our features are gaussian?

We can check that our features are gaussian by plotting a histogram of our data and checking for the bell-shaped curve.

How can we *simplify* our cost function? (Logistic Regression Model)

We can compress our cost function's two conditional cases into one case.

Map Reduce and Data Parallelism

We can divide up batch gradient descent and dispatch the cost function for a subset of the data to many different machines so that we can train our algorithm in parallel.

If we use PCA to compress our data, how can we uncompress our data, or go back to our original number of features?

We can do this with the equation: x(1)approx=Ureduce⋅z(1)

How can we implement the 'XNOR' operator with a neural network?

We can implement the 'XNOR' operator by using two hidden layers.

How can we *speed up* gradient descent?

We can speed up gradient descent by having each of our input values in roughly the same range.

How can we get our discrete 0 or 1 classification from a logistic function?

We can translate the output of the hypothesis function as follows:

Say we are trying to recommend movies to customers. We can use the following definitions [...]

We can use the following definitions

Use supervised learning when...

We have a large number of both positive and negative examples. In other words, the training set is more evenly divided into classes. We have enough positive examples for the algorithm to get a sense of what new positives examples look like. The future positive examples are likely similar to the ones in the training set.

Use anomaly detection when...

We have a very small number of positive examples (y=1 ... 0-20 examples is common) and a large number of negative (y=0) examples. We have many different "types" of anomalies and it is hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've seen so far.

Which method randomly initializes our weights for our Theta matrices of a neural network?

We initialize each Theta l,i,j to a random value between minus epsilon and epsilon.

How do we transform the cost function from (regularized) logistic regression for SVMs?

We may transform this into the cost function for support vector machines by substituting cost0(z) and cost1(z):

We may want to reduce the dimension of our features if [...]

We may want to reduce the dimension of our features if we have a lot of redundant data.

We might want a confident prediction of two classes using logistic regression. One way is to [...]

We might want a confident prediction of two classes using logistic regression. One way is to increase our threshold. This way, we only predict cancer if the patient has a 70% chance.

We must choose our features to have [...]. A useful test is: [...]

We must choose our features to have enough information. A useful test is: Given input x, would a human expert be able to confidently predict y?

We want to train precision and recall on the [...]

We want to train precision and recall on the cross validation set so as not to bias our test set

When might it be a good time to go from a normal solution to an iterative process?

When the number of examples exceeds *10,000* due to the complexity of the normal equation.

What do we call a learning problem, if the target variable is continuous?

When the target variable that we're trying to predict is continuous, the learning problem is also called a *regression problem*.

What code is implemented if we perform forward *and* back propagation?

When we perform forward and back propagation, we loop on every training example.

What do we call a learning problem, if the target variable can take on only a small number of values?

When y can take on only a small number of discrete values, the learning problem is also called a *classification problem*.

How does batch gradient descent differ from gradient descent? (Gradient Descent For Linear Regression)

While gradient descent can be susceptible to local minima in general, batch gradient descent has only one global, and no other local, optima.

With a continuous stream of users to a website, we can [...], where we collect [...] for the features in x to predict some behavior y.

With a continuous stream of users to a website, we can run an endless loop that gets (x,y), where we collect some user actions for the features in x to predict some behavior y.

With a large σ2, the features fi [...]

With a large σ2, the features fi vary more smoothly, causing higher bias and lower variance.

With a small σ2, the features fi [...]

With a small σ2, the features fi vary less smoothly, causing lower bias and higher variance

With k-means, it is not possible [...]

With k-means, it is not possible for the cost function to sometimes increase. It should always descend.

What is the complexity of computing the inversion with the normal equation?

With the normal equation, computing the inversion has complexity of n cubed.

The most common use of PCA is to [...]

The most common use of PCA is to speed up supervised learning.

The most popular dimensionality reduction algorithm is [...]

The most popular dimensionality reduction algorithm is Principal Component Analysis (PCA)

Given a training set and a test set, what is the new procedure for evaluating a hypothesis?

The new procedure using these two sets is then: 1. Learn Θ and minimize Jtrain(Θ) using the training set 2. Compute the test set error Jtest(Θ)

What is the definition of a *hypothesis*?

The predicting function h.

What are the properties of the similarity function?

There are a couple properties of the similarity function:

There are lots of good SVM libraries already written. A. Ng often uses [...]

There are lots of good SVM libraries already written. A. Ng often uses 'liblinear' and 'libsvm'.

How can we tell which parameters Θ to leave in the model (known as "model selection")?

There are several ways to solve this problem: - Get more data (very difficult). - Choose the model which best fits the data without overfitting (very difficult). - Reduce the opportunity for overfitting through regularization.

Does feature scaling speed up the implementation of the normal equation?

There is *no need* to do feature scaling with the normal equation.

What are the theta-matrices for implementing the logical functions 'AND', 'NOR', and 'OR' as a neural network?

Theta Matrices for Neural Network implementation.

The SVM will separate the negative and positive examples by a large margin. This large margin is only achieved when [...]

This large margin is only achieved when C is very large.

It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm. When does this usually happen?

This usually happens with skewed classes; that is, when our class is very rare in the entire data set. Or to say it another way, when we have lot more examples from one class than from the other clas

What are the benefits of selecting a model with the validation set?

This way, the degree of the polynomial d has not been trained using the test set. note: be aware that using the CV set to select 'd' means that we cannot also use it for the validation curve process of setting the lambda value

What is the result if we want to get a very safe prediction?

This will cause higher recall but lower precision.

What is the average test error for the test set?

The average test error for the test set is. This gives us the proportion of the test data that was misclassified.

What does the matrix Delta in the Back propagation algorithm do?

The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative.

The [...] is called the margin.

The distance of the decision boundary to the nearest example is called the margin.

Choosing the Number of Clusters The elbow method [...]

The elbow method: plot the cost J and the number of clusters K. The cost function should reduce as we increase the number of clusters, and then flatten out. Choose K at the point where the cost function starts to flatten out. However, fairly often, the curve is very gradual, so there's no clear elbow.

The error of your hypothesis as measured on the data set with which you trained the parameters will be [...]

The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than any other data set.

The error value will [...] after a certain m, or training set size.

The error value will plateau out after a certain m, or training set size.

what is the implementation for K-Means Algorithm?

The first for-loop is the 'Cluster Assignment' step The second for-loop is the 'Move Centroid' step where we move each centroid to the average of its group.

Why does gradient descent, regardless of the slope's sign, eventually converge to its minimum value? #2

The following graph shows that: • when the slope is negative, the value of theta 1 increases. • when the slope is positive, the value of theta 1 decreases.

Depict an example of *One-versus-all* to classify 3 classes! (Multiclass Classification)

The following image shows how one could classify 3 classes:

Compare gradient descent and the normal equation!

The following is a comparison of gradient descent and the normal equation:

Give an example of the implementation of the *OR-function* as a neural network!

The following is an example of the logical operator 'OR', meaning either x1 is true or x2 is true, or both:

Give an example of the implementation of the *AND-function* as a neural network!

The following is an example of the logical operator AND, meaning it s only true if both x1 and x2 are 1.

What is the function of the Gaussian distribution?

The full function is as follows:

The goal of PCA is to [...]

The goal of PCA is to reduce the average of all the distances of every feature to the projection line. This is the projection error.

Trading Off Precision and Recal The greater the threshold, the [...]

The greater the threshold, the greater the precision and the lower the recall.

Give an example of a neural network which classifies data into one of four categories!

The inner layers, each provide us with some new information which leads to our final hypothesis function.

What are possible evaluation metrics for anomaly detection?

True positive, false positive, false negative, true negative. Precision/recall F1 score

Algorithm for choosing k in PCA?

Try PCA with k=1,2,... Compute Ureduce,z,x Check the formula given above that 99% of the variance is retained. If not, go to step one and increase k.

What is the *activation* function of a neural network?

The logistic function (as in classification) is also called a *sigmoid (logistic) activation function*.

Your learning algorithm is MapReduceable if [...]

Your learning algorithm is MapReduceable if it can be expressed as computing sums of functions over the training set

% row, column indices for values matching comparison

[r,c] = find(A>=7)

% val - maximum element of the vector a and index - index value where maximum occur

[val,ind] = max(a)

% checks which values in a are less than 3

a < 3

% Displaying them:

a = pi

% comma-chaining function calls.

a=1,b=2,c=3

% To add the path for the current session of Octave:

addpath('/path/to/function/')

% change axis scale

axis([0.5 1 -1 1]);

The relationship of λ to the training set and the variance set is as follows: Large λ

both Jtrain(Θ) and JCV(Θ) will be high (underfitting /high bias)

High bias (underfitting):

both Jtrain(Θ) and JCV(Θ) will be high. Also, JCV(Θ)≈Jtrain(Θ)

Recall some of the parameters we used in our K-means algorithm!

c(i) = index of cluster (1,2,...,K) to which example x(i) is currently assigned μk= cluster centroid k (μk∈ℝn) μc(i) = cluster centroid of cluster to which example x(i) has been assigned

% clear command without any args clears all vars

clear q1y

How are the terms We shall cost1(z) and cost0(z) in a SVM defined?

cost1(z) is the cost for classifying when y=1, and cost0(z) is the cost for classifying when y=0

% gives location of elements less than 3

find(a < 3)

Which function returns the values for jVal and gradient in a single turn?

function [jVal, gradient] = costFunction(theta) jVal = [...code to compute J(theta)...]; gradient = [...code to compute derivative of J(theta)...]; end

Octave's functions can return more than one value

function [y1, y2] = squareandCubeThisNo(x)

How can we interpret the output of our logistic function?

h of theta of a given input variable give us the probability that our output is 1.

% plot histogram using 10 bins (default)

hist(w)

% plot histogram using 50 bins

hist(w,50)

% "hold off" to turn off

hold on;

When is the F1 score not defined?

if an algorithm predicts only negatives like it does in one of exercises, the precision is not defined, it is impossible to divide by 0.

% size of longest dimension

length(v)

%% loading data

load q1y.dat % alternatively, load('q1y.dat')

% functions like this operate element-wise on vecs or matrices

log(v)

% maximum along columns(defaults to columns - max(A,[]))

max(A,[],1)

% maximum along rows

max(A,[],2)

What is the convention when using regularization in SVMs?

onvention dictates that we regularize using a factor C, instead of λ, like so: This is equivalent to multiplying the equation by C=1/λ.

% Matrix inverse (pseudo-inverse)

pinv(A)

% save variable v into file hello.mat

save hello.mat v;

% save as ascii

save hello.txt v -ascii;

% To remember the path for future sessions of Octave, after executing addpath above, also do:

savepath

% number of rows

size(A,1)

% number of cols

size(A,2)

% Divide plot into 1x2 grid, access 1st element

subplot(1,2,1);

% Divide plot into 1x2 grid, access 2nd element

subplot(1,2,2);

% 1x2 matrix: [(number of rows) (number of columns)]

sz = size(A)

%% plotting

t = [0:0.01:0.98];

Ho do we find the "similarity" of x and some landmark l(i)?

the "similarity" of x and some landmark l(i):

Bad use of PCA?

trying to prevent overfitting. We might think that reducing the features with PCA would be an effective way to address overfitting. It might work, but is not recommended because it does not consider the values of our results y. Using just regularization will be at least as effective.

% v + 1 % same

v + ones(length(v), 1)

% from 1 to 2, with stepsize of 0.1. Useful for plot axes

v = 1:0.1:2

% from 1 to 6, assumes stepsize of 1 (row vector)

v = 1:6

% first 10 elements of q1x (counts down the columns)

v = q1x(1:10);

% if A is matrix, returns max from each column

val = max(A)

% (mean = -6, var = 10) - note: add the semicolon

w = -6 + sqrt(10)*(randn(1,10000));

% 1x3 vector of ones

w = ones(1,3)

% drawn from a uniform distribution

w = rand(1,3)

% drawn from a normal distribution (mean=0, var=1)

w = randn(1,3)

Given costFunction(), what do we have to do to implement fminunc()?

we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()"

% list variables in workspace

who

% list variables in workspace (detailed view)

whos

How can we break down our decision process *deciding what to do next*? #6

• *Getting more training examples*: Fixes high variance. • *Trying smaller sets of features*: Fixes high variance. • *Adding features*: Fixes high bias. • *Adding polynomial features*: Fixes high bias. • *Decreasing lambda*: Fixes high bias. • *Increasing lambda*: Fixes high variance.

What is the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis? #2

• *High bias (underfitting)*: both J train(Θ) and J CV(Θ) will be high. Also, J CV(Θ) is approximately equal to J train(Θ). • *High variance (overfitting)*: J train(Θ) will be low and J CV(Θ) will be much greater than J train(Θ).

What are alternative terms of a Cost Function? #2

• *Squared error function*. • *Mean squared error*.

How does gradient descent converge with a *fixed* step size alpha? #2

• As we approach a local minimum, gradient descent will take smaller steps. • Thus no need to decrease alpha over time.

How do we implement an *iteration step* when calculating Gradient Descent in code? #2

• At each iteration j, one should simultaneously update the parameters. • Updating a specific parameter prior to calculating another one on the j iteration would yield to a wrong implementation.

What is the *Automatic convergence test* in gradient descent? #2

• Declare convergence if J of theta decreases by less than E in one iteration, where E is some small value such as 0.001. • However in practice it's difficult to choose this threshold value.

What are the ideal ranges of our input variables in gradient descent? #2

• For example a range between minus 1 and 1. • These aren't exact requirements; we are only trying to speed things up.

What is *batch gradient descent*? #2 (Gradient Descent For Linear Regression)

• Gradient descent on the original cost function J. •This method looks at every example in the entire training set on every step.

How can the step parameter alpha in gradient descent cause bugs? #2

• If alpha is too small: slow convergence. • If alpha is too large: may not decrease on every iteration and thus may not converge.

Plot the cost function, if the correct answer for y is 0. #2

• If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. • If our hypothesis approaches 1, then the cost function will approach infinity.

Plot the cost function J, if the correct answer for y is 1.

• If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. • If our hypothesis approaches 0, then the cost function will approach infinity.

What important thing should one keep in mind if one changes the form of a hypothesis function? (Multivariate Linear Regression) #2

• If you create new features when doing polynomial regression then *feature scaling* becomes very important. • For example, if x has range 1 - 1000 then range of x^2 becomes 1 - 1000000.

What is *feature scaling*? #2

• Involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable. • Results in a new range of just 1.

What is *mean normalization*? #2

• Involves subtracting the average value for an input variable from the values for that input variable. • Results in a new average value for the input variable of just zero.

How are the variables L, s of l and K in the cost function of a neural network defined? #3

• L = total number of layers in the network. • s of l = number of units (not counting bias unit) in layer l. • K = number of output units.

How can you *debug* gradient descent? #3

• Make a plot with number of iterations on the x-axis. • Now plot the cost function J of theta over the number of iterations of gradient descent. • If J of theta ever increases, then you probably need to decrease alpha.

What is *gradient descent* for our simplified cost function? (Logistic Regression Model) #2

• Notice that this algorithm is identical to the one we used in linear regression. • We still have to simultaneously update all values in theta.

Give the setup of using a neural network. #4

• Pick a network *architecture*. • choose the *layout* of your neural network. • Number of input units; dimension of features x i. • Number of output units; number of classes. • Number of hidden units per layer; usually more the better.

What are common causes for X Transpose X to be *noninvertible*? #2

• Redundant features, where two features are very closely related (i.e. they are linearly dependent). • Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization".

How do we determine the dimension of the matrices of weights? (Neural Network) #2

• The +1 comes from the addition of the "bias nodes. • In other words the output nodes will not include the bias nodes while the inputs will.

What is the *decision boundary* given a logistic function? #2

• The decision boundary is the line that separates the area where y = 0 and where y = 1. • It is created by our hypothesis function.

What is the Gradient Descent for Multiple Variables? #2

• The gradient descent equation itself is generally the same form. • we just have to repeat it for our 'n' features.

What is the *bias unit* of a neural network? #2

• The input node x0 is sometimes called the "bias unit." • It is always equal to 1.

What is a visual interpretation of the cost function? #2

• The training data set is scattered on the X-Y plane. • We are trying to make a straight line (defined by hθ(x)) which passes through these scattered data points.

Why does *feature scaling* speed up gradient descent? #2

• This is because theta will descend quickly on small ranges and slowly on large ranges. • Thus it will oscillate inefficiently down to the optimum when the variables are very uneven.

Why should we adjust the parameter alpha when using gradient descent? #2

• To ensure that the gradient descent algorithm converges in a reasonable time. • Failure to converge or too much time to obtain the minimum value implies that our step size is wrong.

What is the implementation of *One-versus-all* in Multiclass Classification? #2

• Train a logistic regression classifier h of theta for each class to predict the probability that y = i . • To make a prediction on a new x, pick the class that maximizes h of theta.

Which function do we want to use in octave when implementing the normal equation? #2

• Use the 'pinv' function rather than 'inv'. • The 'pinv' function will give you a value of theta even if X Transpose X is not invertible.

How do we obtain the values for each of the activation nodes, given a single-layer neural network with 3 activation nodes and a 4-dimensional input? #2

• We apply each row of the parameters to our inputs to obtain the value for one activation node. • Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by the parameter matrix theta 2.

How can we improve our features? (Multivariate Linear Regression) #2

• We can *combine* multiple features into one. • For example, we can combine x1 and x2 into a new feature x3 by taking x1 times x2.

What is the algorithm for implementing gradient descent for *linear regression*? #2

• We can substitute our actual cost function and our actual hypothesis function. • m is the size of the training set, theta 0 a constant that will be changing simultaneously with theta 1 and x, y are values of the given training set (data).

What is the intuition of the multivariable form of a hypothesis function in the example of estimating housing prices? #2

• We can think about theta 0 as the basic price of a house, theta 1 as the price per square meter, theta 2 as the price per floor, etc. • x1 will be the number of square meters in the house, x2 the number of floors, etc.

How does the *cost function* for a logistic regression look like? #2

• We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. • In other words, it will not be a convex function.

How do we label the hidden layers of a neural network? #2

• We label these intermediate or hidden layer nodes. • The nodes are also called *activation units*.

Depict the graphical implementation of minimizing the cost function using gradient descent. #2

• We put theta 0 on the x axis and theta 1 on the y axis, with the cost function on the vertical z axis. • The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters.

Which problems can be solved with unsupervised learning? #2

• approach problems with little or no idea what our results should look like. • derive structure from data where we don't necessarily know the effect of the variables.

Depict an example of gradient descent as it is run to minimize a quadratic function. #2

• shown is the trajectory taken by gradient descent, which was initialized at 48,30. • The x's in the figure (joined by straight lines) mark the successive values of theta that gradient descent went through as it converged to its minimum.

State the *cost function* for neural networks. #3

• the double sum simply adds up the logistic regression costs calculated for each cell in the output layer. • the triple sum simply adds up the squares of all the individual theta s in the entire network. • the i in the triple sum does not refer to training example i.

How can we approach regularization using the alternate method of the non-iterative normal equation?

To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:

To make a support vector machine, we will modify the first term of the cost function so that [...]

To make a support vector machine, we will modify the first term of the cost function so that when θTx (from now on, we shall refer to this as z) is greater than 1, it outputs 0.

How can we solve the problem of selecting a model with only the training set?

To solve this, we can introduce a third set, the Cross Validation Set, to serve as an intermediate set that we can train d with. Then our test set will give us an accurate, non-optimistic error.


Related study sets

Micro-Level Correct and Incorrect Answer + Tests

View Set

Chapter 12 - Engaging Consumers and Communicating Customer Value: Advertising and Public Relations

View Set

Caring Interventions: Caring Encounters and Knowledge

View Set

The Limping Adolescent- Hip Joint

View Set

NUR 238 PrepU Chapter 11: Maternal Adaptation During Pregnancy

View Set

CH36: The Child With a Respiratory Disorder

View Set

chapter 2 homework, Business FAll 2017 Chapter 3, Business Law Chapter 4, BLAW CH 5, Business 241 Chap 6, Business Law 1 - MindTap Chapter 4-3 Worksheet: Unintentional Torts (Negligence) & Strict Liability, fin 240 kaplowitz worksheet 7.3: defenses t...

View Set