Lecture 5-7: Intro to Neural Networks

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Predicting what letter grade a student will get in Stats 315

a classification problem because it's about predicting letter grades (A, B, C, etc.), which are discrete categories.

Stable learning rates...

converge smoothly and avoid local minima

A small learning rate...

converges slowly and gets stuck in false local minima

A large learning rate...

converges smoothly and avoids local minima

when is a softmax activation function used

in the context of multiclass classification tasks, probability distributions, cross-entropy loss, one-hot encoding

Drop out in Regularization

- During training, randomly set some activations to 0 - Typically drop 50% of activations in layer - Forces network to not rely on any 1 node

when is a sigmoid classification function used

- In the output layer of binary classification - Sigmoid maps raw output (logits) to a probability of belong to one of the classes

Using Mini-batches while training

- Most accurate estimation of gradient - Smoother convergence allows for larger learning rates - Mini-batches lead to fast training - Can parallelize computation + achieve significant speed increases on GPU's

Adaptive learning rates

- learning rates are no longer fixed - can be made larger or smaller depending on: 1. how large gradient is 2. how fast learning is happening 3. size of particular weights 4. etc

how would you use both relu and softmax

- relu is often used as the activation function in the hidden layers - softmax is often used in the output layer to compute class probabilities

Predicting the price of a house in Ann Arbor

An example of a regression problem. Regression problems involve predicting a continuous, numerical value as the output

Difference between gradient descent and stochastic gradient descent?

Both compute the gradient of the cost function with respect to the model parameters. Gradient descent uses the entire dataset in each iteration while stochastic gradient descent uses only a single training example.

Quantifying Loss

Calculate the loss of our network by measuring the cost incurred from incorrect predictions

Mean Squared Error Loss

Can be used with regression models that output continuous real numbers

Binary Cross Entropy Loss

Cross entropy loss can be used with models that output a probability between 0 and 1

What is gradient descent

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a cost function using gradient descent, one takes steps proportional to the negative of the gradient (partial derivative or tangent) of the function at the current point.

Why do we need regularization

Improve the generalization of our model on unseen data

Importance of Activation Functions

Introduce non-linearities into the network

What would the output look like if we haven't expanded the non-linear activation function?

Perceptrons take in an input and decide if the output goes on either the right or the left side of the line.

Loss Optimization Steps

Randomly pick an initial (w0, w1). Compute a gradient(dJw/dw). Take a small step in the opposite direction of the gradient. Repeat until convergence (Gradient descent)

What are three common activation functions?

Sigmoid function: tf.nn.sigmoid(z), Hyperbolic tangent: tf.nn.tanh(z), Rectified Linear Unit: tf.nn.relu(z)

Early Stopping Regulzarization

Stop the training before loss gets too low and we start to overfit

What is regularization

Technique that constrains our optimization problem to discourage complex models

Empirical Loss

The empirical loss measures the total loss over our entire dataset

Equation for the Perceptron: Forward Propagation

The output is a non-linear activation function that is applied to the bias and linear combination of inputs

The Perceptron

The structure building block of deep learning

Predicting whether or not it will rain today

a binary classification problem (predicting "yes" or "no") and not a regression problem.

Loss Optimization

We want to find the network weights that achieve the lowest loss

The Problem of Overfitting

When the model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

What step in Gradient Descent can be very computational to compute?

When you compute the gradient in step 3.

Predicting the type of species a flower is (assuming there are only 3)

a classification problem because it involves predicting a discrete class label (the type of species).


Ensembles d'études connexes

Chapter 42: Drugs Used to Treat Glaucoma and Other Eye Disorders

View Set

Pharmaceutics 2 exam 2- quiz questions

View Set

Pt. 2 Additional Life Provisions Questions

View Set

Hubspot Email Marketing Certification Q's

View Set

TRANSLATION LIST 2.34 - Modal Verbs

View Set

Romeo & Juliet: Most Famous Plays by William Shakespeare

View Set

sociology Chapter 11: marriage and family

View Set