Lecture 5-7: Intro to Neural Networks
Predicting what letter grade a student will get in Stats 315
a classification problem because it's about predicting letter grades (A, B, C, etc.), which are discrete categories.
Stable learning rates...
converge smoothly and avoid local minima
A small learning rate...
converges slowly and gets stuck in false local minima
A large learning rate...
converges smoothly and avoids local minima
when is a softmax activation function used
in the context of multiclass classification tasks, probability distributions, cross-entropy loss, one-hot encoding
Drop out in Regularization
- During training, randomly set some activations to 0 - Typically drop 50% of activations in layer - Forces network to not rely on any 1 node
when is a sigmoid classification function used
- In the output layer of binary classification - Sigmoid maps raw output (logits) to a probability of belong to one of the classes
Using Mini-batches while training
- Most accurate estimation of gradient - Smoother convergence allows for larger learning rates - Mini-batches lead to fast training - Can parallelize computation + achieve significant speed increases on GPU's
Adaptive learning rates
- learning rates are no longer fixed - can be made larger or smaller depending on: 1. how large gradient is 2. how fast learning is happening 3. size of particular weights 4. etc
how would you use both relu and softmax
- relu is often used as the activation function in the hidden layers - softmax is often used in the output layer to compute class probabilities
Predicting the price of a house in Ann Arbor
An example of a regression problem. Regression problems involve predicting a continuous, numerical value as the output
Difference between gradient descent and stochastic gradient descent?
Both compute the gradient of the cost function with respect to the model parameters. Gradient descent uses the entire dataset in each iteration while stochastic gradient descent uses only a single training example.
Quantifying Loss
Calculate the loss of our network by measuring the cost incurred from incorrect predictions
Mean Squared Error Loss
Can be used with regression models that output continuous real numbers
Binary Cross Entropy Loss
Cross entropy loss can be used with models that output a probability between 0 and 1
What is gradient descent
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a cost function using gradient descent, one takes steps proportional to the negative of the gradient (partial derivative or tangent) of the function at the current point.
Why do we need regularization
Improve the generalization of our model on unseen data
Importance of Activation Functions
Introduce non-linearities into the network
What would the output look like if we haven't expanded the non-linear activation function?
Perceptrons take in an input and decide if the output goes on either the right or the left side of the line.
Loss Optimization Steps
Randomly pick an initial (w0, w1). Compute a gradient(dJw/dw). Take a small step in the opposite direction of the gradient. Repeat until convergence (Gradient descent)
What are three common activation functions?
Sigmoid function: tf.nn.sigmoid(z), Hyperbolic tangent: tf.nn.tanh(z), Rectified Linear Unit: tf.nn.relu(z)
Early Stopping Regulzarization
Stop the training before loss gets too low and we start to overfit
What is regularization
Technique that constrains our optimization problem to discourage complex models
Empirical Loss
The empirical loss measures the total loss over our entire dataset
Equation for the Perceptron: Forward Propagation
The output is a non-linear activation function that is applied to the bias and linear combination of inputs
The Perceptron
The structure building block of deep learning
Predicting whether or not it will rain today
a binary classification problem (predicting "yes" or "no") and not a regression problem.
Loss Optimization
We want to find the network weights that achieve the lowest loss
The Problem of Overfitting
When the model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
What step in Gradient Descent can be very computational to compute?
When you compute the gradient in step 3.
Predicting the type of species a flower is (assuming there are only 3)
a classification problem because it involves predicting a discrete class label (the type of species).