Machine Learning
What is the use of gradient check?
- Do forward pass when we've added epsilon parameter - Do forward If number is close to what program gives us, then we'll know there isn't a bug.
How would you initialize weights?
- Don't initialize weights to zero Gradients will also be zero => so no update - Don't initialize to the same value - Initialize with a gaussian distribution centered at zero but break the symmetry. We want gradients that vary around the same scale.
Give Two examples of regularization method?
- L1 regularization: computes sum of the absolute values of weights and encourages weights to be =0. This results in sparse weights - L2 regularization: computes sum of the squared values of weights and is more nuanced. For every weight w in the network, we add the term 1/2 (λw^2) to the objective
What are some difficulties that may arise in training neural nets?
- The weight matrix might be small and close to values 0 getting propagated across network - Units saturated => Multiplying gradients by 0. weak signal
What is Batch Gradient descent?
- Vanilla gradient descent, aka batch gradient descent - Make a small change in weights that most rapidly improves task performance - Good for small datasets
What is a neural network?
A neural network is a machine learning algorithm that mimics the functioning of human brains. We have an input vector (input layer) which is transformed into a hidden layer containing hidden units, these hidden units are computed with linear transformations followed by non-linearities. We compose this over and over layers to get to the output layer which gives the class of an input.
What is lambda?
A parameter that controls the amount of attention that the learning process should pay to the regularization penalty. It ranges from 0.0 (no penalty) and 1.0 (full penalty).
What is an artificial neuron
A pre-activation a linear transformation with weight vector and a bias Then we apply an activation function
What is an activation function
An activation function is a function that introduces non-linearity into a model. It is applied to the pre-activation "a" of the neuron e.g. sigmoid, tanh
What is 'Overfitting' ?
Creating a model that matches the training data so well that the model fails to make correct predictions on new data. Essentially, the model has learned to describe random error or noise instead of the underlying relationship. e.g. cats/images/sofa
What is early stopping?
Early stopping is running the model for a certain number of epochs and stop when the validation loss increase. Certain libraries allow to define patience=number of additional epochs to run after stop.
What is the difference between sigmoid and softmax functions?
Even though they both output values in [0,1] Sigmoid outputs values Softmax outputs probabilities
Why choose SGD over batch gradient descent?
Feeding the entire dataset o
What is gradient descent?
Gradient descent is an iterative optimization algorithm. To find a local minimum of a cost function using gradient descent, one takes steps proportional to the negative of the gradient (partial derivative or tangent) of the function at the current point. We calculate the descent direction by computing Delta = - (loss + regularization term) = -(Grad of loss + (lamda x Grad regular.function)) Weight = Weight + (alpha x Delta)
What is the tanh function
Hyperbolic tangent is an activation function that squashes pre-activations between -1 and 1. h(0)=0
Why do we have to be careful with regularization
If the penalty is too strong (high bias), the model will underestimate the weights and underfit the problem. If the penalty is too weak, the model will be allowed to overfit the training data.
How is regularization helpful?
It helps reducing overfitting by reducing the capacity of the model. Due to all the multiplicative interactions between weights and inputs -> the network is encouraged to use all of its inputs a little rather than some of its inputs a lot.
What is the softmax function?
It is a function used for representing distribution over classes. It takes logits from previous layers and transforms them into probabilities [0, 1]
What is gradient descent used for?
It is used for optimization and find the local minimums that is the point at which the model converges.
What is the sigmoid function?
It squashes pre-activations between 0 and 1. h(0) = 0.5
What does the bias do? (In an artificial neuron)
It's a ridge that separates the region with low activation and high activation
Give me an example of unsupervised learning
K-Means
I have a dataset of news articles. I don't have any labels for the news articles, but I'd like to perform an analysis to gain insights into the most popular topics in the dataset. How would you go about that?
K-Means. - Clustering algorithm - Do we have a feature that determines how many times an article was viewed?
Walk me through K-means
K-means is an unsupervised learning algorithm that allows to identify clusters in our dataset. 1. Decide what our number of clusters K is 2. Randomly initialize K centroids 3. Calculate distance of a point X to centroid 1, point X to centroid 2, point X to centroid 3 4. Assign the point to nearest cluster centroid 5. Calculate mean of each cluster 6. Then we reajust the position of points based on that 7. Then we will take the variation within each cluster to see how well we're doing. 8. We repeat this process for a few iterations 9. Then K-means will tell us which centroids were the best and the clusters
What is the learning rate?
Learning rate is a scalar value that we set and indicates how big of step we want to make when we update our weights.
Give me an example of supervised learning?
Linear regression
What is linear regression?
Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X.
What is a weight decay?
Method of regularization: L2. It usually changes on a logarithmic scales with values between 0 and 0.1 (such as 0.1, 0.001, etc)
How do you use momentum?
Negative : It will move faster in that direction Negative+Positive: It will move slower in that direction
What causes overfitting?
Overfitting can be caused by: - Dataset without high variance - Neurons learning to predict noise - A high capacity
What is Relu?
Rectified linear unit, it squashes values between (0, |a|) It is max (0, a) Why is it good? - Neurons are sparse -> they are going to be 0 - Sometimes the partial derivative of the activation function is the activation => computational advantages
What is variance/bias tradeoff
Refers to the relationship between a model that learns too closely the underlying relationship (overfitting) (Low Bias/High variance) - noisy dataset - Complex models and model that doesn't learn at all the underlying relationship (underfitting) (Low Variance/High Bias) happens when - we're using less amount of data - a linear model with a nonlinear data
What is regularization?
Regularization is a collection of techniques that help to prevent over-fitting. Regularization penalizes against complexity. - A network with large network weights -> unstable network -> small changes in input -> big changes in output - Goal is to encourage network to have small weights
Say I wanted to design a predictor for the weather on the next day, based on features a meteorologist would describe to me. What method would you use?
Supervised learning - Bayes Theorem P(A/B). Probability of A given B - Multiclass classification problem (rainy/sunny/cloudy)
What is the notion of margin in a linear classifier?
The margin is the distance from the decision surface to the closest data point. It can also refer to the minimal distance of any training point to the hyperplane
What is the objective of the loss function?
The objective of the loss function is to Maximize the probability of picking the correct class
What is backpropagation?
The process where error units are sent back to the hidden and representative units for information on how they should be changed so they will activate the correct property units. We Compute output gradient before activation: - Increase preactivation for the correct class - Decrease preactivation for all other class We have to propagate this info in the layers before.
What is underfitting?
Underfitting (or high bias) : Creating a model that doesn't perform well because it hasn't captured the complexity of our data. eg. classifying cat or dog with just two layers.
What causes underfitting
Underfitting can be caused by: - Training on the wrong features - Training for few epochs - Not having enough hidden layers - Using a low learning rate alpha - Using high regularization rate lambda
How to reduce underfitting
Underfitting can be reduced by: - Training for more epochs - Using more hidden layers - Increasing capacity of the model - Using low regularization rate lambda
How can you initialize bias?
We can initialize to bias to zero
How do we reduce overfitting
We can reduce overfitting by: - Increase the size of our training set (more variance) - Increase our learning rate alpha - Use a higher regularization rate lambda - Performing batch normalization - Perform random search/(fine tuning)
How do we use the loss function?
We minimize the log-likelihood Loss = - log f(x)y
Why use mini-batches in training?
matrix-matrix multipublication We feed network matrices instead of doing vectors x multiplications