SVM, CNN, NLP, LSTM Midterm Study Guide
Word2Vec
- A way of mapping words to vectors - input is a text (corpus) - output is vectorized representation of the words in the text - probability of every word in the vocabulary to appear in proximity of the input word
Continuous bags of words (CBOW)
- Predicts the central word based on the window of words surrounding it - order does not matter
Skid-gram (SG)
- Predicts the context based on the central word (opposite of CBOW) - order matters - closer words are weighted more
hyperplane
- a flat affine subspace of dimension p-1 - "the line" - dividing p-dimensional space into two halves - it is possible to construct a hyperplane that separates the training observations perfectly according to their class labels
Support Vector Machine (SVM)
- a generalization of a simple and intuitive classifier called the maximal margin classifier - best used in binary classification since its classes are separated by a linear boundary - further extension of the support vector classifier in order to accommodate non-linear class boundaries - make use of kernels - can perform multiclass classification (Ex: one vs. one and one vs. all)
soft margin
- a hyperplane that almost separates the classes
Why NLP is difficult?
- ambiguity of the language - relies on contextual information - machine cannot understand creativity - no direct mapping between the vocab of any 2 languages
support vector classifier
- an extension of the maximal margin classifier that can be applied in a broader range of cases - the generalization of the maximal margin classifier to the non-separable case - if Ei = 0 then ith observations is on the correct side of the margin - if Ei > 0 then the ith observation is on the wrong side of the margin - if Ei > 1 then the ith observation is on the wrong side of the hyperplane - still linear line
Polynomial Kernel
- can fit radial dataset - polynomial of degree one is a linear separation
LSTM (Long Short Term Memory)
- can forget, store, update, output - forget: forgets everything in the past - can regulate info - can keep long term info - solves the vanishing gradient problem by reducing computations by forgetting and omitting certain inputs.
Kernel trick
- computationally faster than enlarging the feature space - enlarge the feature space in a specific way - we only need the observations closest to the hyperplane
ReLU
- derivative is 1 for every input >0 - solves the vanishing gradient problem - any value <0, derivative would become 0 which completely kills the negative weights. (use leaky to solve this)
Ranked features
- eliminate the features with the smallest weights - static - Problem: ranking might change each time a feature is eliminated
Pooling
- ensure that our neural network has spatial invariance (doesn't care where the features are and doesn't care if the features are a bit tilted, different, relatively closer or farer apart, etc.) - within the feature map, we pick the max value of a 2x2 sub-square (size can vary). Each max value are inputted into the pooled feature map. - this helps with generalization therefore, if the max value is moved in a different position, it will still be recognized. - we are also reducing the size of the map, which help the work of the final layer - reduces overfitting
Convolution
- going through the image and applying the function pixel by pixel - input image (simplified as just 1s and 0s per pixel) - feature detector (AKA filter. Usually, 3x3 but it could be larger and can contain negative values too) - feature Map (simplified as just 1s and 0s per pixel) - we multiply together the window matrices of the input image and the feature detector matrices and sum all the result together to input into the feature map. - the network will be trained to recognize a series of features that are then stored in a collection of feature maps (the convolutional layer) - For example: the first convolution layer could be features of the left eye. The second layer could be features of a nose
Vanishing gradient
- if many values <1, the gradient will vanish towards 0 - the short term dependencies will have greater weights - back to short term memory - solution: use ReLU (negative value will vanished) - RNN is more likely to incur in this problem if its sequential data input is very long. Each sequence will equate to one computation.
Exploding gradient
- in between each step, we perform multiplications using the weight matrix - involve a very large number of operations - if many values >1, the gradient will explode into a big number.
Positional encoding
- necessary to include the position of each in the text so that the order of the words matter. - the attention mechanism of the blocks cannot reason about the position of the words.
Maximal Margin Classifier
- optimal separating hyperplane - the separating hyperplane that is farthest from the training observations - mid-line of the widest "slab" that we can insert between the two classes - closest training observations to the hyperplane are equidistant from the maximal margin hyperplane and lie along the dashed lines indicating the width of the margin (support vectors) - cannot have non-separable cases - extremely sensitive to a change in a single observation (can lead to overfitting the training data)
attention
- provides learnable memory access - no backpropagation through time (no BPTT) - long term dependencies can be accessed as easily as short-term dependencies
dropout
- reduces overfitting - let a layer to randomly drop out a few output units from the layer itself during the training process
Pre-trained models
- semi supervised learning - we trained the model on a very large corpus of text, without using labels - teaching the language to the model
Feature reduction
- want to reduce number of features without decreasing accuracy - ranked features - recursive feature elimination (RFE)
Full connection
- we attach the artificial neural network and use the 1-dimensional input layer from the flattening process as the input layer. - hidden layers are called fully connected layers in CNN - the backpropagation will not simply identify the weights to update, but also the "features" - feature detectors (the filters) are also modified based on the error via gradient descent.
Flattening
- we convert the matrices in a vector (by rows) - the matrices become 1-Dimensional
Recurrent Neural Network (RNN)
- work with sequential data - the sequence of input is important
CNN steps
1. Convolution (with ReLU) 2. Max Pooling 3. Flattening 4. Full Connection
Recursive feature elimination (RFE)
1. first, generate linear SVM 2. Eliminate feature with lowest weight 3. Generate a new linear SVM based on reduced feature set 4. Eliminate feature with lowest weight 5. Lather, rinse, repeat
Encoding bottleneck problem
The input is compressed into a single hidden layer (happens in the last one because there is more pressure) and the compression process can potentially lose important information