SVM, CNN, NLP, LSTM Midterm Study Guide

Ace your homework & exams now with Quizwiz!

Word2Vec

- A way of mapping words to vectors - input is a text (corpus) - output is vectorized representation of the words in the text - probability of every word in the vocabulary to appear in proximity of the input word

Continuous bags of words (CBOW)

- Predicts the central word based on the window of words surrounding it - order does not matter

Skid-gram (SG)

- Predicts the context based on the central word (opposite of CBOW) - order matters - closer words are weighted more

hyperplane

- a flat affine subspace of dimension p-1 - "the line" - dividing p-dimensional space into two halves - it is possible to construct a hyperplane that separates the training observations perfectly according to their class labels

Support Vector Machine (SVM)

- a generalization of a simple and intuitive classifier called the maximal margin classifier - best used in binary classification since its classes are separated by a linear boundary - further extension of the support vector classifier in order to accommodate non-linear class boundaries - make use of kernels - can perform multiclass classification (Ex: one vs. one and one vs. all)

soft margin

- a hyperplane that almost separates the classes

Why NLP is difficult?

- ambiguity of the language - relies on contextual information - machine cannot understand creativity - no direct mapping between the vocab of any 2 languages

support vector classifier

- an extension of the maximal margin classifier that can be applied in a broader range of cases - the generalization of the maximal margin classifier to the non-separable case - if Ei = 0 then ith observations is on the correct side of the margin - if Ei > 0 then the ith observation is on the wrong side of the margin - if Ei > 1 then the ith observation is on the wrong side of the hyperplane - still linear line

Polynomial Kernel

- can fit radial dataset - polynomial of degree one is a linear separation

LSTM (Long Short Term Memory)

- can forget, store, update, output - forget: forgets everything in the past - can regulate info - can keep long term info - solves the vanishing gradient problem by reducing computations by forgetting and omitting certain inputs.

Kernel trick

- computationally faster than enlarging the feature space - enlarge the feature space in a specific way - we only need the observations closest to the hyperplane

ReLU

- derivative is 1 for every input >0 - solves the vanishing gradient problem - any value <0, derivative would become 0 which completely kills the negative weights. (use leaky to solve this)

Ranked features

- eliminate the features with the smallest weights - static - Problem: ranking might change each time a feature is eliminated

Pooling

- ensure that our neural network has spatial invariance (doesn't care where the features are and doesn't care if the features are a bit tilted, different, relatively closer or farer apart, etc.) - within the feature map, we pick the max value of a 2x2 sub-square (size can vary). Each max value are inputted into the pooled feature map. - this helps with generalization therefore, if the max value is moved in a different position, it will still be recognized. - we are also reducing the size of the map, which help the work of the final layer - reduces overfitting

Convolution

- going through the image and applying the function pixel by pixel - input image (simplified as just 1s and 0s per pixel) - feature detector (AKA filter. Usually, 3x3 but it could be larger and can contain negative values too) - feature Map (simplified as just 1s and 0s per pixel) - we multiply together the window matrices of the input image and the feature detector matrices and sum all the result together to input into the feature map. - the network will be trained to recognize a series of features that are then stored in a collection of feature maps (the convolutional layer) - For example: the first convolution layer could be features of the left eye. The second layer could be features of a nose

Vanishing gradient

- if many values <1, the gradient will vanish towards 0 - the short term dependencies will have greater weights - back to short term memory - solution: use ReLU (negative value will vanished) - RNN is more likely to incur in this problem if its sequential data input is very long. Each sequence will equate to one computation.

Exploding gradient

- in between each step, we perform multiplications using the weight matrix - involve a very large number of operations - if many values >1, the gradient will explode into a big number.

Positional encoding

- necessary to include the position of each in the text so that the order of the words matter. - the attention mechanism of the blocks cannot reason about the position of the words.

Maximal Margin Classifier

- optimal separating hyperplane - the separating hyperplane that is farthest from the training observations - mid-line of the widest "slab" that we can insert between the two classes - closest training observations to the hyperplane are equidistant from the maximal margin hyperplane and lie along the dashed lines indicating the width of the margin (support vectors) - cannot have non-separable cases - extremely sensitive to a change in a single observation (can lead to overfitting the training data)

attention

- provides learnable memory access - no backpropagation through time (no BPTT) - long term dependencies can be accessed as easily as short-term dependencies

dropout

- reduces overfitting - let a layer to randomly drop out a few output units from the layer itself during the training process

Pre-trained models

- semi supervised learning - we trained the model on a very large corpus of text, without using labels - teaching the language to the model

Feature reduction

- want to reduce number of features without decreasing accuracy - ranked features - recursive feature elimination (RFE)

Full connection

- we attach the artificial neural network and use the 1-dimensional input layer from the flattening process as the input layer. - hidden layers are called fully connected layers in CNN - the backpropagation will not simply identify the weights to update, but also the "features" - feature detectors (the filters) are also modified based on the error via gradient descent.

Flattening

- we convert the matrices in a vector (by rows) - the matrices become 1-Dimensional

Recurrent Neural Network (RNN)

- work with sequential data - the sequence of input is important

CNN steps

1. Convolution (with ReLU) 2. Max Pooling 3. Flattening 4. Full Connection

Recursive feature elimination (RFE)

1. first, generate linear SVM 2. Eliminate feature with lowest weight 3. Generate a new linear SVM based on reduced feature set 4. Eliminate feature with lowest weight 5. Lather, rinse, repeat

Encoding bottleneck problem

The input is compressed into a single hidden layer (happens in the last one because there is more pressure) and the compression process can potentially lose important information


Related study sets

ECON 1710 Midterm #2 Multiple Choice

View Set

Chapter 49: Drugs Used to Treat Anemias

View Set

9: Perfect Competition: Videos with Questions

View Set

med surg chapter 65 arthritis and connective tissue

View Set