CS-7643 Quiz 4
Skip-Gram Objective
- For each position t=1,...,T predict context words within a window of fixed size m, given context word w_j. - Objective function is the avg negative likelihood
Connection between Negative Sampling and Collobert & Weston Algorithm?
- NS is similar to collobert and weston's algorithm by training network to distinguish "good" word context pairs from bad ones - Collobert & Weston uses margin loss and NS uses probabilistic optimization
Negative Sampling
- idea: for each (w,c) pair, we sample k negative pairs (w,c') - maximize the probability that real outside word appears, minimize the probability that random word appear around the center word - makes less frequent words to be sampled more often
Knowledge Distillation
- transfers knowledge from a large model to a smaller model without loss of validity to reduce model size - provides a way to maintain size without sacrificing performance.
How to calculate probability of context word wrt to center word?
-take the inner product of U and V to measure how likely word w appears with context word o. -U when w is a center word -V when o is a context word -taking probability is expensive
what is t-SNE?
is an unsupervised non-linear technique primarily used for data exploration and visualization of high dimensional data. In short, it gives you a feel or intuition of how data is in high dimensional space.
Other word embeddings
1.) GloVe: Global Vectors 2.) fastText: sub-word embeddings 3.) contextualized embedding: elmo and bret
What are alternatives to probability of context words?
1.) Hierarchical Softmax 2.) Negative Sampling
How to evaluate word embedding?
1.) Intrinsic Evaluation 2.) Extrinsic Evaluation
Types of Conditional Language Models
1.) Topic Aware Language Model: c=The topic, s=text 2.)Text Summarization: c= a long document, s=its summary 3.)Machine Translation: c=French Text, s=English Text 4.)Image Captioning: c=an image, s=its caption 5.)Optical Character Recognition: c=image of a line, s=its content 6.)Speech Recognition: c=a recording, s=its content
Uses of Word Embedding
1.). features in down stream ML models 2.) initialization for neural net models for NLP like RNN and SLTM
Recurrent Neural Network (RNN)
A RNN models sequential interactions through a hidden state, or memory. It can take up to N inputs and produce up to N outputs. For example, an input sequence may be a sentence with the outputs being the part-of-speech tag for each word (N-to-N). An input could be a sentence, and the output a sentiment classification of the sentence (N-to-1). An input could be a single image, and the output could be a sequence of words corresponding to the description of an image (1-to-N). At each time step, an RNN calculates a new hidden state ("memory") based on the current input and the previous hidden state. The "recurrent" stems from the facts that at each step the same parameters are used and the network performs the same calculations based on different inputs
Graph Embeddings
Generalization of word embedding
how do RNN and LSTM update rules differ?
LSTM networks update rule is cell state is updated in an additive way by adding something to its previous value C_t-1, this differs from the multiplicative update Rule of RNN.
Word2Vec
is an algorithm and tool to learn word embeddings by trying to use a word to predict their context words. context being fixed window of size 2,. Two different objectives can be used to learn these embeddings: The Skip-Gram objective tries to predict a context from on a word
How does t-SNE conceptually work?
The algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. It then tries to optimize these two similarity measures using a cost function. Let's break that down into 3 basic steps: 1.) takes the Gaussian distribution around a data point on a 2D plane. - measure density of all points in gaussian - renormalize for all points - give probabilities proportionate to similarities Note: Gaussian is manipulated by perplexity 2.)take a student t-distribution with one degree of freedom which is also known as the Cauchy distribution. gives us a second set of probabilities (Qij) in the low dimensional space 3.)we want these set of probabilities from the low-dimensional space (Qij) to reflect those of the high dimensional space (Pij) as best as possible. We want the two map structures to be similar. We measure the difference between the probability distributions of the two-dimensional spaces using Kullback-Liebler divergence (KL).
Gradients for RNNs
The gradient computation involves recurrent multiplication of WW. This multiplying by WW to each cell has a bad effect. Think like this: If you a scalar (number) and you multiply gradients by it over and over again for say 100 times, if that number > 1, it'll explode the gradient and if < 1, it'll vanish towards 0.
Collobert & Weston Vectors
a word and its context is a positive example, a negative example is placing a random word in the context of the original words. Similar to SVM, the algorithm uses margin loss to increase the margin between positive and negative examples.
Distribution Semantics
a word's meaning is given by words that are frequently around it.
Conditional Language Modeling
allows you to determine the probability of a sequence of words conditioned on context c.
Extrinsic Evaluation for word embedding?
e.g. Text Classification 1.) Evaluation of real task 2.) can take long time to compute 3.) unclear is the subsystem is the problem or its interaction 4.) if replace one subsystem with another and performance improves thats a win
Intrinsic Evaluation for word embedding?
e.g. Word Analogy Task 1.) evaluate specific/intermediate task 2.) fast to compute 3.) helps to understand system 4.) not clear if really helpful unless correlation to real task is established.
What does t-SNE stand for?
t-Distributed Stochastic Neighbor Embedding
LSTM (Long Short-Term Memory)
the network was invented to prevent the vanishing gradient problem in Recurrent Neural Networks by using a memory gating mechanism. Using LSTM units to calculate the hidden state in an RNN we help to the network to efficiently propagate gradients and learn long-range dependencies
What are word embeddings used for?
used as initialization of lookup tables in LSTM and RNN models by mapping objects to vectors through a trainable function. (Function can be CNN, RNN, transformer or Word2Vec)