Chapter 9: RNN and LSTM
RNNs for various NLP tasks
1) Sequence labelling 2) Sequence Classification 3) Text Generation
LSTM
Addresses challenge of vanishing gradient (backpropagation in hidden layer) and capturing long-range dependencies (sequential data) specialized neural units (gates) regulate flow of information modular design
Word Embedding Matrix
lookup table that maps each word in a vocabulary to a dense vector representation The input vector of a word is the row of a word embedding matrix
Text Generation
machine translation, text summarization and question answering sampling words sequentially conditioned on previous choices, typically using a softmax distribution. (Autoregressive generation)
Elaborate on Task 1
Part-of-speech tagging or named entity Word embeddings as input Tag probabilities as output generated using softmax Cross Entropy Loss
Describe RNN computation process
input processed sequentially (one at a time), multiplied by weight matrices, and combined with the previous hidden layer's value using an activation function. No fixed-length limit on context, extending to the sentence beginning. Training RNNs (Backpropagation) involves two passes: - Forward inference to accumulate loss (cross entropy) and save hidden layer values. - Reverse processing to compute gradients and update weights.
Issues with Encoder-Decoder Model
context vector aka last hidden state acts as bottleneck information stored limited esp for long sentences
Attention mechanism
context vector is derived by computing a weighted sum of all hidden states, where the weights vary for each token, enabling the model to focus on relevant parts of the input sequence for each token generated in the output sequence. scored using dot-product similarity, bilinear models and nromalized with softmax
Stacked RNNs
multiple RNN layer, a single layer (RNN network) serves as input to a subsequent layer (RNN network) create representations of higher (complex/abstract features) and lower levels (edges/textures) of abstraction across layers hence outperforming single-layer networks optimization dependent on the application, training set increase number of layers, increasing training cost
Architecture of RNN
self-supervised model include a hidden layer with a recurrent connection responsible for retaining memory, and influencing later decisions.
Elaborate on Task 2
sentiment analysis, spam detection, or topic classification consists of a feedforward network and softmax layer combined with cross-entropy loss Utilize only final hidden state for classification Use pooling to aggregate information from all hidden states in sequence
Gates
specialized neural units input ~: add new info to context output ~: info passed to next hidden state forget ~: removes irrelevant info from context
Teacher forcing
use gold target sentence aka ground truth as input of the next time step when training the model instead of output of current time step
Weight tying
use single set of embeddings (tying of weights of certain layers in the network) at both input and output softmax layer improve model perplexity, reduce parameter count hence reduced redundancy
Bidirectional RNNs
utilize information from both left and right contexts (final hidden state) by running two separate RNNs, one left-to-right and one right-to-left concatenates, performs element-wise addition or multiplication of both forward and backward contexts into a single vector