Transformer Models Presentation

Ace your homework & exams now with Quizwiz!

Context Window [32]

All models used a context window of 2048 tokens.

Altered CommonCrawl [32]

An altered CommonCrawl dataset was used as the training dataset where filtering based on similarity to high quality corpora was downloaded and high-quality reference corpora was added to the training-mix to augment and increase diversity.

Demonstrate bidirectional pre-training [18]

BERT attempts to demonstrate the importance of bidirectional pre-training for language representations, which evaluates on text that proceeds AND follows the target—as well as reduce the need for many heavily-engineered task-specific architectures by utilizing pre-training.

Transformer Model Architecture [19]

BERT uses an almost identical architecture to the Transformer model presented in "Attention Is All You Need", but is only uses the encoders.

Number of attention heads [13]

Increasing the number of attention heads too much can cause overfitting and can diminish translation quality.

Scale [13]

Larger models produce better results.

Unlabeled data [20]

A large corpus of unlabeled data was used to pre-train a large general model, and then fine-tuning is initialized with more selective data using the pre-trained model as a checkpoint.

Significantly Larger Models [30]

A larger model (175 Billion Parameters) and in-context learning are used to increase the accuracy of task predictions, while also decreasing the number of examples needed to train.

Checkpoints work well [21]

A pre-trained model is a diamond in the rough. It allows you to hone the model to a more specific class rather than starting from scratch, and because the model has already been trained and tuned on the general data, it requires less fine-tuned training data.

Abalation Studies Results [24]

Ablation Studies consist of removing parts of the model to measure the influence of removed modules. The removals consisted of measuring the performance of the base model, a model without training on next sentence prediction (NSP), unidirectional language model with no next sentence prediction training, and the final "+BiLSTM" adds a randomly initialized Bidirectional Long-short term memory layer on top of the previous constraints.

WebText, Book1, Book2, and Wikipedia [32]

Additionally, an expanded version of the WebText dataset, two internet-based book corpora, and the English-language Wikipedia were also used to train the models.

In-Context Exploration [30]

Although GPT models existed prior to "Language Models are Few-Shot Learners", what this paper did was explore the use of in-context learning on different sized GPT models and compare them to other models and benchmarks.

Design Focus [47]

BERT is a bidirectional encoder and uses masking to understand the context of words and phrases. GPT-3 is a unidirectional decoder and is trained as a generative model trained to predict the next word in a sentence given previous context.

Uses [47]

BERT is designed for NLP understanding, sentiment analysis, text classification, and machine translation—especially when fine-tuned. GPT-3 is designed for text generation, content generation, text completion, and general encyclopedic knowledge.

Fine-tuning [47]

BERT models tend to be used as a checkpoint that is then fine-tuned for specific tasks while GPT-3 is designed to be a more general model that isn't commonly fine-tuned, but instead kept versatile in a wide domain.

Pre-trained using two unsupervised tasks [22]

BERT was pre-trained using two unsupervised tasks: Masked LM and Next Sentence Prediction.

Dropout [13]

Dropout is especially helpful in avoiding overfitting

Redundancy [33]

Due to the training dataset being so large and sourced from the internet, there is a chance that the model will be trained on some of the benchmark test sets.

Encoder Only [19]

Encoder only models focus on processing input sequences to understand the relationships between tokens. Because we're not trying to generate text, the decoder architecture isn't necessary.

Evaluation [33]

Evaluated 'clean' sets using GPT-3 and compared them to the original scores. If cleaned score is like the original, no significant effect is obvious on the results. Otherwise, it suggests contamination may be inflating results.

Few-Shot Learning [29]

Few-Shot Learning provides the model with a natural language prompt and k examples of the task being done.

GPT-2 [31]

Finally, the model presented in "Language Models are Few-Shot Learners" again increased the parameter size from GPT-2 but explores single and few-shot learning as well.

Fine Tuning [28]

Fine Tuning uses loss calculations and repeated gradient updates from example tasks to push a model in an accurate direction. Typically Used in traditional ML

Creating 'Clean' test sets [33]

For each evaluation benchmark, a 'clean' test set was produced that removed potential leaked training sets. 'Clean' test sets were defined as examples that have a 13-gram,or total overlap with anything in the pre-training set. This is done to conservatively flag potential contamination across sets.

Transformer Based Architecture [31]

GPT-3 based models are decoder only models that focus on processing input sequences to predict the next token in the sequence

Evolved from previous models: [31]

GPT-3 based models were evolved from previous language learning models.

Model Size [47]

GPT-3 has a massive model size and requires significant computational resources. BERT varies in size which means its computational strain can be optimized depending on need.

GLUE [23]

General Language Understanding Evaluation (GLUE) benchmark is a collection of diverse NLP tasks. Both BERT models Outperformed previous models in every category, with the BERT_LARGE model reaching SOTA in all categories.

6 Identical Layers [11]

In this model's architecture, there is a stack of 6 identical layers in both the encoder and decoder.

In-Context Learning [28]

In-Context Learning is a machine learning technique that uses limited or no labeled data to train a model to make accurate predictions. The idea is to learn more like how a human would: where you get a description of a task to do, maybe see an example, or a few, of how to do that task, and then are expected to recreate the task in a different context.

Input [19]

Inputs to the model are fed as a token sequence of either single sentences, or pairs of sentences such as a Question/Answer pair. A sentence refers to a span of contiguous text, not the linguistic form

What is multi-head attention? [9]

Instead of performing attention a single time, the attention computations are done multiple times independently but in parallel. The multiple attention heads are then merged to give a combined attention weight.

The transformer model attempts to improve or eliminate limitations of RNNs [5]

It attempts to improve or remove limitations imposed by Recurrent Neural Networks by increasing parallelization, improving translation quality, decreasing the time to train models, utilizing less powerful GPUs, reducing operations, and solving resolution issues.

Differentiate based on input position [10]

It provides the model with the ability to differentiate inputs based on their positions in the input sequence.

Why sinusoidal? [10]

It was hypothesized that sinusoidal functions would allow the Transformer to easily learn to attend by relative positions. It was a pragmatic choice instead of an optimal one, choosing a simple method because it wasn't the focus of the model.

Why is it important? [7]

It's important because attention weights are vital in gaining insight into patterns to reinforce when training.

Masked LM [22]

Masked Language Model tasks train the model by randomly masking a percentage of input tokens and then predicting those masked tokens.

Masked Language Models (MLM) [17]

Masked language models are models that randomly mask some of the tokens from the input, with the intention to introduce vocabulary predictions based only on context. Using a masked language model meant that context from the left and right side of a masked token could be pre-trained in a deep bidirectional Transformer model

Multi-head attention is an extension of general attention.

Multi-head attention is an extension of general attention.

Why do multiple computations? [9]

Multiple computations allow for the model to learn different context and varying relationships between words. A deeper understanding of a words contextual information can improve translation quality.

GPT-1 [31]

Next "Language Models are Unsupervised Multitask Learners" used the GPT-1 model but increased the complexity of inputs while using zero-shot learning—resulting in GPT-2.

Next Sentence Prediction [22]

Next Sentence Prediction tasks train the model to learn sentence relationships by supplying two sentences to the model that may or may not be sequential. When sentence B does follow sentence A, there will be a label "IsNext", and inversely if sentence B is a random sentence that does not follow sentence A, there will be a label "NotNext".

One-Shot Learning [29]

One-Shot Learning is when the model is provided with a natural language prompt and a single example of the task being done.

Parallelism [32]

Parallelism was used to train the large models without running out of memory.

Positional Encoding [6]

Positional Encoding allows the model to maintain information about relative or absolute position of the tokens in the input sequence.

Maintains positional information [10]

Positional encoding is used to maintain information on relative or absolute position of inputs.

Capture context-based relationships on a larger scale [21]

Pre-training can introduce relationship that wouldn't have been adjusted for from just a fine-tuned model

Pre-training [17]

Pre-training is the idea that you can train a model on a large corpus of general data before fine-tuning the model for specific tasks or leaving the model generalized.

Recurrent Neural Networks (RNN) [4]

RNNs are a sequential model that takes order and context of data into account. RNNs' maintain a hidden memory state that serves as a short-term memory for the network to consider previous information when making predictions.

What are RNNs used for? [4]

RNNs have been used in various tasks including Natural Language Processing, time series forecasting, machine translation, etcetera.

Attention key size [13]

Reducing attention key size reduces translation quality

Regularization and Dropout [12]

Regularization was handled using residual dropout. Dropout put simply is a technique that randomly sets a subset of neurons in a layer to zero during each training step, adding noise and preventing overfitting

How is self-attention calculated? [7]

Self-Attention is calculated by taking an input and creating a query, key, and value vectors for each token in the input and mapping them to an attention score output. A dot product is taken between a query for a single input token and the keys for all the input tokens, returning an attention score for all tokens. The attention score is then multiplied by the value vector, softmax is applied, and the vectors are all summed. The result is an attention weight vector corresponding to the input token's query vector interacting with every key. This process is repeated for all queries to produce a corresponding attention weight.

What is Self-Attention? [7]

Self-attention enables the model to focus on parts of the input sequence based on the similarity between embedded tokens. It compares the input to itself, calculating the relationships between tokens in the input. It allows for the efficient capture of long-term dependencies and performs better than previous models that required sequential processing.

Batching [12]

Sentences were batched together by approximate length

Limitations of current techniques and unidirectional models. [18]

Techniques such as unidirectional systems restricted the power of pre-trained representations because they could only evaluate text from the text that proceeds the target section.

"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin et al., 2018 [16]

The 2018 paper developed by the Google AI Language Lab focused on transfer learning which is when a pre-trained model is used on new tasks.

Attention Mechanisms [6]

The Attention Mechanisms allow the model to look at all inputs and have them interact with themselves to find which inputs carry the most meaning or significance.

The Decoder [11]

The Decoder attends to relevant information and generates the outputs

Embedding [6]

The Embedding mechanisms convert input tokens and output tokens into a multidimensional vector representation to allow for mathematical operations and comparisons.

The Encoder [11]

The Encoder takes the inputs and changes their form to ease the extraction of information.

SWAG Results [24]

The Situations With Adversarial Generations (SWAG) dataset tests the model's ability to choose the most plausible continuation among four choices. BERT_BASE greatly outperformed the other models, and BERT_LARGE achieved SOTA, only underperforming the human result with 5 annotations.

SQuAD v1.1, SQuAD v2.0 [23]

The Stanford Question Answering Datasets (SQuAD) see how well a model can predict answer text spans based on a question and Wikipedia passage. The version 2.0 set adds the possibility that the answer does not exist in the given passage. BERT Outperformed all other SQuAD metrics other than the human benchmark, with BERT_LARGE again performing best in all fields.

Original Transformer [31]

The Transformer from "Attention Is All You Need" developed into the original GPT model through OpenAI. The paper "Improving Language Understanding by Generative Pre-Training" tweaked the model parameters and included unsupervised pre-training and fine-tuning.

What is a Transformer? [5]

The Transformer is a model architecture that separates attention mechanisms from RNN's sequential disposition—allowing for a model that runs on attention without regard for dependencies' distances between input and output sequences.

The Transformer model structure [6]

The Transformer model follows an encoder-decoder structure, using stacked self-attention and point-wise, fully connected layers.

Training Datasets [12]

The Transformer was trained on the standard WMT 2014 English German dataset consisting of around 4.5 million sentence pairs and the WMT 2014 English-French dataset consisting of 36 million sentences.

Base Model [13]

The base model performed better than all previous models on the English-to-German translation task, and slightly worse on the English-to-French translation task—while having a lower training time than all other models.

Base model parameters [12]

The base model was trained for a total of 12 hour, with each of the 100,000 training steps taking 0.4 seconds.

Feed-Forward Networks [6]

The feed-forward networks work to find complex and non-linear relationships between inputs.

Decoder Information Stream [11]

The first decoder receives an input from the embedded and positionally encoded outputs as well as the output of the final encoder. The following decoder layers get their inputs from the previous decoder layer in addition to the final encoder output.

Encoder Information Stream [11]

The first encoder receives an input from embedding and positional encoding and feeds that into the next encoding layer as its input. Each encoder then receives the output of the previous encoder. The features extracted are then fed into the decoders as inputs.

Large Model [13]

The large model set new records in both English-to-German and English-to-French translation tasks—achieving a BLEU score of 28.4 and 41.8 respectively. In addition to performing the best, the training cost of the large model was lower than previous competitive models.

Large model parameters [12]

The larger models were trained over 3.5 days, with each of the 300,000 training steps taking 1.0 second.

SoftMax and Linear [6]

The linear and SoftMax are activation functions used to predict the next-token probabilities.

Five Metrics: [23]

The metrics used to grade BERT were GLUE, SQuAD v1.1, 2.0, SWAG, and Ablation Studies.

Hardware [12]

The models were trained on a machine with 8 NVIDIA P100 GPUs

Characteristics in model? [10]

The positional encoding in this model was fixed, utilizing sine and cosine functions to represent each dimension of the positional encoding.

What was it trained on? [22]

The pre-training corpus was BookCorpus and English Wikipedia

Limitations of RNNs? [4]

The sequential nature of RNNs make parallelization infeasible to implement, and without parallelization, longer sequences and memory constraints hamper RNN capabillities.

Two steps: pre-training and fine-tuning [20]

There are two key steps in training a BERT model: pre-training and fine-tuning

Two Models [19]

There were two models trained for evaluation: BERT_BASE and BERT_LARGE BERT_BASE Consisted of 12 transformer layers, 768 hidden layers, 12 self-attention heads, and 110 million parameters. The size of the base model was chosen to match the size of the original OpenAI GPT model for comparisons sake. BERT_LARGE consisted of 24 transformer layers, 1024 hidden layers, 16 self-attention heads, and 340 million parameters.

Labeled data [20]

Think of reading all the books in a library to learn about books and then reading only mystery novels to become an expert in writing mystery novels.

Model Size [32]

To study the affect of model size on performance, 8 different size models were trained ranging from 125 million to 175 billion parameters. The largest model received the name "GPT-3"

Sequence Segregation? [19]

Token sequences contain special tokens to identify the start of a sequence, the separation between pairs, and the end of a sequence Additionally, sentence pairs receive an additional learned embedding to every token to indicate if the token is a part of sentence A or sentence B.

Hardware [32]

Trained on V100 GPU's on high bandwidth-cluster provided by Microsoft.

Transfer Learning [17]

Transfer learning is the idea that you can adapt a model trained for one task to do similar related tasks.

General to specific pipeline [21]

You're beginning from a model that's trained off a large set of data, it has the potential to solve problems in different domains without as much fine-tuning

Zero-Shot Learning [29]

Zero-Shot Learning is the idea that the model is only provided with a natural language prompt and no example of the task to be done.

"Language Models are Few-Shot Learners" by Brown et al., 2020 [27]

•"Language Models are Few-Shot Learners" by Brown et al., 2020 was developed by OpenAI and Johns Hopkins University


Related study sets

5 processes of the digestive system

View Set

Julius Caesar Act 3 Scene 2 Vocabulary

View Set

SAT #21, Mmmm Delicious- Abstemious-Voracious

View Set

LabSim Linux+ Chapters 1-12 Quiz Questions

View Set

Les provinces et territoires du Canada et leurs villes capitales

View Set

algebra 2a - unit 1: factoring and solving quadratics lesson 1-4

View Set