LLM
word embedding
Let S_N = {w_i}^N_{i=1} be a sequence of N input tokens with w_i being the i th element (e.g., w[i]). The corresponding word embedding of S_N is denoted as E_N = {x_i}^N_{i=1} where x_i ∈ R^d is the d-dimensional word embedding vector of token w_i without position information.
Pile Dataset
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together Application? text generation
Abstractive summarization
Abstractive summarization, on the other hand, tries to guess the meaning of the whole text and presents the meaning to you. It creates words and phrases, puts them together in a meaningful way, and along with that, adds the most important facts found in the text. This way, abstractive summarization techniques are more complex than extractive summarization techniques and are also computationally more expensive. encoder-decoder model usually
text classification
Embeddings can help the model to assign labels or categories to texts, based on their meaning and context. For example, embeddings can help the model to classify texts as positive or negative, spam or not spam, news or opinion, etc.
text translation
Embeddings can help the model to convert texts from one language to another, while preserving the meaning and the structure of the original texts. For example, embeddings can help the model to translate texts between English and Spanish, French and German, Chinese and Japanese, etc
image generation
Embeddings can help the model to create images from texts, or vice versa, by converting different types of data into a common representation. For example, embeddings can help the model to generate images such as logos, faces, animals, landscapes, etc.
text generation
Embeddings can help the model to create new and original texts, based on the input or the prompt that the user provides. For example, embeddings can help the model to generate texts such as stories, poems, jokes, slogans, captions, etc. or it can even help you answer questions or give you recommendations or solutions to a problem or solve coding problems for you.
GPT-J-6B
HyperparameterValuen_parameters6,053,381,344n_layers28*d_model4,096d_ff16,384n_heads16d_head256n_ctx2,048n_vocab50,257 (same tokenizer as GPT-2/3)position encodingRotary position encodings (RoPE)RoPE dimensions64
RLHF
Reinforcement Learning from Human Feedback. Wouldn't it be great if we use human feedback for generated text as a measure of performance or go even one step further and use that feedback as a loss to optimize the model? That's the idea of Reinforcement Learning from Human Feedback (RLHF); use methods from reinforcement learning to directly optimize a language model with human feedback. RLHF has enabled language models to begin to align a model trained on a general corpus of text data to that of complex human values. Applications? ChatGPT Illustrating Reinforcement Learning from Human Feedback (RLHF) (huggingface.co)
RoPE
Rotary Position Embedding. type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.
Autoencoding models
build a bidirectional representation of a sentence by masking some of the input tokens and trying to predict them from the remaining ones. These LLMs are adept at capturing contextual relationships between tokens quickly and at scale, which makes them great candidates for text classification tasks, for example.
text summarization
extract or generate the most important or relevant information from texts, and to create concise and coherent summaries. For example, embeddings can help the model to summarize news articles, product reviews, research papers, etc.
extractive summarization
involves identifying important sections in the original text, and copying those sections into the summary.
BLEU
llm metrics
ROUGE (max)
llm metrics
semantic meaning
meaning derived from the words themselves and how they are arranged into sentences. area of linguistics interested in meaning, and the many ways that we can describe it.
Autoregressive (decoder) models
predict the next token in a sentence based on the previous tokens. These LLMs are effective at generating coherent free-text following a given context. e.g., gpt
autoregressive+autoencoding models
such as T5, which can use the encoder and decoder to be more versatile and flexible in generating text. Such combination models can generate more diverse and creative text in different contexts compared to pure decoder-based autoregressive models due to their ability to capture additional context using the encoder.
CNN-Daily mail dataset
task? text summarization trained models? model used to train gptj model on mlcommons ccdv/cnn_dailymail · Datasets at Hugging Face