291K final
Results
In all these natural language reasoning tasks, gener- ating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models.
How does RAG work? What is the name of the neural retriever they use, and how does this neural retriever work with the seq2seq model to generate the output?
The retriever (Dense Passage Retriever [26], henceforth DPR) provides latent documents conditioned on the input, and the seq2seq model (BART [32]) then conditions on these latent documents together with the input to generate the output.
What is unsupervised pre-training and what is the point?
Unsupervised pre-training is a special case of semi-supervised learning where the goal is to find a good initialization point instead of modifying the supervised learning objective.
What are the benefits of ToT?
(1) Gener- ality. IO, CoT, CoT-SC, and self-refinement can be seen as special cases of ToT (i.e. trees of limited depth and breadth; Figure 1). (2) Modularity. The base LM, as well as the thought decomposition, generation, evaluation, and search procedures can all be varied independently. (3) Adaptability. Different problem properties, LM capabilities, and resource constraints can be accommodated (4) Convenience. No extra training is needed, just a pre-trained LM is sufficient.
What is generative question answering?
Generative question answering in machine learning is the creation of answers to questions by synthesizing new responses rather than selecting from a pre-defined set.
How do they implement retrieval in the FiD paper?
They consider two methods: 1.) BM25: passages are represented as bag of words, and the ranking function is based on term and inverse doc- ument frequencies. 2.) DPR: passages and questions are represented as dense vector representations, computed using two BERT networks. The ranking function is the dot product between the query and passage represen- tations.
What is tool learning?
Tool learning aims to unleash the power of large language models (LLMs) to effec- tively interact with various tools (APIs) to accomplish complex tasks.
What is the limitation addressed in the REALM paper, and what is the solution to the limitation they came up with?
- Limitation: LM pre-training has been showing to capture a surprising amount of knowledge useful for downstream tasks. However, since this knowledge is stored in the parameters of a neural network, it requires the network to be even larger to cover more facts. - Solution: They want to capture knowledge in a more modulaar, inter-table, and scalable way. So, they augment LM pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus (e.g. Wikipedia) to use DURING pre-training, fine-tuning, and inference. They retrain the knowledge retrieved in an UNSUPERVISED way using masked language modeling as the learning signal and backpropogating through a retrieval step that considers millions of documents.
What is backpropagation?
A crucial algorithm that computes gradients of the loss function with respect to the weights of the neural network by propagating the error backward fro the output layer to the input layers. Involves a forward pass where input data is passes through the network to generate predictions and calculate the loss, and a backwards pass where the gradients are calculated based off of the predictions and and propagated backwards through the network to update the weights.
What is T5?
A diagram of our text-to-text framework. Every task we consider—including translation, question answering, and classification—is cast as feeding our model text as input and training it to generate some target text. This allows us to use the same model, loss function, hyperparameters, etc. across our diverse set of tasks. It also provides a standard testbed for the methods included in our empirical survey. "T5" refers to our model, which we dub the "Text-to-Text Transfer Transformer
What is a hybrid model, and what re two examples of hybrid models?
A hybrid model is one that combines parametric memory with non-parametric (external, retrieval-based) memories. These models can address the issues of parametric models since knowledge can be directly revised, expanded, and accessed knowledge can be inspected and interpreted. Two examples include REALM and ORQA, two recently introduced models that combine masked language models with a differentiable retriever.
What are they key advantages of the ToT framework, and what mode of decision making is it meant to emulate?
(1) maintains and explores diverse alternatives for current choices instead of just picking one, and (2) evaluates its current status and actively looks ahead or backtracks to make more global decisions. This is meant to emulate the deliberate, conscious mode ("System 2"), whereas simple associative token-level choices of LMs are like "System 1": a fast, automatic, unconscious mode ("System 1")
(Key Takeaways) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- Proposes BLIP-2, a new vision-language pre-training method that leverages frozen pre-trained image encoders and large language models (LLMs). This is more compute-efficient than end-to-end pre-training. - Introduces a Querying Transformer (Q-Former) that is pre-trained in two stages: (1) vision-language representation learning (2) vision-to-language generative learning. This bridges the gap between visual and text modalities. - Achieves state-of-the-art performance on tasks like VQA, image captioning, image-text retrieval while using far fewer trainable parameters than previous methods. - Enables zero-shot instructed image-to-text generation capabilities by leveraging capabilities of LLMs. Key Findings: - The two-stage pre-training of Q-Former is critical - representation learning reduces burden on LLM to learn vision-language alignment. Without it, performance drops significantly. - Using a stronger image encoder or LLM leads to better performance, validating BLIP-2 as a generic pre-training approach. - Adds an image-grounded text generation loss during representation learning stage further improves image-text retrieval. Main Points: - Bootstrap from frozen image encoder to extract useful visual features - Bootstrap from frozen LLMs to leverage their text generation capabilities - Q-Former acts as information bottleneck between modalities - Two-stage pre-training strategy to bridge gap between modalities - More efficient than end-to-end pre-training - Enables zero-shot image-to-text generation
Give an example of standard prompting vs chain of thought prompting
1.) Standard Prompting: Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: The answer is 11. Model Output: A: The answer is 27. X 2.) CoT Prompting: Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Model Output: A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9.
Pre-Training BART: What are the different noising schemes used by BART in the pretraining process?
1.) Token Masking: Following BERT, random tokens are sampled and replaced with [MASK] elements. 2.) Token Deletion: Random tokens are deleted from the input. In contrast to token masking, the model must decide which positions are missing inputs. 3.) Text Infilling: A number of text spans are sampled, with span lengths drawn from a Poisson distribution (λ = 3). Each span is replaced with a single [MASK] token. 0-length spans correspond to the insertion of [MASK] tokens. Text infilling teaches the model to predict how many tokens are missing from a span. 4.) Sentences Permutation: A document is split into sentences based on full stops, and these sentences are shuffled in a random order. 5.) Document Rotation: A token is chosen uniformly at random, and the document is rotated so that it begins with that token. This task trains the model to identify the start of the document.
How does the Fid paper address open domain question answering?
1.) first retrieving supporting passages using either sparse or dense representations. 2.) a sequence-to-sequence model generates the answer, taking as input the re- trieved passages in addition to the question.
AGENTS
AGENTS is an open-source framework designed to democratize the development of autonomous language agents, offering features such as planning, memory, multi-agent communication, and symbolic control with minimal coding required. It addresses limitations of existing frameworks by enabling more consistent behavior, supporting human-agent interaction, and introducing symbolic plans (SOPs) for controllable, customizable agents. SOPs allow for detailed, step-by-step behavior guidelines, making agents' actions more predictable and easily adjustable, enhancing user experience and agent performance.
(Abstract) REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS
Abstract: While large language models (LLMs) have demonstrated impressive performance across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with and gather additional information from external sources such as knowledge bases or environments. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines in addition to improved human interpretability and trustworthiness. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes prevalent issues of hallucination and error propagation in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generating human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. Furthermore, on two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.
Adapter Modules
Adapter Modules introduce a parameter-efficient approach to transfer learning in NLP by injecting small, randomly initialized layers (adapters) into a pretrained network for specific tasks, keeping the original network's weights frozen. This method contrasts with standard fine-tuning by allowing the original model to remain unchanged and shareable across tasks, focusing the training solely on the new adapter layers. These adapters are designed to be minimal in size and start with a near-identity initialization to facilitate rapid adaptation without overwhelming the existing model architecture.
more on FLAN
Add stuff here
What is supervised fine tuning?
After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target task.
What is multi-task learning?
An alternative approach, called "multi- task learning", is to train the model on multiple tasks at a time.
What is an autoencoder?
An autoencoder is a neural network used for unsupervised learning to encode input data into a lower-dimensional latent space and then reconstruct it back to its original form, comprising two main parts: the encoder, which compresses the input into a latent space representation, and the decoder, which reconstructs the input from this representation. The latent space is a compressed knowledge representation of the input, and the final output is a reconstruction of the original input data, aiming to be as close to the input as possible using the compressed information.
What is BART?
BART is a denoising autoencoder that maps a corrupted document to the original document it was derived from. It is implemented as a sequence-to-sequence model with a bidirectional encoder over corrupted text and a left-to-right autoregressive decoder.
What is a key difference between BERT and BART
BERT is an encoder-only model, meaning it uses MLM to better understand the context in input text to produce contextualized embeddings. The primary function of BERT is to generate deep contextual representations that can be used. BART on the other hand includes both an encoder and a decoder, fitting the seq2seq model architecture. This structure allows BART to not only understand and interpret input text using MLM and the encoder, but also generate and transform text via the decoder.
BART: Sequence generation tasks
Because BART has an autoregressive decoder, it can be directly fine tuned for sequence generation tasks such as abstractive question answering and summarization.
How does the prediction work in REALM?
Before making each prediction, the language model uses the retriever to retrieve documents1 from a large corpus such as Wikipedia, and then attends over those documents to help inform its prediction. Learning this model end-to- end requires backpropagating through a retrieval step that considers an entire corpus of textual knowledge,
What is BiLSTM?
BiLSTM, or Bidirectional Long Short-Term Memory, is a type of recurrent neural network that processes data in both forward and backward directions, improving model understanding of context in sequence data like text or time series.
What is different about the Fid model?
By processing passages independently in the en- coder, but jointly in the decoder, this method dif- fers from Min et al. (2020) and Lewis et al. (2020). Processing passages independently in the encoder allows to scale to large number of contexts, as it only performs self attention over one context at a time. This means that the computation time of the model grows linearly with the number of passages, instead of quadratically. On the other hand, pro- cessing passages jointly in the decoder allows to better aggregate evidence from multiple passages.
What is CoT prompting?
Chain of thought prompting involves feeding a model with prompts that simulate a step-by-step reasoning process, encouraging it to generate answers through intermediate steps rather than direct conclusion, enhancing its ability to tackle complex problems or questions that require multi-step reasoning. This approach can be used to improve performance on tasks like arithmetic reasoning, multi-step word problems, and complex decision-making by making the model's thought process more transparent and logically structured.
What are some advantages of CoT prompting?
Chain-of-thought prompting has several attractive properties as an approach for facilitating reasoning in language models. First, chain of thought, in principle, allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps. Second, a chain of thought provides an interpretable window into the behavior of the model, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong (although fully characterizing a model's computations that support an answer remains an open question). Third, chain-of-thought reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, and is potentially applicable (at least in principle) to any task that humans can solve via language. Finally, chain-of-thought reasoning can be readily elicited in sufficiently large off-the-shelf language models simply by including examples of chain of thought sequences into the exemplars of few-shot prompting.
What is contrastive learning and what is an example of it?
Contrastive learning is a type of
(Abstract) TOOLLLM: FACILITATING LARGE LANGUAGE MODELS TO MASTER 16000+ REAL-WORLD APIS
Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using exter- nal tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the con- struction can be divided into three stages: (i) API collection: we collect 16, 464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruc- tion generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evalu- ator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demon- strates strong zero-sho
Directional Stimulus Prompting
Directional Stimulus Prompting is a novel framework designed to steer large language models (LLMs) towards generating specific outcomes by using a tunable policy model, like T5, to craft auxiliary directional stimulus prompts for each input. These prompts act as tailored hints guiding the LLM to include certain keywords or concepts in its output. Unlike methods that augment LLMs with external information, this approach generates prompts based on the input query alone, leveraging supervised fine-tuning and reinforcement learning to optimize the policy model. This technique allows for nuanced control over LLM outputs by providing instance-specific guidance without modifying the LLM itself.
(Key Takeaways) DistiliBERT
DistilBERT is a distilled and smaller version of BERT that is 40% smaller, 60% faster, yet retains 97% of BERT's language understanding capabilities. Using knowledge distillation during the pre-training phase, instead of just during fine-tuning for specific tasks, allows creating a smaller general-purpose language model. A triple loss function combining language modeling, distillation, and cosine-distance losses is used to transfer inductive biases from the teacher BERT model to the DistilBERT student model. DistilBERT performs very well on GLUE benchmark and downstream tasks like sentiment analysis and question answering compared to BERT. DistilBERT has 6 layers compared to BERT's 12 layers, and no token-type embeddings or pooler. This makes the model smaller and faster. DistilBERT is a compelling option for on-device inference applications as it is 60% faster than BERT with much fewer parameters. Key Words and Definitions: Knowledge Distillation: Training a small model (student) to reproduce the behavior of a larger model (teacher), transferring knowledge. Transformer: The base neural architecture used by models like BERT, uses attention mechanism rather than recurrence. GLUE benchmark: General Language Understanding Evaluation benchmark, a collection of 9 datasets for evaluating language understanding systems. Fine-tuning: Adapting a pre-trained model to a downstream task by training the weights on task-specific data, rather than training from scratch. On-device inference: Running trained models on client devices rather than servers, enables new applications but requires smaller, fast models.
What step is taken during transfer in this paper?
During transfer, we utilize task-specific input adaptations derived from traversal-style approaches [52], which process structured text input as a single contiguous sequence of tokens. As we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal changes to the architecture of the pre-trained model.
Limitation addressed in Fid paper
Factual information can be extracted from large scale language models trained on vast quantities of data. Coupled with advances in retraining, this introduces a generative model for open domain question answering. However, such a model contains billions of parameters since all the information needs to be stored in the weights. Solution: - How much this method could benefit from having access to an external source of knowledge (e.g. Wikipedia) instead of being purely internal.
What is instruction tuning?
Finetuning language models on a collection of datasets described via instructions. Purpose: improve zero-shot performance.
REALM's generative process
For both pre-training and fine-tuning, REALM takes some input x and learns a distribution p(y | x) over possible out- puts y. For pre-training, the task is masked language mod- eling: x is a sentence from a pre-training corpus X with some tokens masked out, and the model must predict the value of those missing tokens, y. For fine-tuning, the task is Open-QA: x is a question, and y is the answer. REALM decomposes p(y | x) into two steps: retrieve, then predict. Given an input x, we first retrieve possibly helpful documents z from a knowledge corpus Z . We model this as a sample from the distribution p(z | x). Then, we condition on both the retrieved z and the original input x to generate the output y—modeled as p(y | z, x). To obtain the overall likelihood of generating y, we treat z as a latent variable and marginalize over all possible documents z
(Abstract) Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
Generative models for open domain question answering have proven to be competitive, with- out resorting to external knowledge. While promising, this approach requires to use mod- els with billions of parameters, which are ex- pensive to train and query. In this paper, we investigate how much these models can ben- efit from retrieving text passages, potentially containing evidence. We obtain state-of-the- art results on the Natural Questions and Triv- iaQA open benchmarks. Interestingly, we ob- serve that the performance of this method sig- nificantly improves when increasing the num- ber of retrieved passages. This is evidence that sequence-to-sequence models offers a flexible framework to efficiently aggregate and com- bine evidence from multiple passages.
What is parametric vs non parametric memory?
In machine learning, parametric memory involves models that summarize data with a fixed number of parameters, regardless of data size, while non-parametric memory models can grow in complexity with the size of the data, not limited by a fixed parameter count.
What is the point of T5?
It is a text-to-text framework that offers a unified approach so that we can compare the effectiveness of different transfer learning objectives, unlabeled data sets, and other factors, while exploring the limits of transfer learning for NLP by scaling up models and data sets beyond what has previously been considered.
What is meant by "learning this model end-to-end" in REALM?
It means the gradient from the loss function is used to update not just the prediction-related parameters but also those involved in the retrieval and attention mechanisms.
What does Fid stand for?
It stands for the architecture of the model they used (Fusion-in-Decoder): In thi model, the question and all of the associated supporting passages are grouped together into pairs and fed into an encoder, which creates encodings. These encodings are then concatenated together and fed to the decoder to generate the answer. - Here is an example of what is fed into the encoder: 1.) (Question + Passage 1) . . . N.) (Question + Passage N)
(Abstract) REALM: Retrieval-Augmented Language Model Pre-Training
Language model pre-training has been shown to capture a surprising amount of world knowledge, crucial for NLP tasks such as question answer- ing. However, this knowledge is stored implic- itly in the parameters of a neural network, requir- ing ever-larger networks to cover more facts. To capture knowledge in a more modular and inter- pretable way, we augment language model pre- training with a latent knowledge retriever, which allows the model to retrieve and attend over doc- uments from a large corpus such as Wikipedia, used during pre-training, fine-tuning and infer- ence. For the first time, we show how to pre- train such a knowledge retriever in an unsuper- vised manner, using masked language model- ing as the learning signal and backpropagating through a retrieval step that considers millions of documents. We demonstrate the effective- ness of Retrieval-Augmented Language Model pre-training (REALM) by fine-tuning on the chal- lenging task of Open-domain Question Answer- ing (Open-QA). We compare against state-of-the- art models for both explicit and implicit knowl- edge storage on three popular Open-QA bench- marks, and find that we outperform all previous methods by a significant margin (4-16% absolute accuracy), while also providing qualitative bene- fits such as interpretability and modularity.
Limitation addressed in (ToT) paper
Language models This means can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. Solution: - ToT framework enables exploration over coherent units of text ("thoughts") that serve as intermediate steps toward problem solving. - most importantly: allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.
(Abstract) Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, "Tree of Thoughts" (ToT), which generalizes over the popular "Chain of Thought" approach to prompting language models, and enables exploration over coherent units of text ("thoughts") that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%.
(Abstract) PAL: Program-aided Language Models
Large language models (LLMs) have demon- strated an impressive ability to perform arith- metic and symbolic reasoning tasks, when pro- vided with a few examples at test time ("few-shot prompting"). Much of this success can be at- tributed to prompting methods such as "chain- of-thought", which employ LLMs for both under- standing the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solu- tion part, even when the problem is decomposed correctly. In this paper, we present Program- Aided Language models (PAL): a novel approach that uses the LLM to read natural language prob- lems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the inter- preter. We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reason- ing tasks from BIG-Bench Hard and others. In all these natural language reasoning tasks, gener- ating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using CODEX achieves state-of-the-art few-shot accu- racy on GSM8K, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1. 1
(Abstract) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on down- stream NLP tasks. However, their ability to access and precisely manipulate knowl- edge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre- trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric mem- ory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We com- pare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge- intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
(Abstract) Training language models to follow instructions with human feedback
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human inte
What is masked language modeling?
Masked language modeling (MLM) is a training technique used in natural language processing where some words in a sentence are randomly masked (hidden) and the model is tasked with predicting the masked words based on the context provided by the unmasked words. This approach, popularized by models like BERT, helps in learning deep bidirectional representations from unlabelled text by enabling the model to understand the context and relationships between words in a sentence.
What does it mean for a model to be aligned?
Models that act in accordance with user intentions. More practically, for the purpose of our language tasks, we use a framework similar to Askell et al. (2021), who define models to be aligned if they are helpful, honest, and harmless.
What is the reinforcement learning process for InstructGPT?
Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT.
(Abstract) Improving Language Understanding by Generative Pre-Training
Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).
What is open domain question answering?
Open domain question answering is the task of answering general domain questions, in which the evidence is not given as input to the system.
What was a takeaway of the RAG paper?
Our results highlight the benefits of combining parametric and non-parametric memory with genera- tion for knowledge-intensive tasks—tasks that humans could not reasonably be expected to perform without access to an external knowledge source
Prefix Tuning
Prefix-tuning offers a scalable, lightweight alternative to fine-tuning for natural language generation by optimizing a task-specific vector (prefix) instead of the entire model, significantly reducing storage and computational costs. It enables models to attend to these prefixes as "virtual tokens," achieving comparable or superior performance to fine-tuning with just 0.1% of the parameter updates, especially in low-data scenarios and with novel topics. This method allows for modular training and multiple task support with minimal overhead, making it ideal for personalized applications without data cross-contamination. It keeps the original model weights frozen in the fine tuning process.
What is PAL?
Program- Aided Language models (PAL): a novel approach that uses the LLM to read natural language prob- lems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the inter- preter.
What does ReAct do?
ReAct prompts LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interact with the external environments (e.g. Wikipedia) to incorporate additional information into reasoning (act to reason).
RePLUG
RePlug introduces a retrieval-augmented language modeling framework that enhances black-box large language models (LLMs) with a trainable retriever. This retriever fetches relevant documents based on the input and prepends them to the LLM's input, allowing the LLM to remain frozen while the retriever is fine-tuned to improve relevance. Unlike previous methods that require specialized training of the LLM to incorporate retrieved texts, RePlug's simple, plug-and-play design is compatible with any LLM and retrieval model. This approach not only simplifies the augmentation of LLMs with external knowledge but also leverages the LLM itself to guide the retriever towards documents that enhance prediction accuracy.
(Abstract) Language Models are Few-Shot Learners
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine- tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
What is retrieval augmented generation?
Retrieval-augmented generation in machine learning is a technique where a model retrieves relevant information from a database or corpus to inform and enhance the generation process of new content, improving accuracy and contextuality in tasks like text generation or question answering. But in the case of this paper, it is a fine-tuning approach where the parametric memory is a pre-trained seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.
How is InstructGPT made?
Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT.
What is TF/IDF?
TF/IDF, or Term Frequency/Inverse Document Frequency, is a numerical statistic used to reflect how important a word is to a document in a collection or corpus, often used in information retrieval and text mining to weigh and evaluate words' relevance.
What heuristic is used by the search algorithm in ToT to determine which states to keep exploring and in which order?
The LM itself is used to deliberately reason about the states. Here is the process: (a) Value each state independently: V(pθ,S)(s) ∼ pvalue(v|s) ∀s ∈ S, where a value θ prompt reasons about the state s to generate a scalar value v (e.g. 1-10) or a classifica- tion (e.g. sure/likely/impossible) that could be heuristically turned into a value. The basis of such evaluative reasoning can vary across problems and thought steps. In this work, we explore evaluation via few lookahead simulations (e.g. quickly confirm that 5, 5, 14 can reach 24 via 5 + 5 + 14, or "hot l" can mean "inn" via filling "e" in " ") plus commonsense (e.g. 1 2 3 are too small to reach 24, or no word can start with "tzxc"). While the former might promote "good" states, the latter could help eliminate "bad" states. Such valuations do not need to be perfect, and only need to be approximately helpful for decision making. (b)Voteacrossstates:V(pθ,S)(s)=1[s=s∗],wherea"good"states∗ ∼pvote(s∗|S)is θ voted out based on deliberately comparing different states in S in a vote prompt. When problem success is harder to directly value (e.g. passage coherency), it is natural to to instead compare different partial solutions and vote for the most promising one. This is similar in spirit to a "step-wise" self-consistency strategy, i.e. cast "which state to explore" as a multi-choice QA, and use LM samples to vote for it. For both strategies, we could prompt the LM multiple times to aggregate the value or vote results to trade time/resource/cost for more faithful/robust heuristics.
Positional Interpolation
The study introduces Positional Interpolation as a solution to extend the context window of pre-trained large language models (LLMs) like LLaMA, overcoming the limitations of direct fine-tuning and weak extrapolation properties of positional encodings. This method downscales position indices to fit within the original context window limit, enabling the extension of context windows without adding extra weight or altering the model's architecture. Positional Interpolation allows for effective context window extension with minimal fine-tuning, leveraging existing infrastructure and optimization methods.
What generator (seq2seq) model is used in the RAG architecture in the paper?
They mention that the generator component can be modeled using any encoder-decoder, but that they specifically use BART-large, a pre-trained seq2seq transformer 2ith 400M param.
(Abstract) FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially improves zero- shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
ToT Thought Generator
Thought generator G(pθ, s, k). Given a tree state s = [x, z1···i], we consider two strategies to generate k candidates for the next thought step: (a) Sample i.i.d.thoughts from a CoT prompt (Creative Writing, Figure 4): z(j) ∼ pCoT (zi+1|s) = pCoT (zi+1|x, z1···i) (j = 1 · · · k). This works better when the thought θθ space is rich (e.g. each thought is a paragraph), and i.i.d. samples lead to diversity; (b) Proposethoughtssequentiallyusinga"proposeprompt"(Gameof24,Figure2;Crosswords, Figure 6): [z(1), · · · , z(k)] ∼ ppropose(z(1···k) | s). This works better when the thought θ i+1 space is more constrained (e.g. each thought is just a word or a line), so proposing different thoughts in the same context avoids duplication.
How does ToT worK?
ToT actively maintains a tree of thoughts, where each thought is a coherent language sequence that serves as an intermediate step toward problem solving (Table 1). Such a high-level semantic unit allows the LM to self-evaluate the progress different intermediate thoughts make towards solving the problem through a deliberate reasoning process that is also instantiated in language. Finally, we combine this language-based capability to generate and evaluate diverse thoughts with search algorithms, such as breadth-first search (BFS) or depth-first search (DFS), which allow systematic exploration of the tree of thoughts with lookahead and backtracking. => The ToT is literally a tree of thoughts, which is explored (lookahead and backtracking) with algorithms like BFS and DFS.
ToolLLM
ToolLLM introduces a comprehensive framework for enhancing open-source large language models (LLMs) with tool-use capabilities, particularly for executing tasks involving external tools like APIs. It features ToolBench, a dataset for instruction-tuning with API use, and employs a depth-first search-based decision tree (DFSDT) algorithm for improved decision-making and reasoning. By fine-tuning LLaMA with ToolBench and integrating a neural API retriever, ToolLLaMA demonstrates superior tool-use abilities, outperforming standard models in handling complex instructions and unseen APIs with minimal manual intervention. The framework significantly advances LLMs' practical utility in real-world applications by enabling them to interact with and utilize external tools effectively.
(Abstract) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Transfer learning, where a model is first pre-trained on a data-rich task before being fine- tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.1
What approach is used in RAG?
We combine a pre-trained retriever (Query Encoder + Document Index) with a pre-trained seq2seq model (Generator) and fine-tune end-to-end. For query x, we use Maximum Inner Product Search (MIPS) to find the top-K documents zi. For final prediction y, we treat z as a latent variable and marginalize over seq2seq predictions given different documents.
What is one of the main gains realized in this paper?: "Improving Language Understanding by Generative Pre-Training"
We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.
(Key Takeaways) CLIP: Learning Transferable Visual Models From Natural Language Supervision
We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. They converted the title, description, and hashtag metadata of images in the YFCC100M dataset (Thomee et al., 2016) into a bag-of- words multi-label classification task and showed that pre- training AlexNet (Krizhevsky et al., 2012) to predict these labels learned representations which preformed similarly to ImageNet-based pre-training on transfer tasks.
What is RAG?
We endow pre-trained, parametric-memory generation models with a non-parametric memory through a general-purpose fine-tuning approach which we refer to as retrieval-augmented generation (RAG). We build RAG models where the parametric memory is a pre-trained seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We combine these components in a probabilistic model trained end-to-end.
(Abstract) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
We explore how generating a chain of thought—a series of intermediate reasoning steps—significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-of- thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking.
What is ToolBench?
We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the con- struction can be divided into three stages: (i) API collection: we collect 16, 464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruc- tion generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction.
What is introduced in the "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" paper?
We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. 2.) We com- pare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token.
(Abstract) BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its sim- plicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more re- cent pretraining schemes. We evaluate a num- ber of noising approaches, finding the best per- formance by both randomly shuffling the or- der of the original sentences and using a novel in-filling scheme, where spans of text are re- placed with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for compre- hension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state- of-the-art results on a range of abstractive di- alogue, question answering, and summariza- tion tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine transla- tion, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
What is FLAN?
We take a pretrained language model of 137B parameters and perform instruction tuning—finetuning the model on a mixture of more than 60 NLP datasets expressed via natural language instructions. We refer to this resulting model as FLAN, for Finetuned Language Net.
What is transfer learning?
Where a model is first pre-trained on a data-rich task before being fine -tuned on a downstream task.
What is React?
a general paradigm to combine reasoning and acting with language models for solving diverse language reasoning and decision making tasks
More on bart
add here
More on REALM
add stuff
More on that paper
add stuff here
What benchmarks were used to test ReAct?
four diverse benchmarks: 1.) question answering 2.) fact verification 3.)text-based game 4.)webpage navigation
How did they evaluate FLAN's performance?
group NLP datasets into clusters based on their task types and hold out each cluster for evaluation while instruction tuning FLAN on all other clusters. this setup ensures that FLAN has not seen any natural language inference tasks in instruction tuning, we then evaluate its ability to perform zero-shot natural language inference.
What are some of the noising approaches used in the BART paper?
randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are re- placed with a single mask token
REALM model architecture
two key components: the neural knowledge retriever, which models p(z | x), and the knowledge-augmented encoder, which models p(y|z,x).
What do they add in addition to the LLM in ToolLLM?
we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evalu- ator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction.