General ML/LLM
What are the broad risks associated with building AI models in production?
1. Alignment = ensuring AI models understand and follow user intent 2. Bias = mitigating unintended biases in AI generated content 3. Security = protecting AI models and applications from malicious inputs 4. Safety = preventing damaging and harmful AI-generated outputs
What does a level 5 "good" prompt involve?
1. Description of high-level goal 2. Detailed bulleted list of sub-tasks 3. Explicit statement asking LLM to explain its own output 4. Guideline on how LLM output will be evaluated 5. Few shot examples
What are the 2 questions to ask when doing fewshot learning?
1. Does the LLM understand the examples given in the prompt? Evaluate this by inputting the same examples and see if the model outputs the same scores. 2. Does the LLM overfit to these few-shot examples? You can evaluate your model on separate examples.
How does next token prediction work?
1. First take input text and tokenize it (represented by numbers) that we fit into the black box LLM. 2. This outputs a distribution of probabilities across all vocabulary (all tokens we have available for the model). each token comes with probability that it comes next in the sequence. 3. Based on those probabilities, we pick and sample one of the tokens to follow and continue with the sequence. 4. Then we append token to input sequence, then we repeat this process.
Talk-to-your-data workflow 4 steps
1. Organize your internal data into a database (SQL DB, graph DB, vector DB, text DB) 2. Given input in NL, convert it into the query language of the internal DB. (ex. if SQL DB, this process returns a SQL query. If it's an embedding DB, may just be ANN retrieval query. If it's just text, process extracts a search query) 3. Execute this query in the DB to obtain the query result. 4. Translate this query result into NL.
What are some methods of prompt optimization?
1. Prompt model to explain step-by-step how it arrived at an answer (however this can increase both latency and cost due to increased # of output tokens) 2. Generate as many outputs for the same input, and pick the final output with majority vote or asking LLM to pick. 3. Break one big prompt into smaller, simpler prompts.
Why use embeddings store?
1. When data is sufficiently large. - computing distances to each embedding for each query is too slow or expensive (10K+ embeddings) 2. When development velocity is important. - LLM apps need to support many users across many indexes, handle data and scale automatically.
What are the 3 main factors when considering prompting vs finetuning?
1. data availability 2. performance 3. cost
What is prompt tuning?
A cool idea that is between prompting and finetuning is prompt tuning = starting with a prompt, instead of changing this prompt, you programmatically change the embedding of this prompt For prompt tuning to work, you need to be able to input prompts' embeddings into your LLM model and generate tokens from these embeddings, which currently, can only be done with open-source LLMs and not in OpenAI API. On T5, prompt tuning appears to perform much better than prompt engineering and can catch up with model tuning (see image below).
What does cosine similarity measure?
A measure of similarity between 2 non-zero vectors defined in an inner product space. Measuring the cosine of the angle between them. It's a judgement of orientation and not magnitude. Angle between 2 vectors.
embeddings store?
A software package that abstracts a lot of operations. Take set of docs that represent your knowledge base, use store to embed them using its embedding function. When query comes in, same embedding function gets called, generates query embedding, then embedding store itself performs nearest neighbor search for you and returns the relevant docs to the LLM context window. This simplifies a lot of operations you'd have to do yourself.
Self-attention
Allows a neural network to understand a word in the context of the surrounding words around it.
top p sampling
Alternative to sampling with temperature, where the model considers the results of the tokens with top p probability mass. ex. So 0.1 means only the tokens comprising the top 10% probability mass are considered. ex. If p = 0.9, model will pick next token from the top 90% most likely tokens
What are the 2 high level components of an ML lifecycle?
An ML lifecycle can be broken up into two main, distinct parts. The first is the training phase, in which an ML model is created or "trained" by running a specified subset of data into the model. ML inference is the second phase, in which the model is put into action on live data to produce actionable output. The data processing by the ML model is often referred to as "scoring," so one can say that the ML model scores the data, and the output is a score.
Attention
Attention mechanism is a neural network structure that allows a text model to look at every single word in the original sentence when making decision about how to translate word in the output sentence. Does so by learning from many examples to make accurate predictions. From seeing many pairs, it learns the rules.
How does BERT work?
BERT uses Transformer, an attention mechanism that learns contextual relations between words or sub-words in a text. In its vanilla form, Transformer includes 2 separate mechanisms, an encoder that reads the text input and decoder that produces a prediction for the task. As opposed to directional models, which read the text input sequentially (L to R, or vice versa), Transformer encoder reads the entire sequence of words at once. Therefore its considered bidirectional, though more accurately its nondirectional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of a word).
BM25
Bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. Improved "normalized" version of TF-IDF.
What does a bi-encoder do?
Bi-encoders produce, for a given sentence, a sentence embedding. Each sentence is encoded separately and mapped to a common embedding space, where the distances between them can be measured. We pass 2 sentences A and B independently to BERT models, which result in the sentence embeddings (u and v). These sentence embeddings are then compared using cosine similarity.
What is a BERT model?
Bidirectional Encoder Representations from Transformers. Its key technological innovation is applying the bidirectional training of Transformer, a popular attention model, to language modeling. Paper showed that an LM that is bidirectionally trained can have a deeper sense of language context and flow than single-direction LMs.
TF-IDF and BM25 are examples of what?
Classified as being sparse vector methods.
Retry behavior for handling API rate limits
Completion with backoff = when we hit a limit, the function will wait, and try to make the request again.
embedding model
Converts documents into numeric representations and stores them in vector DB
Sentence BERT is an example of what?
Dense vector (semantics behind the words)
Where is the main cost of embedding models?
For real-time use cases = loading the embeddings into a vector DB for low-latency retrieval.
beam search
Generates multiple candidate sequences to maximize the probability of a sequence of tokens. However, both greedy decoding and beam search have limitations.
greedy decoding
Going with the next token with the highest probability . Basically when temperature is set to zero (similar)
Sentence BERT
Have BERT transformer. Query is processed by many BERT encoding layers. Documents are also processed through the encoder network. Produces bunch of dense vectors. Then we use cosine similarity between both to calculate how similar they are.
Data availability
If you have only a few examples, prompting is quick and easy to get started. There's a limit to how many examples you can include in your prompt due to the maximum input token length.
Why is productionizing prompt engineering challenging?
In prompt engineering, instructions are written in natural language, which is much more flexible than programming language. Makes for great user experience but can lead to a bad developer experience.
Positional encoding
Instead of looking at words sequentially, you take each word in the sentence and attach a "number" to it before feeding it to the model, depending on where the word is in a sentence. You store information about the word order in the data itself rather than in the structure of the network. Neural network learns importance of word order from the data.
How does output length affect latency?
It significantly affects latency, likely due to output tokens being generated sequentially.
model inference
Machine learning (ML) inference is the process of running live data points into a machine learning algorithm (or "ML model") to calculate an output such as a single numerical score. This process is also referred to as "operationalizing an ML model" or "putting an ML model into production."
Downsides of cross-encoder?
Needs to compute a new encoding for every pair of input sentences, resulting in high computational overhead. Therefore, it's impractical for tasks like information retrieval and clustering, which involve massive pairwise sentence comparisons.
Approximate nearest neighbor
Normally, exact nearest neighbor requires you to go through all embeddings and calculate distances between each one. O(N) operation, where N is # of embeddings we have. ANN algos exploit the structure of the data, take only O(log(N)) operations. Sub-linear. They trade recall (might not be the truly Nth nearest neighbor) for speed.
Using LLMs to generate embeddings?
One direction that I find very promising is to use LLMs to generate embeddings and then build your ML applications on top of these embeddings, e.g. for search and recsys. As of April 2023, the cost for embeddings using the smaller model text-embedding-ada-002 is $0.0004/1k tokens. If each item averages 250 tokens (187 words), this pricing means $1 for every 10k items or $100 for 1 million items.
What does a cross-encoder do?
Pass 2 sentences A and B simultaneously into a Transformer network. This produces an output value between 0 and 1 indicating the similarity of the input sentence pair. Basically enables the computation of accurate classification/relevance score.
Few-shot learning
Present a set of high quality demonstrations, consisting of both input and desired output, on the target task. Model better understands human intention and criteria for success. However it costs more bc of token consumption and can hit the context length limit when input/output text is long.
Difference between prompting and finetuning
Prompting = for each sample, tell your model explicitly how it should respond. Finetuning = train model on how to respond, so you don't have to specify that in your prompt.
What was used before a transformer and why was it limiting?
Recurrent neural networks. They process words sequentially, one at a time. They're not good at handling large sequences of text (forgetting what happened in the beginning towards the end of the long sequence). They're also hard to train, and couldn't parallelize well.
Zero-shot learning
Simply feeding the task text to the model and asking for results
Pros/cons of biencoders
Since encoded sentences can be cached and reused, bi-encoding is more efficient. Outputs of bi-encoder can be used off-the-shelf as sentence embeddings for downstream tasks. However, in supervised learning, bi-encoders underperform cross-encoders, since they don't explicitly model interactions between sentences.
TF-IDF
Term frequency - inverse document frequency. An algorithm that uses the frequency of words to determine how relevant those words are to a given document. TF * IDF
What are benefits to finetuning your model?
The benefit of finetuning is two folds: 1. You can get better model performance: can use more examples, examples becoming part of the model's internal knowledge. 2. You can reduce the cost of prediction. The more instruction you can bake into your model, the less instruction you have to put into your prompt. Say, if you can reduce 1k tokens in your prompt for each prediction, for 1M predictions on gpt-3.5-turbo, you'd save $2000.
General rule of thumb in terms of cost and latency
The more explicit detail and examples you put into your prompt, the better the model performance, and the more expensive your inference will cost.
How many examples do you need to finetune a model to a task?
The number of examples you need to finetune a model to your task, of course, depends on the task and the model. In my experience, however, you can expect a noticeable change in your model performance if you finetune on 100s examples. However, the result might not be much better than prompting. A prompt is worth ~ 100 examples. The general trend is that as you increase the # of examples, finetuning gives you better model performance than prompting. No limit to how many examples you can use to finetune a model.
What are the types of flexibility involved with prompt engineering?
There's 2 types of flexibility at work: how users define instructions, and how LLMs respond to these instructions. The ambiguity in LLM generated responses can lead to 2 issues: 1. ambiguous output format. Downstream apps built on LLMs expect outputs to be in a certain format so we can parse them. Even if we craft prompts to be explicit about the output format, there's no guarantee that outputs will ALWAYS follow this format. 2. Inconsistency with user experience. LLMs are stochastic, meaning there's no guarantee an LLM will give you the same output for the same input every time (unless you set temperature to 0).
How are input tokens processed?
They are processed in parallel, meaning that input length shouldn't affect latency that much.
Finetuning with distillation
This technique of training a small model to imitate the behavior of a larger model is called distillation. The resulting finetuned model behaves similarly to text-davinci-003, while being a lot smaller and cheaper to run.
To deploy a machine learning inference environment, you need what three main components?
To deploy a machine learning inference environment, you need three main components in addition to the model: 1. One or more data sources 2. A system to host the ML model 3. One or more data destinations In machine learning inference, the data sources are typically a system that captures the live data from the mechanism that generates the data. The host system for the machine learning model accepts data from the data sources and inputs the data into the machine learning model. The data destinations are where the host system should deliver the output score from the machine learning model.
What is a transformer and what are the 3 characteristics that make it so powerful?
Type of neural network architecture. You can parallelize transformers (unlike RNNs). 1. Positional encoding 2. Attention 3. Self-Attention
How to use the LLM to generate sample queries and answer pairs based on documentation?
Use the LLM to programmatically provide details based on the types of information you want to be extracting from the existing documentation. LLM basically comes up with the list of questions that a user might be asking about the docs.
When are bi-encoders used?
Used whenever you need a sentence embedding in a vector space for efficient comparison. Use cases are like for information retrieval, semantic search, or clustering.
When are cross-encoders used?
When you have a pre-defined set of sentence pairs you want to score. For instance, you have 100 sentence pairs and you want to get similarity scores for all 100 pairs.
What is Euclidean distance?
While cosine looks at the angle between vectors (thus not taking into regard their weight or magnitude), euclidean distance is similar to using a ruler to actually measure the distance. It takes into account the magnitude. Actual distance, length of the line, between the two vector tips.
Benefit of using LLMs to generate embeddings?
While this still costs more than some existing open-source models, this is still very affordable, given that: 1. You usually only have to generate the embedding for each item once. 2. With OpenAI API, it's easy to generate embeddings for queries and new items in real-time.
sentence embedding
a numerical representation of a sentence in the form of a vector of real numbers which encodes meaningful semantic information.
1 sentence summary of GPT4
a transformer based model pre-trained to predict the next token in a document.
IDF
inverse document frequency. log (# documents / # documents that contain a specific word). Basically allows us to not care about certain words like "is", "the", etc. and focus more so on unique words, bc chances are they're more relevant to our query.
TF
term frequency. It looks at sentence or paragraph. Based on query, compared to length of that document, how frequent your query is.
What is pooling?
the process of converting a sequence of embeddings into a sentence embedding. Means compressing the granular token-level representations into a single fixed-length representation that reflects the meaning of the entire sequence.