Generative AI 1: "Only" Basics - Basic Definitions, Models, Parts, Architecture, Activation Functions [need to move Optimization Algos to own sets in Quizlet +- I made MoMs their own set in Quizlet]

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What is skip connection architecture?m When do you use it?

Use it for building deep neural networks

#models #evaluation what is an ROC curve and when is it used? Give an example in sentiment analysis enterprise Generative AI

DEFINITION Classification

What type of model is the SWITCH transformer? What is the reason the SWITCH transformer model is less costly to operate? And why are these types better overall? What are other examples of this type?

It is a a mixture-of-experts (MoE) model - a type of conditional computation where parts of the network are activated on a per-example basis which dramatically increases model capacity without a proportional increase in computation. HOW IT WORKS a subset of experts is selected on a per-token or per-example basis, thus creating sparsity in the network WHY BETTER such models have demonstrated better scaling in multiple domains and better retention capability in a continual learning setting Drawbacks: a poor expert routing strategy can cause certain experts to be under-trained, leading to an expert being under or over-specialized. OPTIMIZATIONS DONE This sparsity in the network enables them to be trained, for the first time, with lower precision (bfloat16) formats; we design models based off T5-Base and T5-Large (Raffel et al. 2019) RESULTS OF OPTIMIZATION: to obtain up to 7x increases in pre-training speed with the same computational resources. OTHER EXAMPLES Other examples are GLaM and V-MoE

What is transfer learning? Why is this important to building Enterprise GenAI?

It is taking a pre-trained model and making it do a different task

#scaling What does it mean to scale a model?

See image Width Depth Resolution

Enterprise: Why do you separate your knowledge base from your LLM? What is the User's Workflow "Look Like? when you do this"

To ensure that users receive accurate answers, we need to to this in order to leverage the semantic understanding of our language model while also providing our users with the most relevant information; all of this happens in real-time, and no model training is required User Case The approach for this would be as follows: 1. User asks a question 2. Application finds the most relevant text that (most likely) contains the answer 3. A concise prompt with relevant document text is sent to the LLM 4. User will receive an answer or 'No answer found' response

What does "playground" mean in AWS sagemaker?

It means "here you can enter prompts for models"

What is RAG?

Retrieval Augmented Generation also referred to as "grounding the model" requires separating the knowledge base from the tool, then pulling on the knowledge base "in real time" to get answers (but do not use the knowledge base to train the tool)

#TENSORFLOW What are checkpoints? Why do you need them? How do you use them in TF?

SPECIFIC TO TF: GIVEN The persistent state of a TensorFlow model is stored in tf.Variable objects. WHY: The easiest way to manage variables is by attaching them to Python objects, then referencing those objects DEFINITION ___ is an intermediate dump of a model's entire internal state (its weights, current learning rate, etc.) so that the framework can resume the training from this point whenever desired WHY TO HAVE ___ if you get interrupted while training a model, ___ enable you to pick up with training exactly where you left off FYI: IMPORTANT - Since ___s do not contain any description of the computation defined by the model, you only find these useful when the source code (that will use these saved parameter values) is available...

What are embedding layers? how do you use them in Generative AI? Why do you care about this in enterprise AI?

____ is a type of hidden layer in a neural network Theoretically, every hidden layer can represent an ___; we can extract an output of any hidden layers and treat it as an embedding vector DEFINITION - maps input information from a high-dimensional to a lower-dimensional space - $$$ Why You Care: it allows the network to learn more about the relationship between inputs and to process the data more efficiently IMPORTANT The point of an ____ is not only to lower the input dimension but also to create a meaningful relationship between them; that is why particular types of neural networks are used only to generate embeddings SPECIFIC _____s Transformer-based models create contextual embeddings; it means that the same word will most likely get a different embedding vector if it appears in a different context

What is a text embedding model? how do you make one? why use it for an enterprise gen AI tool?

these models are How to make: you will need to precompute embeddings and store them in a managed vector database Why use for Enterprise GenAI: You can use text embedding as part of your

What does MRKL stand for, and why is this model efficient AND effective? [AI21 Labs' "MRKL"]

"Modular Reasoning, Knowledge and Language" Why Efficient: - Combines AI modules in a pragmatic plug-and-play fashion - switching back and forth between structured knowledge, symbolic methods and neural models - splitting the workload of understanding the task, executing the computation and formulating the output result between different models

Name three ways to implement sparse MoE models, and define them? COME BACK TO THIS ONE

- k-means clustering - linear assignment to maximize token-expert affinities - hashing

What are "Faithful Reasoning Frameworks" (for Question Answering) and when is it used in generative AI?

- the user first provides one or more examples of the reasoning process as part of the prompt - the LLM "imitates" this reasoning process with new inputs

What is "in-context learning" and why is it important to developing models for enterprise generative AI?

DEFINITION Massive neural network models -- similar to large language models -- are capable of containing smaller linear models inside their hidden layers, which the large models could train to complete a new task using simple learning algorithms

What does "temperature" mean with respect to parameters?

A low___ via your parameters produces repetitive and deterministic response, and increasing the temperature will result in more unexpected or creative responses to queries of the LLM

#MLOps #MLSystems. Continual learning means what?

Always streaming, not batch uploads of data, to the. models to generate features

How many layers to add into a neural network?

As Yoshua Bengio, Head of Montreal Institute for Learning Algorithms remarks: "Very simple. Just keep adding layers until the test error does not improve anymore." A method recommended by Geoff Hinton is to add layers until you start to overfit your training set. Then you add dropout or another regularization method.

#optimization #training What is the way a dimensionality reduction algo can help you?

Before you feed your data into another machine, you can use this to remove data and improve performance, overall

What is BERT?

Bidirectional Encoder Representations from Transformers

#algo What is the Viterbi Algorithm? Of which of the "most used principles of programming" does this algo use? How does it use it? When do you use it?

DEFINITION a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events MOST USED PRINCIPLE Is an example of dynamic programming HOW uses this principle when it finds ways to compute the most likely sequence of states in an efficient manner WHEN TO USE IN GEN AI Used for finding the most likely sequence of states in Hidden Markov Models (HMMs) for tasks like: - speech recognition METHOD: By finding the most likely string of text given the acoustic signal - part-of-speech tagging (same method) EXAMPLE Take a simple HMM for part-of-speech tagging, and we want to find the most likely sequence of tags for the sentence "He fishes". Our states might be "Noun" and "Verb", and our observations might be words like "He" and "fishes".

#Model #architecture What is the best choice for an enterprise generative AI model that is tasked with classifying multi-modal sentiment analysis into positive, neutral, and negative?

FROM THE TOP various architectures available, such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Transformers. Each architecture has its strengths and limitations, so choose one that best suits your objective and dataset.

Why beat up your model, early? How do you do this?

If you jump into the role of a mean, adversarial user and stress-test your models early to explore their weak points, you can fix them before too much skin has been put in the game HOW TO DO

Attention layer, what is it good for when you are trying to generate text captions for images? What "is" the attention layer - what does it actually calculate from the encoder's outputs?

In image captioning, the attention layer helps the algorithm focus on the most relevant part of the image when generating each word of the output sequence. What is It? The attention layer is a weighted sum of encoder outputs. More Here: https://towardsdatascience.com/image-captions-with-attention-in-tensorflow-step-by-step-927dad3569fa

What type of model is the Switch Transformer

It the first trillion-parameter models and here is why it is special: It uses AI sparsity - a complex mixture-of experts (MoE) architecture and - other advances to drive performance gains in language processing - and up to 7x increases in pre-training speed.

#sentimentAnalysis What are the three types of sentiment analysis algorithms?

Knowledge-based Statistical Hybrid

What is the use case for Symbolic AI for government clients?

Legal definitions and codes of conduct for governments ARE set in stone And setting this enterprise knowledge in stone = an efficient approach to increase precision b/c it allows you to control the behaviour of the LLM where it is crucial for your client, while still unfolding its power at generating language based on wide external knowledge

Why give away LLMs via open source? What is the reason?

Machine Learning's environmental impact is reduced when large pre-trained language models are shared, it reduces the overall compute cost and carbon footprint of our community-driven efforts

Can LLMs make images?

Only if they are married up to image encoders and decoders per the nerds at Carnegie Mellon University

Parameters and Hyperparameters, what are they? COME BACK AND FINISH THIS ONE

Parameters Initialization Technique (random, etc) Weights Bias Learning Rates Dropout Weights Data Augmentation (for images, esp.) Batch Normalization Learning Weight Decay Autotunning? i.e.: gridsearchcv Hyperparameters: Numbers of layers (like 6 encoders) [FYI - you usually cannot just add another in a RNN] Types of Layers Nodes in a Layer Optimization Algorithm (you can use a different one)

#training what are MCMC-free approaches and why use them?

Score Matching (SM) Noise Constrastive Estimation (NCE)

What are the ways you can think about "stepping" into creating a model for you based on an open-source one?

See IMage

What is seed for the foundational model Diffusion?

See IMage

What is a densely connected neural network?

See Image - the input layer is deeply connected to the hidden layer (neurons)

What is a directed graphical model? why do you care abou this?

See attached A directed graphical model (DGM) is a probabilistic model that uses a graph to represent the conditional dependence structure between random variables. In a DGM, each node in the graph represents a variable, and each edge represents a direct probabilistic interaction between two variables. The probability of a set of random variables in a DGM factors into a product of conditional probabilities, one for each node in the graph.

Model of Training Effectiveness: How do you know when you overfit your training set? COME BACK TO THIS ONE

TO DO

TENSORFLOW The phrase "Saving a TensorFlow model" typically means one of two things. What are they?

This means either Checkpoints OR SavedModel

#multimodal #features What are multimodal features? What are the two categories/kinds?

Using more than one type or "mode" of data to generate features, i.e.: Video + LiDAR+ depth data creates the dataset for self-driving car applications. Two Kinds: 1. Joint Representation 2. Coordinated Representation

#vectors #vectordatabase How does a vector database work?

Vector databases usually use the Approximate Nearest Neighbor (ANN) algorithm to calculate the spatial distance between the query vector and vectors stored in the database. The closer the two vectors are located, the more relevant they are. Then the algorithm finds the top k nearest neighbors and delivers them to the user. https://zilliz.com/blog?tag=39&page=1&utm_source=thenewstack&utm_medium=website&utm_content=inline-mention&utm_campaign=platform

MOVE THIS ONE #MLOps #MLSystems. reak-time monitoring means what?

We do not need batch monitoring solutions, but real-time monitoring. use something like Kafka or Kinesis to securely transport consumers' click streams from the applications then use a stream processing engine to continually compute accuracy or predictions so that as soon as the model is deployed, as soon as there's traffic coming in, to see how the model is performing

Can you make a private ChatGPT?

Yes, here https://levelup.gitconnected.com/training-your-own-llm-using-privategpt-f36f0c4f01ec

What is "cosine similarity" and why is it important for enterprise generative AI?

__ __ measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. Why important to generative AI: It is often used to measure document similarity in text analysis, which can speed up computational time when the LLM uses a separate knowledge base

What is Masking in a TF model? When do you use it?

___ enables you to disregard certain parts of a tensor when executing the forward pass of your neural network - typically those parts of a tensor set to zero

What is a MoE model and for what type of use? What are the differences between sparse and dense MoE models? Which ones are better as of 2023?

___ stands for Mixture of Experts and was used for neural networks but can be used for any ML/AI model MoE operates by adopting a number of experts, each as a sub-network, and activating only one or a few experts for each input token. A gating network must be chosen and optimized in order to route each token to the most suited expert(s). Dense: all experts activated every step Sparse: a subset of experts when routing each token WHY SPARSE: 1. No loss of accuracy 2. Reducing computational cost as compared to a dense 2023 WINNER SPARSE

#embedding Zero Shot Embedding, what is it and why do you want to use it for Enterprise generative AI?

can use embeddings for zero shot classification without any labeled training data. For each class, we embed the class name or a short description of the class. To classify some new text in a zero-shot manner, we compare its embedding to all class embeddings and predict the class with the highest similarity.

Vectorization: how to do with an open source model?

to make the best use of vector embeddings with vector databases like Milvus and Zilliz Cloud obtain vectors by removing the last layer and taking the output from the second-to-last layer. The last layer of a neural network usually outputs the model's prediction, so we take the output of the second-to-last layer. The vector embedding is the data fed to a neural network's predictive layer. https://thenewstack.io/how-to-get-the-right-vector-embeddings/

#LLMs Why are parameters not a good measure for LLMs?

while a power-law might capture the general growth trend, it may overstate the performance improvements at extremely large scales.

#basics What are autoregressive models? Why use them?

"Like" moving averages but better; can be used to predict "n" steps ahead which means you can use to forecast financial data, etc.

#rl what is Q- Learning?

- is an alto that takes the approach aimed to determine the optimal action based on its current state by either following the policy set by the human programmer or by deviating from the prescribed policy (developing its own set of rules). Because it may deviate from the given policy, a defined policy is not needed.

#llms ChatGPT and others "learn" in two ways, what are they? What is one you can actually influence for ChatGPT?

1. Via model weights (i.e., fine-tune the model on a training set) 2. Via model inputs (i.e., insert the knowledge into an input message) You can influence 2 if you do this: https://cookbook.openai.com/examples/question_answering_using_embeddings

RAGs + embedding - how does this work?

1. You encode the data you want to expose to your LLM into embeddings and index that data into a vector database. 2. When a user asks a question, it is converted to an embedding, and we can use it to search for similar embeddings in the database. 3. Once the model maker finds similar embeddings, they construct a prompt with the related data to provide context for an LLM to answer the question

Set yourself up for success with gradient descent, what do you need? ID the challenges to overcome and how you overcome them .... to set yourself up for success

Challenges: Oscillate between two or more points Get trapped in a local minimum Overshoot and miss the minimum point training sets: independently and identically distributed (IID) 1. random start: \ https://machinelearningmastery.com/a-gentle-introduction-to-gradient-descent-procedure/#:~:text=The%20gradient%20descent%20algorithm%20is,the%20lowest%20mean%20square%20error.

#learning #online When you set up learning for a model, what is minibatching? Code for this?

DEFINITION - Instead of feeding the entire dataset to the model at once, data is divided into smaller subsets or mini-batches. - can be seen as an approximation of online learning. EXAMPLE import tensorflow as tf # Define the generator def generator_model(): model = tf.keras.Sequential() # ... [add layers] return model # Define the discriminator def discriminator_model(): model = tf.keras.Sequential() # ... [add layers] return model # Instantiate models generator = generator_model() discriminator = discriminator_model() # Define loss and optimizers cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=True) generator_optimizer = tf.keras.optimizers.Adam(1e-4) discriminator_optimizer = tf.keras.optimizers.Adam(1e-4) # Training loop with mini-batches def train(dataset, epochs): for epoch in range(epochs): for real_images in dataset: with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape: # Generate fake images noise = tf.random.normal([real_images.shape[0], noise_dim]) generated_images = generator(noise, training=True) # Get discriminator predictions real_output = discriminator(real_images, training=True) fake_output = discriminator(generated_images, training=True) # Compute generator and discriminator loss gen_loss = generator_loss(fake_output) disc_loss = discriminator_loss(real_output, fake_output) # Get gradients and apply them gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables) gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.trainable_variables) generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables)) discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.trainable_variables)) # Example data loading (considering data as images) BUFFER_SIZE = 60000 BATCH_SIZE = 256 train_dataset = tf.data.Dataset.from_tensor_slices(train_images).shuffle(BUFFER_SIZE).bat

#optimization What is the Late Acceptance algo?

DEFINITION - a variant of Hill Climbing, instead of comparing the new solution with the current solution, it compares the new solution with a solution from a few iterations ago HOW IT WORKS It can escape local optima by accepting solutions that are worse than the current one but better than the one from some iterations back EXAMPLE current_solution = initial_solution() best_solution = current_solution history = [cost(current_solution)] * history_size iteration = 0 while not stopping_criteria_met(): neighbor = random_neighbor(current_solution) if cost(neighbor) < history[iteration % history_size]: current_solution = neighbor if cost(current_solution) < cost(best_solution): best_solution = current_solution history[iteration % history_size] = cost(current_solution) iteration += 1 return best_solution

#HMM What are Hidden Markov Models?

DEFINITION A statistical model where the system being modeled is assumed to be a Markov process with hidden states; you don't observe the states directly - instead, you observe data that is generated from the hidden states. THRESHOLD CRITERIA The "hidden" part comes from the fact that the true state isn't directly visible, but the data (or observation) is.

#padding #neuralnetworks #text what does it mean to "pad the inputs?"

DEFINITION For text, when processing sequences with neural networks, the sequences often need to be of a uniform length. If the actual sequences are shorter than the desired length, they are "padded" with extra, typically zero, values to reach that length.

#transformers What is positional embedding? Why is it needed? What is a real life example?

DEFINITION In the Transformer architecture position embeddings are added to the word embeddings to provide information about the position of a word within a sequence. WHY NEEDED Transformer architecture (unlike RNNs) does not have any inherent notion of the sequence order, so this is how they "make one" EXAMPLE IN REAL LIFE BERT (can be used for sentiment analysis)

#embedding #classification What is Named Entity Recognition (NER)? Why do we learn about this to help us understand embedding?

DEFINITION NER is a subtask of information extraction that classifies named entities in text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. WHY LEARN ABOUT THIS RE: EMBEDDING CHOICES "Barack Obama was born in Hawaii." - When tagging this sentence, the exact position of each word is crucial because: - The name "Barack" at position 1 - "Obama" at position 2 - together form the entity "Barack Obama" which is a Person. - The word "Hawaii" at position 6 is a Location. Misplacing these positions or shifting them can result in incorrect entity tagging. Absolute Position Embeddings in NER: When using absolute position embeddings, each word's position in the sequence will get a unique embedding. This helps the model distinguish between "Barack" which is at the beginning of the sequence and "Hawaii" which is towards the end.

#models What are Vector Autoregressive models?

DEFINITION VAR models are multivariate time series models that relate current observations of one variable to past observations of that variable and other variables. THRESHOLD CRITERIA Variable feedback is a characteristic of VAR models, unlike univariate autoregressive models. The use of exogenous series to analyze the effects on a system's variables. EXAMPLE: - How real GDP affects policy rate and how policy rate affects real GDP. - Examines whether a recent tariff has affected several econometric series. Forecasting the response variables simultaneously.

#optimization What is a Simulated Annealing algo?

DEFINITION ___ is a probabilistic optimization technique inspired by the annealing process in metallurgy* - the algorithm occasionally accepts worse solutions, with the probability of accepting such a solution decreasing over time, helping the search to escape from local optima. EXAMPLE current_solution = initial_solution() best_solution = current_solution temperature = initial_temperature while not stopping_criteria_met(): neighbor = random_neighbor(current_solution) delta = cost(neighbor) - cost(current_solution) if delta < 0 or random() < exp(-delta / temperature): current_solution = neighbor if cost(current_solution) < cost(best_solution): best_solution = current_solution temperature = decrease_temperature(temperature) return best_solution *annealing refers to heating a material and then slowly cooling it to remove defects, resulting in a more organized material structure.

#optimization Define gradient descent? What is the point of using this in ML?

DEFINITION an algorithm for finding the minimum of a function USED HOW - use ___ as an optimization algo used as the core of a ML algo - use ___ to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost). HOW TO USE - Concrete Example - take classification tasks, which need a mean square error function to fit a model to the data - use ___ to ID the optimal model parameters that lead to the lowest mean square error FYI: Gradient ascent is used similarly, for problems that involve maximizing a function

#SentimentAnalysis What is sentiment analysis? WHY Do you need it in enterprise AI? How do you "do" it?

DEFINITION an automated process of tagging data according to their sentiment, such as positive, negative and neutral WHY YOU CARE for ENTERPRISE GenAI Sentiment analysis allows companies to analyze data at scale, detect insights and automate processes HOW TO DO -Text: use NLP

#embedding what is an "Absolute position embeddings"

DEFINITION each position in a sequence (e.g., the first word, the second word, etc.) has a specific, fixed embedding

#embedding What is relative position embeddings ?

DEFINITION where the relationship between positions (e.g., "two words apart") would be encoded.

#algorithms what is federated learning?

Federated learning (also known as collaborative learning) is a machine learning technique that trains an algorithm via multiple independent sessions, each using its own dataset. This approach stands in contrast to traditional centralized machine learning techniques where local datasets are merged into one training session, as well as to approaches that assume that local data samples are identically distributed. To ensure good task performance of a final, central machine learning model, federated learning relies on an iterative process broken up into an atomic set of client-server interactions known as a federated learning round. Each round of this process consists in transmitting the current global model state to participating nodes, training local models on these local nodes to produce a set of potential model updates at each node, and then aggregating and processing these local updates into a single global update and applying it to the global model.[2

#Embeddin #transformers Is it always better to use absolute position v. relative position? Why or why not?

GIVEN Transformers lack inherent notions of sequence order WHEN TO USE WHICH? Absolute: IF exact position in a sequence is crucial (e.g., certain types of sequence tagging) THEN use absolute embeddings might be more appropriate. For tasks that deal with sequences of varying lengths, or where the relationship between tokens regardless of their absolute positions is more important (e.g., certain types of parsing or translation), relative position embeddings might offer advantages. 1. Absolute Position Embeddings: Advantage: They are simple and directly provide information about the position of a token in a sequence. Limitation: Since each position has a specific embedding, the model may struggle to generalize to sequence lengths not seen during training. Example (Conceptual): Imagine a sequence: "I love dogs". If we use absolute position embeddings: - "I" might have an embedding vector corresponding to the 1st position. - "love" to the 2nd position. - "dogs" to the 3rd position. 2. Relative Position Embeddings: Advantage: These embeddings encode the relative distance between tokens, making them potentially more flexible and capable of handling varying sequence lengths. Limitation: They can be more complex to implement and might not always provide clear benefits over absolute embeddings, depending on the specific task and dataset. Example (Conceptual): - The relationship between "I" and "love" might be encoded as "1 position apart". - Between "love" and "dogs" also as "1 position apart". - Between "I" and "dogs" as "2 positions apart".

#memory #training You are doing offline learning, but you cannot save all the data you need to train the model on your computer, what do you do? What is this called? What TYPE of learning is this?

HOW See image NAME Off-Core Learning TYPE OF LEARNING Incremental (so not really "Offline" or "Streaming/Online" https://read.amazon.com/?asin=B0BHCFNY9Q&ref_=kwl_kr_iv_rec_1

#protip You are using one of the "Auto" classes, and not a specialized classe, for the same model architecture and configuration, what is the impact? When do you use "specialized" classes?

IN A NUTSHELL NO performance impact - no change in accuracy - speed IS ABOUT - flexibility - code clarity WHY USE "AUTO" CLASSES? FLEXIBILITY: you might switch between different model architectures without changing much of your code, WHEN TO USE "SPECIALIZED" CLASSES? CODE CLARITY: you've already decided on a specific model architecture, using these will make the code more explicit and clear about which model you're using

#learning #hyperparameters What is a "learning rate" and why do you want to be careful when you set it?

It is how fast the modle will take in new info while forgetting old... HIGH: Forgets fast, adapts quickly LOW: More conservative, less prone to get "distracted" by outliers in the new data or

#learning #types What type of learning is this: self-creates labels? What category does this type go into: unsupervised or supervised?

It is self-supervised learning Weirdly, it is classified in Supervised rather than Unsupervised learning.

#activation functions, linear or ___? unsupervised or supervised? type of data?

Log (logarithmic), supervised

#learning #Types What type of learning is when the model applies what was already learned? Two names

Offline Learning AKA Batch Learning Alpha Go was an example of this

Why is the Hugging Face transformer library so useful/popular?

Pre-trained Models: Transformers library provides a collection of pre-trained models, like BERT, GPT-2, T5, DistilBERT, and many more, which can be easily loaded and fine-tuned on specific tasks, drastically reducing the effort and computational resources required. Model Architectures: Beyond just pre-trained weights, the library also offers implementations of the underlying model architectures. This is useful for researchers or practitioners who want to train a model from scratch or modify the existing architectures. Tokenizers: Alongside models, the library provides tokenizers that can preprocess text data in a way that's compatible with the respective models. High-Level API: The library's interface is designed to be user-friendly. Loading models, making predictions, and fine-tuning are all straightforward processes. Multilingual Support: Many of the models in the Transformers library come in multilingual variants, capable of understanding and generating text in multiple languages.

What is the importance of Q learning? WhT does "off policy" mean? And what js a Q Table?

See image Off-policy approach is when this also discards the initial human programmed instructions and starts deviating to find a better way to "do the thing." Q-values -- also known as action values - are the expected future values for action. Q-tables keep track of all of these Q-values.

#classification #modelsetup By specifying num_labels=2, what does this do? # Define DistilBERT as our base model: from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

The provided code is attempting to load the DistilBERT model, which has been pre-trained on a large corpus, with the intention of using it for sequence classification, but with specifying num_labels=2, you're indicating that this will be a binary classification task

#models What are Autoregressive models?

They generate data one element at a time - conditioning the generation of each element on previously generated elements - they predict the probability distribution of the next element given the context of the previous elements and then sample from that distribution to generate new data EXAMPLES GPT (Generative Pre-trained Transformer), which can generate coherent and contextually appropriate text.

#padding #embedding #protip Finish this sentence: for a transformer model pre-trained with absolute position embeddings, you should add the padding tokens to the end (right side) of the sequence rather than at the beginning (left side) because ...

WHY ...if you pad on the left = BAD - you would shift the actual content to the right and alter the absolute positions, which might lead the transformer to interpret the sequence differently than intended!! REAL EXAMPLE BERT CODE # Encode texts, pad them to the length of the longest sequence in the batch, and pad on the right encoded_inputs = tokenizer(texts, padding='longest', truncation=True, return_tensors='pt'). EXPLAINATION: 1. tokenize_function takes a batch of examples from the IMDb dataset and tokenizes them using the BERT tokenizer. padding='longest' ensures that all sequences in a batch are padded to the length of the longest sequence in that batch. Right-side padding is the default behavior. truncation=True ensures that sequences longer than BERT's maximum sequence length are truncated. batched=True in imdb.map() means that the tokenize_function will receive batches of examples, which allows for efficient tokenization and padding to the longest sequence in each batch.

Hashing Trick

Wabbit developed at Microsoft.7 The gist of this trick is that you use a hash function to generate a hashed value of each category. The hashed value will become the index of that category. Because you can specify the hash space, you can fix the number of encoded values for a feature in advance, without having to know how many categories there will be. For example, if you choose a hash space of 18 bits, which corresponds to 218 = 262,144 possible hashed values, all the categories, even the ones that your model has never seen before, will be encoded by an index between 0 and 262,143. One problem with hashed functions is collision: two categories being assigned the same index. However, with many hash functions, the collisions are random; new brands can share an index with any of the existing brands instead of always sharing an index with unpopular brands, which is what happens when we use the preceding UNKNOWN category. The impact of colliding hashed features is, fortunately, not that bad. In research done by Booking.com, even for 50% colliding features, the performance loss is less than 0.5%, as shown in Figure 5-4.8 Figure 5-4. A 50% collision rate only causes the log loss to increase less than 0.5%. Source: Lucas Bernardi You can choose a hash space large enough to reduce the collision. You can also choose a hash function with properties that you want, such as a locality-sensitive hashing function where similar categories (such as websites with similar names) are hashed into values close to each other. Because it's a trick, it's often considered hacky by academics and excluded from ML curricula. But its wide adoption in the industry is a testimonial to how effective the trick is. It's essential to Vowpal Wabbit and it's part of the frameworks of scikit-learn, TensorFlow, and gensim. It can be especially useful in conti

#learning #batch Can batch learning/training still lead to a model learning? What are the conditions for this?

Yes Conditions: Must be retrained "from scratch"

What do you do when you have a dataset that is too big to fit into memory?

You split the set into many csv files

What is a Token? A Tokenizer? When to Use? Example of One? Types of Tokenizers?

__ are "pieces of words" - before the API processes the prompts, the input is broken down into these; not cut up exactly where the words start or end - they can include trailing spaces and even sub-words. Here are some helpful rules of thumb for understanding tokens in terms of lengths: 1 ~= 4 chars in English 1 ~= ¾ words 100 ~= 75 words DEFINITION of __-izer an algorithm that breaks down input text into smaller units, typically called tokens. These tokens can be as short as one character or as long as one word (sometimes even longer). For example, the text "ChatGPT is great!" might be tokenized into ["Chat", "G", "PT", " is", " great", "!"]. Model-Specific Needs: Different transformer models (e.g., BERT, GPT-2, RoBERTa) have unique tokenization mechanisms. The transformers library offers tokenizers specifically designed for each model to ensure data is correctly preprocessed. Vocabulary Mapping: Once the text is tokenized, each token is mapped to an integer ID according to a pre-defined vocabulary. This step translates words, subwords, or characters into a format that the model can process (i.e., numerical tensors). For instance, in the tokenizer's vocabulary, the word "Chat" might be mapped to the ID 12345, "G" to 6789, and so on. Additional Formatting: Apart from basic tokenization and integer mapping, the tokenizer handles other essential tasks:Adding Special Tokens: Some models require special tokens in the input, like [CLS] and [SEP] for BERT. The tokenizer ensures these are added correctly.Handling Max Length: Ensures sequences are either truncated to a maximum allowable length or padded to meet a required length.Attention Masks: Produces masks to differentiate actual tokens from padding tokens, ensuring the model doesn't pay attention to irrelevant padding. Decoding: Besides converting text to tokens, the to

What does this mean? for which models do you use this? confusion_matrix(label, model.predict(features))

____ produces a table that is often used to evaluate the performance of a classification model. It helps you understand how well your model is doing in terms of making correct and incorrect predictions - usually presented as a 2x2 table, and you can use these values to calculate various performance metrics like accuracy, precision, recall, and F1-score. - The confusion matrix typically has four components: True Positives (TP): The number of data points correctly predicted as positive by the model. True Negatives (TN): The number of data points correctly predicted as negative by the model. False Positives (FP): The number of data points incorrectly predicted as positive by the model when they are actually negative. False Negatives (FN): The number of data points incorrectly predicted as negative by the model when they are actually positive.

#activation Pass through function are also called what? unsupervised or supervised?

linear activation function tf.keras.activations.linear Supervised, Numerical

#models ANN, used for what?

supervised learning

Why not use a linear activation function in a GenAI model?

two major problems : It's not possible to use backpropagation as the derivative of the function is a constant and has no relation to the input x. All layers of the neural network will collapse into one if a linear activation function is used. No matter the number of layers in the neural network, the last layer will still be a linear function of the first layer. So, essentially, a linear activation function turns the neural network into just one layer.

Why use non-linear activation functions?

— activation functions 1. allow backpropagation because now the derivative function would be related to the input, and it's possible to go back and understand which weights in the input neurons can provide a better prediction. 2.— allow the stacking of multiple layers of neurons as the output would now be a non-linear combination of input passed through multiple layers= output can be represented as a functional computation in a neural network.

#models what are Energy-Based Models?

—-) is a form of generative model (GM) imported directly from statistical physics to learning. — provide a unified framework for many probabilistic and non-probabilistic approaches to such learning, which is used in training graphical and other structured models. —- capture dependencies by associating an unnormalized probability scalar (energy) to each configuration of the combination of observed and latent variables. Inference consists of finding (values of) latent variables that minimize the energy given a set of (values of) the observed variables. Similarly, the model learns a function that associates low energies to correct values of the latent variables, and higher energies to incorrect values. Traditional —s rely on stochastic gradient-descent (SGD) optimization methods that are typically hard to apply to high-dimension datasets. — do not require that energies be normalized as probabilities. In other words, energies do not need to sum to 1. Since there is no need to estimate the normalization constant like probabilistic models do, certain forms of inference and learning with —-s are more tractable and flexible. Samples are generated implicitly via a Markov chain Monte Carlo approach.

what is a Discriminative Deep Learning Model?

1. This is a class of models used in Statistical Classification, mainly used for supervised machine learning 2. These types of models are also known as conditional models since they learn the boundaries between classes or labels in a dataset.

Mode Collapse: What is it?

DEFINITION Due to inability to learn patterns effectively from the data (usually due to lack of data normalization) a model generates a limited variety of outputs, because it is ignoring the diversity present in the training data because it cannot successfully learn from it (Normalization can help alleviate this by ensuring that the model doesn't get stuck generating only one type of output)

#datasets. Stanford Question Answering Dataset- why is this important to include?

SQuAD 2.0 combines exist- ing SQuAD data with over 50,000 unan- swerable questions written adversarially by crowdworkers to look similar to an- swerable ones. To do well on SQuAD 2.0, .ML SYSTEMS must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

What is a sigmoid belief net? What makes it better or worse than other things to use in enterprise generative AI?

The attached image is graphical model and not a neural net A sigmoid belief network is a directed graphical model of binary variables in fully connected layers. In a sigmoid belief network, only the bottom layer is observed. CHARACTERISTICS A sigmoid belief network can be faster than a Boltzmann machine. It also has advantages over Boltzmann machines in pattern classification.

loss functions, what are they, why do you care about them in decision support tools? What loss functions do you use for Classification problems? Give an example of steps in a loss function.

____ help gauge how a machine learning model is performing with its given data, and how well it's able to predict an expected outcome. Many machine learning algorithms use ___ in the optimization process during training to evaluate and improve its output accuracy. Also, by minimizing a chosen ___ during optimization, this can help determine the best model parameters needed for given data. For classification: Binary Cross-Entropy ___ / Log ___ Hinge ___ for regression Mean Square Error / Quadratic Loss / L2 ___ Mean Absolute Error / L1 ___ Huber ___ / Smooth Mean Absolute Error Log-Cosh ___ Quantile ___ You want to train a GAN to generate synthetic data points that mimic a linear relationship, here are the steps in the ___ function: ysynthetic = 2xsynthetic+3 1. The generated synthetic data point ysynthetic is compared to the ground truth data points that follow the same linear relationship. 2. The ___ function quantifies the difference between the generated and real data points. 3. Then, the generator's parameters are updated using backpropagation to minimize this loss.

Weights are "what OpenAI needs to protect from theft" - weights, what are they? why are they important? What is their relationship to bais?

____ is the parameter within a neural network that transforms input data within the network's hidden layer They became very important when the attention mechanism was added in 2018... DEFINITION ____ are the real values that are attached with each input/feature and they convey the importance of that corresponding feature in predicting the final output. ____ associated with each feature, convey the importance of that feature in predicting the output value. Features with ___ that are close to zero said to have lesser importance in the prediction process compared to the features with ___ having a larger value HOW THEY WORK ____ tell the importance of a feature in predicting the target value ____ tell the relationship between a feature and a target value WHAT IS THEIR RELATIONSHIP TO BIAS? ___ and bias are both learnable parameters inside the network GIVEN: a teachable neural network will randomize both the ___ and bias values before learning initially begins GIVEN: As training continues, both parameters are adjusted toward the desired values and the correct output. DIFFERENCE TO BIAS: The two parameters differ in the extent of their influence upon the input data. - Bias represents how far off the predictions are from their intended value; biases make up the difference between the function's output and its intended output (a low bias suggest that the network is making more assumptions about the form of the output, whereas a high bias value makes less assumptions about the form of the output) - ____, on the other hand, can be thought of as the strength of that same connection; ____ affects the amount of influence a change in the input will have upon the output; a low ____ value will have no change on the input, and alternatively a larger ____ value will more significantly change the output.

What type of model is AlphaFold2 and what does it do?

DeepMind, in London, advanced the understanding of proteins, the building blocks of life, using a transformer called _____

#python #generativeAI What are Class Constrictors? Why are they used?

Ture to Python's object-oriented capabilities, you'll find the class keyword, which allows you to define custom classes that can have attributes for storing data and methods for providing behaviors. Once you have a class to work with, then you can start creating new instances or objects of that class, which is an efficient way to reuse functionality in your code. Creating and initializing objects of a given class is a fundamental step in object-oriented programming. This step is often referred to as object construction or instantiation. The tool responsible for running this instantiation process is commonly known as a ___ ____.

Normalization, why is doing this with data used in generativeAI?

____ prepares the data for the generative model's learning process by ensuring that the model focuses on learning meaningful patterns rather than getting confused by variations in scales and magnitudes. 1. Faster Convergence: Normalizing the data brings all features to a common scale, making the optimization process more efficient. This allows the model to converge faster to a solution that accurately captures the data distribution. Better Generalization Due to Better Ability to Learn Meaningful Patterns: Normalization helps the model generalize well to new, unseen data. When the input data has consistent scales, the generative model can learn meaningful patterns across features and apply them to generate coherent and realistic samples. Avoiding Mode Collapse: Mode collapse is a common issue in generative models where the model generates a limited variety of outputs, ignoring the diversity present in the training data. Normalization can help alleviate this by ensuring that the model doesn't get stuck generating only one type of output. Enhancing Model Robustness: Normalization can make the model more robust to changes in input data distribution. If the input data changes, but the normalized representation remains consistent, the model's performance is less likely to degrade. Balanced Learning: In some cases, the data might have class imbalances or uneven distributions. Normalization can assist in creating a more balanced learning process for the generative model, which is especially important when trying to capture the complexities of each class or mode.

#mulitmodal #features What is Joint coordination? Why is this used?

each individual modality is encoded and then placed into a mutual high dimensional space (most direct way/ may work well when modalities are of similar nature) - Usually, this method works well when modalities are similar in nature, and it's the one most often used; on practice when designing multimodal networks, encoders are chosen based on what works well in each area since more emphasis is given to designing the fusion method.

#mulitmodal #features What does Coordinated representation mean for features?

each individual modality is encoded irrespective of one another, but their representations are then coordinated by imposing a restriction. For example, their linear projections should be maximally correlated.

What is Hinton still excited about, the most "beautiful" model he's made (with collaborators)

Boltzmann machines (partially still in use, Restricted Boltzmann machines) DEFINITION - Big, densely connected - It was a model where at the top you had a restricted Boltzmann machine - but below that you had a Sigmoid belief net which was something that invented many years early OUTCOME: - It would learn hidden representations; each artificial synapse only needed to know about the behavior of the two neurons it was directly connected to - and it looked like the kind of thing you should be able to get in a human brain - what we managed to show was the way of learning these deep belief nets so that there's an approximate form of inference that's very fast in that it just hands you a single forward pass .... that was a very beautiful result - And you could guarantee that each time you learn that extra layer of features -you always improved each time HOW IT WORKED The restricted version has just had one layer of hidden features - And since you are learning features each time you use it, you can use those outputs as data and do it again, and again as many times as you liked. We realized that the whole thing could be treated as a single model, but it was a weird kind of model.

What are gaussian processes and why use them with deep learning?

GAUSSIAN PROCESSES ARE HISTORY IS WHY USE WITH DEEP LEARNING? Effectiveness: - Estimates probability of data points- Provides more contextual information than other clustering algorithms - Fits better than k-means: The clustering will mimic the data cloud better and with a smaller k Versatility: - Can model different data types and distributions, including data with multiple peaks or modes, non-spherical clusters, and various modes - Works well with non-linear datasets: Doesn't assume clusters to be of any geometry - Automatically learns subpopulations: Doesn't require which subpopulation a data point belongs to Efficiency: Can find clusters of Gaussians more efficiently than other clustering algorithms, such as k-means

Forward propagation: Explain it and talk about how it relates to backpropagation

Gradients are a key part of this Relationship to backpropagation aims to minimize the cost function by adjusting network's weights and biases. The level of adjustment is determined by the gradients of the cost function with respect to those parameters. How backprop

What is RLHF? How does it work? Why is it important for Enterprise AI v. other methods? What is some next level shit that others do that is useful to do and how do you recreate it?

Reinforcement Learning from Human Feedback is part of training of a model to respond in ways that humans prefer in the context given How it works: 1. During the annotation process of training a model, humans are presented with prompts and either write the desired response or rank a series of existing responses 2. These human preferences are directly encoded in the training data 3. And then this "redirects" the learning process of the LLM from the straightforward but artificial next-token prediction task towards learning what the humans prefer in a given communicative situation. WHY THIS IS IMPORTANT FOR ENTERPRISE Gen AI: This optimizes the model (or Mix of Minds, multiple models) to reflect the human preferences of that enterprise v. self-supervision — which is pre-training LLMs using next-token prediction w/o humans helping NEXT LEVEL SHIT: OpenAI's data for ChatGPT also include human-written responses to prompts that are used to fine-tune the initial LLM HOW TO RE-CREATE THAT? A. you can use simple ratings, simple thumbs up/down labels, or even numbers of upvotes to build out a ___ by taking any internal message boards and using them like ranking datasets [OpenAI uses Reddit or Stackoverflow conversations where answers to questions are rated by users like this] B. You can also allow users to send feedback on what they wanted as answers

What is an attention mechanisms, why do you care about this with GenAI?

___ ___ lets the model focus on specific parts of the input sequence Encoder-Decoder Architecture usually updates the hidden state with one result of encoding, which is then passed to the decoder, but when you add the ____ ____ you can input more to the decoder, telling it to focus on specific parts of an input sequence HOW IT WORKS - See image 1. Calculating the attention weights: The ____ ____assigns weights to different parts of the input sequence, with the most important parts receiving the highest weights 2. Generating the context vector 3. Passing many of those to the decoder WHY YOU CARE Is an optimization on the Seq2Seq models that produces better results USED FOR NLP Image generation

MoE is what type of class of learning method? Name another ensemble learning method and tell me how it works "like" an MoE

___ is one of the ensemble learning methods Another one: stacked generalization, or stacking Both ___ and Stacking attempt to learn from the output of other lower-level models - or at least or learn how to best combine the output by - training a diverse ensemble of machine learning models - then learning a higher-order model to best combine the prediction

Model Collapse: What are the causes?

___ is the degenerative process that affects generations of generative models Causes: Too much generated data: When generated data pollutes the training set of subsequent models, leading to a misperception of reality Too much bad data: Data poisoning, in broader terms, refers to any factor that contributes to the creation of data that inaccurately reflects reality

#Preparation Stage one, what do you need to do to help build your model?

exploratory data analysis prototype creation training routine implementations or in a stream of conciousness: Data Shape Pick your aciviation and loss functions, etc based on your use case

#basics What are the "Types" of Generative AI models?

1. Generative Adversarial Networks (GANs): 2. Variational Autoencoders (VAEs): 3. Autoregressive Models: 4. Recurrent Neural Networks (RNNs): RNNs are a type of neural network that processes sequential data, such as natural language sentences or time-series data. They can be used for generative tasks by predicting the next element in the sequence given the previous elements. However, RNNs are limited in generating long sequences due to the vanishing gradient problem. More advanced variants of RNNs, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have been developed to address this limitation. Download the solution brief. 5. Transformer-based Models: Transformers, like the GPT series, have gained significant popularity in natural language processing and generative tasks. They use attention mechanisms to model the relationships between different elements in a sequence effectively. Transformers are parallelizable and can handle long sequences, making them well-suited for generating coherent and contextually relevant text. 6. Reinforcement Learning for Generative Tasks: Reinforcement learning can also be applied to generative tasks. In this setup, an agent learns to generate data by interacting with an environment and receiving rewards or feedback based on the quality of the generated samples. This approach has been used in areas like text generation, where reinforcement learning helps fine-tune generated text based on user feedback. These are just some of the types of generative AI models, and there is ongoing research and development in this field, leading to the emergence of new and more advanced generative models over time.

What Is a Deep Belief Network? What is the difference between deep belief and deep neural networks?

DEFINITION Deep belief networks (DBNs) are a type of deep learning algorithm that addresses the problems associated with classic neural networks. They do this by using layers of stochastic latent variables, which make up the network. These binary latent variables, or feature detectors and hidden units, are binary variables, and they are known as stochastic because they can take on any value within a specific range with some probability. DIFFERENCES Deep belief networks differ from deep neural networks in that they make connections between layers that are undirected (not pre-determined), thus varying in topology by definition.

#sparseMOEs. What is K-means clustering? When do you use it with Switch Transformers?

DEFINITION K-means clustering aims to partition observations into k clusters where each observation belongs to the cluster with the nearest mean. WHEN TO USE WITH SWITCH Switch transformers for text generation use discrete latent variables z to control attributes like sentiment to make a "cookbook" ALGO Initialize k cluster centers μ1, μ2, ..., μk - Repeat until convergence: -- For each data point xi, assign it to the cluster j that minimizes the distance between xi and μj: argminj ||xi - μj||2 - Update each cluster center μj as the mean of all data points assigned to cluster j: μj = (1/Nj) Σxi assigned to j xi - Where Nj is the number of data points assigned to cluster j HOW IT IS USED K-means can cluster the latent space into K codebook vectors c1,..,cK. Vector Quantization (VQ) maps each z to its nearest codebook vector c(z). This allows gradient flow through discrete variables during training. At inference, sampling codebook vectors generates diverse, controllable text. WHY USEFUL IN GENERATIVE AI - helps switch transformers effectively model and manipulate discrete latent variables for generative tasks like image generation REFERENCES Esser, Patrick, et al. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021 Keskar, Nikhil S., et al. "CTRL: A conditional transformer language model for controllable generation." arXiv preprint arXiv:1909.05858 (2019).

#optimization What is ridge regression and when do you use it?

DEFINITION is a linear regression technique that includes an L2 regularization term to prevent overfitting HOW IT WORKS The regularization term is proportional to the squared L2 norm of the model's coefficients with the aim to minimize the following cost function EXAMPLE Here, w represents the model's coefficient vector, MSE(w) is the mean squared error, and α is the regularization parameter that controls the strength of regularization The derivative of the cost function with respect to w is used in gradient descent to update the model's coefficients; the derivative of the squared L2 norm term is particularly convenient in this context. Let's focus on the regularization term See attached) The derivative of the squared L2 norm term is 2αw, where w is the vector of coefficients. The factor of 2 simplifies the derivative computation. This derivative can be directly added to the gradient of the mean squared error term for updating the coefficients during gradient descent.

What is a SequentialString Layer? when do you use it? how can you optimize /customize it?

DEFINITION ____ has a Tensorflow default setting, and it will create an internal dictionary mapping each word to a unique integer value: # EXAMPLE: # the --> 0 # fat --> 1 # cat --> 2 # in --> 3 # your --> 4 # hat --> 5 WHEN TO USE ____ is common first layer in a sequential model HOW TO OPTIMIZE IT/CUSTOMIZE IT - use the other output modes: one_hot, multi_hot, count, or tf_idf - which will create other representations of your tokens / words EXAMPLE If you were to use the one_hot output mode, you could directly train your model on these representations without an Embedding layer

For Loops and While Loops- why avoid those when training neural nets using Numpy/Python? What are ways to avoid them?

For loops and while loops are not efficient in Python NumPy because they are interpreted by the Python interpreter: = This means that each time the code is executed, the interpreter has to convert the code into machine code. This can be time-consuming, especially for large datasets. How to Avoid: 1. use vectorized operations - Vectorized operations are functions that operate on entire arrays at once 2. use NumPy's built-in functions - to perform common operations on arrays, such as sorting, filtering, and aggregating. 3. use NumPy's built-in functions to create custom functions that can be used to perform complex operations on arrays

TRAINING With an Encoder/Decoder (or Transformer/Foundational) Model What is this code and what is it used for? 1. ~ def split_input_target(sequence): 2. ~~~~ input_text = sequence[:-1] 3. ~~~~ target_text = sequence[1:] 4. ~~~~ return input_text, target_text

For training this model to produce text, you'll need a dataset that is in pairs of "input + label" - at each time step the input is the current character and the label or target is the next character. This code is telling the machine that the input always comes first, the label always comes second ... for each and every time step 1. is the function requiring the sequence as input, duplicates, and shifts it to align the input and label for each timestep 2. is telling the input character comes first 3. ___ 4. is telling the form of the output required EXAMPLE If you want to see how this all came together, you would run this command to see ~ split_input_target(list("Tensorflow")) OUTPUT (['T', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'], ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w']) Just like you required

TRAINING What is a batch size? What is the deal with batch sizes, how do you set them?

HOW TO SET the learning rate and batch size are closely linked — small batch sizes perform best with smaller learning rates, while large batch sizes do best on larger learning rates EXAMPLE

#training What is a Hamiltonian Monte Carlo (HMC) method and how is it different from a Markov chain Monte Carlo (MCMC) method

Hamiltonian Monte Carlo (HMC) is a type of Markov chain Monte Carlo (MCMC) method. The key differences between HMC and general MCMC for training energy-based models are: In HMC, samples are generated by simulating the dynamics of a particle moving across the parameter space following Hamilton's equations. This avoids the random walk behavior and correlations of basic MCMC. The energy function is augmented with a kinetic energy term based on auxiliary momentum variables. The negative log-probability density acts as a potential energy. Each HMC iteration involves simulating the particle dynamics for multiple leapfrog steps and accepting the end state with a Metropolis correction. This enables more efficient exploration. The dynamics are simulated via alternating gradient updates on the position (samples) and momentum. The gradient information helps the chain mix faster. HMC requires tuning the leapfrog step size and number of steps per sample to maintain a reasonable acceptance rate. In summary, HMC works by introducing momentum variables and simulating Hamiltonian dynamics across the sample space. It takes advantage of gradient information and mechanisms like Metropolis correction to improve mixing and convergence compared to basic MCMC methods. This enables more effective maximum likelihood training of energy-based models.

MASKING - what is it and why do you need it? What is the right amount?

It is what you do to help train the model, by removing SOME (see below) of the words in NLP, in order to examine how the model predicts the next token Remove 15% Remove too little: and the model is harder to train Remove too much: removes context so that there is not enough to train the model

Support Vectors Machines - what are they

Its a class of algorithms which have gained popularity due to their ability to handle complex data distributions and high-dimensional spaces HOW USED IN ML classification regression tasks text categorization bioinformatics DEFINTION SVMs find a hyperplane that best separates the data points of different classes while maximizing the margin, which is the distance between the hyperplane and the nearest data points of each class. The data points that are closest to the hyperplane and contribute to defining its position are called support vectors - see image attached. WHEN TO USE It is useful for both linearly deparable (hard margin) and Non-linearly separable (soft margin) data. It is effective in high dimensional spaces. It is effective in cases where a number of dimensions are greater than the number of samples. It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. WHEN NOT TO USE Resource-Intensive: Picking the right kernel and parameters can be computationally intensive. It also doesn't perform very well, when the data set has more noise i.e. target classes are overlapping SVM doesn't directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. https://pub.towardsai.net/support-vector-machine-svm-a-visual-simple-explanation-part-1-a7efa96444f2

What is the chain rule of differentials and what is the more common term used?

More common term: backpropagation High Level: after each forward pass through a network, backpropagation performs a backward pass while adjusting the model's parameters (weights and biases) How does the algorithm know to adjust a parameter? The level of adjustment is determined by the gradients of the cost function with respect to those parameters. why computing gradients? To answer this, we first need to revisit some calculus terminology: Gradient of a function C(x_1, x_2, ..., x_m) in point x is a vector of the partial derivatives of C in x. Equation for derivative of C in x The derivative of a function C measures the sensitivity to change of the function value (output value) with respect to a change in its argument x (input value). In other words, the derivative tells us the direction C is going. The gradient shows how much the parameter x needs to change (in positive or negative direction) to minimize C.

IMPORTANT!! What are encoder-decoder models more properly called after 2017? What are these models? Why are they better than others? Where did they start their use? Why are these models important to Generative AI?

More properly called foundation or transformer models DEFINITION ____ are neural networks that learns context and thus meaning by tracking relationships in sequential data ___ models apply an evolving set of mathematical techniques, called attention or self-attention, to detect subtle ways even distant data elements in a series influence and depend on each other OTHER NAMES USED: Transformers, Foundation Models FYI: Attention was so key to transformers the Google researchers almost used the term as the name for their 2017 model. "Attention Net didn't sound very exciting," said Vaswani, who started working with neural nets in 2011. It was Jakob Uszkoreit, a senior software engineer on the team, who came up with the name Transformer. WHY IMPORTANT TO GENERATIVE AI 1. They have replaced CNNs, RNNs because they are more effective (70 percent of arXiv papers on AI posted in the last two years mention ____s) 2. And because they are more efficient (time-wise): they can take advantage of parallelism with GPUs 1. Stanford researchers called ___ "foundation models" in an August 2021 paper because they see them driving a paradigm shift in AI; "sheer scale and scope of foundation models over the last few years have stretched our imagination of what is possible" 2. First described in a 2017 paper from Google, ___ are among the newest and one of the most powerful classes of models invented to date; they're driving a wave of advances in machine learning some have dubbed "Transformer AI" $$$$ Saved - Unsupervised = OK NOW - Before transformers arrived, users had to train neural networks with large, labeled datasets that were costly and time-consuming to produce; by finding patterns between elements mathematically, ___ eliminate that need, making available the trillions of images and petabytes of text data on the web and in corporate databases $$$$ Saved - parallel processing = OK now the math that ___ use lends itself to parallel processing, so these models can run fast. KOOL TO KNOW: Attention was so key to transformers the Google researchers almost used the term as the name for their 2017 model. Almost. "Attention Net didn't sound very exciting," said Vaswani, who

How to Encoder-Decoder/Transformer/Foundational Models work in three steps?

Most neural networks are large encoder/decoder blocks that process data, _____ are like that ... with improvements via small strategic additions to these large encoder/decoder blocks (see diagram) - making them uniquely powerful allowing computers to "see" the same patterns humans see... 1. Add positional encoders to tag data elements coming in and out of the network 2. Then "attention units" follow these tags: these calculate (a kind of) algebraic map of how each element relates to the others - Attention queries are typically executed in parallel by calculating a matrix of equations in what's called multi-headed attention 3. Decoder generates output data from the information extracted by the encoder, predicting the next "token"

What is 1. prompt design and 2. engineering? How about 3. prompt tuning? Why do you care about this in Enterprise generative AI v. Fine-tuning?

Real simple: 1. ___ it is how you query an LLM (or Mix of Minds, MoMs) 2. ____ prompts that were designed by hand by human engineers, also called "hard" prompts 3. ____ ____ - first, no change of weights or retraining of model so is quick - second, using "soft prompts" generated by a small set of learnable parameters Examples: - the prompts can be extra words introduced by humans - or the prompts can be AI-generated numbers that guide the model towards a desired decision or prediction. WHY THIS MATTERS SAVE $$$ - Panda and his colleagues in this paper show that their Multi-Task Prompt Tuning (MPT) method outperformed other methods, and even did better than models fine-tuned on task-specific data; instead of spending thousands of dollars to retrain a 2-billion parameter model for specialized task, MPT lets you customize the model for less than $100 V. FINE TUNING? BURNS $$$ - Fine tuning is adapting a pre-trained AI model to perform better on specific tasks, domains, or applications by training it on a smaller, specialized dataset (which reflects the nuances of the user's target domain or task) allowing the AI model to learn the patterns, terminology, and context unique to that specific use case; Fine tuning requires more computational resources and time than prompt tuning, as it involves retraining the AI model and adjusting its parameters...

#multimodal Alignment in multimodal learning means what?

Refers to the task of identifying direct relationships between different modalities. Current research in multimodal learning aims to create modality-invariant representations. This means that when different modalities refer to a similar semantic concept, their representations must be similar/close together in a latent space. For example, the sentence "she dived into the pool", an image of a pool, and the audio signal of a splash sound should lie close together in a manifold of the representation space.

#optimizations. What is Lasson Regularization? Why use it?

STANDS FOR Least Absolute Shrinkage and Selection Operator (LASSO) Regularization DEFINITION the L1 norm of the coefficients of a linear model is added to the cost function being minimized HOW IT WORKS It encourages the model to automatically select a subset of important features while driving some coefficients to exactly zero; this property of driving coefficients to zero helps in simplifying the model and enhancing its interpretability OUTCOMES EXPECTED without this regularization, the model might assign non-zero coefficients to all features, including those that might not contribute much to the prediction; this can lead to overfitting and make the model difficult to interpret. USES WHEN TO USE You have a large dataset with many potential features, and you want to build a model that not only predicts well but also provides insights into which features are most important for price prediction

What is an encoder-decoder model? What parts are there?

See image ML Architecture Parts: Parameters:Number of learnable variables/values available for the model. Transformer Layers: Number of Transformer blocks. A transformer block transforms a sequence of word representations to a sequence of contextualized words (numbered representations). Hidden Size: Layers of mathematical functions, located between the input and output, that assign weights (to words) to produce a desired result. Attention Heads: The size of a Transformer block. Processing: Type of processing unit used to train the model. Length of Training: Time it took to train the model.

#lossfunction How do you choose a loss function - without "overfitting" it? Use the example of the gradient descent function

See image for examples of fitting and overfitting Line [a] has lower norms because it has significantly less parameters compared to [c] Line [b] has lower norms because despite having the same number of parameters, they're all much smaller than [c] GIVEN: When doing gradient descent we'll update our weights based on the derivative of the loss function. So if we've included a norm in our loss function, the derivative of the norm will determine how the weights get updated. 1. Choose the simpler functions

What is Teacher Forcing? What would happen if you failed to do it? how do you do it? What happenes when you need "ground truth" how do you get it?

Teacher Forcing Ordinarily, during Inference when the model is fully trained, the output sequence from this timestep is used as the input sequence for the next timestep. This allows the model to predict the next word based on the previous words predicted so far. However, if we did that while the model was still learning during Training, any errors made by the model in predicting an output word in this timestep would be carried forward to the next timestep. The model would end up predicting the next word based on an erroneous previous word. Instead, since we have the ground truth captions available to us during training, we use a technique called Teacher Forcing. The correct expected word from the target caption is added to the input sequence for the next timestep rather than the model's predicted word. In this way, we are helping the model by giving it a hint, so to speak, just like a teacher would.

What is the play with enterprise generative AI in terms of how it will interact with open source and proprietary models?

There will be an interconnected mix of models, providers and business models. Large foundation models from the big providers will live alongside locally deployable open source models from commercial and research organizations alongside models specialized for particular domains or applications. The transparency, curation and ownership of models (and tuning data)

1. Query 2. Key 3. Value What are these? What else are they called? How do these work?

They all get different weights Also called: 1. Token Embeddings 2, Segment Embeddings 3. Position Embeddings intuitively, we can think the 1. ____ represents what kind of information we are looking for 2. _____ represent the relevance to the query, and 3. ____ represent the actual contents of the input, so the "original" input, times the "Attention Score" the model makes out of 1. ___ and 2. ____ Attention Score" - how to make - for each word, the model uses the dot product for 1.____ and uses the same for 2. ____ to get the similarity or "attention product" weight among the different tokens, [usually with matrixed multiplication, in NLP, based on the length of the "sequence" (aka the sentence) so that you have as many columns as you have words in the sentence, same with rows] - then you use 3.___ to give you the final "Attention Score"

Generative AI 1: "Only" Basics - Basic Definitions, Models, Parts, Architecture, Activation Functions [need to move Optimization Algos to own sets in Quizlet +- I made MoMs their own set in Quizlet]

Set pelajaran terkait

Chinese - SONG: Hello!

1.7 True/False

208 exam 1 lesson 3

Economics quiz

Elsevier Ch 24

Promulgated Contract Forms Course 1

Which of the following would be considered a secondary decision-maker?

micro final

Bones and Muscles

midterm

Types of Life Insurance

Chapter 4 Section 4; Chapter 7

Steps of Carbohydrate Digestion

4/5

PUR 4000 Final Exam

APUSH Chapter 7

week 15

Art History Unit 3 Exam

Exam 1 - ESC1000

Chapter 4