Generative AI 1: "Only" Basics - Basic Definitions, Models, Parts, Architecture, Activation Functions [need to move Optimization Algos to own sets in Quizlet +- I made MoMs their own set in Quizlet]

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is skip connection architecture?m When do you use it?

Use it for building deep neural networks

What does MRKL stand for, and why is this model efficient AND effective? [AI21 Labs' "MRKL"]

"Modular Reasoning, Knowledge and Language" Why Efficient: - Combines AI modules in a pragmatic plug-and-play fashion - switching back and forth between structured knowledge, symbolic methods and neural models - splitting the workload of understanding the task, executing the computation and formulating the output result between different models

Name three ways to implement sparse MoE models, and define them? COME BACK TO THIS ONE

- k-means clustering - linear assignment to maximize token-expert affinities - hashing

What are "Faithful Reasoning Frameworks" (for Question Answering) and when is it used in generative AI?

- the user first provides one or more examples of the reasoning process as part of the prompt - the LLM "imitates" this reasoning process with new inputs

#basics What are the "Types" of Generative AI models?

1. Generative Adversarial Networks (GANs): 2. Variational Autoencoders (VAEs): 3. Autoregressive Models: 4. Recurrent Neural Networks (RNNs): RNNs are a type of neural network that processes sequential data, such as natural language sentences or time-series data. They can be used for generative tasks by predicting the next element in the sequence given the previous elements. However, RNNs are limited in generating long sequences due to the vanishing gradient problem. More advanced variants of RNNs, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have been developed to address this limitation. Download the solution brief. 5. Transformer-based Models: Transformers, like the GPT series, have gained significant popularity in natural language processing and generative tasks. They use attention mechanisms to model the relationships between different elements in a sequence effectively. Transformers are parallelizable and can handle long sequences, making them well-suited for generating coherent and contextually relevant text. 6. Reinforcement Learning for Generative Tasks: Reinforcement learning can also be applied to generative tasks. In this setup, an agent learns to generate data by interacting with an environment and receiving rewards or feedback based on the quality of the generated samples. This approach has been used in areas like text generation, where reinforcement learning helps fine-tune generated text based on user feedback. These are just some of the types of generative AI models, and there is ongoing research and development in this field, leading to the emergence of new and more advanced generative models over time.

what is a Discriminative Deep Learning Model?

1. This is a class of models used in Statistical Classification, mainly used for supervised machine learning 2. These types of models are also known as conditional models since they learn the boundaries between classes or labels in a dataset.

What is "in-context learning" and why is it important to developing models for enterprise generative AI?

DEFINITION Massive neural network models -- similar to large language models -- are capable of containing smaller linear models inside their hidden layers, which the large models could train to complete a new task using simple learning algorithms

What does "temperature" mean with respect to parameters?

A low___ via your parameters produces repetitive and deterministic response, and increasing the temperature will result in more unexpected or creative responses to queries of the LLM

#MLOps #MLSystems. Continual learning means what?

Always streaming, not batch uploads of data, to the. models to generate features

How many layers to add into a neural network?

As Yoshua Bengio, Head of Montreal Institute for Learning Algorithms remarks: "Very simple. Just keep adding layers until the test error does not improve anymore." A method recommended by Geoff Hinton is to add layers until you start to overfit your training set. Then you add dropout or another regularization method.

#optimization #training What is the way a dimensionality reduction algo can help you?

Before you feed your data into another machine, you can use this to remove data and improve performance, overall

What is BERT?

Bidirectional Encoder Representations from Transformers

What is Hinton still excited about, the most "beautiful" model he's made (with collaborators)

Boltzmann machines (partially still in use, Restricted Boltzmann machines) DEFINITION - Big, densely connected - It was a model where at the top you had a restricted Boltzmann machine - but below that you had a Sigmoid belief net which was something that invented many years early OUTCOME: - It would learn hidden representations; each artificial synapse only needed to know about the behavior of the two neurons it was directly connected to - and it looked like the kind of thing you should be able to get in a human brain - what we managed to show was the way of learning these deep belief nets so that there's an approximate form of inference that's very fast in that it just hands you a single forward pass .... that was a very beautiful result - And you could guarantee that each time you learn that extra layer of features -you always improved each time HOW IT WORKED The restricted version has just had one layer of hidden features - And since you are learning features each time you use it, you can use those outputs as data and do it again, and again as many times as you liked. We realized that the whole thing could be treated as a single model, but it was a weird kind of model.

What Is a Deep Belief Network? What is the difference between deep belief and deep neural networks?

DEFINITION Deep belief networks (DBNs) are a type of deep learning algorithm that addresses the problems associated with classic neural networks. They do this by using layers of stochastic latent variables, which make up the network. These binary latent variables, or feature detectors and hidden units, are binary variables, and they are known as stochastic because they can take on any value within a specific range with some probability. DIFFERENCES Deep belief networks differ from deep neural networks in that they make connections between layers that are undirected (not pre-determined), thus varying in topology by definition.

Mode Collapse: What is it?

DEFINITION Due to inability to learn patterns effectively from the data (usually due to lack of data normalization) a model generates a limited variety of outputs, because it is ignoring the diversity present in the training data because it cannot successfully learn from it (Normalization can help alleviate this by ensuring that the model doesn't get stuck generating only one type of output)

#sparseMOEs. What is K-means clustering? When do you use it with Switch Transformers?

DEFINITION K-means clustering aims to partition observations into k clusters where each observation belongs to the cluster with the nearest mean. WHEN TO USE WITH SWITCH Switch transformers for text generation use discrete latent variables z to control attributes like sentiment to make a "cookbook" ALGO Initialize k cluster centers μ1, μ2, ..., μk - Repeat until convergence: -- For each data point xi, assign it to the cluster j that minimizes the distance between xi and μj: argminj ||xi - μj||2 - Update each cluster center μj as the mean of all data points assigned to cluster j: μj = (1/Nj) Σxi assigned to j xi - Where Nj is the number of data points assigned to cluster j HOW IT IS USED K-means can cluster the latent space into K codebook vectors c1,..,cK. Vector Quantization (VQ) maps each z to its nearest codebook vector c(z). This allows gradient flow through discrete variables during training. At inference, sampling codebook vectors generates diverse, controllable text. WHY USEFUL IN GENERATIVE AI - helps switch transformers effectively model and manipulate discrete latent variables for generative tasks like image generation REFERENCES Esser, Patrick, et al. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021 Keskar, Nikhil S., et al. "CTRL: A conditional transformer language model for controllable generation." arXiv preprint arXiv:1909.05858 (2019).

#algo What is the Viterbi Algorithm? Of which of the "most used principles of programming" does this algo use? How does it use it? When do you use it?

DEFINITION a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events MOST USED PRINCIPLE Is an example of dynamic programming HOW uses this principle when it finds ways to compute the most likely sequence of states in an efficient manner WHEN TO USE IN GEN AI Used for finding the most likely sequence of states in Hidden Markov Models (HMMs) for tasks like: - speech recognition METHOD: By finding the most likely string of text given the acoustic signal - part-of-speech tagging (same method) EXAMPLE Take a simple HMM for part-of-speech tagging, and we want to find the most likely sequence of tags for the sentence "He fishes". Our states might be "Noun" and "Verb", and our observations might be words like "He" and "fishes".

#SentimentAnalysis What is sentiment analysis? WHY Do you need it in enterprise AI? How do you "do" it?

DEFINITION an automated process of tagging data according to their sentiment, such as positive, negative and neutral WHY YOU CARE for ENTERPRISE GenAI Sentiment analysis allows companies to analyze data at scale, detect insights and automate processes HOW TO DO -Text: use NLP

#optimization What is ridge regression and when do you use it?

DEFINITION is a linear regression technique that includes an L2 regularization term to prevent overfitting HOW IT WORKS The regularization term is proportional to the squared L2 norm of the model's coefficients with the aim to minimize the following cost function EXAMPLE Here, w represents the model's coefficient vector, MSE(w) is the mean squared error, and α is the regularization parameter that controls the strength of regularization The derivative of the cost function with respect to w is used in gradient descent to update the model's coefficients; the derivative of the squared L2 norm term is particularly convenient in this context. Let's focus on the regularization term See attached) The derivative of the squared L2 norm term is 2αw, where w is the vector of coefficients. The factor of 2 simplifies the derivative computation. This derivative can be directly added to the gradient of the mean squared error term for updating the coefficients during gradient descent.

#models #evaluation what is an ROC curve and when is it used? Give an example in sentiment analysis enterprise Generative AI

DEFINITION Classification

What is a SequentialString Layer? when do you use it? how can you optimize /customize it?

DEFINITION ____ has a Tensorflow default setting, and it will create an internal dictionary mapping each word to a unique integer value: # EXAMPLE: # the --> 0 # fat --> 1 # cat --> 2 # in --> 3 # your --> 4 # hat --> 5 WHEN TO USE ____ is common first layer in a sequential model HOW TO OPTIMIZE IT/CUSTOMIZE IT - use the other output modes: one_hot, multi_hot, count, or tf_idf - which will create other representations of your tokens / words EXAMPLE If you were to use the one_hot output mode, you could directly train your model on these representations without an Embedding layer

What type of model is AlphaFold2 and what does it do?

DeepMind, in London, advanced the understanding of proteins, the building blocks of life, using a transformer called _____

#Model #architecture What is the best choice for an enterprise generative AI model that is tasked with classifying multi-modal sentiment analysis into positive, neutral, and negative?

FROM THE TOP various architectures available, such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Transformers. Each architecture has its strengths and limitations, so choose one that best suits your objective and dataset.

For Loops and While Loops- why avoid those when training neural nets using Numpy/Python? What are ways to avoid them?

For loops and while loops are not efficient in Python NumPy because they are interpreted by the Python interpreter: = This means that each time the code is executed, the interpreter has to convert the code into machine code. This can be time-consuming, especially for large datasets. How to Avoid: 1. use vectorized operations - Vectorized operations are functions that operate on entire arrays at once 2. use NumPy's built-in functions - to perform common operations on arrays, such as sorting, filtering, and aggregating. 3. use NumPy's built-in functions to create custom functions that can be used to perform complex operations on arrays

TRAINING With an Encoder/Decoder (or Transformer/Foundational) Model What is this code and what is it used for? 1. ~ def split_input_target(sequence): 2. ~~~~ input_text = sequence[:-1] 3. ~~~~ target_text = sequence[1:] 4. ~~~~ return input_text, target_text

For training this model to produce text, you'll need a dataset that is in pairs of "input + label" - at each time step the input is the current character and the label or target is the next character. This code is telling the machine that the input always comes first, the label always comes second ... for each and every time step 1. is the function requiring the sequence as input, duplicates, and shifts it to align the input and label for each timestep 2. is telling the input character comes first 3. ___ 4. is telling the form of the output required EXAMPLE If you want to see how this all came together, you would run this command to see ~ split_input_target(list("Tensorflow")) OUTPUT (['T', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'], ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w']) Just like you required

What are gaussian processes and why use them with deep learning?

GAUSSIAN PROCESSES ARE HISTORY IS WHY USE WITH DEEP LEARNING? Effectiveness: - Estimates probability of data points- Provides more contextual information than other clustering algorithms - Fits better than k-means: The clustering will mimic the data cloud better and with a smaller k Versatility: - Can model different data types and distributions, including data with multiple peaks or modes, non-spherical clusters, and various modes - Works well with non-linear datasets: Doesn't assume clusters to be of any geometry - Automatically learns subpopulations: Doesn't require which subpopulation a data point belongs to Efficiency: Can find clusters of Gaussians more efficiently than other clustering algorithms, such as k-means

Forward propagation: Explain it and talk about how it relates to backpropagation

Gradients are a key part of this Relationship to backpropagation aims to minimize the cost function by adjusting network's weights and biases. The level of adjustment is determined by the gradients of the cost function with respect to those parameters. How backprop

TRAINING What is a batch size? What is the deal with batch sizes, how do you set them?

HOW TO SET the learning rate and batch size are closely linked — small batch sizes perform best with smaller learning rates, while large batch sizes do best on larger learning rates EXAMPLE

#training What is a Hamiltonian Monte Carlo (HMC) method and how is it different from a Markov chain Monte Carlo (MCMC) method

Hamiltonian Monte Carlo (HMC) is a type of Markov chain Monte Carlo (MCMC) method. The key differences between HMC and general MCMC for training energy-based models are: In HMC, samples are generated by simulating the dynamics of a particle moving across the parameter space following Hamilton's equations. This avoids the random walk behavior and correlations of basic MCMC. The energy function is augmented with a kinetic energy term based on auxiliary momentum variables. The negative log-probability density acts as a potential energy. Each HMC iteration involves simulating the particle dynamics for multiple leapfrog steps and accepting the end state with a Metropolis correction. This enables more efficient exploration. The dynamics are simulated via alternating gradient updates on the position (samples) and momentum. The gradient information helps the chain mix faster. HMC requires tuning the leapfrog step size and number of steps per sample to maintain a reasonable acceptance rate. In summary, HMC works by introducing momentum variables and simulating Hamiltonian dynamics across the sample space. It takes advantage of gradient information and mechanisms like Metropolis correction to improve mixing and convergence compared to basic MCMC methods. This enables more effective maximum likelihood training of energy-based models.

Why beat up your model, early? How do you do this?

If you jump into the role of a mean, adversarial user and stress-test your models early to explore their weak points, you can fix them before too much skin has been put in the game HOW TO DO

Attention layer, what is it good for when you are trying to generate text captions for images? What "is" the attention layer - what does it actually calculate from the encoder's outputs?

In image captioning, the attention layer helps the algorithm focus on the most relevant part of the image when generating each word of the output sequence. What is It? The attention layer is a weighted sum of encoder outputs. More Here: https://towardsdatascience.com/image-captions-with-attention-in-tensorflow-step-by-step-927dad3569fa

What type of model is the SWITCH transformer? What is the reason the SWITCH transformer model is less costly to operate? And why are these types better overall? What are other examples of this type?

It is a a mixture-of-experts (MoE) model - a type of conditional computation where parts of the network are activated on a per-example basis which dramatically increases model capacity without a proportional increase in computation. HOW IT WORKS a subset of experts is selected on a per-token or per-example basis, thus creating sparsity in the network WHY BETTER such models have demonstrated better scaling in multiple domains and better retention capability in a continual learning setting Drawbacks: a poor expert routing strategy can cause certain experts to be under-trained, leading to an expert being under or over-specialized. OPTIMIZATIONS DONE This sparsity in the network enables them to be trained, for the first time, with lower precision (bfloat16) formats; we design models based off T5-Base and T5-Large (Raffel et al. 2019) RESULTS OF OPTIMIZATION: to obtain up to 7x increases in pre-training speed with the same computational resources. OTHER EXAMPLES Other examples are GLaM and V-MoE

What is transfer learning? Why is this important to building Enterprise GenAI?

It is taking a pre-trained model and making it do a different task

MASKING - what is it and why do you need it? What is the right amount?

It is what you do to help train the model, by removing SOME (see below) of the words in NLP, in order to examine how the model predicts the next token Remove 15% Remove too little: and the model is harder to train Remove too much: removes context so that there is not enough to train the model

What does "playground" mean in AWS sagemaker?

It means "here you can enter prompts for models"

What type of model is the Switch Transformer

It the first trillion-parameter models and here is why it is special: It uses AI sparsity - a complex mixture-of experts (MoE) architecture and - other advances to drive performance gains in language processing - and up to 7x increases in pre-training speed.

Support Vectors Machines - what are they

Its a class of algorithms which have gained popularity due to their ability to handle complex data distributions and high-dimensional spaces HOW USED IN ML classification regression tasks text categorization bioinformatics DEFINTION SVMs find a hyperplane that best separates the data points of different classes while maximizing the margin, which is the distance between the hyperplane and the nearest data points of each class. The data points that are closest to the hyperplane and contribute to defining its position are called support vectors - see image attached. WHEN TO USE It is useful for both linearly deparable (hard margin) and Non-linearly separable (soft margin) data. It is effective in high dimensional spaces. It is effective in cases where a number of dimensions are greater than the number of samples. It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. WHEN NOT TO USE Resource-Intensive: Picking the right kernel and parameters can be computationally intensive. It also doesn't perform very well, when the data set has more noise i.e. target classes are overlapping SVM doesn't directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. https://pub.towardsai.net/support-vector-machine-svm-a-visual-simple-explanation-part-1-a7efa96444f2

#sentimentAnalysis What are the three types of sentiment analysis algorithms?

Knowledge-based Statistical Hybrid

What is the use case for Symbolic AI for government clients?

Legal definitions and codes of conduct for governments ARE set in stone And setting this enterprise knowledge in stone = an efficient approach to increase precision b/c it allows you to control the behaviour of the LLM where it is crucial for your client, while still unfolding its power at generating language based on wide external knowledge

Why give away LLMs via open source? What is the reason?

Machine Learning's environmental impact is reduced when large pre-trained language models are shared, it reduces the overall compute cost and carbon footprint of our community-driven efforts

What is the chain rule of differentials and what is the more common term used?

More common term: backpropagation High Level: after each forward pass through a network, backpropagation performs a backward pass while adjusting the model's parameters (weights and biases) How does the algorithm know to adjust a parameter? The level of adjustment is determined by the gradients of the cost function with respect to those parameters. why computing gradients? To answer this, we first need to revisit some calculus terminology: Gradient of a function C(x_1, x_2, ..., x_m) in point x is a vector of the partial derivatives of C in x. Equation for derivative of C in x The derivative of a function C measures the sensitivity to change of the function value (output value) with respect to a change in its argument x (input value). In other words, the derivative tells us the direction C is going. The gradient shows how much the parameter x needs to change (in positive or negative direction) to minimize C.

IMPORTANT!! What are encoder-decoder models more properly called after 2017? What are these models? Why are they better than others? Where did they start their use? Why are these models important to Generative AI?

More properly called foundation or transformer models DEFINITION ____ are neural networks that learns context and thus meaning by tracking relationships in sequential data ___ models apply an evolving set of mathematical techniques, called attention or self-attention, to detect subtle ways even distant data elements in a series influence and depend on each other OTHER NAMES USED: Transformers, Foundation Models FYI: Attention was so key to transformers the Google researchers almost used the term as the name for their 2017 model. "Attention Net didn't sound very exciting," said Vaswani, who started working with neural nets in 2011. It was Jakob Uszkoreit, a senior software engineer on the team, who came up with the name Transformer. WHY IMPORTANT TO GENERATIVE AI 1. They have replaced CNNs, RNNs because they are more effective (70 percent of arXiv papers on AI posted in the last two years mention ____s) 2. And because they are more efficient (time-wise): they can take advantage of parallelism with GPUs 1. Stanford researchers called ___ "foundation models" in an August 2021 paper because they see them driving a paradigm shift in AI; "sheer scale and scope of foundation models over the last few years have stretched our imagination of what is possible" 2. First described in a 2017 paper from Google, ___ are among the newest and one of the most powerful classes of models invented to date; they're driving a wave of advances in machine learning some have dubbed "Transformer AI" $$$$ Saved - Unsupervised = OK NOW - Before transformers arrived, users had to train neural networks with large, labeled datasets that were costly and time-consuming to produce; by finding patterns between elements mathematically, ___ eliminate that need, making available the trillions of images and petabytes of text data on the web and in corporate databases $$$$ Saved - parallel processing = OK now the math that ___ use lends itself to parallel processing, so these models can run fast. KOOL TO KNOW: Attention was so key to transformers the Google researchers almost used the term as the name for their 2017 model. Almost. "Attention Net didn't sound very exciting," said Vaswani, who

How to Encoder-Decoder/Transformer/Foundational Models work in three steps?

Most neural networks are large encoder/decoder blocks that process data, _____ are like that ... with improvements via small strategic additions to these large encoder/decoder blocks (see diagram) - making them uniquely powerful allowing computers to "see" the same patterns humans see... 1. Add positional encoders to tag data elements coming in and out of the network 2. Then "attention units" follow these tags: these calculate (a kind of) algebraic map of how each element relates to the others - Attention queries are typically executed in parallel by calculating a matrix of equations in what's called multi-headed attention 3. Decoder generates output data from the information extracted by the encoder, predicting the next "token"

Can LLMs make images?

Only if they are married up to image encoders and decoders per the nerds at Carnegie Mellon University

Parameters and Hyperparameters, what are they? COME BACK AND FINISH THIS ONE

Parameters Initialization Technique (random, etc) Weights Bias Learning Rates Dropout Weights Data Augmentation (for images, esp.) Batch Normalization Learning Weight Decay Autotunning? i.e.: gridsearchcv Hyperparameters: Numbers of layers (like 6 encoders) [FYI - you usually cannot just add another in a RNN] Types of Layers Nodes in a Layer Optimization Algorithm (you can use a different one)

What is 1. prompt design and 2. engineering? How about 3. prompt tuning? Why do you care about this in Enterprise generative AI v. Fine-tuning?

Real simple: 1. ___ it is how you query an LLM (or Mix of Minds, MoMs) 2. ____ prompts that were designed by hand by human engineers, also called "hard" prompts 3. ____ ____ - first, no change of weights or retraining of model so is quick - second, using "soft prompts" generated by a small set of learnable parameters Examples: - the prompts can be extra words introduced by humans - or the prompts can be AI-generated numbers that guide the model towards a desired decision or prediction. WHY THIS MATTERS SAVE $$$ - Panda and his colleagues in this paper show that their Multi-Task Prompt Tuning (MPT) method outperformed other methods, and even did better than models fine-tuned on task-specific data; instead of spending thousands of dollars to retrain a 2-billion parameter model for specialized task, MPT lets you customize the model for less than $100 V. FINE TUNING? BURNS $$$ - Fine tuning is adapting a pre-trained AI model to perform better on specific tasks, domains, or applications by training it on a smaller, specialized dataset (which reflects the nuances of the user's target domain or task) allowing the AI model to learn the patterns, terminology, and context unique to that specific use case; Fine tuning requires more computational resources and time than prompt tuning, as it involves retraining the AI model and adjusting its parameters...

#multimodal Alignment in multimodal learning means what?

Refers to the task of identifying direct relationships between different modalities. Current research in multimodal learning aims to create modality-invariant representations. This means that when different modalities refer to a similar semantic concept, their representations must be similar/close together in a latent space. For example, the sentence "she dived into the pool", an image of a pool, and the audio signal of a splash sound should lie close together in a manifold of the representation space.

What is a text embedding model? how do you make one? why use it for an enterprise gen AI tool?

these models are How to make: you will need to precompute embeddings and store them in a managed vector database Why use for Enterprise GenAI: You can use text embedding as part of your

What is RLHF? How does it work? Why is it important for Enterprise AI v. other methods? What is some next level shit that others do that is useful to do and how do you recreate it?

Reinforcement Learning from Human Feedback is part of training of a model to respond in ways that humans prefer in the context given How it works: 1. During the annotation process of training a model, humans are presented with prompts and either write the desired response or rank a series of existing responses 2. These human preferences are directly encoded in the training data 3. And then this "redirects" the learning process of the LLM from the straightforward but artificial next-token prediction task towards learning what the humans prefer in a given communicative situation. WHY THIS IS IMPORTANT FOR ENTERPRISE Gen AI: This optimizes the model (or Mix of Minds, multiple models) to reflect the human preferences of that enterprise v. self-supervision — which is pre-training LLMs using next-token prediction w/o humans helping NEXT LEVEL SHIT: OpenAI's data for ChatGPT also include human-written responses to prompts that are used to fine-tune the initial LLM HOW TO RE-CREATE THAT? A. you can use simple ratings, simple thumbs up/down labels, or even numbers of upvotes to build out a ___ by taking any internal message boards and using them like ranking datasets [OpenAI uses Reddit or Stackoverflow conversations where answers to questions are rated by users like this] B. You can also allow users to send feedback on what they wanted as answers

What is RAG?

Retrieval Augmented Generation also referred to as "grounding the model" requires separating the knowledge base from the tool, then pulling on the knowledge base "in real time" to get answers (but do not use the knowledge base to train the tool)

#TENSORFLOW What are checkpoints? Why do you need them? How do you use them in TF?

SPECIFIC TO TF: GIVEN The persistent state of a TensorFlow model is stored in tf.Variable objects. WHY: The easiest way to manage variables is by attaching them to Python objects, then referencing those objects DEFINITION ___ is an intermediate dump of a model's entire internal state (its weights, current learning rate, etc.) so that the framework can resume the training from this point whenever desired WHY TO HAVE ___ if you get interrupted while training a model, ___ enable you to pick up with training exactly where you left off FYI: IMPORTANT - Since ___s do not contain any description of the computation defined by the model, you only find these useful when the source code (that will use these saved parameter values) is available...

#datasets. Stanford Question Answering Dataset- why is this important to include?

SQuAD 2.0 combines exist- ing SQuAD data with over 50,000 unan- swerable questions written adversarially by crowdworkers to look similar to an- swerable ones. To do well on SQuAD 2.0, .ML SYSTEMS must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

#optimizations. What is Lasson Regularization? Why use it?

STANDS FOR Least Absolute Shrinkage and Selection Operator (LASSO) Regularization DEFINITION the L1 norm of the coefficients of a linear model is added to the cost function being minimized HOW IT WORKS It encourages the model to automatically select a subset of important features while driving some coefficients to exactly zero; this property of driving coefficients to zero helps in simplifying the model and enhancing its interpretability OUTCOMES EXPECTED without this regularization, the model might assign non-zero coefficients to all features, including those that might not contribute much to the prediction; this can lead to overfitting and make the model difficult to interpret. USES WHEN TO USE You have a large dataset with many potential features, and you want to build a model that not only predicts well but also provides insights into which features are most important for price prediction

What is a sigmoid belief net? What makes it better or worse than other things to use in enterprise generative AI?

The attached image is graphical model and not a neural net A sigmoid belief network is a directed graphical model of binary variables in fully connected layers. In a sigmoid belief network, only the bottom layer is observed. CHARACTERISTICS A sigmoid belief network can be faster than a Boltzmann machine. It also has advantages over Boltzmann machines in pattern classification.

#training what are MCMC-free approaches and why use them?

Score Matching (SM) Noise Constrastive Estimation (NCE)

What are the ways you can think about "stepping" into creating a model for you based on an open-source one?

See IMage

What is seed for the foundational model Diffusion?

See IMage

What is a densely connected neural network?

See Image - the input layer is deeply connected to the hidden layer (neurons)

What is a directed graphical model? why do you care abou this?

See attached A directed graphical model (DGM) is a probabilistic model that uses a graph to represent the conditional dependence structure between random variables. In a DGM, each node in the graph represents a variable, and each edge represents a direct probabilistic interaction between two variables. The probability of a set of random variables in a DGM factors into a product of conditional probabilities, one for each node in the graph.

What is an encoder-decoder model? What parts are there?

See image ML Architecture Parts: Parameters:Number of learnable variables/values available for the model. Transformer Layers: Number of Transformer blocks. A transformer block transforms a sequence of word representations to a sequence of contextualized words (numbered representations). Hidden Size: Layers of mathematical functions, located between the input and output, that assign weights (to words) to produce a desired result. Attention Heads: The size of a Transformer block. Processing: Type of processing unit used to train the model. Length of Training: Time it took to train the model.

#scaling What does it mean to scale a model?

See image Width Depth Resolution

#lossfunction How do you choose a loss function - without "overfitting" it? Use the example of the gradient descent function

See image for examples of fitting and overfitting Line [a] has lower norms because it has significantly less parameters compared to [c] Line [b] has lower norms because despite having the same number of parameters, they're all much smaller than [c] GIVEN: When doing gradient descent we'll update our weights based on the derivative of the loss function. So if we've included a norm in our loss function, the derivative of the norm will determine how the weights get updated. 1. Choose the simpler functions

Model of Training Effectiveness: How do you know when you overfit your training set? COME BACK TO THIS ONE

TO DO

What is Teacher Forcing? What would happen if you failed to do it? how do you do it? What happenes when you need "ground truth" how do you get it?

Teacher Forcing Ordinarily, during Inference when the model is fully trained, the output sequence from this timestep is used as the input sequence for the next timestep. This allows the model to predict the next word based on the previous words predicted so far. However, if we did that while the model was still learning during Training, any errors made by the model in predicting an output word in this timestep would be carried forward to the next timestep. The model would end up predicting the next word based on an erroneous previous word. Instead, since we have the ground truth captions available to us during training, we use a technique called Teacher Forcing. The correct expected word from the target caption is added to the input sequence for the next timestep rather than the model's predicted word. In this way, we are helping the model by giving it a hint, so to speak, just like a teacher would.

What is the play with enterprise generative AI in terms of how it will interact with open source and proprietary models?

There will be an interconnected mix of models, providers and business models. Large foundation models from the big providers will live alongside locally deployable open source models from commercial and research organizations alongside models specialized for particular domains or applications. The transparency, curation and ownership of models (and tuning data)

1. Query 2. Key 3. Value What are these? What else are they called? How do these work?

They all get different weights Also called: 1. Token Embeddings 2, Segment Embeddings 3. Position Embeddings intuitively, we can think the 1. ____ represents what kind of information we are looking for 2. _____ represent the relevance to the query, and 3. ____ represent the actual contents of the input, so the "original" input, times the "Attention Score" the model makes out of 1. ___ and 2. ____ Attention Score" - how to make - for each word, the model uses the dot product for 1.____ and uses the same for 2. ____ to get the similarity or "attention product" weight among the different tokens, [usually with matrixed multiplication, in NLP, based on the length of the "sequence" (aka the sentence) so that you have as many columns as you have words in the sentence, same with rows] - then you use 3.___ to give you the final "Attention Score"

#models What are Autoregressive models?

They generate data one element at a time - conditioning the generation of each element on previously generated elements - they predict the probability distribution of the next element given the context of the previous elements and then sample from that distribution to generate new data EXAMPLES GPT (Generative Pre-trained Transformer), which can generate coherent and contextually appropriate text.

TENSORFLOW The phrase "Saving a TensorFlow model" typically means one of two things. What are they?

This means either Checkpoints OR SavedModel

Enterprise: Why do you separate your knowledge base from your LLM? What is the User's Workflow "Look Like? when you do this"

To ensure that users receive accurate answers, we need to to this in order to leverage the semantic understanding of our language model while also providing our users with the most relevant information; all of this happens in real-time, and no model training is required User Case The approach for this would be as follows: 1. User asks a question 2. Application finds the most relevant text that (most likely) contains the answer 3. A concise prompt with relevant document text is sent to the LLM 4. User will receive an answer or 'No answer found' response

#python #generativeAI What are Class Constrictors? Why are they used?

Ture to Python's object-oriented capabilities, you'll find the class keyword, which allows you to define custom classes that can have attributes for storing data and methods for providing behaviors. Once you have a class to work with, then you can start creating new instances or objects of that class, which is an efficient way to reuse functionality in your code. Creating and initializing objects of a given class is a fundamental step in object-oriented programming. This step is often referred to as object construction or instantiation. The tool responsible for running this instantiation process is commonly known as a ___ ____.

#multimodal #features What are multimodal features? What are the two categories/kinds?

Using more than one type or "mode" of data to generate features, i.e.: Video + LiDAR+ depth data creates the dataset for self-driving car applications. Two Kinds: 1. Joint Representation 2. Coordinated Representation

#vectors #vectordatabase How does a vector database work?

Vector databases usually use the Approximate Nearest Neighbor (ANN) algorithm to calculate the spatial distance between the query vector and vectors stored in the database. The closer the two vectors are located, the more relevant they are. Then the algorithm finds the top k nearest neighbors and delivers them to the user. https://zilliz.com/blog?tag=39&page=1&utm_source=thenewstack&utm_medium=website&utm_content=inline-mention&utm_campaign=platform

MOVE THIS ONE #MLOps #MLSystems. reak-time monitoring means what?

We do not need batch monitoring solutions, but real-time monitoring. use something like Kafka or Kinesis to securely transport consumers' click streams from the applications then use a stream processing engine to continually compute accuracy or predictions so that as soon as the model is deployed, as soon as there's traffic coming in, to see how the model is performing

Can you make a private ChatGPT?

Yes, here https://levelup.gitconnected.com/training-your-own-llm-using-privategpt-f36f0c4f01ec

What is "cosine similarity" and why is it important for enterprise generative AI?

__ __ measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. Why important to generative AI: It is often used to measure document similarity in text analysis, which can speed up computational time when the LLM uses a separate knowledge base

What is an attention mechanisms, why do you care about this with GenAI?

___ ___ lets the model focus on specific parts of the input sequence Encoder-Decoder Architecture usually updates the hidden state with one result of encoding, which is then passed to the decoder, but when you add the ____ ____ you can input more to the decoder, telling it to focus on specific parts of an input sequence HOW IT WORKS - See image 1. Calculating the attention weights: The ____ ____assigns weights to different parts of the input sequence, with the most important parts receiving the highest weights 2. Generating the context vector 3. Passing many of those to the decoder WHY YOU CARE Is an optimization on the Seq2Seq models that produces better results USED FOR NLP Image generation

What is Masking in a TF model? When do you use it?

___ enables you to disregard certain parts of a tensor when executing the forward pass of your neural network - typically those parts of a tensor set to zero

MoE is what type of class of learning method? Name another ensemble learning method and tell me how it works "like" an MoE

___ is one of the ensemble learning methods Another one: stacked generalization, or stacking Both ___ and Stacking attempt to learn from the output of other lower-level models - or at least or learn how to best combine the output by - training a diverse ensemble of machine learning models - then learning a higher-order model to best combine the prediction

Model Collapse: What are the causes?

___ is the degenerative process that affects generations of generative models Causes: Too much generated data: When generated data pollutes the training set of subsequent models, leading to a misperception of reality Too much bad data: Data poisoning, in broader terms, refers to any factor that contributes to the creation of data that inaccurately reflects reality

What is a MoE model and for what type of use? What are the differences between sparse and dense MoE models? Which ones are better as of 2023?

___ stands for Mixture of Experts and was used for neural networks but can be used for any ML/AI model MoE operates by adopting a number of experts, each as a sub-network, and activating only one or a few experts for each input token. A gating network must be chosen and optimized in order to route each token to the most suited expert(s). Dense: all experts activated every step Sparse: a subset of experts when routing each token WHY SPARSE: 1. No loss of accuracy 2. Reducing computational cost as compared to a dense 2023 WINNER SPARSE

loss functions, what are they, why do you care about them in decision support tools? What loss functions do you use for Classification problems? Give an example of steps in a loss function.

____ help gauge how a machine learning model is performing with its given data, and how well it's able to predict an expected outcome. Many machine learning algorithms use ___ in the optimization process during training to evaluate and improve its output accuracy. Also, by minimizing a chosen ___ during optimization, this can help determine the best model parameters needed for given data. For classification: Binary Cross-Entropy ___ / Log ___ Hinge ___ for regression Mean Square Error / Quadratic Loss / L2 ___ Mean Absolute Error / L1 ___ Huber ___ / Smooth Mean Absolute Error Log-Cosh ___ Quantile ___ You want to train a GAN to generate synthetic data points that mimic a linear relationship, here are the steps in the ___ function: ysynthetic = 2xsynthetic+3 1. The generated synthetic data point ysynthetic​ is compared to the ground truth data points that follow the same linear relationship. 2. The ___ function quantifies the difference between the generated and real data points. 3. Then, the generator's parameters are updated using backpropagation to minimize this loss.

What are embedding layers? how do you use them in Generative AI? Why do you care about this in enterprise AI?

____ is a type of hidden layer in a neural network Theoretically, every hidden layer can represent an ___; we can extract an output of any hidden layers and treat it as an embedding vector DEFINITION - maps input information from a high-dimensional to a lower-dimensional space - $$$ Why You Care: it allows the network to learn more about the relationship between inputs and to process the data more efficiently IMPORTANT The point of an ____ is not only to lower the input dimension but also to create a meaningful relationship between them; that is why particular types of neural networks are used only to generate embeddings SPECIFIC _____s Transformer-based models create contextual embeddings; it means that the same word will most likely get a different embedding vector if it appears in a different context

Weights are "what OpenAI needs to protect from theft" - weights, what are they? why are they important? What is their relationship to bais?

____ is the parameter within a neural network that transforms input data within the network's hidden layer They became very important when the attention mechanism was added in 2018... DEFINITION ____ are the real values that are attached with each input/feature and they convey the importance of that corresponding feature in predicting the final output. ____ associated with each feature, convey the importance of that feature in predicting the output value. Features with ___ that are close to zero said to have lesser importance in the prediction process compared to the features with ___ having a larger value HOW THEY WORK ____ tell the importance of a feature in predicting the target value ____ tell the relationship between a feature and a target value WHAT IS THEIR RELATIONSHIP TO BIAS? ___ and bias are both learnable parameters inside the network GIVEN: a teachable neural network will randomize both the ___ and bias values before learning initially begins GIVEN: As training continues, both parameters are adjusted toward the desired values and the correct output. DIFFERENCE TO BIAS: The two parameters differ in the extent of their influence upon the input data. - Bias represents how far off the predictions are from their intended value; biases make up the difference between the function's output and its intended output (a low bias suggest that the network is making more assumptions about the form of the output, whereas a high bias value makes less assumptions about the form of the output) - ____, on the other hand, can be thought of as the strength of that same connection; ____ affects the amount of influence a change in the input will have upon the output; a low ____ value will have no change on the input, and alternatively a larger ____ value will more significantly change the output.

Normalization, why is doing this with data used in generativeAI?

____ prepares the data for the generative model's learning process by ensuring that the model focuses on learning meaningful patterns rather than getting confused by variations in scales and magnitudes. 1. Faster Convergence: Normalizing the data brings all features to a common scale, making the optimization process more efficient. This allows the model to converge faster to a solution that accurately captures the data distribution. Better Generalization Due to Better Ability to Learn Meaningful Patterns: Normalization helps the model generalize well to new, unseen data. When the input data has consistent scales, the generative model can learn meaningful patterns across features and apply them to generate coherent and realistic samples. Avoiding Mode Collapse: Mode collapse is a common issue in generative models where the model generates a limited variety of outputs, ignoring the diversity present in the training data. Normalization can help alleviate this by ensuring that the model doesn't get stuck generating only one type of output. Enhancing Model Robustness: Normalization can make the model more robust to changes in input data distribution. If the input data changes, but the normalized representation remains consistent, the model's performance is less likely to degrade. Balanced Learning: In some cases, the data might have class imbalances or uneven distributions. Normalization can assist in creating a more balanced learning process for the generative model, which is especially important when trying to capture the complexities of each class or mode.

#embedding Zero Shot Embedding, what is it and why do you want to use it for Enterprise generative AI?

can use embeddings for zero shot classification without any labeled training data. For each class, we embed the class name or a short description of the class. To classify some new text in a zero-shot manner, we compare its embedding to all class embeddings and predict the class with the highest similarity.

#mulitmodal #features What is Joint coordination? Why is this used?

each individual modality is encoded and then placed into a mutual high dimensional space (most direct way/ may work well when modalities are of similar nature) - Usually, this method works well when modalities are similar in nature, and it's the one most often used; on practice when designing multimodal networks, encoders are chosen based on what works well in each area since more emphasis is given to designing the fusion method.

#mulitmodal #features What does Coordinated representation mean for features?

each individual modality is encoded irrespective of one another, but their representations are then coordinated by imposing a restriction. For example, their linear projections should be maximally correlated.

#Preparation Stage one, what do you need to do to help build your model?

exploratory data analysis prototype creation training routine implementations or in a stream of conciousness: Data Shape Pick your aciviation and loss functions, etc based on your use case

Vectorization: how to do with an open source model?

to make the best use of vector embeddings with vector databases like Milvus and Zilliz Cloud obtain vectors by removing the last layer and taking the output from the second-to-last layer. The last layer of a neural network usually outputs the model's prediction, so we take the output of the second-to-last layer. The vector embedding is the data fed to a neural network's predictive layer. https://thenewstack.io/how-to-get-the-right-vector-embeddings/

#LLMs Why are parameters not a good measure for LLMs?

while a power-law might capture the general growth trend, it may overstate the performance improvements at extremely large scales.


Ensembles d'études connexes

Promulgated Contract Forms Course 1

View Set

Which of the following would be considered a secondary decision-maker?

View Set

Primavera Biology B Unit 5: Ecology

View Set

Ch. 18 International Aspects of Financial Management

View Set