CSC277 Exam 1
Pros and Cons of monorepos
+ code reuse, simplified dependencies, atomic commits, refactoring anywhere, refactoring less problems, facilitates collaboration - prototyping harder, potential security risks due to unlimited access
CI/CD pipeline steps
- code built - test execute after building (CI) -fails if problem -Deploy (CD) Continuous integration and Continuous deployment
Limitations of LLMs
- limited ability to plan. reason - do not learn continually, lack updated information -sensitive to prompt - cite sources inaccurately
Layer norm
- mean of each input vector , divide by standard dev invariant to batch size, makes training slower
MLP cons
- no memory - poor for images - poor for sequences -cannot handle graphs as input
Transfer learning steps via embedding
1) Get dataset needed 2) extract features from model on larger dataset, layer before output 3) use linear/softmax/other classifier to train model
Steps to running CNN trained on image net
1) Resize image to size used, usually 244 x 244 2) If image not RGB, eg greyscale, make faux RGB 3) Make image compatible, ie -make single precision, same range as training -subtract average image/channel computed on image net from image -any other normalization used in training, varies per system 4) feed into network 5) networks gen values for all layers, output predicts one of 1000 cats from ImageNet
Byte Pair Encoding steps
1) User specifies vocab size 2) trains vocab using corpus: 3) base vocab for all ascii/unicode 4) iterates through data, merges tokens to increaaase vocab size 5) creates new tokens for most frequent byte pair sequences
How to package for release
1) bundle for lightweight standalone executable, run seamlessly across all hardware 2) deploy on something, cloud system
Academic AI system dev
1) get dataset 2) develop baseline, compare against state of the art 3) come up with novel idea -test -modify until it beats state of the art (modify hyperparameters, improve model/data) -analyze 4) publish
LoRA Steps
1) h = W_0 x + delta Wx = W_0x + BAx 2) W from og, only update B,A, same dim as W but smaller paras 3) B 0 mat, A Gaussion init 4) Fine tune rank r of BA 5) fold weights into og to reduce compute 6) 20-300 samples
Transfer learning via fine tuning
1) initialize weights using already trained network 2)remove last layer, replace with new output 3) retrain whole thing, perhaps with lower learning rate
Steps for pretrain
1) make small dataset 2) turn off augment, regularize 3) have optimizer reduce loss to near zero, if noto optimizer probably bad
InstructGPT steps (RLHF)
1) train LLM, autoregressive 2) labelers provide desired behavior, pretrained model finetunes 3) dataset of comparisons provided, output fine tunes 4) LLM from (2) fine tuned again using RL as outlined in # 5) optional, repeat
Model Capacity
Approximately the number of parameters, more = overfit, less = underfit
Inductive bias
Assumptions built in to learning algorithm
Workflow orchestration
Automating workflows, use flow or pipeline for training/testing, ex unclude prefect, kubeflow, mlflow, clearml
Logistic Sigmoid output activation function, paired with loss function ____
Binary Cross entropy loss, paired with output activation function ___
Monorepos
Central repository for company / division of company, many projects, short branches, version control (uses GIT)
Examples of pre train tests
Check shape of output versus labels Check output ranges Ensure gradient steps on batch decreases lost Make sure weight changes in every layer Ensure gradient flow Ensure system is at untrained (rng) levels of performance check for label leakage btwn training/val sets (overlap in labels)
Byte Pair Encoding
Common tokenizer, ensures common words are single token, rare words broken down into subwords.
Docker
Container system, "source code" for containers, speeds up deployment by providing drivers, os version, images (where code lives), like version control for entire setup. useful for reproducibility
Softmax for mut. exclusive as output activation function, paired with loss function___
Cross entropy loss, paired with output activation function ___
Autoregressive
Forcast future based on past
Chain of though prompting
Gen sequence of short sentences, ask for step by step, v good for large llms
Jailbreaking LLM
Get past safeguards/bad answers
GNN output
Graph with same topology as input
Perplexity
How well a model predicts next word, measures "surprise" level of next word, lower score better, = e^{-1/t sum log p_ theta (x_i | x<i)}, exponent of cross entropy loss, flawed, larger for shorter text, punctuation marks affect performance
If a transformer gets N inputs, it produces ___ outputs
If a transformer receives N outputs, it was given ___ inputs
Artifact
Incorrect label, label noise
Post train tests
Invariance - make permutations, ensure output stays consistent Robustness - make sever perturbations, confirm still acceptable Directional expectation - define changes that should have impact on outputs, test Minimum functionality - measure against different groups, subgroups
Deep Learning Risks
Job loss due to automation - Surveillance state and censorship - Predictive policing (Robocop?) - Media manipulation - Automated weapons and military applications - Can't trust anything you read, see or hear on the web! • Spam, fake dating profiles / chat bots, deep fake images/video, AI generated articles - Democratize skills
Linear activation as output activation function, paired with loss functions ___
L2 and Huber loss, paired with activation function ___
Norm functions
Layernorm bathch norm minibatch norm
Unsupervised Learning
Learning contains no labels
Experiment versioning
Lock down experimental configs to reproduce
Experiment tracking
Log config, data, output, etc. W&B good fr academia
Integration test
Longer running tests to observe higher level behaviors, uses multiple aspect of codebase
Issues with regulation of AI
Loosely defined, hard for start ups to compete
LoRA
Low Rank Adaptation, fine tunes LLMs, limits amount of plasticity in network to small num of weights, keeps og weights fixed by modifies chosen weight matrices to include two new matrices. expensive activation memory consumption
Regularization
Methods for reducing overfitting
Reinforcement learning
Model is not given direction, but gets scalar reward for actions
Self Supervised learning
Model makes up labels, learns underlying structure, used for foundation models
Relative positional encodings
Modifies self attention to ensure pos encoding, encodes relative positional information as a sib component of values matrix
Greedy prediction
Most likely token next, but performs poorly, side effect of cross entropy softmax, heavy tailed output dist
Input of GNN
N d dimensional vectors with adjacency matrix, each node has vector of features
Absolute positional encoding
OG transformer paper, d dim vector , sinusoids. limits number of tokens
Principles of Production level MLOPS for AI
Orchestration ,explainability, reproducibility, software engineering
How to reduce hockey stick loss curves
Output layer bias initialized from data, eg for regression set to mean of output vectors
Regression
Predicting continuous Variables
Kubernetes
Production focused container system, opposition to docker, good for large number of containers, has load balancing, butmore difficult to deploy
Grouped Query Attention
Query heads split into G groups that share key and value
Direct sampling prediction
Randomly sample next token proportional to prob pred by softmax
Data deduplication
Removing repeat data, to reduce overfitting, speed training. Meta uses D-4, embed docs, cluster k means, find duplicated substrings, minhash
Supervised ML
See many examples, pick a mapping that replicates input-output pattern of examples.
Foundation models
Self supervised learners, generative AI such as GPT3, CLIP, DALL-E. Huge amount of downstream tasks, transfer learning
When activation functions are omitted
Situation where you omit something, ex are CNN architectures due to filter size, transformer architectures due to attention
Batch norm
Subtract channel mean of all samples in batch, divide by channel standard dev, worse for transfer to other datasets, should be avoided
MLOps
System for organizing development and deployment of ML systems, center of ML, DevOps, Data Engineering. -increased automation -facilitates reproducibility -provenance of datasets -monitoring of systems in production -tracking business metrics
Linear Filter in CNN
Takes input tensor, outputs feature map, has spacial dimension, pads output layers to ensure dimensionality, part of convolutional layer in CNN
Unit Test
Test on singular part of codebase, run quickly during dev to ensure chunk works
Regression test
Test to ensure changes such as bug fixes or updates do not cause problems
Sanity test
Test to ensure outputs reasonably consistent with expected behavior
Positional Interpolation for RoPE
Train a model for context window, cannot adapt to other context windows, this scales position indices of new to match of context window size, fine tunes model for 1000 steps
How to deal with small amounts of data, one way
Transfer learning is a solution to what problem, due to its ability to learn basic features shared across datasets, ie universal visual features
Rotary positional encoding
Unifies absolute and relative approaches, posencod in self attemtion, called RoPE - embeddings are complex numbers, positions pure rotations, can shift query and key by same amount, changing abs position butnot relative, ensuring same dot prod stays same
Three main types of tests (plus bonus two)
Unit, regression, integration (sanity and smoke)
mini batched gradient descent
Update parameters after n/m times, m batch size, n total data
Full gradient descent
Updates parameters after all instances in training set
Continual Learning
Updating models with new date
GELU activation
Used for most transformer models, many CNN, no dying RELU problem, smoother near 0, diff everywhere, allows small gradients in neg range. poor with small NNs
Examples of inductive bias
Using convolution, uses same filter for whole image (stationary assumption) Weight sharing, transformers / RNNs for each term in sequence, GNN for each graph node
Dataset Versioning
Version data for reproducibility purposes
Model versioning
Version specific model architecture
Pro of transformer in regards to how they view sequences
Views sequences as permutation invariant, all inputs same weights, sequence more like a set
Transfer Initialization
Weights from another NN
Multi - query
all heads share k, v
Key vectors
determine attention weights
softmax equation
e^xi / sum k to n e^xk
Embedding
encoding of an input into simpler output, captures salient/discriminative aspects of input
In context learning
give LLM steps on how to do prompt
few shot learning
give llm examples, ask for output
self consistency sampling (llm)
give prompt many times to check output
Scraped datasets
large commonly used datasets, ie wikepedia, pile(books, academia, code, much more), starcoder, books, etc
Flash attention
make attention IO aware using tiling, recomputation, Flash Attention v2 used, supports multi-query and grouped query
BERT
model that produces embedding for downstream tasks, self supervised nongenerative encoder
value vector
multiplied by attention weights
Why to use experiment and model versioning
never lose knowledge model reproducibility debugging auditing
Zero shot learning
no context promp
Problems with RNNs
requires sequential data to be processed in order, hard to parallelize, hard to train due to vanishing/exploding gradients, hard to capture long term relations/dependencies
Nucleus (top p ) prediction sampling
sample from distribution using top p % of sample
Top k sampling prediction
sample from top k highest prob tokens, use temperature to control peakiness, can be too low information or random = e^(zi/T)/sum(zj/T)
expert prompting
tell llm "give answers as if u were an expert"
Smoke test
test to reveal simple but severe failures
<eos> token
token to stop embedding of transformer
Automatic prompt design
treat prompt as trainable parameter, construct input-output pairs first, OPRO one method
SGD
updates parameters after N (random) instances
RLHF
used to control harmful generations from LLMs, bias
Aspects unique to transformers
var length inputs, residual connections, no recurrence, self attention
query vector
what we try to learn context for in MHA