CSC277 Exam 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Pros and Cons of monorepos

+ code reuse, simplified dependencies, atomic commits, refactoring anywhere, refactoring less problems, facilitates collaboration - prototyping harder, potential security risks due to unlimited access

CI/CD pipeline steps

- code built - test execute after building (CI) -fails if problem -Deploy (CD) Continuous integration and Continuous deployment

Limitations of LLMs

- limited ability to plan. reason - do not learn continually, lack updated information -sensitive to prompt - cite sources inaccurately

Layer norm

- mean of each input vector , divide by standard dev invariant to batch size, makes training slower

MLP cons

- no memory - poor for images - poor for sequences -cannot handle graphs as input

Transfer learning steps via embedding

1) Get dataset needed 2) extract features from model on larger dataset, layer before output 3) use linear/softmax/other classifier to train model

Steps to running CNN trained on image net

1) Resize image to size used, usually 244 x 244 2) If image not RGB, eg greyscale, make faux RGB 3) Make image compatible, ie -make single precision, same range as training -subtract average image/channel computed on image net from image -any other normalization used in training, varies per system 4) feed into network 5) networks gen values for all layers, output predicts one of 1000 cats from ImageNet

Byte Pair Encoding steps

1) User specifies vocab size 2) trains vocab using corpus: 3) base vocab for all ascii/unicode 4) iterates through data, merges tokens to increaaase vocab size 5) creates new tokens for most frequent byte pair sequences

How to package for release

1) bundle for lightweight standalone executable, run seamlessly across all hardware 2) deploy on something, cloud system

Academic AI system dev

1) get dataset 2) develop baseline, compare against state of the art 3) come up with novel idea -test -modify until it beats state of the art (modify hyperparameters, improve model/data) -analyze 4) publish

LoRA Steps

1) h = W_0 x + delta Wx = W_0x + BAx 2) W from og, only update B,A, same dim as W but smaller paras 3) B 0 mat, A Gaussion init 4) Fine tune rank r of BA 5) fold weights into og to reduce compute 6) 20-300 samples

Transfer learning via fine tuning

1) initialize weights using already trained network 2)remove last layer, replace with new output 3) retrain whole thing, perhaps with lower learning rate

Steps for pretrain

1) make small dataset 2) turn off augment, regularize 3) have optimizer reduce loss to near zero, if noto optimizer probably bad

InstructGPT steps (RLHF)

1) train LLM, autoregressive 2) labelers provide desired behavior, pretrained model finetunes 3) dataset of comparisons provided, output fine tunes 4) LLM from (2) fine tuned again using RL as outlined in # 5) optional, repeat

Model Capacity

Approximately the number of parameters, more = overfit, less = underfit

Inductive bias

Assumptions built in to learning algorithm

Workflow orchestration

Automating workflows, use flow or pipeline for training/testing, ex unclude prefect, kubeflow, mlflow, clearml

Logistic Sigmoid output activation function, paired with loss function ____

Binary Cross entropy loss, paired with output activation function ___

Monorepos

Central repository for company / division of company, many projects, short branches, version control (uses GIT)

Examples of pre train tests

Check shape of output versus labels Check output ranges Ensure gradient steps on batch decreases lost Make sure weight changes in every layer Ensure gradient flow Ensure system is at untrained (rng) levels of performance check for label leakage btwn training/val sets (overlap in labels)

Byte Pair Encoding

Common tokenizer, ensures common words are single token, rare words broken down into subwords.

Docker

Container system, "source code" for containers, speeds up deployment by providing drivers, os version, images (where code lives), like version control for entire setup. useful for reproducibility

Softmax for mut. exclusive as output activation function, paired with loss function___

Cross entropy loss, paired with output activation function ___

Autoregressive

Forcast future based on past

Chain of though prompting

Gen sequence of short sentences, ask for step by step, v good for large llms

Jailbreaking LLM

Get past safeguards/bad answers

GNN output

Graph with same topology as input

Perplexity

How well a model predicts next word, measures "surprise" level of next word, lower score better, = e^{-1/t sum log p_ theta (x_i | x<i)}, exponent of cross entropy loss, flawed, larger for shorter text, punctuation marks affect performance

If a transformer gets N inputs, it produces ___ outputs

If a transformer receives N outputs, it was given ___ inputs

Artifact

Incorrect label, label noise

Post train tests

Invariance - make permutations, ensure output stays consistent Robustness - make sever perturbations, confirm still acceptable Directional expectation - define changes that should have impact on outputs, test Minimum functionality - measure against different groups, subgroups

Deep Learning Risks

Job loss due to automation - Surveillance state and censorship - Predictive policing (Robocop?) - Media manipulation - Automated weapons and military applications - Can't trust anything you read, see or hear on the web! • Spam, fake dating profiles / chat bots, deep fake images/video, AI generated articles - Democratize skills

Linear activation as output activation function, paired with loss functions ___

L2 and Huber loss, paired with activation function ___

Norm functions

Layernorm bathch norm minibatch norm

Unsupervised Learning

Learning contains no labels

Experiment versioning

Lock down experimental configs to reproduce

Experiment tracking

Log config, data, output, etc. W&B good fr academia

Integration test

Longer running tests to observe higher level behaviors, uses multiple aspect of codebase

Issues with regulation of AI

Loosely defined, hard for start ups to compete

LoRA

Low Rank Adaptation, fine tunes LLMs, limits amount of plasticity in network to small num of weights, keeps og weights fixed by modifies chosen weight matrices to include two new matrices. expensive activation memory consumption

Regularization

Methods for reducing overfitting

Reinforcement learning

Model is not given direction, but gets scalar reward for actions

Self Supervised learning

Model makes up labels, learns underlying structure, used for foundation models

Relative positional encodings

Modifies self attention to ensure pos encoding, encodes relative positional information as a sib component of values matrix

Greedy prediction

Most likely token next, but performs poorly, side effect of cross entropy softmax, heavy tailed output dist

Input of GNN

N d dimensional vectors with adjacency matrix, each node has vector of features

Absolute positional encoding

OG transformer paper, d dim vector , sinusoids. limits number of tokens

Principles of Production level MLOPS for AI

Orchestration ,explainability, reproducibility, software engineering

How to reduce hockey stick loss curves

Output layer bias initialized from data, eg for regression set to mean of output vectors

Regression

Predicting continuous Variables

Kubernetes

Production focused container system, opposition to docker, good for large number of containers, has load balancing, butmore difficult to deploy

Grouped Query Attention

Query heads split into G groups that share key and value

Direct sampling prediction

Randomly sample next token proportional to prob pred by softmax

Data deduplication

Removing repeat data, to reduce overfitting, speed training. Meta uses D-4, embed docs, cluster k means, find duplicated substrings, minhash

Supervised ML

See many examples, pick a mapping that replicates input-output pattern of examples.

Foundation models

Self supervised learners, generative AI such as GPT3, CLIP, DALL-E. Huge amount of downstream tasks, transfer learning

When activation functions are omitted

Situation where you omit something, ex are CNN architectures due to filter size, transformer architectures due to attention

Batch norm

Subtract channel mean of all samples in batch, divide by channel standard dev, worse for transfer to other datasets, should be avoided

MLOps

System for organizing development and deployment of ML systems, center of ML, DevOps, Data Engineering. -increased automation -facilitates reproducibility -provenance of datasets -monitoring of systems in production -tracking business metrics

Linear Filter in CNN

Takes input tensor, outputs feature map, has spacial dimension, pads output layers to ensure dimensionality, part of convolutional layer in CNN

Unit Test

Test on singular part of codebase, run quickly during dev to ensure chunk works

Regression test

Test to ensure changes such as bug fixes or updates do not cause problems

Sanity test

Test to ensure outputs reasonably consistent with expected behavior

Positional Interpolation for RoPE

Train a model for context window, cannot adapt to other context windows, this scales position indices of new to match of context window size, fine tunes model for 1000 steps

How to deal with small amounts of data, one way

Transfer learning is a solution to what problem, due to its ability to learn basic features shared across datasets, ie universal visual features

Rotary positional encoding

Unifies absolute and relative approaches, posencod in self attemtion, called RoPE - embeddings are complex numbers, positions pure rotations, can shift query and key by same amount, changing abs position butnot relative, ensuring same dot prod stays same

Three main types of tests (plus bonus two)

Unit, regression, integration (sanity and smoke)

mini batched gradient descent

Update parameters after n/m times, m batch size, n total data

Full gradient descent

Updates parameters after all instances in training set

Continual Learning

Updating models with new date

GELU activation

Used for most transformer models, many CNN, no dying RELU problem, smoother near 0, diff everywhere, allows small gradients in neg range. poor with small NNs

Examples of inductive bias

Using convolution, uses same filter for whole image (stationary assumption) Weight sharing, transformers / RNNs for each term in sequence, GNN for each graph node

Dataset Versioning

Version data for reproducibility purposes

Model versioning

Version specific model architecture

Pro of transformer in regards to how they view sequences

Views sequences as permutation invariant, all inputs same weights, sequence more like a set

Transfer Initialization

Weights from another NN

Multi - query

all heads share k, v

Key vectors

determine attention weights

softmax equation

e^xi / sum k to n e^xk

Embedding

encoding of an input into simpler output, captures salient/discriminative aspects of input

In context learning

give LLM steps on how to do prompt

few shot learning

give llm examples, ask for output

self consistency sampling (llm)

give prompt many times to check output

Scraped datasets

large commonly used datasets, ie wikepedia, pile(books, academia, code, much more), starcoder, books, etc

Flash attention

make attention IO aware using tiling, recomputation, Flash Attention v2 used, supports multi-query and grouped query

BERT

model that produces embedding for downstream tasks, self supervised nongenerative encoder

value vector

multiplied by attention weights

Why to use experiment and model versioning

never lose knowledge model reproducibility debugging auditing

Zero shot learning

no context promp

Problems with RNNs

requires sequential data to be processed in order, hard to parallelize, hard to train due to vanishing/exploding gradients, hard to capture long term relations/dependencies

Nucleus (top p ) prediction sampling

sample from distribution using top p % of sample

Top k sampling prediction

sample from top k highest prob tokens, use temperature to control peakiness, can be too low information or random = e^(zi/T)/sum(zj/T)

expert prompting

tell llm "give answers as if u were an expert"

Smoke test

test to reveal simple but severe failures

<eos> token

token to stop embedding of transformer

Automatic prompt design

treat prompt as trainable parameter, construct input-output pairs first, OPRO one method

SGD

updates parameters after N (random) instances

RLHF

used to control harmful generations from LLMs, bias

Aspects unique to transformers

var length inputs, residual connections, no recurrence, self attention

query vector

what we try to learn context for in MHA


Ensembles d'études connexes

Exploring the World of Business - Midterm Review

View Set

Lecture: CNS (Brain and Spinal Cord)

View Set

Interchange Book 2 Unit 4 Language summary

View Set

Music for the Listener - Quiz Week 9 (Ch 23-25)

View Set

STARTING A BUSINESS / FRANCHISES / PARTNERSHIPS

View Set