DLQ4

Ace your homework & exams now with Quizwiz!

Scaled dot-product attention

main method used in AIAYN paper - queries and keys dot-prodded, divided by sqrt(dk), softmax applied

Additive attention

AKA Bahdanau attention. compatibility determined with feed-forward net, softmax probabilities determined, to create a weighted sum of values

Value (transformer)

Actual items in the dataset or sequence - keys are locations where past data was observed, values are the regression values themselves.

Neural Attention types

Additive attention, dot-product attention, scaled dot-product, self-attention, multi-head attention, masked multi-head attention, cross-attention, restricted self-attention

Word2vec/skip-gram conditional probability

Basic form: for sequence of words w1...wT, sum(for all t) sum(for surrounding t's) logP(w_t+j|w_t), i.e. log-probability of surrounding words given this word. Goal is to adjust word vectors to maximize conditional probability

What is byte-pair encoding?

Compression algorithm used in NLP, representing large vocab with small set of subword units - iteratively merge phoenemes until specified vocabulary size is reached. Find most frequent pair of consecutive bytes, merge it into subword unit. update frequency of all bytes/chars containing that subword pair and add it to vocabulary.

What are conditional language models?

Conditional language models generate output based on some prompt or pre-condition. Teacher forcing is a method during training to input ground-truth data instead of the model's own outputs during training. It can improve performance and decrease stability. Student forcing begins the training process like Teacher forcing, but gradually weights model-generated output so that by the end of training, inputs to recurrent cells are mostly inferred. This improves training - inferencing stability.

computation of self-attention matrix size

Embedding size - d_model, Q,K,V = d_modelxd_model. scaled dot-prod attention: seq_len x seq_len (seq len = num words in input); Output of self attention: seq_len x d_model

What is the encoder-decoder model for seq2seq? How does it work?

Encoder: Transforms input sequence to a fixed-size hidden representation. Decoder iterates over the hidden representation mapped to the output sequence. Used for e.g. machine translation. It is implemented with RNNs, LSTMs. or GRUs. Forms basis for attention models, GTP models, transformers and BERT.

What is beam search?

Heuristic algorithm exploring the search space of all tokens to find the most likely outputs. At each time step: Generate set of candidates. Instead of choosing most likely token (greedy), keep track of k most likely sequences so far, where k is a hyperparameter 'beam width'. the k sequences are the "beams". At t+1, model extends each beam with all possible next tokens, resulting in k times the number of possible token sequences. it prunes this set down tot he k most likely ones according to their cumulative probabilities. it repeats until stop token or max length reached. One 'multi-armed-bandit' solution for language models.

Multi-head attention

Instead of performing a single attention function with d_m dimensional keys, q,k,v projected h times with different learned projections to d_k, d_k, and d_v dimensions. Attention function is performed in parallel. Allows model to jointly attend to information from different representation subspaces.

Key (transformer)

Learned vector for attention mechanism- each word in input associated to it. Instructs model which words are most relevant to current word context

Convolutional complexity (n=seq len, d = representation dimension, k=kernel size, r = neighborhood size in RSA)

O(k*n*d^2) complexity, O(1) seq ops, O(logn) max path length

Recurrent complexity (n=seq len, d = representation dimension, k=kernel size, r = neighborhood size in RSA)

O(n*d^2) complexity, O(n) seq ops, O(n) max path length

Self-attention complexity (n=seq len, d = representation dimension, k=kernel size, r = neighborhood size in RSA)

O(n^2*d) complexity per-layer, O(1) seq ops, O(1) Max path length

Self-Attention complexity (n=seq len, d = representation dimension, k=kernel size, r = neighborhood size in RSA)

O(r*n*d) complexity, O(1) seq ops, O(n/r) max path length

How do RNN and LSTM Update rules differ?

RNN uses a simple linear combination of previous weights and new inputs. RNNs are simpler, but susceptible to vanishing gradient. LSTM cells use an input gate, forget gate, and output gate to regulate gradient 'flow' and bypass the vanishing gradient problem. These are more computationally intensive

Query (Transformer)

Representation of current item being processed, e.g. word or phrase being understood in context of surrounding text

Dot-product attention

product of queries and keys, softmax applied

What is t-SNE?

t-disributed Stochastic Neighbor Embedding - dimensionality reduction, which models a point being selected as a neighbor of another point in higher and lower dimension, calculates pairwise similarity measure between all points in the high-D space using gaussian kernel - farther points have lower probability of being selected. It then maps higher-dimensional points into lower-D space while preserving similarity instead of variance (a la PCA)


Related study sets

Social Security & Retirement Plans

View Set

Chapter 2 Developmental Phsycology

View Set

RN Pediatric Nursing Online practice 2023A

View Set

8.Elastic Inelastic Collisions & Linear Momentum

View Set

How the United States Acquired Alaska & Hawaii

View Set

Saunders Comprehensive Review for the NCLEX-RN Exam Pre-op, Intra-op, Post-op

View Set

Exam 1 Nutrition Review Questions

View Set

Pharm. Chapter 21, 22, Pharm. Chapter 23, Pharm. Chapter 24

View Set

Chapter 12: Nervous System III Senses

View Set