CS 4740

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Transformer

- 2 embeddings for every word (which make up the input= std. word embedding+position embedding) - uses self attention. WHY use transformers? - can train larger networks b/c of improved parallelizability - everything is attention and FFNNS - Components: - 6 encoders, 6 decoders - Each encoder has self attention and FFNN - every decoder layers has self attention, encoder decoder attention, FFNN

WordNet

- A laboriously hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets - Lexical relations of nouns, verbs, adj, adverbs. - Collection of synsets (synonymous words) and unique lemmas (unique words) - Lemma is polysemous if it participates in multiple synsets

Multinomial NB Classification

- Assuming conditional independence Aka P(x1,....Xn|c ) = P(x1|c) * P(x2|c) + ... P(xn|c)

Useful Features for Linear classification

- BOW is typically good for Doc-level sentiment/ high level classification Additional features to consider: - bags of bigrams - position information - syntactic features - features based on sentiment lexicon (for small training sets especially)

Linear Classification

- Classification decision is based on a weighted sum of individual feature counts - Methodology: 1. Assign a weight to each word in vocab which measures its compatibility with each label Example: - boy, whale, said (positive score for label fiction) - Spectrometer (negative score for label diction) etc. = produces a weight vector, Ѳ 2.Predicts label y^ given the BOW representation (x) and using the weights vector (Ѳ) - For each label y in Y, compute a score PSI(x,y) which measures the compatability b/w word x and y label - output label y^ with the highest score Psi (x,y) = Ѳ * f(x,y) = Σ Ѳj* fj(x,y) where f(x,y) = feature function that returns x, a feature vector of counts for the weights associated with a labels. Mostly 0s for other labels and x for the label. Aka f(x, y = NEWS) returns the count of times where the label is NEWS. 3. For any document x, we predict the label y^ with the highest score. y^ = argmax PSI(x,y) y in Y Psi (x,y) = Ѳ * f(x,y)

Bag of Words Representation (BOW)

- Document representation is just a vector of word counts x = [0,3,5,0,...] where xi is the count of vocab word i in the document d - does not keep track of position in document - effective for text classification/ labeling probs

linear interpolation

- Draws on n-gram hierarchy - Construct a linear combination of the multiple probability estimates. - Choose the lambdas based on the validation set P(Wn | Wn-2 Wn-1) = Lambda1 * P(wn|wn2wn1) + Lambda2 * P(wn|wn1)+ Lambda3 * P(wn) Where ΣLambda i = 1

ELMo

- Embeddings from language models - a deep LSTM language model trained on 30 mill sentences - each word is embedded w/ combo of character embeddings - each word embedding is passed through this 2 layer bi-LSTM - top layer of Forward LSTM predicts the next word - top layer of backward LSTM predicts the previous word *** it uses contexualized word embeddings

Collocational features

- Encode info about words in specific positions located to left and right of the word to be disambiguated - the word, and its part of speech for example

Evaluation

- General methodology: 1. Divide available data into training and test(5%-20%) set. 2. Train statistical params(counts) on the training set. Use to compute probabilities on the test set. 3. And then test set gets divided into test and validation set(~10% of original set) 4. Decide which model is best and tune smoothing params on the validation set VALIDATION SET IS MEANT FOR TUNING

word sense disambiguation

- Methods for determining the sense of an ambiguous word based on the context - Given a fixed set of senses associated with a lexical item (word) -> Determine which sense applies to a particular instance of the lexical item in the context - You can obtain the sense from dictionary entries

Naive Bayes Classification

- NB takes a probabilistic approach - maximizes joint probability of the training set of labeled docs -MLE Ѳ^ = Argmax P(x , y; Ѳ) - cMAP (most likely class) = P(c | f) argmax P(f | c) * P(c) = argmax P(x1...,xn | c) P(c) f = the document represented as a set of words. p(c) = # of instances of class c in the training set

NLP Tasks that fit Sequence Tagging

- POS tagging, NER (Named Entity Recognition), chunking, opinion extraction, metaphor detection

Viterbi algorithm

- Sweep forward, one word at a time, finding the most likely (highestscoring) tag sequence ending with each possible tag • So we'll find the 4 best tag sequences (one for each POS) at each position - With the right bookkeeping, we can then "read off" the most likely tag sequence once we reach the end of the sentence - watch: https://www.youtube.com/watch?v=mHEKZ8jv2SY - ACTUAL Algorithm:

Text/Doc Classification-based methods (Alternative to LMs)

- Takes into account preceding history/ arbitrary features. Example: Jack helped himself?herself? to the cake. * Takes into account gender of subject and preceding part of speech *

Evaluation Metric

- Tells us how well our model performed on test set

lexical semantics

- The meanings of words and the meaning relationships among words. Examples: - Word sense disambiguation - WordNet - WSD as classification

encoder-decoder Seq2Seq model

- The power of this model lies in the fact that it can map sequences of different lengths to each other. - Encoder: learns to represent the input sequence - Decoder: given the encoder representation, it learns to produce the output sequence word by word autoregressivly (time series model that using the previous step predicts next step) - Initialized Hidden state of encoder is *Deficiencies* with encoder-decoder: 1. make no use of the encoder states that are not the last ones/ the decoder only uses a single vector (the context vector) from the input. 2. this context vector is only seen once, at the start of the decoding. 3. the representation of the input (via context vector( is the same across decoder timesteps. ATTENTION FIXES ALL OF 3 OF THESE PROBS WITH ENCODER DECODERs

1. What are we trying to compute/why do we have a probabilistic approach? - Why is estimating the probability directly intractable, why do we make Markov assumptions?

- We have probabilistic models to mimick how language is formed, explicitly model the process of producing, or generating, language. - We make markov assumptions to not compute all possibilities?

Regular word embeddings vs. contextualized word embeddings

- easy to use - fast to use - lots of research around interpretability of them VS. - substantial performance improvements - more accurately model language - may be better in low resource settings

GRU

- input and outputs need to be the same length

RNN

- input and outputs need to be the same length - suffers from short term memeory (so leads to vanishing gradients)

Deficiencies of word embeddings

- one vector assigned for each word - this vector ignores things like polysemy, homonymeny - same vector is assigned for words in all contexts

Why Viterbi Algorithm?

- searching through all possible tag sequences is inefficient (Tagset size = T, m = sentence length, T^m total possibilities of tag sequences) - Allows for efficient search for the Most likely Sequence

Naive Bayes vs. Perceptron

1. If data is linearly separable then perceptron is def gonna fit that data perfectly..make a correct vector theta 2. NB is better for instant calculation starts and stops instantly- / perceptron need to have a good stopping condition In High dimensions, data is typically linearly separable. Perceptron vs. NB: 1. Computing θ: NB- Closed form Perc- iterative (updates θ depending on whether prediction was T/F) 2. Assumptions: NB- Conditional independence of features Perc- No assumptions 3. Margin NB- no explicit margin cut off Perc- Treats answers as correct or not (linearly seperable)

Why is NLP hard?

1. ambiguity a. Syntactic (preposition attachment Ex. I ate pizza with friends/with mushrooms) b. Discourse (pronoun resolution requires understanding of multi-sentence context even sometimes Ex. Who is He that was talked about all that long ago ) c. Pragmatics (meaning depends on context. Ex. Do you want to find some lunch? I just came from ctb) 2. variability ( I ate pizza w/ friends, friends and i shared pizza, etc.) 3. sparsity (most interactions are not seen in training) 4. discrete nature/ representational challenges (Language constructs from a few dozen discrete elements an infinite variety of expressions of thought.. Ex. no inherent relationship b/w pizza, hamburger, pasta..) 5. compositionally (Sum of the parts is not always a good approach to composition in language. Ex. It rained cats and dogs is not the same as it rained cats, it rained dogs). 6. Hard to write rules, encode basic knowledge (like common sense reasoning)

Evaluation Typical methodology

1. fix/train n-gram model probabilities on the training set 2. choose model+ hyperparams to maximize the probability of the validation data 3. Use the selected model+hyperparams to apply the model to your test set 4. report the results on the test set

BOW

BOW: treats the context as a pseudo doc. The feature function looks at every word around, and separately sees if it has that label

Co-Occurence Features:

Co-Occurence Features: encode info about IMPORTANT neighboring words, ignore exact position, that also co occur for this specific label. - Select the n most frequent content words from a set of bass sentences drawn from a large corpus - Select a window size, e.g. 50 word tokens - Example (WSJ corpus): fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band

Extrinsic(BEST) vs. Intrinsic(ALTERNATIVE) Evaluation

Extrinsic- Embed LM in an application/task and measure its performance on the application/task with and without the language model. Intrinsic- Measure the quality of the model independent of any application.

Relation Extraction Methods

Handwritten patterns- Earliest & still common. Depends on lexico-syntactic patterns. Ex. Noun phrase such as noun phrase implies a hyponym semantic relationship supervised ML- find all pairs of named entities. for each pair of those design two classifiers: 1. determine if two entities r related 2. classify the relation semisupervised- unsupervised- known as open information extraction. This is employed when you have no labeled training data and no list of relations. One approach:- parse each sentence. - extract all subject-verb-object tuples -need additional heuristics to determine which s-v-o to return as extracted relation

Homonyms vs. Polysemy

Homonyms- Same pronunciation, diff spellings, diff meanings (Knight and night) Polysemy- one word, one spelling, diff meanings

Observations

How likely a word is to be selected if randomly select category

word sense disambiguation(WSD) as classifier

Input: 1.text snippet d (An electric guitar and bass player stand off to one side, not really part of the scene, just as ..) 2.classes of senses for word bass Y = {fish, musical} Ouput: y^ = musical Feature vector representation here is with respect to the target (word to be disambiguated) - encodes info from the context (surrounding text important info for determine sense of target word)

Feature vector representation options for WSD

Labeled training feature functions usually returns the concatenation of: 1. BOW features & collocation features OR 2. Co-occurence features + collocation features

LSTM

Long Short-Term Memory networks were invented to prevent the vanishing gradient problem in Recurrent Neural Networks by using a memory gating mechanism. Using LSTM units to calculate the hidden state in an RNN we help to the network to efficiently propagate gradients and learn long-range dependencies - input and outputs need to be the same length

Why is "Naive" Bayes naive?

Naive Bayes (NB) is 'naive' because it makes the assumption that features of a measurement are independent of each other. This is naive because it is (almost) never true. Here is why NB works anyway.

Hidden Markov model (HMM)

P(Observable , hidden) = P(Wi,...Wn , Ti,...Tn) = P(Wi,...Wn | Ti,...Tn)* P(Ti,...Tn) = ∏P(Wi | Ti)* ∏P(Ti | Ti-1)

Perplexity

Perplexity = Confusion - The higher the (estimated) probability of the word sequence, the lower the perplexity. - Lower perplexity is better = means less confusion - Perplexity must be computed on models that have no knowledge of the training set ( aka compute on validation set).

States

Possible lexical categories S = {N, V, D}

BERT

Pre training of deep bidirectional transformer for lang understanding 1. pertained on 3.3 bill words 2. uses new language modeling objective- masked language modeling 3. bidirectional modeling 4.subword level features (not char and not full word) 5. VERY deep (12/24 layers) - fully bidirectional instead of ELMO which uses seperate left to right and right to left - and masked lang model

Perceptron

Predicting using θ If wrong -> Increment θ by f(xi,yi)-f(xi,y^) otherwise -> maintain θ

Classification vs. LM

Pros of Classification: - Feature function can be expanded to include useful clues (bigram counts, Unigram, bigram, trigram presence/absence, length of doc, # of words types in doc, etc.) Cons: - NB independence assumption is implausible

N-Gram language Models Pros and Cons

Pros: 1. easy to understand 2. Cheap 3. Good enough for machine translation and speech recognition Cons: 1. Markov assumptions are linguistically inaccurate (independence is inaccurate so future state only depends on current state and not on all the previous--- unigrams obv the worst bc full independence) 2. Data sparseness (although smoothing helps 3. Out of vocab problem

RNN Vs. GRU Vs. LSTEM

RNN has advanced cells. Basically multiplication of input (xt)* previous output(ht-1) => activation func=> NO GATES. - GRU has a memory unit. UPDATE gate is introduced which decides whether to pass previous O/P (ht-1) to next cell as ht or not. and RESET gate which decides how much info to forget * GRU has less tensor operators so speedier to train - LSTM has additional gates. 2 more. So now it has the update gate, forget gates, and output gate. ALL RNNS THOUGH STILL HAVE THESE PROBS: - sequential so can't do parallelism - no explicit model for long/short term range dependencies - distance b/w positions is lienar

Subjectivity vs. Sentiment

Subjective- sentences express private states(speculations, beliefs, emotions, evaluations, goals, opinions, judgments) - I HATE BILL Sentiment- type of subjective expression depicting positivity of negativity (+ or -) - Typically positive or negative emotions/judgments/evaluations).

Text/ Doc classification

Task Examples: - Opinion spam detection - Assigning subject categories/genres - authorship identification - age//gender identification - language identification - sentiment analysis Input: 1. A document d 2. Fixed set of classes C = {c1,c2...,cj} More specifically: 1. A fixed set of features F = {f1,f2,...fn} 2. Fixed set of classes C = {c1,c2...,cj} 3. a training set m examples which is encoded as a feature vector and its paired class (f1,c1)...(fn,cm) Output: - Predicted class c in C More specifically: - learned classifier: y^: f->c -Classification USES ML method for probabilistic classification - A feature vector is (unigrams counts, bigrams counts, POS tags, etc.) which is document representation

Markov assumption

The assumption we make that a future event can be predicted using a relatively short history

Zipf's Law

The idea that places or things that are farther apart will have less interaction between them.

attention

The inputs in a given timestep are: - the hidden state - the previous word in the translation - context vector Aka= ht = RNN(ht-1,yt-1,ct) where ht = decoder hidden state et = encoder hidden state ct = (sum instead of just the last one)

Smoothing

Unigram P(Wn) = Count(Wn)+K / N +V*K Bigram P(Wn|Wn-1) = Count(Wn-1 Wn)+K / Count(Wn-1) +V*K

MLE

Unigram P(Wn) = Count(Wn)/N Bigram P (Wn |Wn-1) = Count(Wn-1 Wn)/ Count(Wn-1) Trigram P(Wn| Wn-2 Wn-1) = Count(Wn-2 Wn-1 Wn)/ Count (Wn-2 Wn-1) Accuracy increases as N increases

Unknown Word handling (UNK replacement) and Unseen ngram handling (Smoothing)

Unknown Word handling- dealing with words that were not in training data (change first occurence of bottom 50 words to UNK) Unseen ngram- word are in the training set but combinations of those words have not been seen. Handled with smoothing. Smoothing is meant to make the distribution more uniform and fit the new data.

Params vs. hyperparams

We train parameters on our model- Params are: unigram, bigram, trigram counts etc. Hyperparams- counts but taking into account UNKs and Smoothing

Transition probability matrix

probability (bigrams) of moving from one state to another

Viterbi algorithm Time/Space Complexity

where c = # of lexical categories/ tags and n = word number Space- Two c*n matrices O(c*n) 1.SCORE(i, 1) = matrix to keep track of score at each time step 2.BPTR(i, 1) = matrix to keep track of the back pointer for quick time-saving backtracking at the end Time- O(c^2*n) for forward pass O(n) for backward pass VS. O(c^n) for brute force


Kaugnay na mga set ng pag-aaral

U.S History The great Railroad strike

View Set

FEMA IS 015.B Special Events Contingency Planning for Public Safety Agencies

View Set

Health Law 1 (Ch. 29 Law and 9 Underwriting)

View Set

Chapter 10/Section 2: the Crusades

View Set