nlp exam 2
Distributional Hypothesis
"If A and B have almost identitical environments we say that they are synonyms"
Semantic Similarity "fast" is similar to ...
"rapid" and "speed"
Skip-Gram (Context Embedding)
A column in the output. matrix
Lexeme
A pairing of a particular word form with its sense
Homonymy
A relation between concepts/senses the word "Bat" is a homonymy word because bat can be an implement to hit a ball or bat is a nocturnal flying mammal also.
Metonomy
A subtype of polysemy; one. aspect of a. concept is used. to refer to other aspects ofa. concept (or the concept itself) BUILDING <->. ORGANIZATION ANIMAL <-> MEAT
Encoder
Also uses only the final output vector yn, however, the final vector is treated as an encoding of the information in the sequence, and is used as additional information together with other signals. For example, an extractive document summarization system may first run over the document with an RNN, resulting in a vector yn summarizing the entire document. en, yn will be used together with other features in order to select the sentences to be included in the summarization.
Distributional Semantic Model
Any matrix M such that each row represents the distribution of a term x across context, together with a similarity measurement
ELMo (Embeddings from Language Models)
Computes contextualized word representations that are used as a stand-in for static word embeddings by training bi-RNNs that predict a token based on left and right contexts, then sums up the probability for each token
BERT (Bidirectional Encoder Representations from Transformers)
Consider the task of sequence tagging over a sentence x1; : : : ; xn. An RNN allows us to compute a function of the ith word xi based on the past—the words x1Wi up to and iincluding it. However, the following words xiC1Wn may also be useful for prediction, as is evident by the common sliding-windoe approach in which the focus word is categorized based on a window of k words surrounding it. Much like the RNN relaxes the Markov assumption and allows looking arbitrarily back into the past, the biRNN relaxes the fixed window size assumption allowing to look arbitrarily far at both the past and the future within the sequence.
Word Senses
DIfferent word senses of the same word can denote different concepts
Vauquois Triangle
Each of semantics, syntax, and phrases (morphology) is related to the other 2 - changes one can change the others
Word Alignments Steps 1
Figuring out word-to-word translations
Noisy Channel Model for MT
Given an observation in the source language, figure out what was said in the target language
Computational Semantics
How do ?we compute language meaning from word meaning
Skip-Gram model
Input is a single. word in one-hot rep, probability to see any single word as a context word
Stacked/Deep RNNs
Input is output from a previous RNN
Multi-Head Attention
Multi-head Attention is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. all layers have output dimensionality dmodel, but each linear projection can produce a different dimensionality for dk and dv.
Polysemy
Multiple. semantically related concepts correspond to the same word form The verb "get" is a good example of polysemy — it can mean "procure," "become," or "understand."
Word embeddings
Representations of words in a low dimensional dense vector space; should capture relationship between word and the context in running text
MT challenges in Syntax and Morphology
Sentence word order , word order in phrases, prepositions and case marking
Lexical Sample task
Small pre-selected set of target words. • Inventory of senses for each word.
Zipfian Distribution
Some terms appear too often, some are too rare
GPT fine tuning
Train GPT on this language modeling objective. But in the process, it has learned these representations for tokens in the context of its left tokens. can be used for the last token, so we have a representation of the entire sentence. So, we're going to train this model and then apply it to some specific error 2. Now can predict the sentiment.
Supervised Learning
Training data consists of training examples (x_1, y_1), ... , (x_n, y_n). where x_i is an input example(a. d-dimensional vector of attribute. values) and y_i is the label
Lexical substitution
Two lexemes are synonyms if they can be substituted for each other in a sentence, such that the sentence retains its meaning (truth conditions).
Synonyms
Two lexemes refer to the same concept. • couch/sofa • vomit/throw up • car/automobile • hazelnut/filbert • water/H2O
Acceptor
We decide our output only at the final output vector y_n -> for example, consider training an RNN to read the characters of a word one by one and then use the final state to predict the part-of-speech of that word; Typically, the RNN's output vector yn is fed into a fully connected layer or an MLP, which produce a prediction. e error gradients are then backpropagated through the rest of the sequence
Lexical Semantics
What is the meaning of individual words?
WordNet
WordNet is a lexical database containing English word senses and their relations. • Represents word sense as synsets, sets of lemmas that have synonymous lexemes in one context.
Semantic Similarity
Words that can be substituted for another
Semantic Relatedness
Words that occur nearby each other, but are not necessarily similar
Parallel Corpora
a type type of multilingual corpus that consists of two or more texts in different languages that are aligned at the sentence or phrase level
Gloss
dictionary def
Good translation needs to be...
faithful and fluent
Attention as Lookup
find the key that is most important, if multiple keys match, take a combination of them; the queries are the hidden states in the decoder, the keys are the hidden states of the encoder, the values are the hidden states in the encoder
Recurrent neural networks ...
take entire history into account
Skip-Gram (Target Embedding)
takes as input a word from a text corpus and tries to predict the surrounding words within a certain window of context. The model is trained on a large corpus of text and learns to represent each word as a vector in the same high-dimensional space.; A row in the input matrix
Zeugma
when a single word is used with two other parts of a sentence but must be understood differently (word sense) in relation to each. Does United serve breakfast and JFK? He lost his gloves and his temper.
The problem with RNNs for MT:
• For long phrases, fixed-length encoded representation becomes information bottleneck. • Not everything in the input sequence is equally important to predict each word in the decoder.
Extensions to Lesk Algorithm
• Often the available definitions and examples do not provide enough information. Overlap is 0. • Different approaches to extending definitions: • "Corpus-Lesk": Use a sense-tagged-example corpus, add context from example sentences. • Extended Gloss Overlap (Banerjee & Pedersen 2003): Add glosses from related words (hypernyms, meronyms, ...) • Use embedded representations for each word in the definition. Choose the sense with highest average vector similarity to the context.
Hyponymy
• One sense is a hyponym (or subordinate) of another sense if the first sense is more specific, denoting a subclass of the other. (IS-A relationship). • dog is a hyponym of mammal. • mammal is a hyponym of animal. • desk is a hyponym of furniture. • sprint is a hyponym of run. • The inverse relation is called hypernymy, so furniture is a hypernym (or superordinate) of desk.
Meronymy
• Part-whole relationship. • A meronym is a part of another concept. • leg is a meronym of chair. • wheel is a meronym of car. • cellulose is a meronym of paper. (substance meronymy) • The inverse relation is holonymy. Car is a holonym of wheel.
Antonyms
• Senses are opposites with respect to one specific feature of their meaning. • Otherwise, they are very similar! • dark / light • short / long • hot / cold • rise / fall • front / back Antonyms typically describe opposite ends of a scale, or opposite direction/ position with respect to some landmark (reversives). (level of luminosity) (length) (temperature) (direction) (relative position)
Simplified Lesk Algorithm
• Use dictionary glosses for each sense. • Choose the sense that has the highest word overlap between gloss and context (ignore function words). The bank can guarantee deposits will eventually cover future tuition costs because it invests in adjustable-rate mortgage securities.