EXAM 1: Word2Vec Language Models RNN

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Feed Forward Neural Network Cons

- Fixed window - Window is still small - Increasing Window size will increase number of parameters to be estimated - We need a network architecture that can process a sequence of any length. - traditional feedforward neural networks are effective for tasks like classification and regression, they are limited in their ability to handle sequential data.

1. Words to One-Hot Vectors: Embedding Layer

- Given a sequence of words, such as "the students opened their," each word is converted into a one-hot vector representation. Each word in the vocabulary has a corresponding index, and the one-hot vector has a 1 at the index of the word and 0s elsewhere.

4. Combining Embeddings:

Once all character n-gram embeddings are retrieved, FastText combines them to calculate the embedding for the entire word. - This combination typically involves summing up the embedding vectors for all character n-grams. - Additionally, FastText may include the embedding vector for the whole word itself as part of the sum. - This results in a vector of size d×B (where d is the embedding size and B is the bucket size), representing the embedding for the word "eating".

One-Hot Encoded

One-Hot Vectors, represent words as discrete symbols - embedding dimension = vocab size - vector size is too large - vectors know nothing about meaning -

2. Concatenated Word Embeddings:

Each one-hot vector is then mapped to a dense word embedding vector. These embeddings capture the semantic and syntactic information of the words. The embeddings for all words in the fixed-size window are concatenated into a single vector. - These embeddings capture the semantic and syntactic information of the words in the context.

4c. Probability Distribution

Finally, we obtain the probability distribution for the next token given the input context "I saw a cat on a ___". This probability distribution, denoted as (∗∣ saw a cat on a)P(∗∣I saw a cat on a), provides the likelihood of each token in the vocabulary being the next word in the sequence, based on the learned representation of the input context.

3b. Linear Layer

The vector representation ℎh of the context is then passed through a linear layer. This layer applies a linear transformation to the vector ℎh, mapping it to a vector of size∣V∣, where ∣V∣ is the size of the vocabulary. Each element in this vector corresponds to the logit (raw score) for each token in the vocabulary.

Dense Vectors Advantages

They require fewer parameters to tune in machine learning models. They can generalize better and capture more complex relationships between words. They are more effective at capturing semantic similarity, such as synonyms.

Purpose of W2V

This approach essentially learns word embeddings indirectly by optimizing a classifier for a specific task related to word context prediction. While the binary prediction task itself might not be of direct interest, the learned embeddings can still effectively capture semantic relationships between words and be used for downstream natural language processing tasks.

Latent Semantic Analysis (LSA)

Word Document is often used in techniques like where the co-occurrence matrix is subjected to dimensionality reduction techniques to uncover latent semantic relationships between words and documents.

word embeddings

Word embeddings are numerical representations of words in a continuous vector space. Each word is mapped to a high-dimensional vector, where words with similar meanings are closer together in this space.

fixed window neural language model

type of neural network-based language model that operates on fixed-size windows of words from the input text. Unlike traditional n-gram models where the context size is fixed, neural language models use continuous vector representations (embeddings) of words and learn to predict the probability distribution of the next word given a fixed-size window of previous words. - fixed window length - not going to use this one - RNN can process any sequence of any length

Singular Value Decomposition (SVD)

method for dimensionality reduction on the co-occurrence matrix X. - To reduce the dimensionality of the data while preserving the most important information, we can retain only the first k singular values and corresponding singular vectors. - selecting smaller values of k - lower dimensional representation - This reduction helps in generalizing the data and removing noise or irrelevant features, making the resulting representation more compact and efficient.

Word2Vec

popular technique used to generate word embeddings. - It's a self-supervised learning approach because it learns from the data without explicit supervision, typically by predicting the context of a word given its neighboring words in a large corpus of text. - trains very fast - code is available on the web - predict rather than count

Markov Assumption

simplifies the modeling of sequences by stating that the probability of a word at position t (yt​) depends only on a fixed number of previous words (1yt−1​,yt−2​,...,yt−n+1​), rather than the entire history of preceding words.

Solution - Frequent Words

subsampling: - addresses frequent word problem - selectively removes instances of frequent words from the training data, solve their disproportionate influence and improve the quality of learned word embeddings - Considered for removal - The probability of retaining a word in the corpus is calculated using a formula that depends on the word's frequency. - When training the word embeddings model, instances of words that have been subsampled are excluded from the context windows of neighboring words.

GloVe Objective Function

the GloVe model optimizes an objective function that balances the importance of different co-occurrence counts using a weighting function f(x). This weighting function ensures that rare co-occurrences are given more weight in the optimization process, leading to word embeddings that effectively capture the semantic relationships between words across the entire corpus.

Tf-idf

(Term Frequency-Inverse Document Frequency) and one-hot encodings are methods used to represent text data. Both of these representations result in sparse vectors, where most elements are zero. - vectors are long and sparse because they represent the occurrence of words in a document collection, with each dimension corresponding to a unique word in the vocabulary. - long (length |V|= 20,000 to 50,000) sparse (most elements are zero)

3. embedding look up associated with bucket for FastText

- After the character n-gram is assigned to a bucket, FastText retrieves the corresponding embedding vector associated with that bucket. - Each bucket has an associated embedding vector, and these vectors are learned during the training process. - During inference or training, when FastText encounters a character n-gram, it uses the hashing process to determine the bucket and retrieves the corresponding embedding vector for that n-gram.

Local Context Window Methods

- At the same time, GloVe also incorporates the local context window approach used in methods like Skip-Gram and CBOW. - It considers the context words surrounding each target word within a certain window, which helps capture local syntactic and semantic information.

randomly shrinking context window - context position weighting

- By randomly shrinking the size of the context window, we introduce variability into the selection of context words for each target word. - if the original window size is 5, we may randomly select a smaller window size (e.g., 3) with equal probability. - This means that words closer to the target word have a higher chance to be selected as positive samples because they have a higher probability of being included in the context window when the window size is reduced. - This weighting towards nearby words can help the model to capture more local context information and potentially improve the quality of learned word embeddings, especially for tasks where local context is important

GloVe (Global Vectors) model

- Combines the advantages of the two major model families in the literature: 1. global matrix factorization and 2. local context window methods - It leverages statistical information by training only on the nonzero elements in a word-word co-occurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. - not biased towards frequent words or sparse regions of the matrix - By combining global matrix factorization with local context window methods, GloVe achieves a balance between capturing global semantic relationships and local contextual information. - It produces word embeddings that are effective for various NLP tasks, including word similarity, analogy detection, and text classification.

GloVe Overview

- Count-based model using aggregated global word co-occurrence statistics from the entire corpus - Uses a fixed-size context window to build the co-occurrence matrix but aggregates statistics across the entire corpus - Cannot handle OOV words without additional methods - Poor Morphological Richness (does not consider subword information) - semantic relationships- Good (captures both global co-occurrence and local context semantics) - Tasks requiring understanding of global word usage patterns - Captures global statistics efficiently, good for word analogy and similarity - Requires significant memory for co-occurrence matrix, less effective with simple morphologies

7. Training Update

- During training, the model adjusts the embedding vectors to minimize the difference between the sigmoid outputs for actual context words and negative samples, moving them closer to the desired targets of 1 and 0, respectively. - This is typically done using techniques like gradient descent, where the gradients of the loss with respect to the embedding vectors are calculated and used to update the embeddings iteratively.

what is updated in FastText?

- Embedding Vectors: The embedding vectors for both the target words and the context words (positive samples) are updated. - Each word in the vocabulary has its own embedding vector, and these vectors are adjusted during training to better capture the semantic relationships between words. - Embedding Lookup Table: FastText maintains an embedding lookup table, where each word in the vocabulary is associated with its corresponding embedding vector. - During training, the values in this lookup table are updated based on the gradients calculated during the optimization process. - Negative Sampling Parameters: In negative sampling, negative samples (randomly chosen words from the vocabulary) are used to train the model to distinguish between positive and negative examples. - Parameters related to negative sampling, such as the choice of negative samples and the number of negative samples per positive example, may also be adjusted during training. Model Weights: - Other model weights and parameters, such as those used in the scoring function (e.g., weights in the dot product calculation) or - the parameters of the sigmoid activation function, may also be updated during training to improve the model's performance.

Cons of FastText

- FastText can be slower to train than regular skipgram - Sub words do not improve representations of common words (the) and may sometimes lead to slightly worse representation - may not benefit already well represented words - The process of generating and hashing character n-grams increases the training time, especially for large datasets and vocabularies.

5. Center Word Embedding

- Finally, the calculated sum of embedding vectors represents the embedding for the center word. This embedding captures the morphological and semantic information of the word based on its constituent character n-grams.

Word Embeddings Cultural Bias

- In this case, if "father" is closely associated with "doctor", the model might associate "mother" with occupations traditionally associated with women, such as "nurse". - If "man" is closely associated with "computer programmer" and not "homemaker", the model might associate "woman" with occupations stereotypically associated with women, such as "homemaker".

Issues with Increasing Window Size in fixed window neural language model

- Increasing the window size results in more weights to be estimated (from the output layer to the input layer), which increases the complexity of the model. - The number of inputs to the model depends on the number of words in the fixed-size window, which can lead to scalability issues with large window sizes. - The fixed window size restricts the model to capture only a fixed sequence length, which may limit its ability to capture long-range dependencies in the data. - . Increasing the window size can help to some extent but may lead to sparsity issues and computational challenge

2. Word- Document

- Instead of considering local context windows, this approach focuses on the co-occurrence of words within entire documents. - A co-occurrence matrix is constructed, where rows represent words and columns represent documents. - Each entry in the matrix represents the frequency of a word occurring in a particular document. - This method is useful for capturing more general topics or themes present in documents. - For example, if multiple sports-related terms frequently co-occur in the same documents, they will have similar entries in the matrix, indicating a general association with the topic of sports.

Count-Based Methods

- LSA (Latent Semantic Analysis), - HAL (Hyperspace Analogue to Language), - COALS (Continuous Overlapping Lexical Classes) - These methods are based on analyzing co-occurrence statistics within a corpus. - They use matrix factorization techniques, such as Singular Value Decomposition (SVD), to capture latent semantic relationships between words. - Count-based methods are efficient in training and utilize corpus statistics effectively. - They primarily focus on capturing word similarity and semantic relationships. - However, they may suffer from a disproportionate emphasis on large counts, potentially overlooking important patterns in low-frequency terms.

Cons of W2Vec & GloVe

- OOV Words: Word2Vec and GloVe can't create embeddings for words not in their training data, a challenge for the ever-changing vocabulary of human languages. - Polysemy: Both assign one vector per word, unable to differentiate meanings for words like "bank," which can signify a river's edge or a financial institution. - Static Representations: Word2Vec and GloVe produce fixed embeddings, not adapting to varying word meanings in different contexts. - Word Order Sensitivity: These models miss the word order, so phrases like "free drug" and "drug free" might get similar representations despite contrasting meanings.

Selecting Negative Samples Based on Unigram Frequency P(w)

- One approach to selecting negative samples is to choose words according to their unigram frequency P(w). This means that more common words are more likely to be selected as negative samples.

Word2Vec Overview

- Predictive model using local context windows - Context Window: Fixed size - Cannot handle OOV words - Morphological Richness: Poor (does not consider subword information) - Semantic Relationships: Good (especially syntactic analogies) - General NLP tasks, syntactic relationships - Efficient, well-researched, good at capturing syntactic relationships - Struggles with OOV words, does not use global statistics

FastText Overview

- Predictive model using local context windows and subword information - Fixed size Context Windows, but includes character n-grams - Can handle OOV words through subword information - Excellent (captures morphological information) - semantic relationships- Good (captures morphological semantics) - Morphologically rich languages, text classification - Handles OOV and morphologically complex words, fast predictions - Larger model size, potential noise from Subword

Direct-Prediction Methods

- Skip-Gram and CBOW: These are models developed as part of Word2Vec, which utilize neural networks for language modeling. - NNLM (Neural Network Language Model), HLBL (Hierarchical Log-Bilinear Language Model), and RNN (Recurrent Neural Network) also fall under this category. - These methods scale well with corpus size and can handle large datasets efficiently. - They may not utilize corpus statistics as effectively as count-based methods but can capture complex patterns beyond word similarity. - beyond word similarity, such as text classification, sentiment analysis, and machine translation. - However, they can be computationally intensive and may require significant resources for training, especially for large-scale datasets. - scales with corpus side - inefficient usage of statistics

Feedback Loop

- The key idea behind recurrent neurons and RNNs is the feedback loop that allows information to persist over time. This feedback loop enables the network to maintain a memory of past inputs and use that information to influence future predictions or decisions. - By feeding the output of the network back into itself as an input at the next time step, recurrent neurons create a mechanism for capturing temporal dependencies and processing sequential data.

sampling rate

- The sampling rate formula adjusts the probability of keeping a word based on its frequency in the corpus. - More frequent words have lower probabilities of being kept, while less frequent words have higher probabilities. - The provided probabilities for the top 3 most frequent words illustrate this principle: "the" has the lowest probability of being kept, followed by "of" and "and," reflecting their decreasing frequencies in the corpus.

2. Many to One Mapping

- This mapping is typically done using a modulo operation. The hash code is divided by the total number of buckets, and the remainder determines which bucket the character n-gram will be assigned to. - For example, if FastText has 1,000 buckets and the hash code is 3,542, the modulo operation would result in the remainder 542, and the character n-gram would be assigned to bucket 542. - multiple different character n-grams may be mapped to the same bucket due to collisions. - For each character n-gram, FastText retrieves its corresponding embedding vector from the bucket it's mapped to.

1. Training Data: Labeled Data for Specific Tasks

- This type of data consists of labeled examples relevant to a particular task, such as sentiment analysis for movie reviews. - The dataset is relatively small, with around 2,000 labeled reviews and approximately 1.5 million words. - This labeled data is used to train a supervised model, where the goal is to learn a mapping from input features (e.g., words in movie reviews) to output labels (e.g., sentiment labels). - The supervised model learns to make predictions based on the labeled examples, and the training process requires manual labeling of data for each task.

Embeddings & Analogies

- We start with the vector representation of "king" and subtract the vector representation of "man" (removing the male gender component) and add the vector representation of "woman" (adding the female gender component). - The resulting vector is expected to represent a word that shares a similar relationship with "woman" as "king" does with "man", which in this case is "queen". - In this analogy, we're exploring the relationship between cities and their respective countries. We start with the vector representation of "Paris" and subtract the vector representation of "France" (removing the association with France) and add the vector representation of "Italy" (adding the association with Italy). - The resulting vector is expected to represent a city that shares a similar relationship with "Italy" as "Paris" does with "France", which in this case is "Rome".

Cons of SVD

- While SVD provides an effective way to reduce dimensionality, it can be computationally expensive for large matrices, especially when dealing with very large datasets or high-dimensional feature spaces. - Computing the full SVD requires a significant amount of memory and computation time, making it impractical for large-scale applications.

semantic ambiguity

- Word2Vec may struggle with words that have multiple meanings or senses (polysemy) and fail to capture their diverse semantic nuances. - , "bank" can refer to a financial institution or the side of a river, and Word2Vec may not accurately represent both meanings. - Contextualized word embeddings models like BERT or ELMo may handle this better

Recurrent Neuron

- a neural network unit that has a connection to itself. - This means that the output of the neuron at time step t can be used as an input for the same neuron at the next time step t+1. - When processing sequential data, such as text or time series, recurrent neurons allow the network to maintain a memory of previous inputs. - This memory is captured by the hidden state of the neuron, which gets updated over time as the sequence is processed.

1. Window-Based

- a sliding window of fixed size is moved across the text corpus. - For each word in the corpus, the co-occurrence counts with other words within the window are recorded. - This method captures both syntactic and semantic information, as words that frequently co-occur within the same window are likely to have some semantic or syntactic relationship. - It's similar to how Word2Vec operates, where the model learns from the context words within a window around each target word.

benefits of subsampling

- allowing the model to focus more on learning representations for less frequent and potentially more informative words. - helps generalization

Intrinsic Evaluation of Word Embeddings

- assessing their performance on specific linguistic tasks that directly measure the quality of the embeddings in capturing semantic and syntactic similarities and relationships between words. - WordSim-353: comparing the cosine similarity between the embeddings of word pairs to the human similarity scores. - SimLex-999: the similarity of word pairs based on their context-independent similarity rather than relatedness. - Stanford Contextual Word Similarity : It contains word pairs within sentences, and the task is to predict the similarity of the words in the given context. - TOEFL dataset: Levied is closest in meaning to: imposed, believed, requested, correlated: given a word, the task is to identify the word most similar in meaning from a set of options. ( synonym identification.)

How is P(w) calculated?

- chain rule - the count of occurrences of the entire sequence​ in the corpus / the count of occurrences of the sequence (without target word) - P(the | its water is so transparent that) = count(its water is so transparent that the) / count(its water is so transparent that) - add all the probabilities up

Probabilistic Language Modeling

- compute the probability of a given sentence or sequence of words, denoted as P(W)=P(w1​,w2​,w3​,w4​,w5​,...,wn​). - Additionally, estimating the probability of an upcoming word given the preceding context, which is expressed as P(wn | w1​,w2​,w3​,...,wn−1​). - learning the probability distributions of word sequences. - by calculating the probability of different word sequences, the model can estimate the likelihood of a given sentence being grammatically correct and semantically coherent. - better grammar - language models are standard

why can't we increase n?

- leads to a sparsity problem - it's common for many possible n-grams to never occur, especially as n increases. - This means that the numerator (count of a specific n-gram) or denominator (count of the context) in the probability estimation formula can be zero for many n-grams. - When the numerator or denominator is zero, it results in zero probability estimates for those n-grams, which can lead to underfitting and poor generalization. - This sparsity problem becomes more severe as n increases, because the likelihood of encountering every possible n-gram in the training data decreases exponentially with larger n.

what happens when your training data is not diverse?

- leverage dense embeddings - may not be enough to train from scratch - a pretrained model like W2V or GloVe may know relationships between words but are not task specific - use pre-trained word embeddings like wikipedia --> initialize with a pre-trained model and fine tune to the task - Apply transfer learning by fine-tuning the pre-trained word embeddings on the task-specific data. (to adapt and learn task-specific nuances and patterns.) - provides a strong foundation for understanding language semantics and syntax.

Benefit of W2V

- leveraging running text as implicitly supervised training data - the occurrences of words near "apricot" act as implicit labels or "gold 'correct answer'" to the question of whether other words are likely to show up near "apricot." - avoid the need for hand-labeled supervision - you can utilize the vast amount of text available on the internet or in corpora as a source of training data. - where the model learns from the inherent structure of the data itself

out-of-vocabulary (OOV) Words

- misspellings - domain specific terms - Techniques like subword tokenization (e.g., Byte Pair Encoding or WordPiece) or using character-level embeddings can mitigate this issue by representing OOV words based on subword units or character sequences.

Benefits of FastText

- more useful to learn embeddings for rare words and can deal with OOV (Out of off vocabulary) words - Its ability to capture morphological variations through sub-word embeddings enhances its performance on tasks involving syntactic relationships between words (language like Czech and German) - FastText has degraded performance on semantic analogy tasks compared to Word2Vec.

Skip-Gram Intuition

- predict the context words given a central target word. - Given a sequence of words in a sentence, the model takes the central word (e.g., "cat") and predicts the surrounding context words (e.g., "cute", "grey", "playing", "in") one at a time. - Each word in the vocabulary has its own vector representation (embedding). The model learns these embeddings by maximizing the probability of predicting the context words given the central word. - effective in capturing the syntactic and semantic relationships between words by learning from the co-occurrence patterns of words in a large corpus. - used when dataset is large & diverse

Dense Vectors

- preferred in some cases because short vectors may be easier to use as features in machine learning because they typically require fewer weights to tune. - may generalize better than sparse vectors because they can capture more complex relationships between words. - can better capture semantic similarity between words, such as synonyms like "car" and "automobile," which may not be captured well in sparse representations.

Cons of Word2Vec

- relies on pre-defined vocabularies, meaning it may struggle with out-of-vocabulary (OOV) words not present in its training data. - doesn't capture Morphology - doesn't work well with semantic ambiguity - words well for moderate sized vocabularies but becomes inefficient for large vocabularies

Global Matrix Factorization:

- similar to count-based methods like LSA and HAL. - It decomposes the word-word co-occurrence matrix into low-dimensional word vectors, capturing latent semantic relationships between words in the vector space.

RNN

- specifically designed to work with sequential data. They excel at tasks where the order of inputs matters, such as time series analysis, natural language processing, language models - Unlike feedforward networks, which process each input independently, RNNs maintain an internal state that evolves over time as they process each element of the sequence. - we need the neuron to know about its previous history of outputs - One easy way to do this is to simply feed its output back into itself as an input!

storage problem in n gram language models

- storage can become a significant issue due to the need to store counts for all observed n-grams in the corpus. - model needs to store count for every individual n-gram, as n increases, # n-grams increase - increases memory storage & model size - increasing corpus also increases size

skip gram model with negative sampling

- subset of randomly selected negative samples instead of all other word embedding representations

Continuous Bag of Words (CBOW) Intuition

- the objective is to predict the central target word given the surrounding context words. - Instead of predicting context words from a central target word, CBOW predicts the central target word based on the sum of the embeddings of its surrounding context words. For example, given the context words "cute", "grey", "playing", "in", CBOW predicts the central word "cat". - computationally efficient and tends to work well with smaller datasets. - It's particularly useful when the context words provide a strong signal about the central target word - smaller datasets

Morphology

- treats each word as a discrete entity and doesn't inherently capture morphological variations (e.g., verb tense, plurals). - "eat", "eats", "eaten", "eater", and "eating" are treated as distinct words, which may not fully leverage their morphological relationships. - More advanced models like FastText address this limitation by considering subword information, allowing them to better handle morphological variations.

2. Training Data: General Text Data

- vast amount of text collected from sources like Wikipedia, the web, and books, amounting to trillions of words. - this data is not annotated with task-specific labels. - Instead, it represents a broad and diverse range of linguistic patterns and concepts. - used to train word embeddings or word vectors, through self-supervised learning. - In self-supervised training, the model learns to predict certain aspects of the input text (e.g., predicting the context words given a target word) without the need for manually labeled data. - By leveraging the rich contextual information present in the text data, self-supervised learning enables the model to capture semantic and syntactic relationships between words in a distributed representation space.

Improvements to the Fixed Window Neural Language Model

- when the model does not depend on counts, we do not have a sparsity problem or a storage problem

Problem - Frequent Words

1. Lack of Discriminative Context: Common words like "the", "and", "is", etc., often appear in the context of many different words. the context "fox" in the pair ("fox", "and the") doesn't provide much discriminative information about the meaning of "fox" because "the" is a very common word that can occur in various contexts. As a result, such frequent words may not contribute much to learning meaningful word embeddings. 2. Imbalance in Training Samples: we have more vectors of the than we need - Since frequent words like "the" occur very frequently in the text corpus, there will be a disproportionate number of training samples involving these words compared to less frequent words. - This leads to an imbalance in the training data, where the model is exposed to an excessive number of instances of common words, potentially overshadowing the importance of learning representations for less frequent words.

Capturing co-occurrence counts directly

1. Window-based Co-occurrence 2. Word-Document Co-occurrence

4. Skip-Gram: Embedding Extraction

After training the classifier, the weights of the classifier are used as the word embeddings. These embeddings represent the learned representations of words in a continuous vector space, where similar words have similar embeddings.

Selecting Negative Samples Based on Pα​(w):

Alternatively, it's common to select negative samples according to a modified distribution called Pα​(w). - α is a hyperparameter that controls the skewness of the distribution. - The purpose of using Pα​(w) is to give rare words slightly higher probability of being selected as negative samples compared to the unigram frequency distribution. - This can help the model to better distinguish between true context words and noise words.

Synonyms: How are they better than TFIDF

By considering word contexts, Word2Vec and GloVe are better at associating synonyms with similar vectors, while TF-IDF treats all words as independent

Embeddings: Similarity depends on window size C

C = ±2 The nearest words to Hogwarts: ◦Sunnydale ◦Evernight C = ±5 The nearest words to Hogwarts: ◦Dumbledore ◦Malfoy ◦halfblood

6. Scoring and Sigmoid Activation Function

For each context word and negative sample, the dot product between their embedding vectors and the embedding vector for "eating" is calculated. - This dot product represents a score indicating the similarity between the word "eating" and the context word/negative sample. - The sigmoid activation function is then applied to the score to obtain a probability value between 0 and 1. - For actual context words ("am" and "food"), the goal is to bring the sigmoid output close to 1, indicating a high similarity between "eating" and the context word. - For negative samples ("paris" and "earth"), the goal is to bring the sigmoid output close to 0, indicating a low similarity between "eating" and the negative sample.

negative sampling

In SGNS, negative sampling is used as an alternative approach to approximate the softmax function. Instead of predicting all context words, negative sampling randomly samples a small number of negative context words (words that are not in the context) and trains the model to distinguish them from the true context words.

hidden state

In an RNN, each neuron (or unit) has a hidden state that represents its memory or internal state. - This hidden state evolves over time as the network processes each element of the sequence. - At each time step, the hidden state is updated based on the current input and the previous hidden state. - This allows the network to capture dependencies and patterns in the sequential data. - RNNs can capture dynamic context because their hidden state evolves over time as it processes each input token

Extrinsic Evaluation of Word Embeddings

Instead of evaluating the embeddings directly on word similarity tasks, extrinsic evaluation measures how well the embeddings contribute to the performance of classification - predefined labels - incorporating word embeddings as features, the classification model can leverage the rich semantic information encoded in the embeddings - generalization

Chain Rule

The chain rule states that the joint probability of a sequence of events can be decomposed into the product of conditional probabilities of each event given the previous events in the sequence.

Sparse Vectors

Sparse vectors contain mostly zero values, while dense vectors have non-zero values for most elements. - one hot encoded variables - TF-IDF

3. Hidden Layer: Linear Layer

The concatenated word embeddings are fed into a hidden layer of neurons. This layer applies a linear transformation (multiplying by a weight matrix W and adding a bias vector b1​) followed by a non-linear activation function f. The output of this layer is the hidden representation ℎ of the input window. - linear layer followed by nonlinearity - The input word embeddings are passed through a feedforward neural network. - This neural network consists of one or more hidden layers of neurons, which apply linear transformations and non-linear activation functions to the input embeddings. - The output of the neural network is a vector representation of the context, denoted as ℎ. - This vector ℎ captures the learned representation of the input context in a d-dimensional space.

4. Output Distribution:

The hidden representation ℎ is then fed into an output layer. This layer applies another linear transformation (multiplying by a weight matrix U and adding a bias vector b2​) to produce the logits for the output distribution. - # neurons in last linear layer = # outputs - # outputs = vocab - cross entropy loss to train model

5. Training:

The model is trained using a training corpus with labeled data. The parameters (weights and biases) of the model are adjusted iteratively to minimize a loss function, typically the cross-entropy loss, which measures the difference between the predicted probability distribution y^ and the actual labels.

number of buckets

The number of buckets is a hyperparameter chosen by the user and typically depends on factors such as the size of the dataset, the desired memory usage, and the computational resources available.

4b. Softmax

The output of the linear layer is passed through a softmax function. The softmax function normalizes the logits into a probability distribution over the vocabulary. Each element in the resulting probability distribution represents the probability of the corresponding token being the next word in the sequence, given the input context. - The logits are then passed through a softmax function to compute the probability distribution y^ over the vocabulary.

Sparse Embeddings

Traditional sparse word representations, like one-hot encodings, fail to capture the semantic meaning of words efficiently. Word2Vec and other dense embedding techniques aim to address this issue by representing words in a continuous vector space where similar words have similar representations.

Orthonormal

a set of vectors that are both orthogonal and normalized. - Two vectors are orthogonal if their dot product is zero, meaning they are perpendicular to each other. - Mathematically, for two vectors u and v, if u⋅v=0, then they are orthogonal. - Normalized: A vector is normalized if its length (magnitude) is equal to 1. Mathematically, for a vector u, if ∣∣u∣∣ = 1, then it is normalized.

Self Supervised training

a type of machine learning that does not require explicit external labeling of data. Instead, it generates its own supervisory signal based on the input data. - language model - W2Vec

Probabilistic language models

assign a probability to a given sentence or sequence of words. - machine translation, spell correction, speech recognition, summarization, question-answering - P(high winds tonite) > P(large winds tonite) - P(about fifteen minutes from) > P(about fifteen minuets from) - P(I saw a van) >> P(eyes awe of an) - fundamental benchmark task in natural language processing (NLP) to help measure our progress for understanding language - subcomponent of many NLP tasks

context window size

determines the number of neighboring words considered as context words for a given target word. - For example, if the window size is set to 5, the context window includes the 5 words to the left and 5 words to the right of the target word, resulting in a total window size of 11 words.

1. Skip-Gram: Positive Examples

each word in the text corpus is treated as a target word, and its neighboring context words within a certain window size are considered positive examples. - These pairs of target and context words are treated as positive examples because they co-occur in the text. - the context words (e.g., "am" and "food") are considered as positive examples,

FastText

employs sub-word generation using character n-grams to address the limitations of Word2Vec, particularly in handling out-of-vocabulary words and capturing morphological variations. - Instead of treating each word as a single entity, FastText breaks down words into smaller units called character n-grams. - These character n-grams capture morphological information and help in representing both known and unknown words effectively.

2. Skip-Gram: Negative Sampling

generated by randomly sampling words from the lexicon that are not in the context of the target word. These negative samples are used to train the model to distinguish between true context words and random words - while negative samples (e.g., "paris" and "earth") are randomly chosen words from the vocabulary.

1. Hashing

instead of assigning a unique embedding vector to each unique character n-gram, FastText uses a hash function to convert each character n-gram into a hash code based on its contents. This hash code is an integer value between 1 and # B (buckets) - maps each code to a predefined bucket - The hash code itself is determined by the hash function applied to the character n-gram, which aims to produce a unique code for each n-gram while evenly distributing them across the range of possible values.

self-supervised learning

it learns from the data without explicit supervision,

Issues n-gram Language Models

limiting n may cause the predictions to be not as accurate. - students opened their ___ vs - as the proctor started the clock, students opened their ___ - "books" - "exams" - chose books since it is used more frequently, but the correct answer was exam - could have known if proctor was included

3. Skip-Gram: Classifier Training

uses logistic regression or another binary classifier to train on the positive and negative examples. - The classifier is trained to predict whether a given word-context pair is a positive or negative example based on the features extracted from the target word and its context. - the decision threshold can be adjusted based on the specific requirements of the task or the characteristics of the dataset. - depending on factors such as the class distribution, the trade-off between precision and recall, and the specific objectives of the application.

Skip-gram with negative sampling (SGNS)

variant of the Word2Vec model that aims to learn word embeddings by predicting context words from target words. - This approach is particularly efficient for large-scale datasets. - takes each word in the text as a target word and tries to predict the context words surrounding it. The context words are typically defined by a fixed window size around the target word. - For each target-context word pair, the model is trained to maximize the probability of observing the context word given the target word. In other words, the model learns to predict context words based on the target word. - negative sampling

Distributional Hypothesis

words that occur in similar contexts tend to have similar meanings. In the context of word embeddings, the Distributional Hypothesis implies that words with similar distributions in text (i.e., occurring frequently in similar contexts) are likely to have similar meanings. - Word embeddings like Word2Vec are built upon this hypothesis. They learn to represent words in a continuous vector space such that words with similar distributions in the training data are closer together in this space. This allows the embeddings to capture semantic relationships between words, such as synonymy (words with similar meanings) and semantic relatedness (words that are conceptually similar).

Intuition behind W2V

◦Instead of counting how often each word w occurs near "apricot" ◦Train a classifier on a binary prediction task: ◦Is w likely to show up near "apricot"?

Word Relationships: How are they better than TFIDF?

◦These models can capture complex relationships between words, such as analogies ("king" is to "queen" as "man" is to "woman"), which TF-IDF cannot do.

Dimensionality: How are they better than TFIDF

◦Word2Vec and GloVe embeddings are typically of a much lower dimension than TF-IDF, making them more computationally efficient while retaining more useful information.

Semantically: How are they better than TFIDF?

◦Word2Vec and GloVe embeddings capture deep semantic meaning by considering the context in which words appear. This allows words with similar meanings to have similar embeddings, which isn't the case with TF-IDF.


Ensembles d'études connexes

Lewis 10th Chapter 23 Integumentary Problems nclex

View Set

History Chapter 5 - Quizzes and Pop Quizzes

View Set

What is your earliest memory from working at the CLO?

View Set

Ch 31 Care of Patients with Infectious Respiratory Problems

View Set

Principles of Business Quiz (whats shakin & keep the change).

View Set

Series 6 Unit 2 Securities and Tax Regulations

View Set