Natural Language Processing (CS4990)

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Natural Language Tool Kit (NLTK)

(FOSS) Python library to make programs that work with NL. It can perform different operations such as tokenization, stemming, classification, parsing, tagging and semantic reasoning.

Recall

% of correct items that have been labelled. R = TP/(TP+FN)

Distributional semantics (aka distributional hypothesis)

- A word's meaning is given by the words that frequently appear close-by. "You should know a word by the company it keeps" (J. R. Firth) "Words are similar if they occur in a similar context" (Harris) - One of the most successful ideas of modern statistical NLP! ... based word vector: - count co-occurrence of words (co-occurrence matrix) which capture in which contexts a word appears and the context is modelled using a window over the words.

Advanced Optimisers

- Stochastic Gradient Descent (SGD) has troubles: • Ravine: the surface is steeper on one direction, which are common around the local optima. • Saddle points (i.e. points where one dimension slopes up and another slopes down) are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to 0 in all dimensions. - More advanaced optimisers have been proposed: RMSProp, AdaGrad, AdaDelta, Adam, Nadam - These methods train usually faster than SGD, but their solutions found aren't often as good as SGD's ones. - The performance of SGD is very much reliant on a robust initialization and annealing schedule Possible solution: 1st train with Adam then fine-tune with SGD Overview: http://ruder.io/optimizing-gradient-descent/index.html

Homonyms

2 distinct words having the same spelling.

Soft clustering

A document can belong to more than one cluster. - It makes more sense for apps like creating browsable hierarchies. - You may want to put a pair of sneakers in 2 clusters: sports apparel and shoes.

Dropout

A regularisation method that approximates training a large number of NN with different architectures in parallel. This is done by dropping a unit out, i.e., temporarily removing it from the network, along with all its incoming and outgoing connections. It has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs. It can be used with most types of neural architectures, such as dense fully connected layers, CNNs and RNNs. - Dropout rate (PyTorch): the probability of dropping out a node, going from 0 (no dropout) to 1 (drop all nodes). A good value for dropout in a hidden layer is between 0.2 and 0.5. - Use a larger network: a network with 100 nodes and a proposed dropout rate of 0.5 will require 200 nodes (100 / 0.5) when using dropout. - Weight constraint: large weight values can be a sign of an unstable network. To counter this effect a weight constraint can be imposed to force the norm (magnitude) of all weights in a layer to be below a specified value (e.g. 3-4).

One-hot vector

A sparse representation of (the meaning of) a word. The dimensionality of a vector = size of vocabulary (e.g. 300k) Problems: Example: in a web search, if a user searches for "Seattle motel", we would like to match documents containing "Seattle hotel". But motel = [0, 0, 0, 1, 0, ..., 0] and hotel = [0, 1, 0, 0, 0, ..., 0]. These 2 vectors are orthogonal →every pair of words are orthogonal, and hence have the same "distance" to each other.

Lemma

Canonical/dictionary/citation form of a set of words

Hard clustering

Each document belongs to exactly one cluster (more common and easier to do)

Batch normalisation

For each mini-batch, normalise the weights to an ideal range (e.g. 0 mean and a unit variance) to ensure the gradients do not vanish or explode.

Continuous Bag-of-Words (CBoW)

Given a context, predict the missing word. Examples: - same procedure ____ every year - please stay ____ you are

F measure

How to use just 1 number to measure the performance, taking both P and R into account: weighted harmonic mean: F = 1 / [ α/Pr + (1-α)/R ] α = .5 (equally weight Pr and R F1 = 2Pr * R / (P + R) 2∗Precision∗Recall/(Precision + Recall)

Batch Learning

Objective: minimise ∑i[E(y_i, t_i; W)] Compute the gradient based on all data points: W -= α ∑i=1:n[∇_{W}E(y_i, t_i; W)] Cons: - The need to calculate the gradients for the whole dataset to perform just one update, hence it's very slow and expensive (in terms of both memory and computation). - It doesn't allow us to update our model online, i.e. with new examples on-the-fly.

Online Learning

Objective: minimise ∑i[E(y_i, t_i; W)] Compute the gradient based on each data point: for i = 1, 2, ..., n: W -= α∇_{W}E(y_i, t_i; W) Con: the gradient is estimated on each single data point; hence the gradient variations can be very high (aka "gradient fluctuation"). The gradient fluctuation, on one hand, enables it to jump to new and potentially better local minima

Context of a word

Set of words that appear nearby (within a fixed-size window) of a word.

Shuffling and Curriculum Learning

Shuffling: avoid providing the training examples in a meaningful order to our model as this may bias the optimisation algorithm Curriculum learning: for some cases where we aim to solve progressively harder problems, supplying the training examples in a meaningful order may lead to improved performance and better convergence

Markov assumption

Simplifying the assumption by only considering a few words before the target word. E.g. P(know | As far as I) ≈ P(know | I) or P(know | As far as I) ≈ P(know | as I)

Learning low-dimension word vectors

Target: - Build a dense vector for each word, chosen so that it's similar to vector of words that appear in similar contexts - Dense: in contrast to co-occurrence matrix word vectors, which are sparse (high dimensional). - Word vectors are sometimes called word embeddings or word representations. They're a distributed representation.

Ranked IR

Target: return an ordering of the documents in the collection with respect to the query. When a system produces a ranked result, large results sets are not an issue, as we can just show the top (e.g. 10) results. How to rank documents: assign a score - say in [0, 1] - to each document (this score measures how well a document and a query "match").

Hyper-parameter optimisation

The best one is the random layout but it's not guarenteed to be better than the grid layout.

vocabulary (of a text)

The set of unique tokens used (i.e. list of types).

Complex Neural Networks

Those can be built by using: - different connections - different activation functions - multiple layers - different loss functions - ...

80/20 rule

Top ~20% types (in the frequency distribution) account for ~80% of the tokens.

Parameters (in DL)

What ANNs will optimise

Co-occurrence matrix

|V|*|V| matrix where V is the vocabulary. Assumption: if we collect co-occurrence counts over thousands of sentences, the vectors for "enjoy" and "like" will have similar vector representations. - Like the document-term matrix used in IR: • Doc-term matrix: represent a doc by words appearing in the doc. • Word co-occurrence matrix: represent a word by its context. Problem: Vectors become very large with real data (workaround: apply Dimensionality Reduction like truncated SVD).

Sigmoid activation

σ(x) = 1 / [exp(-ax)+1] (centred at 0.5) - Differentiable, but the gradients are killed when |x| is large - Also, expensive to compute

Word2vec

A popular framework for learning dense word vectors. Idea: - Input: a large corpus of text - Output: every word in a fixed vocabulary is represented by a vector Major steps: 1. Go through each token in the text, to get a centre word c and its context ("outside") words o. 2. Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa). 3. Keep adjusting the word vectors to maximise this probability. Approaches: - CBoW - Skip-gram Note: the order of information isn't preserved, i.e. we don't distinguish whether a word is more likely to occur before or after the word. It's the simplest and the WE that can run on much more data as well as being much faster than predecessor neural models (especially with the sigmoid + negative sampling trick)

Linear activation

σ(x) = ax + b

Softplus activation

σ(x) = ln(1 + exp x) Pros: - Differentiable - Gradients don't die in both positive and negative region Cons: - Kills gradients in the negative region when |x| is large - Not 0 centred - Computationally expensive

K-Median

- Instead of recomputing the group centre points using the mean, we use the median vector of the group. - It's less sensitive to outliers. - It's much slower for larger datasets as sorting is required on each iteration when computing the median vector.

WordNet

A thesaurus containing lists of synonym sets and hypernyms ("is a " relationships). Problems: Can we compare the synonyms of words to measure their similarity? This method fails badly in practice, because: - it's great as a resource but missing nuance (e.g. "proficient" is listed as a synonym for "good" which is only correct in some contexts), - missing new meanings of words (e.g. "wicked", "badass", ...), - it's subjective - it requires human labour to create and adapt - it can't compute accurate word similarity

Precision

% of labelled items that are correct Pr = TP/(TP+FP)

Weight initialisation

- Initialising all weights to 0 doesn't usually work well because "all neurons behave in the same way", aka the "symmetry problem" (https://arxiv.org/pdf/1206.5533.pdf). Also, the gradient may be 0 depending on the activation function (it may be undefined due to it not being continuous at 0). Work-around: use soft-plus' derivate at 0. - Using small random numbers (i.e real numbers uniformly randomly drawn from [-0.01, 0.01]) works ok but only for small networks (a few layers, each with a few activations). In deeper network, the activations become very close to 0 in deeper layers (i.e. layers far from the input and close to the output layer). - Gradients also become close to 0, because of the gradient of j-th layer and z^(j-1) (from the gradient) are close to 0. - Using large random numbers (i.e. real numbers uniformly randomly drawn from [-100, 100]). The activations may become very large but the gradients may be (close to) 0 (consider sigmoid and tanh). - Glorot initialisation: W_ij = U[-1/√n, 1/√n], where n is the size of the previous layer (aka Xavier uniform initialization). It's a reasonable initialisation (based on linear activations, works for tanh, it needs adaptation for ReLU because it's not 0 centred).

NLP limits & outlook

- Language problems are hard - for most of them, there's still no fully accurate solution (like Physics, History and Psychology).

Multi-Layer Perceptron (MLP)

- One with 1 hidden layer is a universal approximator i.e. It can represent or approximate any continuous function - More hidden layers can still be useful for learning an actual task. - A good organisation of hidden layers can make learning much faster. Formally, a NN with H-1 hidden layers is a function f: R^n→R^m, with parameter (matrices) W^(1), W^(2), ..., W^(H) and non-linearities σ_1, σ_2, ..., σ_H where: z^(1) = x ⋅ W^(1) (pre-activations of the second layer) z^(2) = σ_1(z^(1)) ⋅ W^(2) ... z^(H) = σ_{H-1}(z^(H-1)) ⋅ W^(H) y = σ_H(z^(H))

Word embedding models

- Word2vec: the more (semantically) similar of words w_1, w_2 → higher e(w_1) ⋅ u(w_2) https://code.google.com/archive/p/word2vec/ - GloVe: the more often words w_1, w_2 appear in the same document → higher e(w_1) ⋅ u(w_2)

Hapax (Hapax legomenon)

A word that occurs only once in a text/corpus.

Language model for training word vectors

Bengio's NNLM (Neural Network based Language Model) vs word2vec: - Word2vec is "simpler" • It has no "hidden" layer • Simpler models can be trained and run faster • It can train on billions of word tokens in hours/days - Uni-directional vs bi-directional • NNLM predicts the next word using the preceding words (left-to-right prediction). • Word2vec predicts both preceding words and succeeding words (or using context words to predict the centre word, in CBoW), hence is bi-directional prediction

Aggressive classifier

Classifier that: - tend to label more items - is high-recall low-precision - useful when you don't want to miss any spam; suitable for first-round filtering (shortlist)

Skip-gram

Given a word, predict the context words Examples: ____ ____ as ____ ____ If the window size is 2, w aim to predict: (w, c-2), (w, c-1), (w, c1) and (w, c2) Concept: For each word w, we learn 2 vectors: - When w is the centre word: E(w) - When w is the context word: U(w) - The final vector of w: mean[E(w), U(w)] - Empirically, learning 2 vectors yields better embeddings w(t): 1-hot vector of the centre word w(t±x): 1-hot vector of the context word E: |V|*d (d: dimension of the vector, given by the designer) U: |V|*d Training: - Learning parameters: E and U - Logit matrix: we measure the similarity of a pair of words by their vectors' doc product - Learning objective: minimise the cross-entropy loss between p and t: J(E, U) = -∑i[t(i) * log p(i; E, U)] - Obtain a large corpus - Tokenise text in the corpus (maybe other text pre-processing techniques) - Decide the window size K (hyper-parameter) - Build training mini-batch: centre word and its context words. E.g. the corpus only has one sentence (K=2) All the world's a stage and all the men and women merely players . Mini-batch 1: (All, the), (All, world) Mini-batch 2: (the, All), (the, world), (the, 's) Mini-batch 3: (world, All), (the, world), (world, 's), (world, a) - Update E and U with each mini-batch data, to minimise J(E, U) Practical realization: Loss function to minimise: J(E, U) = -∑i[t(i) * log p(i; E, U)] p(i; E, U) = Exp(u_i ⋅ e_c) / ∑w∈V[exp(u_i ⋅ e_c)] |V| is usually a very large number, hence the softmax computation is very expensive To avoid the expensive normalisation computation in softmax, use the sigmoid function instead of softmax: p_hat(i; E, U) = σ(u_i ⋅ e_c) J_hat(E, U) = -∑i[t(i) * log p_hat(i; E, U)] The potential problem of using sigmoid: - If vectors in E and U are of big magnitude, the input of the sigmoid is always a large number, even for words that aren't in the context. - Given a centre word c, we want a probability vector in which: - the true context word of c have high probabilities - the non-context word of c have low probabilities Change the loss function to J'_hat(E, U) = -∑i∈C(c)[log p_hat(i; E, U)] + ∑j∉C(c)[log p_hat(i; E, U)] where c is the centre word and C(c) is the set of context words of c. The problems with the above-updated loss function are that j ranges over far more words than i and highly unbalanced training data. - The solution, negative sampling: only use a small number of non-context words in the loss function. Solution: For each centre word c: 1. Find its context words C(c) 2. Randomly sample the same number of non-context word from the corpus N(c) (|N(c)| = |C(c)|) 3. Compute the gradient of the loss function J"_hat(E, U) = -∑i∈C(c)[log p_hat(i; E, U)] + ∑j∈N(c)[log p_hat(i; E, U)]

Probabilistic Language Models (PLM)

Goal: assigning a probability to a sentence (P(W) = P(w_1, w_2, ..., w_n)). Why is it important? - Machine translation: P(high winds tonight) > P(large winds tonight) - Spell correction: P(15 minutes walk) > P(15 minuets walk) - Speech recognition: P(I like the pink flower) > P(I like the pink flour). Related task: the probability of an upcoming word: P(w_5|w_1, ..., w_4). A model that computes either P(W) or P(w_n|w_1, ..., w_n-1) is called a language model. It can be viewed as a probabilistic grammar.

cosine similarity

Instead of directly measuring the angle between vectors, we compute the cosine value of the angle. ∴ larger cosine -> smaller angle -> higher relevancy.

SVD + co-occurrence matrix

Instead of using the high-dimensional original co-occurrence matrix M, use U_t (dimension t is given by the user). Cons: - High computational cost for large datasets - It can't be dynamically updated with new words - It didn't work too well in practice. Idea: instead of reducing high-dimensional vectors to low dimensional vectors: directly learn low-dimensional vectors

Learning rate adaptation (annealing)

Intuition: high learning rate ⇒ many fast changes to parameter vector which may prevent a convergence. Decrease the learning rate during each iteration (epoch). Initially a large learning rate for exploring the error surface. Afterwards, in the vicinity of local optimum: smaller changes. E.g. α_t = α_0 / (1+t) ("power scheduling").

P(W) computation

Intuition: rely on the chain rule of probability - P(A, B) = P(A) * P(B|A) - P(A, B, C) = P(A) ∗ P(B|A) ∗ P(C|A, B) - P(x_1, x_2, ..., x_n) = P(x_1) ∗ P(x_2|x_1) ∗ ... ∗ P(x_n|x_1, x_n-1) - P(x_1, x_2, ..., x_n) = ∏i[P(x_i|x_1, ..., x_i-1) ∴ P(A|B) = P(A, B) / P(B) It's impossible in practice to count and divide, i.e. P(W) ≈ Count(W) / Total_Token_Num as there are too many possible sentences.

Gradient Descent

It's an algorithm to improve a hypothesis function in respect to some cost function. - Neural Network learning: • NN: y = F(x; W) • Target: W* = arg min E(y, t; W) where E is the loss function - To find W*, we let ∇_W E(y, t; W) = 0 and find it corresponding weights - We can approximate W* by batch/online/mini-batch learning

Multinomial Logistic Regression (MLR)

Logistic Regression is among the simplest while most widely used classification model, also known as Maximum Entropy (MaxEnt), Multiclass LR, Softmax Regression, Multinomial Logit. Idea: Predict the probability of the input falling into each category. Image: - The vectorizer can be TF-IDF/BoW or an ANN (in DL) - vi is a 1xn vector on ℝ - Function F (a linear function or deep model): • Input: vector v, dimension 1*D (D: dimension of the vector) • Output: vector u, dimension 1*K (K: number of categories) • Learning parameters: a matrix of W, whose dimension D*K • F: v ⋅ W - ui1 is a vertical vector with: (P(C_1|d), P(C_2|d), ..., P(C_K|d)) - Softmax function: ⋅ Transfer the outputs of Function F into probabilities ⋅ Sigmoid is a special case of Softmax (with 2 outputs) - On the tpi1, ..., tpiK: a given true label, each bits are 0 except for the one that is the true label. The Function F and Softmax together form a LR model

Inflection (in grammar)

Modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender and mood. It expresses 1+ grammatical categories with a prefix, suffix, infix or another internal modification such as vowel change. Examples: removing ..ed, ..ize, ..s, ..de, mid from words.

Early stopping

Monitor error on a validation set during training and stop (with some patience) if the validation error does not improve enough.

Word tokenizer (from NLTK)

NLTK' standard tokenizer. Pros: successfully tokenizes punctuations, split hashtags into separate words (e.g. #70thRepublic_Day into "#" and "70thRepublic_Day") Cons: it fails to identify widely used symbol combinations (e.g. ":)" is split into 2 symbols)

Mini-Batch Learning

Objective: minimise ∑i[E(y_i, t_i; W)] Compute the gradient by a small number of points (intermediate solution between batch and online learning): W -= α ∑i=1:k[∇_{W}E(y_i, t_i; W)] where k is the mini-batch size (k→n: batch learning; k→1: online learning). Common mini-batch sizes range between 8 and 256 but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a NN and the term "Stochastic Gradient Descent" (SGD) usually is employed when mini-batches are used.

Unigram model

P(w_1 w_2 ... w_n) ≈ ∏i[P(w_i)] - It multiplies the probability of individual tokens - The word sequence information is completely ignored (e.g. P(As far as I know) = P(far as as know I) To compute P(w_i): - count the word w_i's frequency in the corpus, - use FreqDict() from NLTK

Chunking

Process of extracting phrases from unstructured text. E.g. Extracting "South Africa" as a single token instead of "South" and "Africa" as separate tokens. Why use it? - For entity detection • proper names (e.g. Monty Python) • Definite noun phrases (e.g. the knights who say "ni") • Sometimes also indefinite nouns or noun chunks (e.g. every student or cats) - Help multiple NLP tasks • Information retrieval (search engines) • Text classification • Sentence simplification/paraphrase • Summarisation Example of type: Noun Phrase (NP) Chunking

Tweet tokenizer (from NLTK)

Pros: correctly handles hashtags and mentions (`@somone`) Cons: it fails at abbreviations (e.g. U.K)

Text cleaning & normalisation

Remove useless information (e.g. email headers) and extract useful information (e.g. words, word sequences, verbs, nouns, adjectives, names, locations, orgs, ...). 1. Tokenization (sentence, words) 2. Stemming / Lemmatization 3. Stop-words removal

Type Token Ratio (TTR)

The number of different words in the sample/number of words in the sample. It's often used to measure the Lexical Diversity (or describing the range of someone's vocabulary)

Latent Semantic Analysis (LSA)

The simplest topic modelling method. Input: doc-term matrix M (BoW or TF-IDF) Rationale: - M is very sparse, noisy and redundant across its many dimensions. - To get the most salient information from the noise, sparse and redundant matrix M, the most natural idea is to reduce its dimensionality. - Truncated SVD (Singular Value Decomposition) is a standard method for reducing matrix dimensionality

Loss function

True labels ("truth") t = (t_1, ..., t_m) The network's predictions y = (y_1, ..., y_m) Note: there's m output units - indexed by j As before, the number of samples is N: {(x_1, t_1), ..., (x_N, t_N)} (indexed by i). Popular ones: - Square loss (Mean Squared Error): MSE(t, y) = ∑j(t_j - y_i)², widely used in regression tasks. - Cross entropy loss CE(t, y) = -∑j[t_j * log(y_i)], widely used in classification tasks. - Hinge loss: HL(t, y) = ∑j max(0, y_j-y_t + 1), where t is the index of the "hot" bit in t. Also widely used in classification tasks.

GloVe

Unlike word2vec (which only relies on the local context of information to learn words), this one exploits co-occurrence information in longer range. It applies an additional trick: the more often words w_1, w_2 appears in the same document → higher e(w_1) ⋅ u(w_2)

Bag of Words (BoW)

View a document as a bag of words. It counts the appearance of each word or term frequency.

Hyper-parameters

What the algorithm designer will choose and which are fixed in the algorithm.

Perceptron

σ(z) = z ≥ 0 ? 1 : 0 (Step function) Con: (single-layer) perceptrons can only separate linear separable data points

ReLU (Rectified Linear Unit) activation

σ(x) = max(0, x) Pros: - Gradients don't die in a positive region - Computationally efficient - Experimentally: convergence is faster - The most popular activation function in NLP. Cons: - Kills gradients in the negative region - Not 0 centred The one to try before sigmoid and the step function.

Leaky ReLU (and Parametric ReLU)

σ(x) = max(0.01x, x) (Leaky ReLU) More generally (Parametric ReLU), σ(x) = max(αx, x) It's there to solve the problems that ReLU has which leads to a parametric ReLU with α=0.01. Pros: - Gradients don't die in both positive and negative region. - Computationally efficient - Experimentally: Convergence is faster Cons: - Need to decide α (hyperparameter) - Not 0 centred It's one of the most widely used activations.

Maxout Neuron

σ(x) = max(w_1*x, w_2*x) Pros: - Generalises Parametric ReLU - Provides more flexibility by allowing different w_1 and w_2 - Gradients don't die Cons: Doubles the number of parameters It's an extension of Leaky ReLU and one of the most widely used activations.

Tanh activation

σ(x) = tanh x (centred at 0) - Differentiable, but the gradients are killed when |x| is large - Also, expensive to compute

Exponential Linear Units (ELU)

σ(x) = x > 0 ? x : α[exp(x) - 1] Pros: - Gradients don't die in both positive and negative region - Computationally efficient - Experimentally: faster convergence - Closer to 0 mean outputs Cons: - Expensive computation (exp) If α = 1 then the point at x=0 is smooth (i.e. y=1) otherwise not. It's one of the most widely used activations.

Step activation

σ(x) = x ≥ 0 ? 1 : 0 - Historically first specification - Not differentiable everywher. - Derivative is 0

Bigram model

Condition on the last word P(w_1 w_2 ... w_n) ≈ ∏i[P(w_i|w_i-1)] To compute P(w_i|w_i-1), example: - P(far|as) = Count(as, far) / Count(as). "as far" is a bigram (or 2-gram)

Named Entity

Definite noun phrases that refer to specific types of individuals, such as organisations, persons, dates, ...

Type

Element in the vocabulary. Also known as the form or spelling of the token (including words and punctuation) independently of its specific occurrences in a text.

Accuracy

Of all predictions, how many are correct. Acc = (TP+TN)/(TP+FP+FN+TN) = True / All For multi-labels: Acc = diagonal / All Cons: when the label distribution is highly unbalanced, you can be easily fooled by the accuracy (so check the label distribution of the training data).

Lemmatization

Process of reducing the inflection in words to their dictionary root forms (called lemma), ensuring that the root word belongs to the language. Pros: the derived root word is meaningful Cons: it's more expensive (hence slow) to run

Rand Index

External evaluation for clustering. - It measures between pair decisions. - It's like accuracy but not computed in the raw data, but done on the pair of points; which is invariant to renaming clusters. RI = (A+D) / (A+B+C+D) Precision equivalent: P = A / (A+B) Recall equivalent: R = A / (A+C)

Syntactic parsing

The task of recognising a sentence and assigning a syntactic structure to it. It's important for: - grammar checking - understand the subject/main verb/object of a sentence; useful in downstream tasks, e.g. question answering, information extraction.

(Overall) Purity

External evaluation for clustering. - Percentage of the majority class in each cluster. - High purity is easy to achieve when the number of clusters is large - in particular, purity is 1 if each document gets its own cluster. Thus, we can't use purity to trade off the quality of the clustering against the number of clusters.

Information Retrieval (IR)

Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Or in short: a search engine, given a query, find the most relevant documents

Similarity measure

Not with Euclidean distance due to the fact it takes the vector length into account (i.e. larger distance → less relevancy). Instead of measuring the Euclidean distance between vectors, we can measure the angle between vectors (∴ larger angle -> less relevancy).

Regular Expression (RegEx or RegExp)

A sequence of characters that define a search pattern. Usually, such patterns are used by string searching algorithms for "find" and "find and replace" operations on strings or for input validation. A key characteristic of it is its economical script (especially with the Kleene closures with + and *). In Python: - re.search(pattern, text) - re.split(split_term, phrase) - re.findall(regex, phrase)

Token

Instance of a type in a text, which is a sequence of characters that is treated as a single group (i.e. words and punctuation). E.g. To be or not to be - 2x to, be - 1x or, not

Data splits

Use separate datasets (if possible) for training, development and test. Otherwise, divide one large dataset into 3.

Single-link

sim(c_i, c_j) = max_x∈c_i, y∈c_j [sim(x, y)] Pro: it encourages the growth of elongated clusters Con: it's very sensitive to noise

Complete-link

sim(c_i, c_j) = min_x∈c_i, y∈c_j [sim(x, y)] Pro: it encourages compact clusters Con: It doesn't work well on elongated clusters

Agglomerative clustering

the dendrogram is built from the bottom level by - merging the most similar (or nearest) pair of clusters - stopping when all the data points are merged into a cluster. - When taking the first step merging, do not consider the global structure of the data, but instead only look at the pairwise structure - It's is faster to compute, in general

Topic

- A grouping of words that are likely to appear in the same context. - A hidden structure that helps determine what words are likely to appear in a corpus. - long-range context (rather than local dependencies like n-grams, syntax).

Mean-Shifting Clustering (MS)

- A sliding-window-based algorithm that attempts to find dense areas of data points. - A centroid-based algorithm: the goal is to locate the centre points of each cluster, which works by updating candidates for centre points to be the mean of the points within the sliding-window. - These candidate windows are then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of centre points and their corresponding groups. - It doesn't need the user to define K. Algorithm: 1. Begin with a circular sliding window centred at point C (randomly selected) and having radius r (given by the designer) as the kernel. 2. At every iteration, the sliding window is shifted towards regions of higher density by shifting the centre point to the mean of the points within the window (hence the name). 3. Continue shifting the sliding window according to the mean until there is no direction at which a shift can accommodate more points inside the kernel. 4. When multiple sliding windows overlap the window containing the most points is preserved. Pro: there's no need to select the number of clusters as MS automatically discovers this. Con: the selection of the window size/radius r isn't trivial

Hierarchical clustering

- Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents. - Clusters are obtained by cutting the dendrogram at the desired level: each connected component forms a cluster. Types: - Divisive (top-down) clustering - Agglomerative (bottom-up) clustering: single cluster (i.e., the root cluster). Pros: - It doesn't require the number of clusters to be specified, we can even select which number of clusters looks best since we are building a tree. - It's not sensitive to the choice of distance metric. - A particularly good use case of hierarchical clustering is when the underlying data has a hierarchical structure and you want to recover the hierarchy. Con: lower efficiency (O(n³) time complexity)

Feature types

- Categorical features: • True-of-false type of features (e.g. whether the text contains xxx) • Use different bits to represent different categories • For each category, the element value is either 0 (false) or 1 (true) • Why not use numeric values: magnitude does not matter. - Numerical features • Number-counting type of features (e.g. the text contains how many xxx). • Magnitude matter: does the feature with value 2 mean twice more than with value 1?

POS Tagging problem

- Determining the POS tag for an instance of a word. - The collection of POS tags used is called tagset. - Words often have more than one POS tag (making that difficult, even for humans). Example with "back": - The back door → JJ (adjective) - On my back → NN (noun) - Win the voters back → RB (adverb) - Promised to back the bill → VB (verb)

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

- Extend Mean-Shifting by identifying outliers. Algorithm: 1. Begin with an arbitrary starting data point that has not been visited. The neighbourhood of this point is extracted using a distance epsilon ε (all points within a distance of ε, are neighbours). 2. If there are a sufficient number of points (according to minPoints, a pre-defined hyperparameter) within this neighbourhood then the clustering process starts and the current data point becomes the first point in the new cluster. Otherwise, the point will be labelled as noise (later this noisy point might become a part of the cluster). In both cases that point is marked as "visited". 3. For the 1st point in the new cluster, the points within its ε distance neighbourhood also become part of the same cluster. This procedure of making all points in the ε neighbourhood belong to the same cluster is then repeated for all of the new points that have been just added to the cluster group. 4. Repeat the previous 2 steps until all points in the cluster are determined i.e. all points within the ε neighbourhood of the cluster have been visited and labelled. 5. Once we're done with the current cluster, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. 6. Repeat this process until all points are marked as visited. Since at the end of this all points have been visited, each point will have been marked as either belonging to a cluster or being noise. Pros: - It doesn't require K (the number of clusters) - Identifies outliers as noises, unlike MS which simply throws them into a cluster even if the data point is very different. - It can find arbitrarily sized and arbitrarily shaped clusters quite well (it's good at handling spherical data). Cons: - It doesn't perform as well as others when the clusters are of varying density (reason: the setting of distance threshold ε and minPoints for identifying the neighbourhood points will vary from cluster to cluster when the density varies)

Truncated SVD (Singular Value Decomposition)

- Factorises any matrix M into the product of 3 separate matrices: M=U∗S∗V, where S is a diagonal matrix of the singular values of M. - Selecting only the t largest singular values, and only keeping the first t columns of U and V. Where t is a hyperparameter that reflects the number of topics we want to find. m: number of docs n: vocabulary size t: number of topics wanted/given M: m*n, doc-term matrix U: m*t, doc-topic matrix V: t*n, topic-term matrix U may replace M in IR systems to represent documents. V can be viewed as word vectors Pros: - Easy to understand and implement - Quick and efficient Cons: - Lack of interpretable embedding (we don't know what the topics are, and the component may be arbitrarily positive/negative). - Need many documents to get accurate results

Naïve Bayes

- Given an input document, a classifier assigns a probability to each class, P(c|d), and selects the most likely one: c = arg max_c_i P(c_i|d) - By Bayes rule, we have: P(c|d) = [P(d|c)P(c)] / P(d)] - To select the most likely class for a given document, we can omit the denominator, hence c = arg max_c_i P(d|c_i)P(c_i) P(c_i): - the general probability of class c_i (e.g. among all emails received by RHUL, how many are spams and how many are not) - It can be obtained by counting the proportion in a large corpus - also known as the prior P(d|c_i): - The document d can be viewed as a set of words {w_1, ..., w_n} - NB assumption: given a class, the appearance probability of each word is independent of the appearance of the other words. - P(d|c_i): P({w_1, ..., w_n|c_i}) ≈ ∏i[P(w_j|c_i)] ∝ ∑i log[P(w_j|c_i)] - also known as the likelihood

Term Frequency (TF)

- Raw form: the number of times term t appears in document d. (a document with 10 occurrences of a term is more relevant than another document with just 1 occurrence, but not 10x more relevant). Intuition: the relevancy increases with term occurrences, but not linearly/proportionally. - Log form: TF(t, d) = raw_tf(t, d) > 0 ? 1 + log[raw_tf(t, d)] : 0

Rule-Based classification

- Rules based on combinations of words or other features • Spam: whether it mentions my name, money or feature phrases like "you are selected" or "this is not a spam" • POS tagging: prefixes (inconvenient, irregular), suffixes (friendly, quickly), upper case letters, 35-year. • Sentiment analysis: "rock", "appreciation", "marvel", "masterpiece" - Accuracy can be high (if the features are well designed and selected by experts) - However, building and maintaining these rules is expensive!

Why Python?

- Shallow learning curve - Good string handling - Combines OO, aspect-oriented and FP paradigms - Extensive standard libraries (e.g. NLTK) - Great support for Deep Learning

Word level ambiguity

- Spelling (e.g. colour vs color) - Pronunciation • 1 word can have multiple pronunciations (e.g. abstract, desert) • Multiple words can share the same pronunciation (e.g. flower/flour) - Meaning (1 word can have multiple meanings, i.e. homonyms; e.g. date, crane, leaves)

(IR Evaluation:) Precision@K

- Standard precision in ML: what proportion of positive identifications was actually correct? - Of the top-K returned results, how many of them are indeed relevant to the query (or, how many are labelled as 'relevant' by human annotators). - Decide K by your application (you want to make sure the user finds at least one relevant result on the first page). - ∀ M queries, compute the average: Performance = mean(Precision@K for results on query_i), i=1, ..., M

Data types (based on structures)

- Structured data - Semi-structured data - Unstructured data

Words

- Token - Type

NLP tasks & applications

- Writing assistance (spell/grammar/style checking, auto completion). - Text classification (spam detection, sentiment analyses, fake news/propaganda detection, news topic classification, customer reviews category classification). - Information retrieval (search engine) - NL Understanding (argumentation mining, question-answering, NL inference, humorous/ironic/metaphoric language analysis). - NL generation (document summarisation, machine translation, sentence paraphrasing/simplification, dialogue/exercise generation)

Corpus (=body)

A large body of text. It usually contains raw text and any metadata associated with the text (e.g. timestamp, source, index, ...). It's also known as a dataset

Cross Entropy Loss (CE)

A loss function which defines the distance between the model output and the true output. It can be used for classification as it measures the divergence of a probability distribution P2 to the reference distribution P1 CE(P1, P2) = -∑i[P1(x_i) * log P2(x_i)] CE(P1, P2) isn't symmetric: CE(P1, P2) ≠ CE(P2, P1) The model output: P(c_1) = p_1 = exp(w_1 ⋅ x_1) / [exp(w_1 ⋅ x_1) + exp(w_2 ⋅ x_2)] P(c_2) = p_2 = exp(w_2 ⋅ x_2) / [exp(w_1 ⋅ x_1) + exp(w_2 ⋅ x_2)] True labels: y1/y2: the probability of class 1/2; y_1 = 1, y_2 = 0 if class 1 is the wanted output. Loss function: CE[(y_1, y_2), (p_1, p_2)] = l = -y_1 ∗ log p_1 - y_2 ∗ log p_2

Natural Language Processing (NLP)

A subfield of linguistics, CompSci, Information Engineering and AI concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyse large amounts of natural language data

K-Means

Assumes documents are real-valued vectors. Clusters based on centroids (aka the centre of gravity or mean) of points in a cluster, C: μ(C) = 1/|C| * ∑x∈C (x) It reassigns the instances to clusters based on their distance to the current cluster centroids. Algorithm: Input: an integer K>0, a set of documents Select K random docs {s_1, s_2, ..., s_K} as seeds (i.e. initial centroids). Until the clustering converge (or another stopping criterion): 1. For each doc d_i: Assign d_i to the cluster c_j such that dist(d_i, s_j) is minimal. 2. Update the seeds to the centroid of each cluster. 3. For each cluster c_j: s_j = μ(c_j) - Possible termination conditions: • A fixed number of iterations. • Document partition unchanged. • Centroid positions don't change. - K-means is a special case of a general procedure known as the Expectation-Maximization (EM) algorithm (known to converge to local optimum). The number of iterations could be large (but usually isn't in practice). Time complexity: O(IKNM) for I iterations. - Distance computation: O(M) where M is there dimensionality of the vectors. - Reassigning clusters: O(KN) distance computations, or O(KNM). - Computing centroids: each doc gets added once to some centroid: O(NM). - Hence, the overall computational complexity for each step is O(KNM). The Goodness measure of cluster j as the sum of squared Euclidean distances from the cluster centroid: G_j = ∑i(d_i - c_j)², i ranges over docs in cluster j. G = ∑kG_k, k=1, 2, ..., K Reassignments monotonically decrease G since each vector is assigned to the closest centroid. ∇G_j = −∑_i[2(d_i − c_j)] With ∇G_j = 0, we have: ∑_i(d_i) = ∑_i(c_j), hence c_j = 1/m_j * ∑_i(d_i), where m_j is the number of documents in cluster j. The results can vary based on the selected seeds, so try out multiple starting points (or initialize with the results of another method). Pros: - Simple and easy to understand and implement. - Efficient: O(IKNM) time complexity) - Since both K and I are usually small, K-means is considered a linear algorithm. - It's the most popular clustering algorithm. Cons: - The user needs to specify K - The algorithm is sensitive to outliers (decide to remove them after a few iterations or perform random sampling). - It performs poorly with spherical data.

Micro-average

Average dominated by the major class where all the relevant TPs/FPs/FNs/TNs are added up.

Macro-average

Average of each class' metric

Dependency

Binary relation between a head (or: governor) and its dependents. The head of a sentence is usually the finite verb. Every other word in the sentence depends on it either directly or through a path of dependencies.

Conservative classifier

Classifier that: - tend to label fewer items; only label the very certain ones, - is high-precision low-recall - useful when you don't want any false alarms; suitable for second-round filtering

Conditional Frequency Distribution

Collection of Frequency Distributions, one for each "condition" (e.g. a category).

Tagset

Collection of all POS tags used for a specific corpus and specific language, and what the tags mean (e.g. VBD = verb in past tense). - Tags are usually uppercase (e.g. DT, ADJ, VBD). - Similar tags often share a prefix (e.g. V... = related to verbs) - There are many that exists, Penn Treebank Tagset (for English) and Universal Tagset (for most languages) are two widely used ones.

Text classification

Definition: - Input: a document d, a fixed set of classes C = {c_1, c_2, ..., c_M} (+ if supervised, a training set of N hand-labelled documents {(d^1, c^1), .., (d^N, c^N)} - Output: the predicted class c ∈ C for d Applications: - Assigning subject categories, topics or genres. - Spam detection - Authorship identification - Age/gender identification - Language identification - Sentiment analysis - ...

Sentence tokenization

For long documents, we may not be interested in words but instead in sentences therein: - Check whether a sentence's sentiment is positive or negative. - Check whether a sentence contains propaganda content. - Check the grammatical correctness of a sentence - ...

TF-IDF

Given a cluster of N documents, a vocabulary containing D words. The value of the (i, j) element in the doc-term matrix is: TFIDF(t_j,d_i) = TF(t_j, d_i) * IDF(t_j). Pros: - Easy to understand and implement - Yields surprisingly good performance in many text classification and text clustering tasks, usually better than BoW and Jaccard. Cons: - Superficial matching: texts with similar/different words can deliver very different/similar meanings - Sparsity: if the vocabulary size is large (e.g. 30K common words in English), each text vector has many 0s and very few 1s. As a result, similar sentences have a low cosine similarity. It can be harmful due to: • over-parameterization, overfitting • increases in both computational and representational expenses • the introduction of many "noisy features" affecting the performance (especially when raw TF/IDF values are used). Sparsity/vocabulary size mitigations: - Remove extremely common words, e.g. stop words and punctuations. - Remove extremely uncommon words, i.e. words that only appear in very few documents - Select the top TF words because they are more representative. - Select the top IDF words because they are more informative. - Select the top TF-IDF words to strike a balance. - Decide your vocabulary at training time, and keep it fixed at test time! I.e. the vocabulary used at training time (because the i-th position in the vector should correspond to the same feature at training and at test time and the model doesn't understand what each feature means)

Inverse Document Frequency (IDF)

If a type appears in many documents, it's less important. IDF = (num. of docs) / (num. of docs with term t) A few problems with IDF: - If the words does not appear in any documents in the corpus (this can happen when a given vocabulary is used), the above definition is ill-defined. - When a word appears in a very few docs (i.e. the denominator is very small) and there are many documents (i.e. the numerator is very large), the IDF is very large. To tackle the above problems, people usually use: IDF(t) = log[num. of docs / (γ + num. of docs with term t)] where γ can be any positive real number usually set to 1)

Logistic Regression

Input: vector (e.g. TF-IDF), x ∈ ℝ^n, Output: probability distribution over possible classes, p ∈ ℝ^k. Parts: Function F ∈ ℝ^n→ℝ^k, the Softmax function Parameters to learn: W ∈ ℝ^(n*k) (weights) In standard (linear) LR, functions σ_1, σ_2, σ_3 are linear functions. v_i = σ_i(x, w_i) = x ⋅ w_i W = [w_i, w_2, ..., w_k] The linear function F in LR can be viewed as K neurons (K is the number of classes) x ⋅ w_i: pre-activation, input of the activation function.

Constituency parsing

It aims to extract a constituency-based parse tree from a sentence that represents its syntactic structure according to a phrase structure grammar. It's widely used.

Dependency parsing

It focuses on how words relate to other words. It's widely used. Caveat: there are multiple theories for dependency parsing that may yield different results.

(IR Evaluation:) MAP

Mean Average Precision. - AP@K: the average of the precision value obtained for the top K results: AP@K = mean(Precision@i), with i=1, 2, ..., K - Performance of the system: average each query's AP to get the MAP: Performance = mean(AP@K for query_i), i=1, ..., M - MAP vs Precision@K: no clear winner, usually both are reported.

Partitioning algorithms

Method: construct a partition of n documents into a set of K clusters. Given a set of documents (and for some algorithms, the number K), find a partition of K clusters that optimizes the chosen partitioning criterion (e.g., the average similarity within each cluster is maximized)

Feature number

More features usually allow for more flexible curves, hence better performance at training time. But using more features also usually increase the chance of overfitting. Avoid over-parameterization: if featureNum > sampleNum at training time, each feature would correspond to each sample, hence achieving perfect performance at training time.

Stemming

Process of reducing the inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself isn't a valid word in the language. NLTK includes 2 widely used ones: Port Stemmer and Lancaster Stemmer (younger and more aggressive); they both regard an input text as a single word. Pros: quick to run (because it's based on simple rules) and suitable for processing a large amount of text Cons: the resulting words may not carry any meaning (or be actual words)

Tokenization

Process of splitting sentences into their constituents, i.e. tokens (generally done by white-space or punctuation character separation in English), which are meaningful segments.

Markov property

Refers to the memoryless property of a stochastic process. A stochastic process has that property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it.

Stop-words

Small, unimportant words in a NL query, (such as "am", "the", "to" and "are") which support words and sentences and help to construct sentences. But they don't affect the meaning of the sentence.

Topic Model

Something that: - automatically group topically-related words in "topics". - associate tokens and documents with those topics - Unsupervised Learning - Goal: uncover the latent variables - topics - that shape the meaning of our document and corpus Models: Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), ... All of which are based on the same basic assumptions: - each document consists of a mixture of topics, and - each topic consists of a collection of words.

Simple tokenization

Split with white-space (for English texts). Pros: simple and natively supported by Python. Cons: it fails to tokenize punctuation and hyphenated words (e.g. "state-of-the-art").

Constituent

Sub-sentence that behaves as a unit that can appear in different places. E.g. "John talked [to the children] [about drugs]." "John talked [about drugs] [to the children]." Substituion/expansion/pro-forms: "I sat [on the box/right on top of the box/there]"

Named Entity Recognition (NER)

The goal of such system is to identify all textual mentions of named entities. Sub-goals: - Identify the boundaries of the NE (e.g. by NP-chunking). - Identify the type of the NE (e.g. by Naïve Bayes classification). It's useful for extracting information and answering specific questions (e.g. "Who is the inventor of Linux?", the answer would just be "Linus Torvalds" instead of a whole sentence). Challenges: - Simple word lookup incorrectly identifies words as NE. - Lists with people names or organisations have poor coverage (hard to keep up with new people or orgs) - NE terms are ambiguous (e.g. May and North are DATE and LOCATION, but can also be PERSON). - Further challenge: multi-word terms (e.g. Royal Holloway University). Spacy can be used for this in Python.

Clustering

The process of grouping a set of objects into classes of similar objects. - Documents within a cluster should be similar. - Documents from different clusters should be dissimilar. It's the commonest form of Unsupervised Learning and an important task that finds many applications in IR and NLP. Why? - Whole corpus analysis/navigation: better UI (search without typing). - For improving performance in search applications (put documents into different clusters then search only in the relevant clusters). - For speeding up vector space retrieval (cluster-based retrieval gives faster search). Internal criterion: A good clustering will produce high-quality clusters in which: the intra-cluster similarity is high and the inter-class similarity is low External criteria for cluster quality: - assesses a clustering with respect to labelled data - assume documents with C gold standard classes (ground truth partitions), while our clustering algorithms produce K clusters (machine-generated partitions): c_1, c_2, ..., c_K, each with n_i members.

Part of Speech Tagging

The process of marking up the words in a text as corresponding to a particular part of speech based on a word's definition and context of its use. Popular ones: noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjection. It's useful for text-to-speech, finding phrases (det adj* n -> noun phrases) and as an input for downstream NLP tasks (e.g. parsing, chunking, named entity recognition). Applications: - Ambiguity: words with the same spelling might have a different part of speech tags depending on the context. - Extracting phrases

Latent Dirichlet Allocation (LDA)

To generate a document: - Randomly choose a distribution over topics - For each word in the doc • Randomly choose a topic from the distribution over topics • Randomly choose a word from the chosen topic. Formally: - φ_1:K are the topics where each φ_k is a distribution over the vocabulary - θ_d are the topic proportions for document d - θ_d,k is the topic proportion for topic k in document d - z_d,n is the topic assignment for word n in document d - w_d, is the observed word n for document d The joint distribution is It can be pictorially represented by a plate diagram - only the words are observed (shaded) - α and β are parameters for Dirichlet - Plates indicate repetitions - Topic modelling: maximize the likelihood p(φ_1:K, θ_1:D, z_1:D, w_1:D) - Need to learn the latent (hidden) variables: φ_1:K, θ_1:D, z_1:D, w_1:D, α and β. - Dirichlet: "Distributions over distributions" Input: K (the number of topics), a document-term matrix of the corpus Output: topics (distribution over words) Pros: - Works better than LSA and probabilistic LSA (pLSA) - Generalize to new document easily Cons: - Expensive computation: Expectation-Maximization (EM) algorithm or Gibbs Sampling based posterior estimation. - Performance is sensitive to hyperparameters: number of topics and iterations

Divisive clustering

Top-down clustering: 1. starts with all data points in one cluster, the root, then 2. Splits the root into a set of child clusters, Each cluster is recursively divided further. 3. Stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point. - When taking the first step (split), have access to all data points; hence it can find the best possible split on the highest level - Less "blind" to the global structure of data

Human language

Ultimate interface for interaction and communication. But something to understand, because it's: - highly ambiguous at all level - complex and a subtle use of context to convey meaning - fuzzy and probabilistic Understanding a language requires domain knowledge, discourse knowledge, world knowledge and linguistic knowledge

Closest pair of clusters

Variants: - Single-link: similarity of the (closest point) most cosine-similar. - Complete-link: similarity of the (furthest point) least cosine-similar. - Centroid: clusters whose centroids (centres of gravity) are the most cosine-similar. - Average-link: average cosine between pairs of elements.

Jaccard Coeffiecent

View the query and each document as a set of words. jaccard(A, B) = |A ∩ B| / |A ∪ B| jaccard(A, A) = 1 jaccard(A, B) = 0 if A ∩ B = 0 A and B don't have to be the same size. It always assigns a number between 0 and 1. Pros: - Easy to implement and to understand - the baseline method you should consider ("if your proposed method performs worse than Jaccard coefficient, throw it away" - lecturer). Cons: - It doesn't consider term frequency - It ignores word sequences (e.g. "Alex is taller than Bob" and "Bob is taller than Alex" are regarded as the same). This can be mitigated by considering n-grams

N-Gram Language Model

We can extend to trigrams, 4-grams, 5-grams, ... P(w_1 w_2 ... w_n) ≈ ∏i[P(w_i|w_i-1, ..., w_i-k)], k controls the gram number. P(w_i|w_i-1, ..., w_i-k) = C(w_i-1, ..., w_i-k) / C(w_i-1, ..., w_i-k-1) The larger the k value is, the smaller the C(w_i-1, ..., w_i-k) and C(w_i-1, ..., w_i-k-1) values will be. Hence, it requires more data to estimate the P(w_i|w_i-1, ..., w_i-k) values. Limitations: - It only considers words in n-width windows and ignores long-distance dependencies between words. - But in practice, 2/3-grams LM strikes a good tradeoff between cost and performance. Building one: 0. Tokenize the text and extract n-grams. 1. Build the LM by count and divide: P(w_i|w_i-1, ..., w_i-k) = C(w_i-1, ..., w_i-k) / C(w_i-1, ..., w_i-k-1) or use ConditionalFreqDict from NLTK. Applications: Given a LM, you can: - with 2 sentences, estimate which is more likely to appear (i.e. which has higher probability). - with a few words, generate the following words: P(new word | existing n-1 words)


Ensembles d'études connexes

Paramedic Program Review - Chapters 30-39

View Set

Intermediate Accounting II: Chapter 14

View Set

Chapter 2 Economic Systems and Tools Review

View Set

Fact/Theory/Hyp, Char of Life, Sci Meth Content Quiz - 1

View Set

Unit 5 AP GOV: Voter Participation

View Set

Chapter 4: Ways to soothe a crying baby

View Set