Natural Language Processing
sentiment analysis: Positive Frequency Dictionary
"I am happy because I am learning NLP" "I am Happy" vocabulary: I am happy because learning nlp sad not 3 3 2 1 1 1 0 0
sentiment analysis: Negative Frequency Dictionary
"I am sad, I am not learning NLP" "I am Sad" vocabulary: I am happy because learning nlp sad not 3 3 0 0 1 1 2 1
Generating Candidates
# splits with a loop word = 'dearz' splits_b = [(word[:i], word[i:]) for i in range(len(word) + 1)] ['', 'dearz'] ['d', 'earz'] ['de', 'arz'] ['dea', 'rz'] ['dear', 'z'] ['dearz', ''] splits = splits_b # deletes with a list comprehension deletes = [L + R[1:] for L, R in splits if R] ['earz', 'darz', 'derz', 'deaz', 'dear']
Hash Table
A data structure where the calculated value is used to mark the position in the table where the data item should be stored, enabling it to be accessed directly, rather than forcing a sequential search. Hash Function( Vector) -> Hash Value
Euclidean distance
A method of distance measurement using the straight line mileage between two places. euc. dist. = sqrt( (X2-X1)**2 + (Y2-Y1)**2 +...+(D2-D1)**2) euc. dist.= np.linalg.norm(v2-v1) Magnitude of larger vector biases results towards those of similar magnitudes
N-grams
A representation of word sequences. The length of a sequence varies from 1 to n. When one word is used, it is a unigram; a two-word sequence is a bigram; and a 3-word sequence is a trigram; and so on. Word order matters Corpus: "I am happy because I am learning" unigram: { I , am, happy, because, learning } m=7 w1= I w2=am .... wm= learning w1^m = w1 w2 ... wm seq: w start^Stop = wstart .... w stop P(I)=2/7 , p(happy) =1/7 bigram: {I am, am happy, happy because, because I, am learning } P(am | I) = C(I am)/C(I) = 2/2 P(happy | I) = C(I happy)/C(I) = 0/2 P(learning | am) = C(I am)/C(am) = 1/2 trigram{ Iam happy, am happy because,... } P(w3 | w1^2) = C( w1^2 w3)/C( w1^2) or P(wN | w1^(N-1)) = C( w1^N-1 wN)/C( w1^(N-1)) REQUIRED: Conditional Probabilities of n-1 words P( wn, wn-1,...w2,w1) = P(w1) xP(w2|w1) x ..P(wn| wn-1...w1) Basically the most probable next word CONS: Use A lot of Memory. Large N-grams capture dependencies between distant words and need a lot of space and RAM. Hence, we resort to using different types of alternatives.
Feature Extraction: Spare Representation
A representation that contains a lot of zeros example vector of 1, 0's each representing the existence of the words in the vocabulary CONS: Features are as large as the size of a Vocabulary. This could result in larger training time, and large prediction time.
Approximate Nearest Neighbor
A supervised learning technique that classifies a new observation by finding similarities ("nearness") between this new observation and the existing data. A) You can start by creating multiple sets of random planes and applying Locality sensitive hash Function #dimensions =2 #planes =3 P = random_matrix_planes = np.random.normal(size =(3,2)) B) vector V C) Check Sign to see which side of Plane Matrix is on by getting the dot product P • V.T
CBOW Structure
ANN X = Context Vector Size (Vocabulary) v x m X= x^col1 ... x^m where m = examples Activation Layer function RELU for Input to Hidden z1= W1 x+b1 = NxV Vxm + Nxm h = Hidden Layer Vector Size of Word Embedding h= relu(z1) = N xm z2= W2 x+b2 = NxV Vxm + Nxm = logits Y = Center Word Vector Size (Vocabulary) Activation Layer function Softmax for Hidden to Output Y= softmax(z2) = N xm Not logistic regression which is solely for predicting two things W1 shape VxN each column of W1 as the column vector embedding vector of a word of the vocabulary. W1= [w^1 w^2 ... w^V ] or W2 shape NxV each word embedding row vector for the corresponding word W2= [ w^1 w^2 ... w^V ] or Avg of W1 and W2 where each column of W1 as the column vector embedding vector of a word of the vocabulary. W3 = (W1+ W2.T)/2 = shape VxN
Missing N-Grams in Training - Smoothing
Add-one Smoothing (laplacian smoothing) - add 1 to numerator and every member of denominator P =(C +1 )/( N*1+ Σj C) P(wn | w(n-1)) =( C (w(n-1), wn) +1 )/( |V|+ C(w(n-1)) ) Add-k Smoothing - add k to numerator and every member of denominator P =(C +k )/( N*k+ Σj C) P(wn | w(n-1)) =( C (w(n-1), wn) +k )/( |V|*k + C(w(n-1)) ) smoothing of the probabilities for previously unseen n-grams The downside is that n-grams not previously seen in the training dataset get too high probability.
Logistic Regression
An algebraic function that is used to relate any and all independent variables to the expected dependent variable. INPUT X ^(i) = column = [[1],[8],[11]] HYPERPARAMTERS θ sigmoid ( θ.T, X ) LABEL Y ex positive sentiment 1, Neg Sentiment =0 PREDECITED LABEL Y' COST FUNCTION TO MINIMIZE L(Y,Y') Gradient Descent = θ - alpha * gradient slope
Applications of Naïve Bayes
Author Identification = P(author1 | book)/P(author2 | book) Spam Filtering = P(spam | email)/P(non spam | email) Information Retrieval P(document k |query) ~ Πi P( query i | document k) retrieve relevant document in k documents if P(document k |query) > threshold Word disambiguation = P(context 1 | ambig word)/P(context 2 | ambig word) = P("RIVER" | "BANK")/P( "MONEY" | "BANK")
Deep Learning, Contextual Embedding
BERT (Google) ELMo (Allen Institute for AI ) GPT-2 (Open AI) T-NLG (Microsoft Project Turing) base off of Pretrained Models You need to train a deep neural network to learn word embeddings. Massive deep learning language models (LM)
Viterbi algorithm Step Forward Pass
C,D Matrix is then Populated Column by Column ak,i = cell in Transition Matrix bi, wordj = the Emission matrix probability of word given index i Ci,j = max k all columns( ck,j-1 ) before x ak,i x bi, wordj return max value because probabilities can be really small it is best to instead use Logs log(Ci,j)= max k all columns( log(ck,j-1 )) + log(ak,i) + log(ak,i) Di,j = arg max k all col( ck,j-1 ) x ak,i x bi, wordj return max value
Singular Value Decomposition (SVD)
Closely related to principal components analysis, it reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space, where each consecutive dimension represents the largest degree of variability (between words and documents).
Transform Vector
Convector from one vector space to the vector word in a different space
CBOW - continuous bag of words
Corpus -> Transformation -> CBOW C = context Half size = given a center word it is the # of words before and after center word Window Size = C+ center word + C sentence : I am happy because I am learning Training ====== TRASFORMATION INPUT ==== Context Words(input) Center Word (output) ===================================== I am because I am happy am happy I am because
Probability Matrix
Displays the marginal probabilities and the intersection of conditional probabilities of n-grams Take Count matrix and dive each value by sum of it's row example bigram N=2 with Corpus <s> I am strong I am brave <e> row= wi-1 , col = wi <s> <e> I am strong brave | row Sum <s> 0 0 1 0 0 0 | 1 <e> 0 0 0 0 0 0 | 0 I 0 0 0 1 0 0 | 1 am 0 0 0 0 .5 .5 | 2 strong 0 0 1 0 0 0 | 1 brave 0 1 0 0 0 0 | 1
Viterbi algorithm Step Backward Pass
Extract path from C and D 1) got to last word column and find the row index of POS tag with the largest probability 2) use that index to reference D which contains the index in the previous Word column of D
Feature Extraction Preprocessing: Stop Words
Frequently used words that are part of sentence but don't add value such as conjunctions and punctuations
Locality Sensitive Hashing
Hashing value is calculated to reflect how closely the vectors are to each other in vector space Hash Function( Vector) -> Hash Value 1) divide data by hyper planes the Sign tells you on what side of the plane a vector lands side = np.asscalar(np.sign( np.dot() ) ) Multiple Planes P1, P2, and P3, given Vector V example: P1 • V.T = 3 , sign = 1 -> hash 1 = h1 = 1 P2 • V.T = 5 , sign = 1 -> hash 2 = h2 = 1 P3 • V.T = -1 , sign = -1 -> hash 3 = h3 = 0 ... Hash = 2^0 * h1 + 2^1 *h2 + 2^2 * h3 = 1*1 + 2*1 + 4*0 = 3
Smoothing
In general, you are decreasing every entry's number by a little bit so that the 0 probabilities will be non zero. This is assuming there are more non zero entries which is usually the case. P =(C +ϵ )/( N*ϵ+ Σj C)
Emission Matrix
In hidden markov models you make use of emission probabilities that give you the probability to go from one state (POS tag) to a specific word. Note that the sum of each row in your B matrix has to be 1. B = { going to eat ... NN .5 .1 .1 VB .3 .1 .5 O .3 .5 .68} POS -> WORD probability C(t(i) ,w) = Count # of times word w is identified as t(i) Sum of Row t(i) = Σj C(t(i) ,w(j)) for each ti P( wi | ti) = C(t(i) ,w(i)) / Σj C(t(i) ,w(j)) avoid divide by zero applying Smoothing for N columns in a row P( wi | ti) = (C(t(i) ,w(i)) +ϵ )/( N*ϵ+ Σj C(t(i) ,w(j)))
vocabulary
List of unique words in a document
Cost
Mean Loss Function m is size of a batch = columns -1/m Σm Σj yj^i ln (y'j^i)
Levenshtein distance
Measuring the edit distance by using the three edits; insert, delete, and replace with costs 1, 1 and 2 respectively is known as levenshtein distance. insert - adds a letter cost =1 delete - remove a letter cost =1 replace- replace a letter with an entirely new letter cost =2
Error Analysis
Mistakes on an individual's assessment are noted and categorized by type 1) Semantic meaning lost in Preprocess ( Removing stop words, punctuations ..) resolution : Keep original text for verification 2) word order relevance resolution : Keep original text for verification 3) Trickiness of word phrasing and language quirks, adversarial attacks: humor /Irony/etc "Basterd", "Niga" amongst friends
new text sources
Now that you have a vocabulary array, you will use it when processing new text sources. A new text will have words that do not appear in the current vocabulary.
Transform Vector (Matrix)
One method is to construct a 2 dimensional matrix R that can be multiplied via dot matrix multiplication to another Vector X • R = Y = np.dot( X, R) in order to fine R the goal is to minimize the the distance between X • R ~ and ~ Y a) collect a subset of words where each word is a row in a matrix X b) collect are target Matrix composed of translated words where they match meaning of X called Y Initialize R LOOP: 1) You need Loss = || (X • R) - Y ||f^2 = how far apart the vectors are 2) gradient g = d/dR Loss = 2/m(X.T • ( (X • R) - Y ) 3) R = R- αg
Conditional Probability Chain Rule
P (A, B, C, D)= P(A)P(B|A)P(C | A,B)P(D | A,B,C)
Bayes Rule
P(Positive|word) P(Word) = P( Positive ∩ word) P(word|positive) P(Positive) = P( Positive ∩ word) P(word|positive) P(Positive)=P(Positive|word) P(Word) P(X|Y)= P(Y|X)P(X)/P(Y)
Prior Ratio
P(pos)/P(neg) in a balance data sets P(pos)/P(neg)= 1
Vector Manipulation
RELATIONSHIP =Subtracting 2 vectors tells you how many dimension removed associates 1 vector to another example RELATIONSHIP = USA - NY to see if other vectors have similar relation ships add that difference to another vector example RUSSIA + RELATIONSHIP = MOSCOW-LIKE compare the vectors using cosine similarity to MOSCOW-LIKE and select the closest
Feature Extraction Preprocessing: Stemming
Reducing Words to their base derivations removing tense example tuning -> tun tune -> tun tuned -> tun reducing vocabulary size
Relu function
Relu (x) = max (0,x) ->[0 , infinity]
Language Model Evaluation - Split Data
SMALL DATA SETS TRAIN 80%/ Validation 10% (Hyper tuning)/ TEST 10% (Accuracy evaluation) LARGE DATA SET TRAIN 98%/ Validation 1% (Hyper tuning)/ TEST 1% (Accuracy evaluation)
Auto Correction
STEPS 1) Identify Misspelled Word deah look to see if word is in dictionary 2) Find strings n edit distance away deah_, _eah, d_ar, de_h, dea_ -> n=1 ( edit operations ) 3) Filter Candidates keep real words look to see if candidates are in dictionary candidates = set(vocab).intersection(set(edits)) 4) Calculate word probabilities
Generative Language Model
Text generation from scratch or text hints 1) randomly chooses sentence start bigrams { <s> , ? } from corpus of sentences. Select ? with highest probability 2) Choose next Bigram at random staring with { ?, next? } with the highest probability 3) repeat 2 until next? = <e> of highest probability
Approximation of Sequence Probabilities
The longer the Sequence the less likely the of more words occurring adjacent to each other in exact order decreases for Corpus: the teacher drinks tea bigram: N= 2 P( tea | the teacher drinks ) ~ P( tea | drinks ) because last n-1 words = 2-1 =1 P(wn | wi^(n-1)) ~ P(wn | w(n-N+1))^(n-1)) trigram: N=3 P( w9 | w1 , w2,...w8) ~ P( w9 | w(9-3+1)^(9-1) = P(w9| w7 w8)
One-hot encoding Vectors = Words as Vectors
The process by which categorical variables are converted into binary form (0 or 1) for machine reading. It is one of the most common methods for handling categorical features in text data. Binary relationship doesn't imply a relationship between words. They are either true or False Column =actual word # rows = all the words in vocabulary set to zero except actual word which is set to 1 CONS: a) HUGE and Sparse b) the word meanings are lost
Log-Likelihood Function
The sum of the log-likelihoods, where the log-likelihood for each observation is the log of the density of the dependent variable given the explanatory variables; the log-likelihood function is viewed as a function of the parameters to be estimated simplifies NBI = (P(pos)/P(neg) )* Πm ratio(wm) NBI = (P(pos)/P(neg) )*(Πm P(wm | pos)/P(wm | neg)) as follows log(NBI) = log(Prior Ratio) + log (Likelihood ) where λ(w) = log (ratio(w))= log(P(w | pos)/P(w | neg)) log(NBI)= log(P(pos)/P(neg))+ Σm λ(wm) So generate a column of λ for each word
Count Matrix
This captures the number of occurrences of relative n-grams. for P(wn | w(n-N+1)^(n-1)) =C(w(n-N+1)^(n-1),wn)/C(w(n-N+1)^(n-1) DETERMINE Numerator C(w(n-N+1)^(n-1),wn) rows = N-1 unique N-grams column = unique corpus words example bigram N=2 with Corpus <s> I am strong I am brave <e> row= wi-N+1^i-1 , col = wi ------- <s> <e> I am strong brave <s>----- 0 0 1 0 0 0 <e>----- 0 0 0 0 0 0 I ------- 0 0 0 2 0 0 am----- 0 0 0 0 1 1 strong- 0 0 1 0 0 0 brave-- 0 1 0 0 0 0
Extrinsic evaluation of the quality of the vectors
To test your word embedding evaluation on external tasks Tasks Named Entity Recognition Parts of speech Tagging Evaluation occurs on a test set for accuracy, F1 Score and evaluates Embedding and classification tasks Evaluates actual usefulness of embeddings More time Consuming to evaluate then Intrinsic Can not Debug because then the performance metrics provides no information as to which specific parts of the end-to-end process is responsible
Dot Product of Vectors
U • V = u1v1 + u2v2 = |U| |V| cos θ = projection of vector on a plane
Missing N-Grams in Training
What is the Probability of an N-gram constructed from words in Corpus , but not actually present present in corpus as an N-Gram
Laplacian smoothing
When P(w | pos)=0 or P(w | neg)=0 V = Count Total Words in Vocabulary Σw P(w | class) = 1 P(word | class) = (freqs(word,class) +1 )/( ∀w Σw freqs(w,class) + V)
Co-occurrence Matrix
Word by Word: # of times 2 words appear close together within a distance k sentence 1: I like simple data sentence 2: I prefer simple raw data if k= 2 for word data common words of sentences: simple: 2 , raw: 1, like:1, I:0 Word by Doc: # of times a word appear within a category in 3 document categories: Entertainment, ML, Economy The words data appears Entertainment: 500, ML:9320 , Economy:6620 film appears Entertainment: 7000, ML:1000 , Economy:4000
Assumptions of Naïve Bayes
Words in a Sentence are assumed independent Bad: Words can be used together to describe/reference another word in a sentence and not necessarily be stand alone. example "sunny and hot" of "cold and snowy" Relies on data distribution of training sets. Good training sets have equal frequencies of data classifications. Bias is present in sentiments of training tweets for example
Softmax function
[ 0,1 ] allows Vector of Probabilities e^z = np.exp(z) softmax( e^z./np.sum(e^z))
Principal Component Analysis (PCA)
a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set This Allows you to plot vectors better You take uncorrelated features then project data to lower dimension without loosing the existing data 1) get Mean normalized data = xi = (x, - mean(xi))/variance(xi) 2) get Covariance Matrix 3) perform Singular Value Decomposition (SVD) RESULT matrices a) Eigenvectors as columns of U=[ U1, U2 ...] b) Eigenvalues in a diagonal of the matrix S: S00, S11, S22... they will be used to project vectors X' = dot product of ( X , U[:, 0: desired dimensions] ) % of variance retained in new vector space Σi=desired dimensions Sii /Σj=original dimensions Sjj
Viterbi algorithm
a graph algorithm given Transition Matrix of POS -> POS starting at token 𝝅 you can multiply the probabilities from ti-1 to position of Tags in sentence until you get to the position ti just before the Word then use This results in several paths ti-1 -> ti choose the path with MAX Probability for next word in sentence Emission Matrix of POS -> WORD you can multiply the probabilities above from last tag ti to word Total Probability is product of each probability ending in a word through out entire sentence example Transition Matrix of POS -> POS starting at token 𝝅 P( 𝝅,O) = .3, P(O,NN) = .5 , P(O,VB) = .5, P(VB,O) = .2 Emission Matrix of POS -> WORD P(O,I)=.5, P(VB, Iove)=.5, P(NN, Iove)=.1, P(O, to)=.4, P(VB, learn) =.2 <s> I love to Learn P(𝝅,I) = P( 𝝅,O)*P(O,I) = .3*.5 = .15 P(O,Iove) = max(P(O,VB)*P(VB,Iove) , P(O,NN)*P(NN,Iove)) = .25 = max(.5*.5 , .5*.1) = .25 P(VB,to) = P(VB,O)*P(O,to) = .2*.4 =.08 P(O,learn) =P(O,VB) *P(VB, learn) = .5*.2=.10 Total Probability of sequence 𝝅->O->VB->O P(𝝅,Iearn) = P(𝝅,I)*P(O,Iove)*P(VB,to)*P(O,learn) = .15*.25*.08*.1 = .0003 STEPS 1) Initialization Step 2) Forward Pass 3) Backward Pass N= #of POS tags i.e rows K= # of unique words i.e. columns Matrix C= POS tags x WORD columns = optimal probabilities Matrix D= POS tags x WORD columns = indices of visited states
Hidden Markov Model
a model in which there is a graph of states with probabilistic transitions between states, in which the state that the system is in cannot be observed directly. States are hidden or not directly observable
transition matrix
a set of probabilities that determine what happens from one time step to the next on a particular space or gives you the probabilities from one state to another. where Q = States = { q1,q2,q3,,,qn} A = Transition Matrix is N+1 x N probabilities ={ a1,1, ....aN+1,N} Note that the sum of each row in your A matrix has to be 1. Count the occurrence of POS tag Pairs count the number of times tag t_(i-1), t_(i) show up near each other and divide by the total number of times t_(i-1) shows up C(t(i−1) ,t(i)) is the count of times tag (i-1) shows up before tag i Take a sentence and add a Start token 𝝅 . The Count is across entire corpus of sentences Sum of Row = Σj C(t(i−1) ,t(j)) for each ti P( ti | ti-1) = C(t(i−1) ,t(i)) / Σj C(t(i−1) ,t(j)) avoid divide by zero applying Smoothing for N columns in a row P( ti | ti-1) =( C(t(i−1) ,t(i)) +ϵ )/( N*ϵ+Σj C(t(i−1) ,t(j))) POS ->POS probability
Probablistic Model
a statistical model applicable when product demand or any other variable is not known but can be specified by means of a probability distribution useful for auto-correction as well as web search suggestions
Markov Chain
a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
CBOW - Transformation of Center Words to Vectors
a) Corpus -> Vocabulary -> One Hot Vector matrix b) Transform Context Word into vectors i) avg of individual One Hot Vectors to represent Context example I am because I = 4 = I am because I am 0 1 0 0 /4 .25 because 0 0 1 0 /4 .25 happy 0 0 0 0 /4 0 I 1 0 0 1 /4 .25 learning 0 0 0 0 /4 0
CBOW - Cleaning and Tokenization
a) lower case b) Punctuation represented as single char, empty string etc c) numbers might be dropped if insignificant or replace with a tag> d) special characters may need to be replace by Empty String e) Special Words like emoji are hashtags as a single word nltk.download('punkt'') predefined way of handling punctuations for English
sentiment analysis
an automated process of analyzing and categorizing social media to determine the amount of positive, negative, and neutral online comments a brand receives Looking at Vocabulary you can create a Positive Frequency Negative frequency associated to every word in vocabulary
intrinsic evaluation of the quality of the vectors
assess how well the word embeddings contextually capture the semantic or syntactic relationships between the words Semantics refers to the meaning of words the missing word semantic analogies as, "France" is to "Paris" as "Italy" is to <?> syntax refers to the grammar. such as plurals, tenses, and comparatives. syntactic analogies as "seen" is to "saw" as "been" is to <?>. Testing relationships Analogies are not perfect if more that one analogy applies Clustering Algorithms similar word embeddings Visualization
FastText
based on Skip Gram Model and takes into account the structure of words by representing words as an n-gram of characters. supports out of vocabulary words, words can be averaged to gether
Viterbi algorithm Step Initialization Step
create a Matrix C and initialize 1st column with emissions Ci,1 = a1,i * bj, word create a Matrix D and initialize 1st column with 0 Di,1 = 0 cause we have not traversed POS tags yet
defaultdict collection
dict subclass that calls a factory function to supply missing values they are a special kind of dictionaries that return the "zero" value of a type if you try to access a key that does not exist. Since you want the frequencies of words, you should define the defaultdict with a type of int. from collections import defaultdict freq = defaultdict(int)
backtrace
display which functions have been called. Tells you what tabular path you used to get to cell
Optimizing Minimum Levenshtein Distance with Dynamic Programming (Tabular)
example source=play -> target =stay Min Edit Dist D[i,j] = source[:i]-> target[:j] D[2,3] = pl-> sta D[2,3] = play-> stay D[0,0] 0 - no change D[0,1] = 1 : "" -> s : insert , D[1,0] = 1 - p -> "" : delete D[1,1] =( p-> ps -> s) or ( p-> # -> s) or replace (p,s) diagonal D[1,1] = (D[0,1] + delete p) or (D[1,0] + insert 2) or replace (p,s) D[1,1] = min [ (0+1 +1) or (0+1+1) or (2) ] D[2,0] = pl -> "" =2 , D[3,0] = pla-> "" =3 D[i,j]= D[i-1,j] + delete D[0,2] = "" -> st = 2 , D[0,3] = ""-> sta =3 D[i,j]= D[i,j-1] + insert D[1,1] = p -> s = 2 , D[2,2] = t -> l = 2 D[i,j]= D[i-1,j-1] + replace 2 or 0 if same character table 0 1 2 3 4 "" S T A Y 0 "" 0 1 | 1 P 1 2 | 2 L 2---D[2,3] 3 A 3 4 Y 4
Eigenvalue
explained variance - The amount of information retained by each feature - Represents the total variance explained by each feature pcaTr.explained_variance_
Global Vectors (GloVe)
factorizing the logarithm of the corpuses word co-occurrence matrix, which is similar to the count matrix you've used before GloVe doesn't use neural networks at all
Word by Word Vectors
generate a Co-occurrence Matrix of word to other words that produces a Vector. # of times 2 words appear close together within a distance k
Word by Doc
generate a Co-occurrence Matrix of word to related categories that produces a Vector. # of times a word appear within a category
RNN (Recurrent Neural Networks) Feedforward explained
h<t0> + x<t1> -> y'<t1> , h<t1> h<t1> + x<t2> -> y'<t2> , h<t2> .... h<T-1> + x<T> -> y'<T> ========================================= where Wh•[h<t-1>, x<t>] + bh = Whh•h<t-1> + Whx•x<t> + bh 𝑊ℎ in the first formula is denotes the horizontal concatenation of 𝑊ℎℎ and 𝑊ℎ𝑥 from the second formula. # Option 1: concatenate - horizontal w_h1 = np.concatenate((w_hh, w_hx), axis=1) # Option 2: hstack w_h2 = np.hstack((w_hh, w_hx)) [ℎ<𝑡−1>,𝑥<𝑡>] = vertical concatenation or vertical stack = [ ℎ<𝑡−1> 𝑥<𝑡> ] # Option 1: concatenate - vertical ax_1 = np.concatenate( (h_t-1, x_t), axis=0 ) # Option 2: vstack ax_2 = np.vstack((h_t-1, x_t)) g= is an activation function so that h<t> = g( Wh•[h<t-1>, x<t>] + bh) = g(Whh•h<t-1> + Whx•x<t> + bh ) shape wise (4x1) = (4x14)•[14x1] +(4x1) = (4x4)•(4x1) +(4x10)(10x1) +(4x1) which can help yield using the current hidden state Wyh '<t> =g(Wyh • h<t> + by ) ADVANTAGES: 1) Capture dependencies within a Short Range 2) Take up less RAM than other n-grams models CON: 1) Struggle to capture Long Term dependencies. Accuracy decreases as Epochs increase 2) Prone to vanishing or exploding gradients 3) No parallel Computations hidden states are discovered in Sequence Recall that the hidden state at step 𝑖i is defined as: h<t> = g(Whh•h<t-1> + Whx•x<t> + bh ) where g is an activation function (usually sigmoid). So, you can use the chain rule to get the partial derivative: 𝛿ℎ<t>/𝛿ℎ<t−1> =𝑊ℎℎ.T diag(g′(𝑊ℎℎℎ<t−1>+𝑊ℎ𝑥𝑥<t>+𝑏ℎ)) ∏𝑡≥𝑖>𝑘 𝛿ℎ<i>/𝛿ℎ<i−1> = ∏𝑡≥𝑖>𝑘 𝑊ℎℎ.T • diag(g′(𝑊ℎℎℎ<𝑖−1>+𝑊ℎ𝑥𝑥<𝑖>+𝑏ℎ)) # Interval [0, 1] def sigmoid(x): return 1 / (1 + np.exp(-x)) # Gradient Interval [0, 0.25] def sigmoid_gradient(x): return sigmoid(x) * (1 - sigmoid(x)) def prod(k): p = 1 for i in range(t-1, k-2, -1): p *= [email protected]( sigmoid_gradient(W_hh@h[:,i]+ W_hx@x[:,i] + b_h)) return p Where the diag converts that vector into a diagonal matrix
Vector Space Models
identify if a pair of sentences are similar by representing words as vectors that capture the relative meaning of words. Each pair of words in a Co-occurrence Matrix form the axis of a 2 dimensional vector space where their individual frequent occurrence form a vector used for dimensional comparison In a vectorized representation of your data, equal sequence length allows more efficient batch processing.
Word Embedding Vectors - Words as Vectors - Adding Meaning
instead of using 0 or 1 you can attribute negative and positive numbers in place to connotate negative and positive connotations example rage = -2.52 , excited = 2.31 you can add another dimension to represent abstract vs concrete definitions. example snake (-5.3, 4.1) Thought ( 0.03, -.93) definition Word Embedding Vectors : a) Low dimensions (less than V) b) Embed meaning e.g analogies, semantic distance Hyperparameters example a) Word Embedding Size It is Self-Supervised because aims to learn representations without requiring human-annotated labels and then use those learned representations on some downstream tasks. it is both: unsupervised -in the sense that the input data, the corpus is unlabeled, supervised - in the sense that the data itself provides the necessary context which would ordinarily make up the labels.
Missing N-Grams in Training - BackOff
is a model generalization method that leverages information from lower order n-grams in case information about the high order n-grams is missing. If N-Gram is missing use (N-1)-gram for conditional property besides the trigram ('John', 'drinks', 'chocolate') we also use bigram ('drinks', 'chocolate') and unigram (chocolate) probability for trigram ('John', 'drinks', 'chocolate') not found probability for bigram ('drinks', 'chocolate') found probability for trigram ('are', 'you', 'happy') estimated as probability for bigram *constant Corpus <s> Lyn drinks chocolate <e> <s> John drinks tea <e> <s> Lyn eats chocolate <e> Trigram N=3 P( chocolate | John Drinks ) = P ( wi | wi-2 wi-1 ) = 0 missing Stupid backoff to the next lowest ~P(chocolate | John Drinks) ~ = 0* P(chocolate | John Drinks) + c * P(chocolate | Drinks ) + 0* P(chocolate ) Bigram N=2 skipping P(chocolate | John Drinks) ~P( chocolate | Drinks ) = P(wi | wi-1)
Language Model
is a tool that calculates the probabilities of sentences. You can think of a sentence as a sequence of words. Speech Recognition -> fix words used in grammar Spelling Correction -> swap out by most probable Augmentative Communication -> Word prediction based on hands gestures use Probability to Matrix to get sentence probability example bigram sentence: <s> I am brave <e> P(sentence) = P(I|<s>)P(am | I) P(brave| am)= (1)(1)(.5) This involve multiplication of a lot of small number probabilities increasing numerical errors So use so store log(P(sentence)) instead log(P(I|<s>)) + log(P(am | I)) + log(P(brave| am))
Back propagation
is an algorithm that calculates the partial derivatives or gradient of the cost with respect to the weights and biases of the neural network. ∂J/∂W1 = 1/m RELU(W2.T *(Y'- Y)) X.T ∂J/∂b1 = 1/m RELU(W2.T *(Y'- Y)) 1.T ∂J/∂W2 = 1/m (Y'- Y)) H.T ∂J/∂b2 = 1/m (Y'- Y)) 1.T Gradient Descent adjusts the weights and biases of the neural network using the gradient to minimize the costs W1 = W1 - α ∂J/∂W1 W2= W2 - α ∂J/∂W2 b1 = b1 - α ∂J/∂b1 b2 = b2 - α ∂J/∂b2
Corpus
is an entire set of sentences Creating a Frequency table and summing them up will tell you the V- Total Size of Corpus C(word) - number of times word appears in corpus P(word) - Probability of word in corpus - C(word)/V
Loss Function - Cross Entropy
loss Function: cross entropy loss= J= - Σk yk ln (y'k) minimize J ERROR: this so value so y' ~ 0 and J ~ infinity CORRECT: this so value so y' ~ 1 and J ~ 0
cosine similarity
measure of similarity between vectors that measures the cosine of the angle between them. used to measure the similarities between documents by representing them as vectors This focus on the cosine angle between vectors and ignores magnitudes norm or magnitude = sqrt(sum(v**2)) =sqrt(np.linalg.norm(v)) Cos(β) = dot(v1,v2)/(norm(v1)*norm(v2)) β= 90= Orthogonally different => Cos(β) = 0 β= 0= Same direction => Cos(β) = 1
Language Model Evaluation - Perplexity Metrics
measure of the complexity in a sample of text It can tell if set of sentences was created by humans vs machine. Humans have the lowest perplexity score. We are measuring uncertainty much the same as entropy The Higher the probability of test set W the lower the the perplexity PP( test Set of Words W) = P( s1,s2,s3,..sm)^(-1/m) PP(W) = sqrt m(1/ (𝝅i P(wi | w(i-1) ) ) or log(PP(W))= -1/m Σi log2(P(wi | w(i-1) )) where wi is ith sentence in test set ending in <e> m # words in test set W including <e> but not <s> good language models PPW() between 60 to 20 sometimes even lower for English PPW(character level models) < PPW(word level models).
minimum edit distance algorithm
minimum number of operations it would take to convert one word into another. shance -> chance :: dist = 2 -> replace(s,c) cimb -> climb :: dist = 1 -> insert(l) play -> stay :: dist = 4 -> replace(p,s), replace(l,t) insert delete replace = insert + delete swap = adjacent The minimum edit distance depends only on the editing cost and the two words that are being considered and not on any corpus or vocabulary.
numpy A@B
numpy.matmult(A,B)
Sigmoid
prediction h(x,θ) vertical , θ.T (X) horizontal
Naive Bayes Classifier for sentiment Analysis
predicts the probability of a certain outcome based on prior occurrences of related events Assumes that Variables are Independent For each Word in Vocabulary you can create a Positive Frequency List Negative frequency List V = Count Total Words in Vocabulary ∀w Σw freqs(w,1) pos. ∀w Σw freqs(w,0) neg. For Each word calculate new table P(word | pos) = freqs(word,1) / ∀w Σw freqs(w,1) pos. P(word | neg) = freqs(word,0) / ∀w Σw freqs(w,0) neg. Σw P(w | pos) = 1 Σw P(w | neg) = 1 any word where P(w | pos) = P(w | neg) are neutral and don't add to sentiment Also actively avoid P(w | pos)=0 or P(w | neg)=0 Power words are widely skewed P(w | pos) >> P(w | neg) or P(w | pos) << P(w | neg) 1) Annotate tweets to be Pos or Neg 2) preprocess Tweets Lowercase Remove punctuation, urls, names Remove stop words Stemming Tokenize sentences 3) Get columns Σw freqs(w,1) pos. , Σw freqs(w,0) neg. 4) Get columns P(w | pos), P(w | neg) 5) Get column λ(w) = log (ratio(w)) 6) calculate log( Prior Ratio)= log(P(pos)/P(neg))
Eigenvector
principal component- uncorrelated features of your data -how well-connected are those I'm connected to by direction (Rotation Matrix) = pcaTr.components_ 𝑅=[[𝑐𝑜𝑠(angle), −𝑠𝑖𝑛(angle)], [𝑠𝑖𝑛(angle) ,𝑐𝑜𝑠(angle)]] The principal components contained in the rotation matrix, are decreasingly sorted depending on its explained Variance. It usually means that the first components retain most of the power of the data to explain the patterns that generalize the data.
part-of-speech tagging (pos)
provides information on - the word class, e.g. noun, adjective, auxiliary verb - inflections, e.g. plural noun, verb in the past participle form, superlative form of an adjective - other aspects, e.g. type of conjunction, comparative after-determiner (more, less, ...); existential there, ... Purposes : 1) Identifying named entities 2) Coreference Resolution (Identify pronouns) 3) Speech Recognition
Ration Of Probabilities
ratio(w) = P(w | pos)/P(w | neg) POSITIVE = P(w | pos)/P(w | neg) => ∞ NEUTRAL = P(w | pos)/P(w | neg) = 1 NEGATIVE = P(w | pos)/P(w | neg) => 0
Naïve Bayes Inference condition Rule for Binary Classification
sentence m= "I am happy today; I am Learning" Likelihood = Πm P(wm | pos)/P(wm | neg) Naïve Bayes Inference= (Prior Ratio) (Likelihood ) NBI = (P(pos)/P(neg) )*(Πm P(wm | pos)/P(wm | neg)) in m words of the Sentence Likelihood = Product of Word Ratios > 1 = Positive Likelihood = Product of Word Ratios < 1 = Negative
Probabilities
the likelihood that something will happen
Markov assumption
the probability of a word depends only on the previous N words
Conditional Probability
the probability that one event happens given that another event is already known to have happened. All the other givens are ignored P(X|Y)
Word as Vectors Use Cases
use cases for word embeddings include machine translation systems, information extraction, question answering Semantic Analogies Semantic Analysis Classification of Customer Feedback
Missing N-Grams in Training - Interpolation
use weighted probability example trigram ~P(wn | wn-2 wn-1) = λ1 * P(wn | wn-2 wn-1) +λ2 * P(wn | wn-1) + λ3 * P(wn) where Σj λj = 1 λ= percentage fraction
Cost
usually graphed over iterations -1/m Σi y^(i) log (h(x,θ)) + (1-y^(i)) log (1 - h(x,θ)) y= 1 and h(x,θ) = 0 infinity y= 0 and h(x,θ) = 1 infinity
Text Corpus -> N-Gram Model
when breaking a sentence into n grams The Probability of the first word has no predecessors so add N-1 <s> start tags bigram = <s> w1 w2.... trigram = <s> <s> w1 w2.... n gram = <s> <s>...<s> w1 w2.... but only one end tag <e> per sentence in n grams
Frobenius Norm
where A = [ [2 2] , [2 2]] || A ||f = sqrt ( 2^2 + 2^2 + 2^2 + 2^2) A_squared = np.square(A) A_Frobenious = np.sqrt( np.sum ( A_squared )) it is easier to work with the LOSS as the square of the Forbenious norm which gets rid of square root to find the gradient since the goal is to minimize and use calculus to get gradient formula || (X • R) - Y ||f^2
Feature Extraction: Frequencies of Words
where freq(word, sentiment class) Xm = [ 1 (bias) , Σw freqs(w,1) pos.,Σw freqs(w,0) neg. ] vocab: I am happy because learning nlp sad not pos: 3 3 2 1 1 1 0 0 neg: 3 3 0 0 1 1 2 1 Σw = "I am sad, I am not learning NLP" Σ freqs(w,1) pos.=I:3+am:3 +sad:0 +not:0+learn:1+NLP:1=8 Σ freqs(w,0) neg.=I:3+am:3 +sad:2 +not:1+learn:1+NLP:1=11 X shape (m, 3) X1 = [ 1 , 3, 5] X2 = [ 1 , 5, 4] ... Xm = [ 1 , 8, 11 ] sample 1 row of m sample
Word2Vec
word2vec is an algorithm and tool to learn word embeddings by trying to predict the context of words in a document. The resulting word vectors have some interesting properties, for example vector('queen') ~= vector('king') - vector('man') + vector('woman'). Two different objectives can be used to learn these embeddings: A) The Skip-Gram objective tries to predict a context from on a word, the model learns to predict the words surrounding a given input word. B) CBOW - Continuous Bag ow Words -objective tries to predict a word from its context. word2vec uses shallow neural networks
out of vocabulary words
words the model hasn't seen during training. Smoothing may be an option tag <UNK> Closed Vocabulary = fixed set of word Open Vocabulary = can encounter words out of Vocabulary A Corpus A) tag can be updated so <UNK> replaces a word that has less than the minimum frequency. The Vocabulary then becomes the Unique words remaining that are not <UNK> B) predetermines the max size of vocabulary and appends the top most frequent words to vocabulary until max Vocabulary Size is met.. treating everything else essentially as an <UNK> word Use <UNK> sparingly so they don't have a high probability in your sequence if there are many <𝑈𝑁𝐾> replacements in your train and test set, you may get a very low perplexity even though the model
Document Vectors
you can represent a document as a vector by adding up the word vectors for the words inside the document. for word in words_in_document: document_embedding += word_embedding.get(word,0)
Accuracy
Σi (pred^(i) == y^(i))/m
VANISHING GRADIENT PROBLEM
∏𝑡≥𝑖>𝑘 𝛿ℎ<i>/𝛿ℎ<i−1> => 0 Contributions The vanishing gradient problem arises in very deep Neural Networks, typically Recurrent Neural Networks, that use activation functions whose gradients tend to be small (in the range of 0 from 1). Because these small gradients are multiplied during backpropagation, they tend to "vanish" throughout the layers, preventing the network from learning long-range dependencies. Common ways to counter this problem is to use activation functions like ReLUs that do not suffer from small gradients, along with initializing weights to identity matrix. What this essentially does is copy the previous hidden states and information from the current inputs and replace any negative values with zero. use of Batch Normalization and proper initialization of weights through normalization ensures that the gradients have healthy norms. use architectures like LSTMs that explicitly combat vanishing gradients. The opposite of this problem is called the exploding gradient problem use of Skip Connections. As the Layers get deeper it may not even be possible for the benefits of early identity matrix weights to learn in the deeper layers 1) Derivative of the activation function is bounded by some value constant c 2) The absolute value of the largest eigenvalue of the weight matrix 𝑊ℎℎ is lower than 1/c (sufficient condition for vanishing gradient)