Natural Language Processing

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

sentiment analysis: Positive Frequency Dictionary

"I am happy because I am learning NLP" "I am Happy" vocabulary: I am happy because learning nlp sad not 3 3 2 1 1 1 0 0

sentiment analysis: Negative Frequency Dictionary

"I am sad, I am not learning NLP" "I am Sad" vocabulary: I am happy because learning nlp sad not 3 3 0 0 1 1 2 1

Generating Candidates

# splits with a loop word = 'dearz' splits_b = [(word[:i], word[i:]) for i in range(len(word) + 1)] ['', 'dearz'] ['d', 'earz'] ['de', 'arz'] ['dea', 'rz'] ['dear', 'z'] ['dearz', ''] splits = splits_b # deletes with a list comprehension deletes = [L + R[1:] for L, R in splits if R] ['earz', 'darz', 'derz', 'deaz', 'dear']

Hash Table

A data structure where the calculated value is used to mark the position in the table where the data item should be stored, enabling it to be accessed directly, rather than forcing a sequential search. Hash Function( Vector) -> Hash Value

Euclidean distance

A method of distance measurement using the straight line mileage between two places. euc. dist. = sqrt( (X2-X1)**2 + (Y2-Y1)**2 +...+(D2-D1)**2) euc. dist.= np.linalg.norm(v2-v1) Magnitude of larger vector biases results towards those of similar magnitudes

N-grams

A representation of word sequences. The length of a sequence varies from 1 to n. When one word is used, it is a unigram; a two-word sequence is a bigram; and a 3-word sequence is a trigram; and so on. Word order matters Corpus: "I am happy because I am learning" unigram: { I , am, happy, because, learning } m=7 w1= I w2=am .... wm= learning w1^m = w1 w2 ... wm seq: w start^Stop = wstart .... w stop P(I)=2/7 , p(happy) =1/7 bigram: {I am, am happy, happy because, because I, am learning } P(am | I) = C(I am)/C(I) = 2/2 P(happy | I) = C(I happy)/C(I) = 0/2 P(learning | am) = C(I am)/C(am) = 1/2 trigram{ Iam happy, am happy because,... } P(w3 | w1^2) = C( w1^2 w3)/C( w1^2) or P(wN | w1^(N-1)) = C( w1^N-1 wN)/C( w1^(N-1)) REQUIRED: Conditional Probabilities of n-1 words P( wn, wn-1,...w2,w1) = P(w1) xP(w2|w1) x ..P(wn| wn-1...w1) Basically the most probable next word CONS: Use A lot of Memory. Large N-grams capture dependencies between distant words and need a lot of space and RAM. Hence, we resort to using different types of alternatives.

Feature Extraction: Spare Representation

A representation that contains a lot of zeros example vector of 1, 0's each representing the existence of the words in the vocabulary CONS: Features are as large as the size of a Vocabulary. This could result in larger training time, and large prediction time.

Approximate Nearest Neighbor

A supervised learning technique that classifies a new observation by finding similarities ("nearness") between this new observation and the existing data. A) You can start by creating multiple sets of random planes and applying Locality sensitive hash Function #dimensions =2 #planes =3 P = random_matrix_planes = np.random.normal(size =(3,2)) B) vector V C) Check Sign to see which side of Plane Matrix is on by getting the dot product P • V.T

CBOW Structure

ANN X = Context Vector Size (Vocabulary) v x m X= x^col1 ... x^m where m = examples Activation Layer function RELU for Input to Hidden z1= W1 x+b1 = NxV Vxm + Nxm h = Hidden Layer Vector Size of Word Embedding h= relu(z1) = N xm z2= W2 x+b2 = NxV Vxm + Nxm = logits Y = Center Word Vector Size (Vocabulary) Activation Layer function Softmax for Hidden to Output Y= softmax(z2) = N xm Not logistic regression which is solely for predicting two things W1 shape VxN each column of W1 as the column vector embedding vector of a word of the vocabulary. W1= [w^1 w^2 ... w^V ] or W2 shape NxV each word embedding row vector for the corresponding word W2= [ w^1 w^2 ... w^V ] or Avg of W1 and W2 where each column of W1 as the column vector embedding vector of a word of the vocabulary. W3 = (W1+ W2.T)/2 = shape VxN

Missing N-Grams in Training - Smoothing

Add-one Smoothing (laplacian smoothing) - add 1 to numerator and every member of denominator P =(C +1 )/( N*1+ Σj C) P(wn | w(n-1)) =( C (w(n-1), wn) +1 )/( |V|+ C(w(n-1)) ) Add-k Smoothing - add k to numerator and every member of denominator P =(C +k )/( N*k+ Σj C) P(wn | w(n-1)) =( C (w(n-1), wn) +k )/( |V|*k + C(w(n-1)) ) smoothing of the probabilities for previously unseen n-grams The downside is that n-grams not previously seen in the training dataset get too high probability.

Logistic Regression

An algebraic function that is used to relate any and all independent variables to the expected dependent variable. INPUT X ^(i) = column = [[1],[8],[11]] HYPERPARAMTERS θ sigmoid ( θ.T, X ) LABEL Y ex positive sentiment 1, Neg Sentiment =0 PREDECITED LABEL Y' COST FUNCTION TO MINIMIZE L(Y,Y') Gradient Descent = θ - alpha * gradient slope

Applications of Naïve Bayes

Deep Learning, Contextual Embedding

BERT (Google) ELMo (Allen Institute for AI ) GPT-2 (Open AI) T-NLG (Microsoft Project Turing) base off of Pretrained Models You need to train a deep neural network to learn word embeddings. Massive deep learning language models (LM)

Viterbi algorithm Step Forward Pass

C,D Matrix is then Populated Column by Column ak,i = cell in Transition Matrix bi, wordj = the Emission matrix probability of word given index i Ci,j = max k all columns( ck,j-1 ) before x ak,i x bi, wordj return max value because probabilities can be really small it is best to instead use Logs log(Ci,j)= max k all columns( log(ck,j-1 )) + log(ak,i) + log(ak,i) Di,j = arg max k all col( ck,j-1 ) x ak,i x bi, wordj return max value

Singular Value Decomposition (SVD)

Closely related to principal components analysis, it reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space, where each consecutive dimension represents the largest degree of variability (between words and documents).

Transform Vector

Convector from one vector space to the vector word in a different space

CBOW - continuous bag of words

Corpus -> Transformation -> CBOW C = context Half size = given a center word it is the # of words before and after center word Window Size = C+ center word + C sentence : I am happy because I am learning Training ====== TRASFORMATION INPUT ==== Context Words(input) Center Word (output) ===================================== I am because I am happy am happy I am because

Probability Matrix

Displays the marginal probabilities and the intersection of conditional probabilities of n-grams Take Count matrix and dive each value by sum of it's row example bigram N=2 with Corpus <s> I am strong I am brave <e> row= wi-1 , col = wi <s> <e> I am strong brave | row Sum <s> 0 0 1 0 0 0 | 1 <e> 0 0 0 0 0 0 | 0 I 0 0 0 1 0 0 | 1 am 0 0 0 0 .5 .5 | 2 strong 0 0 1 0 0 0 | 1 brave 0 1 0 0 0 0 | 1

Viterbi algorithm Step Backward Pass

Extract path from C and D 1) got to last word column and find the row index of POS tag with the largest probability 2) use that index to reference D which contains the index in the previous Word column of D

Feature Extraction Preprocessing: Stop Words

Frequently used words that are part of sentence but don't add value such as conjunctions and punctuations

Locality Sensitive Hashing

Hashing value is calculated to reflect how closely the vectors are to each other in vector space Hash Function( Vector) -> Hash Value 1) divide data by hyper planes the Sign tells you on what side of the plane a vector lands side = np.asscalar(np.sign( np.dot() ) ) Multiple Planes P1, P2, and P3, given Vector V example: P1 • V.T = 3 , sign = 1 -> hash 1 = h1 = 1 P2 • V.T = 5 , sign = 1 -> hash 2 = h2 = 1 P3 • V.T = -1 , sign = -1 -> hash 3 = h3 = 0 ... Hash = 2^0 * h1 + 2^1 *h2 + 2^2 * h3 = 1*1 + 2*1 + 4*0 = 3

Smoothing

In general, you are decreasing every entry's number by a little bit so that the 0 probabilities will be non zero. This is assuming there are more non zero entries which is usually the case. P =(C +ϵ )/( N*ϵ+ Σj C)

Emission Matrix

In hidden markov models you make use of emission probabilities that give you the probability to go from one state (POS tag) to a specific word. Note that the sum of each row in your B matrix has to be 1. B = { going to eat ... NN .5 .1 .1 VB .3 .1 .5 O .3 .5 .68} POS -> WORD probability C(t(i) ,w) = Count # of times word w is identified as t(i) Sum of Row t(i) = Σj C(t(i) ,w(j)) for each ti P( wi | ti) = C(t(i) ,w(i)) / Σj C(t(i) ,w(j)) avoid divide by zero applying Smoothing for N columns in a row P( wi | ti) = (C(t(i) ,w(i)) +ϵ )/( N*ϵ+ Σj C(t(i) ,w(j)))

vocabulary

List of unique words in a document

Cost

Mean Loss Function m is size of a batch = columns -1/m Σm Σj yj^i ln (y'j^i)

Levenshtein distance

Measuring the edit distance by using the three edits; insert, delete, and replace with costs 1, 1 and 2 respectively is known as levenshtein distance. insert - adds a letter cost =1 delete - remove a letter cost =1 replace- replace a letter with an entirely new letter cost =2

Error Analysis

Mistakes on an individual's assessment are noted and categorized by type 1) Semantic meaning lost in Preprocess ( Removing stop words, punctuations ..) resolution : Keep original text for verification 2) word order relevance resolution : Keep original text for verification 3) Trickiness of word phrasing and language quirks, adversarial attacks: humor /Irony/etc "Basterd", "Niga" amongst friends

new text sources

Now that you have a vocabulary array, you will use it when processing new text sources. A new text will have words that do not appear in the current vocabulary.

Transform Vector (Matrix)

One method is to construct a 2 dimensional matrix R that can be multiplied via dot matrix multiplication to another Vector X • R = Y = np.dot( X, R) in order to fine R the goal is to minimize the the distance between X • R ~ and ~ Y a) collect a subset of words where each word is a row in a matrix X b) collect are target Matrix composed of translated words where they match meaning of X called Y Initialize R LOOP: 1) You need Loss = || (X • R) - Y ||f^2 = how far apart the vectors are 2) gradient g = d/dR Loss = 2/m(X.T • ( (X • R) - Y ) 3) R = R- αg

Conditional Probability Chain Rule

P (A, B, C, D)= P(A)P(B|A)P(C | A,B)P(D | A,B,C)

Bayes Rule

Prior Ratio

P(pos)/P(neg) in a balance data sets P(pos)/P(neg)= 1

Vector Manipulation

RELATIONSHIP =Subtracting 2 vectors tells you how many dimension removed associates 1 vector to another example RELATIONSHIP = USA - NY to see if other vectors have similar relation ships add that difference to another vector example RUSSIA + RELATIONSHIP = MOSCOW-LIKE compare the vectors using cosine similarity to MOSCOW-LIKE and select the closest

Feature Extraction Preprocessing: Stemming

Reducing Words to their base derivations removing tense example tuning -> tun tune -> tun tuned -> tun reducing vocabulary size

Relu function

Relu (x) = max (0,x) ->[0 , infinity]

Language Model Evaluation - Split Data

SMALL DATA SETS TRAIN 80%/ Validation 10% (Hyper tuning)/ TEST 10% (Accuracy evaluation) LARGE DATA SET TRAIN 98%/ Validation 1% (Hyper tuning)/ TEST 1% (Accuracy evaluation)

Auto Correction

STEPS 1) Identify Misspelled Word deah look to see if word is in dictionary 2) Find strings n edit distance away deah_, _eah, d_ar, de_h, dea_ -> n=1 ( edit operations ) 3) Filter Candidates keep real words look to see if candidates are in dictionary candidates = set(vocab).intersection(set(edits)) 4) Calculate word probabilities

Generative Language Model

Text generation from scratch or text hints 1) randomly chooses sentence start bigrams { <s> , ? } from corpus of sentences. Select ? with highest probability 2) Choose next Bigram at random staring with { ?, next? } with the highest probability 3) repeat 2 until next? = <e> of highest probability

Approximation of Sequence Probabilities

The longer the Sequence the less likely the of more words occurring adjacent to each other in exact order decreases for Corpus: the teacher drinks tea bigram: N= 2 P( tea | the teacher drinks ) ~ P( tea | drinks ) because last n-1 words = 2-1 =1 P(wn | wi^(n-1)) ~ P(wn | w(n-N+1))^(n-1)) trigram: N=3 P( w9 | w1 , w2,...w8) ~ P( w9 | w(9-3+1)^(9-1) = P(w9| w7 w8)

One-hot encoding Vectors = Words as Vectors

The process by which categorical variables are converted into binary form (0 or 1) for machine reading. It is one of the most common methods for handling categorical features in text data. Binary relationship doesn't imply a relationship between words. They are either true or False Column =actual word # rows = all the words in vocabulary set to zero except actual word which is set to 1 CONS: a) HUGE and Sparse b) the word meanings are lost

Log-Likelihood Function

The sum of the log-likelihoods, where the log-likelihood for each observation is the log of the density of the dependent variable given the explanatory variables; the log-likelihood function is viewed as a function of the parameters to be estimated simplifies NBI = (P(pos)/P(neg) )* Πm ratio(wm) NBI = (P(pos)/P(neg) )*(Πm P(wm | pos)/P(wm | neg)) as follows log(NBI) = log(Prior Ratio) + log (Likelihood ) where λ(w) = log (ratio(w))= log(P(w | pos)/P(w | neg)) log(NBI)= log(P(pos)/P(neg))+ Σm λ(wm) So generate a column of λ for each word

Count Matrix

This captures the number of occurrences of relative n-grams. for P(wn | w(n-N+1)^(n-1)) =C(w(n-N+1)^(n-1),wn)/C(w(n-N+1)^(n-1) DETERMINE Numerator C(w(n-N+1)^(n-1),wn) rows = N-1 unique N-grams column = unique corpus words example bigram N=2 with Corpus <s> I am strong I am brave <e> row= wi-N+1^i-1 , col = wi ------- <s> <e> I am strong brave <s>----- 0 0 1 0 0 0 <e>----- 0 0 0 0 0 0 I ------- 0 0 0 2 0 0 am----- 0 0 0 0 1 1 strong- 0 0 1 0 0 0 brave-- 0 1 0 0 0 0

Extrinsic evaluation of the quality of the vectors

To test your word embedding evaluation on external tasks Tasks Named Entity Recognition Parts of speech Tagging Evaluation occurs on a test set for accuracy, F1 Score and evaluates Embedding and classification tasks Evaluates actual usefulness of embeddings More time Consuming to evaluate then Intrinsic Can not Debug because then the performance metrics provides no information as to which specific parts of the end-to-end process is responsible

Dot Product of Vectors

U • V = u1v1 + u2v2 = |U| |V| cos θ = projection of vector on a plane

Missing N-Grams in Training

What is the Probability of an N-gram constructed from words in Corpus , but not actually present present in corpus as an N-Gram

Laplacian smoothing

When P(w | pos)=0 or P(w | neg)=0 V = Count Total Words in Vocabulary Σw P(w | class) = 1 P(word | class) = (freqs(word,class) +1 )/( ∀w Σw freqs(w,class) + V)

Co-occurrence Matrix

Word by Word: # of times 2 words appear close together within a distance k sentence 1: I like simple data sentence 2: I prefer simple raw data if k= 2 for word data common words of sentences: simple: 2 , raw: 1, like:1, I:0 Word by Doc: # of times a word appear within a category in 3 document categories: Entertainment, ML, Economy The words data appears Entertainment: 500, ML:9320 , Economy:6620 film appears Entertainment: 7000, ML:1000 , Economy:4000

Assumptions of Naïve Bayes

Words in a Sentence are assumed independent Bad: Words can be used together to describe/reference another word in a sentence and not necessarily be stand alone. example "sunny and hot" of "cold and snowy" Relies on data distribution of training sets. Good training sets have equal frequencies of data classifications. Bias is present in sentiments of training tweets for example

Softmax function

[ 0,1 ] allows Vector of Probabilities e^z = np.exp(z) softmax( e^z./np.sum(e^z))

Principal Component Analysis (PCA)

a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set This Allows you to plot vectors better You take uncorrelated features then project data to lower dimension without loosing the existing data 1) get Mean normalized data = xi = (x, - mean(xi))/variance(xi) 2) get Covariance Matrix 3) perform Singular Value Decomposition (SVD) RESULT matrices a) Eigenvectors as columns of U=[ U1, U2 ...] b) Eigenvalues in a diagonal of the matrix S: S00, S11, S22... they will be used to project vectors X' = dot product of ( X , U[:, 0: desired dimensions] ) % of variance retained in new vector space Σi=desired dimensions Sii /Σj=original dimensions Sjj

Viterbi algorithm

a graph algorithm given Transition Matrix of POS -> POS starting at token 𝝅 you can multiply the probabilities from ti-1 to position of Tags in sentence until you get to the position ti just before the Word then use This results in several paths ti-1 -> ti choose the path with MAX Probability for next word in sentence Emission Matrix of POS -> WORD you can multiply the probabilities above from last tag ti to word Total Probability is product of each probability ending in a word through out entire sentence example Transition Matrix of POS -> POS starting at token 𝝅 P( 𝝅,O) = .3, P(O,NN) = .5 , P(O,VB) = .5, P(VB,O) = .2 Emission Matrix of POS -> WORD P(O,I)=.5, P(VB, Iove)=.5, P(NN, Iove)=.1, P(O, to)=.4, P(VB, learn) =.2 <s> I love to Learn P(𝝅,I) = P( 𝝅,O)*P(O,I) = .3*.5 = .15 P(O,Iove) = max(P(O,VB)*P(VB,Iove) , P(O,NN)*P(NN,Iove)) = .25 = max(.5*.5 , .5*.1) = .25 P(VB,to) = P(VB,O)*P(O,to) = .2*.4 =.08 P(O,learn) =P(O,VB) *P(VB, learn) = .5*.2=.10 Total Probability of sequence 𝝅->O->VB->O P(𝝅,Iearn) = P(𝝅,I)*P(O,Iove)*P(VB,to)*P(O,learn) = .15*.25*.08*.1 = .0003 STEPS 1) Initialization Step 2) Forward Pass 3) Backward Pass N= #of POS tags i.e rows K= # of unique words i.e. columns Matrix C= POS tags x WORD columns = optimal probabilities Matrix D= POS tags x WORD columns = indices of visited states

Hidden Markov Model

a model in which there is a graph of states with probabilistic transitions between states, in which the state that the system is in cannot be observed directly. States are hidden or not directly observable

transition matrix

a set of probabilities that determine what happens from one time step to the next on a particular space or gives you the probabilities from one state to another. where Q = States = { q1,q2,q3,,,qn} A = Transition Matrix is N+1 x N probabilities ={ a1,1, ....aN+1,N} Note that the sum of each row in your A matrix has to be 1. Count the occurrence of POS tag Pairs count the number of times tag t_(i-1), t_(i) show up near each other and divide by the total number of times t_(i-1) shows up C(t(i−1) ,t(i)) is the count of times tag (i-1) shows up before tag i Take a sentence and add a Start token 𝝅 . The Count is across entire corpus of sentences Sum of Row = Σj C(t(i−1) ,t(j)) for each ti P( ti | ti-1) = C(t(i−1) ,t(i)) / Σj C(t(i−1) ,t(j)) avoid divide by zero applying Smoothing for N columns in a row P( ti | ti-1) =( C(t(i−1) ,t(i)) +ϵ )/( N*ϵ+Σj C(t(i−1) ,t(j))) POS ->POS probability

Probablistic Model

a statistical model applicable when product demand or any other variable is not known but can be specified by means of a probability distribution useful for auto-correction as well as web search suggestions

Markov Chain

a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

CBOW - Transformation of Center Words to Vectors

a) Corpus -> Vocabulary -> One Hot Vector matrix b) Transform Context Word into vectors i) avg of individual One Hot Vectors to represent Context example I am because I = 4 = I am because I am 0 1 0 0 /4 .25 because 0 0 1 0 /4 .25 happy 0 0 0 0 /4 0 I 1 0 0 1 /4 .25 learning 0 0 0 0 /4 0

CBOW - Cleaning and Tokenization

a) lower case b) Punctuation represented as single char, empty string etc c) numbers might be dropped if insignificant or replace with a tag> d) special characters may need to be replace by Empty String e) Special Words like emoji are hashtags as a single word nltk.download('punkt'') predefined way of handling punctuations for English

sentiment analysis

an automated process of analyzing and categorizing social media to determine the amount of positive, negative, and neutral online comments a brand receives Looking at Vocabulary you can create a Positive Frequency Negative frequency associated to every word in vocabulary

intrinsic evaluation of the quality of the vectors

assess how well the word embeddings contextually capture the semantic or syntactic relationships between the words Semantics refers to the meaning of words the missing word semantic analogies as, "France" is to "Paris" as "Italy" is to <?> syntax refers to the grammar. such as plurals, tenses, and comparatives. syntactic analogies as "seen" is to "saw" as "been" is to <?>. Testing relationships Analogies are not perfect if more that one analogy applies Clustering Algorithms similar word embeddings Visualization

FastText

based on Skip Gram Model and takes into account the structure of words by representing words as an n-gram of characters. supports out of vocabulary words, words can be averaged to gether

Viterbi algorithm Step Initialization Step

create a Matrix C and initialize 1st column with emissions Ci,1 = a1,i * bj, word create a Matrix D and initialize 1st column with 0 Di,1 = 0 cause we have not traversed POS tags yet

defaultdict collection

dict subclass that calls a factory function to supply missing values they are a special kind of dictionaries that return the "zero" value of a type if you try to access a key that does not exist. Since you want the frequencies of words, you should define the defaultdict with a type of int. from collections import defaultdict freq = defaultdict(int)

backtrace

display which functions have been called. Tells you what tabular path you used to get to cell

Optimizing Minimum Levenshtein Distance with Dynamic Programming (Tabular)

example source=play -> target =stay Min Edit Dist D[i,j] = source[:i]-> target[:j] D[2,3] = pl-> sta D[2,3] = play-> stay D[0,0] 0 - no change D[0,1] = 1 : "" -> s : insert , D[1,0] = 1 - p -> "" : delete D[1,1] =( p-> ps -> s) or ( p-> # -> s) or replace (p,s) diagonal D[1,1] = (D[0,1] + delete p) or (D[1,0] + insert 2) or replace (p,s) D[1,1] = min [ (0+1 +1) or (0+1+1) or (2) ] D[2,0] = pl -> "" =2 , D[3,0] = pla-> "" =3 D[i,j]= D[i-1,j] + delete D[0,2] = "" -> st = 2 , D[0,3] = ""-> sta =3 D[i,j]= D[i,j-1] + insert D[1,1] = p -> s = 2 , D[2,2] = t -> l = 2 D[i,j]= D[i-1,j-1] + replace 2 or 0 if same character table 0 1 2 3 4 "" S T A Y 0 "" 0 1 | 1 P 1 2 | 2 L 2---D[2,3] 3 A 3 4 Y 4

Eigenvalue

explained variance - The amount of information retained by each feature - Represents the total variance explained by each feature pcaTr.explained_variance_

Global Vectors (GloVe)

factorizing the logarithm of the corpuses word co-occurrence matrix, which is similar to the count matrix you've used before GloVe doesn't use neural networks at all

Word by Word Vectors

generate a Co-occurrence Matrix of word to other words that produces a Vector. # of times 2 words appear close together within a distance k

Word by Doc

generate a Co-occurrence Matrix of word to related categories that produces a Vector. # of times a word appear within a category

RNN (Recurrent Neural Networks) Feedforward explained

h<t0> + x<t1> -> y'<t1> , h<t1> h<t1> + x<t2> -> y'<t2> , h<t2> .... h<T-1> + x<T> -> y'<T> ========================================= where Wh•[h<t-1>, x<t>] + bh = Whh•h<t-1> + Whx•x<t> + bh 𝑊ℎ in the first formula is denotes the horizontal concatenation of 𝑊ℎℎ and 𝑊ℎ𝑥 from the second formula. # Option 1: concatenate - horizontal w_h1 = np.concatenate((w_hh, w_hx), axis=1) # Option 2: hstack w_h2 = np.hstack((w_hh, w_hx)) [ℎ<𝑡−1>,𝑥<𝑡>] = vertical concatenation or vertical stack = [ ℎ<𝑡−1> 𝑥<𝑡> ] # Option 1: concatenate - vertical ax_1 = np.concatenate( (h_t-1, x_t), axis=0 ) # Option 2: vstack ax_2 = np.vstack((h_t-1, x_t)) g= is an activation function so that h<t> = g( Wh•[h<t-1>, x<t>] + bh) = g(Whh•h<t-1> + Whx•x<t> + bh ) shape wise (4x1) = (4x14)•[14x1] +(4x1) = (4x4)•(4x1) +(4x10)(10x1) +(4x1) which can help yield using the current hidden state Wyh '<t> =g(Wyh • h<t> + by ) ADVANTAGES: 1) Capture dependencies within a Short Range 2) Take up less RAM than other n-grams models CON: 1) Struggle to capture Long Term dependencies. Accuracy decreases as Epochs increase 2) Prone to vanishing or exploding gradients 3) No parallel Computations hidden states are discovered in Sequence Recall that the hidden state at step 𝑖i is defined as: h<t> = g(Whh•h<t-1> + Whx•x<t> + bh ) where g is an activation function (usually sigmoid). So, you can use the chain rule to get the partial derivative: 𝛿ℎ<t>/𝛿ℎ<t−1> =𝑊ℎℎ.T diag(g′(𝑊ℎℎℎ<t−1>+𝑊ℎ𝑥𝑥<t>+𝑏ℎ)) ∏𝑡≥𝑖>𝑘 𝛿ℎ<i>/𝛿ℎ<i−1> = ∏𝑡≥𝑖>𝑘 𝑊ℎℎ.T • diag(g′(𝑊ℎℎℎ<𝑖−1>+𝑊ℎ𝑥𝑥<𝑖>+𝑏ℎ)) # Interval [0, 1] def sigmoid(x): return 1 / (1 + np.exp(-x)) # Gradient Interval [0, 0.25] def sigmoid_gradient(x): return sigmoid(x) * (1 - sigmoid(x)) def prod(k): p = 1 for i in range(t-1, k-2, -1): p *= [email protected]( sigmoid_gradient(W_hh@h[:,i]+ W_hx@x[:,i] + b_h)) return p Where the diag converts that vector into a diagonal matrix

Vector Space Models

identify if a pair of sentences are similar by representing words as vectors that capture the relative meaning of words. Each pair of words in a Co-occurrence Matrix form the axis of a 2 dimensional vector space where their individual frequent occurrence form a vector used for dimensional comparison In a vectorized representation of your data, equal sequence length allows more efficient batch processing.

Word Embedding Vectors - Words as Vectors - Adding Meaning

instead of using 0 or 1 you can attribute negative and positive numbers in place to connotate negative and positive connotations example rage = -2.52 , excited = 2.31 you can add another dimension to represent abstract vs concrete definitions. example snake (-5.3, 4.1) Thought ( 0.03, -.93) definition Word Embedding Vectors : a) Low dimensions (less than V) b) Embed meaning e.g analogies, semantic distance Hyperparameters example a) Word Embedding Size It is Self-Supervised because aims to learn representations without requiring human-annotated labels and then use those learned representations on some downstream tasks. it is both: unsupervised -in the sense that the input data, the corpus is unlabeled, supervised - in the sense that the data itself provides the necessary context which would ordinarily make up the labels.

Missing N-Grams in Training - BackOff

is a model generalization method that leverages information from lower order n-grams in case information about the high order n-grams is missing. If N-Gram is missing use (N-1)-gram for conditional property besides the trigram ('John', 'drinks', 'chocolate') we also use bigram ('drinks', 'chocolate') and unigram (chocolate) probability for trigram ('John', 'drinks', 'chocolate') not found probability for bigram ('drinks', 'chocolate') found probability for trigram ('are', 'you', 'happy') estimated as probability for bigram *constant Corpus <s> Lyn drinks chocolate <e> <s> John drinks tea <e> <s> Lyn eats chocolate <e> Trigram N=3 P( chocolate | John Drinks ) = P ( wi | wi-2 wi-1 ) = 0 missing Stupid backoff to the next lowest ~P(chocolate | John Drinks) ~ = 0* P(chocolate | John Drinks) + c * P(chocolate | Drinks ) + 0* P(chocolate ) Bigram N=2 skipping P(chocolate | John Drinks) ~P( chocolate | Drinks ) = P(wi | wi-1)

Language Model

is a tool that calculates the probabilities of sentences. You can think of a sentence as a sequence of words. Speech Recognition -> fix words used in grammar Spelling Correction -> swap out by most probable Augmentative Communication -> Word prediction based on hands gestures use Probability to Matrix to get sentence probability example bigram sentence: <s> I am brave <e> P(sentence) = P(I|<s>)P(am | I) P(brave| am)= (1)(1)(.5) This involve multiplication of a lot of small number probabilities increasing numerical errors So use so store log(P(sentence)) instead log(P(I|<s>)) + log(P(am | I)) + log(P(brave| am))

Back propagation

is an algorithm that calculates the partial derivatives or gradient of the cost with respect to the weights and biases of the neural network. ∂J/∂W1 = 1/m RELU(W2.T *(Y'- Y)) X.T ∂J/∂b1 = 1/m RELU(W2.T *(Y'- Y)) 1.T ∂J/∂W2 = 1/m (Y'- Y)) H.T ∂J/∂b2 = 1/m (Y'- Y)) 1.T Gradient Descent adjusts the weights and biases of the neural network using the gradient to minimize the costs W1 = W1 - α ∂J/∂W1 W2= W2 - α ∂J/∂W2 b1 = b1 - α ∂J/∂b1 b2 = b2 - α ∂J/∂b2

Corpus

is an entire set of sentences Creating a Frequency table and summing them up will tell you the V- Total Size of Corpus C(word) - number of times word appears in corpus P(word) - Probability of word in corpus - C(word)/V

Loss Function - Cross Entropy

loss Function: cross entropy loss= J= - Σk yk ln (y'k) minimize J ERROR: this so value so y' ~ 0 and J ~ infinity CORRECT: this so value so y' ~ 1 and J ~ 0

cosine similarity

measure of similarity between vectors that measures the cosine of the angle between them. used to measure the similarities between documents by representing them as vectors This focus on the cosine angle between vectors and ignores magnitudes norm or magnitude = sqrt(sum(v**2)) =sqrt(np.linalg.norm(v)) Cos(β) = dot(v1,v2)/(norm(v1)*norm(v2)) β= 90= Orthogonally different => Cos(β) = 0 β= 0= Same direction => Cos(β) = 1

Language Model Evaluation - Perplexity Metrics

measure of the complexity in a sample of text It can tell if set of sentences was created by humans vs machine. Humans have the lowest perplexity score. We are measuring uncertainty much the same as entropy The Higher the probability of test set W the lower the the perplexity PP( test Set of Words W) = P( s1,s2,s3,..sm)^(-1/m) PP(W) = sqrt m(1/ (𝝅i P(wi | w(i-1) ) ) or log(PP(W))= -1/m Σi log2(P(wi | w(i-1) )) where wi is ith sentence in test set ending in <e> m # words in test set W including <e> but not <s> good language models PPW() between 60 to 20 sometimes even lower for English PPW(character level models) < PPW(word level models).

minimum edit distance algorithm

minimum number of operations it would take to convert one word into another. shance -> chance :: dist = 2 -> replace(s,c) cimb -> climb :: dist = 1 -> insert(l) play -> stay :: dist = 4 -> replace(p,s), replace(l,t) insert delete replace = insert + delete swap = adjacent The minimum edit distance depends only on the editing cost and the two words that are being considered and not on any corpus or vocabulary.

numpy A@B

numpy.matmult(A,B)

Sigmoid

prediction h(x,θ) vertical , θ.T (X) horizontal

Naive Bayes Classifier for sentiment Analysis

predicts the probability of a certain outcome based on prior occurrences of related events Assumes that Variables are Independent For each Word in Vocabulary you can create a Positive Frequency List Negative frequency List V = Count Total Words in Vocabulary ∀w Σw freqs(w,1) pos. ∀w Σw freqs(w,0) neg. For Each word calculate new table P(word | pos) = freqs(word,1) / ∀w Σw freqs(w,1) pos. P(word | neg) = freqs(word,0) / ∀w Σw freqs(w,0) neg. Σw P(w | pos) = 1 Σw P(w | neg) = 1 any word where P(w | pos) = P(w | neg) are neutral and don't add to sentiment Also actively avoid P(w | pos)=0 or P(w | neg)=0 Power words are widely skewed P(w | pos) >> P(w | neg) or P(w | pos) << P(w | neg) 1) Annotate tweets to be Pos or Neg 2) preprocess Tweets Lowercase Remove punctuation, urls, names Remove stop words Stemming Tokenize sentences 3) Get columns Σw freqs(w,1) pos. , Σw freqs(w,0) neg. 4) Get columns P(w | pos), P(w | neg) 5) Get column λ(w) = log (ratio(w)) 6) calculate log( Prior Ratio)= log(P(pos)/P(neg))

Eigenvector

principal component- uncorrelated features of your data -how well-connected are those I'm connected to by direction (Rotation Matrix) = pcaTr.components_ 𝑅=[[𝑐𝑜𝑠(angle), −𝑠𝑖𝑛(angle)], [𝑠𝑖𝑛(angle) ,𝑐𝑜𝑠(angle)]] The principal components contained in the rotation matrix, are decreasingly sorted depending on its explained Variance. It usually means that the first components retain most of the power of the data to explain the patterns that generalize the data.

part-of-speech tagging (pos)

provides information on - the word class, e.g. noun, adjective, auxiliary verb - inflections, e.g. plural noun, verb in the past participle form, superlative form of an adjective - other aspects, e.g. type of conjunction, comparative after-determiner (more, less, ...); existential there, ... Purposes : 1) Identifying named entities 2) Coreference Resolution (Identify pronouns) 3) Speech Recognition

Ration Of Probabilities

Naïve Bayes Inference condition Rule for Binary Classification

sentence m= "I am happy today; I am Learning" Likelihood = Πm P(wm | pos)/P(wm | neg) Naïve Bayes Inference= (Prior Ratio) (Likelihood ) NBI = (P(pos)/P(neg) )*(Πm P(wm | pos)/P(wm | neg)) in m words of the Sentence Likelihood = Product of Word Ratios > 1 = Positive Likelihood = Product of Word Ratios < 1 = Negative

Probabilities

the likelihood that something will happen

Markov assumption

the probability of a word depends only on the previous N words

Conditional Probability

the probability that one event happens given that another event is already known to have happened. All the other givens are ignored P(X|Y)

Word as Vectors Use Cases

use cases for word embeddings include machine translation systems, information extraction, question answering Semantic Analogies Semantic Analysis Classification of Customer Feedback

Missing N-Grams in Training - Interpolation

use weighted probability example trigram ~P(wn | wn-2 wn-1) = λ1 * P(wn | wn-2 wn-1) +λ2 * P(wn | wn-1) + λ3 * P(wn) where Σj λj = 1 λ= percentage fraction

Cost

usually graphed over iterations -1/m Σi y^(i) log (h(x,θ)) + (1-y^(i)) log (1 - h(x,θ)) y= 1 and h(x,θ) = 0 infinity y= 0 and h(x,θ) = 1 infinity

Text Corpus -> N-Gram Model

when breaking a sentence into n grams The Probability of the first word has no predecessors so add N-1 <s> start tags bigram = <s> w1 w2.... trigram = <s> <s> w1 w2.... n gram = <s> <s>...<s> w1 w2.... but only one end tag <e> per sentence in n grams

Frobenius Norm

where A = [ [2 2] , [2 2]] || A ||f = sqrt ( 2^2 + 2^2 + 2^2 + 2^2) A_squared = np.square(A) A_Frobenious = np.sqrt( np.sum ( A_squared )) it is easier to work with the LOSS as the square of the Forbenious norm which gets rid of square root to find the gradient since the goal is to minimize and use calculus to get gradient formula || (X • R) - Y ||f^2

Feature Extraction: Frequencies of Words

where freq(word, sentiment class) Xm = [ 1 (bias) , Σw freqs(w,1) pos.,Σw freqs(w,0) neg. ] vocab: I am happy because learning nlp sad not pos: 3 3 2 1 1 1 0 0 neg: 3 3 0 0 1 1 2 1 Σw = "I am sad, I am not learning NLP" Σ freqs(w,1) pos.=I:3+am:3 +sad:0 +not:0+learn:1+NLP:1=8 Σ freqs(w,0) neg.=I:3+am:3 +sad:2 +not:1+learn:1+NLP:1=11 X shape (m, 3) X1 = [ 1 , 3, 5] X2 = [ 1 , 5, 4] ... Xm = [ 1 , 8, 11 ] sample 1 row of m sample

Word2Vec

word2vec is an algorithm and tool to learn word embeddings by trying to predict the context of words in a document. The resulting word vectors have some interesting properties, for example vector('queen') ~= vector('king') - vector('man') + vector('woman'). Two different objectives can be used to learn these embeddings: A) The Skip-Gram objective tries to predict a context from on a word, the model learns to predict the words surrounding a given input word. B) CBOW - Continuous Bag ow Words -objective tries to predict a word from its context. word2vec uses shallow neural networks

out of vocabulary words

words the model hasn't seen during training. Smoothing may be an option tag <UNK> Closed Vocabulary = fixed set of word Open Vocabulary = can encounter words out of Vocabulary A Corpus A) tag can be updated so <UNK> replaces a word that has less than the minimum frequency. The Vocabulary then becomes the Unique words remaining that are not <UNK> B) predetermines the max size of vocabulary and appends the top most frequent words to vocabulary until max Vocabulary Size is met.. treating everything else essentially as an <UNK> word Use <UNK> sparingly so they don't have a high probability in your sequence if there are many <𝑈𝑁𝐾> replacements in your train and test set, you may get a very low perplexity even though the model

Document Vectors

you can represent a document as a vector by adding up the word vectors for the words inside the document. for word in words_in_document: document_embedding += word_embedding.get(word,0)

Accuracy

Σi (pred^(i) == y^(i))/m

VANISHING GRADIENT PROBLEM

∏𝑡≥𝑖>𝑘 𝛿ℎ<i>/𝛿ℎ<i−1> => 0 Contributions The vanishing gradient problem arises in very deep Neural Networks, typically Recurrent Neural Networks, that use activation functions whose gradients tend to be small (in the range of 0 from 1). Because these small gradients are multiplied during backpropagation, they tend to "vanish" throughout the layers, preventing the network from learning long-range dependencies. Common ways to counter this problem is to use activation functions like ReLUs that do not suffer from small gradients, along with initializing weights to identity matrix. What this essentially does is copy the previous hidden states and information from the current inputs and replace any negative values with zero. use of Batch Normalization and proper initialization of weights through normalization ensures that the gradients have healthy norms. use architectures like LSTMs that explicitly combat vanishing gradients. The opposite of this problem is called the exploding gradient problem use of Skip Connections. As the Layers get deeper it may not even be possible for the benefits of early identity matrix weights to learn in the deeper layers 1) Derivative of the activation function is bounded by some value constant c 2) The absolute value of the largest eigenvalue of the weight matrix 𝑊ℎℎ is lower than 1/c (sufficient condition for vanishing gradient)

Voir tous les ensembles d'études

Natural Language Processing

Ensembles d'études connexes

PathoPharm Neuro

Med Surg Final Exam

measurements

Psychology Exam 2

Finance 10 PQ 9MWT 8 PQ 8EMQ 8 WT 9 PQ 7PQ 7Q 7WT 6EMQ

Immunologie appliquée

Geography 2/9/16 2:12

110.1-110.8 Color // Lesson Challenge

Sociology 1101 final exam

Good Paragraph Writing

12. SDF

MHA 705 Module 7- Exam 7

Chapters 6-8 Study Guide

Comprehensive Math

Chp 10: DSM

MGT 301

Econ 101 - Unit 3

1.4 Equilibrium

Intro To Business Chapter 9

National Exam 1