Natural Language Processing

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Define lemmatization.

Is a scalpel to bring words down to their root forms. For example, NLTK's savvy lemmatizer knows "am" and "are" are related to "be."

List the common tasks involved in text preprocessing. HINT: Ned Turns Normal Sticks Lovely.

1.) Noise Removal 2.) Tokenization 3.) Normalization 4.) Stemming 5.) Lemmatization

Provide each of the abbreviations for each part of speech when it comes to part-of-speech tagging.

1.) Noun = NN 2.) Verb = VB 3.) Adverb = RB 4.) Adjective = JJ 5.) Determiner = DT

If .match() finds a match that starts at the _______ of the string, it will return a _______. The _______ lets you know what piece of text the regular expression matched, and at what index the match begins and ends. If there is no match, .match() will return _______.

beginning, match object, match object, None

Based on the previous notecard, identify a noun phrase from the part-of-speech tagged text below: [('we', 'PRP'), ('are', 'VBP'), ('so', 'RB'), ('grateful', 'JJ'), ('to', 'TO'), ('you', 'PRP'), ('for', 'IN'), ('having', 'VBG'), ('killed', 'VBN'), ('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'), ('of', 'IN'), ('the', 'DT'), ('east', 'NN'), (',', ','), ('and', 'CC'), ('for', 'IN'), ('setting', 'VBG'), ('our', 'PRP$'), ('people', 'NNS'), ('free', 'VBP'), ('from', 'IN'), ('bondage', 'NN'), ('.', '.')]

(('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN')) (('the', 'DT'), ('east', 'NN')) (('bondage', 'NN'))

Using a regular expression, match the phrases puppies are my favorite!, and kitty cats are my favorite!, but not the phrases deer are my favorite!, and otters are favorite!

(puppies|kitty cats) are my favorite!

What are some of the use cases for a bag-of-words language model?

- determining topics in a song - filtering spam from your inbox - finding out if a tweet has positive or negative sentiment - creating word cloud

What is a bag-of-words(BoW) language model?

Bag-of-words (BoW) is a statistical language model based on word count.

List the two most common verb structures.

1.) The first structure begins with a verb VB of any tense, followed by a noun phrase, and ends with an optional adverb RB of any form. 2.) The second structure switches the order of the verb and the noun phrase, but also ends with an optional adverb.

What is a Bag-of-Words model?

A bag of words model is a simplified representation of text data that disregards grammar and word order, focusing only on the frequency of individual words in a document.

What is a feature vector?

A feature vector is a numeric representation of an item's important features. Each feature has its own column. If the feature exists for the item, you could represent that with a 1. If the feature does not exist for that item, you could represent that with a 0. Ex: Five fantastic fish flew off to find faraway fish. Feature Vector: five: 1 , fantastic: 1, fish: 2, flew: 1, off: 1, to: 1 , find: 1, faraway:1

Define term frequency-inverse document frequency (tf-idf).

A form of topic modeling that deprioritizes the most common words and prioritize less frequently used terms as topics. This way there are no words that contribute nothing to the overall topic.

What is alternation?

Alternation, performed in regular expressions with the pipe symbol, |, allows us to match either the characters preceding the | OR the characters after the |. Ex: Stone|Scout would match "I love Stone" or "I love Scout".

Define Topic Modeling.

An area of NLP dedicated to uncovering latent, or hidden, topics within a body of language.

What is an n-gram model?

An n-gram model is a way of understanding language that looks at sequences of n words in a sentence or text to predict what comes next. Simple Explanation(Optional): Imagine reading a story and noticing that some words often come together, like "once upon a time." An n-gram model is like a super smart friend who remembers which words like to be together, and can guess what word might come next based on the words that came before! For example, if you have the words "I love to eat," the model would guess that the next word might be "pizza" because a lot of times people say "I love to eat pizza."

Bag-Of-Words Section

Bag-Of-Words Section

Define tokenization.

Breaking text into individual words.

What are character sets?

Character sets, denoted by a pair of brackets [], let us match one character from a series of characters, allowing for matches with incorrect or different spellings. Ex: The regex con[sc]en[sc]us will match consensus

Define normalization.

Cleaning text data in any other way.

How can we represent a vector in python?

In Python, we can use a list to represent a vector. Each index in the list will correspond to a word and be set to its count.

Define stemming.

Is a blunt axe to chop off word prefixes and suffixes. "booing" and "booed" become "boo", but "computer" may become "comput" and "are" would remain "are."

What is the goal of NLP?

Enabling computers to interpret, analyze, and approximate the generation of human languages

We can only match alphabetical characters using literals. T/F

False. Ex: "3" would match "3"

Within a character set we can match multiple characters. T/F

False. Within any character set [] we only match one character.

What are fixed quantifiers?

Fixed quantifiers, denoted with curly braces {}, let us indicate the exact quantity of a character we wish to match, or allow us to provide a quantity range to match on. Ex: \w{4,7} will match at minimum 4 word characters and at maximum 7 word characters. The regex roa{3}r will match the characters ro followed by 3 as, and then the character r, such as in the text roaaar.

Define grouping.

Grouping, denoted with the open parenthesis ( and the closing parenthesis ), lets us group parts of a regular expression together, and allows us to limit alternation to part of the regex. Ex: The regex I love (baboons|gorillas) will match the text I love and then match either baboons or gorillas.

Define Named Entity Recognition.

Helps identify the proper nouns (e.g., "Natalia" or "Berlin") in a text. This can be a clue as to the topic of the text.

What is language accessibility?

How accessible a language or other areas of a language are to those that are disabled.

What is Part-of-Speech Tagging?

Identifies parts of speech (verbs, nouns, adjectives, etc.).

Define text preprocessing.

Is usually the first step you'll take when faced with an NLP task. Text preprocessing refers to the process of cleaning, organizing, and preparing raw text data for analysis or machine learning applications.

Explain the following piece of chunk grammar: chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

It can start with an optional determiner (DT), followed by zero or more adjectives (JJ), then a noun (NN), followed by a verb (VB) and optionally an adverb (RB). This pattern is structured to capture phrases like "the cat quickly chased the mouse" or "a curious dog barked loudly".

Explain the following line of chunk grammar: chunk_grammar = """NP: {<.*>+}

It essentially says that an NP can consist of one or more of any type of word or part of speech. The inverted brackets }{ indicate which parts of speech you want to filter from the chunk. <VB.?|IN>+ will filter out any verbs or prepositions.

What is a Dependency Grammar Tree?

It helps you understand the relationship between the words in a sentence.

What is a statistical language model?

It is a way for computers to make sense of language based on probability. Ex: "Five fantastic fish flew off to find faraway functions. Maybe find another five fantastic fish?" A statistical language model focused on the starting letter for words might take this text and predict that words are most likely to start with the letter "f" because 11 out of 15 words begin that way.

What is language prediction?

It is an application of NLP concerned with predicting text given preceding text.

What is chunk filtering?

It lets you define what parts of speech you do not want in a chunk and remove them.

What does the parse method do?

It takes a list of part-of-speech tagged words as an argument, and identifies where such chunks occur in the sentence. Ex: scaredy_cat = chunk_parser.parse(pos_tagged_oz[282])

What does the match method do?

It takes a string of text as an argument and looks for a single match to the regular expression that starts at the beginning of the string. Ex: result = regular_expression_object.match("Toto")

What does the findall method do?

It will return a list of all non-overlapping matches of the regular expression in the string. Ex: list_of_matches = re.findall("\w{8}",text) Output: ['Everythi', 'Munchkin', 'favorite', 'friendly', 'Munchkin']

What is the Kleene plus?

Kleene plus, denoted by the plus +, which matches the preceding character 1 or more times.

What is latent Dirichlet allocation (LDA)?

LDA is a statistical model that takes your documents and determines which words keep popping up together in the same contexts.

Language Parsing Section

Language Parsing Section

What is language smoothing? What is it used for?

Language smoothing is a technique used in natural language processing to handle the issue of unseen or rare words in a dataset. It adjusts the probabilities of words to ensure that even if a word hasn't been seen before, it still has a small chance of occurring, preventing the model from assigning zero probability to unseen words.

Regular expression work from ______ to ______ with a piece of text.

Left, Right

What is Levenshtein Distance?

Levenshtein distance is a way to measure how different two words are by counting the minimum number of single-character changes needed to turn one word into the other.

When using an n-gram model with language prediction, you'll likely have to rely on _______.

Markov Chains

What are Markov Chains?

Markov chains are a way to understand and predict what happens next in a sequence of events, based on the current situation, without considering the past.

What are Naive Bayes Classifiers?

Naive Bayes are supervised machine learning algorithms that leverage a probabilistic theorem to make predictions and classifications.

What are optional quantifiers?

Optional quantifiers, indicated by the question mark ?, allow us to indicate a character in a regex is optional, or can appear either 0 times or 1 time. For example, the regex humou?r matches the characters humo, then either 0 occurrences or 1 occurrence of the letter u, and finally the letter r

Explain the following piece of chunk grammar: chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

Optionally, there can be a determiner (DT), followed by zero or more adjectives (JJ), and finally, a noun (NN). This grammar pattern is designed to capture phrases like "the big brown dog" or "a lovely sunny day" where there may be optional words like determiners or multiple adjectives before the noun.

What is parsing?

Parsing is an NLP process concerned with segmenting text based on syntax.

Define phonetic similarity.

Phonetic similarity in NLP means figuring out if words sound alike or similar, even if they are spelled differently.

The first step to language prediction is _______.

Picking a language model

What does the caret (^) symbol do?

Placed at the front of a character set, the ^ negates the set, matching any character that is not stated. These are called negated character sets. Ex: Thus the regex [^cat] will match any character that is not c, a, or t.

What are language models?

Probabilistic computer models of language. We build and use these models to figure out the likelihood that a given sound, letter, word, or phrase will be used.

What library do we use for Regex parsing?

Python's 're' library.

What are ranges?

Ranges allow us to specify a range of characters in which we can make a match without having to type out each individual character.The - character allows us to specify that we are interested in matching a range of characters. Ex: The regex [abc], which would match any character a, b, or c, is equivalent to regex range [a-c]

To use the chunk grammar defined, you must create a nltk _______ object and give it a piece of _______ as an argument.

RegexpParser, chunk grammar Ex: from nltk import RegexpParser chunk_parser = RegexpParser(chunk_grammar)

Regular Expressions Section

Regular Expressions Section

What are shorthand character classes?

Represent common ranges, and they make writing regular expressions much simpler.

Define text similarity.

Text similarity in NLP means figuring out how much two pieces of text are alike or similar to each other.

What is the Kleene Star?

The Kleene star, denoted with the asterisk *, is also a quantifier, and matches the preceding character 0 or more times. This means that the character doesn't need to appear, can appear once, or can appear many many times.

Define semantic similarity.

The degree to which documents contain similar meaning or topics.

Define lexical similarity.

The degree to which texts use the same vocabulary and phrases.

What does the escape character look like and what does it do?

The escape character is a "\" and it is used to ignore something that may be confused for a part of the regex expression. Ex: .....\. would match "stone."

What does the pos_tag funtion do?

The function takes one argument, a list of words in the order they appear in a sentence, and returns a list of tuples, where the first entry in the tuple is a word and the second is the part-of-speech tag. Ex: part_of_speech_tagged_sentence = pos_tag(word_sentence) [('do', 'VB'), ('you', 'PRP'), ('suppose', 'VB'), ('oz', 'NNS'), ('could', 'MD'), ('give', 'VB'), ('me', 'PRP'), ('a', 'DT'), ('heart', 'NN'), ('?', '.')]

What are the negated shorthand character classes?

These shorthands will match any character that is NOT in the regular shorthand classes.

What is a literal?

This is where our regular expression contains the exact text that we want to match. The regex a, for example, will match the text a. We can additionally match just part of a piece of text. For example, "monkeys" would match "monkeys" in "I love monkeys".

What does the compile method do?

This method takes a regular expression pattern as an argument and compiles the pattern into a regular expression object, which you can later use to find matching text. Ex: regular_expression_object = re.compile("[A-Za-z]{4}") This will exactly match 4 upper or lower case characters.

If the filtered parts of speech are in the middle of a chunk, it will split the chunk into two separate chunks. T/F

True

Quantifiers will match the greatest quantity of characters they possibly can. T/F

True. They will match the greatest quantity of characters they possibly can. For example, the regex mo{2,4} will match the text moooo in the string moooo, and not return a match of moo

What does the following line of code do?: def text_to_bow(some_text): bow_dictionary = {} tokens = preprocess_text(some_text) for token in tokens: if token in bow_dictionary: bow_dictionary[token] += 1 else: bow_dictionary[token] = 1 return bow_dictionary

We define a method called text_to_bow with a parameter of some_text. We define an empty BoW dictionary and set the tokens, individual words, to a preprocessed version of some_text. Then, we iterate through each of the individual words in some_text and check to see if the token is already in the BoW dictionary. If so, we increment its value to indicate the count for that word. If not, we don't do anything but set the value to 1. Then we return the BoW dictionary.

How do we know which vector index corresponds to which word?

When building BoW vectors, we generally create a features dictionary of all vocabulary in our training data Ex: "Five fantastic fish flew off to find faraway functions. Maybe find another five fantastic fish?" Output: {'five': 0,'fantastic': 1,'fish': 2,'fly': 3,'off': 4,'to': 5,'find': 6,'faraway': 7,'function': 8,'maybe': 9,'another': 10}

What are wildcards?

Wildcards will match any single character (letter, number, symbol or whitespace) in a piece of text. They are represented with a period (.). They are useful when we do not care about the specific value of a character, but only that a character exists! Ex: ..... would match "stone"

What does the search method do?

Will look left to right through an entire piece of text and return a match object for the first match to the regular expression given. If no match is found, .search() will return None.

Define chunking.

With chunking in nltk, you can define a pattern of parts-of-speech tags using a modified notation of regular expressions. You can then find non-overlapping matches, or chunks of words, in the part-of-speech tagged sentences of a text. Ex: chunk_grammar = "AN: {<JJ><NN>}" (The above chunk grammar matches any adjective followed by a noun)

What is word embedding?

Word embedding is like a way to turn words into numbers that computers can understand. It helps computers figure out how words relate to each other, which is really useful for understanding language.

Using the prior dictionary, we can convert new documents into vectors using a vectorization function. For example, we can take a brand new sentence "Another five fish find another faraway fish." — test data — and convert it to a vector that looks like: ?

[1, 0, 2, 0, 0, 0, 1, 1, 0, 0, 2] The word 'another' appeared twice in the test data. If we look at the feature dictionary for 'another', we find that its index is 10. So when we go back and look at our vector, we'd expect the number at index 10 to be 2.

List all of the negated shorthand character classes.

\W: the "non-word character" class represents the regex range [^A-Za-z0-9_], matching any character that is not included in the range represented by \w \D: the "non-digit character" class represents the regex range [^0-9], matching any character that is not included in the range represented by \d \S: the "non-whitespace character" class represents the regex range [^ \t\r\n\f\v], matching any character that is not included in the range represented by \s

Match the following phrases using regular expressions: 1 duck for adoption? 5 ducks for adoption

\d ducks? for adoption\?

Write a regular expression that would match the phrases 5 sloths, 8 llamas, 7 hyenas, but not the phrases one bird, and two owls.

\d\s\w\w\w\w\w\w

List all of the shorthand character classes.

\w: the "word character" class represents the regex range [A-Za-z0-9_], and it matches a single uppercase character, lowercase character, digit or underscore \d: the "digit character" class represents the regex range [0-9], and it matches a single digit character \s: the "whitespace character" class represents the regex range [ \t\r\n\f\v], matching a single space, tab, carriage return, line break, form feed, or vertical tab

Match the following phrase using a regular expression without matching "king penguins are cooler than regular expressions": penguins are cooler than regular expression

^penguins are cooler than regular expressions$

How can we create a features dictionary using code?

def create_features_dictionary(documents): features_dictionary = {} merged = ' '.join(documents) tokens = preprocess_text(merged) index = 0 for token in tokens: if token not in features_dictionary: features_dictionary[token] = index index += 1 return features_dictionary,tokens

Now we can create a BoW vector! How can we do this in code?

def text_to_bow_vector(some_text, features_dictionary): bow_vector = [0] * len(features_dictionary) tokens = preprocess_text(some_text) for token in tokens: feature_index = features_dictionary[token] bow_vector[feature_index] += 1 return bow_vector, tokens

A popular form of noun phrase begins with a _______ , which specifies the noun being referenced, followed by any number of _______ , which describe the noun, and ends with a _______.

determiner(DT), adjectives(JJ), noun(NN)

One of the most common ways to implement the BoW model in Python is as a _______ with each _______ set to a word and each _______ set to the number of times that word appears.

dictionary, key , value

Usually, we need to prepare our text data by breaking it up into _______.

documents (shorter strings of text, generally sentences).

Turning text into a BoW vector is known as _______.

feature extraction or vectorization.

Match the following phrases using regular expressions without matching the word "hot": hoot hooooooot hoooooooot

hoo+t

A popular method for performing chunk filtering is to chunk an entire sentence together and then _______.

indicate which parts of speech are to be filtered out

What is the sole concern of BoW?

its sole concern is word count — how many times each word appears in a document.

Hoe can you access the matched text returned from the match object?

result.group(0)

Usign regular expressions match the following phrases: squeaaak, squeaaaak, squeaaaaak.

squea{3,5}k

Define noise removal.

stripping text of formatting (e.g., HTML tags).

For statistical models, we call the text that we use to build the model our _______.

training data

BoW is referred to as the _______.

unigram model. It's technically a special case of another statistical model, the n-gram model, with n (the number of words in a sequence) set to 1.

What is word2vec?

word2vec can map out your topic model results spatially as vectors so that similarly used words are closer together.


Ensembles d'études connexes

Microeconomics: test 2 practice exam

View Set

Interactions between cells and their environment

View Set

Chapter 9: The Biomechanics of The Human Spine

View Set

Unit 3 - Share Holders' Equity, share-based compensation and Earnings per share

View Set

organization and mangement practice exam questions

View Set