LIN 120 Final *Flipped Side*
an inventory of what there is like a dictionary but gives more information like structure, attributes, semantic roles an example of an ontology is Propbank. It defines the word "buy" as "purchase" and defines roles like buyer, thing bought, seller, price paid, etc. also gives "accept as truth" as definition, with the roles believer and thing believed.
Ontology
Coreference
The first sentence has information for the next. Paul was poor. John bought him a car.
use mixed n-grams. if sentence contains the mixed 2-gram Verb you (ie: let you, did you) it's not an ODP if sentence contains the mixed 2-gram Verb me (ie: send me, tell me) it's an ODP
an example of how to distinguish if a sentence is an ODP
0
given a word Fenster in german, it's plural form is Fenster what is the gold class?
machine translation automatic speech recognition handwriting recognition
uses of N-gram models
sample = sample.capitalize()
capitalize the first character of the words in the string sample
words written the same with different meaning
homograph
reduce useless variation in results
what is the goal of an activation function?
Garden Path Sentence
what is this sentence an example of The horse raced past he barn fell.
[]
["John", "Mary", "Sue"][1:1]
'Mary'
["John", "Mary", "Sue"][1]
words pronounced the same with different meaning
homophone
a process in which a computer can generalize from seen data
machine learning
demonstrates that NLP tools can be used to study interesting questions about how people use language
the significance of studying ODP of female and male superiors
all letters of the alphabet letters of other alphabets digits not white spaces not special characters
what does \w match
the linear functions of their input input can be represented graphically using a line which separates the two classes of data points
what does a perceptron compute
Bow of a present, bow and arrow, beau as in male beloved (all pronounced the same)
what is an example of 3-way ambiguity in spoken language?
Bow of a ship, Bow to the king, Bow of a present, Bow an arrow (all spelled the same)
what is an example of 4-way ambiguity in written language?
import re def digits(string): return re.findall(r"[0-9]+", string)
what is another way to write this code: import re def digits(string): return re.findall(r"\d+", string)
def print_first_last(n): print(hamlet[:n]) print(hamlet[-n:])
Write a small custom function print_first_last that prints the first n and last n words of hamlet
['b','c',d','e']
list= ["a","b","c","d","e","f"] print(list[1:5])
['My', 'phone', 'number', 'is', '555-123-4567']
what is the output of this code? def tokenize(string): token_list = re.findall(r"\S+", string) return token_list tokenize("My phone number is 555-123-4567")
['Stalag', '17', 'might', 'be', 'Billy', "Wilder's", 'best', 'movie!']
what is the output of this code? def tokenize(string): token_list = re.findall(r"\S+", string) return token_list tokenize("Stalag 17 might be Billy Wilder's best movie!")
['True', 'music', 'aficionados', 'listen', 'to', 'Taylor,', 'Harry,', 'and', 'Drake...']
what is the output of this code? def tokenize(string): token_list = re.findall(r"\S+", string) return token_list tokenize("True music aficionados listen to Taylor, Harry, and Drake...")
['My', 'phone', 'number', 'is', '555', '123', '4567']
what is the output of this code? def tokenize(string): token_list = re.findall(r"\w+", string) return token_list tokenize("My phone number is 555-123-4567")
['Stalag', '17', 'might', 'be', 'Billy', 'Wilder', 's', 'best', 'movie']
what is the output of this code? def tokenize(string): token_list = re.findall(r"\w+", string) return token_list tokenize("Stalag 17 might be Billy Wilder's best movie!")
['True', 'music', 'aficionados', 'listen', 'to', 'Taylor', 'Harry', 'and', 'Drake']
what is the output of this code? def tokenize(string): token_list = re.findall(r"\w+", string) return token_list tokenize("True music aficionados listen to Taylor, Harry, and Drake...")
a very simple but effective machine learning algorithm easy to implement in python basis for neural networks and deep learning we can interpret the models (the sequences of weights) learned by perceptrons
what is the perception
['antler', 'beast', 'cat', 'deer', '👍']
word_list = ["cat", "antler", "👍", "deer", "beast"] print(sorted(word_list))
Yes No No Yes No Yes Yes Yes
Are these examples of ODP: "Please give me your views ASAP." a student emails a teacher "I need my grade today." "can you believe this bloody election?" "can you please keep me in the loop" "Enjoy the rest of your week!" "I need the answer ASAP" "Would you work on that" "Call me on my cell later"
requests that create constraints on its response so you can't say no. person requesting most be higher up. for example, I need the report today. vs Do you think you can send the report today?
ODP (Overt Display of Power)
words that are the same, have the same spelling, and same pronunciation but different meaning
What do these sentences demonstrate? John bought the car. John bought the story.
homophone
What do these sentences demonstrate? John went to the bank to deposit money. John drank from the river bank.
Cognitive State- how confident the speaker is
What do these sentences demonstrate? John will leave tomorrow. Mary says John will leave tomorrow. I hope John will leave tomorrow
polysemy
What do these sentences demonstrate? The book fell on the floor. The book tells the story of world war 2.
homograph
What do these sentences demonstrate? The bow of the ship was torn. The ribbon was tied in a bow.
homograph and homophone
What do these sentences demonstrate? The ribbon was tied in a bow. Katniss used a bow and arrow.
implicatures: we know you bought 2, not 3, cause if you bought 3, you would have said 3.
What does this sentence demonstrate? I bought two pencils.
implicatures: we know sandy is not lover or spouse, cause if she was, it would have been said. (demonstrates Grice's maxims, that acquaintance<friend<lover<spouse)
What does this sentence demonstrate? Sandy is a friend.
matching anything that is not matched by \d
\D
matching anything that is not matched by \s
\S
matching anything that is not matched by \w
\W
matches digits
\d
matches whitespace (and tabs)
\s
matches word characters
\w
vase_entry = dict([]) vase_entry["POS"] = "noun" vase_entry["definition"] = "A container for flowers." vase_entry["plural"] = "vases" english_dictionary["vase"] = vase_entry
add the word "vase" to the dictionary along with its POS, definition, and plural english_dictionary = dict([])
there are not so many possible part-of-speech n-grams because there are far fewer parts of speech than words. So we can have longer n-grams (4-grams, 5-grams) and still do computation on them.
advantage of part-of-speech n-grams
the sentence the boy saw a girl with a telescope the "with a telescope" can either be connected to the verb "saw" or the noun phrase "a girl"
an example of ambiguity using trees
meaning of all sentences in text + common sense/background knowledge + inference
aspects required for deep understanding
def clean_up (reply): reply = re.sub(r"[\.\?!,;]", r"", reply) return reply.lower()
create a function that cleans the reply of the user
def tokenize(the_string): new_string = re.sub(r"(?=[\.\?!,;])", r" ", the_string) token_list = re.findall(r"\S+", new_string) return token_list
create a function that tokenizes "Sue, stop!" as ["Sue", ",", "stop", "!"], and "Sue and Bill..." as ["Sue", "and", "Bill", ".", ".", "."]
BOS the = BOS Det the dog = Det Noun-sg dog barked = Noun-sg Verb-past barked at = Verb-past Prep at the = Prep Det the black = Det Adj black cat = Adj Noun-sg cat . = Noun sg Punc . EOS = Punc EOS
create a part of speech n-gram for the dog barked at the black cat.
import re def digits(string): return re.findall(r"\d", string)
define a function that returns a list of all the individual digits in a string ie: string = "James Madison had 0 sons and fought the war of 1812." => ['0', '1', '8', '1', '2']
import re def digits(string): return re.findall(r"\d+", string)
define a function that returns a list of all the numbers in a string ie: string = "James Madison had 0 sons and fought the war of 1812." => ['0', '1812']
import re def tokenize(string): return re.findall(r"\w+", string)
define a function that tokenizes a string by word characters
def tokenize(string): token_list = re.findall(r"\S+", string) return token_list
define a function that tokenizes a string on spaces
tokens = re.findall(r"\w+", str.lower(string)) counts_tokens = Counter(tokens) del counts_token["the"]
delete the stop word the from a string
A.
dictionary = dict([]) dictionary["a"] = "A." dictionary["b"] = "B." dictionary["c"] = "C." dictionary["d"] = "D." print(dictionary["a"])
{1: 'The number one.', 2: 'The number two.', 3: 'The number three.', 4: 'The number four.'}
dictionary = dict([]) dictionary[1] = "The number one." dictionary[2] = "The number two." dictionary[3] = "The number three." dictionary[4] = "The number four." print(dictionary)
cannot do morphology: even though dogs is just the plural of dog, they would have to be represents as different vectors no representation of meaning: travel and voyage are similar but cannot be represented as such through vectors huge vectors in real life: can only consider so many words, or vectors too large
disadvantages of one-hot vectors
['box', 'pan', 'vase']
english_dictionary = dict([]) english_dictionary["vase"] = "A container into which we can put flowers, and water to keep them fresh." english_dictionary["pan"] = "A flat container used to cook food in." english_dictionary["box"] = "An enclosed container, usually of wood or cardboard." print(sorted(english_dictionary))
[('box', 'An enclosed container, usually of wood or cardboard.'), ('pan', 'A flat container used to cook food in.'), ('vase', 'A container into which we can put flowers, and water to keep them fresh.')]
english_dictionary = dict([]) english_dictionary["vase"] = "A container into which we can put flowers, and water to keep them fresh." english_dictionary["pan"] = "A flat container used to cook food in." english_dictionary["box"] = "An enclosed container, usually of wood or cardboard." print(sorted(english_dictionary.items()))
b ['a'] ['b'] ['c', 'd', 'e', 'f'] ['f'] [] ['a', 'b', 'c', 'd', 'e', 'f']
example_list = ["a", "b", "c", "d","e","f"] print(example_list[1]) print(example_list[:1]) print(example_list[1:2]) print(example_list[2:]) print(example_list[5:6]) print(example_list[10:]) print(example_list[0:100])
final syllable (unaccented schwa, unaccented closed...) gender (masculine, fem, ntr) alveolar (true or false)
features used in decision tree training
Counter({'is': 2, 'a': 2, 'this': 1, 'sentence': 1, 'and': 1, 'that': 1, 'tree': 1}) [('a', 2), ('and', 1), ('is', 2), ('sentence', 1), ('that', 1), ('this', 1), ('tree', 1)]
from collections import Counter word_count = Counter(["this", "is", "a", "sentence", "and", "that", "is", "a", "tree"]) print(word_count) print(sorted(word_count.items()))
sentence that is expected to end, but then has a verb at the end. difficult for both humans and computers to understand.
garden path sentence
dictionary["pan"]["plural"]
given a dictionary with words, their POS, definition, and plural get the plural form of the word pan
(0, 1, 0, 0, 0, 1, 0, 0)
given a word Frage in german, it's final syllable is represented as (0, 1, 0, 0) it's gender is (0, 1, 0) it's alveolar is (0) what is the full input representation?
1
given a word Frage in german, it's plural form is Fragen what is the gold class?
dog= (1,0,0,0) cat= (0,1,0,0) bark= (0,0,1,0) run= (0,0,0,1) dog + bark = (1,0,1,0) cat + run = (0,1,0,1) dog + run = (1,0,0,1 sum = (2,1,1,2)
given the words dog cat bark run. use one-hot vectors to find the vector for dog bark, cat run, dog run
for word in stopwords: del counts_hamlet[word]
given this list of stop words, remove all of them from counts_hamlet stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they"]
combine multiple perceptrons into a complex architecture. this is called neural machine learning. the output from one perceptron becoming the input to the next
how can we extend perceptrons to do more complex tasks?
annotate the data: identity desired predictions you want system to learn find features of the data that may help the system generalize training: run on given data test performance on new data make changes *rinse and repeat*
how does machine learning work
prob(BOS the) = count of BOS the / count of BOS *anything* prob(the dog) = count of the dog / count of the *anything* prob(dog ran) = count of dog ran / count of dog *anything* prob(ran faster) = count of ran faster / count of ran *anything* prob(faster .) = count of faster . / count of faster *anything* prob(. EOS) = count of . EOS / count of . *anything* (multiplied together)
how to calculate prob(BOS the dog ran faster . EOS)
Female superiors use fewer ODP than male superiors in interactions with their subordinates
hypothesis of ODP in terms of gender
sample = sample.title()
make the first letter of a the string sample uppercase
sample = sample.lower()
make the string sample lowercase
instead of naming the part of speech for closed-class parts of speech, just list the word ie: for the dog, the mixed 2-gram would be: the Noun-sg
mixed 2-grams
how the input is represented in machine learning for instance, gender: masculine = (1, 0, 0) feminine = (0, 1, 0) neater = (0, 0, 1)
one-hot encoding
given a set of vocabulary words, give each a vector.
one-hot vectors
open-class parts of speech: nouns, verbs, adjectives, adverbs are constantly new ones, ever-changing. closed-class parts of speech, prepositions, determiners, pronouns are permanent, new ones are not created.
open-class parts of speech vs closed-class parts of speech
words that are identical but there a multiple "versions" of the word. think book: the physical vs the story.
polysemy
for character in string: print(character)
print each character in a string one line at a time
tokens = re.findall(r"\w+", str.lower(string)) counts_tokens = Counter(tokens) print(Counter.most_common(counts_tokens, 10))
print the 10 most common tokens of a string, along with the counter of the frequencies of each token
print(sum(Counter.values(Counter(["a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c", "c"]))) / len(test_counter))
print the average number of word tokens per type given: test_counter = Counter(["a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "c"])
print(sorted(numbers_list))
print the sorted list numbers_list = [23, 8, 98, -2, 1330 ]
import re from collections import Counter tokens = re.findall(r"\w+", str.lower(string)) print(Counter(tokens))
print the tokens of a string, along with the counter of the frequencies of each token
input data (a word) is represented by a sequence of numbers, created using the words features. data has a sequence of corresponding weights, which always starts as 0, then you make changes. data * corresponding weight + constant bias = raw prediction value. if raw>0. normalized prediction is 1, if raw<=0, normalized prediction is 0. there's a gold label: the goal value. (in our example, the gold label was whether or not the plural of a word ends in n) if prediction= gold, do nothing. if prediction = 0, gold =1: increase "1" weights by 0.01. if prediction = 1, gold = 0: decrease "1" weights by 0.01.
procedure for the perceptron
import re sample = re.sub(r"\.", r"", sample)
remove every period in the string sample
import re sample = re.sub(r"[abcd]",r"", sample)
remove the letters abcd from the string sample
import re sample = re.sub(r"\w", r"*", sample)
replace all letters with a * in the string sample
import re sample = re.sub(r"!", r"?", sample)
replace every ! with ? in the string sample
import re sample =re.sub(r"[abD!\?]", r"X", sample)
replace every a, b, D, !, or ? character with an X in the string sample
import re sample = re.sub(r".", "?", sample)
replace every character with ? in the string sample
import re sample = re.sub(r"[?!]+", r".", sample)
replace every sequence of punctation marks !?!?!!!!!??? with a period in the string sample
print(list[0:1])
show the first two elements of a list
be able to label whether a sentence contains an ODP or not machine learning
significance of studying ODP
Counter({'d': 4, 'c': 3, 'b': 2, 'a': 1})
string= "a b b c c c d d d d" tokens = re.findall(r"\w+", str.lower(string)) print(Counter(tokens))
['she', 'liked', 'seafood', 'and', 'her', 'husband', 'liked', 'beef']
test_string = "She liked seafood, and her husband liked beef." print(re.findall(r"\w+", str.lower(test_string)))
if (word in english_dictionary.keys()):
the if statement to determine if a word is in the dictionary