Python-NLP Oral Exam

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Tagging

•a task to label every word in a given text with the appropriate POS (part-of-speech) tags •POS tags def: •no def from computational standpoint •Benchmark datasets are the definition (gold standard) of what the correct tags are •the correct POS depends on the context, not only the word itself •POS tags examples from Universal Tagset: •DET (determiner: word that modifies nouns/noun phrases) •NOUN (noun) •ADJ (adjective) •VERB (verb) •PRON (pronoun) •ADP (adposition:prepositions and postpositions) •AUX (auxiliary: function word -> often a verb) •ADV (adverb) •PUNCT (punctuation) •SYM (symbol) •X (other: word does not fall into real POS tag) •token: The atoms of interest, in our case the words of a text. •corpus: A list of tokens •tagset: A finite (usually small) set of symbols, which are linguistically defined properties of tokens. •labeled corpus: A List of (token, tag) pairs, where the tags are from a given tagset. •tagging: Take a given (unlabeled) corpus and tagset. The tagging is the task of assigning tags to tokens. •approaches: rule based (Based on manually created rules) and statistical (Based on machine learning) •annotating: The human labor of making the gold dataset. Manually processing sentences and label them correctly. •gold data: Annotated corpus, usually annotated with serious efforts and for a specific task. •silver data: Not that good quality or automatically labeled data. •evaluation: per-token/per-sentence/unknown word accuracy •NP(noun phrase)-chunking/shallow parsing def: a task to find noun phrases: •refers to (not necesserily named) entities or things •correspond to grammatical roles (subject, object...) in the sentence •example: [He] reckons [the current account deficit] will narrow to [only £1.8 billion] in [September] . where [] represents NP label •example: [The little yellow dog] barked at [the cat] •NER(Named entity recognition) def: a task to find/mark real world entities in a given text/sentence (people, places, organizations, etc) •[Jim] bought 200 shares of [Apple] in [2015]. (Person, organization, time) •Naive ways limitations (non-unique POS-tag, capital letters in NER) •The tag (label) of a word is not a function of the word itself. •words can have several part-of-speech tags •In English noun-verbs are common: work, talk, walk •Names entities are not always proper nouns •Neither start with capital letters •Counter examples: the States, von Neumann •Sentences start with capital letters, irrespective of whether the first word is a named entity •There is no comprehensive list of all named entities •That list will never be comprehensive, since a well known thing can be mentioned in an unusual or abbreviated form •And the list would became obsolete very soon, since new movies, books, famous person and firms are born •Still, lists provide a good starting point for statistical models and the creation of silver standard corpora •Supervised learning def: machine learning task of learning a function that maps an input to an output based on example input-output pairs •Sequential data: •windows approach: pos depends on fixed window of n past tokens •problem with long-distance relationships: part of natural language, but not supported by Markov mode. A POS tag of a word could depend on a word that is out of range (greater than n in distance) •Simple POS-tagger: •most seen pos tag in a given context of words: NOUN

Context managers

•type of managed resource •resource acquisition and release are automatically done •no need for manual resource management •example: memory •the with keyword opens a resource •keeps it open until the execution leaves with's scope •releases the resource regardless whether an exception is raised or not •defining context managers •any class can be a context manager if it implements: •__enter__: runs at the beginning of the with. Returns the resource. •__exit__: runs after the with block. Releases the resource. •__exit__ takes 3 extra arguments that describe the exception: exc_type, exc_value, traceback •examples: class DummyContextManager: def __init__(self, value): self.value = value def __enter__(self): print("Dummy resource acquired") return self.value def __exit__(self, *args): print("Dummy resource released") with DummyContextManager(42) as d: print("Resource: {}".format(d)) output: Dummy resource acquired Resource: 42 Dummy resource released another example: class DummyContextManager: def __init__(self, value): self.value = value def __enter__(self): print("Dummy resource acquired") return self.value def __exit__(self, exc_type, exc_value, traceback): if exc_type is not None: print("{0} with value {1} caught\n" "Traceback: {2}".format( exc_type, exc_value, traceback)) print("Dummy resource released") with DummyContextManager(42) as d: print(d) same output

Dependency Grammars

•Features; the dependency tree •"ordered" tree •No phrasal nodes •Difference from PSG •Capture direct relations between words in a sentence •No phrasal nodes •word based instead of constituent-based •dependency grammars are more minimal (less nodes) •Parsing: •arc-factored parsing: train an ML system to score edges of a dependency graph •for each new sentence, find the tree with the largest total score •features and objective •represent edges with features -> assign weight each feature -> each parse tree •learn a scoring function for edges that assigns the best total score to the most likely analysis •for each sentence in the training data, compare: •the top-scoring analysis A •with the gold analysis G •update weights to increase the relative likelihood of the gold analysis: •increase the weight of features in F (G) \ F (A) •decrease the weight of features in F (A) \ F (G) •example: Eisner algorithm •transition-based parsing: build graphs by adding one word at a time •train an ML system to predict the most probable next step for any intermediate configuration •based on a technique called shift-reduce parsing •Generally, the parser: •reads the sentence once word by word from left to right ⇒ O(n) •builds the parse incrementally, without backtracking •uses two data structures: •buffer: the list of words yet unread - starts with the whole sentence •stack: stores words that are not part of the parse yet - starts empty •at each step, chooses one of two actions: •shift: read the next word from the buffer and push it on the stack •reduce: pop words from the stack and add the construct that covers them to the parse •shift-reduce parsers •arc-standard parsing: a shift-reduce parser •it has two reduce actions: •LeftArc add an arc w1 → w2 and pop w2 •RightArc add an arc w2 → w1 and pop w1 •(where w1 and w2 are the two topmost words on the stack) •An ML model is trained to predict the most likely next step for any configuration: this is known as the Guide •the Guide: a model responsible for selecting the most likely next step in the parsing process •trained on gold-standard trees •configurations are represented by features •word form, POS-tag of words on the stack and/or the last words on the buffer •dependency relations of top word on the stack •combinations of the above •ML models used to implement guides: •those that can handle large number of sparse features: support vector machines (SVMs) •deep learning methods: Multi-layer perceptron (MLP) •pros: •Efficient: time complexity linear in the number of words •Transparent: guides can be trained with informative features •cons: •may not be able to yield the best tree •risk of error propagation (consider book −−−→ me in the sentence Book me a flight) •Universal Dependencies •language-independent dependency relations •annotations consistent across 60+ languages •100+ treebanks publicly available •Shared tasks in multilingual dpeendency parsing (2017, 2018)

Finite state morphology

•Finite state: •FSA: Finite state automaton •internal state: one from a finite set of possible states •input tape •reads the tape one symbol at a time and may move to a new state after each •FST: Finite state transducer •FSA with an additional output tape •outputs a symbol after each state transition •mathematical description •The mathematical model of a finite state automaton is a 5-tuple •Q: the set of states •∑: the input alphabet •p0*(*subsript): the initial state •F subseteq Q: the accepting states. The FSA accepts and input, if it stops here after reading the input •delta(q, w): the state transition function (Q x∑ xQ) •the model of an FST is a 6-tuple, with these changes: •Gamma: the output alphabet •delta(q, w, w'): the state transition function (Q x ∑ x Gamma x Q) •The set of symbol sequences (Sigma^*) a machine accepts is its language (L). FSTs have input and output languages •Languages accepted by FSA are regular languages. FSTs implement regular relations or functions •regular expressions also generate the regular languages •the two formalisms (FSA and re) are equivalent •some regular expression engines, e.g. grep are implemented as FSA •Regular languages are memory-less, and therefore very limited •the language a^*b^* is regular, but a^ib^i is not •these restrictions make FSA/FSTs simple and fast •Operations on FSTs: inversion, composition, projection •inversion: the direction of the transduction can be changed by swapping the input and output tapes •composition: FSTs can be used in cascade: •Lex: Lexicon FSA •MT: Morphotactics FST •Orto: Ortography FST •Since they are relations (or functions), they can be composed into a single FST: Morph = Lex º MT º Orto •Caveat: the resulting transducer might be much larger than its parts •projection: The lower or upper language (and associated FSA) of a FST •example: lower: cat, cow, dog | upper: MEOW, MOO< WOOF •Morphology: •FSTs as morphological analyzers •The three components of a morphological analyser can be implemented as FSA / FSTs: •Lex: Lexicon FSA / FST, MT: Morphotactics FST, Orto: Ortography FST •These three can be used in cascade, or composed into a single morphological FST Morph = Lex º MT º Orto •Lower side: surface form (string of characters) •Upper side: morpheme structure (characters (e.g. lemma) + morphological features) •analysis / generation •Applied "up" (lower ->upper): analysis •Applied "down" (upper -> lower): generation •backtracking: allows FSTs to enumerate all candidates •lexc: used to define FSTs •Specifically designed for writing lexicons •Related tasks: •spell checking: spell checker finds words that are not spelled correctly •The lower projection of a lexical transducer enumerates the existing word forms •lemmatization: the task of finding the lemma ("dictionary form") of a word •Same as morphological analysis without the morphological tags on the upper side •With FSTs: Morph º DelTags, where DelTags is an FST that deletes tags •tokenization: the task of dividing a string into its components tokens (words) •rule-based (FST) solution •A simple solution: make the (lower projection of the) analyzer circular: Morph^+ •how a morphological FST can be used for these tasks

Statistical machine translation

•Mathematical description of statistical MT: •learns a probabilistic model and tries to find the most probable translation •mathematically •given the foreign language sentence F = f(1), f(2), ..., f(m) •we are looking for the best English sentence Ê= e(1), e(2), ..., e(l) •Ê = argmaxP(E|F) E •the noisy channel model approach •a paradigm borrowed from information theory • two phases: •Training: learning the probabilistic model from training data •Decoding: using the trained system to translate a sentence • 1. we are talking with someone in English • 2. the channel between us is noisy, and everything comes out of it in Spanish • 3. translation is the task of restoring the original signal •mathematically (using bayes' theorem) •Ê = argmaxP(F|E)P(E) E •P(F|E) is the (backward) translation model •P(E) is the (English) language model •language and translation models; fidelity and fluency •language: a probability distribution over a sequence of words •n-gram models (4-gram model predicts the 4th word based on the first three) •Recurrent Neural Networks (RNNs) encode the whole history in their state •translation:based on the idea of word alignment •word alignment: a mapping between the words of the source and target sentences •can be 1 : n, n : 1 or n : n •correspond to two important properties of a good translation •Fidelity: how faithful the translation is to the source language and content •Fluency: how natural the resulting sentence in the target language •phrase-based:unit of translation is the phrase •main steps: • 1. Group the source words into I phrases • 2. Translate the phrases into the target language phrases one-by-one • 3. Reorder the phrases •Very similar to direct translation (phrases instead of words) •Training and decoding (parallel corpora, decoding as search) •training: the process of learning the probabilistic model from training data •MT models are trained on parallel corpora (Literature, Movie subtitles, etc) •The sentences in the parallel texts are aligned using sentence aligner programs •features used: sentence length, bilingual dictionary •no gold standard data •decoding: the act of using the trained system to actually translate a sentence •implemented as a search problem in the space of translations •the translation is built incrementally •translate a single word at a time •we execute the next step according to some strategy •Greedy search: Executes the step with the highest trained probability •Beam search: Keeps the most probable n paths and always continues the top one •Best-first search: Like beam search, but also uses heuristics beside the trained probability •Evaluation: •no undisputed method of choice for evaluation •Manual evaluation (what and how) : relies on the judgement of human raters •what: •Fidelity: •Adequacy: whether the translation contains the information in the source sentence •Informativeness: is the information in the translation enough to complete a task; e.g. answer certain questions about the text •Fluency: •Clarity, Naturalness, Style •how: •Ratings: rate aspects of the translation e.g. on a 5-point scale •Psycholinguistic tasks •The cloze task: some words are replaced by an underscore and raters must guess it. Correlates with fluency •Multi-choice questions: good for evaluating informativeness •Edit cost: how much effort it takes to convert the MT output into a good translation •Overview of BLEU (modified precision, brevity penalty, micro average) •Bilingual Evaluation Understudy •Scores a sentence translation candidate based on several reference translations (range from 0 to 1) •modified precision: •usually ratio of overlapping n-grams between candidate and references •modified: •Clip the n-gram count at the maximum reference value •brevity penalty •incomplete translations pose another problem •BLEU penalizes candidates shorter than the reference •BP = 1 if c > r | e^(1-r/c) if c <= r (c and r are length of candidate and reference) •micro average •final score for the whole text is the micro average of the sentence scores •the average of averages

An HMM POS-tagger

•POS-tagging def: a task to label every word in a given text with the appropriate POS (part-of-speech) tags •The Viterbi algorithm (k=2,3) •algorithm for finding the most likely sequence of hidden states •Two phases: forward and backward •forward: goes through the sentence from beginning to end and fills the probability and backpointer tables •backward: follows the backpointers to find the most probably state sequence •Backtracking •you trace backwards through the backpointers to get the most likely path

Parsing and PCFG

•Parsing •Top-down parsing: start from root (S) -> apply rules until only terminals •bottom-up parsing: start from sentence -> apply rules backwards to reach single tree (S) •Challenges: nondeterminism and ambiguity (global/local); examples •nondeterminism: more than one way to expand rules •build tree using BFS, DFS, etc •each node represents a partially completed parse tree •if it reaches a node that is not compatible with the sentence it backtracks •ambiguity: sentence may have more than one possible parse •example: One morning i shot an elephant in my pajamas (PP can attach to VP and NP) •global: the whole sentence is ambiguous (multiple possible parses) •local: a constituent is ambiguous in itself, but not when part of the sentence (induces backtracking in parser) •Dynamic programming; the CKY algorithm •group of algorithms that: •breaks the problem into smaller subproblems •solves each subproblem only once •stores the solution to (re)use it later •CKY algorithm •bottom up parser algorithm •requires the grammar to be in CNF •runs in O(n^3) time •description: •for a sentence of n words, CKY fills the upper triangle of an n^2 matrix •each cell corresponds to a number of consecutive words •cells record all possible constituents that cover that those words •a parse is successful if the top right cell contains an S •PCFG (Probabilistic Context-Free Grammars) •Motivation •aims to solve problem of ambiguity by: •assigning a probability to each production rule •probability a of parse tree is the product of the rule probabilities •the "correct" parse is the most probable tree:argmax(i)P(Tree(i)| S) •mathematical formulation: P (Tree) = ∏(i=1 to n) P(rule(i)) •Training and testing: treebanks and PARSEVAL •treebanks: A text corpus annotated by linguists (Contains the syntax tree for each sentence) •The rule probabilities are obtained this •example: Penn Tree for English •PARSEVAL: the standard evaluation measure •measures how much of the constituents are correct •Correctness is given by the F1-score. The state of the art is 92-93% •Problems and solutions: •subcategorization (not suported) and lexicalization (solution) •lexicalization: annotate constituents with its lexical head •Introduces the sparsity problem •Smoothing: interpolate with the raw probability •independence (problem) and annotation (solution) •a rule can be applied whenever its left-hand nonterminal is present, irrespective of context •annotation: annotate constituents with grammatical features, and propagate them up the tree (mildly context sensitive) •example: A sentence is grammatical iff the features agree (dog and was are both [Sg]

Relation Extraction, Sentiment Analysis

•Relation extraction (RE) •obtaining structured information from language data. •example: parsing CVs to build a database for recruitment •domain: •professional profile information based on CVs, Linkedin pages •product specifications based on e-commerce, manufacturers' websites •stock price information based on news articles •rule-based approach •handwritten rules for each relation •example: X dropped by Y points -> drop_by(X, Y) •example:X, CEO of Y -> ceo_of(X, Y) | X was born in Y -> born_in(X, Y) •pros: simple and effective, yields fast results, high precision •cons: low recall, limited, no capacity for generalization, real-life systems may contain thousands of templates, and require continuous development by experts, many companies still depend on such systems •supervised learning approach •Use parsed and annotated text to train text classifiers •features: headwords, word/POS ngrams with position information, NER/Chunk tags, Paths in parse trees among candidates •pros: Effective if large training sample is available and target texts are similar •cons: requires a fair amount of annotated data (costly to produce), doesn't generalize well across genres, domains •semi-supervised methods •a few annotated examples -> some patterns -> more examples -> more patterns •a few patterns -> some examples -> more patterns -> more examples •example: author(William_Shakespeare, Hamlet) •extracted patterns from found instances: •X's Y, the X play Y, Y by X, Y is a tragedy written by X, etc •Sentiment analysis (SA) •detecting attitudes and opinions in text •example: user reviews of movies, products, etc •task definition, examples •supervised learning approach •Use training data, extract standard features: •bag-of-words •ngrams •emoticons •numbers, dates •Use these to train standard classifiers •Naive Bayes, SVM, MaxEnt, etc.

Word meaning

•Representations: distributional and ontology-based approaches, pros and cons for each. • distributional approach •Two words are similar in meaning if they appear in similar contexts •represent words using real-valued vectors •Euclidean distance between vectors is proportional to the similarity of contexts •pros: •Robust, can be constructed from large unannotated data, which is available nowadays •Has proven useful in virtually all NLP tasks •cons: •non-transparent representation: we cannot truly understand the meaning of a representation (e.g. the meaning of dimension i) •cannot handle rare words - and a large part of any data is rare words! •Ontology-based approaches •A single ontology is used as a global reference model in the system •similar in meaning if synonym •pros: •can combine data from multiple sources •cons: •effectiveness of ontology based data integration is closely tied to the consistency and expressivity of the ontology •Simple word relationships: •synonymy: Pairs of words that mean roughly the same thing (synonyms) •hypernymy:A word is a hypernym of another if it is a broader or more general concept of which the other is a special case •example: rectangle is the hypernym of square, square is a hyponym of rectangle •Common lexical resources: •WordNet: widely used lexical database •groups words into sets of synonyms (synsets) •models semantic relationships among them •available for dozens of languages •FrameNet: A resource based on Frame Semantics •Frames are script-like structures that represent a situation, event or object, and lists its typical participants or props, which are called event roles •Measuring word similarity: approaches and issues •Distributional approaches •cosine similarity of word vectors is expected to be proportional to semantic similarity •Ontology-based approaches •Distance between words in lexical graphs such as WordNet is also used as a source of semantic similarity •Issues •Not a precise definition •Datasets are created based on the human intuition of hundreds of annotators •Not as reliable with less frequent words

Syntax: theory

•Syntax (grammar) def: the set of rules that govern the structure of words •Syntax concerns itself with grammaticality (whether something is grammatical or not), meaning is irrelevant •POS: Part of speech (open&closed) •a category of words with similar grammatical distribution •example: Noun, Verb, adj, adverb, preposition, determiners, etc •open: new words can be added •closed: membership is fixed (preposition and determiners) •grammatical relations (predicate, arguments, adjuncts) •predicate: center of a sentence (always a verb in English) •arguments: expressions that complete a predicate's meaning (subject and object in English) •adjuncts: phrases that give extra information, but are optional •subcategorization (valency, transitivity) •valency: the ability of verbs that specify (limit) the number and types of its arguments •transitivity: •intransitive: no object (Ryan sleeps) •transitive: only direct subject (Ryan loves me) •ditransitive: direct and indirect subject (Ryan gave Didi a flower) •constituency (tests) •constituent/phrase: a group of words that acts as a unit and fulfills one of the grammatical roles in a sentence •major constituents: Noun, verb, prepositional phrase •tests to see if a group of words is a constituent •substitution test: noun phrases •topicalization test: prepositional phrases can be moved to front •Phrase-structure grammars (PSGs): •Components: production rules, terminals, nonterminals •production rules: •terminals: part of the generated language (Ryan, to) •nonterminals: represent word classes or constituents (NP, Nominal) •rules can generate sentences •language defined by the grammar is the set of sentences it generates •Derivation: the sequence of rule applications •start with S -> pick nonterminal -> replace with rule -> continue until only terminals •the parse tree: provides a graphical view on the derivation process •inverted tree, inner nodes = nonterminals, leaves = terminals, edges = rules •Chomsky Normal Form: type of grammar where every rule takes on one of the two forms: A -> B C or A -> a •produce binary parse trees •every PSG can be converted into a CNF •Converting to CNF •production rules that do not conform are of two groups •right hand side too long and unit productions •right hand: split into two by introducing new terminal (recursively) •unit: deleted, the right hand side is pulled up all rules that contain the left hand nonterminal on their right

Classical machine translation

•Translation divergences: systematic, idiosynchretic and lexical divergences; examples •when a translation is wrong because of various linguistic structures used by the source and target languages •systematic: modeled generally for many languages •Morphological differences •Syntax: •Sentence word order: •SVO (Subject-Verb-Object): English •SOV: Japanese •Argument marking: •Head-marking: Hungarian •Dependent-marking: English (the man-'s house) •Marking direction of motion: •Verb-framed: Japanese •Satellite-framed: Germanic languages (The bottle floated out) •idiosynchretic: arbitrary and must be dealt with one by one •range from syntactic to purely punctuational •Noun phrase order: the green witch vs la bruja verde •POS subdivisions: Japanese has two types of adjectives •Date and time: MM/DD/YEAR vs DD/MM/YEAR •lexical: languages make different distinctions between concepts •ex: brother vs younger brother (Hungarian, Japanese) •The classical model •Translation steps and approaches; the Vauquois triangle • 1. Analysis: parse the input sentence into some representation • 2. Transformation: between source- and target-language representations • 3. Generation: target-language text from target-language structures •approaches: •Direct translation: •word-by-word translation using a bilingual dictionary •nominal analysis and generation steps •Transfer: •the input is parsed at some level(s) •and transformation rules convert source-language parses to target-language parses •Interlingua: •the input is analyzed into a language-agnostic abstract meaning representation •no transformation step •The Vauquois triangle helps visualize these approaches •Direct translation, transfer, interlingua. Advantages and disadvantages of each •Direct translation •Steps • 1. Morphological analysis • 2. Lexical transfer (word-by-word) using a dictionary • 3. Word reordering • 4. Morphological generation •Pros: no words will be missing in translation •Cons: •Words' meaning depend on context •A single word usually has several translations •Even more confusion with homonymous and polysemous words •cannot reorder phrases •transfer •Levels •Syntactic transfer: Maps between source and target language tree or dependency structures •Semantic transfer: handle the systematic differences between the grammar of the two languages (Subcategorization, Lexicalization) •Pros: •can handle phrases •does not need all concepts to be present •Cons: •for n languages needs n^2 systems •interlingua •use a language-agnostic abstract meaning representation (AMR) • 1. Analyze input into the AMR using a source language pipeline • 2. No transfer is involved • 3. Generate the output from AMR using target language tools •Pros: • No transfer stage is involved: no need for bilingual dictionaries • needs one resource chain per language • for n languages n systems •Cons: • Defining an AMR is very difficult • Full conceptual analysis / generation is hard • All concepts from all languages need to be present (younger brother) •Problems with the classical model •It needs translation-specific resources (transfer rules, AMR) •Creating these requires a huge amount of work •No guidance on how to choose between available rules

Introduction

•What is Jupyter? •formally known as IPython Notebook is a web application that allows you to create and share documents with live code, equations, visualizations etc. •Jupyter notebooks are JSON files with the extension .ipynb •can be converted to HTML, PDF, LateX etc. •can render images, tables, graphs, LateX equations •large number of extensions •content is organized into cells •cell types •code cell: Python/R/Lua/etc. code •raw cell: raw text •markdown cell: formatted text using Markdown •cell magic •Special commands can modify a single cell's behavior •example: %%time or %%timeit or %%writefile •%lsmagic gives complete list of magic commands •kernel •each notebook is run by its own Kernel (Python interpreter) •can interrupted or restarted through the Kernel menu •all cells share a single namespace •cells can be run in arbitrary order, execution count is helpful •short history of Python •Python started as a hobby project of Dutch programmer, Guido van Rossum in 1989. •Python 1.0 in 1994 •Python 2.0 in 2000 •cycle-detecting garbage collector •Unicode support •Python 3.0 in 2008 •backward incompatible •Python2 End-of-Life (EOL) date was postponed from 2015 to 2020 •Python community and development •Python Software Foundation nonprofit organization based in Delaware, US •managed through PEPs (Python Enhancement Proposal) •strong community inclusion •large standard library •very large third-party module repository called PyPI (Python Package Index) •pip installer •Pythonista/Pythonist • good Python programmer

Packaging

•Why is it good to package your code? •allows you to share your code with other people •make it easy for other developers to download and install your package •professional (licence, Manifest.IN, setup.cfg, README.rst) •example: example_package/ example_package/ __init__.py #may be empty setup.py #describes how the package should be installed

Alignment in machine translation

•Word alignment; examples and arity •a mapping between the words of the source and target sentences •examples: [Mary] {did not} [slap] {the} [green] {witch} [Maria] {no} [dió una bofetada] {a la} {bruja} [verde] •arity: can be 1 : n, n : 1 or n : n (phrasal alignment) •The NULL word can model the appearance of spurious words in the output •Phrase acquisition for phrase-based MT •the unit of translation is the phrase • 1. Group the source words into I phrases: E = e ̄1, e ̄2, ..., e ̄I • 2. Translate the phrases into the target language phrases one-by-one: each e ̄ to f ̄ • 3. Reorder the phrases •Very similar to direct translation, only with phrases instead of words •Two important subtasks during training: •Finding the phrases e ̄, f ̄ using word alignments •Building a phrase translation table •IBM Model 1: algorithm and mathematical formulation •IBM 1 only handles 1 : n alignment •uses: •word alignment directly as the translation model •P(F | E) = ∑P(F, A | E) A •E is source sentence, F is translated sentence •only word-word translation probabilities: P(f(i) | e(j) •the probability that the jth English word translates to the ith foreign word •only trainable part of the model is the translation table •The alignment probabilities are assumed to be uniform •steps in detail •Choose a length K for the Spanish sentence: F = f1, f2, ..., fK •K is chosen with the small constant probability ε •Choose a 1 : n alignment between E and F: A = a1,a2,...,aK •P(A | E) = ε /(J + 1)^K •Fill each position k in F by the translation of the English word that is aligned to it •Given A, let •e(a(k)) be the English word that is aligned to f(k) •P(f(x) | e(y)) is the probability of translating e(y) to f(x) •probability of translated sentence: •P(F | E,A) = ∏(k=1 to K) P(f(k) |e(a(k))) •Put togeher •P( F, A | E) = P(F | E, A) x P(A | E) • =ε /(J + 1)^K x ∏(k=1 to K) P(f(k) |e(a(k))) •to get the probability of the translation, we sum over all alignments •P(F | E) =∑ε /(J + 1)^K x ∏(k=1 to K) P(f(k) |e(a(k))) A

Object oriented programming II.

•assignment •the assignment operator (=) cannot be overridden •it performs reference binding instead of copying •tightly bound to the garbage collector •There are 3 types of assignment and copying: •1. the assignment operator (=) creates a new reference to the same object •2. copy performs shallow copy •3. deepcopy recursively deepcopies everything •shallow copy •constructs a new compound object and then (to the extent possible) inserts references into it to the objects found in the original •deep copy •constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original •object introspection •the ability to determine the type of an object at runtime •support for full object introspection •class decorators: most common features •static methods •class methods •property •static methods •defined inside a class but not bound to an instance (no self parameter) •class methods •bound to the class instead of an instance of the class •first argument is a class instance (cls by default) •typical usage: factory methods for the class •properties •attributes with getters, setters and deleters •Property works as both a built-in function and as separate decorators

Strings

•character encodings: Unicode vs. UTF-8, encoding, decoding •Unicode •strings are immutable sequences of Unicode code points •provides a mapping from letters to code points or numbers •every character has unique code point •actual text needs to be stored as a byte array/sequence (byte strings) •character encoding: code point - byte array correspondence •encoding: Unicode code point -> byte sequence •decoding: byte sequence -> Unicode code point •UTF-8 •UTF-8 byte sequence •most popular encoding •variable number of bytes per character •example: English -> one byte but other -> six bytes or more •uses flags to continue next unit apart of same character •extra: Python 2 vs. Python 3 encodings •Python 3 automatically encodes and decodes files •Python 2 does not (must specify) •.encode and .decode •Python 2 string is byte string •Python 3 string is Unicode string •common string operations •lower, upper, title •most sequence operations are supported •strip, rstrip, lstrip •split, join •startswith, endswith •istitle, isspace, isdigit •string formatting (at least two solutions) •str.format •"My name is {0} and I'm {1} years old. I turned {1} last December".format(name, age) •"My name is {} and I'm {} years old.".format(name, age) •% operator •"My name is %s and I'm %d years old" % (name, age) •string interpolation/f-strings •f"My name is {name} and I'm {age} years old {age2}"

Object oriented programming I.

•data attributes and methods: •created upon assignment •can be assigned anywhere (not just in __init__) •attributes: •Attributes can be added to instances (will not affect other instances) •Method attributes: •functions inside the class definition •explicitly take the instance as first parameter (self) •class attributes •class-global attributes •Access class attributes via instances •Access class attributes via the class object •Set the class object via the class •Cannot set via an instance •inheritance •class NewStyleClass(object) •implicitly subclasses object •Method inheritance •inherited and overridden in the usual way •data attributes are only inherited if the code in the base class' method is called •since __init__ is not a constructor, the base class' init is not called automatically, if the subclass overrides it •Multiple Inheritance •method resolution order (MRO) defines the way methods are inherited •since every class subclasses object, the diamond problem is present •super •allows for the base class's methods can be called •duck typing •an object's suitability is determined by the presence of certain methods and properties (with appropriate meaning), rather than the actual type of the object •allows polymorphism without abstract base classes •magic methods •mechanism to implement advanced OO features •dunder methods •operator overloading •operators are mapped to magic functions •defining these functions defines/overrides operators •ex: abs, eq, gt, lt, add, len

Decorators

•decorator def: a decorator is a function that takes a function as an argument and returns a wrapped version of the function •decorator syntax is just syntactic sugar (shorthand) for: func = decorator(func) •example: @add_noise def informal_greeter(): print("Yo") •@wraps decorator •solution to function metadata being lost •example: from functools import wraps def add_noise(func): @wraps(func) def wrapped_with_noise(): print("Calling {}...".format(func.__name__)) func() print("{} finished.".format(func.__name__)) wrapped_with_noise.__name__ = func.__name__ return wrapped_with_noise @add_noise def greeter(): """function that says hello""" print("Hello") print(greeter.__name__) print(greeter.__doc__) •decorators with parameters •they have to return a decorator without parameters - decorator factory •example: def decorator_with_param(param1, param2=None): print("Creating a new decorator: {0}, {1}".format( param1, param2)) def actual_decorator(func): @wraps(func) def wrapper(*args, **kwargs): print("Wrapper function {}".format( func.__name__)) print("Params: {0}, {1}".format(param1, param2)) return func(*args, **kwargs) return wrapper return actual_decorator @decorator_with_param(42, "abc") def personal_greeter(name): print("Hello {}".format(name)) @decorator_with_param(4) def personal_greeter2(name): print("Hello {}".format(name)) print("\nCalling personal_greeter") personal_greeter("Mary") output: Creating a new decorator: 42, abc Creating a new decorator: 4, None Calling personal_greeter Wrapper function personal_greeter Params: 42, abc Hello Mary •extra: classes as decorators •__call__ implements the wrapped function •example: class MyDecorator(object): def __init__(self, func): self.func_to_wrap = func wraps(func)(self) def __call__(self, *args, **kwargs): print("before func {}".format(self.func_to_wrap.__name__)) res = self.func_to_wrap(*args, **kwargs) print("after func {}".format(self.func_to_wrap.__name__)) return res @MyDecorator def foo(): print("bar") foo() output: before func foo bar after func foo

Morphology: theory

•def: the study of how words are formed in terms of the minimal meaning-bearing unit •morphemes: minimal meaning-bearing unit •tags/features: signify the syntactic role of the morpheme •analysis: parses a surface form to morphological features •generation: given a list of features, reproduce the surface form •Why do we need morphology: tasks •spell checking •lemmatization •information retrieval (IR) •linguistic tasks: syntax and semantics •Phenomena: •free and bound morphems •Free morphemes can stand alone in the sentence (quick, brown, fox) •Bound morphemes only occur attached to a stem (est in smallest) •affix placement, affix types •types: derivational and inflectional •derivational: change part-of-speech or meaning of the word •example: [un]prepared[ness]; computer[ize] •inflectional: fill syntactic functions •example: look[ed] ([Pst]); sea[s] ([Pl]) •placement: prefix, suffix, circumfix, infix •prefix example: [extra]terrestrial, [dis]appear •suffix example: listen[ing], short[est] •circumfix example: [ge]hab[t] (have -> had) •infix example: abso-[f***ing]-lutely •compounding, clitics • comp: words (usually nouns) with more than one stem •examples: football, blackboard, redhead •clitics: words reduced in form that are phonologically dependent on another word •examples: we'll, it's •non-concatenative morphology: Grammatical / semantic changes are reflected by modifications to the stem rather than affixation •reduplication: the stem (or part of it) is repeated •examples: bye-bye, night-night, yamayama (mountains) •templates: a consonant template is filled with vowels •examples: Arabic: k-t-b -> kitāb: book, kutub: books, kātib: writer •ablaut: stem sounds change that confers grammatical information •examples: foot -> feet, run -> ran •language types: isolating, analytic, and synthetic •Isolating languages: no inflection to speak of (Chinese) •Analytic languages: some inflection, but prefers word order, adposition (Engligh) •Synthetic languages: inflected, high morpheme-per-word ratio •Fusional: use a single suffix that encodes several grammatical functions (Romance and Slavic languages) •Agglutinative: highly concatenative, separate "slots" for different grammatical functions (Japanese) •Polysynthetic: "sentence-words" (Inuit languages) •Computational morphology: •view that morphemes and their intra-word interactions can be formalized as rules •components: lexicon, morphotactics, orthography •lexicon: lists the morphemes with basic information (e.g. part of speech) •morphotactics: specifies morpheme ordering and their interactions (usually on an abstract level) (sheep doesn't have a plural) •orthography: model how the morphemes are mapped to letters / sounds (run + ing -> running) •analysis: parses a surface form to morphological features •generation: given a list of features, reproduce the surface form

List comprehension

•example: l = [2*i+1 for i in range(10)] == [1, 3, 5, 7, 9, 11, 13, 15, 17, 19] •general form: [<expression> for <element> in <sequence>] •with conditional: [<expression> for <element> in <sequence> if <condition>] •if-else example: [int(n / abs(n)) if n != 0 else 0 for n in l] •traversing 2 lists: [(i, j) for i in l1 for j in l2] •nested list comp: [[e*e for e in row] for row in matrix] •type is generator •generator expressions are generalization of list comprehension •example: even_numbers = (2*n for n in range(10)) #first 10 even nums •generator expression produces one item at a time •use for loop to print •syntax for generator expression is similar to that of a list comprehension •the square brackets are replaced with round parentheses •calling next() raises a StopIteration exception •a generator expression is much more memory efficient than an equivalent list comprehension •extra: iteration protocol, writing an iterator class •A class satisfies the iteration protocol if: •it has a __iter__ function that returns and iterator, which •has a __next__ function (this function is called next in Python 2), •raises a StopIteration after a certain number of iterations •set and dict comprehension •Sets and dictionaries can be instantiated via generator expressions too •A generator expression between curly brackets instantiates a set •if the expression in the generator is a key-value pair separated by a colon, it instantiates a dictionary •set example: fruit_list = ["apple", "plum", "apple", "pear"] fruits = {fruit.title() for fruit in fruit_list} •dict example: word_list = ["apple", "plum", "pear", "apple", "apple"] word_length = {word: len(word) for word in word_list} •yield keyword •if a function uses yield instead of return, it becomes a generator function •pauses the function saving all its states and later continues from there on successive calls •yield temporarily gives back the execution to the caller •the generator function continues

Functions and generators

•functions: •defined using the def keyword •positional, named, or keyword arguments •keyword arguments must follow positional arguments •arguments can have default values •default arguments must follow non-default arguments •Default arguments need not be specified when calling the function •If more than one value has default arguments, either can be skipped •return statement •functions may return more than one value •a tuple of the values is returned •without an explicit return statement None is returned •an empty return statement returns None •args, kwargs •both positional and keyword arguments can be captured in arbitrary numbers using the * and ** operators •positional arguments are captured in a tuple •keyword arguments are captured in a dictionary •we usually capture both •default arguments •be careful with mutable default arguments •best not to use mutable defaults •if argument not specified the default argument will be applied •lambda functions •unnamed functions •may take parameters •can access local scope •can use any callable as the key •example: sorted(l, key=lambda x: abs(x)+y) or sorted(l, key=abs) •generators, yield statement •a generator is a function that returns an object (iterator) which we can iterate over (one value at a time) •simple way of creating iterators •all the overhead are automatically handled by generators •a normal function with yield statement instead of a return statement •while a return statement terminates a function entirely, yield statement pauses the function saving all its states and later continues from there on successive calls

Exception handling

•keywords: •try •except •can have more than one except clause from more to least specific •More than one type of exception can be handled in the same except clause •without specifying a type, except catches everything but all information about the exception is lost •else •try-except blocks may have an else clause that only runs if no exception was raised •finally •the finally block is guaranteed to run regardless an exception was raised or not •raise •raise throws/raises an exception •defining exception classes •any type that subclasses Exception (BaseException to be exact) can be used as an exception object •ValueError •TypeError •Exception as e: type(e) give you type of exception

Evaluation

•labeled corpus def: A List of (token, tag) pairs, where the tags are from a given tagset •train/test set •best to use gold standard corpus •dataset divided into train (30%) and test set (70%) •use the train set to extract features and fit a model •once the model is trained, the model is used on the test set •The predicted and gold labels are compared to each other •The performance of a tagger can be measured several ways •word, sentence, out of vocabulary word accuracy •seen/unseen data •similar to train/test set •the train set will contain the seen data •the test set may contain the seen and/or unseen data •word accuracy •per-token accuracy •# correct labels/# words •sentence accuracy •# sentences with all correct labels/# sentences •Binary classification metrics (TP/TN/FP/FN table, precision/recall, Fscore) •TP = true positive •TN = true negative •FP = false positive •FN = false negative •TP/TN/FP/FN table •ideally have the only positive numbers on the diagonal •precision •fraction of relevant instances among the retrieved instances • tp / tp + fp •ranges from 0 to 1 (worst to best) •1 means every label to a particular class is actually in class •does not say anything about the number of item in that class that were incorrectly labeled (recall) •recall •the fraction of relevant instances that have been retrieved over the total amount of relevant instances •tp / tp + fn •ranges from 0 to 1 (worst to best) •1 means every item in a class was labeled as that class •does not say anything about number of incorrect label to that class (precision) •Fscore •a measure of a test's accuracy •considers both the precision p and the recall •2 * precision * recall/precision + recall •ranges from 0 to 1 (worst to best)

Sequence types

•list vs. tuple • list is mutable sequence type •tuple is immutable sequence type •tuples contain immutable references, however, the objects may be mutable •both don't need elements to be of the same type •both can be indexed the same way •operators •in •not in •addition •multiplication •s[i] or s[i:j] or s[i:j:k] •len •min or max •s.index(x[, i[, j]]) #index of the first occurrence of x in s (at or after index i and before index j) •s.count(x) #total number of occurrences of x in s •advanced indexing •example: l[start:end:step] •can leave any one blank •can have negative indices •example: l[-4] would start at the 4th to last element and go to end •reverse a list: l[::-1] •extra: time complexity of basic operations (lookup, insert, remove, append etc.) •list •lookup: O(1) •insert: O(n) •remove: O(n) •append: O(1) •dict •lookup: O(1) •insert: O(1) •remove: O(1) •all worst case are O(n) •set operations •implemented as: methods or overloaded operators •& | - •return new sets •issubset and issuperset (< >)

numpy

•main object (ndarray) •shape: tuple of the array's dimensions •get the current shape of an array, but may also be used to reshape the array in-place by assigning a tuple of array dimensions to it •dtype: type of the elements •dimension (len(shape)) •elementwise operations (exp, sin, + - * /, pow or **) •indexing •reshape •The shape of an array can be modified with reshape, as long as the number of elements remains the same •matrix operations: •dot •standard matrix prouct •transpose • •reductions: •sum along a given axis • •broadcasting •One can calculate with uneven shaped arrays if their shapes satisfy certain requirements •denote the broadcasted dimensions with None index •(2,) + (3, 2, 2) ->(None, None, 2) + (3, 2, 2) = (3, 2, 2) •example: •row-wise normalization •new_matrix = a / row_sums[:, numpy.newaxis] •one-liner for a complex grid: •np.arange(5)[:, None] + 1j * np.arange(5)[None, :] •some matrix constructions •ones •array filled with ones •zeros •array filled with zeros •eye • identity matrix, only 2D •diag •construct or extract diagonal

Functional Python

•map •applies a function on each element of a sequence •example: list(map(lambda x: x * 2, [2, 3, "abc"])) == [4, 6, 'abcabc'] •filter •creates a list of elements for which a function returns true •example: list(filter(lambda x: x % 2 == 0, range(8))) == [0, 2, 4, 6] •reduce •applies a rolling computation on a sequence •the first argument of reduce is two-argument function •the second argument is the sequence •an initial value for the accumulator may be supplied •example: l = [1, 2, -1, 4] reduce(lambda x, y: x*y, l) == -8 •example: reduce(lambda x, y: x*y, l, 10) == -80

Type system

•static vs. dynamic typing •static: type checking is performed at compile-time •variables' types are static •once you set a variable to a type, you cannot change it •dynamic: type checking is performed at run-time •no need to declare variables •the = operator binds a reference to any arbitrary object •Python is strongly typed •most implicit conversions are disallowed. •conversions between numeric types are OK •built-in types (numeric, boolean), operators •boolean operators •three boolean operators: and, or and not •boolean type •two boolean values: True and False (must be capitalized) •Numeric types •three numeric types: int, float and complex •an object's type is derived from its initial value •implicit conversion between numeric types is supported in arithmetic operations •the resulting type is the one with less data loss •integers have unlimited precision •floats are usually implemented using C's double •complex numbers use two floats for their real and imaginary parts •Arithmetic operators •addition, subtraction and product work the same as in C/C++ •/ computes the float quotient even if the operands are both integers •Explicit floor quotient operator: // •Comparison operators •< > <= >= = •operators can be chained •Other operators for numeric types •remainder: % •power: ** •absolute value: abs(x) •round: round(x) •Explicit conversions between numeric types •float(x) or int(x) •math and cmath •additional operations •cmath is for complex numbers •mutability •instances of mutable types can be modified in place •immutable objects have the same value during their lifetime •numeric types: immutable •Common integers are preallocated and kept in memory (-5 to 256) •booleans: singleton immutable objects •lists: mutable •tuple: immutable •tuples contain immutable references, however, the objects may be mutable •strings: immutable •set: mutable •frozenset: immutable •dict: mutable

Question answering

•task description: the process of generating adequate answers to user's questions, based on some knowledge of the world •existing systems: Siri (Apple), Alexa (Amazon), Google Now, IBM Watson, cleverbot •major approaches and their limitations •IR-based: handle questions as search queries (e.g. Watson, Google) •Knowledge-based: convert question into a Relation Extraction task (e.g. Watson, Siri, Wolfram Alpha) •Limitations: will yield several answer candidates that need to be ranked •Limitations: Create semantic representation of query -> must understand what is being asked •The IR-based approach: •question processing •detect question type (through handwritten rules) •goal is to extract a number of pieces of information from the question •answer type (person, location, time) •query (specifies the keywords that should be used in searching for documents) •focus (string of words in the question that are likely to be replaced by the answer in any answer string found) •(relations), etc •example: Which US state capital has the largest population? •city •US state capital, largest, population •state capital •query generation •generate search queries from questions •the task of creating from the question a list of keywords that form a query that can be sent to an information retrieval system •retrieval •retrieve ranked documents •extract relevant passages, rerank •extract answer candidates •filter out passages in the returned documents that don't contain potential answers and then rank the rest according to how likely they are to contain an answer to the question •answer ranking •rank answers •rank them using a classifier (Answer type match, Pattern match, Number of matched question keywords) •approaches •pointwise •pairwise •listwise

Chomsky hierarchy

•the hierarchy; rule types allowed •classifies formal languages by the rules they can contain •0 Turing complete (rules allowed: α→β) ex: LFG, Python •1 Context sensitive (rules allowed: αAβ → αγβ) ex: •- Mildly context sensitive (ex:CCG) •2 Context free (rules allowed: A→γ) ex: PSG •3 (Right) regular (rules allowed: A → aB or A → a) ex: FSA, regex •α= terminal, A and B=non-terminals, α, β, γ=strings of both terminal and non-terminal symbols •Lower hierarchy levels are subsets of those above •Regular languages and FSA equivalence •Two types: Right and Left regular grammar •right: A→a, A → aB, A→ε •left: A→a, A → Ba, A→ε •The two are completely equivalent, but the two types of rules cannot be mixed •FSAs can simply be transcribed into right-regular grammars by equating FSA states with non-terminals •Examples for language classes in linguistics: •Regular languages and morphology •Center embedding and CFGs •cannot be expressed with a regular expression •can be expressed with CFG •A linguistic phenomenon outside CFG •cross-serial dependencies (in Swiss German)


Ensembles d'études connexes

Cascia Hall Ultimate Academic Bowl Study Guide

View Set

ECO2013-Andrew Tucker-Final Exam

View Set

Bio II Test 2 Chapters 23 and 24

View Set

SVHS Chemistry 5.1 Atomic Models

View Set

Cushing's Syndrome/Addison's Disease

View Set

Orgin, Insertion, Action of Muscles

View Set