Natural Language Processing

Ace your homework & exams now with Quizwiz!

Give a simple way to encode affix lexicons in a

Pair affixes with an encoding of the syntactic/semantic effect of it. E.g.: ed PAST_VERB ed PSP_VERB s PLURAL_NOUN A lexicon of irregular forms is also necessary. One approach is just a triple consisting of inflected form, 'affix information' and stem, where 'affix information' corresponds to whatever encoding is used for the regular affix. E.g.: began PAST_VERB begin begun PSP_VERB begin This approach can be used for generation as well as analysis

Define cue phrases

Phrases which indicates particular rhetorical relations.

Define agreement

The requirement for two phrases to have compatible values for grammatical features such as number and gender. For instance, in English, dogs bark is grammatical but dog bark and dogs barks are not.

Define context

The situation in which an utterance occurs: includes prior utterances, the physical environment, background knowledge of the speaker and hearer(s), etc etc. Nothing to do with context-free grammar.

What sort of formalism do spelling rules map to?

Finite state transducers

Why are corpuses needed in NLP?

Firstly, we have to evaluate algorithms on real language: corpora are required for this purpose for any style of NLP. Secondly, corpora provide the data source for many machine-learning approaches.

Discuss the 5 general problems in evaluating NLP solutions

_Training data and test data_ The assumption in NLP is always that a system should work on novel data, therefore test data must be kept unseen. For machine learning approaches, such as stochastic POS tagging, the usual technique is to split a data set into 90% training and 10% test data. Care needs to be taken that the test data is representative. For an approach that relies on significant hand-coding, the test data should be literally unseen by the researchers. Development cycles involve looking at some initial data, developing the algorithm, testing on unseen data, revising the algorithm and testing on a new batch of data. The seen data is kept for regression testing. _Baselines Evaluation_ should be reported with respect to a baseline, which is normally what could be achieved with a very basic approach, given the same training data. For instance, a baseline for POS tagging with training data is to choose the most common tag for a particular word on the basis of the training data (and to simply choose the most frequent tag of all for unseen words). _Ceiling_ It is often useful to try and compute some sort of ceiling for the performance of an application. This is usually taken to be human performance on that task, where the ceiling is the percentage agreement found between two annotators (interannotator agreement). For POS tagging, this has been reported as 96% (which makes existing POS taggers look impressive since some perform at higher accuracy). However this raises lots of questions: relatively untrained human annotators working independently often have quite low agreement, but trained an- notators discussing results can achieve much higher performance (approaching 100% for POS tagging). Human performance varies considerably between individuals. Fatigue can cause errors, even with very experienced annotators. In any case, human performance may not be a realistic ceiling on relatively unnatural tasks, such as POS tagging. _Error analysis_ The error rate on a particular problem will be distributed very unevenly. For instance, a POS tagger will never confuse the tag PUN with the tag VVN (past participle), but might confuse VVN with AJ0 (adjective) because there's a systematic ambiguity for many forms (e.g., given). For a particular application, some errors may be more important than others. For instance, if one is looking for relatively low frequency cases of de- nominal verbs (that is verbs derived from nouns — e.g., canoe, tango, fork used as verbs), then POS tagging is not directly useful in general, because a verbal use without a characteristic affix is likely to be mistagged. This makes POS-tagging less useful for lexicographers, who are often specifically interested in finding examples of unusual word uses. Similarly, in text categorisation, some errors are more important than others: e.g. treating an incoming order for an expensive product as junk email is a much worse error than the converse. _Reproducibility_ If at all possible, evaluation should be done on a generally available corpus so that other researchers can replicate the experiments.

How would pleonastic nouns such as in 'It barked' or 'It rained' be represented in logical representation?

$\exists x (ball'(x) \land PRON(x))$ $rain'$

What equation is used on occurences of a word in a corpus to determine probability in a bigram model?

${C(w_{n - 1}w_n}/{\sum_w C(w_{n - 1}w)}$ i.e. the count of a particular bigram, normalised by dividing by the total number of bigrams starting with the same word (which is equivalent to the total number of occurrences of that word, except in the case of the last token, a complication which can be ignored for a reasonable size of corpus).

Discuss why atomic category CFGs insufficient?

- *Lack of subject-verb (or any) agreement* so, for instance, *it fish is allowed by the grammar as well as they fish. We could, of course, allow for agreement by increasing the number of atomic symbols in the CFG, introducing NP-sg, NP-pl, VP-sg and VP-pl, for instance. But this approach would soon become very tedious. Note that we have to expand out the symbols even when there's no constraint on agreement, since we have no way of saying that we don't care about the value of number for a category (e.g., past tense verbs). - *Does not account for subcategorisation*. Consider how a verb like 'adore' is transitive and relates two entities ('Kim adores' is weird but 'Kim adores Sandy' is not); but others like give is ditransitive ('Kim gives Sandy an apple'). A CFG fails to encode this. The example is lectures allows for: they fish fish it (S (NP they) (VP (V fish) (VP (V fish) (NP it)))) We could expand the grammar to also deal with this, but this becomes excessively cumbersome. - *Long-distance dependencies:* A sentence like 'Which problem did you say you don't understand?' is traditionally modelled as having a gap where a subject would usually go: 'Which problem did you say you don't understand _?' Doing this in standard CFGs is possible, but extremely verbose, potentially leading to trillions of rules. Instead of having simple atomic categories in the CFG, we want to allow for features on the categories, which can have values indicating things like plurality. As the long-distance dependency examples should indicate, the features need to be complex-valued. For instance, * what kid did you say _ were making all that noise? is not grammatical, as the verb doesn't agree with the numbers ('which kids did you say _ were making all that noise?' or 'what kid did you say _ was making all that noise?'). The analysis needs to be able to represent the information that the gap corresponds to a plural noun phrase.

What sort of lexical information is needed for full, high precision morphological processing

- Affixes, plus the associated information conveyed by the affix - Irregular forms, with associated information similar to that for affixes - Stems with syntactic categories (plus more information if derivational morphology is to be treated as productive)

Why is POS tagging useful?

- Basic, coarse-grained sense disambiguation - Easier to extract information for NLP or linguistic research - As the basis for more complex forms of annotation - Named entity recognisers are generally run on tagged corpuses - As preprocessors to full parsing - As part of a method for dealing with words not in a parer's lexicon.

What different methods of context weighting exist in distributional semantics?

- Binary model: if context c co-occurs with word w, value of vector w for dimension c is 1, otherwise it is 0. - Basic frequency model: the value of vector w for dimension c is the number of times that context c co-occurs with w - Characteristic model: the weights given to the vector components express how characteristic a given context is for w. There are various functions used, like: > Pointwise Mutual Information, with or without discounting factor: $pmi_{wc} = \log(\frac{f_{wc}*f_{total}}{f_w*f_c})$ where $f_{wc}$ is the frequency with which word $w$ occurs in context $c$; $f_{total}$ is the total frequency of all the possible contexts, $f_w$ is the frequency of the word $w$ and $f_c$ is the overall frequency of the context item, i.e. if we use words as dimensions $f_{wc}$ is the frequency with which word $w$ and word $c$ co-occur, $f_c$ is the overall frequency of word $c$ and $f_{total}$ is the total frequency of all the context words. > Positive PMI: as PMI but 0 if PMI < 0 > Derivatives such as Mitchell and Lapata's (2010) weighting function (PMI without the log) > Etc NB - PMI is one of the measures used to find collocations

What possible ways exist for regeneration systems to work by?

- Constructing coherent sentences from partially ordered set of words, e.g. in statistical MT - Paraphrasing - Text compression - Summarisation - Article construction from text fragments - Text simplification - Mixed generation and regeneration systems can use sequence-to-sequence neural models

Compare dependency structures to CFG-based syntax trees

- Dependency structures have no intermediate nodes... - ...and no notion of constituencies... - ...and are more intuitively closer to meaning than parse trees... - ...and are more neutral to word-order, particularly important in languages such as Russian where word order is extremely flexible. Also especially important in building systems which work more generally cross-linguistically. - Transition-based dependency parsing is deterministic, unlike chart-parsers for CFGs - However, misses some syntactic and semantic phenomena - DSs have no notion of 'grammar', and instead are usually produced by machine learning on dependency annotated corpora

What options are there for deciding on a semantic space for distributional semantics?

- Entire vocabulary. The dimensions may however include noise. - Top $n$ words with highest frequencies. This is efficient and only 'real' words will be included. However it may miss infrequent but relevant contexts. - Singular Value Decomposition and other dimensionality reduction techniques - Several other variants

Discuss optimisations on chart parsing, other than packing

- In the pseudo-code given in lectures, the order of addition of edges to the chart was determined by the recursion. In general, chart parsers make use of an agenda of edges, so that the next edges to be operated on are the ones that are first on the agenda. Different parsing algorithms can be implemented by making this agenda a stack or a queue, for instance. - So far, we've considered bottom up parsing: an alternative is top down parsing, where the initial edges are given by the rules whose mother corresponds to the start symbol. - Some efficiency improvements can be obtained by ordering the search space appropriately, though which version is most efficient depends on properties of the individual grammar.

What possible starting points are there for generating text in Natural Language Generation?

- Logical form or other sentence meaning representation. The opposite of deep parsing, and sometimes called realisation when part of a bigger system - Formally-defined data such as databases, knowledge bases, semantic web ontologies, etc - Semi-structured data (tables, graphs etc) - Numerical data (e.g. weather reports) - User input in assistive communication

List some uses of finite state techniques in NLP

- Morpheme analysis/generation - Grammars for simple dialog systems - Partial grammars for named entity recognition - Dialogue models for spoken dialogue systems (SDS). SDS use dialogue models for a variety of purposes: in- cluding controlling the way that the information acquired from the user is instantiated (e.g., the slots that are filled in an underlying database) and limiting the vocabulary to achieve higher recognition rates. FSAs can be used to record possible transitions between states in a simple dialogue.

What sort of quantifiers cannot be represented easily/at all in FOPC?

- Numbers - Most - Perhaps - Maybe - Modal verbs like can There are known logics which provide a suitable formalisation for the above, but inference is not always tractable, and the different logics aren't always compatible. Bigger problem: bare plurals. Example: - Dogs are mammals - Birds fly - Ducks lay eggs - Voters rejected AV in 2011

Give examples of problems in NLP which can be treated as classification problems

- Pronoun resolution - Sentiment classification - Word sense disambiguation - POS-tagging, but there we take the tag sequence of highest probability rather than each individual tag

What uses exist for hyponymy relations?

- Semantic classification: e.g. for selectional restrictions (e.g. the object of eat has to be something edible) - Shallow inference: 'X murdered Y' implies 'X killed Y' etc - Back-off to semantic classes in some statistical approaches (e.g. WordNet classes can be used in document classification) - Word-sense disambiguation - Query expansion for information retrieval: if a search doesn't return enough results, one option is to replace an over-specific term with a hypernym

Why do we want to use prediction in NLP?

- Some machine learning systems can be trained using prediction on general text corpora in a way that also makes them useful on other tasks where there is limited training data. - Prediction is also a fundamental part of human language understanding. - Also, there are applications like text prediction on phone keyboards, handwriting recognition, spelling correction and text segmentation for languages such as Chinese, which are conventionally written without explicit word boundaries. - Prediction is also important in estimation of entropy, including estimations of the entropy of English. The notion of entropy is important in language modelling because it gives a metric for the difficulty of the prediction problem. - The most important use of prediction is as a form of language modelling for automatic speech recognition (ASR). Speech recognisers cannot accurately determine a word from the sound signal for that word alone, and they cannot reliably tell where each word in an utterance starts and finishes. For instance, have an ice Dave, heaven ice day and have a nice day could easily be confused. 20 For traditional ASR, an initial signal processing phase produces a lattice of hypotheses of the words uttered, which are then ranked and filtered using the probabilities of the possible sequences according to the language model. The language models which were traditionally most effective work on the basis of n-grams (a type of Markov chain), where the sequence of the prior n − 1 words is used to derive a probability for the next.

What difficulties exist in performing inference on a knowledge base for compositional semantics?

- The mapping must be adequate for any way a user may choose to ask a question. Acquiring such mapping is difficult though machine learning can help, given the right sort of training data. - The main limitation is representing the knowledge base. Whilst applicable where there is a clear boundary between the domain and reasoning, for common sense sort of reasoning and knowledge. Its questionable whether regarding human thinking as operating this way is useful.

What are problems with using logical representations

- Unrepresentable quantifiers - Bare plurals - Some phrases are neither multiword expressions nor compositional, but somewhere in-between - Coverage is still very incomplete - Analyses which are suggested often don't mesh with goals of broad-coverage parsing which requires limiting ambiguity - Many unresolved puzzles exist in using logical representations, even approaches using logical representations are using approximations. It may be better to view the logical representation as annotation that captures some but not all meaning.

Summarise different approaches to word-sense disambiguation

- Until 1990s, hand coded rules - Supervised learning is possible, but corpora are small; require huge time investments; and show poor agreement between annotators. - So unsupervised/semi-supervised approaches are preferable - Initial ML/statistical approaches trained on homonyms, which are easier than other polysems, claimed 95% disambiguation rate though this doesn't indicate accuracy - For experimentation a standard is required so WordNet was used, but drew criticism since it has fine and overlapping sense disambiguation granularity. - WSD is now unfashionable to be treated separately as its own NLP task - it appears unhelpful to consider it separately from other applications. (E.g. its important in speech synthesis but only where sense distinction indicated different pronunciation; and for MT disambiguation depends on the language pairs, as some languages distinguish sense where others don't) - Sense induction has also been attempted, to determine clusters of usages in texts that correspond to senses. Whilst in principle this is a good idea since the notion of a word sense is inherently fuzzy, it has not been explored as a standalone task due to difficulty in evaluation. Arguably though, this is an inherent part of statistical approach to MT and indeed, of deep learning approaches which incorporate distributions and embeddings.

What is the specific research problem used to measure the effectiveness of sentiment analysis

- Uses a corpus of movie reviews where the rating associated with each review is known. Hence, there is an objective measure of whether a review was positive or negative. - Pang et al. balances the corpus so it had 50% positive reviews and 50% negative - Research problem is to assign sentiment automatically to each document in the entire corpus to agree with the known ratings.

How may a bag of words technique for sentiment analysis be improved

- Using a form of compositional semantics had positive results - Neural network methods has shown high performance, it is hypothesised that they induced some internal representation of syntactic and/or semantic structure. - Many 'obvious' techniques such as accounting for negating words turn out not be so good, especially as deeper parsing is necessary to determine the scope of negation correctly.

What notion of contexts can exist in distributional semantics?

- Word windows (unfiltered): n words on either side of the item under consideration (unparsed text). - Word windows (filtered): n words on either side of the item under consideration (unparsed text). Some words will be in a stoplist and not considered part of the context. It is common to put function words and some very frequent content words in the stoplist. The stoplist may be constructed manually, but often the corpus is POS- tagged and only certain POS tags are considered part of the context. - Lexeme windows: as above, but a morphological processor is applied first that converts the words to their stems. - Dependencies: syntactic or semantic. The corpus is converted into a list of directed links between heads and dependants (see §5.1). The context for a lexeme is constructed based on the dependencies it belongs to. The length of the dependency path varies according to the implementation. - Etc

Aside from the RTE Challenge, how else can we determine textual entailment?

1. By matching dependency structures, a similarity metric can be used to relate different lexemes (without explicit meaning postulates like the one above). The similarity could be extracted from a corpus using distributional semantics, or from a lexical resource. 2. More robust methods which require no parsing also exist. Bag of words for example: if there is sufficient overlap between two words then entailment goes through. The method only works well depending on H, which must therefore be sufficiently complex with good overlap.

List the tasks involved in Natural Language Generation

1. Content determination: deciding what information to convey. This often involves selecting information from a body of possible pieces of information. 2. Discourse structuring: The broad structure of the text or dialogue. E.g. scientific articles have abstract, introduction, methods, results, comparison, conclusion 3. Aggregation: Deciding how to split the information into sentence-sized chunks 4. Referring expression generation: Deciding when to use pronouns, how many modifiers to include, and so on. 5. Lexical choice: Deciding which lexical items to use to convey a given concept. 6. Surface realization: mapping from a meaning representation for an individual sentence (or a detailed syntactic representation) to a string or speech output. Includes morphological generation. 7. Fluency ranking: A large grammar generates many strings for any given meaning representation. Usually, the grammar overgenerates to produce ungrammatical sentences as well as grammatical ones. A fluency ranking component (at its simplest based on n-grams) is used to rank the outputs. Nowadays, some of these subtasks can be done via statistical approaches.

What factors influence discourse interpretation?

1. Cue phrases. Sometimes not ambiguous, but they usually are. 2. Punctuation (or way the sentence is said — intonation etc) and text structure. 3. Real world content — further description of the event in the real world can clarify what sort of discourse is intended. 4. Tense and aspect

Outline an abstractive summarisation system to produce a set of propositions. How well does it perform?

1. For each input sentence, parse it and create a set of propositions 2. Use argument overlap to decide which propositions to attach to a coherence tree, which represents the document as a whole. 3. Remove and store some existing propositions (provisionally 'forget' them) 4. Proceed to next sentence. 5. If phase 2 failed to attach propositions coherently, recover temporarily forgotten propositions so that the new propositions can be attached 6. Rank propositions by the number of cycles that they survived, and generate (realise) the summary from the highest ranked propositions. In the limit, the attachment of the incoming propositions requires arbitrary world knowledge and reasoning. However, Fang and Teufel (2016) demonstrate the effectiveness of a WordNet-based system of deriving lexical chains, which I will not describe in detail here. Evaluation of this system (e.g., Fang et al 2016) suggests it is a promising alternative to extractive summarization.

When producing a Natural Language Generation system, in the 'Representing the data' phase, what considerations are there?

1. Granularity of information 2. Abstraction — we need to generalise over instances 3. Faithfulness to source versus closeness to natural language 4. Inferences over data 5. Which formalism to use

What sorts of polysemy exist?

1. Homonomy: a word with two unrelated senses 2. Two related words with different senses

What are the two broad approaches to inference?

1. Inference on an explicit, formally-represnented knowledge base 2. Language based

How can we do machine learning for natural language generation from existing text?

1. Obtain pairings of data and text 2. Annotate the text with appropriate data entities. Can be done by hand (time consuming and expensive) or write rules to create training set automatically, using numbers and proper names as links. 3. Treat content selection as a classification problem: all possible factoids are derived from the data source and each is classified as in or out, based on training data. The factoids can be categorised into classes and grouped. Often though, this returns 'meaningless' factoids — techniques exist to avoid this but the approach can become too task specific. 4. Discourse structuring requires generalising over batches of the text to see where particular information types are presented. Fluency ranking and predicate choice could straightforwardly use the collected texts.

Which effects indicate a 'soft' preference for resolving a particular pronoun antecedent to a particular antecedent

1. RECENCY: More recent antecedents are preferred. Only relatively recently referred to entities are accessible. 2. GRAMMATICAL ROLE: Subjects > Objects > Everything else 3. REPEATED MENTION: Entities referred to more often are preferred 4. PARALLELISM: Entities which share the same role as the pronoun in the same sort of sentence are preferred 5. COHERENCE EFFECTS: The pronoun resolution may depend on the rhetorical/discourse relation that is inferred. For example, an Explanation inference may cause different behaviour to a Narration inference. All of this can be overridden by world-knowledge effects — humans use understanding and inference on the world to resolve anaphora as well.

Give examples of context-dependent situations

1. Referring expressions: pronouns, definite expressions etc. 2. Universe of discourse: 'every dog barked' doesn't mean every dog in the world, but those within some explicit or implicit context 3. Responses to questions 4. Implicit relationships between events ('Max fell. John pushed him' — the second sentence is usually understood as providing a causal explanation)

Why is the concept of hypo/hypernyms, whilst rigorous in logic, difficult in NLP?

1. What classes of noun can be categorised by hyponymy? E.g. how do you categorise a noun like 'truth'. Some nouns can be categorised by hyponym too but this is not as clear as for concrete nouns. Pretty useles 2. Do differences in quantisation and individuation matter? For instance, is chair a hyponym of furniture? is beer a hyponym of drink? is coin a hyponym of money? 3. Is multiple inheritance allowed? Intuitively, multiple parents might be possible: e.g. coin might be metal (or object?) and also money. Artefacts in general can often be described either in terms of their form or their function. 4. What should the top of the hierarchy look like? The best answer seems to be to say that there is no single top but that there are a series of hierarchies.

What visualisation techniques exist for analysing neural networks in NLP?

1. t-SNE is an important approach for visualising high dimensionality datasets. 2. Heatmaps can demonstrate the effect of different intensifiers on individual nodes in a NN.

Define Meronymy

A 'part of' relationship. For example, an arm is a part of a body; a steering wheel is a metonym of car.

Define 'Context Free Grammar'?

A CFG has four compo- nents, described here as they apply to grammars of natural languages: 1. a set of non-terminal symbols (e.g., S, VP), conventionally written in uppercase; 2. a set of terminal symbols (i.e., the words), conventionally written in lowercase; 3. a set of rules (productions), where the left hand side (the mother) is a single non-terminal and the right hand side is a sequence of one or more non-terminal or terminal symbols (the daughters); 4. a start symbol, conventionally S, which is a member of the set of non-terminal symbols.

What is a bigram model?

A bigram model assigns a probability to a word based on the previous word alone: i.e., $P(w_n|w_{n−1})$ (the probability of $w_n$ conditional on $w_{n−1}$) where $w_n$ is the $n^{th}$ word in some string.

Define 'corpus'

A body of text that has been collected for some purpose.

What notions of similarity does a semantic space tend to capture in distributional semantics?

A broad range. Includes synonyms, near synonyms, hyponyms, taxonomical siblings, antonyms, etc. It does correlate with a psychological reality. We can calculate the rank correlation between a distributional similarity system and human judgements (and note human similarity results can be replicated with a strong correlation coefficient). A good distributional similarity system can have a correlation of 0.8 or better with human data (though reported results may be unreasonably high as the data has been used in so many experiments.)

How is a 'chart' in chart parsing designed?

A chart is a collection of edges, usually implemented as a vector of edges, indexed by edge identifiers. In the simplest version of chart parsing, each edge records a rule application and has the following structure: [id,left vertex, right vertex,mother category, daughters] A vertex is an integer representing a point in the input string (between words). Mother category refers to the rule that has been applied to create the edge. Daughters is a list of the edges that acted as the daughters for this particular rule application: it is there purely for record keeping so that the output of parsing can be a labelled bracketing.

How many results does a chart parser return?

A chart parser is designed to be complete: it returns all valid trees (though there are some minor caveats e.g., concerning rules which can apply to their own output).

Where are the probabilites in a bigram model obtained from?

A corpus

Define 'balanced corpus'

A corpus which contains texts which represent different genres (newspapers, fiction, textbooks, parliamentary reports, cooking recipes, scientific papers etc etc): early examples were the Brown corpus (US English: 1960s) and the Lancaster- Oslo-Bergen (LOB) corpus (British English: 1970s) which are each about 1 million words: the more recent British National Corpus (BNC: 1990s) contains approximately 100 million words, including about 10 million words of spoken English.

What does it mean for a grammar to be 'bidirectional'?

A deep grammar which does not over generate (much) is said to be bidirectional

Describe how the graph-based matching of referring expressions can be used to match a description to an entity

A description is a graph with unlabelled nodes; it matches the knowledge base if it is a subgraph isomorph of the knowledge base. A DISTINGUISHING GRAPH is one which can only be placed over the knowledge-base graph in one way (i.e. refers to only one entity). If the graph can be placed over the graph in many ways, the entities which are not the one being referred to are DISTRACTORS. In general, there will be multiple distinguishing graphs, but we are looking for the one with the lowest cost. What distinguishes algorithms is essentially their definition of cost.

What is an embedding? When does a RNN use them?

A distributional model for words with dimensionality reduction created on-the-fly, via prediction. Used in recurrent neural networks. The embeddings may be externally created (from another corpus) or learned as part of the task. If an embedding is to be learned as part of prediction, the input is a one-hot vector (i.e. there is an input corresponding to each word), and the embedding is created in the first layer of the network

What is a chart parser?

A dynamic programming parser for CFGs. A chart is the data structure used to record partial results as parsing continues.

Define 'Distributional Semantics'

A family of techniques for representing word and phrase meaning based on linguistic contexts of use. In such models, meaning is seen as a space with dimensions corresponding to elements in the context (features).

Define multiword expression

A fixed phrase with a non-compositional meaning; A conventional phrase that has something idiosyncratic about it and therefore might be listed in a dictionary.

Define constraint-based grammar

A formalism which describes a language using a set of independently stated constraints, without imposing any conditions on processing or processing order.

Define 'left-associative grammar' and 'right-associative grammar'

A grammar in which all nonterminal daughters are the leftmost daughter in a rule (i.e., where all rules are of the form X → Y a∗), is said to be left-associative. A grammar where all the nonterminals are rightmost is right-associative.

What is a 'deep grammar'?

A grammar that produces some explicit representation of compositional semantics

Define collocation

A group of two or more words that occur together more often than would be expected by change. They include multiword expressions, i.e. conventional phrases that might be listed in a dictionary (e.g. bass guitar and striped bass); or phrases like 'heavy rain'. Sometimes, collocation is restricted to refer to situations where there is a syntactic relationship between the words, J&M define collocation as a position-specific relationship (in contrast to bag-of-words, where position is ignored) but this is not a standard definition.

Define AI-Complete

A half-joking term, applied to problems that would require a solution to the problem of representing the world and acquiring world knowledge (lecture 1).

How does fluency ranking work in Natural Language Generation?

A large grammar will overgenerate. A fluency ranking attempts to choose the best, grammatical sentence. The simplest solution is based on n-grams. In the extreme case, no grammar is used and the fluency ranking is performed on a partially ordered set of words.

Define 'lexeme'

A lexeme is an abstract unit of morphological analysis in linguistics

Define full-form lexicon

A lexicon where all morphological variants are explicitly listed

What is a "full form lexicon"?

A list of all inflected forms treating derivational morphology as non-productive. But since the vast majority of words in English have regular morphology so a full-form lexicon can be regarded as a form of compilation - it is redundant to have to specify the inflected form as well as the stem.

What is the difference between a long-term and a long-distance dependency?

A long-term dependency is any relationship between elements in the sequence that are separated by more than a very small number of other elements. Note that this is not the same as long distance dependency in linguistics. The example above can be analysed syntactically as coordination of VPs without any form of 'gap'. Note that, if we were using a dependency parser, at the point where we might shift shook, the stack could contain 'she decided so', hence the 'long-term' nature refers to the surface text, not the syntactically/semantically structured information.

Describe the graph-based generation of referring expression in natural language generation

A meta-algorithm for generating referring expressions. The approach involves describing situations by predicates in a knowledge base, which can be thought of as arcs on a graph with nodes corresponding to entities. The algorithm starts at the node of the entity being described and expands adjacent edges. The cost function is given by a positive number associated with an edge. We want the cheaper graph without distractors. We explore the search space, never considering graphs which are more expensive than the best one we have. If we put an upper bound K on the number of edges in a distractor, the complexity is $n^K$. Various algorithms correspond to different weights. An example of such an algorithm is the full-brevity algorithm.

Define affix

A morpheme which can only occur in conjunction with other morphemes

Define stem

A morpheme which is a central component of a word

What is a cue phrase?

A phrase (e.g. 'because') which links two rhetorically related sentences. Theories of discourse/rhetorical relations try to reify this intuition using link types such as Explanation and Narration

Define verb phrase

A phrase headed by a verb.

Define noun phrase

A phrase which has a noun as syntactic head

Define utterance

A piece of speech or text (sentence or fragment) generated by a speaker in a particular context.

Define referent

A real world entity that some piece of text or speech refers to

Define relative clause

A relative clause gives more information about a noun or modifies it, as in the following example: The man who bought our house has just won the lottery. Relative clauses contain relative adverbs or pronouns, which do not need to be overt, for example in a ZERO RELATIVE CLAUSE or a REDUCED RELATIVE CLAUSE. See also

Define restrictive relative clause

A restrictive relative clause is one which limits the interpretation of a noun to a subset: e.g. the students who sleep in lectures are obviously overworking refers to a subset of students. Contrast non-restrictive, which is a form of parenthetical comment: e.g. the students, who sleep in lectures, are obviously overworking means all (or nearly all) are sleeping.

What is discourse?

A sentence within a context, e.g. between other sentences

Define n-gram

A sequence of n words

What is the "Recognising Textual Entailment" challenge? Why is it useful?

A series of shared tasks. The task presented a sentence without context, and a hypothesis which may or may not entail from it. Human annotators must label the pairs as TRUE if the entailment follows, or FALSE otherwise. Examples of this sort can be manipulated using a logical form generated from a grammar with compositional semantics combined with inference. Whilst the specifics are tedious, we can show which meaning postulates (e.g. $find'(x,y,z) \implies discover'(x,y,z)$) are valid.

Why do statistical pronoun resolvers often work well despite classification of features being very complex and, as a result, unreliable?

A statistical classifier is somewhat robust to erroneous classifications if the training data features have been assigned by the same mechanism as used in the test system. For example, if the grammatical role assignment is unreliable, the weight assigned to that feature might be less than if it were perfect.

Define dependency structure

A syntactic or semantic representation that links words via relations

What does 'two level morphology' mean?

A system which is good for both analysing and generating mophemes

What is "stemming"?

A technique in traditional information retrieval systems. Involves reducing all morphologically complex forms to a canonical form. The canonical form may not be the linguistic stem, despite the name of the technique. The most commonly used algorithm is the Porter stemmer, which uses a series of simple rules to strip endings.

Define aspect

A term used to cover distinctions such as whether a verb suggests an event has been completed or not (as opposed to tense, which refers to the time of an event). For instance, she was writing a book vs she wrote a book.

What is word2vec?

A two-layer neural network used to create embeddings. It is referred to as a 'predict' model, in contrast to other distributional models which are sometimes described as 'count' models.

Describe the formation of spelling rules

AKA orthographic rules In such rules, the mapping is always given from the 'underlying' form to the surface form, the mapping is shown to the left of the slash and the context to the right, with the indicating the position in question. Example: $\epsilon \to e/{ s } \textasciicircum _s$

Why are basic restricted Boltzmann machine-based neural networks not so great for NLP?

Although effective for many tasks, combined RBMs and similar architectures cannot handle sequence information well when the sequences are of arbitrary length (we can pass them sequences encoded as vectors, but the input vectors are fixed length). So a different architecture needed for sequences and most language and speech problems. Recurrent neural network describes a class of architecture that can handle sequences. This includes the Long short term memory (LSTM) models which are a development of basic RNNs, which have been found to be more effective for at least some language applications.

Define 'lexical ambiguity'

Ambiguities from different lexical analyses. Example: fish can be a verb and a noun.

Define lexical ambiguitiy

Ambiguity caused because of multiple senses for a word.

Define local ambiguity

Ambiguity that arises during analysis etc, but which will be resolved when the utterance is com- pletely processed.

Define hyponymy

An 'IS-A' relationship (Lecture 7) More general terms are hypernyms, more specific hyponyms.

Define suffix

An affix that follows the stem.

Define prefix

An affix that precedes the stem.

Define Wizard of Oz experiment

An experiment where data is collected, generally for a dialogue system, by asking users to interact with a mock-up of a real system, where some or all of the 'processing' is actually being done by a human rather than automatically.

What is hyponymy?

An is-a relationship in a taxonomy (a child in a taxonomic tree)

What is a hypernym?

An is-the-category-of relationship, i.e. parent/ancestor in a taxonomy

What is "lemmatization"?

Another name for morphological analysis.

Define expletive pronoun

Another term for pleonastic pronoun

What problems does polysemy have for NLP systems?

Any NLP system dealing with semantics must deal with polysemy, either implicitly or explicitly. For systems operating in limited domains, polysemy is relatively rare and the fewer cases where its a problem can be disambiguated by domain knowledge; but for broad-coverage NLP this is still difficult.

Define mumble input

Any unrecognised input in a spoken dialogue system

How is prediction using n-grams used in automatic speech recognition?

As a form of language modelling. Speech recognisers cannot accurately determine a word from the sound signal for that word alone, and they cannot reliably tell where each word in an utterance starts and finishes. For instance, have an ice Dave, heaven ice day and have a nice day could easily be confused. For traditional ASR, an initial signal processing phase produces a lattice of hypotheses of the words uttered, which are then ranked and filtered using the probabilities of the possible sequences according to the language model. The language models which were traditionally most effective work on the basis of n-grams (a type of Markov chain), where the sequence of the prior n − 1 words is used to derive a probability for the next.

How are CFGs and labelled dependencies related?

As used in NLP, dependencies are usually weakly- equivalent to a CFG, but more powerful variants exist. In fact, dependency formalisms can be encoded in sufficiently rich feature structure frameworks.

Define part of speech tagging

Automatic assignment of syntactic categories to the words in a text. The set of categories used is actually generally more fine-grained than traditional parts of speech.

Define referring expressions

Bits of language used to perform reference by a speaker.

How is the similarity between two words determined in a semantic space?

By calculating the distance between the vectors. Cosine similarity: $\frac{\Sigma v1_k * v2_k}{\sqrt{v1^2_k} * \sqrt{v2^2_k}}$ This measure calculates the angle between two vectors and is therefore length-indepdent. This is important as frequent words have longer vectors than less frequent ones.

How can we modify the lambda calculus representation of semantics to represent a greater range of sentences, e.g. adverbial modifiers?

By reifying the event, i.e. introducing an extra variable for all verbs to represent the event. Example: Rober barked loudly. $\exists e [bark'(e, r) \land loud'(e)]$ Roughly translates to "there exists an event, Rover barked in this event, and the event was loud"

What are the main word2vec architectures? Which one is most frequently used?

CBOW: Given some context words, predict the target Skip-gram: Given a target word, predict the contexts Skip-gram is more commonly used

Define probabilistic CFGs

CFGs with probabilities associated with rules

What is the tagging system used in lectures?

CLAWS 5 (C5)

What possible features exist for computational distributional semantics?

Co-occurence with word in a window; co-occur with word as syntactic dependent; occur in same paragraph; co-occur in same document; etc.

Why is it problematic that algorithms for generating referring expressions from a knowledge-base graph have no association with syntax of natural languages?

Consider alternative terms like earlier or before. In a domain about diaries, these lexemes might be considered to map to the same KB predicate— but they actually differ in their syntax. - Where two entities are required, we need to be able to generate good referring expressions for both entities. - Referring expression generation is also needed in generation contexts without a limited domain and hence where there is no KB. E.g. after simplifying a text by replacing relative clauses by new sentences, it is often necessary to reformulate the referring expressions in the original text.

How can the binary relation between sentences suggest a basic technique for text summarisation? What limitations does this have?

Considering the nucleus and satellite of sentences, if we remove a satellite and are just left with nuclei this is usually reasonably coherent. This can act as a summary. This only works for rhetorical relations with satellite and nucleus. Rhetorical relations giving equal weight to each branch, such as narration, cannot do this. Genre-specific cues, such as in scientific papers, can help improve performance within a genre.

Define selectional restrictions

Constraints on the semantic classes of arguments to verbs etc (e.g., the subject of think is restricted to being sentient). The term selectional preference is used for non-absolute restrictions.

Define realisation

Construction of a string from a meaning representation for a sentence or a syntax tree

What might be suggested by the fact that forming referring expressions from a graph-based knowledge base is so problematic?

Corpus-based approaches to referring expression generation may be preferable.

What is a dependency?

Dependencies relate words (tokens) in the sentence directly via unlabelled or labelled relationships

Discuss the performance of dependency parsers

Dependency parsing can be very fast. The greedy algorithm described can go wrong, but usually has reasonable accuracy. (Note that humans process language incrementally and (mostly) deterministically.) There is no notion of grammaticality, so parsing is robust to typos and Yodaspeak, but will not directly indicate ungrammaticality. The oracle's decisions are sensitive to case, agreement etc because of the features. For instance, when analyzing example 6, den Mann beißt der Hund, the choice between LeftArcSubj and LeftArcObj is conditioned on the case of the noun phrase as well as its position.

How is dependency parsing usually done?

Dependency parsing is nearly always treated as a type of machine learning, with parsers trained on dependency banks (sometimes constructed by automatically converting CFG-based treebanks). There is thus no grammar as such. There is a stack, a list of words to be processed, and a record of actions. For now, we assume an oracle which chooses the correct action each time (this is the part where we need machine learning, as discussed below). For the unlabelled case, the oracle chooses between only three actions: LeftArc, RightArc or SHIFT. Informally: 1. At the start, the stack just contains the ROOT symbol. 2. Parsing is deterministic: at each step either SHIFT a word from the list onto the stack, or link the top two items on the stack (via LeftArc or RightArc). 3. Only the head word is left on the stack after an arc is added. 4. All added arcs are recorded. 5. The algorithm terminates when there is nothing in the word list and only ROOT on the stack. We now consider the oracle. This is treated as a machine learning classifier. Given the stack and the word list, the classifier must choose an action at each step. This is a form of supervised machine learning: trained by extracting parsing actions from correctly annotated data. The features extracted from the training instances can be the actual word forms, the morphological analysis, parts of speech etc. In fact feature templates are used: these are automatically instantiated to give huge number of actual features. A wide range of different machine learning models are possible: MaxEnt, SVMs — recently deep learning approaches have become popular.

How are dependency structures formed?

Dependency structures have arcs from a head (here the verb likes) to its dependents (the subject and direct object). It is usual to identify one node as the ROOT, typically the main verb in an utterance.

What is derivational morphology?

Derivational affixes, such as un-, re-, anti- etc, have a broader range of semantic possibilities (there seems no principled limit on what they can mean) and don't fit into neat paradigms. Inflectional affixes may be combined (though not in English). However, there are always obvious limits to this, since once all the possible slot values are 'set', nothing else can happen.

What does coherence have to do with discourse?

Discourses have to have connectivity to be coherent. Adding context can make sentences coherent.

Define case

Distinctions between nominals indicating their syntactic role in a sentence. In English, some pronouns show a distinction: e.g., she is used for subjects, while her is used for objects. e.g., she likes her vs *her likes she. Languages such as German and Latin mark case much more extensively.

What do distributions in distributional semantics actually represent?

Distributions are a usage representation: they are corpus-dependent, culture-dependent and register-dependent. Syn- onyms with different registers often don't have a very high similarity. For example, the similarity between policeman and cop is 0.23 and the reason for this relatively low number becomes clear if one examines the highest weighted features: they are used in different contexts, cops generally being derogatory and policeman not so Further, in general, true synonymy does not correspond to higher similarity scores than near-synonymy. Antonyms have high similarity as well, as it is often found they are paired (e.g. "dead/alive"; "cold/hot" etc). Some perfect synonyms (e.g. aubergine and eggplant) display low cosine similarity, though often this is down to frequency differences.

How are anaphoras usually used in a sentence?

Entities are introduced in a discourse (technically, evoked) by indefinite noun phrases or proper names. Demonstratives (e.g. this) and pronouns are generally anaphoric. Definite noun phrases are often anaphoric, but often used to bring a mutually known and uniquely identifiable entity into the current discourse.

Why is not necessarily good that referring expressions from a knowledge-base graph are designed to be brief?

Experiments suggest the algorithms do not give similar results to humans. Verbosity isn't always a bad thing: - We don't want the user to have to make complex inferences to determine a referent: - Sometimes longer phrases make a salient point - Sometimes longer phrases just sound better - Longer phrases are more likely to be polite - Longer phrases are often more likely improve the chances of a user understanding the speech synthesizer So, later work on referring expressions has relaxed the requirement to minimise noun phrase modifiers.

Give 4 examples of rhetorical relation link types

Explanation Narration Result Contrast Justification

What does it mean for a summarisation system to be extractive or abstractive?

Extractive: System chooses the 'best' sentences from a piece of text to produce the summary. Abstractive: System (partially) analyses the text and uses that analysis to build a summary.

What difficulty exists in FOPC representation of sentences regarding quantifiers? How do NLP systems currently deal with this?

FOPC forces quantifiers to be in a particular scopal relationship, but this information in not usually overt in natural language sentences. Current NLP systems may construct a representation which underspecifies this and is neutral between these readings. There are many different ways of representing this underspecification.

Why are FSTs more useful than FSAs for morpheme analysis

FSAs can be used to recognise certain patterns but don't by themselves allow for any analysis of word forms. Hence for morphology we use FSTs which allow the surface structure to be mapped into the list of morphemes. FSTs are useful for both analysis ad generation since the mapping in bidirectional. This approach is known as "two-level morphology".

Define lemmatization

Finding the stem and affixes for words

What is the relationship between dependency trees and CFGs in terms of their power in generating languages?

For NLP purposes, we assume dependency trees which are projective and weakly-equivalent to CFGs.

What is a 'Wizard of Oz' experiment?

For interface applications in particular, collecting a corpus requires a simulation of the actual application: this has often been done by a Wizard of Oz experiment, where a human pretends to be a computer.

Define complement

For the purposes of this course, an argument other than the subject.

Define grammar

Formally, in the generative tradition, the set of rules and the lexicon.

Describe English morphological structure

Generally concatenative

Discuss sense disambiguation in distributional semantics

Generally this is not performed, since with homonyms the semantic space is generally going to be distinct. It may also contain many contexts arising from multiword expressions of various types.

What sort of rhetorical relation is usually indicated by text in parenthesis

Generally, explanation

Define synonymy

Having the same meaning

What are nucleus and satellite phrases?

If we consider a discourse relation as a relationship between two phrases, we get a binary branching tree structure for the discourse. In many relationships, one phrase depends on the other (e.g. Explanation). We can get rid of subsidiary phrases and still have reasonably coherent discourse. In such contexts, the main phrase is the 'nucleus' and the subsidiary one is 'satellite'

What considerations exist in the lexical selection phase of natural language generation?

If we were using a deep grammar for the generation, we must have some mapping between the input representation and the grammar. Realistic systems have multiple mapping rules and this process may require refinement of aggregation.

Define ontology

In NLP and AI, a specification of the entities in a particular domain and (sometimes) the relationships between them. Often hierarchically structured.

Define discourse

In NLP, a piece of connected text.

How is doc2vec trained?

In an analogous fashion to the way skip-gram is trained by predicting context word vectors given an input word, distributed bag of words (dbow) is trained by predicting context words given a document vector. In dbow, the order of the document words is ignored, but there is also dmpv, which is analogous to cbow and is sensitive to document word order. Lau and Baldwin (2016) undertook a careful empirical investigation which addressed some of the initial difficulties in replication of doc2vec results. One important point they emphasize is the status of the word vectors. In principle, one could start with random word vector initialization. Lau and Baldwin found empirically that it was preferable to do one initial run of skip-gram first, and that starting with word embeddings pretrained on some large corpus was even better.

What optimisations are generally used on HMM POS taggers?

In fact, POS taggers generally use trigrams rather than bigrams — the relevant equations are given in J&M, 5.5.4. As with word prediction, backoff (to bigrams) and smoothing are crucial for reasonable performance because of sparse data. When a POS tagger sees a word which was not in its training data, we need some way of assigning possible tags to the word. One approach is simply to use all possible open class tags, with probabilities based on the unigram probabilities of those tags.22 A better approach is to use a morphological analyser (without a lexicon) to restrict the candidates: e.g., words ending in -ed are likely to be VVD (simple past) or VVN (past participle), but can't be VVG (-ing form).

Define nominal

In grammar terminology, noun-like

How could we augment dependency formalisms?

In principle, one could add additional power to dependency formalisms by adding a notion of feature in order to represent agreement, for instance. In fact, dependency formalisms can be encoded in sufficiently rich feature structure frameworks. However, as with probabilistic CFGs, it is more normal in NLP to allow the probabilities and features to (partially) model case, agreement and so on.

Why may some dependency structures not be trees?

In some varieties of dependency grammar, it is possible to have non-tree structures, which are required to represent control, for instance. Or perhaps, a word can be regarded as having dependencies from two other words

In a generative grammar, what is a constituent?

In syntactic analysis, a constituent is a word or a group of words that function(s) as a single unit within a hierarchical structure

Define head

In syntax, the most important element of a phrase.

Define argument

In syntax, the phrases which are lexically required to be present by a particular word (prototypically a verb). This is as opposed to adjuncts, which modify a word or phrase but are not required. For instance, in: Kim saw Sandy on Tuesday Sandy is an argument but on Tuesday is an adjunct. Arguments are specified by the subcategorization of a verb etc. Also see the IGE.

Describe the full brevity algorithm for producing referring expressions from a knowledge-base graph in natural language generation

In the meta-algorithm, it is equivalent to giving each edge a weight of 1. It is guaranteed to produce the shortest possible expression (in terms of logical form rather than the string). Dale (1992) also describes a greedy heuristic, which can be emulated by assuming that the edges are ordered by discriminating power. This gives a smaller search space.

Why may increasing the size of the tag set not necessarily degrade POS tagger performance?

Increasing the size of the tagset does not necessarily result in decreased performance: some additional tags could be assigned more-or-less unambiguously and more fine-grained tags can increase performance. For instance, suppose we wanted to distinguish between present tense verbs according to whether they were 1st, 2nd or 3rd person. With the C5 tag set, and the stochastic tagger described, this would be impossible to do with high accuracy, because all pronouns are tagged PRP, hence they provide no discriminating power. On the other hand, if we tagged I and we as PRP1, you as PRP2 and so on, the n-gram approach would allow some discrimination. In general, predicting on the basis of classes means we have less of a sparse data problem than when predicting on the basis of words, but we have less discriminating power. There is also something of a trade-off between the utility of a set of tags and their effectiveness in POS tagging. For instance, C5 assigns separate tags for the different forms of be, which is redundant for many purposes, but helps make distinctions between other tags in tagging models such as the HMM described here where the context is given by a tag sequence alone (i.e., rather than considering words prior to the current one).

Define meaning postulates

Inference rules that capture some aspects of the meaning of a word.

How does language based inference work?

Infering validity purely based on natural language phrases assessed by human intuition, rather than any particular logic. Logic may be used to model such inferences but the basic notion of correctness comes down to human judgment.

Define the difference between inflectional and derivational morphology?

Inflectional morphology can be thought of as setting values of slots in some paradigm (i.e. there is a fixed set of slots which can be thought of as being filled with simple values). Inflectional morphology concerns properties such as tense, aspect, number, person, gender and case. Derivational morphology have a broader range of semantic possibilities, in the that there seems no principled limit on what they can mean and don't fit into neat paradigms.

What is inflectional morphology?

Inflectional morphology can be thought of as setting values of slots in some paradigm (i.e., there is a fixed set of slots which can be thought of as being filled with simple values). Inflectional morphology concerns properties such as tense, aspect, number, person, gender, and case, although not all languages code all of these: English, for instance, has very little morphological marking of case and gender.

Describe the skip-gram word2vec architecture

Input -> Projection -> Output The dimensionality of the vectors produced is a parameter of the system: usually the size is a few hundred.

Define homonymy

Instances of polysemy where the two senses are unrelated

What does it mean for a certain linguistic construct to be "productive"?

Is applied to new words

Describe in detail a basic pronoun resolution algorithm, discussing limitations and difficulties

It can be formulated as a simple classification problem, which is amenable to one of the standard machine learning approaches to supervised classification (Naive Bayes, perceptron, k-nearest neighbour, etc). We just need a suitable set of training data. The problem is to classify whether each candidate antecedent and non-pleonastic pronoun is an actual antecedent-pronoun pair. We can assume that that the candidate antecedents are all noun phrases in a window of e.g. 5 sentences around the pronoun (excluding pleonastic pronouns). Decisions to be made include what to do about possessives and whether to exclude cataphors. For each such candidate pairing we build a feature vector using features corresponding to some of the factors and properties associated with pronouns. For example: Cataphoric? Number agrees? Gender agrees? Same verb (binding theory)? Sentence distance, as number of sentences between pronoun and antecedent? Grammatical roles match? Parallel? Linguistic form is proper, definite, indefinite or pronoun? Modelling repeated mention with a classifier based-system may be tricky, it requires we keep track of the coreferences that have been assigned and thus that we maintain a model of the discourse as individual pronouns are resolved. Coherence effects and real-world knowledge effects are very difficult (AI complete) so these are excluded from a feature set. Realistic systems would use more features and values, and can approximate some partial world knowledge via classification of named entities, for example. Implementing the classifier requires some knowledge of the syntactic structure, though full parsing may be unnecessary. We could approximately determine noun phrases and grammatical role by means of a series of regexes over POS-tagged data instead of using a full parser. Even if a full syntactic parser is available, it must be augmented to detected pleonastic nouns. The training data is the vectors of features (not the words themselves), each classified as true or false. If we use a Naive Bayes classifier, for each feature vector $\bar{f}$ we calculate the probability it is classified in class $c \in \{ true, false \}$. Then: $ P(c|\bar{f}) = \frac{P(\bar{f}|c)P(c)}{P(\frac{f})}$ But we can ignore the denominator since it is constant, hence: $\hat{c} = argmax_{c \in C} P(\bar{f} | c)P(c)$. Treating the features as independent means taking the product of the probabilities of the individual features in $\bar{f}$ for the class: $\hat{c} = argmax_{c \in C} P(c) \prod_{I = 1}^n P(f_i | c)$ In practice the Naive Bayes model is often found to perform well even with a set of features that are clearly not independent. Clearly, this has severe limitations in treating classification of individual pronoun-antecedent pairs rather than as building a discourse model, including all coreferences. - Inability to implement repeated mention - Inability to use information gained from one linkage in resolving further pronouns. For example, mismatching coreferences is unlikely but would be undetected. One approach to solve this is to run a simple classifier initially to acquire probabilities of links and to use those results as the input to a second system which clusters the entities to find an optimal solution.

Discuss how similarity ratings from a semantic space relate to human notions of word similarity

It does correlate with a psychological reality. We can calculate the rank correlation between a distributional similarity system and human judgements (and note human similarity results can be replicated with a strong correlation coefficient of even 0.97 in some examples). A good distributional similarity system can have a correlation of 0.8 or better with human data (though reported results may be unreasonably high as the data has been used in so many experiments.)

What is the current state-of-the-art in research into inferring rhetorical relations?

It is clearly a very difficult problem, but recent research simply using cue phrases and punctuation is quite promising. This can be done by hand-coding a series of finite state patterns, or supervised learning. Within genres, we can use genre-specific cues. For instance, in scientific texts.

Why are some logical statements in compositional semantics written with a ' symbol?

It is conventional for a predicate corresponding to a lexeme to be appended with '

What heuristics can be employed in distributional semantics to distinguish between antonyms and near synonyms>

It is possible to automatically distinguish antonyms from (near-)synonyms using corpus-based techniques, but this requires additional heuristics. For instance, it has been observed that antonyms are frequently coordinated while synonyms are not: • a selection of cold and hot drinks • wanted dead or alive • lectures, readers and professors are invited to attend Similarly, it is possible to acquire hyponymy relationships from distributions, but this is much less effective than looking for explicit taxonomic relationships in Wikipedia text.

Can POS taggers be built without hand-tagged corpuses?

It is possible to build POS taggers that work without a hand-tagged corpus, but they don't perform as well as a system trained on even a very small manually-tagged corpus and they still require a lexicon associating possible tags with words. Completely unsupervised approaches also exist, where no lexicon is used, but the categories induced do not correspond to any standard tagset.

Why does limiting natural language interface systems to limited domains make it a relatively easy problem?

It removes a lot of ambiguity, e.g. LUNAR (the lunar rock sample database querying system) only dealt with "rock" in the sense of the material, never the music.

What is a perceptron?

It simply computes the dot product of an input vector ⃗x and a weight vector ⃗w, compared to a threshold θ . Learning (by a form of gradient descent) is very fast. There are no hidden layers, so this is just a linear classifier. The output is simply the summation: later systems use more complex activation functions: e.g. a sigmoid (which can output probabilities).

Why is WordNet the main resource for lexical semantics for English

Its large coverage and the fact that its freely available

Discuss what sort of neural net architecture has been found to be good at processing sequences of arbitrary length NLP input, and why?

Long short term memory (LSTM) models which are a development of basic RNNs, which have been found to be more effective for at least some language applications. The claim is that LSTMs are better at capturing long-term dependencies: i.e., any relationship between elements in the sequence that are separated by more than a very small number of other elements. (1) She shook her head. (2) She decided she did not want any more tea, so shook her head when the waiter reappeared. The point of this example is that, in the use here, the object of the verb shake is a possessive NP which agrees with the subject. We cannot say she shook the head with this meaning. Hence her can be predicted by looking at the subject, which may be textually quite distant. Note that this is not the same as long distance dependency in linguistics. The example above can be analysed syntactically as coordination of VPs without any form of 'gap'. Note that, if we were using a dependency parser, at the point where we might shift shook, the stack could contain 'she decided so', hence the 'long-term' nature refers to the surface text, not the syntactically/semantically structured information. LSTMs are now standard for speech recognition (after decades where no approach could beat HMMs in any practical situation), but there is currently lots of experimentation for other language applications.

Why did many mainstream linguists discount the use of corpuses?

Mainstream linguists in the past mostly dismissed their use in favour of reliance on intuitive judgements about whether or not an utterance is grammatical (a corpus can only (directly) provide positive evidence about grammaticality). However, many linguists do now use corpora.

How do computational techniques typically model distributional semantics?

Meaning is seen as a vector space, with dimensions corresponding to elements in the context. Vectors represent the space and terms semantic space models and vector space models are sometimes used instead of distributional semantics. ------------------- Distributions are vectors in a multidimensional semantic space, that is, objects with a magnitude (length) and a direction. The semantic space has dimensions which correspond to possible contexts. For our purposes, a distribution can be seen as a point in that space (the vector being defined with respect to the origin of that space). e.g., cat [...dog 0.8, eat 0.7, joke 0.01, mansion 0.2, zebra 0.1...], where 'dog', 'joke' and so on are the dimensions.

What has been the main use of distributional models of semantics?

Measuring similarity between pairs of words. Similarity measurements allow clustering of words so can be used as a way of getting at unlabelled semantic classes.

Define morpheme

Minimal information carrying units within a word

What is an affix?

Morphemes which can only occur in conjunction with other morphemes, e.g. words that are made up of a stem and zero or more affixes.

What is a Boltzmann Machine? What is a Restricted Boltzmann Machine?

NNs with one or more hidden layers (i.e., layers between input and output) are theoretically capable of approximating any continuous function (mapping reals to reals) to an arbitrary degree of accuracy. However, there is no guarantee they can be effectively trained. A Boltzmann machine has a hidden layer and arbitrary interconnections between units. This is not effectively trainable in general. A Restricted Boltzmann Machine (RBM) has one input layer and one hidden layer and no intra-layer links. The RBM is trainable: the structure allows for efficient implementation since the weights can be described by a matrix.

How effective are universal dependencies?

No single set of dependencies is useful for all languages, and some languages do not demonstrate some of the UDs. There is tension between universality and meaningful dependencies, something acknowledged by the project. Factors which need balancing include ease of human annotation, efficiency in use for parsing, developing a linguistically sound scheme and more. It has also been necessary to add vague 'catch all' dependency types such as 'MARK', since words like English infinitival 'to' are not easily classified. It is also important to note that some linguistic concepts cannot be captured only using dependencies. That said, a relation notion is part of most modern linguistic methodology. However, it has been highly successful in NLP use cases.

How well do distributional semantics systems perform on Test of English as a Foreign Language synonym tests? Comment on this

Non-native English speakers are reported to average around 65% on this test (US college applicants): the best corpus- based results are 100% (Bullinaria and Levy, 2012) . But note that the authors who got this result suggest the test is not very reliable — one reason is probably that the data includes some extremely rare words.

Define pleonastic

Non-referring (esp. of pronouns)

Define domain

Not a precise term, but I use it to mean some restricted set of knowledge appropriate for an application.

What are the advantages of language based inference?

Not tied to a particular knowledge base or domain. We can proceed without having a perfect meaning representation for a natural language statement, but must be prepared for the fact some incorrect decisions may be made.

Why don't we attach e.g. fish directly to a VP node, but instead to a unary V node?

Notice that fish could have been entered in the lexicon directly as a VP, but that this would cause problems if we were doing inflectional morphology, because we want to say that suffixes like -ed apply to Vs. Making rivers etc NPs rather than nouns is a simplification I've adopted here just to keep the example grammar smaller.

What must pronouns agree on? When does this simple rule get confused?

Number and gender with their antecedents Not quite a strong rule when: - they refers to a single individual (gender-neutral they) - Use of they with everybody - Group nouns (the team played well but they are all tired) - Conjunctions (Kim and Sandy are asleep: they are very tired) - Discontinuous sets

Define overgenerate

Of a grammar, to produce strings which are invalid, e.g., because they are not grammatical according to human judgements.

How can you test word2vec? Comment on the effectiveness and meaningfulness of these tests

On similarity datasets (although note that the hyperparameters have been tuned for high per- formance on the standard similarity datasets). It can also be used for clustering, as can any model giving a notion of similarity. Mikolov et al introduced a new task with word2vec: a form of analogy. The idea is that one solve puzzles such as: man is to woman as king is to ? where the correct answer is supposed to be queen. The idea is that one can derive the vector between the pair of words man and woman and combine it with king, and that the nearest word to the region of vector space that results will be the answer to the analogy. It should be pointed out that the space is very sparse and that there are many word pairs for which this does not work

Describe stochastic part of speech tagging using hidden markov models

One form of POS tagging uses a technique known as Hidden Markov Modelling (HMM). It involves an n-gram technique, but in this case the n-grams are sequences of POS tags rather than of words. The most common approaches depend on a small amount of manually tagged training data from which POS n-grams can be extracted. The idea of stochastic POS tagging is that the tag can be assigned based on consideration of the lexical probability (how likely it is that the word has that tag), plus the sequence of prior tags (for a bigram model, the immediately prior tag). This is more complicated than prediction because we have to take into account both words and tags. We wish to produce a sequence of tags which have the maximum probability given a sequence of words.

How is word2vec trained?

One interesting aspect of word2vec training is the use of negative sampling instead of softmax (which is computa- tionally very expensive). word2vec is trained using logistic regression to discriminate between real and fake words. In outline, whenever considering a word-context pair, also the network is also given some contexts which are not the actual observed word. The negative contexts are sampled from the vocabulary (in a manner so that the probability of sampling something more frequent in the corpus is higher). The number of negative samples used affects the results. Another interesting aspect is the use of subsampling. Instead of considering all words in the sentence, it is transformed by randomly removing words from it. For example, the previous sentence might become: considering all sentence transform randomly words. The subsampling function makes it more likely to remove a frequent word. word2vec does not use a stop list. Subsampling affects the window size around the target (so the word2vec window size is not fixed). This approach can be emulated in count models, of course. An additional complication is that the weights of elements in context window vary, so that closer words are given higher weights (assuming they haven't been removed by subsampling). Again, this is something that can be replicated in count based systems. Although word2vec is usually used with unparsed data, it can be modified for use with dependencies, as described by Levy and Goldberg.

How do deep learning architectures based on Restricted Boltzmann Machines work?

One popular deep learning architecture can be described as a combination of RBMs, so the output from one RBM is the input to the next. In principle, the RBMs can be trained separately and then fine-tuned in combination. The intuition is that the layers allow for successive approximations to concepts.

For what reasons may we prefer dependency formalisms over CFG-based approaches?

One reason to prefer dependency formalisms over CFG-based approaches is their higher degree of neutrality to word order variation which varies substantially cross-linguistically. English word order is predominantly subject verb object (SVO) and 'who did what to whom' meaning is indicated by order. For example the dog bites that man means something different from that man bites the dog. In the right context, topicalization (OSV order) is possible: "That man, the dog bites." Passive has a different structure where the syntactic subject is the 'underlying' or 'deep' object. Many languages allow freer word order. Sometimes 'who did what to whom' is indicated by case. Some languages have very free word order and can't be categorized by the scheme of breaking words into S, V and O categories. Because of word order variability, CFG rules do not work in all languages. While dependency representations (especially as used in NLP) do not entirely avoid word-order problems, they are less problematic than CFGs for many languages.

What does it mean for a generative system to "overgenerate"?

One that generates output which is invalid (as well as valid ones)

Define open class

Opposite of closed class.

Why is high-quality anaphora resolution important in machine translation?

Other languages, e.g. German, have gender-specific words for gender-neutral pronouns in English like 'it'. To determine which gender to translate to, the anaphora must be resolved.

What concerns exist in producing referring expressions in natural language generation?

Overall: given some information about an entity, how do we choose to refer to it? Pertinent aspects: 1. Do we use ellipses or coordination? 2. Which grammatical category of expression should we use — pronouns, proper names, or definite expressions, for example. Anaphoras must agree if used. 3. If we choose to use a full-noun phrase we must select which attributes to express. We require enough modifiers to distinguish the expression from possible distractors (the dog vs the big dog vs the big dog in the basket). This is a whole topic within itself.

How do you evaluate POS tagger performance?

POS tagging algorithms are evaluated in terms of percentage of correct tags. The standard assumption is that every word should be tagged with exactly one tag, which is scored as correct or incorrect: there are no marks for near misses. Generally there are some words which can be tagged in only one way, so are automatically counted as correct. Punctuation is generally given an unambiguous tag.

Why are PP attachment ambiguities problematic? How do humans avoid this?

PP attachment ambiguities are a major headache in parsing, since sequences of four or more PPs are common in real texts and the number of readings increases as the Catalan series, which is exponential. Other phenomena have similar properties: for instance, compound nouns (e.g. long-stay car park shuttle bus). Humans disambiguate such attachments as they hear a sentence, but they're relying on the meaning in context to do this, in a way we cannot currently emulate, except when the sentences are restricted to a very limited domain.

What is 'entropy' of a language?

Prediction is important in estimation of entropy, including estimations of the entropy of English. The notion of entropy is important in language modelling because it gives a metric for the difficulty of the prediction problem. For instance, speech recognition is vastly easier in situations where the speaker is only saying two easily distinguishable words (e.g., when a dialogue system prompts by saying answer 'yes' or 'no') than when the vocabulary is unlimited: measurements of entropy can quantify this.

What are the different types of affixes? Which occur in English?

Prefix, suffix, infix, circumfix English has pre- and suffix

Define named entity recognition

Recognition and categorisation of person names, names of places, dates etc

What is a recurrent neural network? Examples?

Recurrent neural network describes a class of architecture that can handle sequences. This includes the Long short term memory (LSTM) models which are a development of basic RNNs, which have been found to be more effective for at least some language applications.

Define smoothing

Redistributing observed probabilities to allow for sparse data, especially to give a non-zero probability to unseen events

Define corefer

Referring expressions which all refer to the same entity

Define closed class

Referstopartsofspeech,suchasconjunction,forwhichallthememberscouldpotentiallybeenumerated (lecture 3).

What is the difference between Language Generation systems and Regeneration Systems

Regeneration systems start from text and produce reformulated text.

Define distributional semantics

Representing word meaning by context of use

Discuss why the choice of using corpuses to train distributional semantic models, especially large models, may have flaws

Research suggests we want as much data as we can get. But more data is unrealistic from a psycholinguistic perspective. It is estimated we see perhaps 50,000 words per day so the BNC, which is small by the standards of current experiments, corresponds to 5 years' exposure. Further, humans can get a good idea of a word's meaning from a small number of examples, something current techniques cannot.

What are proposed ways in which NLP could deal with bare plurals in compositional semantics using logical representations? Examples: Birds fly; Ducks lay eggs; Voters rejected AV in 2011

Researchers have argued that the correct interpretation must involve non-monotonicity or defaults. Example: Birds fly unless they are penguins, ostriches, babies, etc. No fully satisfactory solution has been found yet.

Define orthographic rules

Same as spelling rules

What are dependency semantics? How is it written out?

Semantics based on the relations between words. Leading underscores show that this is the semantic of some lexeme, and _x suffixes shows that the lexeme has a broad indication of having sense x (e.g. _v is verb)

How can similarity scores be used across a wide range of other NLP techniques?

Similarity measures can be applied as a type of backoff technique in a range of tasks. For instance, in sentiment analysis (discussed in lecture 1), an initial bag of words acquired from the training data can be expanded by including distributionally similar words.

Discuss corpus choice when performing distributional semantics

Some research suggests we want as many words as we can get. Examples: - British National Corpus (100m words) - Wikipedia dump used in Wikiwoods (897m words) - UKWac (obtained from web-crawling): 2bn words The domain must also be considered.

Define speaker

Someone who makes an utterance

Define denominal

Something derived from a noun: e.g., the verb tango is a denominal verb.

Define deverbal

Something derived from a verb: e.g., the adjective surprised.

Define modifier

Something that further specifies a particular entity or event

Define stemming

Stripping affixes

Comment on the apparently high success rates of POS taggers?

Success rates of about 97% are possible for English POS tagging (performance seems to have reached a plateau, probably partly because of errors in the manually-tagged corpora) but the baseline of choosing the most common tag based on the training set often gives about 90% accuracy.23 Some POS taggers return multiple tags in cases where more than one tag has a similar probability.

Discuss Machine Learning approaches to Word Sense Disambiguation

Supervised learning for WSD can be performed with a sense-tagged corpus. But this is extremely time consuming and existing examples are too small, plus agreement between annotators was poor. Instead, semi-supervised or unsupervised learning is preferable. Initial ML techniques were evaluated on homonyms, which are relatively easy to disambiguate, with a claim of 95% disambiguation (which says nothing about whether it was high prevision on all words)

How does WordNet work? Alternative approaches?

Synonym sets are constructed, and nouns organised by hyponymy. Taxonomies have also been extracted from machine-readable dictionaries, e.g. in Microsoft's MindNet. There has also been considerable work on extracting taxonomic relationships from corpora.

How may multiword expressions be problematic for compositional semantics parsers using logical representations?

Take 'red tape' - its not really compositional in the meaning of its words. Sometimes! - The deputy head of department was dismayed at the endless red tape vs - She tied red tape round the parcel This can be accounted for somewhat in the grammar, e.g. red tape represented as ambiguous between between a MWE or compositional, but some phrases are half-way between MWE and compositional. Example: real pleasure (it was a real pleasure meeting you) - it is used as a MWE but is compositional enough not to be listed independently in most dictionaries. A distributional approach may work here but its still an area of active research.

Define meronymy

The 'part-of' lexical semantic relation

Define indirect object

The beneficiary in verb phrases like give a present to Sandy or give Sandy a present. In this case the indirect object is Sandy and the direct object is a present.

Define interannotator agreement

The degree of agreement between the decisions of two or more humans with respect to some categorisation

Discuss context-free grammars with empty productions

The formal description of a CFG generally allows productions with an empty right hand side (e.g., Det → ε). It is convenient to exclude these however, since they complicate parsing algorithms, and a weakly-equivalent grammar can always be constructed that disallows such empty productions.

Discuss the difficulties in grounding and how it relates to the techniques covered in the course

The idea of grounding is simple and intuitive, but relates to some serious and complex philosophical debate concerning Artificial Intelligence. So far in this course, our notion of meaning has been purely symbolic: we expect our systems to behave correctly based purely on manipulating symbols of natural language, or logical or other abstract symbols which we have introduced. We can only talk about meaning in terms of relationships between terms. If we did, somehow, manage to completely define the word table, this would still not mean that a system with that knowledge could recognise a real table.

Define compositionality

The idea that the meaning of a phrase is a function of the meaning of its parts. compositional semantics is the study of how meaning can be built up by semantic rules which mirror syntactic structure (lecture 6).

What is a rhetorical relation?

The implicit relationship between two sentences. AKA discourse relation.

What are the limitations of language based inference?

The inferences made are out of context, but there will be a requirement to include some form of modelling of entities referred to (e.g. anaphora resolution) The limitations generally mean that any explicit labelling of the meaning of sentences from this approach are best viewed as annotations, rather than complete models

Define reflexive

The informal and inadequate generalisation is that reflexive pronouns must be co-referential with a preceding argument of the same verb (i.e. something it subcategorises for) while non-reflexive pronouns cannot be

Discuss the dimensions produced by word2vec — their meaning and the reason they are useful

The intuition is that the dimensionality reduction captures meaningful generalizations, but the dimensions are not directly interpretable: it is impossible to look into 'characteristic contexts' as we can with the count models. Of course, the advantage of the smaller vectors is greater efficiency. As outlined below, there are visualization techniques which allow one to examine the closeness of different words.

What is doc2vec useful for?

The learned document vector is effective for various tasks, including sentiment analysis (note that there is a large space of hyperparameters to investigate).

Define subcategorisation

The lexical property that tells us how many arguments a verb etc can have.

What is the most difficult part of making a statistically based word prediction system, particularly one that is designed to help users input words quicker?

The main difficulty with using statistical prediction models in such applications is in finding enough data: to be useful, the model really has to be trained on an individual speaker's output, but of course very little of this is likely to be available. Training a conversational aid on newspaper text can be worse than using a unigram model from the user's own data.

How do most other grammar/parsing systems work in NLP/CL?

The majority use some form of statistical ranking of parses alongside a symbolic component. The symbolic grammar may be written by hand by a linguist or automatically generated from a manually-annotated Treebank. An alternative approach with Treebanks is to construct them by manually selecting from alternative analyses generated based off of manually encoded symbolic grammars, thereby training a statistical ranking component. Work in the area of unsupervised learning is generally limited to understanding human acquisition of language, rather than for practical usage. Grammars vary greatly in the level of granularity and depth they provide. Some grammars are bidirectional, making them useful for both generating and parsing text. Such grammars are usually confined to a small subset of a natural language, making their usage limited. Their coverage in arbitrary text will not be 100%, and some other method of enforcing further robustness is probably required. The level of overgeneration depends on the usage and implementation.

What is the assumption behind compositional semantics? How is this enforced?

The meaning of each whole phrase must relate to the meaning of its parts. To enforce this, each syntactic rule must be paired with a semantic rule showing how the meaning of the daughters is enforced. This is usually done using Lambda calculus in linguistics.

What is a morpheme?

The minimal information carrying unit of a word

Define lexicon

The part of an NLP system that contains information about individual words

Define anaphora

The phenomenon of referring to something that was mentioned previously in a text. An anaphor is an expression which does this, such as a pronoun

Define polysemy

The phenomenon of words having different senses

Define generation

The process of constructing text (or speech) from some input representation

What sort of information does a dependency structure normally encode?

The relationships encoded are usually syntactic (although see next lecture for semantic dependencies): syntactic dependency structures can be seen as an alternative to parse trees which have the advantage of capturing meaningful relationships more directly. Dependency parsers produce such representations directly, but in NLP dependencies are also constructed by converting another representation such as a parse tree (though such conversions are generally imperfect/lossy).

Define logical form

The semantic representation constructed for an utterance

Given a constituent X, what does X' mean?

The semantics of X. (E.g. NP' is the semantics of a noun phrase)

What is compositional semantics?

The study of how the structure of a sentence conveys its meaning.

Define binding-theory

The study of inter-sentential anaphora

What is morphology?

The study of the structure of words.

Define antecedent

The text initially evoking a referent

Why do we need CFGs rather than FSAs to parse natural languages? Discuss the merits of this argument

The usual answer is that the syntax of natural languages cannot be described by an FSA, even in principle, due to the presence of centre-embedding, i.e. structures which map to: A → αAβ and which generate grammars of the form $a^n b^n$. For instance: the students the police arrested complained has a centre-embedded structure. However, this is not entirely satisfactory, since humans have difficulty processing more than two levels of embedding: ? the students the police the journalists criticised arrested complained If the recursion is finite (no matter how deep), then the strings of the language could be generated by an FSA. So it's not entirely clear whether an FSA might not suffice, despite centre embedding. There's a fairly extensive discussion of the theoretical issues in J&M , but there are two essential points for our purposes: 1. Grammars written using finite state techniques alone may be very highly redundant, which makes them difficult to build and slow to run. 2. Without internal structure, we can't build up good semantic representations. These are the main reasons for the use of more powerful formalisms from an NLP perspective (in the next section, I'll discuss whether simple CFGs are inadequate for similar reasons). However, FSAs are very useful for partial grammars. In particular, for information extraction, we need to recognise named entities: e.g. Professor Smith, IBM, 101 Dalmatians, the White House, the Alps and so on. Although NPs are in general recursive (the man who likes the dog which bites postmen), relative clauses are not generally part of named entities. Also the internal structure of the names is unimportant for IE. Hence FSAs can be used, with sequences such as 'title surname', 'DT0 PNP' etc CFGs can be automatically compiled into approximately equivalent FSAs by putting bounds on the recursion. This is particularly important in speech recognition engines.

Define generative grammar

Thefamilyofapproachestolinguisticswhereanaturallanguageistreatedasgovernedbyrules which can produce all and only the well-formed utterances

Comment on how developing word-sense disambiguation is no longer fashionable

There are several reasons, chiefly that it seems unhelpful to consider the problem in isolation. WSD is important in speech synthesis, for example, but only for a relatively small number of words where the sense distinction indicates a difference in pronunciation (e.g., bass but not bank or plant). In SMT, the necessary disambiguation happens as part of the general model rather than being a separate step, even conceptually. In any case, for MT, the word senses (or uses) which need to be distinguished depend on the language pair, and may well not be ambiguous to a native speaker. For instance, German Wald may be translated as forest or wood, but the distinction is not seen as an ambiguity by a native German speaker.

Discuss constraint-based frameworks for general NLP usage

There have been a number of frameworks developed, aiming to be both linguistically useful across a number of languages whilst being tractable. Two such very active frameworks are Lexical Functional Grammar (LFG) and Head-drive Phrase Structure Grammar(HPSG). Much of the recent academic work involves two international collaborations, PARGRAM for LFG and DELPH-IN for HPSG. They allow for specification of formal grammars (including morphology, syntax and compositional semantics), for both parsing and generation, and are used across different languages (though mainly English). Through these systems, efficient parsing mechanisms have been developed and generalisations across languages have been produced. In fact their performance in some use-cases is comparable to automatically-induced Treebank based approaches. DELPH-IN has produced the largest English grammar specification ever, the English Resource Grammar (ERG). These schemes often aim to produce open source FS grammars for a variety of languages.

How can inference on a knowledge base for compositional semantics be implemented?

There is an explicit, formally-represented knowledge base. Then, we have a mapping between the domain covered, and constants and predicates in the knowledge base. Exampe: C <=> Chase, H <=> Happy, U <=> Unhappy. Then, after converting the natural language phrase into such a representation we can infer from the knowledge base by matching or logical inference.

Discuss the use of dimensionality reduction techniques in producing a semantic space

This can be very efficient (200-500 dimensions are often used) and it should capture generalisations in the data, but the SVD matrices end up being uninterpretable. Arguably this is a theoretical problem, but more importantly it is a practical one: using SVD makes it impossible to debug the feature space by manual inspection, so is unsuitable for initial experiments.

Discuss sense induction in word-sense disambiguation

This is an attempt to determine clusters of usages of a word in texts that correspond to senses. In principle this is a great idea since the whole notion of a word sense is fuzzy (you can even argue word senses are artefacts of dictionary publishing). However, sense induction has not been explored as a standalone task due to the difficulty of evaluation - that said it can be considered an inherent part of statistical approaches to MT and, indeed, of deep learning approaches which incorporate distributions/embeddings.

Why can we parse a CFG without backtracking?

This works for parsing with CFGs because the rules are independent of their context: a VP can always expand as a V and an NP regardless of whether or not it was preceded by an NP or a V, for instance. (In some cases we may be able to apply techniques that look at the context to cut down the search space, because we can tell that a particular rule application is never going to be part of a sentence, but this is strictly a filter: we're never going to get incorrect results by reusing partial structures.)

Define taxonomy, in particular vs ontology

Traditionally,theschemeofclassificationofbiologicalorganisms.ExtendedinNLPtomeanahierarchical classification of word senses. The term ontology is sometimes used in a rather similar way, but ontologies tend to be classifications of domain-knowledge, without necessarily having a direct link to words, and may have a richer structure than a taxonomy.

Why is unsupervised learning often preferred to supervised learning?

Training data is very expensive, difficult to obtain and doesn't generalise across domains

Describe a finite state transducer

Transducers map between two representations, so each transition corresponds to a pair of characters. As with the spelling rule, we use the special character 'ε' to correspond to the empty character and 'ˆ' to correspond to an affix boundary. The abbreviation 'other : other' means that any character not mentioned specifically in the FST maps to itself.18 AswiththeFSAexample,weassumethattheFSTonlyacceptsaninputiftheendoftheinputcorresponds to an accept state (i.e., no 'left-over' characters are allowed).

What useful additions can be made to FSAs?

Transition probabilities

What sort of dependency structures do parsers limit themselves to?

Trees, as otherwise, standard parsing approaches fail and complexity is too high. In NLP, it is also usual to restrict attention to dependency trees which are projective. I will not go through the formal definition of projectivity, but non-projective structures can easily be seen because they have to be drawn with crossing arcs. Non-projective structures are fairly rare in English, but they arise more often in free word order languages.

Why is knowing trivial information both a difficult and essential problem for information retrieval systems?

Trivial information is critical for human reasoning but tends not to be explicitly stated anywhere, since humans find it trivial.

What does it mean for two grammars to be weakly equivalent?

Two grammars are said to be weakly-equivalent if they generate the same strings

What does it mean for two grammars to be strongly equivalent?

Two grammars are strongly- equivalent if they assign the same structure to all strings they generate.

Define Antonymy

Two words with opposite meanings. Mostly discussed with respect to adjectives, but only relevant for some classes of adjectives.

Define Synonymy

Two words with the same, or nearly the same, meaning

Define genre

Type of text: e.g., newspaper, novel, textbook, lecture notes, scientific paper. Note the difference to domain (which is about the type of knowledge): it's possible to have texts in different genre discussing the same domain (e.g., discussion of human genome in newspaper vs textbook vs paper).

What differences and similarities exist between dependency structures and syntax trees?

Unlike the syntax tree, the dependency structure has no intermediate nodes such as VP. There is no direct notion of constituency in dependency structures (although some notions of constituency can be recovered). Lack of intermediate node labels is actually helpful for generality in NLP annotation in that notions of constituency vary a lot between dif- ferent approaches. However it means that we can't model some syntactic and semantic phenomena so directly/easily. Dependency structures are intuitively closer to meaning than the parse trees. They are also more neutral to word order variations, as discussed below. In fact, dependency formalisms can be encoded in sufficiently rich feature structure frameworks.

Define bag of words

Unordered collection of words in some text.

Discuss the earliest attempts at word sense disambiguation

Until 1990s was done by hand-constructed rules. Approach depended on frequency, collocations and selectional restrictions/preferences. The approaches depended on frequency; collocations; and selectional restrictions/preferences. These later proved useful knowledge sources for later approaches (alongside machine readable dictionaries and Wikipedia sense disambiguation pages)

Define bidirectional

Usable for both analysis and generation

How can you make chart parsing go in cubic time?

Using 'packing'. The modification is to change the daughters value on an edge to be a set of lists of daughters and to make an equality check before adding an edge so we don't add one that's equivalent to an existing one. That is, if we are about to add an edge: [id,left vertex, right vertex,mother category, daughters] and there is an existing edge: [id-old,left vertex, right vertex,mother category, daughters-old] we simply modify the old edge to record the new daughters: [id-old,left vertex, right vertex,mother category, daughters-old ⊔ daughters] There is no need to recurse with this edge, because we couldn't get any new results: once we've found we can pack an edge, we always stop that part of the search. Thus packing saves computation and in fact leads to cubic time operation, though I won't go through the proof of this. Example: We want to add an edge, E1: [1, 1, 3, VP, {(2, 9)}] Suppose there already exists an edge E2: [2, 1, 3, VP, {(2, 6)}] Instead of inserting an edge E1 and recursing on it, we instead append (2, 9) onto the list of daughters for E2: [2, 1, 3, VP, {(2 6), (2 9)}]

How do recurrent neural networks describe embeddings?

Using a model trained on a vert large corpus to predict the next word in a sequence. The vector for word at t is concatenated to the vector which is output from context layer at t − 1. Words could be encoded using a one-hot vector: one dimension per word (i.e., a simple index). This would be analogous to HMMs, which use the words themselves for prediction. However better performance can be obtained using input embeddings. These are, in effect, a distributional model with dimensionality reduction created on-the-fly, via prediction.

Define backoff

Usually used to refer to techniques for dealing with data sparseness in probabilistic systems: using a more general classification rather than a more specific one. For instance, using unigram probabilities instead of bigrams; using word classes instead of individual words (lecture 3).

Show how the compositional semantics of `S -> NP VP` could be represented

VP' (NP')

What algorithm is used to find bigrams in speech recognition. Give a very, very high overview of this algorithm.

We can regard bigrams as comprising a simple deterministic weighted FSA. The Viterbi algorithm, a dynamic programming technique for efficiently applying n-grams in speech recognition and other applications to find the highest probability sequence (or sequences), is usually described in terms of an FSA.

What difficulties exist to evaluating pronoun resolution?

We can't just require that every (non-pleonastic) pronouns is linked to an antecedent and just measure the accuracy of the links found compared to test data. - Some pronouns may be pleonastic, others may refer to concepts which aren't expressed in the noun as text phrases - Identification of the target noun phrases, with embedded noun phrases being a particular issue We could treat this as a separate problem and assume we're given data with the non-pleonastic pronouns and the candidate antecedents identified, but this isn't fully realistic. - Sometimes an anaphora is resolved as referring to another pronoun, where this pronoun refers to the correct noun. Whilst its easy to account for the transitive closure of references, this unfairly rewards algorithms that link all the pronouns together into one cluster. Hence, it has been difficult to develop agreed metrics for evaluation.

What must be consider when performing attribute selection in natural language generation?

We need to include enough modifiers to distinguish the expression from possible DISTRACTORS in the discourse context.

What is a problem of generating bigram data from corpuses, especially small ones? What solution is used?

We often end up with bigrams with 0 probability, because it is based purely on examples of a pair of words existing. We use smoothing to mitigate some of this, which simply means that we make some assumption about the 'real' probability of unseen or very infrequently seen events and distribute that probability appropriately. A common approach is simply to add one to all counts: this is add-one smoothing which is not sound theoretically, but is simple to implement. A better approach in the case of bigrams is to backoff to the unigram probabilities: i.e., to distribute the unseen probability mass so that it is proportional to the unigram probabilities. This sort of estimation is extremely important to get good results from n-gram techniques

How can you use a bigram model to determine the likelihood of an entire string of words?

We require the probability of some string of words $P(w_1^n)$ which is approximated by the product of the bigram probabilities: $P(w_1^n ) \approx \prod_{k = 1}^{n} P(w_k | w_{k − 1})$ This assumes independence of the individual probabilities, which is clearly wrong, but the approximation nevertheless works reasonably well. Note that, although the n-gram probabilities are based only on the preceding words, the effect of this combination of probabilities is that the choice between possibilities at any point is sensitive to both preceding and following contexts.

Derive the equations used in Hidden Markov Model for POS tagging

We want an estimate of the probability a word $w^n_1$ in an n-gram has a particular tag, denoted $\hat{t}^n_1$ by maximising: $\hat{t}^n_1 = argmax_{t^n_1}(P(t^n_1 | w^n_1) $ (NB the hat symbol indicates estimate). We use Bayes' theorem: $P(t^n_1 | w^n_1) = \frac{P(w^n_1 | t^n_1)P(t^n_1)}{P(w^n_1)}$. But since we know the word in advance (because we are considering a particular given string) we know $P(w^n_1) = 1$. Hence, $\hat{t}^n_1 = argmax_{t^n_1}(P(w^n_1 | t^n_1)P(t^n_1)) $ --------------------------------------------------------------------- Now, we also need to estimate $P(w^n_1 | t^n_1)$ and $P(t^n_1)$. If we make the bigram assumption, then the probability of a tag depends on the previous tag, hence the tag sequence is estimated as a product of the probabilities: $P(t^n_1) = \Pi^n_{i = 1} P(t_i | t_{i - 1})$ We will also assume that the probability of the word is independent of the words and tags around it and depends only on its own tag: $P(w^n_1 | t^n_1) = \Pi^n_{i = 1} P(w_i | t_i)$ These values can be estimated from the corpus frequencies. ---------------------------------------------------------------------- We therefore get this final expression: $\hat{t}^n_1 = argmax_{t^n_1} \Pi^n_{i = 1} P(w_i | t_i)P(t_i | t_{i - 1})$

What is a probabilistic CFG and why is it important? How are they generated?

Weights can be manually assigned to rules and lexical entries in a manually constructed grammar. However, since the beginning of the 1990s, a lot of work has been done on automatically acquiring probabilities from a corpus annotated with syntactic trees (a treebank), either as part of a general process of automatic grammar acquisition, or as automatically acquired additions to a manually constructed grammar. Probabilistic CFGs (PCFGs) can be defined quite straightforwardly, if the assumption is made that the probabilities of rules and lexical entries are independent of one another (of course this assumption is not correct, but the orderings given seem to work quite well in practice). The importance of this is that we rarely want to return all parses in a real application, but instead we want to return those which are top-ranked: i.e., the most likely parses. This is especially true when we consider that realistic grammars can easily return many tens of thousands of parses for sentences of quite moderate length (20 words or so). If edges are prioritised by probability, very low priority edges can be completely excluded from consideration if there is a cut-off such that we can be reasonably certain that no edges with a lower priority than the cut-off will contribute to the highest-ranked parse. Limiting the number of analyses under consideration is known as beam search (the analogy is that we're looking within a beam of light, corresponding to the highest probability edges). Beam search is linear rather than exponential or cubic. Just as importantly, a good priority ordering from a parser reduces the amount of work that has to be done to filter the results by whatever system is processing the parser's output.

Define cataphora

When a pronoun appears before their referents are introduced by a proper name or definite description

Define 'structural ambiguity'

Where a string can be parsed correctly to produce two parse trees of different structures

What is 'part of speech tagging'?

Where the words in a corpus are associated with a tag indicating some syntactic information that applies to that particular use of the word.

What is 'beam search'?

Where we limit the number of analyses under consideration when parsing a string using a chart parser or similar, based on the probability of a certain parse occurring.

When producing a Natural Language Generation system, in the 'Content Selection' phase, what considerations are there?

Which content is important and relevant when there is so much which could be included or inferred? We can use machine learning to determine this.

What approaches exist in generating referring expressions using anaphora in natural language generation?

Whilst there are bidirectional approaches to anaphora, an alternative is to generate multiple options and test them using a standard anaphora resolution algorithm.

What is WordNet?

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

After all heuristics are applied to resolve anaphora, what may override all the rules?

World knowledge. Humans routinely resolve anaphora by using logical inference on world knowledge, where the meaning of anaphora resolutions would otherwise not make sense.

Give pseudocode for a chart parser

``` Parse: Initialise the chart (i.e., clear previous results) For each word word in the input sentence, let from be the left vertex, to be the right vertex and daughters be (word) For each category category that is lexically associated with word Add new edge from, to, category, daughters Output results for all spanning edges (i.e., ones that cover the entire input and which have a mother corresponding to the root category) Add new edge from, to, category, daughters: Put edge in chart: [id,from,to, category,daughters] For each rule in the grammar of form lhs -> cat1 . . . catn−1 ,category Find set of lists of contiguous edges [id1,from1,to1, cat1,daughters1] ...[idn−1,fromn−1,from, catn−1,daughtersn−1] (such that to1 = from2 etc) (i.e., find all edges that match a rule) For each list of edges, Add new edge from1 , to, lhs, (id1 . . . id) (i.e., apply the rule to the edges) ```

Define treebank

a corpus annotated with trees

What is doc2vec?

a modification of word2vec so a vector is learned to represent a 'document' (i.e., any collection of words, including a sentence, paragraph or short document).

What are universal dependencies?

a set of dependencies which will work cross-linguistically. This work draws on attempts to define a 'universal' set of POS tags. There are now UD depbanks for over 50 languages (though most small).

What is transition-based dependency parsing?

a variant of shift- reduce parsing. This is deterministic, and thus in strong contrast to chart-parsing which we saw in the previous lecture, although other dependency parsing algorithms exist which do incorporate search.

How can you find ambiguities in a chart from a chart parser?

local ambiguities correspond to situations where a particular span has more than one associated edge.

What do dependency parsers do about non-tree dependency structures?

standard dependency parsing approaches will not work, and the computational complexity of parsing is higher. It is therefore usual within NLP to either ignore this problem entirely, or to regard parsing as a two-phase process, and add non-tree arcs to the dependency structure in a second phase.

Give examples where the idea that morphology is purely concatenative breaks down

unkempt - kempt is no longer a word feed - could be fee -ed but fee is a noun corpus - there is no such single "corpu"

How does word2vec relate to other distributional models?

word2vec is sometimes called a 'predict' model, in contrast to other distributional models which are 'count' models. Omer Levy et al found many of word2vec's hyper parameters could usefully be applied to count models. Once trained, word2vec is more-or-less equivalent to a count model with dimensionality reduction, and the essential reason for its better performance on some tasks was the improved hyper parameters. It is also very efficient computationally compared to most count approaches and therefore easy to incorporate in experiments. word2vec gives dense vectors which are very effective for many tasks: its hyperparameters have been tuned to give good performance of some of the standard similarity datasets. However, Levy et al found some tasks for which count vectors without dimensionality reduction were more effective.

How is word2vec training efficiency improved?

word2vec training is the use of negative sampling instead of softmax (which is computa- tionally very expensive).


Related study sets

HA/Exam 4 - Ch 20 Peripheral Vascular System and Lymphatic System

View Set

AP Euro Natural Philosophers (Scientific Revolution)

View Set

Art History 101 Part 2 Intro,1,2,3 MindTap

View Set

DOC1 Chapter 2 Knowing Your Students

View Set