Natural Language Processing

Ace your homework & exams now with Quizwiz!

What equation is used on occurences of a word in a corpus to determine probability in a bigram model?

${C(w_{n - 1}w_n}/{\sum_w C(w_{n - 1}w)}$ i.e. the count of a particular bigram, normalised by dividing by the total number of bigrams starting with the same word (which is equivalent to the total number of occurrences of that word, except in the case of the last token, a complication which can be ignored for a reasonable size of corpus).

How may a bag of words technique for sentiment analysis be improved

- Using a form of compositional semantics had positive results - Neural network methods has shown high performance, it is hypothesised that they induced some internal representation of syntactic and/or semantic structure. - Many 'obvious' techniques such as accounting for negating words turn out not be so good, especially as deeper parsing is necessary to determine the scope of negation correctly.

Why don't we attach e.g. fish directly to a VP node, but instead to a unary V node?

Notice that fish could have been entered in the lexicon directly as a VP, but that this would cause problems if we were doing inflectional morphology, because we want to say that suffixes like -ed apply to Vs. Making rivers etc NPs rather than nouns is a simplification I've adopted here just to keep the example grammar smaller.

Describe stochastic part of speech tagging using hidden markov models

One form of POS tagging uses a technique known as Hidden Markov Modelling (HMM). It involves an n-gram technique, but in this case the n-grams are sequences of POS tags rather than of words. The most common approaches depend on a small amount of manually tagged training data from which POS n-grams can be extracted. The idea of stochastic POS tagging is that the tag can be assigned based on consideration of the lexical probability (how likely it is that the word has that tag), plus the sequence of prior tags (for a bigram model, the immediately prior tag). This is more complicated than prediction because we have to take into account both words and tags. We wish to produce a sequence of tags which have the maximum probability given a sequence of words.

What does it mean for a generative system to "overgenerate"?

One that generates output which is invalid (as well as valid ones)

How do you evaluate POS tagger performance?

POS tagging algorithms are evaluated in terms of percentage of correct tags. The standard assumption is that every word should be tagged with exactly one tag, which is scored as correct or incorrect: there are no marks for near misses. Generally there are some words which can be tagged in only one way, so are automatically counted as correct. Punctuation is generally given an unambiguous tag.

Why are PP attachment ambiguities problematic? How do humans avoid this?

PP attachment ambiguities are a major headache in parsing, since sequences of four or more PPs are common in real texts and the number of readings increases as the Catalan series, which is exponential. Other phenomena have similar properties: for instance, compound nouns (e.g. long-stay car park shuttle bus). Humans disambiguate such attachments as they hear a sentence, but they're relying on the meaning in context to do this, in a way we cannot currently emulate, except when the sentences are restricted to a very limited domain.

Give a simple way to encode affix lexicons in a

Pair affixes with an encoding of the syntactic/semantic effect of it. E.g.: ed PAST_VERB ed PSP_VERB s PLURAL_NOUN A lexicon of irregular forms is also necessary. One approach is just a triple consisting of inflected form, 'affix information' and stem, where 'affix information' corresponds to whatever encoding is used for the regular affix. E.g.: began PAST_VERB begin begun PSP_VERB begin This approach can be used for generation as well as analysis

Define cue phrases

Phrases which indicates particular rhetorical relations.

What is 'entropy' of a language?

Prediction is important in estimation of entropy, including estimations of the entropy of English. The notion of entropy is important in language modelling because it gives a metric for the difficulty of the prediction problem. For instance, speech recognition is vastly easier in situations where the speaker is only saying two easily distinguishable words (e.g., when a dialogue system prompts by saying answer 'yes' or 'no') than when the vocabulary is unlimited: measurements of entropy can quantify this.

What are the different types of affixes? Which occur in English?

Prefix, suffix, infix, circumfix English has pre- and suffix

Define agreement

The requirement for two phrases to have compatible values for grammatical features such as number and gender. For instance, in English, dogs bark is grammatical but dog bark and dogs barks are not.

Define context

The situation in which an utterance occurs: includes prior utterances, the physical environment, background knowledge of the speaker and hearer(s), etc etc. Nothing to do with context-free grammar.

What is morphology?

The study of the structure of words.

Why do we need CFGs rather than FSAs to parse natural languages? Discuss the merits of this argument

The usual answer is that the syntax of natural languages cannot be described by an FSA, even in principle, due to the presence of centre-embedding, i.e. structures which map to: A → αAβ and which generate grammars of the form $a^n b^n$. For instance: the students the police arrested complained has a centre-embedded structure. However, this is not entirely satisfactory, since humans have difficulty processing more than two levels of embedding: ? the students the police the journalists criticised arrested complained If the recursion is finite (no matter how deep), then the strings of the language could be generated by an FSA. So it's not entirely clear whether an FSA might not suffice, despite centre embedding. There's a fairly extensive discussion of the theoretical issues in J&M , but there are two essential points for our purposes: 1. Grammars written using finite state techniques alone may be very highly redundant, which makes them difficult to build and slow to run. 2. Without internal structure, we can't build up good semantic representations. These are the main reasons for the use of more powerful formalisms from an NLP perspective (in the next section, I'll discuss whether simple CFGs are inadequate for similar reasons). However, FSAs are very useful for partial grammars. In particular, for information extraction, we need to recognise named entities: e.g. Professor Smith, IBM, 101 Dalmatians, the White House, the Alps and so on. Although NPs are in general recursive (the man who likes the dog which bites postmen), relative clauses are not generally part of named entities. Also the internal structure of the names is unimportant for IE. Hence FSAs can be used, with sequences such as 'title surname', 'DT0 PNP' etc CFGs can be automatically compiled into approximately equivalent FSAs by putting bounds on the recursion. This is particularly important in speech recognition engines.

Why are FSTs more useful than FSAs for morpheme analysis

FSAs can be used to recognise certain patterns but don't by themselves allow for any analysis of word forms. Hence for morphology we use FSTs which allow the surface structure to be mapped into the list of morphemes. FSTs are useful for both analysis ad generation since the mapping in bidirectional. This approach is known as "two-level morphology".

What sort of formalism do spelling rules map to?

Finite state transducers

Why are corpuses needed in NLP?

Firstly, we have to evaluate algorithms on real language: corpora are required for this purpose for any style of NLP. Secondly, corpora provide the data source for many machine-learning approaches.

Discuss the 5 general problems in evaluating NLP solutions

_Training data and test data_ The assumption in NLP is always that a system should work on novel data, therefore test data must be kept unseen. For machine learning approaches, such as stochastic POS tagging, the usual technique is to split a data set into 90% training and 10% test data. Care needs to be taken that the test data is representative. For an approach that relies on significant hand-coding, the test data should be literally unseen by the researchers. Development cycles involve looking at some initial data, developing the algorithm, testing on unseen data, revising the algorithm and testing on a new batch of data. The seen data is kept for regression testing. _Baselines Evaluation_ should be reported with respect to a baseline, which is normally what could be achieved with a very basic approach, given the same training data. For instance, a baseline for POS tagging with training data is to choose the most common tag for a particular word on the basis of the training data (and to simply choose the most frequent tag of all for unseen words). _Ceiling_ It is often useful to try and compute some sort of ceiling for the performance of an application. This is usually taken to be human performance on that task, where the ceiling is the percentage agreement found between two annotators (interannotator agreement). For POS tagging, this has been reported as 96% (which makes existing POS taggers look impressive since some perform at higher accuracy). However this raises lots of questions: relatively untrained human annotators working independently often have quite low agreement, but trained an- notators discussing results can achieve much higher performance (approaching 100% for POS tagging). Human performance varies considerably between individuals. Fatigue can cause errors, even with very experienced annotators. In any case, human performance may not be a realistic ceiling on relatively unnatural tasks, such as POS tagging. _Error analysis_ The error rate on a particular problem will be distributed very unevenly. For instance, a POS tagger will never confuse the tag PUN with the tag VVN (past participle), but might confuse VVN with AJ0 (adjective) because there's a systematic ambiguity for many forms (e.g., given). For a particular application, some errors may be more important than others. For instance, if one is looking for relatively low frequency cases of de- nominal verbs (that is verbs derived from nouns — e.g., canoe, tango, fork used as verbs), then POS tagging is not directly useful in general, because a verbal use without a characteristic affix is likely to be mistagged. This makes POS-tagging less useful for lexicographers, who are often specifically interested in finding examples of unusual word uses. Similarly, in text categorisation, some errors are more important than others: e.g. treating an incoming order for an expensive product as junk email is a much worse error than the converse. _Reproducibility_ If at all possible, evaluation should be done on a generally available corpus so that other researchers can replicate the experiments.

Give pseudocode for a chart parser

``` Parse: Initialise the chart (i.e., clear previous results) For each word word in the input sentence, let from be the left vertex, to be the right vertex and daughters be (word) For each category category that is lexically associated with word Add new edge from, to, category, daughters Output results for all spanning edges (i.e., ones that cover the entire input and which have a mother corresponding to the root category) Add new edge from, to, category, daughters: Put edge in chart: [id,from,to, category,daughters] For each rule in the grammar of form lhs -> cat1 . . . catn−1 ,category Find set of lists of contiguous edges [id1,from1,to1, cat1,daughters1] ...[idn−1,fromn−1,from, catn−1,daughtersn−1] (such that to1 = from2 etc) (i.e., find all edges that match a rule) For each list of edges, Add new edge from1 , to, lhs, (id1 . . . id) (i.e., apply the rule to the edges) ```

Give examples where the idea that morphology is purely concatenative breaks down

unkempt - kempt is no longer a word feed - could be fee -ed but fee is a noun corpus - there is no such single "corpu"

Discuss why atomic category CFGs insufficient?

- *Lack of subject-verb (or any) agreement* so, for instance, *it fish is allowed by the grammar as well as they fish. We could, of course, allow for agreement by increasing the number of atomic symbols in the CFG, introducing NP-sg, NP-pl, VP-sg and VP-pl, for instance. But this approach would soon become very tedious. Note that we have to expand out the symbols even when there's no constraint on agreement, since we have no way of saying that we don't care about the value of number for a category (e.g., past tense verbs). - *Does not account for subcategorisation*. Consider how a verb like 'adore' is transitive and relates two entities ('Kim adores' is weird but 'Kim adores Sandy' is not); but others like give is ditransitive ('Kim gives Sandy an apple'). A CFG fails to encode this. The example is lectures allows for: they fish fish it (S (NP they) (VP (V fish) (VP (V fish) (NP it)))) We could expand the grammar to also deal with this, but this becomes excessively cumbersome. - *Long-distance dependencies:* A sentence like 'Which problem did you say you don't understand?' is traditionally modelled as having a gap where a subject would usually go: 'Which problem did you say you don't understand _?' Doing this in standard CFGs is possible, but extremely verbose, potentially leading to trillions of rules. Instead of having simple atomic categories in the CFG, we want to allow for features on the categories, which can have values indicating things like plurality. As the long-distance dependency examples should indicate, the features need to be complex-valued. For instance, * what kid did you say _ were making all that noise? is not grammatical, as the verb doesn't agree with the numbers ('which kids did you say _ were making all that noise?' or 'what kid did you say _ was making all that noise?'). The analysis needs to be able to represent the information that the gap corresponds to a plural noun phrase.

What sort of lexical information is needed for full, high precision morphological processing

- Affixes, plus the associated information conveyed by the affix - Irregular forms, with associated information similar to that for affixes - Stems with syntactic categories (plus more information if derivational morphology is to be treated as productive)

Why is POS tagging useful?

- Basic, coarse-grained sense disambiguation - Easier to extract information for NLP or linguistic research - As the basis for more complex forms of annotation - Named entity recognisers are generally run on tagged corpuses - As preprocessors to full parsing - As part of a method for dealing with words not in a parer's lexicon.

Discuss optimisations on chart parsing, other than packing

- In the pseudo-code given in lectures, the order of addition of edges to the chart was determined by the recursion. In general, chart parsers make use of an agenda of edges, so that the next edges to be operated on are the ones that are first on the agenda. Different parsing algorithms can be implemented by making this agenda a stack or a queue, for instance. - So far, we've considered bottom up parsing: an alternative is top down parsing, where the initial edges are given by the rules whose mother corresponds to the start symbol. - Some efficiency improvements can be obtained by ordering the search space appropriately, though which version is most efficient depends on properties of the individual grammar.

List some uses of finite state techniques in NLP

- Morpheme analysis/generation - Grammars for simple dialog systems - Partial grammars for named entity recognition - Dialogue models for spoken dialogue systems (SDS). SDS use dialogue models for a variety of purposes: in- cluding controlling the way that the information acquired from the user is instantiated (e.g., the slots that are filled in an underlying database) and limiting the vocabulary to achieve higher recognition rates. FSAs can be used to record possible transitions between states in a simple dialogue.

Define full-form lexicon

A lexicon where all morphological variants are explicitly listed

Why do we want to use prediction in NLP?

- Some machine learning systems can be trained using prediction on general text corpora in a way that also makes them useful on other tasks where there is limited training data. - Prediction is also a fundamental part of human language understanding. - Also, there are applications like text prediction on phone keyboards, handwriting recognition, spelling correction and text segmentation for languages such as Chinese, which are conventionally written without explicit word boundaries. - Prediction is also important in estimation of entropy, including estimations of the entropy of English. The notion of entropy is important in language modelling because it gives a metric for the difficulty of the prediction problem. - The most important use of prediction is as a form of language modelling for automatic speech recognition (ASR). Speech recognisers cannot accurately determine a word from the sound signal for that word alone, and they cannot reliably tell where each word in an utterance starts and finishes. For instance, have an ice Dave, heaven ice day and have a nice day could easily be confused. 20 For traditional ASR, an initial signal processing phase produces a lattice of hypotheses of the words uttered, which are then ranked and filtered using the probabilities of the possible sequences according to the language model. The language models which were traditionally most effective work on the basis of n-grams (a type of Markov chain), where the sequence of the prior n − 1 words is used to derive a probability for the next.

What is the specific research problem used to measure the effectiveness of sentiment analysis

- Uses a corpus of movie reviews where the rating associated with each review is known. Hence, there is an objective measure of whether a review was positive or negative. - Pang et al. balances the corpus so it had 50% positive reviews and 50% negative - Research problem is to assign sentiment automatically to each document in the entire corpus to agree with the known ratings.

What is a "full form lexicon"?

A list of all inflected forms treating derivational morphology as non-productive. But since the vast majority of words in English have regular morphology so a full-form lexicon can be regarded as a form of compilation - it is redundant to have to specify the inflected form as well as the stem.

Define 'Context Free Grammar'?

A CFG has four compo- nents, described here as they apply to grammars of natural languages: 1. a set of non-terminal symbols (e.g., S, VP), conventionally written in uppercase; 2. a set of terminal symbols (i.e., the words), conventionally written in lowercase; 3. a set of rules (productions), where the left hand side (the mother) is a single non-terminal and the right hand side is a sequence of one or more non-terminal or terminal symbols (the daughters); 4. a start symbol, conventionally S, which is a member of the set of non-terminal symbols.

What is a bigram model?

A bigram model assigns a probability to a word based on the previous word alone: i.e., $P(w_n|w_{n−1})$ (the probability of $w_n$ conditional on $w_{n−1}$) where $w_n$ is the $n^{th}$ word in some string.

Define 'corpus'

A body of text that has been collected for some purpose.

How is a 'chart' in chart parsing designed?

A chart is a collection of edges, usually implemented as a vector of edges, indexed by edge identifiers. In the simplest version of chart parsing, each edge records a rule application and has the following structure: [id,left vertex, right vertex,mother category, daughters] A vertex is an integer representing a point in the input string (between words). Mother category refers to the rule that has been applied to create the edge. Daughters is a list of the edges that acted as the daughters for this particular rule application: it is there purely for record keeping so that the output of parsing can be a labelled bracketing.

How many results does a chart parser return?

A chart parser is designed to be complete: it returns all valid trees (though there are some minor caveats e.g., concerning rules which can apply to their own output).

Where are the probabilites in a bigram model obtained from?

A corpus

Define 'balanced corpus'

A corpus which contains texts which represent different genres (newspapers, fiction, textbooks, parliamentary reports, cooking recipes, scientific papers etc etc): early examples were the Brown corpus (US English: 1960s) and the Lancaster- Oslo-Bergen (LOB) corpus (British English: 1970s) which are each about 1 million words: the more recent British National Corpus (BNC: 1990s) contains approximately 100 million words, including about 10 million words of spoken English.

What is a chart parser?

A dynamic programming parser for CFGs. A chart is the data structure used to record partial results as parsing continues.

Define constraint-based grammar

A formalism which describes a language using a set of independently stated constraints, without imposing any conditions on processing or processing order.

Define 'left-associative grammar' and 'right-associative grammar'

A grammar in which all nonterminal daughters are the leftmost daughter in a rule (i.e., where all rules are of the form X → Y a∗), is said to be left-associative. A grammar where all the nonterminals are rightmost is right-associative.

Define AI-Complete

A half-joking term, applied to problems that would require a solution to the problem of representing the world and acquiring world knowledge (lecture 1).

Define 'lexeme'

A lexeme is an abstract unit of morphological analysis in linguistics

Define affix

A morpheme which can only occur in conjunction with other morphemes

Define dependency structure

A syntactic or semantic representation that links words via relations

What does 'two level morphology' mean?

A system which is good for both analysing and generating mophemes

What is "stemming"?

A technique in traditional information retrieval systems. Involves reducing all morphologically complex forms to a canonical form. The canonical form may not be the linguistic stem, despite the name of the technique. The most commonly used algorithm is the Porter stemmer, which uses a series of simple rules to strip endings.

Define aspect

A term used to cover distinctions such as whether a verb suggests an event has been completed or not (as opposed to tense, which refers to the time of an event). For instance, she was writing a book vs she wrote a book.

Describe the formation of spelling rules

AKA orthographic rules In such rules, the mapping is always given from the 'underlying' form to the surface form, the mapping is shown to the left of the slash and the context to the right, with the indicating the position in question. Example: $\epsilon \to e/{ s } \textasciicircum _s$

Define 'lexical ambiguity'

Ambiguities from different lexical analyses. Example: fish can be a verb and a noun.

What is "lemmatization"?

Another name for morphological analysis.

Define expletive pronoun

Another term for pleonastic pronoun

How is prediction using n-grams used in automatic speech recognition?

As a form of language modelling. Speech recognisers cannot accurately determine a word from the sound signal for that word alone, and they cannot reliably tell where each word in an utterance starts and finishes. For instance, have an ice Dave, heaven ice day and have a nice day could easily be confused. For traditional ASR, an initial signal processing phase produces a lattice of hypotheses of the words uttered, which are then ranked and filtered using the probabilities of the possible sequences according to the language model. The language models which were traditionally most effective work on the basis of n-grams (a type of Markov chain), where the sequence of the prior n − 1 words is used to derive a probability for the next.

What is the tagging system used in lectures?

CLAWS 5 (C5)

What is derivational morphology?

Derivational affixes, such as un-, re-, anti- etc, have a broader range of semantic possibilities (there seems no principled limit on what they can mean) and don't fit into neat paradigms. Inflectional affixes may be combined (though not in English). However, there are always obvious limits to this, since once all the possible slot values are 'set', nothing else can happen.

Define case

Distinctions between nominals indicating their syntactic role in a sentence. In English, some pronouns show a distinction: e.g., she is used for subjects, while her is used for objects. e.g., she likes her vs *her likes she. Languages such as German and Latin mark case much more extensively.

What is a 'Wizard of Oz' experiment?

For interface applications in particular, collecting a corpus requires a simulation of the actual application: this has often been done by a Wizard of Oz experiment, where a human pretends to be a computer.

Define complement

For the purposes of this course, an argument other than the subject.

Describe English morphological structure

Generally concatenative

Define discourse

In NLP, a piece of connected text.

What optimisations are generally used on HMM POS taggers?

In fact, POS taggers generally use trigrams rather than bigrams — the relevant equations are given in J&M, 5.5.4. As with word prediction, backoff (to bigrams) and smoothing are crucial for reasonable performance because of sparse data. When a POS tagger sees a word which was not in its training data, we need some way of assigning possible tags to the word. One approach is simply to use all possible open class tags, with probabilities based on the unigram probabilities of those tags.22 A better approach is to use a morphological analyser (without a lexicon) to restrict the candidates: e.g., words ending in -ed are likely to be VVD (simple past) or VVN (past participle), but can't be VVG (-ing form).

In a generative grammar, what is a constituent?

In syntactic analysis, a constituent is a word or a group of words that function(s) as a single unit within a hierarchical structure

Define argument

In syntax, the phrases which are lexically required to be present by a particular word (prototypically a verb). This is as opposed to adjuncts, which modify a word or phrase but are not required. For instance, in: Kim saw Sandy on Tuesday Sandy is an argument but on Tuesday is an adjunct. Arguments are specified by the subcategorization of a verb etc. Also see the IGE.

Why may increasing the size of the tag set not necessarily degrade POS tagger performance?

Increasing the size of the tagset does not necessarily result in decreased performance: some additional tags could be assigned more-or-less unambiguously and more fine-grained tags can increase performance. For instance, suppose we wanted to distinguish between present tense verbs according to whether they were 1st, 2nd or 3rd person. With the C5 tag set, and the stochastic tagger described, this would be impossible to do with high accuracy, because all pronouns are tagged PRP, hence they provide no discriminating power. On the other hand, if we tagged I and we as PRP1, you as PRP2 and so on, the n-gram approach would allow some discrimination. In general, predicting on the basis of classes means we have less of a sparse data problem than when predicting on the basis of words, but we have less discriminating power. There is also something of a trade-off between the utility of a set of tags and their effectiveness in POS tagging. For instance, C5 assigns separate tags for the different forms of be, which is redundant for many purposes, but helps make distinctions between other tags in tagging models such as the HMM described here where the context is given by a tag sequence alone (i.e., rather than considering words prior to the current one).

Define the difference between inflectional and derivational morphology?

Inflectional morphology can be thought of as setting values of slots in some paradigm (i.e. there is a fixed set of slots which can be thought of as being filled with simple values). Inflectional morphology concerns properties such as tense, aspect, number, person, gender and case. Derivational morphology have a broader range of semantic possibilities, in the that there seems no principled limit on what they can mean and don't fit into neat paradigms.

What is inflectional morphology?

Inflectional morphology can be thought of as setting values of slots in some paradigm (i.e., there is a fixed set of slots which can be thought of as being filled with simple values). Inflectional morphology concerns properties such as tense, aspect, number, person, gender, and case, although not all languages code all of these: English, for instance, has very little morphological marking of case and gender.

What does it mean for a certain linguistic construct to be "productive"?

Is applied to new words

Can POS taggers be built without hand-tagged corpuses?

It is possible to build POS taggers that work without a hand-tagged corpus, but they don't perform as well as a system trained on even a very small manually-tagged corpus and they still require a lexicon associating possible tags with words. Completely unsupervised approaches also exist, where no lexicon is used, but the categories induced do not correspond to any standard tagset.

Why does limiting natural language interface systems to limited domains make it a relatively easy problem?

It removes a lot of ambiguity, e.g. LUNAR (the lunar rock sample database querying system) only dealt with "rock" in the sense of the material, never the music.

Why did many mainstream linguists discount the use of corpuses?

Mainstream linguists in the past mostly dismissed their use in favour of reliance on intuitive judgements about whether or not an utterance is grammatical (a corpus can only (directly) provide positive evidence about grammaticality). However, many linguists do now use corpora.

What is an affix?

Morphemes which can only occur in conjunction with other morphemes, e.g. words that are made up of a stem and zero or more affixes.

Define domain

Not a precise term, but I use it to mean some restricted set of knowledge appropriate for an application.

Define closed class

Referstopartsofspeech,suchasconjunction,forwhichallthememberscouldpotentiallybeenumerated (lecture 3).

Define distributional semantics

Representing word meaning by context of use

Define denominal

Something derived from a noun: e.g., the verb tango is a denominal verb.

Define deverbal

Something derived from a verb: e.g., the adjective surprised.

Comment on the apparently high success rates of POS taggers?

Success rates of about 97% are possible for English POS tagging (performance seems to have reached a plateau, probably partly because of errors in the manually-tagged corpora) but the baseline of choosing the most common tag based on the training set often gives about 90% accuracy.23 Some POS taggers return multiple tags in cases where more than one tag has a similar probability.

Discuss context-free grammars with empty productions

The formal description of a CFG generally allows productions with an empty right hand side (e.g., Det → ε). It is convenient to exclude these however, since they complicate parsing algorithms, and a weakly-equivalent grammar can always be constructed that disallows such empty productions.

Define compositionality

The idea that the meaning of a phrase is a function of the meaning of its parts. compositional semantics is the study of how meaning can be built up by semantic rules which mirror syntactic structure (lecture 6).

What is the most difficult part of making a statistically based word prediction system, particularly one that is designed to help users input words quicker?

The main difficulty with using statistical prediction models in such applications is in finding enough data: to be useful, the model really has to be trained on an individual speaker's output, but of course very little of this is likely to be available. Training a conversational aid on newspaper text can be worse than using a unigram model from the user's own data.

What is a morpheme?

The minimal information carrying unit of a word

Define anaphora

The phenomenon of referring to something that was mentioned previously in a text. An anaphor is an expression which does this, such as a pronoun

Why can we parse a CFG without backtracking?

This works for parsing with CFGs because the rules are independent of their context: a VP can always expand as a V and an NP regardless of whether or not it was preceded by an NP or a V, for instance. (In some cases we may be able to apply techniques that look at the context to cut down the search space, because we can tell that a particular rule application is never going to be part of a sentence, but this is strictly a filter: we're never going to get incorrect results by reusing partial structures.)

Describe a finite state transducer

Transducers map between two representations, so each transition corresponds to a pair of characters. As with the spelling rule, we use the special character 'ε' to correspond to the empty character and 'ˆ' to correspond to an affix boundary. The abbreviation 'other : other' means that any character not mentioned specifically in the FST maps to itself.18 AswiththeFSAexample,weassumethattheFSTonlyacceptsaninputiftheendoftheinputcorresponds to an accept state (i.e., no 'left-over' characters are allowed).

What useful additions can be made to FSAs?

Transition probabilities

Why is knowing trivial information both a difficult and essential problem for information retrieval systems?

Trivial information is critical for human reasoning but tends not to be explicitly stated anywhere, since humans find it trivial.

What does it mean for two grammars to be weakly equivalent?

Two grammars are said to be weakly-equivalent if they generate the same strings

What does it mean for two grammars to be strongly equivalent?

Two grammars are strongly- equivalent if they assign the same structure to all strings they generate.

Define bag of words

Unordered collection of words in some text.

Define bidirectional

Usable for both analysis and generation

How can you make chart parsing go in cubic time?

Using 'packing'. The modification is to change the daughters value on an edge to be a set of lists of daughters and to make an equality check before adding an edge so we don't add one that's equivalent to an existing one. That is, if we are about to add an edge: [id,left vertex, right vertex,mother category, daughters] and there is an existing edge: [id-old,left vertex, right vertex,mother category, daughters-old] we simply modify the old edge to record the new daughters: [id-old,left vertex, right vertex,mother category, daughters-old ⊔ daughters] There is no need to recurse with this edge, because we couldn't get any new results: once we've found we can pack an edge, we always stop that part of the search. Thus packing saves computation and in fact leads to cubic time operation, though I won't go through the proof of this. Example: We want to add an edge, E1: [1, 1, 3, VP, {(2, 9)}] Suppose there already exists an edge E2: [2, 1, 3, VP, {(2, 6)}] Instead of inserting an edge E1 and recursing on it, we instead append (2, 9) onto the list of daughters for E2: [2, 1, 3, VP, {(2 6), (2 9)}]

Define backoff

Usually used to refer to techniques for dealing with data sparseness in probabilistic systems: using a more general classification rather than a more specific one. For instance, using unigram probabilities instead of bigrams; using word classes instead of individual words (lecture 3).

What algorithm is used to find bigrams in speech recognition. Give a very, very high overview of this algorithm.

We can regard bigrams as comprising a simple deterministic weighted FSA. The Viterbi algorithm, a dynamic programming technique for efficiently applying n-grams in speech recognition and other applications to find the highest probability sequence (or sequences), is usually described in terms of an FSA.

What is a problem of generating bigram data from corpuses, especially small ones? What solution is used?

We often end up with bigrams with 0 probability, because it is based purely on examples of a pair of words existing. We use smoothing to mitigate some of this, which simply means that we make some assumption about the 'real' probability of unseen or very infrequently seen events and distribute that probability appropriately. A common approach is simply to add one to all counts: this is add-one smoothing which is not sound theoretically, but is simple to implement. A better approach in the case of bigrams is to backoff to the unigram probabilities: i.e., to distribute the unseen probability mass so that it is proportional to the unigram probabilities. This sort of estimation is extremely important to get good results from n-gram techniques

How can you use a bigram model to determine the likelihood of an entire string of words?

We require the probability of some string of words $P(w_1^n)$ which is approximated by the product of the bigram probabilities: $P(w_1^n ) \approx \prod_{k = 1}^{n} P(w_k | w_{k − 1})$ This assumes independence of the individual probabilities, which is clearly wrong, but the approximation nevertheless works reasonably well. Note that, although the n-gram probabilities are based only on the preceding words, the effect of this combination of probabilities is that the choice between possibilities at any point is sensitive to both preceding and following contexts.

Derive the equations used in Hidden Markov Model for POS tagging

We want an estimate of the probability a word $w^n_1$ in an n-gram has a particular tag, denoted $\hat{t}^n_1$ by maximising: $\hat{t}^n_1 = argmax_{t^n_1}(P(t^n_1 | w^n_1) $ (NB the hat symbol indicates estimate). We use Bayes' theorem: $P(t^n_1 | w^n_1) = \frac{P(w^n_1 | t^n_1)P(t^n_1)}{P(w^n_1)}$. But since we know the word in advance (because we are considering a particular given string) we know $P(w^n_1) = 1$. Hence, $\hat{t}^n_1 = argmax_{t^n_1}(P(w^n_1 | t^n_1)P(t^n_1)) $ --------------------------------------------------------------------- Now, we also need to estimate $P(w^n_1 | t^n_1)$ and $P(t^n_1)$. If we make the bigram assumption, then the probability of a tag depends on the previous tag, hence the tag sequence is estimated as a product of the probabilities: $P(t^n_1) = \Pi^n_{i = 1} P(t_i | t_{i - 1})$ We will also assume that the probability of the word is independent of the words and tags around it and depends only on its own tag: $P(w^n_1 | t^n_1) = \Pi^n_{i = 1} P(w_i | t_i)$ These values can be estimated from the corpus frequencies. ---------------------------------------------------------------------- We therefore get this final expression: $\hat{t}^n_1 = argmax_{t^n_1} \Pi^n_{i = 1} P(w_i | t_i)P(t_i | t_{i - 1})$

What is a probabilistic CFG and why is it important? How are they generated?

Weights can be manually assigned to rules and lexical entries in a manually constructed grammar. However, since the beginning of the 1990s, a lot of work has been done on automatically acquiring probabilities from a corpus annotated with syntactic trees (a treebank), either as part of a general process of automatic grammar acquisition, or as automatically acquired additions to a manually constructed grammar. Probabilistic CFGs (PCFGs) can be defined quite straightforwardly, if the assumption is made that the probabilities of rules and lexical entries are independent of one another (of course this assumption is not correct, but the orderings given seem to work quite well in practice). The importance of this is that we rarely want to return all parses in a real application, but instead we want to return those which are top-ranked: i.e., the most likely parses. This is especially true when we consider that realistic grammars can easily return many tens of thousands of parses for sentences of quite moderate length (20 words or so). If edges are prioritised by probability, very low priority edges can be completely excluded from consideration if there is a cut-off such that we can be reasonably certain that no edges with a lower priority than the cut-off will contribute to the highest-ranked parse. Limiting the number of analyses under consideration is known as beam search (the analogy is that we're looking within a beam of light, corresponding to the highest probability edges). Beam search is linear rather than exponential or cubic. Just as importantly, a good priority ordering from a parser reduces the amount of work that has to be done to filter the results by whatever system is processing the parser's output.

Define 'structural ambiguity'

Where a string can be parsed correctly to produce two parse trees of different structures

What is 'part of speech tagging'?

Where the words in a corpus are associated with a tag indicating some syntactic information that applies to that particular use of the word.

What is 'beam search'?

Where we limit the number of analyses under consideration when parsing a string using a chart parser or similar, based on the probability of a certain parse occurring.

How can you find ambiguities in a chart from a chart parser?

local ambiguities correspond to situations where a particular span has more than one associated edge.


Related study sets

HA/Exam 4 - Ch 20 Peripheral Vascular System and Lymphatic System

View Set

AP Euro Natural Philosophers (Scientific Revolution)

View Set

Art History 101 Part 2 Intro,1,2,3 MindTap

View Set

DOC1 Chapter 2 Knowing Your Students

View Set