NLP
POS Tagging
Assigning the correct POS to each word in a corpus: tokenization, POS tagging
N-grams Brown Corpus
Brown Corpus - famous, 1 million words
Use and Tool
Closed class word - can be enumerated, used frequently, stop list word
Word classes
Nouns, verbs
Tokenizer
Recognizes words, root words Playing -> play
Syntatic ambiguity
Same sequence with different meanings like flying planes can be dangerous Impact of meaning on structure like I washed the shirt with soap (shirt and soap go together)
N-Grams
Sentences - sequence of words, rules apply how words can be ordered.
N-grams Continued 2
Simplest, any word can follow any other word Frequency based: how often does a word occur
Morphology
Single words: made up of morphemes: stems + affixes Example test -> tests, testing
N-grams In Information Systems
Spell check, text analysis in buisness
Assigning structure to text
Step 1: POS Tagging : rule based, stochastic, and transformation-based learning Step 2: CFG - rules on how to combine chunks
Words NLP
Text = sequence of characters, recognize the words in the sequence
Morphology
Two classes of morphemes: stem = main meaning "cat" Affix = additional meaning "cats" Parsing = assignment structure
Parsing
assigning a structure to a text that fits the rules
Conjunction
closed class, and, or
Preposition
closed class, before nouns or phrase (in, by)
Pronoun
closed class, refere to person/entity: I, she, he
Syntax
combining words according to set of rules
Semantics
connecting linguistic elements to non-linguistic knowledge of the world (meaning of words)
Words lemma
describes the set of lexical forms with the same stem: cats, cat
Words tokenizer
extracts token, rules to split up sequence of characters. Basic: white space
Regular Expressions
formula to describe/specify a string
Morphology stemming
go back to base morpheme, elected -> elect
Parts of Speech
lexical tagset = POS Provides information about a word, neighbors, pronounciation
Pragmatics
meaning in context, language as a means of communication needs three elements: linguistic expression what expression refers to context Syntax 1 Semantic 1-2 Pragmatic 1-2-3
Word tokens
number of appearances of distinct words
Word types
number of distnct words
Adjective
open class word, properties and qualities (old, blue)
Adverb
open class word: modify something, (very, extremely)
Verbs
open class words (draw), Auxilaries - closed class words, tense, mood (can, to do)
Morphological parsing
rewriting words as its base components Porter Stemmer - most famous, get stems of words: relational -> relate
Morpheme
smallest meaning bearing unit
Basic structure of syntax
subject - predicate
Natural Language Processing (NLP)
Increases availability of text Technical: UI, information retrieval, extraction Domain: Business-Google Advertising, Comp Sci: voice recognition
Trigram Model
Counts for every consecutive three words, based on previous two words
Bigrams Model
Counts for every consecutive word pair, predictions based on previous word
Closed class word
Determiners: that, there, the. Article = subclass of determiner: the, a
Tools and Resources for NLP
Like using Python NLTK
Sequence of words
N-gram, useful to build language models for word prediction
Nouns
Open class words, proper nouns - name of place, common nouns - mass noun: group, count nouns: countable - pear