Naive Bayes and Sentiment Classification
byte n-grams
where instead of using the multibyte Unicode character representations called codepoints, we just pretend everything is a string of raw bytes
binary multinomial naive bayes
whether a word occurs or not seems to matter more than its frequency. Thus it often improves performance to clip the word counts in each document at 1
feature selection
winnow down to the most informative 7000 final features
effect size
δ(x) a bigger δ means that classifier A seems to be way better than classifier B; a small δ means classifier A seems to be only a little better
cross-validation
we randomly choose a training and test set division of our data, train our classifier, and then compute the error rate on the test set
Classification
Deciding what letter, word, or image has been presented to our senses, recognizing faces or voices, sorting mail, assigning grades to homeworks; these are all examples of assigning a category to an input
development test set
to perhaps tune some parameters
stop words
very frequent words like the and a
Bayesian inference
The idea that our estimate of the probability of an outcome is determined by the prior probability (our initial belief) and the likelihood (the extent to which the available evidence is consistent with the outcome).
macroaveraging
we compute the performance for each class, and then average over classes
supervised machine learning
we have a data set of input observations, each associated with some correct output (a 'supervision signal'). The goal of the algorithm is to learn how to map from a new observation to a correct output.
confusion matrix
a table for visualizing how an algorithm performs with respect to the human gold labels, using two dimensions (system output and gold labels), and each cell labeling a set of possible outcomes
bag-of-words
an unordered set of words with their position ignored, keeping only their frequency in the document
Spam detection
another important commercial application, the binary classification task of assigning an email to one of the two classes spam or not-spam
bootstrap test
can apply to any metric; from pre- cision, recall, or F1 to the BLEU metric used in machine translation
microaveraging
collect the decisions for all classes into a single confusion matrix, and then compute precision and recall from that table
authorship attribution
determining a text's author— are also relevant to the digital humanities, social sciences, and forensic linguistic
model card
documents a machine learning model with information like: a) training algorithms and parameters b) training data sources, motivation, and preprocessing c) evaluation data sources, motivation, and preprocessing d) intended use and users e) model performance across different demographic or other groups and environmental situations
representational harms
harms caused by a system that demeans a social group, for example by perpetuating negative stereotypes about them
naive Bayes assumption
his is the conditional independence assumption that the probabilities P( fi|c) are independent given the class c and hence can be 'naively' multiplied
Naive Bayes unknown words
ignore them—remove them from the test document and not include any probability for them at all
Discriminative classifiers
like logistic regression instead learn what features from the input are most useful to discriminate between the different possible classes. While discriminative systems are often more accurate and hence more commonly used, generative classifiers still have a role.
Generative classifiers
like naive Bayes build a model of how a class could generate some input data. Given an observation, they return the class most likely to have generated the observation.
sentiment lexicons
lists of words that are pre- annotated with positive or negative sentiment
Recall
measures the percentage of items actually present in the input that were correctly identified by the system
Precision
measures the percentage of the items that the system detected (i.e., the system labeled as positive) that are in fact positive (i.e., are positive according to the human gold labels)
prior probability
our initial belief about the probability of an outcome
bootstrapping
refers to repeatedly drawing large numbers of smaller samples with replacement (called bootstrap samples) from an original larger sample
F-measure
simplest metric that incorporates aspects of both precision and recall.
multinomial naive Bayes classifier
so called because it is a Bayesian classifier that makes the simplifying (naive) assumption about how features interact
null hypothesis
supposes that δ(x) is actually negative or zero, meaning that A is not better than B
probabilistic classifier
tell us the probability of the observation being in the class
sentiment analysis
the extraction of sentiment, the positive or negative orientation that a writer expresses toward some object
likelihood
the fact of being likely to happen; something that is likely to happen
language id
the first step in most language processing pipelines
gold labels
the human-defined labels for each document that we are trying to match
character n-grams
the most effective naive Bayes features are not words at all,
p-value
the probability, assuming the null hypothesis H0 is true, of seeing the δ (x) that we saw or one even greater
text categorization
the task of assigning a label or category to an entire text or document
toxicity detection
the task of detecting hate speech, abuse, harassment, or other kinds of toxic language
statistically significant
the δ we saw has a probability that is below the threshold and we therefore reject this null hypothesis