160exam2
Why text is difficult
- "unstructured" - linguistic structure is intended for human communication and not computers - word order matters sometimes - text can be dirty (abbreviations, synonyms, grammar, punctuation, etc...) - context matters
How does a business ensure that it gets the most from the wealth of data?
(i) the firm's management must think data-analytically, and (ii) the management must create a culture where data science, and data scientists, will thrive.
N-gram Sequences
- In some cases, word order is important and you want to preserve some information about it in the representation - next step up in complexity is to include sequences of adjacent words as terms (individual words, adjacent word pairs, adjacent word triples) - useful when particular phrases are significant but their component words may not be. - advantage of using n-grams is that they are easy to generate, disadvantage is that they greatly increase the feature set
Topic Models
- additional layer between document and the model, topic layer - The main idea of a topic layer is first to model the set of topics in a corpus separately. As before, each document constitutes a sequence of words, but instead of the words being used directly by the final classifier, the words map to one or more topics. The topics also are learned from the data (often via unsupervised data mining). The final classifier is defined in terms of these intermediate topics rather than words.
Named Entity Extraction
- can process raw text and extract phrases annotated with terms like 'person' or 'organization' - knowledge intensive; to work well they have to be trained on a large corpus or hand coded
Pre-processing of Text
- case normalized to lower case - words are stemmed (suffixes removed; noun plurals to singular forms) - stop-words removed (very common words like the, a, and, etc...) - numbers commonly regarded as unimportant but purpose of representation should decide this
Expected Value Framework and Business Problem
- expected value framework decomposes a business problem and recomposes the solution pieces (provides a helpful decomposition of possibly complicated business problems into subproblems that we understand better how to solve) - we can use expected value as a framework for structuring our approach to engineering a solution to the problem
TFIDF
- product of term frequency (TF) and inverse document frequency (IDF) - TFIDF(t, d) = TF(t, d) × IDF(t) - TFIDF value is specific to a single document (d) whereas IDF depends on the entire corpus - *Each document thus becomes a feature vector, and the corpus is the set of these feature vectors*
Term frequency (TF)
- step up from bag of words - uses frequency of word in document instead of just a zero or one - purpose is to represent relevance of term to a document - raw frequencies are normalized in some way, such as by dividing each by the total number of words in the document (normalized term frequency)
Bag of words
- treats every document as just a collection of individual words. - ignores grammar, word order, sentence structure, and (usually) punctuation - treats every word in a document as a potentially important keyword of the document - straightforward and inexpensive to generate, and tends to work well for many tasks (one if token is present, zero if token is not)
Inverse Document Frequency (IDF)
- useful words are not too common (don't distinguish anything) and not too rare (won't be basis for meaningful cluster) - can put both upper and lower limits on the number (or fraction) of documents in which a word may occur - measures sparseness of term t - can be thought of as the boost a term gets for being rare
Errors a model makes can be characterized by three factors
1. Inherent randomness 2. Bias 3. Variance
F-Measure
2*Precision*Recall/(Precision + Recall)
Example of ranking by likelihood
A common situation is where you have a budget for actions, such as a fixed marketing budget for a campaign, and so you want to target the most promising candidates. If one is going to target the highest expected value cases using costs and benefits that are constant for each class, then ranking cases by likelihood of the target class is sufficient. budget must be small enough so that the actions do not go into negative expected- value territory
Curve in ROC space
A ranking model produces a set of points (a curve) in ROC space. A ranking model can be used with a threshold to produce a discrete (binary) classifier: if the classifier output is above the threshold, the classifier produces a Y, else an N.
Superior Data Scientists
A top-notch data scientist has connections to other data scientists throughout the data science community and is a master of some area of technical expertise, and is familiar with many others
How to measure generalization performance
Accuracy Precision Recall F-measure
Naive-Naive Bayes
Assuming full feature independence, the Naïve Bayes classifier becomes a Naïve-Naïve Bayes (probability as a product of evidence lifts)
Challenges of web scraping
At the mercy of the website Many sites are old Not up to date on current design standards Data validation can be difficult and time consuming Need some basic knowledge of HTML
When is web scraping worthwhile
Best when scraping many pages Particularly when web addresses are not structured (follow links) Useful when data need to be updated
Proposal Review Questions
Business and Data Understanding: If unsupervised, is there an "exploratory data analysis" path well defined?(That is, where is the analysis going?) If supervised, is the target variable defined? Data Preparation: Are the data being drawn from a similar population to which the model will be applied? If there are discrepancies, are the selection biases noted clearly? Is there a plan for how to compensate for them? Modeling: Is the choice of model appropriate for the choice of target variable? Evaluation and Deployment: Does evaluation use holdout data?
Skewed
Class distribution is unbalanced because the unusual or interesting class is rare among the general population
Lift example
Consider a list of 100 customers, half of whom churn (positive instances) and half who do not (negative instances). If you scan down the list and stop halfway (representing 0.5 targeted), how many positives would you expect to have seen in the first half? If the list were sorted randomly, you would expect to have seen only half the positives (0.5), giving a lift of 0.5/0.5 = 1. If the list had been ordered by an effective ranking classifier, more than half the positives should appear in the top half of the list, producing a lift greater than 1. If the classifier were perfect, all positives would be ranked at the top of the list so by the midway point we would have seen all of them (1.0), giving a lift of 1.0/0.5 = 2.
TFIDF representation
Cosine similarity is commonly used in text classification to measure the distance between documents.
Strengths of web scraping
Data is relatively easily obtained Structured process for obtaining data Can be easily updated
Caveats to web scraping
Do not want to repeatedly do expensive bandwidth operations (we are using the website's bandwidth) Better to scrape once, then run only to update data Some websites prohibit scraping (TOS)
Combine evidence
For any particular collection of evidence E, we probably have not seen enough cases with exactly that same collection of evidence to be able to infer the probability of class membership with any confidence. Therefore, what we will do is to consider the different pieces of evidence separately, and then combine evidence.
What are some general guidelines for good baselines?
For classification tasks, one good baseline is the majority classifier, a naive classifier that always chooses the majority class of the training dataset For regression problems we have a directly analogous baseline: predict the average value over the population (usually the mean or median).
Sustaining Competitive Advantage with Data Science
Historical Advantage Unique Intellectual Property Unique Intangible Collateral Assets (such as company culture) Superior Data Scientists Superior Data Science Management
When to use profit curves for visualizing model performance
If both class priors and cost-benefit estimates are known and are expected to be stable, profit curves may be a good choice for visualizing model performance.
Bad positives and harmless negatives
In discussing classifiers, we often refer to a bad outcome as a "positive" example, and a normal or good outcome as "negative." It is useful to think of a positive example as one worthy of attention or alarm, and a negative example as uninteresting or benign.
Questions that should be asked both in formulating proposals for projects and in evaluating them:
Is the business problem well specified? Does the data science solution solve the problem? Is it clear how we would evaluate a solution? Would we be able see evidence of success before making a huge investment in deployment? Do you have the data assets you need? (For example, for supervised modeling, are there actually labeled training data? Will the organization invest in the assets it does not yet have?)
Naive
Models each feature's effect on the target independently, so it takes no feature interactions into account
polyline
Multi-segment line, often used to approximate curves
Uses of Naive Bayes
Naive Bayes is included in nearly every data mining toolkit and serves as a common baseline classifier against which more sophisticated methods can be compared.
Data Science Maturity
One dimension that is very important for strategic planning is the firm's "maturity," specifically, how systematic and well founded are the processes used to guide the firm's data science projects.
Better points in ROC space
One point in ROC space is superior to another if it is to the northwest of the first (tp rate is higher and fp rate is no worse; fp rate is lower and tp rate is no worse, or both are better).
Joint probability
P(AB) probability that both A and B will occur in case of independent events: P(AB) = p(A) * p(B)
PBSmapping and maptools
PBSmapping - R processing tools used to manage map data maptools - additional tools to make PBS mapping functions easier to use
distribution fit by the arrival times of tweets on a given topic
Poisson distribution
The Poisson distribution has a characteristic shape that would be described as
Positively (right) skewed
discriminative methods
Prior chapters presented modeling techniques that basically asked the question: *"What is the best way to distinguish (segment) target values?"* Classification trees and linear equations both create models this way, trying to minimize loss or entropy, which are functions of discriminability. These are termed discriminative methods, in that they try directly to discriminate different targets.
when to use cumulative response, lift
When costs and benefits cannot be specified with confidence, but the class mix will likely not change, a cumulative response or lift graph is useful. Both show the relative advantages of classifiers, independent of the value (monetary or otherwise) of the advantages.
pros and cons of Naive Bayes
Pros: very simple classifier Efficient in terms of both storage space and computation time Performs well in many real-world applications Incremental learner (good for online applications with high velocity) Cons: Non-accurate class probability estimation
Selenium
R library that automates browsers
Building tree in random forest
Same N examples at random, with replacement (where N is training size) Select a subset of features at random Grow the tree as large as possible without pruning Decide on the number of features using a validation set
Ranking
Strategy for making decisions involving ranking a set of cases by their scores and taking actions on the cases at the top of the ranked list instead of deciding each case separately, we may decide to take the top n cases (or, equivalently, all cases that score above a given threshold).
Specificity
TN / (TN + FP ) = True negative rate = 1 - False positive rate
Sensitivity
TP / (TP + FN ) = True positive rate
Recall
TP/(TP+FN)
Precision
TP/(TP+FP)
Goal of text representation
Take a set of documents and turn it into our familiar feature-vector form
require()
Tests if a package is loaded and loads it if needed
ROC graphs and curves
The diagonal line connecting (0, 0) to (1, 1) represents the policy of guessing a class.
FPs vs. FNs
The number of FPs may dominate, though the cost of each FN will be higher. (e.g. the consequences of telling someone with cancer that they don't have cancer is more harmful than the other way around)
Point of analytical engineering
The point of analytical engineering is not to develop complex solutions by addressing every possible contingency. Rather, the point is to promote thinking about problems data analytically so that the role of data mining is clear, the business constraints, cost, and benefits are considered, and any simplifying assumptions are made consciously and explicitly. This increases the chance of project success and reduces the risk of being blindsided by problems during deployment.
How to succeed with Data Science
Thinking data analytically Supporting a culture where data science and data scientists thrive recruiting top data science teams considering data and ability to conduct data science as strategic assets (can they be leveraged to achieve a competitive advantage?)
Generative methods
This chapter introduced a new family of methods that essentially turns the question around and asks: *"How do different targets generate feature values?"* They attempt to model how the data were generated. In the use phase, when faced with a new example to be classified, they apply Bayes' Rule to their models to answer the question: "Which class most likely generated this example?" Thus, in data science this approach to modeling is called generative, and it forms the basis for a large family of popular methods known as Bayesian methods
Confusion Matrix abbreviations
True classes: p(positive) and n(negative) predicted classes: Y(es) and N(o) e.g. the model says "Y(es), it is a positive" or "N(o), it is not a positive"
Attributes of good data science managers
Understand and appreciate the needs of business Able to communicate and be respected by both "techies" and "suits" Can coordinate technically complex activities Able to anticipate outcomes of data science projects
Difference between ROC and cumulative response/lift curves
Unlike for ROC curves, these curves assume that the test set has exactly the same target class priors as the population to which the model will be applied.
expected rates (probability matrix)
Used in expected value calculation Normalize the confusion matrix to rates p(Y, p) p(Y, n) p(N, p) p(N, n) by dividing each count by the total number of instances
Ranking confusion matrix
With a ranking classifier, a classifier plus a threshold produces a single confusion matrix. Whenever the threshold changes, the confusion matrix may change as well because the numbers of true positives and false positives change. As the threshold is lowered, instances move up from the N row into the Y row of the confusion matrix: an instance that was considered a negative is now classified as positive, so the counts change. Technically, each different threshold produces a different classifier, represented by its own confusion matrix.
discrete classifier
a classifier that outputs only a class label (as opposed to a ranking) Each discrete classifier produces an (fp rate, tp rate) pair corresponding to a single point in ROC space
decision stump
a decision tree with only one internal node, the root node. A tree limited to one internal node simply means that the tree induction selects the single most informative feature to make a decision.
Confusion Matrix
a matrix of counts with the columns labeled with actual classes and the rows labeled with predicted classes separates out the decisions made by the classifier, making explicit how one class is being confused for another. (errors of classifier are FPs and FNs)
cost benefit matrix
a matrix of the benefits (negative benefit is a cost) for each classification b(Y, p) b(Y, n) b(N, p) b(N, n)
Receiver Operating Characteristics (ROC) graph
a method that can accomodate uncertainty by showing the entire space of performance possibilities a two-dimensional plot of a classifier with false positive rate on the x axis against true positive rate on the y axis. depicts relative trade-offs that a classifier makes between benefits (true positives) and costs (false positives). a common tool for visualizing model performance for classification, class probability estimation, and scoring
Document
a piece of text, no matter how large or small (e.g. 100 page report or a comment on a blog post) composed of individual tokens or terms (words)
ROC graph random classifier
a random classifier will produce a ROC point that moves back and forth on the diagonal based on the frequency with which it guesses the positive class. In order to get away from this diagonal into the upper triangular region, the classifier must exploit some information in the data. Note that no classifier should be in the lower right triangle of a ROC graph. This represents performance that is worse than random guessing.
list
an object that contains other data objects, and those objects may be a variety of different modes/types.
Problems with accuracy as a metric
as skew increases, evaluation based on accuracy breaks down accuracy makes no distinction between false positive and false negative errors. By counting them together, it makes the tacit assumption that both errors are equally important. (These are typically very different kinds of errors with very different costs because the classifications have consequences of differing severity.)
Unicode
binary representation of characters in most of the world's written language
Critical conditions underlying the profit calculation
class priors costs and benefits
Corpus
collection of documents
Shapefiles
data for maps
Random forests
each tree votes for its classification output (majority wins) Benefits: - reduces chance of overfitting - typically produces higher model performance
AUC equivalences
equivalent to the Mann-Whitney- Wilcoxon measure equivalent to the probability that a randomly chosen positive instance will be ranked ahead of a randomly chosen negative instance.
Naïve Bayes Classifier
estimates the probability that the example belongs to each class, and reports the class with the highest probability And since p(E) is the same, regardless of which class c is computed, we don't need to calculate it to determine the highest probability class
web scraping
extracting text from web pages
Write for loop in R that outputs the sequence of numbers 1, 1/2, 1/3, 1/4, 1/5, ... 1/1000.
for (i in 1:1000) { print(1/i) }
corpus
from the Latin word meaning "body," a word that text analysts use to refer to a body of text material, often consisting of one or more documents
Base rate of a class
how well a classifier would perform by simply choosing that class for every instance
Independent events
if events A and B are independent. A and B being independent means that knowing about one of them tells you nothing about the likelihood of the other. The typical example used to illustrate independence is rolling a fair die; knowing the value of the first roll tells you nothing about the value of the second.
When ranking by likelihood is not sufficient
if individual cases have different costs and benefits
incremental learner
induction technique that can update its model one training example at a time; does not need to reprocess all past training examples when new training data become available
p(E)
likelihood of the evidence: how common is the feature representation E among all examples? This might be calculated from the data as the percentage occurrence of E among all examples.
Permissive classifier decisions
lowering the threshold
Expected profit using class priors
note: True positive rate and False negative rate refer to when the instance is actually positive. True negative rate and false positive rate refer to when the instance is actually negative
Accuracy
number of correct decisions made / total number of decisions made = (TP + TN) / (TP+FP+TN+FN) Equal to 1 - error rate. Common metric that is easy to measure but it is simplistic and has some known problems
Joint probability using conditional probability (takes care of dependencies between events)
p(AB) = p(A) * p(B | A) the probability of A and B is the probability of A times the probability of B given A
Issues of Bayes rule for classification
p(E|C=c) and p(E) are difficult if not impossible to measure (or may simply be 0 when E is not in the training data) *Bayesian methods deal with this issue by making assumptions of probabilistic independence*
Expected profit using cost/benefit matrix and matrix of probability
p(Y, p) · b(Y, p) + p(N, p) · b(N, p) + p(N, n) · b(N, n) + p(Y, n) · b(Y, n)
cumulative response curve
plots the tp rate (y axis) as a function of the percentage of the population that is targeted (x axis).
One reliable predictor of the success of a research project, and it is highly predictive
prior success of the investigator
pros and cons: Profit graph
pro: may be easy to comprehend for stakeholders who are not data scientists, since they reduce model performance to their basic "bottom line" cost or profit con: requires that operating conditions be known and specified exactly
Pros and Cons of ROC graphs
pro: they decouple classifier performance from the conditions under which the classifiers will be used. *Specifically, they are independent of the class proportions as well as the costs and benefits. ^ the positions and relative performance of the classifiers will not change. con: not most intuitive visualization for many business stakeholders
Ensemble methods
provide an advantage by leveraging model diversity and combining many models to reduce variance common models include bagging, boosting, and random forests
Expected Value
provides a framework that is extremely useful in organizing thinking about data-analytic problems decomposes data-analytic thinking into (i) the structure of the problem, (ii) the elements of the analysis that can be extracted from the data, and (iii) the elements of the analysis that need to be acquired from other sources (e.g., business knowledge of subject matter experts).
stringr
provides a set of string manipulation functions
provides an extremely simple interface for downloading a list of tweets directly from the Twitter service into R.
mashup
refers to bringing together various sources of data to create a new product with unique value.
Bayes' Rule
says that we can compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence.
ROC curves in R
see lecture 18Ch8b
Profit curve
shows the expected cumulative profit for a classifier as progressively larger proportions of the consumer base are targeted. note: adding a budgetary constraint causes not only a change in the operating point (i.e. targeting 8% of the population instead of 50%) but also a change in the choice of classifier to do the ranking.
p(C = c)
the "prior" probability of the class, i.e., the probability we would assign to the class before seeing any evidence could come from several places: i. subjective prior ii. prior belief based on previous application iii. unconditional probability inferred from data (e.g. using as the class prior the base rate of c - the prevalence of c in the population as a whole)
The Area Under the ROC Curve (AUC)
the area under a classifier's curve expressed as a fraction of the unit square. (ranges from zero to one) good general summary statistic of the predictiveness of a classifier Though a ROC curve provides more information than its area, the AUC is useful when a single number is needed to summarize performance, or when nothing is known about the operating conditions.
Conservative classifier decisions
the classifier should have high certainty before taking the positive action (using a high threshold)
Lift
the lift of a classifier represents the advantage it provides over random guessing. The lift is the degree to which it "pushes up" the positive instances in a list above the negative instances. The lift curve is essentially the value of the cumulative response curve at a given x point divided by the diagonal line (y=x) value at that point. The diagonal line of a cumulative response curve becomes a horizontal line at y=1 on the lift curve. (2x lift means model's targeting is twice as good as random)
p(E|C = c)
the likelihood of seeing the evidence E when the class C = c. This likelihood might be calculated from the data as the percentage of examples of class c that have feature vector E.
p(C|E)
the probability of C given E, or the probability of C conditioned on E
Bayes' Rule for classification
the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account. This is called the posterior probability. (what we want)
class priors
the proportion of positive and negative instances in the target population, also known as the base rate (usually referring to the proportion of positives).
Expected value formula
the weighted average of the values of the different possible outcomes, where the weight given to each value is its probability of occurrence EV = p(o1) · v(o1) + p(o2) · v(o2) + p(o3) · v(o3) ... where each oi is a possible decision outcome; p(oi) is its probability and v(oi) is its value. *note that in binary cases, the probability of one situation is (1 - the probability of the other situation)
Dealing with text
to apply data mining to text, either: - engineer the data representation to match the tools (representation engineering) - build new tools to match the data
Why compare model performance against a baseline?
to understand whether you indeed are improving performance to demonstrate to stakeholders that mining the data has added value
tp and fp
tp rate is sometimes referred to as the hit rate—what percent of the actual positives does the classifier get right. The fp rate is sometimes referred to as the false alarm rate —what percent of the actual negative examples does the classifier get wrong (i.e., predict to be positive).
True or false: The expected value framework provides a helpful decomposition of possibly complicated business problems into subproblems that we understand better how to solve.
true
How to chose proper threshold
we determine the threshold where our expected profit is above a desired level (usually zero).
ad impression
when an ad is displayed somewhere on a page, regardless of whether a user clicks it.
Selection Bias
when data is not a random sample from the population to which you intend to apply the model
Generates a word cloud
wordcloud()