CSE Exam 2
A ROC graph has false negative rate on the x-axis and true positive rate on the y-axis T/F
False
A collection of documents is called a library. T/F
False
A profit curve is helpful when you don't know the prior probabilities of each of the classes. T/F
False
Business problems are usually straightforward classification problems, regression problems, or clustering problems. T/F
False
Case-based reasoning can be applied to legal cases but not to medical cases. T/F
False
Correlation implies that there must be some cause and effect relationship between the two variables. T/F
False
Expected profit is the profit that is expected per customer that receives the targeted marketing offer. T/F
False
If the threshold is lowered, the confusion matrix will not change because the numbers of true positives and false positives stay the same. T/F
False
Like its location, the font size of each word in a word cloud is random T/F
False
Lower variance models tend to have lower bias, and vice versa. T/F
False
Naive Bayes calculates the exact probability that an instance belongs to each class and then assigns the instance to the class that has the highest predicted probability. T/F
False
Referring to inverse document frequency, the more documents in which a term occurs, the more significant it likely is to be to the documents it does occur in. T/F
False
Referring to term frequency, the importance of a term in a document should decrease with the number of times that term occurs. T/F
False
The Pythagorean theorem tells us that the distance between A and B is given by the length of the shortest side of a triangle, and is equal to the square root of the summed squares of the lengths of the other two longer sides of the triangle. T/F
False
The expected value framework lets us determine the best classification method for a business task, but we cannot use it in a situation that would require multiple classification or regression models. T/F
False
The word "and" should have a higher IDF score than "banana" in a general purpose corpus. T/F
False
Using n-grams greatly decreases the size of the feature set. T/F
False
Sensors produce clean time series data (i.e., with equally spaced time intervals) making them useful for time series analysis. T/F
False -- Sensor readings often are not equally spaced, can have missing data. We need to process that data to create a time series.
A confusion matrix is a matrix with the columns labeled with actual classes and the rows labeled with predicted classes. The values in the matrix represent the fraction of instances that fall within each combination of categories. T/F
False -- The values in the matrix represent the **count** of instances that fall within each combination of categories.
Positive instances classified as negative are called:
False negatives
TFIDF
TFIDF(t, d) = TF(t, d) x IDF(t)
Specificity
TN / (TN + FP)
Evaluation metric: Precision formula
TP / (TP + FP)
text representation
Take a set of documents-each of which is a relatively free-form sequence of words-and turn it into our familiar feature-vector form
A ranking classifier plus a threshold produces a single confusion matrix. T/F
True
An important concern for data scientists is selection or survivor bias, since in such cases the data does not match the expected use of the model. T/F
True
Classification and regression trees tend to have high variance. T/F
True
Ensembles have been observed to improve generalization performance in many situations. T/F
True
Good data journalism employs methods from this course to engage and involve readers to discover knowledge in data. T/F
True
If we had a classifier with an AUC of .25, we could invert it to get a classifier with an AUC of .75. T/F
True
In addition to the words themselves, context is also important when analyzing data. T/F
True
In topic modeling, the terms most associated with the topic, and term weights, are learned by the topic modeling process T/F
True
Initialization of K-Means may affect the clustering results. T/F
True
Modeling involves making some simplifying assumptions to keep the problem tractable. T/F
True
Predicting stock price based on historical data is an example of time-series analysis. T/F
True
Stemming removes suffixes and transforms plural nouns to their singular forms T/F
True
Stop words are typically very common words and include functional words like prepositions. T/F
True
Term frequency (TF) is document specific. T/F
True
Text is just another form of data, but dealing with text sometimes requires pre-processing and specific expertise. T/F
True
The Jaccard distance measure treats the two objects as sets of characteristics. T/F
True
The expected value is the weighted average of the values of the different possible outcomes, where the weight given to each value is its probability of occurrence. T/F
True
We saw that n-grams can be useful in recognizing the author of a document. T/F
True
base rate classifier
a classifier that does no learning, always predicts the majority class of the training data, often used as a minimum baseline performace
corpus
a collection of documents
How many neighbors should be used in k-nn? a. no simple answer b. 10 c. 3 d. 5
a; no simple answer
AUC
area under a classifier's curve expressed as a fraction of the unit square -- values range from 0 to 1 AUC is useful when a single number is needed to summarize performance or when nothing is known about operating conditions
goal of machine learning
best estimate the function for the output variable given the input data want to achieve low bias and low variance, leading to good predictive performance
machine learning errors
bias error, variance error, irreducible error/inherent randomness
bias error
bias of the model constituted by simplifying assumptions to make the target function easier to learn
What visual presentation of data can directly identify potential outliers?
boxplot
Which of the following is always true? a. P(AB) = P(A)/(P(B) + P(A)) b. P(AB) = P(A)P(A|B) c. P(AB) = P(A)P(B|A) d. P(AB) = P(A)P(B)
c. P(AB) = P(A)P(B|A)
expected value
computation provides a framework that is useful in organizing thinking about data-analytic problems decomposes data-analytic thinking into: the structure of the problem, the elements of the analysis that can be extracted from the data, the elements of the analysis that need to be acquired form other sources
When comparing classification model performance, the model with the highest ______________________ should be used. a. precision b. f-measure c. accuracy d. depends on the situation e. recall
d. depends on the situation
Which of the following is not one of the three factors that characterize the errors a model could make? a. inherent randomness b. bias c. variance d. human error
d. human error
ROC graph
decouple classifier performance from the conditions under which the classifiers will be used independent of the class proportions as well as costs and benefits not the most intuitive visualization for many stakeholders
PCA and topic modeling: a. help the data scientist explore and understand the data b. none of these are correct c. have the ability to extract latent dimensions from data d. both can operate on the term-document frequency matrix e. all of these are correct
e. all of these are correct
random forest
each tree votes for its classification output (majority wins) reduce chances of overfitting, typically have higher model performance
Mann-Whitney Wilcoxon measure
equivalent to AUC and Gini coefficient both equivalent to probability that a randomly chosen positive instance will be ranked ahead of a randomly chosen negative instance
Naive Bayes Classifier
estimates the probability that the example belongs to each class, and reports the class with the highest probability
not independent
example die 1 is likely to roll the same as die 1 p(AB) = p(A) x p(B | A)
low bias
few assumptions about the form of the target function: decision trees, k-nearest neighbors, support vector machines
evaluating classifiers with a confusion matrix
for a problem involving classes is a matrix of counts from the test set with columns labeled with the actual classes and rows labeled with the predicted classes
F-measure combines precision and recall ...
giving equal weight to precision and recall.
Bias-variance tradeoff
high bias leads to lack of fully modeling (underfitting) high variance causes the model to focus on training data (overfitting) high bias = low variance high variance = low bias
variance error
how the learned function would change if different training data is used
independence
knowing the probability of one event does not help with the probability of the other event p(AB) = p(A) x p(B)
low variance
linear and logistic regression
high bias
more assumptions about the form of the target function: linear regression, logistic regression
unigrams
one word sequence term
prior probability
p(C = c) before seeing evidence
posterior probability
p(C=c | E)
Expected profit equatin
p(Y, p) * b(Y, p) + p(N p) * b(N, p) + p(N,n) * b(N, n) + p(Y, n) * b(Y, n)
expected value equation
p(x1) * v(x1) + p(x2) * v(x2) ...
joint probability
probability both event A and event B occur p(AB)
p(C|E) stands for the:
probability of an event C happening given E happened
conditional probability
probability of two different events both occurring is equal to the probability of one of them occurring times the probability of the other occurring if we know the first occurs. p(x,y) = p(y) * p(x | y)
ensemble methods
provide advantage by leveraging model diversity and combining many models to reduce variance ex: bagging, boosting, random forests
Inverse Document Frequency
some words are more popular across documents than others -- useful words are not too common or too rare the more common a word is, the less value it gives
high variance
strongly influenced by the specifics of training data decision trees, k-nearest, SVM
pre-processing of text
the case is normalized -- every term is in lowercase words are stemmed -- suffixes removed stop-words removed
profit curves
the class priors: the proportion of positive and negative instances in the target population The cost and benefits: the expected profit is sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix
trigram
three word sequence term
document is composed of:
tokens or terms
"bag of words"
treat every document as just a collection of individual words; ignore grammar, word order, punctuation; treat every term as a potentially important keyword
bigrams
two word sequence term
term frequency
use the word count in the document instead of just a 0 or 1
stop-word
very common word that is considered content-free such as "the", "and", "of"
Advantages & Disadvantages of Naive Bayes
very simple classifier efficient in terms of storage space and computation time performs well in real world applications non-accurate class probability estimation incremental learner -- good for online applications with high velocity
N-gram sequences
word order is important and you want to preserve some info about it in the representation they greatly increase the size of the feature set
Word Clouds (tag clouds)
words that are larger correspond to more important words in source material
steps for generating word cloud
1. import 2 libraries: tm and wordcloud 2. create a corpus for analysis 3. clean it 4. can build a term-document matrix 5. extract terms and their frequencies 6. generate the word cloud
Evaluation metric: F-measure formula
2 x Precision x Recall / (Precision + Recall)
What is the (minimum) Levenshtein metric between: Godly Goodbye
4
Evaluation metric: Accuracy formula
(TP + TN) / (TP + FP + FN + TN)
Bayes rule
P(A | B) = P(B | A) * P(A) / P(B)
Given the formula P(C|E) = P(C)*P(E|C)/P(E), what is the "prior" probability?
P(C)
The point of analytical engineering is to:
Promote thinking about problems data analytically
Evaluation metric: Recall formula
Recall = Sensitivity TP / (TP + FN)
TN/(TN+FP) = True negative rate = 1-False positive rate = ________________
Specificity