CSE 160 Exam 2
Equation for F-measure
2 x precision x recall / (precision + recall)
If the probability of a burrito having cheese on it is 7/8 and the probability of a burrito having beans on it is 1/4, what is the probability of a burrito having both beans and cheese on it, assuming these two events are independent of each other?
7/8*1/4
Equation for Accuracy
Accuracy = (TP+TN)/(P+N) = (TP + TN)/all
Define IDF
IDF, or inverse document frequency, is an inverse function of the number of documents in which it occurs. A word has a high IDF score if it appears infrequently across multiple documents, and a low IDF score if it appears frequently across multiple documents
How can you identify conservative and permissive models on a ROC graph?
Conservative models are located towards the lower left, and permissive models are located towards the upper right.
(T/F) ROC graphs are advantageous because anyone can interpret them with ease.
False
(T/F) As class distribution becomes more skewed, evaluation based on accuracy improves
False, as class distribution becomes more skewed, evaluation based on accuracy BREAKS DOWN
(T/F) A profit curve cannot go negative
False, if the profit curve is negative, then the entity is losing money
(T/F) The rbind() function in R determines whether the R object is bound or not
False, rbind() takes all of the rows and puts them together (similar to cbind() used on data frames, but for rows)
(T/F) A "false positive" is when a classifier correctly assigns a negative classification.
False, that is a "true negative"
(T/F) The value of the AUC (area under the ROC curve) can be greater than 1.
False, the area varies between 0 and 1.
(T/F) It is more important to have good data scientists than it is to have good management of the data science team
False, they are both important
(T/F) With topic models, we perform supervised classification of documents
False, we perform unsupervised classification
What makes a model "permissive"? What are the advantages and disadvantages of a permissive model?
Models that do not require as much evidence to classify something as positive are considered permissive. Permissive models have a high true positive rate, but as a consequence they also have a high false positive rate.
What makes a model "conservative"? What are the advantages and disadvantages of a conservative model?
Models that require strong evidence to classify something as positive are considered conservative. Conservative models have a low false positive rate, but in turn they also have a low true positive rate. (In the planet talk, they use a conservative model to distinguish whether or not a planet is orbiting a star)
Name a learning algorithm we have seen in class that can operate incrementally (i.e. can be updated easily without re-examining the rest of the training data)
Naive Bayes/kNN
Equation for Precision/Positive Predictive Value
Precision = TP/(TP+FP)
Define TF
TF, or term frequency, is the normalized count of the number of times a term appears in a document. A word has a high TF score if it appears frequently in a document, and a low TF score if it appears rarely in a document
Define TF-IDF as a combination of TF and IDF as defined above
TF-IDF = TF * IDF
Equation for True Negative Rate/Sensitivity
TN/(TN+FP) - true negatives / all real negatives
Equation for Recall/True Postive Rate/Specificity
TP/(TP+FN)
What does it mean for a word to have a high TF-IDF score?
That word is important to the topic of a document. Examples of words with a high TF-IDF score would be "pitch" and "slugger" in a document about baseball.
(T/F) A profit curve is the plot of expected profit at each possible threshold that can be applied to a ranking classifier
True
(T/F) A shapefile contains the points and contours of a geographical region
True
(T/F) Classifier accuracy is equal to 1 - Error rate.
True
(T/F) Expected value is a valuable tool when making business decisions because it can predict how much money could be earned or lost if certain actions are taken
True
(T/F) Predictive modeling enables the enterprise to learn to anticipate and prepare for the future
True
(T/F) ROC graphs decouple classifier performance from the conditions under which the classifiers were used
True
(T/F) Selenium enables R to extract information from interactive websites
True
(T/F) Text is an example of unstructured data
True
(T/F) The AUC is useful when a single number is needed to summarize performance, or when nothing is known about the operating conditions
True
(T/F) The AUC is useful when a single number is needed to summarize performance, or when nothing is known about the operating conditions.
True
(T/F) The Naive-Bayes assumption is that events are conditionally independent, and thus can be multiplied together to estimate the probability of a combination of events
True
(T/F) The lift of a classifier is calculated as the classifier's performance divided by the performance of a randomly guessing classifier
True
(T/F) Two events are independent if the value of one does not affect the value of the other
True
(T/F) When considering Facebook "Likes", we saw that they could be quite predictive of how people score on intelligence tests
True
(T/F) While topic modeling can be informative, it can take a lot of time for a computer to perform
True
(T/F) With a ranking classifier, a classifier plus a threshold produces a single confusion matrix
True
(T/F) The lift of a classifier is calculated as the classifier's performance divided by the performance of a randomly guessing classifier
True (note: a "perfect" classifier has a lift of 2, a random classifier has a lift of 1, and the worst classifier has a lift of 0)
(T/F) The R language provides many useful libraries, including ones to plot on maps (bonus: what is this library called?)
True, and the library is called "maps"
If a model is created for the classification of students as freshmen, sophomores, juniors, or seniors, what would be the maximum dimensions of the confusion matrix? a) 4x4 b) 4x2 c) 4x1 d) 2x2 e) It depends on what the data model is built on
a) 4x4
Which of these organizations would have the most challenge in applying supervised predictive modeling? a) A business school that wants to start a new Master's degree program in Business Analytics, and would like to estimate the likely number of applicants b) A grocery store that is trying to identify which of its loyalty-card-carrying customers will spend more than $100 next month c) A city government that is trying to predict which neighborhoods will see the most new businesses open up next quarter d) An online marketing company that wants to estimate the number of clicks that the ads it serves will receive when shown to a particular population e) All of the above are equally challenging
a) A business school that wants to start a new Master's degree program in Business Analytics, and would like to estimate the likely number of applicants Possible explanation: (b) and (c) have access to past data, and (d) can deploy a test fairly easily. (a) would have to go to an external source, for example a competing business school with a competing program, for its data, and the competing school likely would not turn over the data
The points on a model's ROC graph a) Represent the performance of different thresholds b) Represent different rankings of examples c) Represent the cost of different classifications d) All of the above e) None of the above
a) Represent the performance of different thresholds
Confusion matrices can be used to analyze the performance of which of the following models? a) Decision Trees b) Naive Bayes c) k-Nearest Neighbors d) All of the above e) (a) and (c) only
d) All of the above
When word order is important, which text mining strategy should we use? a) Bag of Words b) N-grams c) Named Entity Extraction d) Topic Models e) All of the above are valid models that respect word order
b) N-grams
The point of analytical engineering is to a) Develop complex solutions by addressing every possible contingency b) Promote thinking about problems data analytically c) Analyze engineers d) All of the above e) None of the above
b) Promote thinking about problems data analytically
Which of the following is not a step in generating an ROC curve? a) Sort the test set by the model predictions b) Start with the cutoff = min(prediction) c) Decrease cutoff, and then count the number of true positives and false positives d) Calculate the TP rate and the FP rate e) Plot current number of TP/P as a function of the current FP/N
b) Start with the cutoff = min(prediction)
In the context of preparing for textual analysis, what is meant by "stop words"? a) Words that have the same meaning as the word "stop" b) Words such as "and", "or", and "the" c) Words specific to the topic of a corpus, such as "rebound" or "offense" when talking about a basketball game d) The words at the end of each sentence e) None of the above
b) Words such as "and", "or", and "the"
In a marketing environment, the expected benefit of not targeting is typically a) A negative value b) Zero c) A positive value d) All of the above e) None of the above
b) Zero
In order for data points to be taken as input to most data mining programs, they must be represented as a) text documents b) feature vectors c) dependent variables d) targets e) all of the above
b) feature vectors (???)
Which of the following terms will have the lowest IDF score in a typical (general purpose) corpus? a) bug b) car c) she d) spaghetti e) vertebrae
b) she Explanation: Words with a high TF score occur frequently in a certain document, and words with a high IDF score appear infrequently across a set of documents. Therefore, a word with a high TF-IDF score is a good indicator as to the topic of a document Being a pronoun, "she" is a word that appears frequently across many documents and is not a good indicator as to what the document is about
Give two examples of words that can be "stemmed" and what their stems would be.
cats, catlike, catty --> cat fishing, fished, fisherman --> fish
The bag of words model a) Treats every document as a collection of individual words b) Ignores grammar, word order, and sentence structure c) Is a straightforward representation that is inexpensive to generate d) All of the above e) None of the above
d) All of the above
The Area Under the ROC Curve (AUC) is a) a fraction of the unit square, ranging from 0 to 1 b) useful when a single number is needed to summarize performance c) equivalent to the probability that a randomly chosen positive instance will be ranked ahead of a randomly chosen negative instance d) all of the above e) none of the above
d) all of the above
The area under the ROC curve is not a) equal to the Mann-Whitney-Wilcoxon statistic b) a measure of the quality of a model's probability estimates c) likely to be at least 0.5 d) larger when false positive errors cost more e) actually it is all of the above
d) larger when false positive errors cost more
Which of the following is not an advantage of the Naive Bayes classifier? a) Very simple implementation b) Efficient in terms of storage space c) Efficient in terms of compute time d) Performs well in many real-world applications e) Generally accurate class probability estimation
e) Generally accurate class probability estimation
R provides a mechanism to allow computation to continue, even when an error (e.g. some kind of exception) has occurred. What is that function's name?
try()
Write a valid for loop in R that sums the numbers 1 through 100
sum <- 0 for (i in 1:100) { sum <- sum + i }