CSE 160 Exam 2

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Equation for F-measure

2 x precision x recall / (precision + recall)

If the probability of a burrito having cheese on it is 7/8 and the probability of a burrito having beans on it is 1/4, what is the probability of a burrito having both beans and cheese on it, assuming these two events are independent of each other?

7/8*1/4

Equation for Accuracy

Accuracy = (TP+TN)/(P+N) = (TP + TN)/all

Define IDF

IDF, or inverse document frequency, is an inverse function of the number of documents in which it occurs. A word has a high IDF score if it appears infrequently across multiple documents, and a low IDF score if it appears frequently across multiple documents

How can you identify conservative and permissive models on a ROC graph?

Conservative models are located towards the lower left, and permissive models are located towards the upper right.

(T/F) ROC graphs are advantageous because anyone can interpret them with ease.

False

(T/F) As class distribution becomes more skewed, evaluation based on accuracy improves

False, as class distribution becomes more skewed, evaluation based on accuracy BREAKS DOWN

(T/F) A profit curve cannot go negative

False, if the profit curve is negative, then the entity is losing money

(T/F) The rbind() function in R determines whether the R object is bound or not

False, rbind() takes all of the rows and puts them together (similar to cbind() used on data frames, but for rows)

(T/F) A "false positive" is when a classifier correctly assigns a negative classification.

False, that is a "true negative"

(T/F) The value of the AUC (area under the ROC curve) can be greater than 1.

False, the area varies between 0 and 1.

(T/F) It is more important to have good data scientists than it is to have good management of the data science team

False, they are both important

(T/F) With topic models, we perform supervised classification of documents

False, we perform unsupervised classification

What makes a model "permissive"? What are the advantages and disadvantages of a permissive model?

Models that do not require as much evidence to classify something as positive are considered permissive. Permissive models have a high true positive rate, but as a consequence they also have a high false positive rate.

What makes a model "conservative"? What are the advantages and disadvantages of a conservative model?

Models that require strong evidence to classify something as positive are considered conservative. Conservative models have a low false positive rate, but in turn they also have a low true positive rate. (In the planet talk, they use a conservative model to distinguish whether or not a planet is orbiting a star)

Name a learning algorithm we have seen in class that can operate incrementally (i.e. can be updated easily without re-examining the rest of the training data)

Naive Bayes/kNN

Equation for Precision/Positive Predictive Value

Precision = TP/(TP+FP)

Define TF

TF, or term frequency, is the normalized count of the number of times a term appears in a document. A word has a high TF score if it appears frequently in a document, and a low TF score if it appears rarely in a document

Define TF-IDF as a combination of TF and IDF as defined above

TF-IDF = TF * IDF

Equation for True Negative Rate/Sensitivity

TN/(TN+FP) - true negatives / all real negatives

Equation for Recall/True Postive Rate/Specificity

TP/(TP+FN)

What does it mean for a word to have a high TF-IDF score?

That word is important to the topic of a document. Examples of words with a high TF-IDF score would be "pitch" and "slugger" in a document about baseball.

(T/F) A profit curve is the plot of expected profit at each possible threshold that can be applied to a ranking classifier

True

(T/F) A shapefile contains the points and contours of a geographical region

True

(T/F) Classifier accuracy is equal to 1 - Error rate.

True

(T/F) Expected value is a valuable tool when making business decisions because it can predict how much money could be earned or lost if certain actions are taken

True

(T/F) Predictive modeling enables the enterprise to learn to anticipate and prepare for the future

True

(T/F) ROC graphs decouple classifier performance from the conditions under which the classifiers were used

True

(T/F) Selenium enables R to extract information from interactive websites

True

(T/F) Text is an example of unstructured data

True

(T/F) The AUC is useful when a single number is needed to summarize performance, or when nothing is known about the operating conditions

True

(T/F) The AUC is useful when a single number is needed to summarize performance, or when nothing is known about the operating conditions.

True

(T/F) The Naive-Bayes assumption is that events are conditionally independent, and thus can be multiplied together to estimate the probability of a combination of events

True

(T/F) The lift of a classifier is calculated as the classifier's performance divided by the performance of a randomly guessing classifier

True

(T/F) Two events are independent if the value of one does not affect the value of the other

True

(T/F) When considering Facebook "Likes", we saw that they could be quite predictive of how people score on intelligence tests

True

(T/F) While topic modeling can be informative, it can take a lot of time for a computer to perform

True

(T/F) With a ranking classifier, a classifier plus a threshold produces a single confusion matrix

True

(T/F) The lift of a classifier is calculated as the classifier's performance divided by the performance of a randomly guessing classifier

True (note: a "perfect" classifier has a lift of 2, a random classifier has a lift of 1, and the worst classifier has a lift of 0)

(T/F) The R language provides many useful libraries, including ones to plot on maps (bonus: what is this library called?)

True, and the library is called "maps"

If a model is created for the classification of students as freshmen, sophomores, juniors, or seniors, what would be the maximum dimensions of the confusion matrix? a) 4x4 b) 4x2 c) 4x1 d) 2x2 e) It depends on what the data model is built on

a) 4x4

Which of these organizations would have the most challenge in applying supervised predictive modeling? a) A business school that wants to start a new Master's degree program in Business Analytics, and would like to estimate the likely number of applicants b) A grocery store that is trying to identify which of its loyalty-card-carrying customers will spend more than $100 next month c) A city government that is trying to predict which neighborhoods will see the most new businesses open up next quarter d) An online marketing company that wants to estimate the number of clicks that the ads it serves will receive when shown to a particular population e) All of the above are equally challenging

a) A business school that wants to start a new Master's degree program in Business Analytics, and would like to estimate the likely number of applicants Possible explanation: (b) and (c) have access to past data, and (d) can deploy a test fairly easily. (a) would have to go to an external source, for example a competing business school with a competing program, for its data, and the competing school likely would not turn over the data

The points on a model's ROC graph a) Represent the performance of different thresholds b) Represent different rankings of examples c) Represent the cost of different classifications d) All of the above e) None of the above

a) Represent the performance of different thresholds

Confusion matrices can be used to analyze the performance of which of the following models? a) Decision Trees b) Naive Bayes c) k-Nearest Neighbors d) All of the above e) (a) and (c) only

d) All of the above

When word order is important, which text mining strategy should we use? a) Bag of Words b) N-grams c) Named Entity Extraction d) Topic Models e) All of the above are valid models that respect word order

b) N-grams

The point of analytical engineering is to a) Develop complex solutions by addressing every possible contingency b) Promote thinking about problems data analytically c) Analyze engineers d) All of the above e) None of the above

b) Promote thinking about problems data analytically

Which of the following is not a step in generating an ROC curve? a) Sort the test set by the model predictions b) Start with the cutoff = min(prediction) c) Decrease cutoff, and then count the number of true positives and false positives d) Calculate the TP rate and the FP rate e) Plot current number of TP/P as a function of the current FP/N

b) Start with the cutoff = min(prediction)

In the context of preparing for textual analysis, what is meant by "stop words"? a) Words that have the same meaning as the word "stop" b) Words such as "and", "or", and "the" c) Words specific to the topic of a corpus, such as "rebound" or "offense" when talking about a basketball game d) The words at the end of each sentence e) None of the above

b) Words such as "and", "or", and "the"

In a marketing environment, the expected benefit of not targeting is typically a) A negative value b) Zero c) A positive value d) All of the above e) None of the above

b) Zero

In order for data points to be taken as input to most data mining programs, they must be represented as a) text documents b) feature vectors c) dependent variables d) targets e) all of the above

b) feature vectors (???)

Which of the following terms will have the lowest IDF score in a typical (general purpose) corpus? a) bug b) car c) she d) spaghetti e) vertebrae

b) she Explanation: Words with a high TF score occur frequently in a certain document, and words with a high IDF score appear infrequently across a set of documents. Therefore, a word with a high TF-IDF score is a good indicator as to the topic of a document Being a pronoun, "she" is a word that appears frequently across many documents and is not a good indicator as to what the document is about

Give two examples of words that can be "stemmed" and what their stems would be.

cats, catlike, catty --> cat fishing, fished, fisherman --> fish

The bag of words model a) Treats every document as a collection of individual words b) Ignores grammar, word order, and sentence structure c) Is a straightforward representation that is inexpensive to generate d) All of the above e) None of the above

d) All of the above

The Area Under the ROC Curve (AUC) is a) a fraction of the unit square, ranging from 0 to 1 b) useful when a single number is needed to summarize performance c) equivalent to the probability that a randomly chosen positive instance will be ranked ahead of a randomly chosen negative instance d) all of the above e) none of the above

d) all of the above

The area under the ROC curve is not a) equal to the Mann-Whitney-Wilcoxon statistic b) a measure of the quality of a model's probability estimates c) likely to be at least 0.5 d) larger when false positive errors cost more e) actually it is all of the above

d) larger when false positive errors cost more

Which of the following is not an advantage of the Naive Bayes classifier? a) Very simple implementation b) Efficient in terms of storage space c) Efficient in terms of compute time d) Performs well in many real-world applications e) Generally accurate class probability estimation

e) Generally accurate class probability estimation

R provides a mechanism to allow computation to continue, even when an error (e.g. some kind of exception) has occurred. What is that function's name?

try()

Write a valid for loop in R that sums the numbers 1 through 100

sum <- 0 for (i in 1:100) { sum <- sum + i }


Kaugnay na mga set ng pag-aaral

Chapter 13 - Small Business Accounting: Projecting and Evaluating Performance

View Set

Personal Finance Chapter 5: Savings Plans and Payments Accounts

View Set

solving linear equations and inequalities (100%)

View Set

Government: Unit 4: Civil Liberties (Chapter 19)

View Set