CSE Exam 2

¡Supera tus tareas y exámenes ahora con Quizwiz!

A ROC graph has false negative rate on the x-axis and true positive rate on the y-axis T/F

False

A collection of documents is called a library. T/F

False

A profit curve is helpful when you don't know the prior probabilities of each of the classes. T/F

False

Business problems are usually straightforward classification problems, regression problems, or clustering problems. T/F

False

Case-based reasoning can be applied to legal cases but not to medical cases. T/F

False

Correlation implies that there must be some cause and effect relationship between the two variables. T/F

False

Expected profit is the profit that is expected per customer that receives the targeted marketing offer. T/F

False

If the threshold is lowered, the confusion matrix will not change because the numbers of true positives and false positives stay the same. T/F

False

Like its location, the font size of each word in a word cloud is random T/F

False

Lower variance models tend to have lower bias, and vice versa. T/F

False

Naive Bayes calculates the exact probability that an instance belongs to each class and then assigns the instance to the class that has the highest predicted probability. T/F

False

Referring to inverse document frequency, the more documents in which a term occurs, the more significant it likely is to be to the documents it does occur in. T/F

False

Referring to term frequency, the importance of a term in a document should decrease with the number of times that term occurs. T/F

False

The Pythagorean theorem tells us that the distance between A and B is given by the length of the shortest side of a triangle, and is equal to the square root of the summed squares of the lengths of the other two longer sides of the triangle. T/F

False

The expected value framework lets us determine the best classification method for a business task, but we cannot use it in a situation that would require multiple classification or regression models. T/F

False

The word "and" should have a higher IDF score than "banana" in a general purpose corpus. T/F

False

Using n-grams greatly decreases the size of the feature set. T/F

False

Sensors produce clean time series data (i.e., with equally spaced time intervals) making them useful for time series analysis. T/F

False -- Sensor readings often are not equally spaced, can have missing data. We need to process that data to create a time series.

A confusion matrix is a matrix with the columns labeled with actual classes and the rows labeled with predicted classes. The values in the matrix represent the fraction of instances that fall within each combination of categories. T/F

False -- The values in the matrix represent the **count** of instances that fall within each combination of categories.

Positive instances classified as negative are called:

False negatives

TFIDF

TFIDF(t, d) = TF(t, d) x IDF(t)

Specificity

TN / (TN + FP)

Evaluation metric: Precision formula

TP / (TP + FP)

text representation

Take a set of documents-each of which is a relatively free-form sequence of words-and turn it into our familiar feature-vector form

A ranking classifier plus a threshold produces a single confusion matrix. T/F

True

An important concern for data scientists is selection or survivor bias, since in such cases the data does not match the expected use of the model. T/F

True

Classification and regression trees tend to have high variance. T/F

True

Ensembles have been observed to improve generalization performance in many situations. T/F

True

Good data journalism employs methods from this course to engage and involve readers to discover knowledge in data. T/F

True

If we had a classifier with an AUC of .25, we could invert it to get a classifier with an AUC of .75. T/F

True

In addition to the words themselves, context is also important when analyzing data. T/F

True

In topic modeling, the terms most associated with the topic, and term weights, are learned by the topic modeling process T/F

True

Initialization of K-Means may affect the clustering results. T/F

True

Modeling involves making some simplifying assumptions to keep the problem tractable. T/F

True

Predicting stock price based on historical data is an example of time-series analysis. T/F

True

Stemming removes suffixes and transforms plural nouns to their singular forms T/F

True

Stop words are typically very common words and include functional words like prepositions. T/F

True

Term frequency (TF) is document specific. T/F

True

Text is just another form of data, but dealing with text sometimes requires pre-processing and specific expertise. T/F

True

The Jaccard distance measure treats the two objects as sets of characteristics. T/F

True

The expected value is the weighted average of the values of the different possible outcomes, where the weight given to each value is its probability of occurrence. T/F

True

We saw that n-grams can be useful in recognizing the author of a document. T/F

True

base rate classifier

a classifier that does no learning, always predicts the majority class of the training data, often used as a minimum baseline performace

corpus

a collection of documents

How many neighbors should be used in k-nn? a. no simple answer b. 10 c. 3 d. 5

a; no simple answer

AUC

area under a classifier's curve expressed as a fraction of the unit square -- values range from 0 to 1 AUC is useful when a single number is needed to summarize performance or when nothing is known about operating conditions

goal of machine learning

best estimate the function for the output variable given the input data want to achieve low bias and low variance, leading to good predictive performance

machine learning errors

bias error, variance error, irreducible error/inherent randomness

bias error

bias of the model constituted by simplifying assumptions to make the target function easier to learn

What visual presentation of data can directly identify potential outliers?

boxplot

Which of the following is always true? a. P(AB) = P(A)/(P(B) + P(A)) b. P(AB) = P(A)P(A|B) c. P(AB) = P(A)P(B|A) d. P(AB) = P(A)P(B)

c. P(AB) = P(A)P(B|A)

expected value

computation provides a framework that is useful in organizing thinking about data-analytic problems decomposes data-analytic thinking into: the structure of the problem, the elements of the analysis that can be extracted from the data, the elements of the analysis that need to be acquired form other sources

When comparing classification model performance, the model with the highest ______________________ should be used. a. precision b. f-measure c. accuracy d. depends on the situation e. recall

d. depends on the situation

Which of the following is not one of the three factors that characterize the errors a model could make? a. inherent randomness b. bias c. variance d. human error

d. human error

ROC graph

decouple classifier performance from the conditions under which the classifiers will be used independent of the class proportions as well as costs and benefits not the most intuitive visualization for many stakeholders

PCA and topic modeling: a. help the data scientist explore and understand the data b. none of these are correct c. have the ability to extract latent dimensions from data d. both can operate on the term-document frequency matrix e. all of these are correct

e. all of these are correct

random forest

each tree votes for its classification output (majority wins) reduce chances of overfitting, typically have higher model performance

Mann-Whitney Wilcoxon measure

equivalent to AUC and Gini coefficient both equivalent to probability that a randomly chosen positive instance will be ranked ahead of a randomly chosen negative instance

Naive Bayes Classifier

estimates the probability that the example belongs to each class, and reports the class with the highest probability

not independent

example die 1 is likely to roll the same as die 1 p(AB) = p(A) x p(B | A)

low bias

few assumptions about the form of the target function: decision trees, k-nearest neighbors, support vector machines

evaluating classifiers with a confusion matrix

for a problem involving classes is a matrix of counts from the test set with columns labeled with the actual classes and rows labeled with the predicted classes

F-measure combines precision and recall ...

giving equal weight to precision and recall.

Bias-variance tradeoff

high bias leads to lack of fully modeling (underfitting) high variance causes the model to focus on training data (overfitting) high bias = low variance high variance = low bias

variance error

how the learned function would change if different training data is used

independence

knowing the probability of one event does not help with the probability of the other event p(AB) = p(A) x p(B)

low variance

linear and logistic regression

high bias

more assumptions about the form of the target function: linear regression, logistic regression

unigrams

one word sequence term

prior probability

p(C = c) before seeing evidence

posterior probability

p(C=c | E)

Expected profit equatin

p(Y, p) * b(Y, p) + p(N p) * b(N, p) + p(N,n) * b(N, n) + p(Y, n) * b(Y, n)

expected value equation

p(x1) * v(x1) + p(x2) * v(x2) ...

joint probability

probability both event A and event B occur p(AB)

p(C|E) stands for the:

probability of an event C happening given E happened

conditional probability

probability of two different events both occurring is equal to the probability of one of them occurring times the probability of the other occurring if we know the first occurs. p(x,y) = p(y) * p(x | y)

ensemble methods

provide advantage by leveraging model diversity and combining many models to reduce variance ex: bagging, boosting, random forests

Inverse Document Frequency

some words are more popular across documents than others -- useful words are not too common or too rare the more common a word is, the less value it gives

high variance

strongly influenced by the specifics of training data decision trees, k-nearest, SVM

pre-processing of text

the case is normalized -- every term is in lowercase words are stemmed -- suffixes removed stop-words removed

profit curves

the class priors: the proportion of positive and negative instances in the target population The cost and benefits: the expected profit is sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix

trigram

three word sequence term

document is composed of:

tokens or terms

"bag of words"

treat every document as just a collection of individual words; ignore grammar, word order, punctuation; treat every term as a potentially important keyword

bigrams

two word sequence term

term frequency

use the word count in the document instead of just a 0 or 1

stop-word

very common word that is considered content-free such as "the", "and", "of"

Advantages & Disadvantages of Naive Bayes

very simple classifier efficient in terms of storage space and computation time performs well in real world applications non-accurate class probability estimation incremental learner -- good for online applications with high velocity

N-gram sequences

word order is important and you want to preserve some info about it in the representation they greatly increase the size of the feature set

Word Clouds (tag clouds)

words that are larger correspond to more important words in source material

steps for generating word cloud

1. import 2 libraries: tm and wordcloud 2. create a corpus for analysis 3. clean it 4. can build a term-document matrix 5. extract terms and their frequencies 6. generate the word cloud

Evaluation metric: F-measure formula

2 x Precision x Recall / (Precision + Recall)

What is the (minimum) Levenshtein metric between: Godly Goodbye

4

Evaluation metric: Accuracy formula

(TP + TN) / (TP + FP + FN + TN)

Bayes rule

P(A | B) = P(B | A) * P(A) / P(B)

Given the formula P(C|E) = P(C)*P(E|C)/P(E), what is the "prior" probability?

P(C)

The point of analytical engineering is to:

Promote thinking about problems data analytically

Evaluation metric: Recall formula

Recall = Sensitivity TP / (TP + FN)

TN/(TN+FP) = True negative rate = 1-False positive rate = ________________

Specificity


Conjuntos de estudio relacionados

TOEIC | Đọc hiểu 5 - Đề 14

View Set

Unit 1 "The Living World: Ecosystems" - AP Environmental Science Topic Questions

View Set

Evidence Based Practice- MIDTERM

View Set

Measures of Development Ch 17 APHG

View Set

Chapter 21 Law for Business and Personal Use

View Set

Pharmacological and parenteral therapies quiz

View Set

OBGYN III: Menstrual Irregularities (1)

View Set