CSE 160 Quizzes Review
Euclidean distance is limited to only two dimensions. True False
F
Logical vectors in R cannot be a part of arithmetic operations. True False
F
When we use the K-nearest neighbors method for predictive modeling, how can we make the decision boundary estimate more smooth? a. Decrease K b. Increase K c. None of these choices will affect the smoothness of the decision boundary d. Use a distance measure other than Euclidean
b
help(function_name)
documents what a function does
example(function_name)
shows how to use a function
A collection that is impure means that it is homogeneous with respect to the target variable. True False
F
A complex tree with many leaves is the kind of tree that will best prevent overfitting. True False
F
By definition, node construction in decision trees always results in binary trees. True False
F
In many binary classification models, the decision boundary is a single, contiguous dividing line or curve between two classes. However, in k-nearest neighbor, as k gets larger, the more boundaries and islands will appear. True False
F
Overfitting is the tendency of data mining procedures to create models that generalize to previously unseen data points. True False
F
The creation of models from data is known as model deduction. True False
F
According to Data Science for Business, in terms of nearest neighbor methods, two unique aspects of overall intelligibility are the justification of a specific _____________ and the intelligibility of an entire ______________. a. decision, model b. instance, dataset c. variable, model d. variable, dataset e. instance, species
a
Cross-validation is a best-practice method to estimate generalization performance. True False
a
Datasheets for datasets can be valuable to all of the following EXCEPT a. ALL of these may find datasheets for datasets valuable b. dataset consumers c. policy makers d. investigative journalists e. dataset creators
a
Different distance measures have different properties. Suppose we want to ignore differences in scale across instances—technically, we want to ignore the magnitude of the vectors. Which of the following shall we select? a.Cosine distance b. All of the others ignore differences in scale c.Manhattan distance d.Jaccard distance e.Euclidean distance
a
How many neighbors should be used in k-nn? a. No simple answer b. 3 c. 5 d. 10
a
Similarity can be used in each of the following except: a. None of the others are exceptions (all can utilize similarity) b.Performing regression c. Classifying things d. Recommending things e. Retrieving and ranking things from a collection
a
The most appropriate type of attribute that should be used to encode letter grade is called a. Ordinal b. Numeric c. Calendrial d. Nominal e. Gradual
a
When would you NOT want to increase the complexity of your model? a. When your model is overfitting b. When your model is underfitting c. All of these cases (i.e., for all of them you would not want to increase complexity) d. None of these cases (i.e., for all of them you would want to increase complexity) e. When you believe a new attribute could provide a useful signal
a
Which of the below characters starts a commented line in R? a. # b. / c. // d. -- e. % f. /*
a
How does data leakage negatively impact model building? a. The model will return an error b. The model will see information which is not available when making decisions c. Data leakage positively affects model building d. Leakage means missing data and missing data is problematic e. The model will not see enough information
b
Tree induction has been a very popular data mining procedure for all reasons EXCEPT: a. It is easy to understand b. It is definitely the most accurate model one can produce from a particular data set c. It is easy to implement d. It is computationally inexpensive
b
When measuring distance between elements of a dataset, what are the units used for Euclidean Distance? a. Depends on the input b. None c.Centimeters d. Inches
b
When we analyze a set of data with a defined target in mind, what kind of model are we building? a. Data mining b. Supervised c. Machine learning d. Analytical e. Unsupervised
b
Which of the following is NOT necessarily part of the data mining process presented in class? a. Data Preparation b. Interviewing potential customers c. Business Understanding d. Data Understanding e. Deployment f. Modeling g. Evaluation
b
The data mining procedure that attempts to characterize the typical behavior of an individual or population. a. Co-occurrence grouping b. Classification c. Regression d. Similarity matching e. Profiling
e
The data mining procedure that produces a model which determines the category to which an individual sample belongs. a. Link prediction b. Causal Modeling c. Data Reduction d. Clustering e. Classification
e
Which of the following is NOT true about Big Data? a. Characteristics of today's Big Data are often described with "V" words b. Big Data technologies can be used to support data-driven decision making c. Big data is defined as datasets that are too large for traditional data processing systems d. Big Data is considered to be "at the foundation of all the megatrends that are happening today" e. Data science is the study of Big Data and not about Small Data
e
All data mining procedures have the tendency to overfit to some extent. True False
T
An instance represents a fact or a data point. True False
T
Data science is the discipline of making data useful. True False
T
Decomposing a data analytics problem into recognized tasks is a critical skill. True False
T
Information gain measures how much an attribute improves/decreases entropy due to new information being added. True False
T
Manhattan distance is called this because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points (if the data were plotted on this grid). True False
T
Test data should be strictly independent of model building so that we can get a good estimate of model accuracy. True False
T
A learning curve is defined as: a. A plot of the performance of the classifier on the y axis on a growing test set on the x axis with a fixed size training set b. A plot of the performance of the classifier on the y axis on a growing test set as the size of the training data grows on the x axis c. A plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis d. None of these reflect the definition of a learning curve e. A plot of the performance of the classifier on the y axis versus the complexity of the model on the x axis
c
If you are told that a set with binary elements has an entropy of 0, what do you know? a. That the set is all TRUE b.You don't know anything about the set c. None of the others are correct d. That the set is all FALSE e. That the set is perfectly mixed
c
Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the _______ variable. a. independent b. feature c. target d. attribute e. vivacious
c
What is the name for a list of vectors where each vector has the exact same number of elements as the others? a. A dataset b. A list c. A dataframe d. A mode
c
What type of data mining task would be used to predict the gas mileage of a car? a. Classification b. Profiling c. Regression d. Similarity Matching e. Clustering
c
AMBIGUOUS/miss-formed -- all answers will get credit. Which of the following is NOT used to describe the data for which we have values of the target variable but is used for building the model? a. training data b. test data c. validation data d. holdout data
d
Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what? a. Specialization b. Generalization c. Underfitting d. Overfitting e. The correct answer is none of these.
d
Name the function in R that reveals the structure of a data object: a. display() b. summary() c. print() d. str() e. content()
d
Which of the following is NOT a reason to perform attribute selection? a. Better predictions b. Better explanations and more tractable models c. Reduced computational and/or storage cost d. All of these choices are reasons for attribute selection e. Faster predictions
d
help.search("decriptive_word")
searches database for function when you do not know the name,
apropos("function_name")
searches database for function by name or a partial name
