CSE 160 Exam 1
Euclidean distance is limited to only two dimensions.
False
Similarity can be used in each of the following except:
None of the others are exceptions (all can utilize similarity)
Which of the below characters starts a commented line in R?
#
Different distance measures have different properties. Suppose we want to ignore differences in scale across instances—technically, we want to ignore the magnitude of the vectors. Which of the following shall we select?
Cosine distance
A collection that is impure means that it is homogeneous with respect to the target variable.
False
A complex tree with many leaves is the kind of tree that will best prevent overfitting.
False
By definition, node construction in decision trees always results in binary trees.
False
In general, if two classes are linearly separable, there is exactly one linear discriminant.
False
In many binary classification models, the decision boundary is a single, contiguous dividing line or curve between two classes. However, in k-nearest neighbor, as k gets larger, the more boundaries and islands will appear.
False
Logical vectors in R cannot be a part of arithmetic operations.
False
Logistic regression is misnamed because it does not use a log function.
False
Overfitting is the tendency of data mining procedures to create models that generalize to previously unseen data points.
False
The creation of models from data is known as model deduction.
False
A loss function is used to determine how much penalty should be assigned to an instance based on the error in the model's predicted value for that instance.
True
A main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into.
True
All data mining procedures have the tendency to overfit to some extent.
True
An instance represents a fact or a data point.
True
Cross-validation is a best-practice method to estimate generalization performance.
True
Data science is the discipline of making data useful.
True
Decomposing a data analytics problem into recognized tasks is a critical skill.
True
Information gain measures how much an attribute improves/decreases entropy due to new information being added.
True
Linear functions can represent nonlinear models, if we include more complex features.
True
Manhattan distance is called this because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points (if the data were plotted on this grid).
True
Test data should be strictly independent of model building so that we can get a good estimate of model accuracy.
True
When would you NOT want to increase the complexity of your model?
When your model is overfitting
Which loss function specifies a loss proportional to the distance from the boundary?
absolute error
According to Data Science for Business, in terms of nearest neighbor methods, two unique aspects of overall intelligibility are the justification of a specific _____________ and the intelligibility of an entire ______________.
decision, model
what does help(function_name) do?
documents what a function does
Datasheets for datasets can be valuable to all of the following EXCEPT
ALL of these may find datasheets for datasets valuable
The data mining procedure that produces a model which determines the category to which an individual sample belongs.
Classification
When measuring distance between elements of a dataset, what are the units used for Euclidean Distance?
None
Name the function in R that reveals the structure of a data object:
str()
Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the _______ variable.
target
What is the name for a list of vectors where each vector has the exact same number of elements as the others?
A dataframe
A learning curve is defined as:
A plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis
Which of the following is NOT a reason to perform attribute selection?
All of these choices are reasons for attribute selection (Reduced computational and/or storage cost, Better predictions, Better explanations and more tractable models, Faster predictions)
Which of the following is NOT true about Big Data?
Data science is the study of Big Data and not about Small Data
what does example(function_name) do?
shows how to use a function
When we use the K-nearest neighbors method for predictive modeling, how can we make the decision boundary estimate more smooth?
Increase K
Which of the following is NOT necessarily part of the data mining process presented in class?
Interviewing potential customers
Tree induction has been a very popular data mining procedure for all reasons EXCEPT:
It is definitely the most accurate model one can produce from a particular data set
I am _______________ confident with a classification close to the decision boundary than one far from the boundary.
Less
Different learning algorithms build different decision boundaries. If we have two attributes, we can visualize the decision boundaries (as we do below for a three class problem). Which of the following are NOT possible sources of the model?
Logistic regression or Support vector machine
How many neighbors should be used in k-nn?
No simple answer
The most appropriate type of attribute that should be used to encode letter grade is called
Ordinal
Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what?
Overfitting
The data mining procedure that attempts to characterize the typical behavior of an individual or population.
Profiling
What type of data mining task would be used to predict the gas mileage of a car?
Regression
When we analyze a set of data with a defined target in mind, what kind of model are we building?
Supervised
If you are told that a set with binary elements has an entropy of 0, what do you know?
That the set is all FALSE or all TRUE
How does data leakage negatively impact model building?
The model will see information which is not available when making decisions
what does help.search("decriptive_word") do?
searches database for function when you do not know the name
Logistic regression models produce a ____________ estimate for a _____________ target variable.
numeric, categorical
what does apropos("function_name") do?
searches database for function by name or a partial name
