CSE 160 Quizzes (Fall 2022)

Ace your homework & exams now with Quizwiz!

Test data should be strictly independent of model building so that we can get a good estimate of model accuracy.

True

Which of the below characters starts a commented line in R? a. /* b. # c. // d. % e. / f. --

b. #

Name the function in R that reveals the structure of a data object: a. display() b. content() c. print() d. str() e. summary()

d. str()

When we analyze a set of data with a defined target in mind, what kind of model are we building? a. Machine learning b. Analytical c. Data mining d. Unsupervised e. Supervised

e. Supervised

A collection that is impure means that it is homogeneous with respect to the target variable.

False

By definition, node construction in decision trees always results in binary trees.

False

In general, if two classes are linearly separable, there is exactly one linear discriminant.

False

In most binary classification models, the decision boundary is a single, contiguous dividing line or curve between two classes. However, in k-nearest neighbor, as k gets larger, the more boundaries and islands will appear.

False

Logical vectors in R cannot be a part of arithmetic operations. Select one: True False

False

Logistic regression is misnamed because it does not use a log function.

False

Overfitting is the tendency of data mining procedures to create models that generalize to previously unseen data points.

False

The creation of models from data is known as model deduction.

False

A decision tree with multiple interior nodes is a linear classifier.

True

A loss function is used to determine how much penalty should be assigned to an instance based on the error in the model's predicted value for that instance.

True

A main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into.

True

All data mining procedures have the tendency to overfit to some extent.

True

An instance represents a fact or a data point.

True

Cross-validation is a best-practice method to estimate generalization performance.

True

Decomposing a data analytics problem into recognized tasks is a critical skill.

True

Information gain measures how much an attribute improves/decreases entropy due to new information being added.

True

Linear functions can represent nonlinear models, if we include more complex features.

True

Manhattan distance is called this because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points (if the data were plotted on this grid).

True

A learning curve is defined as: a. A plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis b. A plot of the performance of the classifier on the y axis on a growing test set on the x axis with a fixed size training set c. A plot of the performance of the classifier on the y axis versus the complexity of the model on the x axis d. A plot of the performance of the classifier on the y axis on a growing test set as the size of the training data grows on the x axis e. None of these reflect the definition of a learning curve

a. A plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis

I am _______________ confident with a classification close to the decision boundary than one far from the boundary. a. Less b. More c. The same

a. Less

When measuring distance between elements of a dataset, what are the units used for Euclidean Distance? a. None b. Centimeters c. Inches d. Depends on the input

a. None

When would you NOT want to increase the complexity of your model? a. When your model is overfitting b. When you believe a new attribute could provide a useful signal c. All of these cases (i.e., for all of them you would not want to increase complexity) d. None of these cases (i.e., for all of them you would want to increase complexity) e. When your model is underfitting

a. When your model is overfitting

Logistic regression models produce a ____________ estimate for a _____________ target variable. a. numeric, categorical b. categorical, numeric c. categorical, categorical d. numeric, numeric

a. numeric, categorical

What type of data mining task would be used to predict the gas mileage of a car? a. Clustering b. Regression c. Profiling d. Similarity Matching e. Classification

b. regression

help(function_name) →

documents what a function does

Which of the following is NOT necessarily part of the data mining process presented in class? a. Interviewing potential customers b. Data Preparation c. Data Understanding d. Evaluation e. Business Understanding f. Deployment g. Modeling

A. Interviewing potential customers

Which of the following is NOT true about Big Data? a. Big Data technologies can be used to support data-driven decision making b. Data science is the study of Big Data and not about Small Data c. Characteristics of today's Big Data are often described with "V" words d. Big Data is considered to be "at the foundation of all the megatrends that are happening today" e. Big data is defined as datasets that are too large for traditional data processing systems

B

The data mining procedure that attempts to characterize the typical behavior of an individual or population. a. Regression b. Co-occurrence grouping c. Similarity matching d. Profiling e. Classification

D

What is the name for a list of vectors where each vector has the exact same number of elements as the others? a. A dataframe b. A mode c. A list d. A dataset

Dataframe

example(function_name) →

shows how to use a function

Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what?

Overfitting

Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the _______ variable. vivacious feature attribute independent target

Target

The data mining procedure that produces a model which determines the category to which an individual sample belongs. a. Clustering b. Data Reduction c. Classification d. Causal Modeling e. Link prediction

c. Classification

If you are told that a set with binary elements has an entropy of 0, what do you know? a. That the set is all TRUE b. You don't know anything about the set c. None of the others are correct d. That the set is perfectly mixed e. That the set is all FALSE

c. None of the others are correct

The most appropriate type of attribute that should be used to encode letter grade is called a. Calendrial b. Gradual c. Ordinal d. Numeric e. Nominal

c. Ordinal

Which of the following is NOT a reason to perform attribute selection? a. Better explanations and more tractable models b. Better predictions c. Faster predictions d. All of these choices are reasons for attribute selection e. Reduced computational and/or storage cost

d. All of these choices are reasons for attribute selection

When we use the K-nearest neighbors method for predictive modeling, how can we make the decision boundary estimate more smooth? a. Use a distance measure other than Euclidean b. Decrease K c. None of these choices will affect the smoothness of the decision boundary d. Increase K

d. Increase K

Tree induction has been a very popular data mining procedure for all reasons EXCEPT: a. It is computationally inexpensive b. It is easy to implement c. It is easy to understand d. It is definitely the most accurate model one can produce from a particular data set

d. It is definitely the most accurate model one can produce from a particular data set

How does data leakage negatively impact model building? a. Leakage means missing data and missing data is problematic b. The model will not see enough information c. Data leakage positively affects model building d. The model will see information which is not available when making decisions e. The model will return an error

d. The model will see information which is not available when making decisions

Which loss function specifies a loss proportional to the distance from the boundary? a. absolute error b. zero-one loss c. squared error d. logistic loss e. hinge loss

e. hinge loss

apropos("function_name") →

searches database for function by name or a partial name

help.search("decriptive_word") →

searches database for function when you do not know the name


Related study sets

Week 4 - Correlation and Regression

View Set

Chapter 13: Creating Innovative Organizations

View Set

Principles of Microeconomics CLEP

View Set

Medication Administration Post-Test

View Set

Chapter 9 muscle fibers and tissue

View Set

Endocrinology, Anterior Pituitary and Hypothalamus

View Set

past quizzes and turning point ?s

View Set