CSE 160 Exam 1

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Euclidean distance is limited to only two dimensions.

False

Similarity can be used in each of the following except:

None of the others are exceptions (all can utilize similarity)

Which of the below characters starts a commented line in R?

#

Different distance measures have different properties. Suppose we want to ignore differences in scale across instances—technically, we want to ignore the magnitude of the vectors. Which of the following shall we select?

Cosine distance

A collection that is impure means that it is homogeneous with respect to the target variable.

False

A complex tree with many leaves is the kind of tree that will best prevent overfitting.

False

By definition, node construction in decision trees always results in binary trees.

False

In general, if two classes are linearly separable, there is exactly one linear discriminant.

False

In many binary classification models, the decision boundary is a single, contiguous dividing line or curve between two classes. However, in k-nearest neighbor, as k gets larger, the more boundaries and islands will appear.

False

Logical vectors in R cannot be a part of arithmetic operations.

False

Logistic regression is misnamed because it does not use a log function.

False

Overfitting is the tendency of data mining procedures to create models that generalize to previously unseen data points.

False

The creation of models from data is known as model deduction.

False

A loss function is used to determine how much penalty should be assigned to an instance based on the error in the model's predicted value for that instance.

True

A main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into.

True

All data mining procedures have the tendency to overfit to some extent.

True

An instance represents a fact or a data point.

True

Cross-validation is a best-practice method to estimate generalization performance.

True

Data science is the discipline of making data useful.

True

Decomposing a data analytics problem into recognized tasks is a critical skill.

True

Information gain measures how much an attribute improves/decreases entropy due to new information being added.

True

Linear functions can represent nonlinear models, if we include more complex features.

True

Manhattan distance is called this because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points (if the data were plotted on this grid).

True

Test data should be strictly independent of model building so that we can get a good estimate of model accuracy.

True

When would you NOT want to increase the complexity of your model?

When your model is overfitting

Which loss function specifies a loss proportional to the distance from the boundary?

absolute error

According to Data Science for Business, in terms of nearest neighbor methods, two unique aspects of overall intelligibility are the justification of a specific _____________ and the intelligibility of an entire ______________.

decision, model

what does help(function_name) do?

documents what a function does

Datasheets for datasets can be valuable to all of the following EXCEPT

ALL of these may find datasheets for datasets valuable

The data mining procedure that produces a model which determines the category to which an individual sample belongs.

Classification

When measuring distance between elements of a dataset, what are the units used for Euclidean Distance?

None

Name the function in R that reveals the structure of a data object:

str()

Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the _______ variable.

target

What is the name for a list of vectors where each vector has the exact same number of elements as the others?

A dataframe

A learning curve is defined as:

A plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis

Which of the following is NOT a reason to perform attribute selection?

All of these choices are reasons for attribute selection (Reduced computational and/or storage cost, Better predictions, Better explanations and more tractable models, Faster predictions)

Which of the following is NOT true about Big Data?

Data science is the study of Big Data and not about Small Data

what does example(function_name) do?

shows how to use a function

When we use the K-nearest neighbors method for predictive modeling, how can we make the decision boundary estimate more smooth?

Increase K

Which of the following is NOT necessarily part of the data mining process presented in class?

Interviewing potential customers

Tree induction has been a very popular data mining procedure for all reasons EXCEPT:

It is definitely the most accurate model one can produce from a particular data set

I am _______________ confident with a classification close to the decision boundary than one far from the boundary.

Less

Different learning algorithms build different decision boundaries. If we have two attributes, we can visualize the decision boundaries (as we do below for a three class problem). Which of the following are NOT possible sources of the model?

Logistic regression or Support vector machine

How many neighbors should be used in k-nn?

No simple answer

The most appropriate type of attribute that should be used to encode letter grade is called

Ordinal

Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what?

Overfitting

The data mining procedure that attempts to characterize the typical behavior of an individual or population.

Profiling

What type of data mining task would be used to predict the gas mileage of a car?

Regression

When we analyze a set of data with a defined target in mind, what kind of model are we building?

Supervised

If you are told that a set with binary elements has an entropy of 0, what do you know?

That the set is all FALSE or all TRUE

How does data leakage negatively impact model building?

The model will see information which is not available when making decisions

what does help.search("decriptive_word") do?

searches database for function when you do not know the name

Logistic regression models produce a ____________ estimate for a _____________ target variable.

numeric, categorical

what does apropos("function_name") do?

searches database for function by name or a partial name


Set pelajaran terkait

A&P Exam 3 Quiz 6 Neurophysiology

View Set

"Robo en la Noche" Examen Final (Incluye Cultura de Costa Rica)

View Set

Lab 4 - Microbial Phototrophs: Algae and Cyanobacteria

View Set

Periodic Table Of The Elements: 20-40

View Set

Pathology II: Oops, All Quizzes!

View Set

Chapter 37 Fire Detection, Protection, and Suppression Systems

View Set

SURGERY - NMS/Pestana/pretest/lange/uworld

View Set