CSE 160 Quizzes Review

Ace your homework & exams now with Quizwiz!

Euclidean distance is limited to only two dimensions. True False

Logical vectors in R cannot be a part of arithmetic operations. True False

When we use the K-nearest neighbors method for predictive modeling, how can we make the decision boundary estimate more smooth? a. Decrease K b. Increase K c. None of these choices will affect the smoothness of the decision boundary d. Use a distance measure other than Euclidean

help(function_name)

documents what a function does

example(function_name)

shows how to use a function

A collection that is impure means that it is homogeneous with respect to the target variable. True False

A complex tree with many leaves is the kind of tree that will best prevent overfitting. True False

By definition, node construction in decision trees always results in binary trees. True False

In many binary classification models, the decision boundary is a single, contiguous dividing line or curve between two classes. However, in k-nearest neighbor, as k gets larger, the more boundaries and islands will appear. True False

Overfitting is the tendency of data mining procedures to create models that generalize to previously unseen data points. True False

The creation of models from data is known as model deduction. True False

According to Data Science for Business, in terms of nearest neighbor methods, two unique aspects of overall intelligibility are the justification of a specific _____________ and the intelligibility of an entire ______________. a. decision, model b. instance, dataset c. variable, model d. variable, dataset e. instance, species

Cross-validation is a best-practice method to estimate generalization performance. True False

Datasheets for datasets can be valuable to all of the following EXCEPT a. ALL of these may find datasheets for datasets valuable b. dataset consumers c. policy makers d. investigative journalists e. dataset creators

Different distance measures have different properties. Suppose we want to ignore differences in scale across instances—technically, we want to ignore the magnitude of the vectors. Which of the following shall we select? a.Cosine distance b. All of the others ignore differences in scale c.Manhattan distance d.Jaccard distance e.Euclidean distance

How many neighbors should be used in k-nn? a. No simple answer b. 3 c. 5 d. 10

Similarity can be used in each of the following except: a. None of the others are exceptions (all can utilize similarity) b.Performing regression c. Classifying things d. Recommending things e. Retrieving and ranking things from a collection

The most appropriate type of attribute that should be used to encode letter grade is called a. Ordinal b. Numeric c. Calendrial d. Nominal e. Gradual

When would you NOT want to increase the complexity of your model? a. When your model is overfitting b. When your model is underfitting c. All of these cases (i.e., for all of them you would not want to increase complexity) d. None of these cases (i.e., for all of them you would want to increase complexity) e. When you believe a new attribute could provide a useful signal

Which of the below characters starts a commented line in R? a. # b. / c. // d. -- e. % f. /*

How does data leakage negatively impact model building? a. The model will return an error b. The model will see information which is not available when making decisions c. Data leakage positively affects model building d. Leakage means missing data and missing data is problematic e. The model will not see enough information

Tree induction has been a very popular data mining procedure for all reasons EXCEPT: a. It is easy to understand b. It is definitely the most accurate model one can produce from a particular data set c. It is easy to implement d. It is computationally inexpensive

When measuring distance between elements of a dataset, what are the units used for Euclidean Distance? a. Depends on the input b. None c.Centimeters d. Inches

When we analyze a set of data with a defined target in mind, what kind of model are we building? a. Data mining b. Supervised c. Machine learning d. Analytical e. Unsupervised

Which of the following is NOT necessarily part of the data mining process presented in class? a. Data Preparation b. Interviewing potential customers c. Business Understanding d. Data Understanding e. Deployment f. Modeling g. Evaluation

The data mining procedure that attempts to characterize the typical behavior of an individual or population. a. Co-occurrence grouping b. Classification c. Regression d. Similarity matching e. Profiling

The data mining procedure that produces a model which determines the category to which an individual sample belongs. a. Link prediction b. Causal Modeling c. Data Reduction d. Clustering e. Classification

Which of the following is NOT true about Big Data? a. Characteristics of today's Big Data are often described with "V" words b. Big Data technologies can be used to support data-driven decision making c. Big data is defined as datasets that are too large for traditional data processing systems d. Big Data is considered to be "at the foundation of all the megatrends that are happening today" e. Data science is the study of Big Data and not about Small Data

All data mining procedures have the tendency to overfit to some extent. True False

An instance represents a fact or a data point. True False

Data science is the discipline of making data useful. True False

Decomposing a data analytics problem into recognized tasks is a critical skill. True False

Information gain measures how much an attribute improves/decreases entropy due to new information being added. True False

Manhattan distance is called this because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points (if the data were plotted on this grid). True False

Test data should be strictly independent of model building so that we can get a good estimate of model accuracy. True False

A learning curve is defined as: a. A plot of the performance of the classifier on the y axis on a growing test set on the x axis with a fixed size training set b. A plot of the performance of the classifier on the y axis on a growing test set as the size of the training data grows on the x axis c. A plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis d. None of these reflect the definition of a learning curve e. A plot of the performance of the classifier on the y axis versus the complexity of the model on the x axis

If you are told that a set with binary elements has an entropy of 0, what do you know? a. That the set is all TRUE b.You don't know anything about the set c. None of the others are correct d. That the set is all FALSE e. That the set is perfectly mixed

Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the _______ variable. a. independent b. feature c. target d. attribute e. vivacious

What is the name for a list of vectors where each vector has the exact same number of elements as the others? a. A dataset b. A list c. A dataframe d. A mode

What type of data mining task would be used to predict the gas mileage of a car? a. Classification b. Profiling c. Regression d. Similarity Matching e. Clustering

AMBIGUOUS/miss-formed -- all answers will get credit. Which of the following is NOT used to describe the data for which we have values of the target variable but is used for building the model? a. training data b. test data c. validation data d. holdout data

Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what? a. Specialization b. Generalization c. Underfitting d. Overfitting e. The correct answer is none of these.

Name the function in R that reveals the structure of a data object: a. display() b. summary() c. print() d. str() e. content()

Which of the following is NOT a reason to perform attribute selection? a. Better predictions b. Better explanations and more tractable models c. Reduced computational and/or storage cost d. All of these choices are reasons for attribute selection e. Faster predictions

help.search("decriptive_word")

searches database for function when you do not know the name,

apropos("function_name")

searches database for function by name or a partial name

See all study sets

Related study sets

American Politics: Review Quiz for Exam 2

apush

Chapter 6 Learning and Behavior

Chapter 3 quiz part A

NURS 117: PrepU CH. 60

MGT 427.S13

f

Google Analytics Certificate Questions

21 - PMP I Lesson 11 - Sequence Activities & Estimate Activity Duration/20 - PMP I Lesson 10 - Plan Schedule Management & Define Activities/22 - PMP 1 Lesson 12 - Develop Schedule & Control Schedule/23 - Rita's Chapt 6 (Schedule Mgmt)

CSE 160 Quizzes Review

Related study sets

American Politics: Review Quiz for Exam 2

apush

Chapter 6 Learning and Behavior

Chapter 3 quiz part A

NURS 117: PrepU CH. 60

MGT 427.S13

f

Google Analytics Certificate Questions

21 - PMP I Lesson 11 - Sequence Activities & Estimate Activity Duration/20 - PMP I Lesson 10 - Plan Schedule Management & Define Activities/22 - PMP 1 Lesson 12 - Develop Schedule & Control Schedule/23 - Rita's Chapt 6 (Schedule Mgmt)

Test Questions #3

Romeo and Juliet vocabulary, comprehension, Globe Theatre, theatre terms, Shakespeare

Business Analytics Midterm pt1

Interviewing

Ch05: Intro To Business 101

Management chapter 8

Chapter 30 Review

MGT 321 Final Exam

final exam Info 2410

Chemistry Quiz Study Guide

2.03 Understand the types of economic systems