CSE160 Data Science Exam 1

Ace your homework & exams now with Quizwiz!

Understanding data science does not mean that you will be able to tell whether a data mining project will succeed. (T/F)

True

Understanding data science is important because data analysis is so critical to business strategy, and because data analytics projects reach into all business units. (T/F)

True

We should study industries like online advertising for hints about big data and data science that subsequently will be adopted by other industries. (T/F)

True

Tree induction has been a very popular data mining procedure for all reasons EXCEPT: Select one: a. It is definitely the most accurate model one can produce from a particular data set b. It is computationally inexpensive c. It is easy to understand d. It is easy to implement

a. It is definitely the most accurate model one can produce from a particular data set

When we analyze a set of data with a defined target in mind, what kind of model are we building? a. Supervised b. Unsupervised

a. Supervised

The data mining procedure that produces a model that, given an individual, determines the category to which that individual belongs: Select one: a. Causal Modeling b. Classification c. Link prediction d. Data Reduction e. Clustering

b. Classification

What is NOT TRUE about k-means algorithm? Select one: a. Initial points can have a large influence on the result b. K-means converge to the same final result c. K is the number of clusters d. K-means is only applicable with a defined mean

b. K-means converge to the same final result

According to Data Science for Business, in terms of nearest neighbor methods, two unique aspects of overall intelligibility are the justification of a specific _____________ and the intelligibility of an entire ______________. Select one: a. variable, model b. decision, model c. variable, dataset

b. decision, model

Name the function in R that combines data elements together into a vector

c()

The process of repeatedly drawing a subset from a population is called ___________, and the end result of doing lots of this is a _____________. Select one: a. "grabbing", population graph b. "extraction", sampling distribution c. "sampling", sampling distribution d. "extraction", scatterplot

c. "sampling", sampling distribution

What is the (minimum) Levenshtein metric between: Godly Goodbye Select one: a. 4 b. 12 c. 3 d. 7

c. 3

When measuring distance between elements of a dataset, what are the units used for Euclidean Distance? Select one: a. Inches b. Depends on the input c. None d. Centimeters

c. None

How confident are you with a classification close to the decision boundary? Select one: a. Very confident b. Reasonably confident c. Not very confident

c. Not very confident

In the approach called _______________, the parameters of a model are tuned so that it fits the data as well as possible. Select one: a. classification b. curve plotting c. parametric modeling d. parametric plotting

c. parametric modeling

Data comes from the Latin word "datum", meaning: a. day and time b. the tomb c. a thing taken d. a thing given

d. a thing given

The data mining procedure that attempts to find associations between entities based on transactions involving them: a. Classification b. Similarity Matching c. Clustering d. Co-occurrence grouping e. Profiling

d. co-occurrence

Name the R command that creates a new function

function()

A main purpose of creating _________ regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into.

homogenous

The situation in which a variable collected in historical data gives information on the target variable that is not actually available when the decision has to be made is called what?

leak

Name the R command that takes two lists and returns values that are in each

match()

A collection that is _________ means that it is homogeneous with respect to the target variable.

pure

Name the function in R that reveals the structure of a data object

str()

Name the R command that counts occurrences of integer-valued data in a vector

tabulate()

Name the R command that creates a list of unique values in a vector

unique()

Data scientists play active roles in the four A's of data: data architecture, data acquisition, data analysis, and data archiving. (T/F)

True

Decomposing a data analytics problem into recognized tasks is a critical skill. (T/F)

True

Exporting a data set in CSV format as opposed to a spreadsheet format can sometimes help to cut down on the work necessary to clean and prepare the data for analysis. (T/F)

True

In R, (as long as we give them credit,) we can use other people's functions by installing their packages and using the library() function to make the contents of the package available. (T/F)

True

Information gain measures how much an attribute improves/decreases entropy due to new information being added. (T/F)

True

Jaccard distance treats the two objects as sets of characteristics. (T/F)

True

Manhattan distance is called this because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points (if the data were plotted on this grid). (T/F)

True

R is case-sensitive. (T/F)

True

Walmart data miners found that strawberry pop tarts sell at _______ times their normal rate ahead of a hurricane.

7

Instead of the equal sign, in R, what is the operator that is used to assign a value to a variable?

<-

Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what?

Overfitting

At a high level, data mining is a set of fundamental principles that guide the extraction of knowledge from data. (T/F)

False

In general, if two classes are linearly separable, there is exactly one linear discriminant.

False

Logical vectors in R cannot be a part of arithmetic operations. (T/F)

False

Similarity can be used for classification but not regression. (T/F)

False

The SVM's objective function incorporates the idea that a thinner bar is better. (T/F)

False

The creation of models from data is known as model deduction. (T/F)

False

There is only one manual online for R. (T/F)

False

A linear regression model outputs a class probability estimate. (T/F)

True

The Euclidean distance measure is closely related to the _________ Theorem from Geometry.

Pythagorean

Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the __________ variable.

Target

An instance represents a fact or a data point. (T/F)

True


Related study sets

MyProgrammingLab Starting out with Python Ch.6

View Set

Iggy Ch 25 - Care of Patients with Skin Problems

View Set

DR QUIZ 4 - Authorized Relationships, Duties, adn Disclosure

View Set

DMD Lesson 6 Inheritance of Genes

View Set