CSE 160
The term "loss" is used across data science as a general term for error penalty.
True
Tree induction will keep growing the tree to fit the training data until it creates pure leaf nodes.
True
Understanding data science does not mean that you will be able to tell whether a data mining project will succeed.
True
Understanding data science is important to understand because dat analysis is so critical to business strategy, and because data analytics projects reach into all business units.
True
Using the list of U.S. states illustrates how a non-normal distribution has a normal sampling distribution of means.
True
We should study industries like online advertising for hints about big data and data science that subsequently will be adopted by other industries.
True
Tree induction has been a very popular data mining procedure for all reasons EXCEPT: a. It is always right b. It is computationally inexpensive c. It is easy to understand d. It is easy to implement
a.it is always right
Name the function in R that concatenates data elements together into a vector:
c()
What is the name for a list of vectors where each vector has the exact same number of elements as the others?
dataframe
The 25% of cases with the smallest value is known as the:
first quartile
What kind of graph or curve shows the accuracy of a model as a function of complexity?
fitting
A main purpose of creating _______ regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into.
homogeneous
The situation in which a variable collected in historical data gives information on the target variable but is not actually available when the decision has to be made is called what?
leak
What is the name of the value that occurs most often in a sample of data?
mode
What is the term that Gauss used to describe the common bell-shaped distribution?
normal
The distinction between classification and regression is whether the target variable is categorical or
numeric
Entities that R creates and manipulates are called:
objects
Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what?
overfitting
The definition for ______ is homogeneous with respect to the target variable.
pure
The general method for reining in model complexity to avoid overfitting is called model
regularization
Name the R function that can repeat an activity. (Do not include parenthesis, just the name of the function)
replicate
The larger the sample size, the _________ the standard error.
smaller
Name the function in R that reveals the structure of a data object:
str()
Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the
target variable
In certain fields of statistics and econometrics, the bare model with unspecified parameters is called:
the model
Data comes from the Latin word "datum", meaning:
thing given
A natural measure of impurity for numeric values is
variance
Data science can apply to the farming profession.
True
Data scientists play active roles in the four A's of data: data architecture, data acquisition, data analysis, and data archiving.
True
Decomposing a data analytics problem into recognized tasks is a critical skill.
True
Information gain measures how much an attribute improves entropy due to new information being added.
True
R is case-sensitive.
True
Walmart data miners found that strawberry pop tarts sell at ________ times their normal rate ahead of a hurricane.
7
A simplified* representa.on of reality created for a specific purpose
Model
Select the mathematician that did not work on the ideas of "the law of large numbers" and the central limit theorem.
Archimedes
The data mining procedure that produces a model that, given a new individual, determines the category to which that individual belongs:
Classification
The data mining procedure that attempts to find associations between entities based on transactions involving them:
Co-occurrence grouping
At a high level, data mining is a set of fundamental principles that guide the extraction of knowledge from data.
False
Cross-validation specifies a systematic way of splitting up a single dataset such that it generates one single performance measure.
False
Deduction is a term from philosophy that refers to generalizing from specific cases to general rules.
False
If you run a statistical process a large number of times, it does not converge on a stable result.
False
In data science, prediction means to forecast a future event.
False
In data science, the key to success is to "follow the money"
False
SVM stands for separate vector machines.
False
The SVM's objective function incorporates the idea that a thinner bar is better.
False
The primary purpose of descriptive modeling is to predict a future event.
False
There is only one manual online for R.
False
What is the actual last name of the person who invented the Student's t-Test?
Gosset
An instance represents a fact or a data point.
True
Which of the following is NOT necessarily part of the data mining process presented in class?
Interviewing potential customers
Name the function in R that returns the average:
Mean
If the "tail" on the high side is slightly longer than it should be, then we have a:
Rightward Skew
What is the name of the bank that spun out the Capital One credit card company?
Signet
The data mining procedure that produces a model that, given a new individual, finds those individuals that are most like the new individual:
Similarity Matching
When we analyze a set of data with knowledge of the correct prediction for each item, what kind of model are we building?
Supervised