CSE160 Data Science Exam 1
Understanding data science does not mean that you will be able to tell whether a data mining project will succeed. (T/F)
True
Understanding data science is important because data analysis is so critical to business strategy, and because data analytics projects reach into all business units. (T/F)
True
We should study industries like online advertising for hints about big data and data science that subsequently will be adopted by other industries. (T/F)
True
Tree induction has been a very popular data mining procedure for all reasons EXCEPT: Select one: a. It is definitely the most accurate model one can produce from a particular data set b. It is computationally inexpensive c. It is easy to understand d. It is easy to implement
a. It is definitely the most accurate model one can produce from a particular data set
When we analyze a set of data with a defined target in mind, what kind of model are we building? a. Supervised b. Unsupervised
a. Supervised
The data mining procedure that produces a model that, given an individual, determines the category to which that individual belongs: Select one: a. Causal Modeling b. Classification c. Link prediction d. Data Reduction e. Clustering
b. Classification
What is NOT TRUE about k-means algorithm? Select one: a. Initial points can have a large influence on the result b. K-means converge to the same final result c. K is the number of clusters d. K-means is only applicable with a defined mean
b. K-means converge to the same final result
According to Data Science for Business, in terms of nearest neighbor methods, two unique aspects of overall intelligibility are the justification of a specific _____________ and the intelligibility of an entire ______________. Select one: a. variable, model b. decision, model c. variable, dataset
b. decision, model
Name the function in R that combines data elements together into a vector
c()
The process of repeatedly drawing a subset from a population is called ___________, and the end result of doing lots of this is a _____________. Select one: a. "grabbing", population graph b. "extraction", sampling distribution c. "sampling", sampling distribution d. "extraction", scatterplot
c. "sampling", sampling distribution
What is the (minimum) Levenshtein metric between: Godly Goodbye Select one: a. 4 b. 12 c. 3 d. 7
c. 3
When measuring distance between elements of a dataset, what are the units used for Euclidean Distance? Select one: a. Inches b. Depends on the input c. None d. Centimeters
c. None
How confident are you with a classification close to the decision boundary? Select one: a. Very confident b. Reasonably confident c. Not very confident
c. Not very confident
In the approach called _______________, the parameters of a model are tuned so that it fits the data as well as possible. Select one: a. classification b. curve plotting c. parametric modeling d. parametric plotting
c. parametric modeling
Data comes from the Latin word "datum", meaning: a. day and time b. the tomb c. a thing taken d. a thing given
d. a thing given
The data mining procedure that attempts to find associations between entities based on transactions involving them: a. Classification b. Similarity Matching c. Clustering d. Co-occurrence grouping e. Profiling
d. co-occurrence
Name the R command that creates a new function
function()
A main purpose of creating _________ regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into.
homogenous
The situation in which a variable collected in historical data gives information on the target variable that is not actually available when the decision has to be made is called what?
leak
Name the R command that takes two lists and returns values that are in each
match()
A collection that is _________ means that it is homogeneous with respect to the target variable.
pure
Name the function in R that reveals the structure of a data object
str()
Name the R command that counts occurrences of integer-valued data in a vector
tabulate()
Name the R command that creates a list of unique values in a vector
unique()
Data scientists play active roles in the four A's of data: data architecture, data acquisition, data analysis, and data archiving. (T/F)
True
Decomposing a data analytics problem into recognized tasks is a critical skill. (T/F)
True
Exporting a data set in CSV format as opposed to a spreadsheet format can sometimes help to cut down on the work necessary to clean and prepare the data for analysis. (T/F)
True
In R, (as long as we give them credit,) we can use other people's functions by installing their packages and using the library() function to make the contents of the package available. (T/F)
True
Information gain measures how much an attribute improves/decreases entropy due to new information being added. (T/F)
True
Jaccard distance treats the two objects as sets of characteristics. (T/F)
True
Manhattan distance is called this because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points (if the data were plotted on this grid). (T/F)
True
R is case-sensitive. (T/F)
True
Walmart data miners found that strawberry pop tarts sell at _______ times their normal rate ahead of a hurricane.
7
Instead of the equal sign, in R, what is the operator that is used to assign a value to a variable?
<-
Looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what?
Overfitting
At a high level, data mining is a set of fundamental principles that guide the extraction of knowledge from data. (T/F)
False
In general, if two classes are linearly separable, there is exactly one linear discriminant.
False
Logical vectors in R cannot be a part of arithmetic operations. (T/F)
False
Similarity can be used for classification but not regression. (T/F)
False
The SVM's objective function incorporates the idea that a thinner bar is better. (T/F)
False
The creation of models from data is known as model deduction. (T/F)
False
There is only one manual online for R. (T/F)
False
A linear regression model outputs a class probability estimate. (T/F)
True
The Euclidean distance measure is closely related to the _________ Theorem from Geometry.
Pythagorean
Supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the __________ variable.
Target
An instance represents a fact or a data point. (T/F)
True