Exam 1
a classification tree is ________ to a rule set
equivalent
the creation of models from data is known as model deduction (true or false)
false
prediction
an estimate of the target
the most appropriate type of attribute that should be used to encode letter grade is called a. calendrial b. numeric c. ordinal d. gradual e. nominal
c. ordinal
name the function in R that reveals the structure of a data object: a. summary() b. content() c. str() d. print() e. display()
c. str()
using decision trees help to create _________ _________
decision boundaries
what does the function do? example(function_name)
shows how to use a function
squared error
specifies a loss proportional to the square of the distance from the boundary (cares about distance but has the effect of greatly penalizing predictions that are grossly wrong)
predictive model
a "formula" for estimating the unknown value of interest (aka the target)
laplace correction
a form of smoothing which adds a count to both classes (p(c) = n+1/((n+1)+(m+1)) = n+1/n+m+2
what is logistic regression
a linear model that corresponds to log odds and computes the distance from the decision boundary
what is the result of supervised data mining
a model that given data predicts some quantity
model
a simplified representation of reality created to serve a purpose
what is the name for a list of vectors where each vector has the exact same number of elements as the others? a. a dataframe b. a mode c. a dataset d. a list
a. a dataframe
what type of data mining task would be used to predict the gas mileage of a car? a. regression b. similarity matching c. profiling d. clustering e. classification
a. regression
when would you NOT want to increase the complexity of your model? a. when your model is overfitting b. when you believe a new attribute could provide a useful signal c. all of these cases (i.e., for all of them you would not want to increase complexity) d. when your model is underfitting e. none of these cases (i.e., for all of them you would want to increase complexity)
a. when your model is overfitting
what is big data?
an all encompassing term for a collection of data so large it is difficult to process using traditional single-machine techniques
what are you doing in supervised data mining
applying a rule to a variable to make a prediction
zero-one loss
asks if a mistake was made (and only cares that a mistake has been made)
which of the below characters starts a commented line in R? a. /* b. # c. // d. / e. % f. --
b. #
which of the following is NOT true about Big Data? a. characteristics of today's Big Data are often described with "V" words b. data science is the study of Big Data and not about Small Data c. Big Data is considered to be "at the foundation of all the megatrends that are happening today" d. Big Data technologies can be used to support data-driven decision making e. Big data is defined as datasets that are too large for traditional data processing systems
b. data science is the study of Big Data and not about Small Data
when measuring distance between elements of a dataset, what are the units used for Euclidean Distance? a. inches b. depends on the input c. none d. centimeters
b. depends on the input
different learning algorithms build different decision boundaries. If we have two attributes, we can visualize the decision boundaries (as we do below for a three class problem). Which of the following are NOT possible sources of the model? a. support vector machine b. none of the other choices are possible models. c. classification tree d. logistic regression e. linear classifier
b. none of the other choices are possible models
logistic regression models produce a ____________ estimate for a _____________ target variable. a. numeric, numeric b. numeric, categorical c. categorical, numeric d. categorical, categorical
b. numerical, categorical
the data mining procedure that attempts to characterize the typical behavior of an individual or population. a. similarity matching b. profiling c. co-occurrence grouping d. classification e. regression
b. profiling
hinge loss
based on the margins of the sample - if positive, it means it is within the margin or the wrong side of the boundary - it penalizes points more the farther they are from the correct margin
which of the following is NOT a reason to perform attribute selection? a. better predictions b. better explanations and more tractable models c. faster predictions d. all of these choices are reasons for attribute selection e. reduced computational and/or storage cost
d. all of these choices are reasons for attribute selection
when we use the K-nearest neighbors method for predictive modeling, how can we make the decision boundary estimate more smooth? a. decrease K b. none of these choices will affect the smoothness of the decision boundary c. use a distance measure other than Euclidean d. increase K
d. increase K
tree induction has been a very popular data mining procedure for all reasons EXCEPT: a. it is easy to understand b. it is computationally inexpensive c. it is easy to implement d. it is definitely the most accurate model one can produce from a particular data set
d. it is definitely the most accurate model one can produce from a particular data set
linear model vs tree induction
depends on the amount of data available - use a tree if there is a lot of data available
4 types of data anlytics
descriptive, diagnostic, predictive, prescriptive
loss function
determines how much penalty should be assigned to an instance based on the error in the models predicted value
what does the function do? help(function_name)
documents what a function does
logistic loss
does not assign zero penalty to any points but gives less penalty to points correctly classified with high confidence
a loss function is used to determine how much penalty should be assigned to an instance based on the error in the model's predicted value for that instance (true or false)
true
a main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into (true or false)
true
all data mining procedures have the tendency to overfit to some extent (true or false)
true
an instance represents a fact or a data point (true or false)
true
cross-validation is a best-practice method to estimate generalization performance (true or false)
true
data science is the discipline of making data useful (true or false)
true
decomposing a data analytics problem into recognized tasks is a critical skill (true or false)
true
information gain measures how much an attribute improves/decreases entropy due to new information being added (true or false)
true
linear functions can represent nonlinear models, if we include more complex features (true or false)
true
manhattan distance is called this because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points (if the data were plotted on this grid) (true or false)
true
test data should be strictly independent of model building so that we can get a good estimate of model accuracy (true or false)
true
unsupervised methods of data mining
used if my results fall into different groups and there is no objective target stated
given that the equation of a line in two dimensions is y = mx + b, match the following m is the _____ b is the ______
slope, y-intercept
absolute error
specifies a loss proportional to the absolute distance from the boundary (cares how far away from correct the point is
linear discriminant
the rule of what is above and below the boundary and results in a parameterized model
linear classifier difference from decision trees
this method creates mathematical functions out of the attributes
confusion matrix
used to evaluate decision trees
what is one way to do multivariate supervised segmentation?
using tree structured models
factors
vectors whose elements can draw from a specific (distinct) set of values (internal numbering is done alphabetically by default)
what are the 4 V's of big data?
volume, variety, velocity, veracity
log(P₊(x))/1-P₊(x)) = f(x)
w₀ + w₁x₁ + w₂x₂ + ...
P₊(x)
1/1+e(^-f(x))
general linear model
f(x) = w₀+w₁x₁+...wₙxₙ
tree-structured models
- no 2 parents share descendant - the branches always point downwards - every example always ends up at a leaf node with some specific class
data mining
- provides the analytical modeling toolkits - an approach for extracting knowledge from data
how do you create a classification tree?
- the divide and conquer approach - take each subset and recursively apply attribute selection to find the best attribute to partition to - stop when the nodes are pure, there are no more variables, or earlier to avoid overfitting
supervised methods of data mining
- used if there is a quantifiable target we want to predict - we need data from a highly related phenomenon (doesn't have to be exact)
entropy equations for 2 classes
-(P₁log₂(P₁) + P₂log₂(P₂)) the number of classes determines the base of the log (in this example, there are 2 classes so it the log base is 2)
what is the range of logistic regression
0 to 1
what is supervised segmentation
segmenting the population with respect to something that we would like to predict or estimate (ie which customers are/aren't likely to buy something)
what is multivariate supervised segmentation
selecting multiple attributes, each giving some information gain, and putting them together
which loss function specifies a loss proportional to the distance from the boundary? a. logistic loss b. squared error c. absolute error d. zero-one loss e. hinge loss
c. absolute error
looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what? a. the correct answer is none of these. b. specialization c. overfitting d. underfitting e. generalization
c. overfitting
ways to conduct supervised segmentation
classification trees, logistic regression, other
when using the tree/rule task, we use a _______ model when our target is _______
classification, categorical
a learning curve is defined as: a. none of these reflect the definition of a learning curve b. a plot of the performance of the classifier on the y axis versus the complexity of the model on the x axis c. a plot of the performance of the classifier on the y axis on a growing test set as the size of the training data grows on the x axis d. a plot of the performance of the classifier on the y axis on a growing test set on the x axis with a fixed size training set e. a plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis
e. a plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis
the data mining procedure that produces a model which determines the category to which an individual sample belongs. a. link prediction b. causal Modeling c. clustering d. data Reduction e. classification
e. classification
which of the following is NOT necessarily part of the data mining process presented in class? a. data preparation b. evaluation c. deployment d. modeling e. interviewing potential customers f. business understanding g. data understanding feedback
e. interviewing potential customers
When we analyze a set of data with a defined target in mind, what kind of model are we building? a. data mining b. analytical c. machine learning d. unsupervised e. supervised
e. supervised
supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the _______ variable. a. feature b. vivacious c. attribute d. independent e. target
e. target
if you are told that a set with binary elements has an entropy of 0, what do you know? a. that the set is perfectly mixed b. none of the others are correct c. that the set is all TRUE d. you don't know anything about the set e. that the set is all FALSE
e. that the set is all FALSE
how does data leakage negatively impact model building? a. the model will not see enough information b. the model will return an error c. leakage means missing data and missing data is problematic d. data leakage positively affects model building e. the model will see information which is not available when making decisions
e. the model will see information which is not available when making decisions
why use trees?
easy to understand, use, cheap, built in usually to data mining tools
information gain equation
entropy(parent) - [p(left child) x entropy(left child) + p(right child) x entropy(right child)]
a collection that is impure means that it is homogeneous with respect to the target variable (true or false)
false
by definition, node construction in decision trees always results in binary trees (true or false)
false
in general, if two classes are linearly separable, there is exactly one linear discriminant (true or false)
false
in most binary classification models, the decision boundary is a single, contiguous dividing line or curve between two classes. However, in k-nearest neighbor, as k gets larger, the more boundaries and islands will appear (true or false)
false
logical vectors in R cannot be a part of arithmetic operations (true or false)
false
logistic regression is misnamed because it does not use a log function (true or false)
false
overfitting is the tendency of data mining procedures to create models that generalize to previously unseen data points (true or false)
false
when is squared error loss usually used
for regression
selecting ______ attributes is an important part of supervised segmentation
informative
why is logistic regression misnamed
it is a class probability estimation model (because it is estimating the probability of a numeric quantity over a categorical class) and not a regression model
I am _______________ confident with a classification close to the decision boundary than one far from the boundary. a. less b. more c. the same
less
support vector machines
linear discriminants, effective, use "hinge loss"
what does entropy measure
measured the general disorder of a set
what is information gain
measures the change in entropy due to any amount of new information being added by the split
frequency-based estimate
nodes represent probabilities
classification trees are very prone to ________
overfitting
probability estimation tree
predicting probability based on classification trees
data warehouses
provide access to historical data
in a numeric function, we use a _______ model when our target is _______
regression, numeric
instance
represents a fact or a data point, is described by a set of attributes that can be represented in a vector
what does the function do? apropos("function_name")
searches database for function by name or a partial name
what does the function do? help.search("descriptive_word")
searches database for function when you do not
