Exam 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

a classification tree is ________ to a rule set

equivalent

the creation of models from data is known as model deduction (true or false)

false

prediction

an estimate of the target

the most appropriate type of attribute that should be used to encode letter grade is called a. calendrial b. numeric c. ordinal d. gradual e. nominal

c. ordinal

name the function in R that reveals the structure of a data object: a. summary() b. content() c. str() d. print() e. display()

c. str()

using decision trees help to create _________ _________

decision boundaries

what does the function do? example(function_name)

shows how to use a function

squared error

specifies a loss proportional to the square of the distance from the boundary (cares about distance but has the effect of greatly penalizing predictions that are grossly wrong)

predictive model

a "formula" for estimating the unknown value of interest (aka the target)

laplace correction

a form of smoothing which adds a count to both classes (p(c) = n+1/((n+1)+(m+1)) = n+1/n+m+2

what is logistic regression

a linear model that corresponds to log odds and computes the distance from the decision boundary

what is the result of supervised data mining

a model that given data predicts some quantity

model

a simplified representation of reality created to serve a purpose

what is the name for a list of vectors where each vector has the exact same number of elements as the others? a. a dataframe b. a mode c. a dataset d. a list

a. a dataframe

what type of data mining task would be used to predict the gas mileage of a car? a. regression b. similarity matching c. profiling d. clustering e. classification

a. regression

when would you NOT want to increase the complexity of your model? a. when your model is overfitting b. when you believe a new attribute could provide a useful signal c. all of these cases (i.e., for all of them you would not want to increase complexity) d. when your model is underfitting e. none of these cases (i.e., for all of them you would want to increase complexity)

a. when your model is overfitting

what is big data?

an all encompassing term for a collection of data so large it is difficult to process using traditional single-machine techniques

what are you doing in supervised data mining

applying a rule to a variable to make a prediction

zero-one loss

asks if a mistake was made (and only cares that a mistake has been made)

which of the below characters starts a commented line in R? a. /* b. # c. // d. / e. % f. --

b. #

which of the following is NOT true about Big Data? a. characteristics of today's Big Data are often described with "V" words b. data science is the study of Big Data and not about Small Data c. Big Data is considered to be "at the foundation of all the megatrends that are happening today" d. Big Data technologies can be used to support data-driven decision making e. Big data is defined as datasets that are too large for traditional data processing systems

b. data science is the study of Big Data and not about Small Data

when measuring distance between elements of a dataset, what are the units used for Euclidean Distance? a. inches b. depends on the input c. none d. centimeters

b. depends on the input

different learning algorithms build different decision boundaries. If we have two attributes, we can visualize the decision boundaries (as we do below for a three class problem). Which of the following are NOT possible sources of the model? a. support vector machine b. none of the other choices are possible models. c. classification tree d. logistic regression e. linear classifier

b. none of the other choices are possible models

logistic regression models produce a ____________ estimate for a _____________ target variable. a. numeric, numeric b. numeric, categorical c. categorical, numeric d. categorical, categorical

b. numerical, categorical

the data mining procedure that attempts to characterize the typical behavior of an individual or population. a. similarity matching b. profiling c. co-occurrence grouping d. classification e. regression

b. profiling

hinge loss

based on the margins of the sample - if positive, it means it is within the margin or the wrong side of the boundary - it penalizes points more the farther they are from the correct margin

which of the following is NOT a reason to perform attribute selection? a. better predictions b. better explanations and more tractable models c. faster predictions d. all of these choices are reasons for attribute selection e. reduced computational and/or storage cost

d. all of these choices are reasons for attribute selection

when we use the K-nearest neighbors method for predictive modeling, how can we make the decision boundary estimate more smooth? a. decrease K b. none of these choices will affect the smoothness of the decision boundary c. use a distance measure other than Euclidean d. increase K

d. increase K

tree induction has been a very popular data mining procedure for all reasons EXCEPT: a. it is easy to understand b. it is computationally inexpensive c. it is easy to implement d. it is definitely the most accurate model one can produce from a particular data set

d. it is definitely the most accurate model one can produce from a particular data set

linear model vs tree induction

depends on the amount of data available - use a tree if there is a lot of data available

4 types of data anlytics

descriptive, diagnostic, predictive, prescriptive

loss function

determines how much penalty should be assigned to an instance based on the error in the models predicted value

what does the function do? help(function_name)

documents what a function does

logistic loss

does not assign zero penalty to any points but gives less penalty to points correctly classified with high confidence

a loss function is used to determine how much penalty should be assigned to an instance based on the error in the model's predicted value for that instance (true or false)

true

a main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into (true or false)

true

all data mining procedures have the tendency to overfit to some extent (true or false)

true

an instance represents a fact or a data point (true or false)

true

cross-validation is a best-practice method to estimate generalization performance (true or false)

true

data science is the discipline of making data useful (true or false)

true

decomposing a data analytics problem into recognized tasks is a critical skill (true or false)

true

information gain measures how much an attribute improves/decreases entropy due to new information being added (true or false)

true

linear functions can represent nonlinear models, if we include more complex features (true or false)

true

manhattan distance is called this because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points (if the data were plotted on this grid) (true or false)

true

test data should be strictly independent of model building so that we can get a good estimate of model accuracy (true or false)

true

unsupervised methods of data mining

used if my results fall into different groups and there is no objective target stated

given that the equation of a line in two dimensions is y = mx + b, match the following m is the _____ b is the ______

slope, y-intercept

absolute error

specifies a loss proportional to the absolute distance from the boundary (cares how far away from correct the point is

linear discriminant

the rule of what is above and below the boundary and results in a parameterized model

linear classifier difference from decision trees

this method creates mathematical functions out of the attributes

confusion matrix

used to evaluate decision trees

what is one way to do multivariate supervised segmentation?

using tree structured models

factors

vectors whose elements can draw from a specific (distinct) set of values (internal numbering is done alphabetically by default)

what are the 4 V's of big data?

volume, variety, velocity, veracity

log(P₊(x))/1-P₊(x)) = f(x)

w₀ + w₁x₁ + w₂x₂ + ...

P₊(x)

1/1+e(^-f(x))

general linear model

f(x) = w₀+w₁x₁+...wₙxₙ

tree-structured models

- no 2 parents share descendant - the branches always point downwards - every example always ends up at a leaf node with some specific class

data mining

- provides the analytical modeling toolkits - an approach for extracting knowledge from data

how do you create a classification tree?

- the divide and conquer approach - take each subset and recursively apply attribute selection to find the best attribute to partition to - stop when the nodes are pure, there are no more variables, or earlier to avoid overfitting

supervised methods of data mining

- used if there is a quantifiable target we want to predict - we need data from a highly related phenomenon (doesn't have to be exact)

entropy equations for 2 classes

-(P₁log₂(P₁) + P₂log₂(P₂)) the number of classes determines the base of the log (in this example, there are 2 classes so it the log base is 2)

what is the range of logistic regression

0 to 1

what is supervised segmentation

segmenting the population with respect to something that we would like to predict or estimate (ie which customers are/aren't likely to buy something)

what is multivariate supervised segmentation

selecting multiple attributes, each giving some information gain, and putting them together

which loss function specifies a loss proportional to the distance from the boundary? a. logistic loss b. squared error c. absolute error d. zero-one loss e. hinge loss

c. absolute error

looking too hard at a set of data might result in finding something that does not generalize to unseen data. This is called what? a. the correct answer is none of these. b. specialization c. overfitting d. underfitting e. generalization

c. overfitting

ways to conduct supervised segmentation

classification trees, logistic regression, other

when using the tree/rule task, we use a _______ model when our target is _______

classification, categorical

a learning curve is defined as: a. none of these reflect the definition of a learning curve b. a plot of the performance of the classifier on the y axis versus the complexity of the model on the x axis c. a plot of the performance of the classifier on the y axis on a growing test set as the size of the training data grows on the x axis d. a plot of the performance of the classifier on the y axis on a growing test set on the x axis with a fixed size training set e. a plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis

e. a plot of the performance of the classifier on the y axis on a fixed test set as the size of the training data grows on the x axis

the data mining procedure that produces a model which determines the category to which an individual sample belongs. a. link prediction b. causal Modeling c. clustering d. data Reduction e. classification

e. classification

which of the following is NOT necessarily part of the data mining process presented in class? a. data preparation b. evaluation c. deployment d. modeling e. interviewing potential customers f. business understanding g. data understanding feedback

e. interviewing potential customers

When we analyze a set of data with a defined target in mind, what kind of model are we building? a. data mining b. analytical c. machine learning d. unsupervised e. supervised

e. supervised

supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the _______ variable. a. feature b. vivacious c. attribute d. independent e. target

e. target

if you are told that a set with binary elements has an entropy of 0, what do you know? a. that the set is perfectly mixed b. none of the others are correct c. that the set is all TRUE d. you don't know anything about the set e. that the set is all FALSE

e. that the set is all FALSE

how does data leakage negatively impact model building? a. the model will not see enough information b. the model will return an error c. leakage means missing data and missing data is problematic d. data leakage positively affects model building e. the model will see information which is not available when making decisions

e. the model will see information which is not available when making decisions

why use trees?

easy to understand, use, cheap, built in usually to data mining tools

information gain equation

entropy(parent) - [p(left child) x entropy(left child) + p(right child) x entropy(right child)]

a collection that is impure means that it is homogeneous with respect to the target variable (true or false)

false

by definition, node construction in decision trees always results in binary trees (true or false)

false

in general, if two classes are linearly separable, there is exactly one linear discriminant (true or false)

false

in most binary classification models, the decision boundary is a single, contiguous dividing line or curve between two classes. However, in k-nearest neighbor, as k gets larger, the more boundaries and islands will appear (true or false)

false

logical vectors in R cannot be a part of arithmetic operations (true or false)

false

logistic regression is misnamed because it does not use a log function (true or false)

false

overfitting is the tendency of data mining procedures to create models that generalize to previously unseen data points (true or false)

false

when is squared error loss usually used

for regression

selecting ______ attributes is an important part of supervised segmentation

informative

why is logistic regression misnamed

it is a class probability estimation model (because it is estimating the probability of a numeric quantity over a categorical class) and not a regression model

I am _______________ confident with a classification close to the decision boundary than one far from the boundary. a. less b. more c. the same

less

support vector machines

linear discriminants, effective, use "hinge loss"

what does entropy measure

measured the general disorder of a set

what is information gain

measures the change in entropy due to any amount of new information being added by the split

frequency-based estimate

nodes represent probabilities

classification trees are very prone to ________

overfitting

probability estimation tree

predicting probability based on classification trees

data warehouses

provide access to historical data

in a numeric function, we use a _______ model when our target is _______

regression, numeric

instance

represents a fact or a data point, is described by a set of attributes that can be represented in a vector

what does the function do? apropos("function_name")

searches database for function by name or a partial name

what does the function do? help.search("descriptive_word")

searches database for function when you do not


Ensembles d'études connexes

Chapter 5 Organizing Principles: Lipids, Membranes, and Cell Compartments

View Set

Ch 63 - Concepts of Care for Patients with Acute Kidney Injury and Chronic Kidney Disease (rationale)

View Set

30) Pollution Basics & CFC's + Ozone

View Set

Jean Inman Domain III: Management of Food and Nutrition Programs and Services (Pages 26-30)

View Set