Systems Midterm
data mining
Business understanding, Data Understanding, Deployment, Data Preparation, Modeling and Evaluation are all important components of the ______________ process
true
The more complex a model is, the more likely it is to overfit T or F
As a model gets more complex it is allowed to pick up harmful spurious correlations. These correlations are idiosyncrasies of the specific training set used and do not represent characteristics of the population in general
Why is overfitting bad?
Big Data 2.0
____________ era: firms take advantage of the interactive nature of the Web. The changes brought on by this shift in thinking are pervasive; the most obvious are the incorporation of social networking components, and the rise of the "voice" of the individual consumer
Big Data 1.0
_____________ era: firms are busying themselves with building the capabilities to process large data, largely in support of their current operations
Linear Classifier
a (classification tree/linear classifier) places a single decision surface through the entire space. It has great freedom in the orientation of the surface, but it is limited to a single division into two segments
classification tree
a (classification tree/linear classifier) uses decision boundaries that are perpendicular to the instance space axes
complexity
a fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model ______
false
complex models always generalize better to the target population than simple models do T or F
True
in a classification tree, decision nodes can only ask questions about the attributes of the examples we want to classify T or F
model
in data science, a predictive _______ is a formula for estimating the unknown value of interest: the target
objective function
represents the overall goal in data mining when choosing parameters for a project using certain weights and particular data in your equations to find the optimal solution
data science
the ultimate goal of ______ is improving decision making for a business
Data Mining
used for general customer relationship management to analyze customer behavior in order to manage attrition and maximize expected customer value
logistic regression
- for probability estimation, it uses the same linear model as do our linear discriminants for classification and linear regression for estimating numeric target values - output is interpreted as log-odds of class membership, which can be translated directly into probability of class membership
Big data
-essentially refers to datasets that are too large for traditional data processing systems -used in data engineering -used for implementing data mining techniques -most well-known for being used for data processing in support of the data mining techniques and other data science activities
true
We can reduce or avoid overfitting in a tree induction by setting a minimum number of training examples that must exist at each leaf node T or F
supervised
When there is a specific target or purpose for a particular data set, the data mining problem is referred to as ________ "Can we find groups of customers who have particularly high likelihoods of cancelling their service soon after their contracts expire?"
data mining
______ differs from, and is complementary to, important supporting technologies such as statistical hypothesis testing and database querying
linear classifier
a (classification tree/linear classifier) can use decision boundaries of any direction or orientation
classification tree
a (classification tree/linear classifier) is a "piecewise" classifier that segments the instance space recursively when it has to, using a divide-and-conquer approach. It can cut up the instance space arbitrarily finely into very small regions
false
a fitting curve for a model plots the true positive rates vs the false positive rates T of F
Support Vector Machine (SVM)
a linear discriminant objective function incorporates the idea that a wider bar in between different segments of data is better (maximize the margin)
cross validation
a more sophisticated holdout training and testing procedure specifies a systematic way of splitting up a single data set such that it generates multiple performance measures. These values tell the data scientist what average behavior the model yields as well as the variation to expect
data mining
a successful _____________ project involves an intelligent compromise between what the data can do (ie what they can predict and how well) and the project goals. For this reason it is important to keep in mind how the results will be used, and use this to inform the process itself.
False
according to an assigned reading, the ad vetting model used b the AdBlock Plus is endorsed broadly by industry marketing groups, including the IAB T or F
segment
an intuitive way of thinking about extracting patterns from data in a supervised manner is to try to _______ the population into subgroups that have different values for the target variable
Data science
at a high level, _________ is a set of fundamental principles that guide the extraction of knowledge from data
Data mining
at a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data. ___________ is the extraction of knowledge from data, via technologies that incorporate these principles
true
cross-validation is used to estimate generalization performance T or F
classification
data mining procedure attempt to predict, for each individual in a population, which of a set of classes this individual belongs to usually the classes are mutually exclusive "among all the customers of MegaTelCo, which are likely to respond to a given offer?"
profiling
data mining procedure attempts to characterize the typical behavior of an individual, group, or population "What is the typical cell phone usage of this customer segment?"
regression
data mining procedure attempts to estimate or predict, for each individual, the numerical value of some variable for that individual "how much will a given customer use the service?"
co-occurrence grouping
data mining procedure attempts to find associations between entities based on transactions involving them "what items are commonly purchased together?"
Clustering
data mining procedure attempts to group individuals in a population together by their similarity, but not driven by a specific purpose "do our customers form natural groups or segments?"
causal modeling
data mining procedure attempts to help us understand what events or actions actually influence others
similarity matching
data mining procedure attempts to identify similar individuals based on data known about them
link prediction
data mining procedure attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link "Since you and Karen share 10 friends, maybe you would like to be Karen's friend?"
data reduction
data mining procedure attempts to take a large set of data and replace it with a smaller set of data that contains much of the important info of the larger set
model
generally speaking, a(n) __________ is a simplified representation of reality created to serve a purpose - it is simplified based on some assumptions about what is and is not important for the specific purpose, or sometimes based on constraints on information or tractability
tree induction
incorporates the idea of supervised segmentation in an elegant matter, repeatedly selecting informative attributes
- stop growing the tree before it gets too complex - grow the tree until it is too large, then prune it back reducing its size
list the two techniques used to avoid overfitting in tree induction
logistic regression
often though of as simply a model for the probability of class membership they are used widely to estimate quantities like the probability of default on credit, the probability of response to an offer, the probability of fraud on an account, the probability that a document is relevant to a topic
true
one way to avoid overfitting when using tree induction is to reduce the size of the tree by cutting off branches and replacing them with leaves T or F
Data-driven decision making
refers to the practice of basing decisions on the analysis of data, rather than purely on intuition ex - a marketer could select advertisements based on her long experience in the field and her eye for what will work, or she could base her selection on analysis of data regarding how consumers react to differing ads
linear discriminant
separates variables into different classes, and the function of the decision boundary is a linear combination of the attributes
logistic regression
similar to a normal linear function except it is used to measure the log-odds of the event of interest
false
studies indicate that companies using DDD approach achieve greater productivity, although the significance of such productivity gains is marginal at best T or F
target
supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the __________ variable
uncertainty
the better the information, the more ____________ is reduced
induction
the creation of models from data is known as model ___________. This is a term from philosophy that refers to generalizing from specific cases to general rules the input data for these algorithms are called the training data
categorical or numeric
the distinction between classification and regression is whether the value for the target variable is _____________
Generalization
the property of a model or modeling process, whereby the model applies to data that were not used to build the model the ability of a model to apply data provided in a sample to where it is representative of the overall population
leaf
the simplest method to limit tree size is to specify a minimum number of instances that must be present in a _______
overfitting
the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points
overfitting
there is a fundamental trade-off between model complexity and _____
data mining
to a business manager, the ______ process is useful as a framework for analyzing a data mining project or proposal
true
we can build unsupervised data mining models when we lack labels for the target variable in the training data T or F
unsupervised
when there is no specific target or a specific purpose for a data set, the data mining problem is referred to as _____________ "Do our customers naturally fall into specific groups?"
true
you can make your macro run faster by telling the macro not to update your screen display T or F
false
SVM attempts to minimize the margin between two classes T or F
true
SVM is a useful approach when data is not linearly separable T or F
true
SVMs are based on supervised learning T or F
true
T or F There are many different "perfect" ways a set of training data can be separated using a linear discriminant
segmentation
If the _________ is done using values of variables that will be known when the target is not, then these segments can be used to predict the value of the target variable "Middle-aged professionals who reside in NYC -----> on average have a churn rate of 5%"
segmentation
If the _________ is done using values of variables that will be known when the target is not, then these segments can be used to predict the value of the target variable "Middle-aged professionals who reside in NYC on average -----> have a churn rate of 5%"
true
In VBA, objects can inherit properties and methods from a container (parent) object T or F
true
In an assigned reading, the concept of an item being a toy likely means that it is not useful for serious business productivity T or F
prediction
In data science, ______ more generally means to estimate an unknown value. This value could be something in the future, but it could also be something in the present or in the past these models are often built and tested using events from the past
unsupervised
Is co-occurrence grouping a supervised or unsupervised task
unsupervised
Is profiling a supervised or unsupervised task
supervised
Is regression a supervised or unsupervised task
segmentation
Our goal with supervised ________ is to separate the data into regions with different values of the target variable
false
Data science techniques are typically used to help with three primary types of decisions T or F
Sub FixThis()
A VBA macro called "FixThis" whether recorded or created from scratch, will begin which of the following ways?? a) Sub FixThis() b) Call Macro(FixThis) c) Macro FixThis() d) Start(FixThis)
false
A logistic regression represents the odds of class membership as a linear function of the attributes T or F
true
A regression model is a typical example of a supervised data mining application T or F
true
According to an assigned reading, the FANG group of companies has emerged as the de facto head of the current network hierarchy T or F
false
Classification trees attempt to create the widest margin between members of the various classes T or F
false
Decision nodes are used in linear regression T or F
False
Estimating the probability of a fraudulent transaction is an example of data mining T or F
true
Excel's INDEX function essentially lets you reference data in a table by location T or F
false
Excel's MATCH function is similar to the VLOOKUP function in that it returns the value contained in the referenced worksheet cell T or F
true
Excel's Solver not only allows you to optimize functions by manipulating multiple variables, it also allows you to select the approach that should be used to solve the problem T or F
true
Finding something in data that doesn't apply to the outside world (ie beyond that particular dataset) is typically a result of overfitting T or F
false
Finding the characteristics that differentiate my most profitable customers from my less profitable customers is an example of an unsupervised learning task T or F
true
Increasing the complexity of a model generally increases its performance on the training set T or F
false
Induction reasons from general knowledge to specific facts T or F
supervised
Is causal modeling a supervised or unsupervised task
supervised
Is classification a supervised or unsupervised task
unsupervised task
Is clustering a supervised or unsupervised task
true
Logistic regression can be used to predict the probability of membership in a certain class T or F
true
Logistic regression requires a categorical target variable in training data T or F
true
Logistic regression requires numeric attributes, so categorical attributes need to be converted to numeric attributes before analysis
true
Logistic regression requires numeric attributes, so categorical attributes need to be converted to numeric attributes before analysis True or False
false
More complex models are generally easier to interpret T or F
True
One key aspect of DDD is greater reliance on data, rather than intuition T or F
false
One key benefit of cross-validation to evaluate induction models is that it is much quicker to compute than other holdout methods T or F
true
One way to avoid overfitting when using tree induction is to specify a maximum number of nodes that the tree is allowed to contain T or F
true
Pruning is a technique for reducing complexity in a decision tree T or F