Systems Midterm

Ace your homework & exams now with Quizwiz!

data mining

Business understanding, Data Understanding, Deployment, Data Preparation, Modeling and Evaluation are all important components of the ______________ process

true

The more complex a model is, the more likely it is to overfit T or F

As a model gets more complex it is allowed to pick up harmful spurious correlations. These correlations are idiosyncrasies of the specific training set used and do not represent characteristics of the population in general

Why is overfitting bad?

Big Data 2.0

____________ era: firms take advantage of the interactive nature of the Web. The changes brought on by this shift in thinking are pervasive; the most obvious are the incorporation of social networking components, and the rise of the "voice" of the individual consumer

Big Data 1.0

_____________ era: firms are busying themselves with building the capabilities to process large data, largely in support of their current operations

Linear Classifier

a (classification tree/linear classifier) places a single decision surface through the entire space. It has great freedom in the orientation of the surface, but it is limited to a single division into two segments

classification tree

a (classification tree/linear classifier) uses decision boundaries that are perpendicular to the instance space axes

complexity

a fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model ______

false

complex models always generalize better to the target population than simple models do T or F

True

in a classification tree, decision nodes can only ask questions about the attributes of the examples we want to classify T or F

model

in data science, a predictive _______ is a formula for estimating the unknown value of interest: the target

objective function

represents the overall goal in data mining when choosing parameters for a project using certain weights and particular data in your equations to find the optimal solution

data science

the ultimate goal of ______ is improving decision making for a business

Data Mining

used for general customer relationship management to analyze customer behavior in order to manage attrition and maximize expected customer value

logistic regression

- for probability estimation, it uses the same linear model as do our linear discriminants for classification and linear regression for estimating numeric target values - output is interpreted as log-odds of class membership, which can be translated directly into probability of class membership

Big data

-essentially refers to datasets that are too large for traditional data processing systems -used in data engineering -used for implementing data mining techniques -most well-known for being used for data processing in support of the data mining techniques and other data science activities

true

We can reduce or avoid overfitting in a tree induction by setting a minimum number of training examples that must exist at each leaf node T or F

supervised

When there is a specific target or purpose for a particular data set, the data mining problem is referred to as ________ "Can we find groups of customers who have particularly high likelihoods of cancelling their service soon after their contracts expire?"

data mining

______ differs from, and is complementary to, important supporting technologies such as statistical hypothesis testing and database querying

linear classifier

a (classification tree/linear classifier) can use decision boundaries of any direction or orientation

classification tree

a (classification tree/linear classifier) is a "piecewise" classifier that segments the instance space recursively when it has to, using a divide-and-conquer approach. It can cut up the instance space arbitrarily finely into very small regions

false

a fitting curve for a model plots the true positive rates vs the false positive rates T of F

Support Vector Machine (SVM)

a linear discriminant objective function incorporates the idea that a wider bar in between different segments of data is better (maximize the margin)

cross validation

a more sophisticated holdout training and testing procedure specifies a systematic way of splitting up a single data set such that it generates multiple performance measures. These values tell the data scientist what average behavior the model yields as well as the variation to expect

data mining

a successful _____________ project involves an intelligent compromise between what the data can do (ie what they can predict and how well) and the project goals. For this reason it is important to keep in mind how the results will be used, and use this to inform the process itself.

False

according to an assigned reading, the ad vetting model used b the AdBlock Plus is endorsed broadly by industry marketing groups, including the IAB T or F

segment

an intuitive way of thinking about extracting patterns from data in a supervised manner is to try to _______ the population into subgroups that have different values for the target variable

Data science

at a high level, _________ is a set of fundamental principles that guide the extraction of knowledge from data

Data mining

at a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data. ___________ is the extraction of knowledge from data, via technologies that incorporate these principles

true

cross-validation is used to estimate generalization performance T or F

classification

data mining procedure attempt to predict, for each individual in a population, which of a set of classes this individual belongs to usually the classes are mutually exclusive "among all the customers of MegaTelCo, which are likely to respond to a given offer?"

profiling

data mining procedure attempts to characterize the typical behavior of an individual, group, or population "What is the typical cell phone usage of this customer segment?"

regression

data mining procedure attempts to estimate or predict, for each individual, the numerical value of some variable for that individual "how much will a given customer use the service?"

co-occurrence grouping

data mining procedure attempts to find associations between entities based on transactions involving them "what items are commonly purchased together?"

Clustering

data mining procedure attempts to group individuals in a population together by their similarity, but not driven by a specific purpose "do our customers form natural groups or segments?"

causal modeling

data mining procedure attempts to help us understand what events or actions actually influence others

similarity matching

data mining procedure attempts to identify similar individuals based on data known about them

link prediction

data mining procedure attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link "Since you and Karen share 10 friends, maybe you would like to be Karen's friend?"

data reduction

data mining procedure attempts to take a large set of data and replace it with a smaller set of data that contains much of the important info of the larger set

model

generally speaking, a(n) __________ is a simplified representation of reality created to serve a purpose - it is simplified based on some assumptions about what is and is not important for the specific purpose, or sometimes based on constraints on information or tractability

tree induction

incorporates the idea of supervised segmentation in an elegant matter, repeatedly selecting informative attributes

- stop growing the tree before it gets too complex - grow the tree until it is too large, then prune it back reducing its size

list the two techniques used to avoid overfitting in tree induction

logistic regression

often though of as simply a model for the probability of class membership they are used widely to estimate quantities like the probability of default on credit, the probability of response to an offer, the probability of fraud on an account, the probability that a document is relevant to a topic

true

one way to avoid overfitting when using tree induction is to reduce the size of the tree by cutting off branches and replacing them with leaves T or F

Data-driven decision making

refers to the practice of basing decisions on the analysis of data, rather than purely on intuition ex - a marketer could select advertisements based on her long experience in the field and her eye for what will work, or she could base her selection on analysis of data regarding how consumers react to differing ads

linear discriminant

separates variables into different classes, and the function of the decision boundary is a linear combination of the attributes

logistic regression

similar to a normal linear function except it is used to measure the log-odds of the event of interest

false

studies indicate that companies using DDD approach achieve greater productivity, although the significance of such productivity gains is marginal at best T or F

target

supervised learning is model creation where the model describes a relationship between a set of selected variables and a predefined variable called the __________ variable

uncertainty

the better the information, the more ____________ is reduced

induction

the creation of models from data is known as model ___________. This is a term from philosophy that refers to generalizing from specific cases to general rules the input data for these algorithms are called the training data

categorical or numeric

the distinction between classification and regression is whether the value for the target variable is _____________

Generalization

the property of a model or modeling process, whereby the model applies to data that were not used to build the model the ability of a model to apply data provided in a sample to where it is representative of the overall population

leaf

the simplest method to limit tree size is to specify a minimum number of instances that must be present in a _______

overfitting

the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points

overfitting

there is a fundamental trade-off between model complexity and _____

data mining

to a business manager, the ______ process is useful as a framework for analyzing a data mining project or proposal

true

we can build unsupervised data mining models when we lack labels for the target variable in the training data T or F

unsupervised

when there is no specific target or a specific purpose for a data set, the data mining problem is referred to as _____________ "Do our customers naturally fall into specific groups?"

true

you can make your macro run faster by telling the macro not to update your screen display T or F

false

SVM attempts to minimize the margin between two classes T or F

true

SVM is a useful approach when data is not linearly separable T or F

true

SVMs are based on supervised learning T or F

true

T or F There are many different "perfect" ways a set of training data can be separated using a linear discriminant

segmentation

If the _________ is done using values of variables that will be known when the target is not, then these segments can be used to predict the value of the target variable "Middle-aged professionals who reside in NYC -----> on average have a churn rate of 5%"

segmentation

If the _________ is done using values of variables that will be known when the target is not, then these segments can be used to predict the value of the target variable "Middle-aged professionals who reside in NYC on average -----> have a churn rate of 5%"

true

In VBA, objects can inherit properties and methods from a container (parent) object T or F

true

In an assigned reading, the concept of an item being a toy likely means that it is not useful for serious business productivity T or F

prediction

In data science, ______ more generally means to estimate an unknown value. This value could be something in the future, but it could also be something in the present or in the past these models are often built and tested using events from the past

unsupervised

Is co-occurrence grouping a supervised or unsupervised task

unsupervised

Is profiling a supervised or unsupervised task

supervised

Is regression a supervised or unsupervised task

segmentation

Our goal with supervised ________ is to separate the data into regions with different values of the target variable

false

Data science techniques are typically used to help with three primary types of decisions T or F

Sub FixThis()

A VBA macro called "FixThis" whether recorded or created from scratch, will begin which of the following ways?? a) Sub FixThis() b) Call Macro(FixThis) c) Macro FixThis() d) Start(FixThis)

false

A logistic regression represents the odds of class membership as a linear function of the attributes T or F

true

A regression model is a typical example of a supervised data mining application T or F

true

According to an assigned reading, the FANG group of companies has emerged as the de facto head of the current network hierarchy T or F

false

Classification trees attempt to create the widest margin between members of the various classes T or F

false

Decision nodes are used in linear regression T or F

False

Estimating the probability of a fraudulent transaction is an example of data mining T or F

true

Excel's INDEX function essentially lets you reference data in a table by location T or F

false

Excel's MATCH function is similar to the VLOOKUP function in that it returns the value contained in the referenced worksheet cell T or F

true

Excel's Solver not only allows you to optimize functions by manipulating multiple variables, it also allows you to select the approach that should be used to solve the problem T or F

true

Finding something in data that doesn't apply to the outside world (ie beyond that particular dataset) is typically a result of overfitting T or F

false

Finding the characteristics that differentiate my most profitable customers from my less profitable customers is an example of an unsupervised learning task T or F

true

Increasing the complexity of a model generally increases its performance on the training set T or F

false

Induction reasons from general knowledge to specific facts T or F

supervised

Is causal modeling a supervised or unsupervised task

supervised

Is classification a supervised or unsupervised task

unsupervised task

Is clustering a supervised or unsupervised task

true

Logistic regression can be used to predict the probability of membership in a certain class T or F

true

Logistic regression requires a categorical target variable in training data T or F

true

Logistic regression requires numeric attributes, so categorical attributes need to be converted to numeric attributes before analysis

true

Logistic regression requires numeric attributes, so categorical attributes need to be converted to numeric attributes before analysis True or False

false

More complex models are generally easier to interpret T or F

True

One key aspect of DDD is greater reliance on data, rather than intuition T or F

false

One key benefit of cross-validation to evaluate induction models is that it is much quicker to compute than other holdout methods T or F

true

One way to avoid overfitting when using tree induction is to specify a maximum number of nodes that the tree is allowed to contain T or F

true

Pruning is a technique for reducing complexity in a decision tree T or F


Related study sets

Nombre y fórmula de los compuestos químicos

View Set

English 2 A: Determine meaning: words and phrases

View Set

Chapter 29 Care of the Hospitalized Child

View Set

Small Business Management: Bricolage

View Set

Fundamentals of Team building: Characteristics of Effective Teams

View Set

Manual Transmission/Transaxle Principles Final Exam-Attempt 1

View Set