Data Science for Business Test 1

Ace your homework & exams now with Quizwiz!

clustering analysis and association rules

Unsupervised methods

- Used as a catchall term for the computation of particular numeric values of interest from data - To denote the field of study that goes by that name; it provides us with a huge amount of knowledge that underlies analytics, and can be thought of as a component of the larger field of Data Science.

Uses of statistics in business analytics

Brynjolfsson's study

What demonstrated the benefits of data-driven decision-making?

big data technologies

are used for data processing in support of the data mining techniques and other data science activities

-p1*logbase2(p1) -p2*logbase2(p2)

entropy formula

p(c) = (n+1)/(n+m+2)

equation for binary class probability estimation with Laplace correction

Data science

involves principles, processes, and techniques for understanding phenomena via the (automated) analysis of data

Big data

means datasets that are too large for traditional data processing systems, and therefore require new processing technologies.

Data-driven decision-making (DDD)

refers to the practice of basing decisions on the analysis of data, rather than purely on intuition.

leaks

One very general and important concern during data preparation is to beware of _______.

leak

A _______ is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made.

query

A _______ is a specific request for a subset of data or for statistics about data, formulated in a technical language and posed to a database system.

variance

A natural measure of impurity for numeric values is _______.

instance or example

A(n) _______ represents a fact or a data point

Clustering

An example _______ question would be: "Do our customers form natural groups or segments?"

regression

An example _______ question would be: "How much will a given customer use the service?"

profiling

An example _______ question would be: "What is the typical cell phone usage of this customer segment?"

co-occurrence

An example_______ question would be: What items are commonly purchased together?

feature vector

An instance is also sometimes called a _______, because it can be represented as a fixed-length ordered collection (vector) of feature values.

The Cross Industry Standard Process for Data Mining

CRISP-DM stands for _______.

- Business Understanding - Data Understanding - Data Preparation - Modeling - Evaluation - Deployment

CRISP-DM steps

There must be a target and data on the target

Conditions for supervised learning

dependent variable

The target variable, whose values are to be predicted, is commonly called the _______ in statistics.

- Data, and the capability to extract useful knowledge from data, should be regarded as key strategic assets. - Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages. - From a large mass of data, information technology can be used to find informative descriptive attributes of entities of interest. - If you look too hard at a set of data, you will find something—but it might not generalize beyond the data you're looking at. - Formulating data mining solutions and evaluating the results involves thinking carefully about the context in which they will be used.

Fundamental principles of data science

Overfitting a dataset

If you look too hard at a set of data, you will find something—but it might not generalize beyond the data you're looking at.

supervised

In _______ learning, there is a specific target defined.

Classification

In _______, a model takes the instance and determines which class the instance belongs to.

class probability estimation

In _______, or "scoring," a model is applied to the instance that produces a numbered score that represents the probability that the instance belongs to each class, rather than a binary target variable that is produced in classification.

descriptive modeling

In _______, the primary purpose is not to estimate a value but instead to gain insight into the underlying phenomenon or process.

unsupervised learning

In _______, there is no specific purpose or target that has been specified for the data.

Web 1.0

In ________, businesses busied themselves with getting the basic internet technologies in place, so that they could establish a web presence, build electronic commerce capability, and improve the efficiency of their operations.

Web 2.0

In ________, new systems and companies began taking advantage of the interactive nature of the Web.

predictive model

In data science, a _______ is a formula for estimating the unknown value of interest: the target.

Evaluation

In the _______ step of the CRISP-DM process, assess the data mining results rigorously and gain confidence that they are valid and reliable. See if the model serves the business goal. If not the next iteration may use another model that does.

Data Preparation

In the _______ step of the CRISP-DM process, different modeling techniques require data to be different forms. Typically tabular format

Business Understanding

In the _______ step of the CRISP-DM process, it is vital to understand the problem to be solved, this step itself is an iterative process of discovery that requires creativity and high level knowledge of the fundamentals.

Modeling

In the _______ step of the CRISP-DM process, running the dataset through a model that is based on the data and business question being asked.

Data Understanding

In the _______ step of the CRISP-DM process, the data set has strengths and limitations that must be understood. The data also has a cost associated with it that must be understood.

Deployment

In the _______ step of the CRISP-DM process, the results of data mining are put into real use in order to realize some return of investment.

entropy(parent)-[p(c1) x entropy(c1) + p(c2) x entropy(c2)]

Information Gain Formula

Independent variables / predictors / explanatory variables

Statisticians refer to _______ as the attributes supplied as input.

classification and regression

Supervised methods

data mining

The _______ process should not be viewed as a software development cycle.

model induction

The creation of models from data is known as _______.

Data Mining

The field of _______ started as an offshoot of Machine Learning

training data

The input data for the induction algorithm, used for inducing the model, are called the _______.

information gain

The most common splitting criterion is called _______.

economist Prasanna Tambe of NYU's Stern School

Who's study examined the extent to which big data technologies seem to help firms?

Profiling

_______ (also known as behavior description) attempts to characterize the typical behavior of an individual, group, or population.

Database queries

_______ are appropriate when an analyst already has an idea of what might be an interesting subpopulation of the data, and wants to investigate this population or confirm a hypothesis about it.

Training data

_______ are called labeled data because the value for the target variable (the label) is known.

Machine Learning

_______ as a field of study arose as a subfield of Artificial Intelligence, which was concerned with methods for improving the knowledge or performance of an intelligent agent over time, in response to the agent's experience in the world.

Clustering

_______ attempts to group individuals in a population together by their similarity, but not driven by any specific purpose.

Causal modeling

_______ attempts to help us understand what events or actions actually influence others.

Similarity matching

_______ attempts to identify similar individuals based on data known about them.

Link prediction

_______ attempts to predict connections between data items

Data reduction

_______ attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set.

Data warehouses

_______ collect and coalesce data from across an enterprise, often from multiple transaction-processing systems, each with its own database.

Query tools

_______ generally have the ability to execute sophisticated logic

CRISP-DM

_______ is a codification of the data mining process.

purity measure

_______ is a formula that evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable.

Entropy

_______ is a measure of disorder that can be applied to a set, such as one of our individual segments.

Induction

_______ is a term from philosophy that refers to generalizing from specific cases to general rules (or laws, or truths).

Link prediction

_______ is common in social networking systems: "Since you and Karen share 10 friends, maybe you'd like to be Karen's friend?"

Machine Learning

_______ is concerned with issues of agency and cognition—how will an intelligent agent use learned knowledge to reason and act in its environment.

Machine Learning

_______ is concerned with many types of performance improvement. It includes subfields such as robotics and computer vision.

Profiling

_______ is often used to establish behavioral norms for anomaly detection applications such as fraud detection and monitoring for intrusions to computer systems.

Data mining

_______ is the extraction of knowledge from data via technologies that incorporate data science principles.

Data-Analytic thinking

_______ is the process of ​extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages​.

information gain

_______ measures the change in entropy due to any amount of new information being added.

On-line Analytical Processing (OLAP)

_______ provides an easy-to-use GUI to query large data collections, for the purpose of facilitating data exploration.

Deduction

_______ starts with general rules and specific facts, and creates other specific facts from them.

Tree induction

_______ takes a divide-and-conquer approach, starting with the whole dataset and applying variable selection to try to create the "purest" subgroups possible using the attributes.

Class Probability Estimation

_______ would be able to evaluate each individual customer and produce a score of how likely each is to respond to the offer.

Classification

_______ would be useful in answering the question, "Among all the customers of MegaTelCo, which are likely to respond to a given offer?"

Co-occurrence grouping

_______, also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to find associations between entities based on transactions involving them.

Regression

_______, or "value estimation," attempts to estimate or predict, for each individual, the numerical value of some variable for that individual.


Related study sets

MANA3335 MindTap Learn It: Chapter 04: Managing Decision Making

View Set

Cultura Prehispánica y su legado

View Set

Diesel II Chapter 9-Circuit Control Devices

View Set

Ch. 1 The Science of Psychology |Quiz

View Set