Chapter 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Typical examples of data preparation are:

- converting data to tabular format, - removing or inferring missing values, & - converting data to different types (e.g. numeric to text, text to numeric, or boolean to binary.)

The most intense application of data science and data mining include:

- direct marketing, - online advertising, - credit scoring, - financial trading, - help-desk management, - fraud detection, - search ranking, and - product recommendation.

The data mining process is a skill that was founded in science and

- requires a great deal of creativity, & - some application of common sense.

During the deployment stage (stage 6) of the CRISP-DM Process:

-The results of the data mining and the data mining techniques are put into real use to engage a return on investment.

The data understanding stage (stage 2) of the CRISP-DM Process entails:

1) The data consists of the "raw" material used to provide the solution. 2) Know the strengths and limitations of the data. 3) Identify the different data sources ( e.g. is it primary, secondary, or syndicated data?)

The business decisions being evaluated using data analytics, mainly fall into two types:

1.) decisions for which "discoveries" need to be made within data (i.e., an unsupervised approach), and 2.) decisions that repeat, especially at a massive scale.

Steps (or phases) of Data Mining

1.) define the problem, 2.) identify the data needed to solve the problem, 3.) Prepare and Pre-process the data, 4.) Model the data, 5.) Train and test the data model, & 6.) Verify and deploy the model.

The CRISP- DM process helps managers

1.) gain an understanding of the business, 2.) gain an understanding of the data, 3.) prepare the data for analysis, 4.) create a model from the data, 5.) evaluate the results of the model, & 6.) deploy the data model for decision making.

Two main reasons for deploying the data mining system itself rather than models produced by a data mining system are:

1.) the world may change faster than the data science team can adapt (like with fraud & intrusion detection), & 2.) a business has too many modeling tasks for their data science team to manually curate each model individually.

Leakage

A situation where a variable collected in historical data gives information on the target variable. Example... a customer "big spender.

CRISP-DM

Cross Industry Standard Process for Data Mining

Syndicated data

Data available for a fee from commercial research firms such as Information Resources Inc. (IRI), National Purchase Diary Panel, and ACNielsen.

Normalizing Data

Divides one numeric attribute by another in order to minimize differences in values based on the size of areas or number of features in each area

Strategist

Seizes the envisioned opportunities, has creativity, evaluates the promise of new ideas, design data science projects

Tabular format

The presentation of information such as text and numbers in tables.

Once firms have become capable of processing massive data in a flexible fashion (big data 1.0), the should begin the Big Data 2.0 phase by asking:

What can I now do that I couldn't do before, or do better than I could do before?

An "in vivo" evaluation is

a live system which randomly applies the model to some customers while keeping other customers as a control group.

a predictive model abstracts away most of the complexity of the world, focusing in on

a particular set of indicators that correlate in some way with a quantity of interest.

Extracting useful knowledge from data to solve business problems can be treated systematically by following

a process with reasonably well-defined stages.

Tambe's study of the extent to which big data technologies help firms found that (after controlling for various confounding factors) using big data technologies is associated with:

additional productivity growth. Specifically, one standard deviation higher utilization of said technologies is associated with a 1 to 3% increase in productivity than the average firm. (Note: one standard deviation lower is associated with a 1 to 3% decrease in productivity than the average firm.)

Step 2 of data mining, identifying the necessary data, requires data scientists to:

asses what data is needed and, then collect and gain an understanding of said data.

Data-driven decision making (DDD) refers to the practice of

basing decisions on the analysis of data, rather than purely on intuition.

We can think of ourselves as being in the era of Big Data 1.0, in that firms are busying themselves with

building the capabilities to process large data, largely in support of their current operations- for example, to improve efficiency.

Data and Data Science are strategic assets. "Too many businesses regard data analytics as pertaining to realizing value from existing data, often without careful regard to whether the

business has the appropriate analytical talent. ...data/data science are complimentary"

Whether the deployment of a data model is successful or not, the process is returned to the

business understanding phase in order to generate common insight.

data processing

collecting, organizing, analyzing, and summarizing data

The data preparation stage (stage 3) of the CRISP-DM Process entails:

data conversion into an appropriate format that will yield better results, and often proceeds along with the data understanding phase.

The data modeling stage (stage 4) of the CRISP-DM Process is the primary place where

data mining techniques are applied to the data.

Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking not only will help to

envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threats.

A critical part of the data understanding phase is

estimating the costs and benefits of each data source and deciding whether further investment is merited.

data processing technologies are very important for many data-oriented business tasks that do not involve

extracting knowledge or data-driven decision-making such as efficient transaction processing, modern web system processing, and online advertising campaign management.

data scientist

extracts knowledge from data by performing statistical analysis, data mining, and advanced analytics on big data to identify trends, market changes, and other relevant information

data mining is used for

general customer relationship management to analyze customer behavior in order to manage attrition and maximize expected customer value

Step 1 of data mining, defining the problem, requires data scientists to:

identify both the business and data mining goals

"Big data" technologies (such as Hadoop, HBase, & MongoDB) are used for

implementing data mining techniques, data engineering, and (more frequently), data processing in the support of data mining and other data science activities.

The ultimate goal of data science applications to business is to

improve the quality of decision-making.

Primary data

information collected for the specific purpose at hand

secondary data

information that already exists somewhere, having been collected for another purpose

From a large mass of data, information technology can be used to find:

informative descriptive attributes of entities of interest.

During the business understanding stage (stage 1) of the CRISP-DM Process:

is vital in order to understand the problem to be solved. Creativity and skill play a vital role. The design team thinks carefully about the problem to be solved and the usage of the results.

Overfitting the data can occur when you

look too hard at a set of data you will find something- but it might not generalized beyond the data you're looking at.

In data understanding, we need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then

match them to one or more data mining tasks for which we may have substantial science and technology to apply.

Firms with sophisticated data science teams wisely build testbed environments that

mirror production data as closely as possible, in order to get the most realistic evaluations before taking the risk of deployment.

data science involves

principles, processes, and techniques for understanding phenomena via the (automated) analysis of data.

It is statistically proven that the more data-drive a firm is, the more

productive it is- even for controlling for a wide range of possible confounding factors. In fact, one standard deviation higher on the DDD scale is associated with a 4 to 6% increase in overall firm productivity.

Step 4 of data mining, modeling the data, requires data scientists to:

select the appropriate algorithms and build predictive models.

Step 3 of data mining, preparing and pre-processing the data, requires data scientists to:

select the required data, and then cleanse and/or format the data as necessary.

As a term, "data science" is often applied more broadly than the traditional use of "data mining," but data mining techniques provide

some of the clearest illustrations of the principles of data science.

Brynjolfsson and his colleagues developed a measure of DDD that rates firms as to how:

strongly they use data to make decisions across the company.

The data-analytic thinker needs to consider whether he or she expects the data to have

sufficient value to justify the investment.

The best data science team can yield little value without the appropriate data; the right data often cannot substantially improve decisions without

suitable data science talent.

Data science needs access to data and it often benefits from sophisticated data engineering that data processing technologies, but these technologies are not data science per se. They are critical in:

supporting data science

data science differs from data mining, in that data science is a set of fundamental principles that guide the extraction of knowledge from data; whereas, data mining is

the extraction of knowledge from data, via technologies that incorporate these principles.

churn is

the number of consumers who stop using a product or service, divided by the average number of consumers of that product or service

Moore's Law

the observation that computing power roughly doubles every two years.

Formulating data mining solutions and evaluating the results involves

thinking carefully about the context in which they will be used.

Big Data essentially means datasets that are

too large or complex for traditional data processing systems, and therefore require new processing technologies.

Step 5 of data mining, training and testing the model, requires data scientists to:

train the model with sample data and, then test and iterate the results.

A collaborator in a data-centric project can

translate from business to the execution

Step 6 of data mining, verifying and deploy the finalized model, requires data scientists to:

verify the final data model, prepare visualizations, and deploy the model.

Commonality of data refers to the

volume and variety of data

During the evaluation stage (stage 5) of the CRISP-DM Process:

• The purpose is to assess the data mining results rigorously and to gain confidence of its validity. • The evaluation stage ensures the model satisfies the original business goals. • Involves both quantitative and qualitative assessment.

Managing a data-mining project

•Understand the potential of the project, •Ability to evaluate the proposal and execute the output requirements, & •Ability to interview, listen to the needs of the project and deliver outcomes to a complex variety of people

Voir tous les ensembles d'études

Chapter 1

Ensembles d'études connexes

Environmental Science Ch. 10-14 (edited)

Environmental 2.7-2.8 Study Guide

Oceanography Ch. 7

Ch 18. Cardiovascular System: Blood

AP ART HISTORY QUIZ OCEANIC UNIT 6-11

Prep U | Oxygenation and Perfusion

Ohm's law

BUSINESS TRANSACTIONS AND ACCOUNTING

PSU Econ304 section3 Lesson3 production and the classical labor market model Quiz

New x3

ASTR test 2

NURS 450A Quiz 1 NCLEX Prep/ATI Book Questions

(Feist) Chapter 12

Chapter 10: Government Health Insurance Programs: Medicaid, CHIP, and Medicare

Econ test 2 quizlet

Mr. Foster Midterm Review Ch. 1-11

Business Finance Exam 1

MUS 1030 Part 4 Ch 14

Exam Fx 2

Evolve OB: Ch 6