Chapter 1
Typical examples of data preparation are:
- converting data to tabular format, - removing or inferring missing values, & - converting data to different types (e.g. numeric to text, text to numeric, or boolean to binary.)
The most intense application of data science and data mining include:
- direct marketing, - online advertising, - credit scoring, - financial trading, - help-desk management, - fraud detection, - search ranking, and - product recommendation.
The data mining process is a skill that was founded in science and
- requires a great deal of creativity, & - some application of common sense.
During the deployment stage (stage 6) of the CRISP-DM Process:
-The results of the data mining and the data mining techniques are put into real use to engage a return on investment.
The data understanding stage (stage 2) of the CRISP-DM Process entails:
1) The data consists of the "raw" material used to provide the solution. 2) Know the strengths and limitations of the data. 3) Identify the different data sources ( e.g. is it primary, secondary, or syndicated data?)
The business decisions being evaluated using data analytics, mainly fall into two types:
1.) decisions for which "discoveries" need to be made within data (i.e., an unsupervised approach), and 2.) decisions that repeat, especially at a massive scale.
Steps (or phases) of Data Mining
1.) define the problem, 2.) identify the data needed to solve the problem, 3.) Prepare and Pre-process the data, 4.) Model the data, 5.) Train and test the data model, & 6.) Verify and deploy the model.
The CRISP- DM process helps managers
1.) gain an understanding of the business, 2.) gain an understanding of the data, 3.) prepare the data for analysis, 4.) create a model from the data, 5.) evaluate the results of the model, & 6.) deploy the data model for decision making.
Two main reasons for deploying the data mining system itself rather than models produced by a data mining system are:
1.) the world may change faster than the data science team can adapt (like with fraud & intrusion detection), & 2.) a business has too many modeling tasks for their data science team to manually curate each model individually.
Leakage
A situation where a variable collected in historical data gives information on the target variable. Example... a customer "big spender.
CRISP-DM
Cross Industry Standard Process for Data Mining
Syndicated data
Data available for a fee from commercial research firms such as Information Resources Inc. (IRI), National Purchase Diary Panel, and ACNielsen.
Normalizing Data
Divides one numeric attribute by another in order to minimize differences in values based on the size of areas or number of features in each area
Strategist
Seizes the envisioned opportunities, has creativity, evaluates the promise of new ideas, design data science projects
Tabular format
The presentation of information such as text and numbers in tables.
Once firms have become capable of processing massive data in a flexible fashion (big data 1.0), the should begin the Big Data 2.0 phase by asking:
What can I now do that I couldn't do before, or do better than I could do before?
An "in vivo" evaluation is
a live system which randomly applies the model to some customers while keeping other customers as a control group.
a predictive model abstracts away most of the complexity of the world, focusing in on
a particular set of indicators that correlate in some way with a quantity of interest.
Extracting useful knowledge from data to solve business problems can be treated systematically by following
a process with reasonably well-defined stages.
Tambe's study of the extent to which big data technologies help firms found that (after controlling for various confounding factors) using big data technologies is associated with:
additional productivity growth. Specifically, one standard deviation higher utilization of said technologies is associated with a 1 to 3% increase in productivity than the average firm. (Note: one standard deviation lower is associated with a 1 to 3% decrease in productivity than the average firm.)
Step 2 of data mining, identifying the necessary data, requires data scientists to:
asses what data is needed and, then collect and gain an understanding of said data.
Data-driven decision making (DDD) refers to the practice of
basing decisions on the analysis of data, rather than purely on intuition.
We can think of ourselves as being in the era of Big Data 1.0, in that firms are busying themselves with
building the capabilities to process large data, largely in support of their current operations- for example, to improve efficiency.
Data and Data Science are strategic assets. "Too many businesses regard data analytics as pertaining to realizing value from existing data, often without careful regard to whether the
business has the appropriate analytical talent. ...data/data science are complimentary"
Whether the deployment of a data model is successful or not, the process is returned to the
business understanding phase in order to generate common insight.
data processing
collecting, organizing, analyzing, and summarizing data
The data preparation stage (stage 3) of the CRISP-DM Process entails:
data conversion into an appropriate format that will yield better results, and often proceeds along with the data understanding phase.
The data modeling stage (stage 4) of the CRISP-DM Process is the primary place where
data mining techniques are applied to the data.
Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking not only will help to
envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threats.
A critical part of the data understanding phase is
estimating the costs and benefits of each data source and deciding whether further investment is merited.
data processing technologies are very important for many data-oriented business tasks that do not involve
extracting knowledge or data-driven decision-making such as efficient transaction processing, modern web system processing, and online advertising campaign management.
data scientist
extracts knowledge from data by performing statistical analysis, data mining, and advanced analytics on big data to identify trends, market changes, and other relevant information
data mining is used for
general customer relationship management to analyze customer behavior in order to manage attrition and maximize expected customer value
Step 1 of data mining, defining the problem, requires data scientists to:
identify both the business and data mining goals
"Big data" technologies (such as Hadoop, HBase, & MongoDB) are used for
implementing data mining techniques, data engineering, and (more frequently), data processing in the support of data mining and other data science activities.
The ultimate goal of data science applications to business is to
improve the quality of decision-making.
Primary data
information collected for the specific purpose at hand
secondary data
information that already exists somewhere, having been collected for another purpose
From a large mass of data, information technology can be used to find:
informative descriptive attributes of entities of interest.
During the business understanding stage (stage 1) of the CRISP-DM Process:
is vital in order to understand the problem to be solved. Creativity and skill play a vital role. The design team thinks carefully about the problem to be solved and the usage of the results.
Overfitting the data can occur when you
look too hard at a set of data you will find something- but it might not generalized beyond the data you're looking at.
In data understanding, we need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then
match them to one or more data mining tasks for which we may have substantial science and technology to apply.
Firms with sophisticated data science teams wisely build testbed environments that
mirror production data as closely as possible, in order to get the most realistic evaluations before taking the risk of deployment.
data science involves
principles, processes, and techniques for understanding phenomena via the (automated) analysis of data.
It is statistically proven that the more data-drive a firm is, the more
productive it is- even for controlling for a wide range of possible confounding factors. In fact, one standard deviation higher on the DDD scale is associated with a 4 to 6% increase in overall firm productivity.
Step 4 of data mining, modeling the data, requires data scientists to:
select the appropriate algorithms and build predictive models.
Step 3 of data mining, preparing and pre-processing the data, requires data scientists to:
select the required data, and then cleanse and/or format the data as necessary.
As a term, "data science" is often applied more broadly than the traditional use of "data mining," but data mining techniques provide
some of the clearest illustrations of the principles of data science.
Brynjolfsson and his colleagues developed a measure of DDD that rates firms as to how:
strongly they use data to make decisions across the company.
The data-analytic thinker needs to consider whether he or she expects the data to have
sufficient value to justify the investment.
The best data science team can yield little value without the appropriate data; the right data often cannot substantially improve decisions without
suitable data science talent.
Data science needs access to data and it often benefits from sophisticated data engineering that data processing technologies, but these technologies are not data science per se. They are critical in:
supporting data science
data science differs from data mining, in that data science is a set of fundamental principles that guide the extraction of knowledge from data; whereas, data mining is
the extraction of knowledge from data, via technologies that incorporate these principles.
churn is
the number of consumers who stop using a product or service, divided by the average number of consumers of that product or service
Moore's Law
the observation that computing power roughly doubles every two years.
Formulating data mining solutions and evaluating the results involves
thinking carefully about the context in which they will be used.
Big Data essentially means datasets that are
too large or complex for traditional data processing systems, and therefore require new processing technologies.
Step 5 of data mining, training and testing the model, requires data scientists to:
train the model with sample data and, then test and iterate the results.
A collaborator in a data-centric project can
translate from business to the execution
Step 6 of data mining, verifying and deploy the finalized model, requires data scientists to:
verify the final data model, prepare visualizations, and deploy the model.
Commonality of data refers to the
volume and variety of data
During the evaluation stage (stage 5) of the CRISP-DM Process:
• The purpose is to assess the data mining results rigorously and to gain confidence of its validity. • The evaluation stage ensures the model satisfies the original business goals. • Involves both quantitative and qualitative assessment.
Managing a data-mining project
•Understand the potential of the project, •Ability to evaluate the proposal and execute the output requirements, & •Ability to interview, listen to the needs of the project and deliver outcomes to a complex variety of people