Chapter 2

Ace your homework & exams now with Quizwiz!

In many business analytics projects, we want to find

"correlations" between a particular variable describing an individual and other variables.

Regression (value estimation) attempts to

"estimate or predict, for each individual, the numerical value of some variable for that individual. (I.e. a technique that is used to predict a dependent variable with one or more independent variables.)

A vital part in the early stages of the data mining process is:

(1.) to decide whether the line of attack will be supervised or unsupervised, and (2.) if supervised, to produce a precise definition of a specific, quantifiable target variable.

Clustering is used as input to decision-making processes focused on questions such as:

- What products should we offer or develop? - How should our customer care teams (or sales teams) be structured?

Automated statistical modeling (e.g. advanced regression)

- considered data mining, - is usually based on linear models, & - massive databases allow non-linear alternatives.

Data mining (also known KDD) is an offshoot of machine learning, and the too are closely related. Machine learning differs from data mining in that:

- focuses on performance improvement and includes several subfields such as robotics that are not part of data mining, - it is concerned with issues of agency and cognition (how will an intelligent agent use learned knowledge to reason and act in its environment), which are not concerns of data mining. - it is less concerned with applications than the KDD community. - KDD is more concerned with the process of data analytics such as data preparation, model learning, evaluation, and so on.

Two main subclasses of supervised data mining, classification and regression, are distinguished by the type of target:

- regression involves a numeric target, while - classification involves a categorical (often binary) target

The knowledge of Statistics in data science helps us:

- understand different data distributions, - determine what statistics are appropriate as summaries, - understand how to use data to test hypotheses, & - understand how to estimate the uncertainties of conclusions.

There is an important distinction that needs to be understood with regard to data mining:

1)Mining the data to find patterns and build models. a)Produces the probability estimation model. b)It is a craft 2)Using the results of data mining. a)The results should influence and inform the data mining process itself.

supervised data mining methods include

1.) Classification, 2.) Regression, & 3.) Causal Modeling.

Unsupervised data mining methods include:

1.) Clustering, 2.) Co-occurrence grouping, & 3.) Profiling.

Regression analysis asks questions such as:

1.) Who are the most profitable customers? 2.) Is there really a difference between the profitable customers and the average customer? 3.) Who really are these customers and can characterize them? 4.) Will some particular new customer be profitable? How much revenue should I expect this customer to generate?

The most commonly used data analysis techniques are:

1.) classification, 2.) regression, 3.) similarity matching, 4.) clustering, 5.) co-occurrence grouping, 6.) profiling, 7.) link prediction, 8.) data reduction, & 9.) causal modeling. (Note: Classification & Regression are the most commonly used methods for analyzing data.)

unsupervised data mining

A form of data mining whereby the analysts do not create a model or hypothesis before running the analysis. Instead, they apply the data mining technique to the data and observe the results. With this method, analysts create hypotheses after the analysis to explain the patterns found.

Querying and reporting

Data manipulation language - commands used to add, delete, change and retrieve data from the database. (Examples include SQL, Excel, QBE, & other GUI-based querying software).

Classification predicts WHETHER something will happen, whereas regression predicts

HOW MUCH something will happen.

examples of questions that regression analysis can answer include:

How much will she use the service? How much is she willing to pay (opportunity cost)? (The variable could be a numeric data point such as service usage, billing data, customer take.)

Classification and Class Probability Estimation

This method is used to attempt to predict, for each individual in a population, a subset of classes the individual may belong to...example: among a group of customers, which customers are likely to respond to an advertisement campaign? Those who will respond and those who won't respond.

supervised data mining

a form of data mining in which data miners develop a model prior to the analysis and apply statistical techniques to data to estimate values of the parameters of the model

Classification and scoring are closely related in that:

a model designed to do one can easily be modified to do the other.

The data mining process breaks up the overall task of finding patterns from data into

a set of well-defined subtasks.

Data warehousing is not always necessary for data mining, but most firms that decide to invest in data warehouses often can

apply data mining more broadly and more deeply in the organization.

The main different from other analytical techniques and data mining is that, data mining focuses on the

automated search for knowledge, patterns, or regularities from data.

Similarity Matching, Link Prediction, & Data Reduction methods can be

both supervised and unsupervised.

For business applications, we often want a numerical prediction over a

categorical target.

Profiling (also known as behavioral description) attempts to

characterize the typical behavior of individuals, groups, or populations. For example: What does "normal" behavior look like opposed to "fraudulent" behavior?

Similarity measures underline certain solutions to other data mining tasks, such as:

classification, regression, and clustering.

Data Warehouses

collect and coalesce data from across an enterprise, often from multiple transaction-processing systems, each with its own database.

Link Prediction attempts to predict:

connections between data items, accompanied by estimations of the strength of the link. Commonly used by social media platforms to make friend suggestions to users.

A critical skill in data science is the ability to

decompose a data analytics problem into pieces such that each piece matches a known task for which tools are available.

A subtask that will likely be part of the solution to any churn problem is to:

estimate from historical data the probability of a customer terminating their contract shortly after it has expired.

Co-occurrence Grouping (also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to

find associations between entities based upon transactions involving them. In other words, it asks the question; what items are commonly purchased together? (For example: Motorcycle helmet was purchased with Motorcycle.)

Profiling is often used to establish behavioral norms for anomaly detection applications such as:

fraud detection and monitoring for computer system intrusions.

Behavior can be described:

generally over a population, or down to the level of small groups or even individuals.

clustering attempts to

group individuals in a population together by their similarity, but not driven by any specific purpose. For example: Do my customers form natural groups?

causal modeling attempts to

help us understand what events or actions that actually influence others. For example: Why are customers leaving?

Machine Learning is concerned with methods for

improving the knowledge or performance of an intelligent agent over time, in repose to the agent's experience in the world. (Such improvement involves analyzing data from the environment and making predictions about unknown quantities.)

In relation to data mining, hypothesis testing can help determine whether an observed pattern is:

likely to be valid, general regularity as opposed to a chance occurrence in some particular dataset.

Similarity Matching attempts to

match individuals based on information known about them and is the most popular methods for making product recommendations.

Supervised tasks require different techniques than unsupervised tasks do and the results are often

much more useful.

On-Line Analytical Processing (OLAP)

performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.

The data mining process is a well-understood process that:

places structure on the structure on the problem, allowing reasonable consistency, repeatability, and objectiveness.

Techniques for causal modeling include those involving a substantial investment in data, such as

randomized controlled experiments (e.g. A/B Tests), as well as sophisticated methods for drawing causal conclusions from observational data,

A query is a

specific request for a subset of data or for statistic about data, formulated in a technical language and posed to a database system.

The result of co-occurrence grouping is a description of items that occur together. These descriptions usually include

statistic on the frequency of the co-occurrence and an estimate of how surprising it is.

In collaboration with business stakeholders, data scientists decompose a business problem into-

subtasks. The solutions to the subtasks can then be composed to solve the overall problem.

Clustering is useful in preliminary domain exploration to see which natural groups exist because these groups in turn may:

suggest other data mining tasks or approaches.

data reduction attempts to

take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. (The smaller dataset may be easier to deal with or to process. Moreover, the smaller dataset may better reveal the information.)

Summary statistics should be chosen with close attention to the business problem to be solved and also with attention to:

the distribution of the data they are summarizing.

The value for the target variable for an individual is often called

the individual's label

Scoring or class probability estimation is a scoring model which when applied to an individual produces, instead of a class prediction, a score representing:

the probability (or some other quantification of likelihood) that the individual belongs to each class.

While clustering looks at similarity between objects based on the objects' attributes, co-occurrence grouping considers similarity of objects based on:

their appearing together in transactions.

The activity of querying differs fundamentally from data mining in that:

there is no discovery of patterns or models.

Database queries are appropriate when an analyst already has an idea of

what might be an interesting subpopulation of the data, and wants to investigate this population or confirm a hypothesis about it.

For a classification task, a data mining procedure produces a model, that given a new individual, determines

which class that individual belongs to.

Each data-driven business decision-making problem is unique

• Goals, Desires, Constraints, and even personalities.

The Business Process needs to be understood

• Helps Structure data mining projects. • The analysis becomes similar to systematic analysis rather than heroic endeavors associated with luck, chance and individual acumen.

Traditional statistical analysis

• Mainly based on hypothesis testing or estimation and/or quantification of uncertainty, & • Should be used to follow-up on data mining's hypothesis generation.

An important skill for a business analyst is to be able to recognize what sort of analytics techniques is appropriate for addressing a particular problem, such as when to use

• Statistics, • Database Querying (SQL, GUI, OLAP, QBE), • Data Warehousing (ETL), • Regression, • Forecasting, & • Machine Learning and Data Mining.

analytical skills include the ability to

• formulate problems well, • prototype solutions quickly, • make reasonable assumpts in the face of ill-structure problems, • design experiments that represent good investments, & • analyze the results of an analysis.


Related study sets

fundamentals review questions # 2

View Set

Funds IV Therapy Review Questions

View Set

ACCT 292 Chapter 10: Performance Measurement in Decentralized Organizations

View Set

2.09 Quiz : The Constitution Part 3

View Set

Next-Gen Pediatric Case 1: Jackson Weber (Core) Post-Simulation Quiz

View Set