160ch.1-6
Example classification question
Among all the customers of the company, which customers are likely to respond to a given offer? In this example, the two classes could be called "will respond" and "will not respond"
Example clustering question
"Do our customers form natural groups or segments?" Clustering also is used as input to decision-making processes focusing on questions such as: What products should we offer or develop?
Example regression question
"How much will a customer use a service?" The property (variable) to be predicted here is service usage.
Example link prediction question
"Since you and Becky share 10 friends, maybe you'd like to be Becky's friend?"
Example profiling question
"What is the typical cell phone usage of this customer segment?"
Laplace correction
"smoothed" version of the frequency-based estimate, the purpose of which is to moderate the influence of leaves with only a few instances
Avoiding overfitting with tree induction
(i) stop growing the tree before it gets too complex, and (ii) grow the tree until it is too large, then "prune" it back, reducing its size (and thereby its complexity).
Vital parts in the early stages of the data mining process
(i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to produce a precise definition of a target variable. This variable must be a specific quantity that will be the focus of the data mining
Information gain formula
IG(parent, children) = entropy(parent) - [p(c1) × entropy(c1) + p(c2) × entropy(c2) + ⋯]
Overfitting
If you look hard at a set of data you will find some sort of pattern but it might not generalize beyond the data you are looking at the need to detect and avoid this is one of the most important concepts when applying data mining to real problems
Entropy in charts
In a chart, the highest possible entropy corresponds to the entire area being shaded; the lowest possible entropy corresponds to the entire area being white.
Example co-occurence grouping question
What items are commonly purchased together?
Data analytic thinking
When faced with a business problem, you should be able to assess whether and how data can improve performance
Clustering vs. co-occurence grouping
While clustering looks at similarity between objects based on the objects' attributes, co-occurrence grouping considers similarity of objects based on their appearing together in transactions.
Predictive model
a formula for estimating the unknown value of interest: the target The formula could be mathematical, or it could be a logical statement such as a rule. Often it is a hybrid of the two. a partially specified equation: a numeric function of the data attributes, with some unspecified numeric parameters. The task of the data mining procedure is to "fit" the model to the data by finding the best set of parameters, in some sense of "best."
Entropy
a measure of disorder (how mixed (impure) the segment is with respect to the properties of interest). for example, a mixed up segment with lots of write-offs and lots of non-write-offs would have high entropy. if pure, entropy is zero
Linear discriminant function
a numeric classification model
Data Science
a set of fundamental principles that guide the extraction of knowledge from data involves principles, processes, and techniques for understanding phenomena via the analysis of data ultimate goal is to improve decision making
Classification and class probability estimation
attempt to predict, for each individual in a population, which of a small set of classes individuals belong to. for a classification task, a data mining procedure produces a model that, given a new individual, determines which class that individual belongs to
Profiling (behavior description)
attempts to characterize the typical behavior of an individual, group, or population often used to establish behavioral norms for anomaly detection applications such as fraud detection and monitoring for intrusions to computer systems.For example, if we know what kind of purchases a person typically makes on a credit card, we can determine whether a new charge on the card fits that profile or not.
Regression (value estimation)
attempts to estimate or predict, for each individual, the numerical value of some variable for that individual for a regression task, a data mining procedure produces a model that, given an individual, estimates the value of the particular variable specific to that individual
Co-occurence grouping
attempts to find associations between entities based on transactions involving them. The result is a description of items that occur together. These descriptions usually include statistics on the frequency of the co-occurrence and an estimate of how surprising it is.
Clustering
attempts to group individuals in a population together by their similarity, but not driven by any specific purpose is useful in preliminary domain exploration to see which natural groups exist because these groups, in turn, may suggest other data mining tasks or approaches.
Causal modeling
attempts to help us understand what events or actions actually influence others. (Did A cause B or did B cause A?)
Similarity matching
attempts to identify similar individuals based on data known about them the basis for one of the most popular methods for making product recommendations
Link prediction
attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link.
Data reduction
attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. The smaller dataset may be easier to deal with or to process. usually involves loss of information but what is important is the trade-off for improved insight.
Steps of the data mining process
business understanding data understanding data preparation modeling evaluation deployment
Purity
homogeneous with respect to the target variable If every member of a group has the same value for the target, then the group is pure. If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
Supervised segmentation
how can we segment the population with respect to something that we would like to predict or estimate? we would like to find knowable attributes that correlate with the target of interest —that reduce our uncertainty in it.
Frequency-based estimate of class membership probability
if a leaf contains n positive instances and m negative instances, the probability of any new instance being positive may be estimated as n/(n +m).
Data preparation
in this stage, data are manipulated and converted into forms that yield better results Typical examples are converting data to tabular format, removing or inferring missing values, and converting data to different types beware of leaks
Modeling
in this stage, data mining techniques are applied to the data
Data understanding
in this stage, it is important to understand the strengths and limitations of the data because rarely is there an exact match with the problem Some data will be available virtually for free while others will require effort to obtain. Some data may be purchased. Other data simply won't exist and will require collection. A critical part of this stage is estimating the costs and benefits of each data source and deciding whether further investment is merited.
Business understanding
in this stage, the design team should think carefully about the use scenario What exactly do we want to do? How exactly would we do it? What parts of this use scenario constitute possible data mining models?
Deployment
in this stage, the results of data mining—and increasingly the data mining techniques themselves—are put into real use in order to realize some return on investment.
Regression vs. Classification
informally, classification predicts whether something will happen, wheras regression predicts how much something will happen Regression involves a numeric target and classification involves a categorical (often binary) target
Model
is a simplified representation of reality created to serve a purpose
What does it mean for a variable to be informative
it reduces uncertainty about something
Variance
measure of impurity for numeric values If the set has all the same values for the numeric target variable, then the set is pure and the variance is zero. If the numeric target values in the set are very different, then the set will have high variance.
Information gain (IG)
measures how much an attribute improves (decreases) entropy over the whole segmentation it creates. Strictly speaking, measures the change in entropy due to any amount of new information being added;
Supervised learning
model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function of the features. So, for our churn-prediction problem we would like to build a model of the propensity to churn as a function of customer account attributes, such as age, income, length with the company, etc...
Unsupervised methods
Has no specific target. Clustering, co-occurrence grouping, and profiling are generally unsupervised.
Importance of understanding data science
Data analysis is critical to business strategy. Understanding the fundamental concepts, and having frameworks for organizing data-analytic thinking not only will allow one to interact competently, but will help to envision opportunities for improving data-driven decision-making, or to see data-oriented competitive threats. Note: Firms where the business people do not understand what the data scientists are doing are at a substantial disadvantage, because they waste time and effort or, worse, because they ultimately make wrong decisions.
4 A's of data science
Data architecture data acquisition data analysis data archiving
Fundamental concept
Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages. The Cross Industry Standard Process for Data Mining, abbreviated CRISP-DM (CRISP- DM Project, 2000), is one codification of this process. Keeping such a process in mind provides a framework to structure our thinking about data analytics problems.
Data mining
Extraction of knowledge from data, via technologies that incorporate the principles of data science
Supervised methods
Has a specific purpose for grouping - predicting the target. Note that there must be data on the target. Classification, regression, and causal modeling are generally solved with supervised methods
Evaluation
In this stage, the goal is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on (also comprehensible) helps to ensure that the model satisfies the original business goals
Some applications of data mining
Marketing: Targeted marketing, online advertising, recommendations for cross-selling Customer Relationship Management: analyzing customer behavior (in order to manage attrition and maximize expected customer value) Finance: credit scoring and trading, fraud detection
Data-driven decision-making (DDD)
Refers to the practice of basing decisions on the analysis of data, rather than purely on intuition correlated with higher return on assets, return on equity, and market value
Methods that could be either supervised or unsupervised
Similarity matching, link prediction, and data reduction
Statistics and data
Statistics helps us understand how to use data to test hypotheses and to estimate the uncertainty of conclusions. In relation to data mining, hypothesis testing can help determine whether an observed pattern is likely to be a valid, general regularity as opposed to a chance occurrence in some particular dataset. often also want to provide confidence intervals
Support Vector Machine (SVM)
The SVM's objective function incorporates the idea that a wider bar is better. Then once the widest bar is found, the linear discriminant will be the center line through the bar The distance between the dashed parallel lines is called the margin around the linear discriminant, and thus the objective is to maximize the margin.
Churn
customers switching from one company to another
Holdout data
data for which we know the value of the target variable, but which will not be used to build the model. These are not the actual use data, for which we ultimately would like to predict the value of the target variable. Instead, creating holdout data is like creating a "lab test" of generalization performance. We will simulate the use scenario on these holdout data: we will hide from the model (and possibly the modelers) the actual values for the target on the holdout data. The model will predict the values. Then we estimate the generalization performance by comparing the predicted values with the hidden true values.
Data as an asset
data, and the capability to extract useful knowledge from data, should be regarded as key strategic assets once we view data as a business asset, we should think about whether and how much we are willing to invest
Big Data
datasets that are too large for traditional data processing systems, and therefore require new processing technologies use is associated with significant additional productivity growth
Entropy formula
entropy = - p1 log (p1)- p2 log (p2)- ⋯ Each pi is the probability (the relative percentage) of property i within the set, ranging from pi = 1 when all members of the set have property i, and pi = 0 when no members of the set have property i.
Purity measure
evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable.
Classification tree/decision tree induction
recursive process of divide and conquer, where the goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible with respect to the target variable. We perform this partitioning recursively, splitting further and further until we are done. We choose the attributes to split upon by testing all of them and selecting whichever yields the purest subgroups.
Regularization
reining in model complexity to avoid overfitting
Learning curve
shows model performance on testing data plotted against the amount of training data used.
Fitting graph
shows the accuracy of a model as a function of complexity
Leak
situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made.
Linear modeling techniques
svms, logistic regression, linear regression
Model induction
the creation of models from data Note: the procedure that creates the model from the data is called the induction algorithm. The input data for the induction algorithm, used for inducing the model, are called the training data.
Parameter learning/parametric modeling
the goal of the data mining is to tune the parameters so that the model fits the data as well as possible
Label
the name for the value of the target variable for an individual
Descriptive model
the primary purpose of the model is not to estimate a value but instead to gain insight into the underlying phenomenon or process. (can tell us what something typically looks like)
Generalization
the property of a model or modeling process, whereby the model applies to data that were not used to build the model