MIS 373 PRED ANALYTCS & DATA MINING
Drawbacks of Content-Based Approaches
1. For some customers, important predictors of the customer's experience may be opaque. If important predictors for a given consumer are not represented in the training data, it will not be possible to produce accurate predictions for this consumer. 2. Some predictors may be known, but tricky to quantify to capture what drives the customer's preferences
Regression
A predictive model that predicts the value of a numerical (real-value) variable
Induction
A process by which a pattern is extracted from factual data (experience)
Confidence
A rule's strength is measured by its confidence: How strongly the condition implies the rule
Data set
A set of examples
The Long Tail
A substantial portion of the demand pertains to a large number of items, for each of which there is relatively little demand
ROC (Receiver Operating Curve)
AKA AUC Precision (True Positive Rate) vs. False positive rate
Collaborative Filtering
Aims to produce recommendations which tap into tacit drivers of personal taste and judgment. Predict consumers' preferences without having to understand and gather information about the underlying drivers of preferences. It does this by finding like minded neighbors.
Unsupervised Learning
All modeling tasks which are not used to predict/estimate an unknown value (Clustering/segmentation)
Subtree
Branching from a node. Captures predictive patterns that fit a sub-population
N-Fold Cross-validation
CV is an evaluation methodology (*not* a model induction process) 1. Randomly partition the data into N equally-sized sets (folds) 2. Perform N repeated experiments of model building and evaluation In each experiment: a) Hold out one fold as the test set b) Induce a model from the remaining (N-1) folds c) Evaluated the performance of the model on the hold out fold 3. Average performance of the N experiments
Detecting Overfitting
Cannot be detected if we evaluate the model using the training data. We must evaluate performance on a representative test sample.
Information Gain
Captures how informative the attribute is = Impurity (parent) - weighted average (children)
Support
Captures the significance of the association: What proportion of the transactions reflect this rule?
Learning Curve
Characterizes how test accuracy (y-axis) improves as the training set size (x-axis) increases Using only a portion of the data set to train a model for evaluation may yield over pessimistic evaluation of the model
Lift Chart
Characterizes targeting performance with the model: - y-axis: the number (or percent) of responses - x-axis: the number of solicitations (or percent of solicitations out of the total number of customers) Higher the lift, the better
Classification
Class prediction
Recall
Considering all the examples from the positive class -- What proportion of these examples the model classifies correctly? (Predicted Bad Risk and is Bad risk / All Actual bad Risk)
Training Data
Data used to induce (train) a model
Confusion matrix
Details the different types of errors that the model makes and their frequency
Nodes
Each "non-terminal" node represents a test on an attribute
How to extract rules from a classification tree model
Each path from the root of the tree (top node) to a leaf node constitutes a rule: IF (refund = yes) & (Marital Status = Married) THEN "NO"
Classification Model
Includes a set of {IF (condition) THEN {class}) rules
Linear Regression
Is an induction algorithm
Random Forest
Like bagging, except for one thing: When constructing a tree, not all available attributes are being considered in each tree. Rather, only a subset of randomly selected attributes are considered.
Test Accuracy
Model's predictive performance on (out-of-sample) test data
Overfitting
Occurs when a model captures not only the regularities in the data, but also the peculiarities in the training data. When a model overfits the training data, it fits patterns that would undermine its predictive performance on test data. If the training data is used for evaluation, the model that overfits most has higher accuracy (yet, this model's test/generalization accuracy is worse!).
Underfitting
Occurs when the model is too simple to capture the complex patterns in the data Both training and test errors are large As the model grows, performance on validation set improves
Minimum Confidence Threshold
Only rules with at least the specified minimum confidence are presented
Clustering
Partition data into cohesive groups. Provides high-level understanding of consumer by grouping them by similar usage behavior.
What examples should be used to evaluate the model?
Performance should be evaluated on examples: a) that were not used to induce the model b) an whose true class is known
Classification Accuracy Rate
Proportion of examples whose class is predicted accurately by the model S= number of examples accurately classified by the model N= Total number of exampeles Classification accuracy rate = S/N
Bagged Classification Decision Trees
Pros: - Similar to classification trees - Can capture complex patterns - Predictions are less likely to be undermined by overfitting by "filtering" outliers and diminishing their adverse effects on modeling Cons - less simple model (not as comprehensible as a single tree)
Entropy
Quantifies the level of impurity (or uncertainty) in a group of examples Entropy = Sum of proportion x log2proportion (High entropy = bad, 0 entropy = 100% predictability)
How to avoid overfitting
Separate training data into training and validation sets (in addition to test) Or, specify a minimum number of examples in a leaf node
Sum of Squared Errors (SSE)
Sum of the distance between each example and its cluster's centroid
Leaves
Terminal nodes - a prediction on a classification tree
Minimum Support Threshold
The minimum number of cases in which the rule must hold
Base Rate
The proportion of examples from the majority class
Predictive Model
The target/dependent variable is discrete (categorical)
Precision
When the model tells us that an example belongs to the positive class, how often is it correct? (Predicted Bad Risk and is Bad Risk/ All Predicted bad Risk)
Recursive Partitioning
With each partition the examples are split into subgroups that have "increasingly more pure" class distribution (used for classification trees)
Class probability estimation (CPE)
the probability with which an example belongs to a certain class
Challenges of Collaborative Filtering
- works well once a "Critical mass" of data is available: requires a very large number of consumer ratings over a relatively large number of products. - CF requires the availability of ratings, costly to acquire - Consumer input is difficult to get - Cold start - a new customer with no prior rating, a new product/service with no prior ratings
What is a model?
-A concise description of a pattern (relationship) that exists in data -Also referred to as a theory - A general pattern induced from data
Classification trees
- Easy to understand the relationships in the data captured by the model - Computationally fast to induce from data - Constructed by recursively partitioning the examples in the data
Supervised learning
- Objective is to estimate/predict an unknown value - Model captures a relationship between a set of independent attributes (predictors) and a dependent attribute (target)
Training accuracy
Evaluation of the model's predictive performance on the training examples used to induce the model
Sequence Analysis
Find patterns in time-stamped data
Link Analysis: Association Rules
Finds relations among attributes in the data that frequently co-occur
Profit Chart
For evaluating the benefits (profits) from strategies that rely on the model's ranking of examples Cost and benefits from correct/incorrect targeting are known
Content Based Approach
Generating a predictive model induced for a given customer: The model use restaurant characteristics to predict the customer's ratings (target variable)
Naive Bayes
IF: A story contains many words that are often found in negative stories, and which are infrequent in positive stories, THEN: Classify the story as negative
Clustering/Segmentation Analysis
Identifies distinct groups or cluster of "similar" instances
