MIS 373 PRED ANALYTCS & DATA MINING

¡Supera tus tareas y exámenes ahora con Quizwiz!

Drawbacks of Content-Based Approaches

1. For some customers, important predictors of the customer's experience may be opaque. If important predictors for a given consumer are not represented in the training data, it will not be possible to produce accurate predictions for this consumer. 2. Some predictors may be known, but tricky to quantify to capture what drives the customer's preferences

Regression

A predictive model that predicts the value of a numerical (real-value) variable

Induction

A process by which a pattern is extracted from factual data (experience)

Confidence

A rule's strength is measured by its confidence: How strongly the condition implies the rule

Data set

A set of examples

The Long Tail

A substantial portion of the demand pertains to a large number of items, for each of which there is relatively little demand

ROC (Receiver Operating Curve)

AKA AUC Precision (True Positive Rate) vs. False positive rate

Collaborative Filtering

Aims to produce recommendations which tap into tacit drivers of personal taste and judgment. Predict consumers' preferences without having to understand and gather information about the underlying drivers of preferences. It does this by finding like minded neighbors.

Unsupervised Learning

All modeling tasks which are not used to predict/estimate an unknown value (Clustering/segmentation)

Subtree

Branching from a node. Captures predictive patterns that fit a sub-population

N-Fold Cross-validation

CV is an evaluation methodology (*not* a model induction process) 1. Randomly partition the data into N equally-sized sets (folds) 2. Perform N repeated experiments of model building and evaluation In each experiment: a) Hold out one fold as the test set b) Induce a model from the remaining (N-1) folds c) Evaluated the performance of the model on the hold out fold 3. Average performance of the N experiments

Detecting Overfitting

Cannot be detected if we evaluate the model using the training data. We must evaluate performance on a representative test sample.

Information Gain

Captures how informative the attribute is = Impurity (parent) - weighted average (children)

Support

Captures the significance of the association: What proportion of the transactions reflect this rule?

Learning Curve

Characterizes how test accuracy (y-axis) improves as the training set size (x-axis) increases Using only a portion of the data set to train a model for evaluation may yield over pessimistic evaluation of the model

Lift Chart

Characterizes targeting performance with the model: - y-axis: the number (or percent) of responses - x-axis: the number of solicitations (or percent of solicitations out of the total number of customers) Higher the lift, the better

Classification

Class prediction

Recall

Considering all the examples from the positive class -- What proportion of these examples the model classifies correctly? (Predicted Bad Risk and is Bad risk / All Actual bad Risk)

Training Data

Data used to induce (train) a model

Confusion matrix

Details the different types of errors that the model makes and their frequency

Nodes

Each "non-terminal" node represents a test on an attribute

How to extract rules from a classification tree model

Each path from the root of the tree (top node) to a leaf node constitutes a rule: IF (refund = yes) & (Marital Status = Married) THEN "NO"

Classification Model

Includes a set of {IF (condition) THEN {class}) rules

Linear Regression

Is an induction algorithm

Random Forest

Like bagging, except for one thing: When constructing a tree, not all available attributes are being considered in each tree. Rather, only a subset of randomly selected attributes are considered.

Test Accuracy

Model's predictive performance on (out-of-sample) test data

Overfitting

Occurs when a model captures not only the regularities in the data, but also the peculiarities in the training data. When a model overfits the training data, it fits patterns that would undermine its predictive performance on test data. If the training data is used for evaluation, the model that overfits most has higher accuracy (yet, this model's test/generalization accuracy is worse!).

Underfitting

Occurs when the model is too simple to capture the complex patterns in the data Both training and test errors are large As the model grows, performance on validation set improves

Minimum Confidence Threshold

Only rules with at least the specified minimum confidence are presented

Clustering

Partition data into cohesive groups. Provides high-level understanding of consumer by grouping them by similar usage behavior.

What examples should be used to evaluate the model?

Performance should be evaluated on examples: a) that were not used to induce the model b) an whose true class is known

Classification Accuracy Rate

Proportion of examples whose class is predicted accurately by the model S= number of examples accurately classified by the model N= Total number of exampeles Classification accuracy rate = S/N

Bagged Classification Decision Trees

Pros: - Similar to classification trees - Can capture complex patterns - Predictions are less likely to be undermined by overfitting by "filtering" outliers and diminishing their adverse effects on modeling Cons - less simple model (not as comprehensible as a single tree)

Entropy

Quantifies the level of impurity (or uncertainty) in a group of examples Entropy = Sum of proportion x log2proportion (High entropy = bad, 0 entropy = 100% predictability)

How to avoid overfitting

Separate training data into training and validation sets (in addition to test) Or, specify a minimum number of examples in a leaf node

Sum of Squared Errors (SSE)

Sum of the distance between each example and its cluster's centroid

Leaves

Terminal nodes - a prediction on a classification tree

Minimum Support Threshold

The minimum number of cases in which the rule must hold

Base Rate

The proportion of examples from the majority class

Predictive Model

The target/dependent variable is discrete (categorical)

Precision

When the model tells us that an example belongs to the positive class, how often is it correct? (Predicted Bad Risk and is Bad Risk/ All Predicted bad Risk)

Recursive Partitioning

With each partition the examples are split into subgroups that have "increasingly more pure" class distribution (used for classification trees)

Class probability estimation (CPE)

the probability with which an example belongs to a certain class

Challenges of Collaborative Filtering

- works well once a "Critical mass" of data is available: requires a very large number of consumer ratings over a relatively large number of products. - CF requires the availability of ratings, costly to acquire - Consumer input is difficult to get - Cold start - a new customer with no prior rating, a new product/service with no prior ratings

What is a model?

-A concise description of a pattern (relationship) that exists in data -Also referred to as a theory - A general pattern induced from data

Classification trees

- Easy to understand the relationships in the data captured by the model - Computationally fast to induce from data - Constructed by recursively partitioning the examples in the data

Supervised learning

- Objective is to estimate/predict an unknown value - Model captures a relationship between a set of independent attributes (predictors) and a dependent attribute (target)

Training accuracy

Evaluation of the model's predictive performance on the training examples used to induce the model

Sequence Analysis

Find patterns in time-stamped data

Link Analysis: Association Rules

Finds relations among attributes in the data that frequently co-occur

Profit Chart

For evaluating the benefits (profits) from strategies that rely on the model's ranking of examples Cost and benefits from correct/incorrect targeting are known

Content Based Approach

Generating a predictive model induced for a given customer: The model use restaurant characteristics to predict the customer's ratings (target variable)

Naive Bayes

IF: A story contains many words that are often found in negative stories, and which are infrequent in positive stories, THEN: Classify the story as negative

Clustering/Segmentation Analysis

Identifies distinct groups or cluster of "similar" instances


Conjuntos de estudio relacionados

Insurance regulations/laws (UTAH)

View Set

MCITP Chapter 6 Review Questions

View Set

Ch. 5 - IT Infrastructure: Hardware and Software

View Set