Data Mining Test 1
Specificity
% of actual c0s correctly classified
Sensitivity
% of actual c1s correctly classified
False negative rate
% of predicted c0s that were not c0s
False positive rate
% of predicted c1s that we not c1s
Obtaining Data: Sampling
-Algorithms & models are typically applied to a sample from a database -Develop and select a final model, you use it to score the observations in the larger database
Unsupervised Learning
-Association Rules -Data Reduction -Data Exploration -Visualization -Segment data into meaningful segments; detect patterns -No target to predict or classify -Association rules, data reduction, exploration/visualization
Cross-market analysis
-Association/co-relations between product sales -Prediction based on the association information -Sell bundled products and services
Statistical summary of data: common metrics
-Average -Median -Minimum -Maximum -Standard deviation -Counts and percentages
Graphs for data exploration
-Basic plots -Line graphs -Bar charts -Scatterplots -Distribution plots -Box plots -Histograms
Lift vs. Decile charts
-Both embody concept of moving down through the records, starting with the most probable -Decile chart does this in decile chunks of data -Lift chart shows continuous cumulative results
Supervised Learning
-Classification -Prediction -Predict a single target or outcome variable -Training data -Score to data where value is known
Naiive rule
-Classify all records as belonging to the most prevalent class -Often used as a benchmark -Equivalent to predicting everyone has a value equal to the mean -EXCEPTION: when the goal is to identify high-value but rare outcomes, may do well by doing worse than the rule
How to use life and decile charts
-Compare lift to no model baseline -In lift: compare step function to straight line -In decile: compare to ratio of 1
Numeric
-Continuous -Integer -May need to bin into categories
Pivot Tables
-Counts and percentages are useful for summarizing categorical data -Averages are useful for summarizing grouped numerical data
How does PCA work?
-Creates new variables that are linear combinations of the original variables -Linear combinations are uncorrelated, only a few contain most of the original information
What's the buzz about data mining?
-Data is already being produced -Data is being warehouse/stored -Computing power is available and affordable -Electronic storage media and capacity are available and affordable -Complex algorithms are now accessible via commercial software
Drowning in data; Starving for knowledge
-Data is collected on almost everything -Credit/debit transactions -grocery purchases -Electronic ticketing
Data Exploration
-Data sets are typically large, complex, & messy -Need to review the data to help refine the task -Use techniques of reduction and visualization
Pre-processing variable types
-Determine the types of pre-processing needed and algorithms used -Categorical vs. numeric
Unsupervised: Data reduction
-Distillation of complex/large data into simpler/smaller data -Reducing the number of variables/columns (principle components) -Reducing the number of records
Misclassification error
-Error -Error rate
Principal components analysis (PCA)
-Goal: Reduce a set of numerical variables -Idea: Remove the overlap of information between varables -Product: A smaller number of numerical variables that contain most of the information
Lift and decile charts
-Goal: useful for assessing performance in terms of identifying the most important class -Compare performance of DM model to 'no model, pick randomly' -Measures ability of DM model to identify the important class, relative to its average prevalence -Give explicit assessment of results over a large number of cutoffs
Unsupervised: Data Visualization
-Graphs and plots of data -Histograms, boxplots, bar charts, scatterplots -Useful to examine relationships between pairs of variables
Visualizations
-Help reveal interesting trends that might otherwise go unseen -Help quickly and efficiently convey what you would like others to concentrate on
Separation of records
-High separation of records means that using predictor variables attains low error -Low separation of records means that using predictor variables does not improve much on a naiive rule
Identifying customer requirements
-Identifying the best products for different customers -Use prediction to find what factors will attract new customers
Detecting Outliers
-Important step -Observation that is extreme, being distant from the rest of the data -Have disproportionate influence on models -Required to determine if it is an error or truly extreme -Anomaly Detection (airport security screening)
Questions data mining can answer
-Loyal customer vs. jump ship (churn) -Products to be marketed -Offer responses -Best telemarketing script -Next branch location -Next product/service
Data explosion problem
-Mature database technology lead to tremendous amounts of data being stored in databases, data warehouses, and other information repositories -Higher record complexity
Some measures of error
-Mean absolute error/deviation (MAE/D) -Average error -Mean absolute percentage error (MAPE) -Root mean squared error (RMSE) -Total sum of squared errors (Total SSE)
Competition assessment
-Monitor competitors and market directions -Group customers into classes and a class-based pricing procedure -Set pricing strategy in a highly competitive market
Jittering
-Moving markers by a small random amount -Uncrowds the data by allowing more markers to be seen
Measuring predictive error
-Not the same as goodness of fit -Want to know the model predicts new data
Handling missing data
-Omission of the data -Imputatoin
Categorical
-Ordered (low, med. high) -Unordered (male, female) -Naiive Bayes can use as-is -Must create binary dummies
Supervised: Classification
-Predict categorical target variable -Purchase/no purchase -Each row is a case -Each column is a variable -Target variable is binary
Supervised: Prediction
-Predict numerical target variable -Ex: sales & revenue -Row= Case -Column= variable -Target variable is numerical -Taken together, classification & prediction constitute 'predictive analytics'
Unsupervised: Association Rules
-Produce rules that define 'what goes with what' -Ex: X was purchased, Y was also purchased -Rows= Transactions -Used in recommender systems -AKA affinity analysis
Rare event oversampling
-Sampling may yield too few interesting cases to effectively train a model -Solution: oversample the rare cases to obtain a more balanced training set
Alternate accuracy measures
-Sensitivity -Specificity -False positive rate -False negative rate
Reducing categories: create dummies
-Single categorical variable with m categories is typically transformed into m-1 dummy variables -Can end up with too many variables -Reduce to combining categories that are close to each other -Pivot tables
How to compute lift charts
-Sort records from most likely to least -Compute lift: Accumulate the classified important class records -Naiive model: Compare to mean number of important class records
Overfitting
-Statistical models can produce highly complex explanations of relationships between variables -Models of great complexity often fare quite poorly -Fit of simple model to new data is better than the fit of complex models to new data
Core Ideas
-Supervised learning -Unsupervised learning
Market Analysis and Management
-Target marketing -Determine customer purchasing patterns over time -Cross market analysis -Customer profiling -Identifying customer requirements -Competition assessement
Causes of overfitting
-Too many predictors -Trying many different models -Consequence: Deployed model will not work as well as expected with completely new data
Normalizing (Standardizing) Data
-Variables with the largest scales would dominate & skew results -Puts variables on same scale -Subtract mean & divide by standard deviation -Or scale to 0-1 by subtracting minimum and dividing by the range
Steps in data mining
1. Define & understand purpose 2. Obtain data 3. Explore, clean, pre-process data 4. Reduce the data; if supervised DM, partition it 5. Specify task (classification , clustering, etc.) 6. Choose the techniques (regression, CART, neural networks) 7. Iterative implementation and tuning 8. Assess results- compare models 9. Deploy best model
Business Intelligence Pyramid
Chapter 1 Slide 6
Error
Classifying a record as belonging to one class when it belongs to another clas
Heat maps
Color conveys information -Correlations -Missing data
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Customer profiling
Data mining can tell you what types of customers buy what products (clustering or classification)
Distribution Plots
Display 'how many' of each value occur in a data set. OR for continuous data or data with many possible values are in each of a series of ranges or bins
Scatterplot
Displays relationship between two numerical variables
Target marketing
Find clusters of model customers who share the same characteristics: interest, income level, spending habits
Decile chart
Most probable model is twice as likely to identify the important class
Training
Partition to develop the model
Validation
Partition to implement the model and evaluate its performance on new data
Error rate
Percent of misclassified records out of total records
Imputation
Replace missing values with reasonable substitutes
Partitioning the Data
Separate the data into two parts -Training -Validation
Matrix plot
Shows scatterplots for variable pairs
Histogram
Shows the distribution of the outcome variable
Boxplot
Side-by-side boxplots are useful for comparing subgroups
Omission
Small number of records
Lift chart: cumulative performance
Tells us how much improvement out model provides over selecting cases at random
What is data mining?
The process of identifying hidden patterns and relationships within data sets; automating information discovery
What is data mining good for?
Turning information into useful knowledge