GSB 530 Data Mining
Opportunistic Data (Data mining data)
Purpose: operational, Value: commercial, generation: passive size: massive, hygiene: dirty, state: dynamic
accuracy rate
(total correct of success & non-success) / total data points
granularity (of a fact table)
- level of detail -transactional (finest) -aggregated (summarized) finer grains give better market basket analysis, but there is more rows click = web based finest granularity
Areas where analytics are used often
- new customer acquisition, customer loyalty, cross-sell, pricing tolerance, supply optimization, financial forecasting, churn, product placement, insurance rat setting, fraud
Areas when analytics are not helpful
- snap decisions -novel approaches (no historical data) -most salient factors rare (making decisions to work around unlikely obstacles/miracles) -expert analysis suggests a particular path -Metrics are inappropriate (quantifying love, quantifying athlete salaries?) -naive implementation of analytics (only looking at one variable) -confirming what you already know (ignoring important variables)
Discovery = Unsupervised classification
-cluster analysis -association (market basket analysis) -dimension reduction -affinity analysis No target variable = unsupervised -no outcome variable
Why do we need data warehousing?
-company wide view of high-quality info -separation of operational vs informational system (analytical)
Normalizing
-create associative entity (ex. helper table) -sometimes natural hierarchy between dimensions -design options ( single dimension tables or nested 1:M)
Problems with company wide view
-inconsistent key structures -synonyms -freeform vs structured -inconsistent data values -missing data
Predictive modeling = supervised classification
-linear regression, logistic regression, decision trees -used in classification & prediction (AND 1. HAS KNOWN TARGET 2. with values for the target available 3. must be large enough for partitioning Key difference: If there is a target variable = supervised (know what we are looking for) No target variable = unsupervised
Best model
-low error rate mixed with not too complex - Overfitting: underlying patterns that will affect what you get vs noise factors (drinking milk, taking the bus before exam)
Slowly Changing Dimension
-maintain knowledge of the past -Kimble's approach: Type 1: replace and lose old data, type 3: add new attribute, multivalued (create current and old fields), Type 2: add new row each time dimension changes (most common)
Problems with Opportunistic Data
-not collected with data analysis in mind, in multiple silo-based data system
2 Technique to deal with missing values:
-omission (delete rows with missing values) Or imputation (add in reasonable value)
Real time Data Warehouse
-real-time ETL -ODS (operational data storage) and Data warehouse are the same -data marts are not separate databases in this warehouse but instead VIEWS of of the warehouse (easier to create new data mart)
Enterprise Data Warehouse
-single ETL for Enterprise Data Warehouse (all goes to one place then splits to little data marts from EDW) - dependent data marts then loaded from EDW
cutoff probability
.02 means if need data point to have 98% chance of being a success class to put it in success class
What 3 things do Data Scientists need?
1) training in IS/IT & problem solving 2) Curiosity (more important) 3) story telling ability
3 Consequences of Data Deluge
1. Consequences= every problem will generate data 2. every company will need analytics 3. everyone will need analytics = proactive is good
Variations of Star Schema
1. Multiple Fact tables (better performance, different combos of dimensions) 2. Factless fact table = no non-key data used for tracking events, inventory coverage conformed dimension = associated with multiple fact tables
Data Partition
1. Training Data = build model, algorithm learns 2. Validation Data = applied to see how well it does (validation data has never seen the model before), select best model only used in supervised data sometimes use 3. Test data set, measure performance of selected model (not used in model selection)
Size of fact table depends on
1. grain 2. number of dimensions (dimension tables)
Two Types of major types of data mining techniques
1. supervised 2. unsupervised
Duration of a Database
13 months (5 quarters)
Incremental response modeling 4 Quadrent Group
4 groups: 1. Sure thing group (buy if receive coupon or not) 2. Persuadable group (buy only if get coupon) 3. Do not Disturb (buy if don't receive coupon) 4. Lost Cause (not going to buy either way) incremental response modeling is to find Persuadable group
How much data is useless for insight?
90%
what percent of the data is in the fact table?
90%
Data warehouse (analytical)
A subject-oriented (customers, etc), integrated (consistent, formatted, from multiple sources), time-variant, non-updatable (read-only) collection of data used in support of management decision-making processes vs operational database constantly updated
Data Mining (DM)
Advanced methods for exploring and modeling relationships in large amount of data. it is a component of business analytics Business problems, information technology, machine learning, statistics, etc.
Curse of Dimensionality
As the number of dimensions increases, the number of cases need to be increased to fill the space.
Why have a test data set in data partitioning?
Chance aspect that if make 3 models, model 2 will work on validation data better than the others without being better in real world That's why we make a test data set (3rd type)
Knowledge is _____
Coming up with the actionable consistent business strategy
What stage of the industry are we at for data?
Contagion stage (growing)
Time frame of prediction vs profiling data?
Prediction data: using old variables to explain current targets Profiling model: variables and target variable may all be from same time frame
Why now? for Data Mining
Data Deluge = too much data. Info overload.
What type of approach are we using? (Data drive or theory)
Data Driven - we use the whole population not a sample like in statistics
Is data mining all about complicated queries?
Data mining is not just complicated queries - that is an OLAP or MOLAP or Business Intelligence (Online analytical processing problems (OLAP) multidimensional online analytical processing (MOLAP) data mining problems: What are the important pre-operative predictors of excessive length of stay? Importance of domain expertise.
Advanced analytics examples
Decision optimization, predictive modeling (pattern recognition), forecasting (relationship finding),
How to determine fitness of a profile using a target profile
Divide given profile % by target % of each characteristic, average them to see fitness. Closer to 1 is more fit
Reporting (basic)
Dynamic reporting, ad hoc reporting, basic reporting
ETL
EXTRACT, TRANSFORM, LOAD (ETL) -separate ETLs for independent data mart -data access complexity due to multiple data marts
Proactive Analytical Investigation
Examples: inferential statistics, experimentation, empirical validation, forecasting, optimization
T/F: Organizational data from different business units is generally well-organized and in a form that is ready for analysis.
FALSE
Methodology in Business Analytics: P-value yes or no?
Forget about the p-value (we don't want things based on theory), USE preliminary value (ability to predict a sample), retire model if doesn't predict well p-value is a rough guide
how to do data reduction?
Grouping/Clustering
Response modeling
Improve response rates by identifying prospects who are more likely to respond to a direct solicitation. Techniques: descriptive modeling, predictive modeling(decision tree, neural network, regression) , pattern recognition)
What type of analysis is the beer/diapers analysis?
Market Basket Analysis (MBA) (a type of Descriptive Analytics)
Is data mining a linear process?
NO
Outliers
Outliers = 3 standard deviations away from the mean If mean = 1800 and SD = 200 1800 +- (3*200) = > 2400, <1200 would be outliers
Periodic Data
Periodic data are never physically altered or deleted once they have been added to the store.
How much data do we need?
Rules: 1. Need 10 records for each predictor variable (10 predictor variables * 10 = 100 records) 2. 6 * M (outcome classes ex. buy/hold/sell=3) * P (predictor variables) = min number of records needed (ex. 180) Must count for data partioning (if need 180 above, that counts for 40% training variables only, need another set of data for the Validation data) - thus 180/.40 = actual total needed
SEMMA Methodology
Sample: Take a sample from the dataset; partition into training, validation, and test datasets. Explore: Examine the dataset statistically and graphically. Modify: Transform the variables and impute missing values. Model: Fit predictive models. Assess: Compare models using a validation dataset
sensitivity
Sensitivity = TP / (TP + FN)
Prediction - supervised
Similar to classification except that we are trying to predict the value of a numerical variable input: numeric KEY DIFFERENCE: target variable: numeric only
Drill Down
Starting with summary data, users can obtain details for particular cells.
T/F: Simple reporting is an important part of business analytics even though it only shows a snapshot of the past.
True
10 rules of dimensional modeling
Use atomic facts Create single-process fact tables Include a date dimension for each fact table Enforce consistent grain Disallow null keys in fact tables Honor hierarchies Decode dimension tables Use surrogate keys Conform dimensions Balance requirements with actual data
Rule of 5
We can make a reduction in uncertainty 94% sure that a median is between two points
Y/N It is important for team members on an analytical team to try to identify the potential problems with an analytical approach.
Yes
What are the changes in the analytical landscape? (Front office now?)
analytics are front office and DIRECTLY impacting company performance. Success will breed desire for more models.
What does the piano tuner example prove?
analytics is not an exact number (we are trying to estimate) all about a state of mind (not having right software)
Why Have a Methodology?
avoid not true, avoid not useful (already know, should know), create stable models, avoid mistakes, develop and learn
Basic Analytics
basic statistical analysis, reporting with early warning
Discrete Target
binary
Data Stalemate
business analytics should be introduced incrementally, many companies have data that they do not use or sell that third parties (that may sell back to them even)
Transient Data
changes to existing records are written over previous records, DESTROYING previous data vs periodic data
Descriptive modeling
characteristics (age, gender, eduction) of current customers to find good prospects
Index based scoring
meaningless unless divided by actual statistic, another issue is that all characteristics may not carry the same weight (ex. deciding between customer profiles)
Ensemble model
combining results from different techniques to get super model
Characteristics of Methodology ( in Business Analytics)
computer intensive adhockery, multidisciplinary lineage
Operational Database
daily operations of company, pull/push data tons of times a day a system that is used to run a business in real time, based on current data; also called a system of record purpose: run business currently, data type: current, users: clerks, admin, scope: narrow, planned , simple, queries, design goal: performance availability, volume: constant updates
TECHNIQUE: Oversampling
deliberately sample more yes cases than no cases so that the yes case will appear to be a larger proportion of entire data set, get 5,000 Yes and 5,000 No rare but important class - give more model more information to work with
Data Mart is _____ data (reconciled or derived)?
derived
People who buy beer at a grocery store also tend to buy ________________.
diapers
Star Schema
dimension tables = descriptions about subjects of the business (ex. product, customer, time, location) - make sure to have time period dimension fact tables = contain factual or quantitative data (unit price, sales) -ALL ARE 1:M relationships between dimension (1) to fact tables (M) -dimension tables good for ad hoc queries, bad for online transaction processing -allows us to slice easily, and query
Snowflake schema
dimension tables have multiple tables linked to them in hierarchies
Event Data
event = database action that creates/updates/deletes resulting from a transaction (vs status data)
Pattern Recognition
ex. cluster analysis: put customers to different clusters and look for high response rate
the purpose of predictive modeling
generalization
Analytical database (Informational system)
historical data, want large data set, demographic info, prior purchases, millions of rows/hundreds of columns, a couple times a day accessed, computationally intensive a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications purpose: support management, data type: historical snapshots and predictions, users: mangers, analysts, customers, scope: broad, ad hoc, complex, analysis, goal: ease of use, volume: batch updates, periodic
Time series forecasting
identifies seasonal patterns and trends in data
Lift
improvement in analytics over no analytics
Types of data warehouses
independent data mart, dependent data mart, operational data store (ODS), logical data mart, real-time data warehouse, 3 layer architecture ALL INVOLVE some form of EXTRACT, TRANSFORM, LOAD (ETL)
intelligent key
key created using given attributes
Characteristics of Data ( in Business Analytics)
massive, operational, opportunistic
Undercoverage
not looking at entire population with model
Derived Data
objective: ease of use, fast response to query, ad hoc query support, data mining characteristics: detailed periodic data, aggregate, derived MOST COMMON DATA MODEL = dimensional model (star schema)
OLAP
online analytical processing The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
2 types of databases
operational analytical
Predictive modeling
our classes focus
Classification - supervised
predicting a class: Purchase vs no purchase, buy/hold/sell a stock, approve/deny/need more info Purchase/nopurchase = binary variable Input variable: numeric or categorical Target variable: categorical
Experimental Data (research purpose data)
purpose: research, value: scientific, generation: actively controlled, size: small, hygiene: clean, state: static
surrogate key
randomly generated key to represent each key -better than intelligent key because doesn't get messed up if something changes -simpler, shorter
Data Warehouse is ____ Data? (reconciled or derived)
reconciled
Data Mart
small scale data warehouse (less complex, easier to use) - organic, unplanned, one subject, few sources, restrictive, short life, start small, decentralized vs Data warehouse is centralized, planned, historical, multiple subject, many source, flexible, long life, large, single structure
Broad or specific goals?
specific goals better
attrition modeling
what customer is likely to leave us
Sequence analysis
what sequence of activities to predict which product customer most likely to want (ex. amazon gets college students free prime, predicts next sequence in life will have more money to buy things)
Information is _____
when we discover something from data set
Survival analysis
which channel is likely to get best customers