GSB 530 Data Mining

Ace your homework & exams now with Quizwiz!

Opportunistic Data (Data mining data)

Purpose: operational, Value: commercial, generation: passive size: massive, hygiene: dirty, state: dynamic

accuracy rate

(total correct of success & non-success) / total data points

granularity (of a fact table)

- level of detail -transactional (finest) -aggregated (summarized) finer grains give better market basket analysis, but there is more rows click = web based finest granularity

Areas where analytics are used often

- new customer acquisition, customer loyalty, cross-sell, pricing tolerance, supply optimization, financial forecasting, churn, product placement, insurance rat setting, fraud

Areas when analytics are not helpful

- snap decisions -novel approaches (no historical data) -most salient factors rare (making decisions to work around unlikely obstacles/miracles) -expert analysis suggests a particular path -Metrics are inappropriate (quantifying love, quantifying athlete salaries?) -naive implementation of analytics (only looking at one variable) -confirming what you already know (ignoring important variables)

Discovery = Unsupervised classification

-cluster analysis -association (market basket analysis) -dimension reduction -affinity analysis No target variable = unsupervised -no outcome variable

Why do we need data warehousing?

-company wide view of high-quality info -separation of operational vs informational system (analytical)

Normalizing

-create associative entity (ex. helper table) -sometimes natural hierarchy between dimensions -design options ( single dimension tables or nested 1:M)

Problems with company wide view

-inconsistent key structures -synonyms -freeform vs structured -inconsistent data values -missing data

Predictive modeling = supervised classification

-linear regression, logistic regression, decision trees -used in classification & prediction (AND 1. HAS KNOWN TARGET 2. with values for the target available 3. must be large enough for partitioning Key difference: If there is a target variable = supervised (know what we are looking for) No target variable = unsupervised

Best model

-low error rate mixed with not too complex - Overfitting: underlying patterns that will affect what you get vs noise factors (drinking milk, taking the bus before exam)

Slowly Changing Dimension

-maintain knowledge of the past -Kimble's approach: Type 1: replace and lose old data, type 3: add new attribute, multivalued (create current and old fields), Type 2: add new row each time dimension changes (most common)

Problems with Opportunistic Data

-not collected with data analysis in mind, in multiple silo-based data system

2 Technique to deal with missing values:

-omission (delete rows with missing values) Or imputation (add in reasonable value)

Real time Data Warehouse

-real-time ETL -ODS (operational data storage) and Data warehouse are the same -data marts are not separate databases in this warehouse but instead VIEWS of of the warehouse (easier to create new data mart)

Enterprise Data Warehouse

-single ETL for Enterprise Data Warehouse (all goes to one place then splits to little data marts from EDW) - dependent data marts then loaded from EDW

cutoff probability

.02 means if need data point to have 98% chance of being a success class to put it in success class

What 3 things do Data Scientists need?

1) training in IS/IT & problem solving 2) Curiosity (more important) 3) story telling ability

3 Consequences of Data Deluge

1. Consequences= every problem will generate data 2. every company will need analytics 3. everyone will need analytics = proactive is good

Variations of Star Schema

1. Multiple Fact tables (better performance, different combos of dimensions) 2. Factless fact table = no non-key data used for tracking events, inventory coverage conformed dimension = associated with multiple fact tables

Data Partition

1. Training Data = build model, algorithm learns 2. Validation Data = applied to see how well it does (validation data has never seen the model before), select best model only used in supervised data sometimes use 3. Test data set, measure performance of selected model (not used in model selection)

Size of fact table depends on

1. grain 2. number of dimensions (dimension tables)

Two Types of major types of data mining techniques

1. supervised 2. unsupervised

Duration of a Database

13 months (5 quarters)

Incremental response modeling 4 Quadrent Group

4 groups: 1. Sure thing group (buy if receive coupon or not) 2. Persuadable group (buy only if get coupon) 3. Do not Disturb (buy if don't receive coupon) 4. Lost Cause (not going to buy either way) incremental response modeling is to find Persuadable group

How much data is useless for insight?

90%

what percent of the data is in the fact table?

90%

Data warehouse (analytical)

A subject-oriented (customers, etc), integrated (consistent, formatted, from multiple sources), time-variant, non-updatable (read-only) collection of data used in support of management decision-making processes vs operational database constantly updated

Data Mining (DM)

Advanced methods for exploring and modeling relationships in large amount of data. it is a component of business analytics Business problems, information technology, machine learning, statistics, etc.

Curse of Dimensionality

As the number of dimensions increases, the number of cases need to be increased to fill the space.

Why have a test data set in data partitioning?

Chance aspect that if make 3 models, model 2 will work on validation data better than the others without being better in real world That's why we make a test data set (3rd type)

Knowledge is _____

Coming up with the actionable consistent business strategy

What stage of the industry are we at for data?

Contagion stage (growing)

Time frame of prediction vs profiling data?

Prediction data: using old variables to explain current targets Profiling model: variables and target variable may all be from same time frame

Why now? for Data Mining

Data Deluge = too much data. Info overload.

What type of approach are we using? (Data drive or theory)

Data Driven - we use the whole population not a sample like in statistics

Is data mining all about complicated queries?

Data mining is not just complicated queries - that is an OLAP or MOLAP or Business Intelligence (Online analytical processing problems (OLAP) multidimensional online analytical processing (MOLAP) data mining problems: What are the important pre-operative predictors of excessive length of stay? Importance of domain expertise.

Advanced analytics examples

Decision optimization, predictive modeling (pattern recognition), forecasting (relationship finding),

How to determine fitness of a profile using a target profile

Divide given profile % by target % of each characteristic, average them to see fitness. Closer to 1 is more fit

Reporting (basic)

Dynamic reporting, ad hoc reporting, basic reporting

ETL

EXTRACT, TRANSFORM, LOAD (ETL) -separate ETLs for independent data mart -data access complexity due to multiple data marts

Proactive Analytical Investigation

Examples: inferential statistics, experimentation, empirical validation, forecasting, optimization

T/F: Organizational data from different business units is generally well-organized and in a form that is ready for analysis.

FALSE

Methodology in Business Analytics: P-value yes or no?

Forget about the p-value (we don't want things based on theory), USE preliminary value (ability to predict a sample), retire model if doesn't predict well p-value is a rough guide

how to do data reduction?

Grouping/Clustering

Response modeling

Improve response rates by identifying prospects who are more likely to respond to a direct solicitation. Techniques: descriptive modeling, predictive modeling(decision tree, neural network, regression) , pattern recognition)

What type of analysis is the beer/diapers analysis?

Market Basket Analysis (MBA) (a type of Descriptive Analytics)

Is data mining a linear process?

NO

Outliers

Outliers = 3 standard deviations away from the mean If mean = 1800 and SD = 200 1800 +- (3*200) = > 2400, <1200 would be outliers

Periodic Data

Periodic data are never physically altered or deleted once they have been added to the store.

How much data do we need?

Rules: 1. Need 10 records for each predictor variable (10 predictor variables * 10 = 100 records) 2. 6 * M (outcome classes ex. buy/hold/sell=3) * P (predictor variables) = min number of records needed (ex. 180) Must count for data partioning (if need 180 above, that counts for 40% training variables only, need another set of data for the Validation data) - thus 180/.40 = actual total needed

SEMMA Methodology

Sample: Take a sample from the dataset; partition into training, validation, and test datasets. Explore: Examine the dataset statistically and graphically. Modify: Transform the variables and impute missing values. Model: Fit predictive models. Assess: Compare models using a validation dataset

sensitivity

Sensitivity = TP / (TP + FN)

Prediction - supervised

Similar to classification except that we are trying to predict the value of a numerical variable input: numeric KEY DIFFERENCE: target variable: numeric only

Drill Down

Starting with summary data, users can obtain details for particular cells.

T/F: Simple reporting is an important part of business analytics even though it only shows a snapshot of the past.

True

10 rules of dimensional modeling

Use atomic facts Create single-process fact tables Include a date dimension for each fact table Enforce consistent grain Disallow null keys in fact tables Honor hierarchies Decode dimension tables Use surrogate keys Conform dimensions Balance requirements with actual data

Rule of 5

We can make a reduction in uncertainty 94% sure that a median is between two points

Y/N It is important for team members on an analytical team to try to identify the potential problems with an analytical approach.

Yes

What are the changes in the analytical landscape? (Front office now?)

analytics are front office and DIRECTLY impacting company performance. Success will breed desire for more models.

What does the piano tuner example prove?

analytics is not an exact number (we are trying to estimate) all about a state of mind (not having right software)

Why Have a Methodology?

avoid not true, avoid not useful (already know, should know), create stable models, avoid mistakes, develop and learn

Basic Analytics

basic statistical analysis, reporting with early warning

Discrete Target

binary

Data Stalemate

business analytics should be introduced incrementally, many companies have data that they do not use or sell that third parties (that may sell back to them even)

Transient Data

changes to existing records are written over previous records, DESTROYING previous data vs periodic data

Descriptive modeling

characteristics (age, gender, eduction) of current customers to find good prospects

Index based scoring

meaningless unless divided by actual statistic, another issue is that all characteristics may not carry the same weight (ex. deciding between customer profiles)

Ensemble model

combining results from different techniques to get super model

Characteristics of Methodology ( in Business Analytics)

computer intensive adhockery, multidisciplinary lineage

Operational Database

daily operations of company, pull/push data tons of times a day a system that is used to run a business in real time, based on current data; also called a system of record purpose: run business currently, data type: current, users: clerks, admin, scope: narrow, planned , simple, queries, design goal: performance availability, volume: constant updates

TECHNIQUE: Oversampling

deliberately sample more yes cases than no cases so that the yes case will appear to be a larger proportion of entire data set, get 5,000 Yes and 5,000 No rare but important class - give more model more information to work with

Data Mart is _____ data (reconciled or derived)?

derived

People who buy beer at a grocery store also tend to buy ________________.

diapers

Star Schema

dimension tables = descriptions about subjects of the business (ex. product, customer, time, location) - make sure to have time period dimension fact tables = contain factual or quantitative data (unit price, sales) -ALL ARE 1:M relationships between dimension (1) to fact tables (M) -dimension tables good for ad hoc queries, bad for online transaction processing -allows us to slice easily, and query

Snowflake schema

dimension tables have multiple tables linked to them in hierarchies

Event Data

event = database action that creates/updates/deletes resulting from a transaction (vs status data)

Pattern Recognition

ex. cluster analysis: put customers to different clusters and look for high response rate

the purpose of predictive modeling

generalization

Analytical database (Informational system)

historical data, want large data set, demographic info, prior purchases, millions of rows/hundreds of columns, a couple times a day accessed, computationally intensive a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications purpose: support management, data type: historical snapshots and predictions, users: mangers, analysts, customers, scope: broad, ad hoc, complex, analysis, goal: ease of use, volume: batch updates, periodic

Time series forecasting

identifies seasonal patterns and trends in data

Lift

improvement in analytics over no analytics

Types of data warehouses

independent data mart, dependent data mart, operational data store (ODS), logical data mart, real-time data warehouse, 3 layer architecture ALL INVOLVE some form of EXTRACT, TRANSFORM, LOAD (ETL)

intelligent key

key created using given attributes

Characteristics of Data ( in Business Analytics)

massive, operational, opportunistic

Undercoverage

not looking at entire population with model

Derived Data

objective: ease of use, fast response to query, ad hoc query support, data mining characteristics: detailed periodic data, aggregate, derived MOST COMMON DATA MODEL = dimensional model (star schema)

OLAP

online analytical processing The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques

2 types of databases

operational analytical

Predictive modeling

our classes focus

Classification - supervised

predicting a class: Purchase vs no purchase, buy/hold/sell a stock, approve/deny/need more info Purchase/nopurchase = binary variable Input variable: numeric or categorical Target variable: categorical

Experimental Data (research purpose data)

purpose: research, value: scientific, generation: actively controlled, size: small, hygiene: clean, state: static

surrogate key

randomly generated key to represent each key -better than intelligent key because doesn't get messed up if something changes -simpler, shorter

Data Warehouse is ____ Data? (reconciled or derived)

reconciled

Data Mart

small scale data warehouse (less complex, easier to use) - organic, unplanned, one subject, few sources, restrictive, short life, start small, decentralized vs Data warehouse is centralized, planned, historical, multiple subject, many source, flexible, long life, large, single structure

Broad or specific goals?

specific goals better

attrition modeling

what customer is likely to leave us

Sequence analysis

what sequence of activities to predict which product customer most likely to want (ex. amazon gets college students free prime, predicts next sequence in life will have more money to buy things)

Information is _____

when we discover something from data set

Survival analysis

which channel is likely to get best customers


Related study sets

WGU D265 Critical Thinking: Reasoning and Evidence - Practice Assessment

View Set

Systems Analysis 10-152-170 MSTC Chapter 1-3 Test

View Set

Final Exam Phil 1102 Chapters 13, 14, 5, 6, 7

View Set

Physics 221 Exam 2 Homework Questions

View Set

Word Part Practice Session CHAPTER 1 & 2

View Set