Data Mining Test 1

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Specificity

% of actual c0s correctly classified

Sensitivity

% of actual c1s correctly classified

False negative rate

% of predicted c0s that were not c0s

False positive rate

% of predicted c1s that we not c1s

Obtaining Data: Sampling

-Algorithms & models are typically applied to a sample from a database -Develop and select a final model, you use it to score the observations in the larger database

Unsupervised Learning

-Association Rules -Data Reduction -Data Exploration -Visualization -Segment data into meaningful segments; detect patterns -No target to predict or classify -Association rules, data reduction, exploration/visualization

Cross-market analysis

-Association/co-relations between product sales -Prediction based on the association information -Sell bundled products and services

Statistical summary of data: common metrics

-Average -Median -Minimum -Maximum -Standard deviation -Counts and percentages

Graphs for data exploration

-Basic plots -Line graphs -Bar charts -Scatterplots -Distribution plots -Box plots -Histograms

Lift vs. Decile charts

-Both embody concept of moving down through the records, starting with the most probable -Decile chart does this in decile chunks of data -Lift chart shows continuous cumulative results

Supervised Learning

-Classification -Prediction -Predict a single target or outcome variable -Training data -Score to data where value is known

Naiive rule

-Classify all records as belonging to the most prevalent class -Often used as a benchmark -Equivalent to predicting everyone has a value equal to the mean -EXCEPTION: when the goal is to identify high-value but rare outcomes, may do well by doing worse than the rule

How to use life and decile charts

-Compare lift to no model baseline -In lift: compare step function to straight line -In decile: compare to ratio of 1

Numeric

-Continuous -Integer -May need to bin into categories

Pivot Tables

-Counts and percentages are useful for summarizing categorical data -Averages are useful for summarizing grouped numerical data

How does PCA work?

-Creates new variables that are linear combinations of the original variables -Linear combinations are uncorrelated, only a few contain most of the original information

What's the buzz about data mining?

-Data is already being produced -Data is being warehouse/stored -Computing power is available and affordable -Electronic storage media and capacity are available and affordable -Complex algorithms are now accessible via commercial software

Drowning in data; Starving for knowledge

-Data is collected on almost everything -Credit/debit transactions -grocery purchases -Electronic ticketing

Data Exploration

-Data sets are typically large, complex, & messy -Need to review the data to help refine the task -Use techniques of reduction and visualization

Pre-processing variable types

-Determine the types of pre-processing needed and algorithms used -Categorical vs. numeric

Unsupervised: Data reduction

-Distillation of complex/large data into simpler/smaller data -Reducing the number of variables/columns (principle components) -Reducing the number of records

Misclassification error

-Error -Error rate

Principal components analysis (PCA)

-Goal: Reduce a set of numerical variables -Idea: Remove the overlap of information between varables -Product: A smaller number of numerical variables that contain most of the information

Lift and decile charts

-Goal: useful for assessing performance in terms of identifying the most important class -Compare performance of DM model to 'no model, pick randomly' -Measures ability of DM model to identify the important class, relative to its average prevalence -Give explicit assessment of results over a large number of cutoffs

Unsupervised: Data Visualization

-Graphs and plots of data -Histograms, boxplots, bar charts, scatterplots -Useful to examine relationships between pairs of variables

Visualizations

-Help reveal interesting trends that might otherwise go unseen -Help quickly and efficiently convey what you would like others to concentrate on

Separation of records

-High separation of records means that using predictor variables attains low error -Low separation of records means that using predictor variables does not improve much on a naiive rule

Identifying customer requirements

-Identifying the best products for different customers -Use prediction to find what factors will attract new customers

Detecting Outliers

-Important step -Observation that is extreme, being distant from the rest of the data -Have disproportionate influence on models -Required to determine if it is an error or truly extreme -Anomaly Detection (airport security screening)

Questions data mining can answer

-Loyal customer vs. jump ship (churn) -Products to be marketed -Offer responses -Best telemarketing script -Next branch location -Next product/service

Data explosion problem

-Mature database technology lead to tremendous amounts of data being stored in databases, data warehouses, and other information repositories -Higher record complexity

Some measures of error

-Mean absolute error/deviation (MAE/D) -Average error -Mean absolute percentage error (MAPE) -Root mean squared error (RMSE) -Total sum of squared errors (Total SSE)

Competition assessment

-Monitor competitors and market directions -Group customers into classes and a class-based pricing procedure -Set pricing strategy in a highly competitive market

Jittering

-Moving markers by a small random amount -Uncrowds the data by allowing more markers to be seen

Measuring predictive error

-Not the same as goodness of fit -Want to know the model predicts new data

Handling missing data

-Omission of the data -Imputatoin

Categorical

-Ordered (low, med. high) -Unordered (male, female) -Naiive Bayes can use as-is -Must create binary dummies

Supervised: Classification

-Predict categorical target variable -Purchase/no purchase -Each row is a case -Each column is a variable -Target variable is binary

Supervised: Prediction

-Predict numerical target variable -Ex: sales & revenue -Row= Case -Column= variable -Target variable is numerical -Taken together, classification & prediction constitute 'predictive analytics'

Unsupervised: Association Rules

-Produce rules that define 'what goes with what' -Ex: X was purchased, Y was also purchased -Rows= Transactions -Used in recommender systems -AKA affinity analysis

Rare event oversampling

-Sampling may yield too few interesting cases to effectively train a model -Solution: oversample the rare cases to obtain a more balanced training set

Alternate accuracy measures

-Sensitivity -Specificity -False positive rate -False negative rate

Reducing categories: create dummies

-Single categorical variable with m categories is typically transformed into m-1 dummy variables -Can end up with too many variables -Reduce to combining categories that are close to each other -Pivot tables

How to compute lift charts

-Sort records from most likely to least -Compute lift: Accumulate the classified important class records -Naiive model: Compare to mean number of important class records

Overfitting

-Statistical models can produce highly complex explanations of relationships between variables -Models of great complexity often fare quite poorly -Fit of simple model to new data is better than the fit of complex models to new data

Core Ideas

-Supervised learning -Unsupervised learning

Market Analysis and Management

-Target marketing -Determine customer purchasing patterns over time -Cross market analysis -Customer profiling -Identifying customer requirements -Competition assessement

Causes of overfitting

-Too many predictors -Trying many different models -Consequence: Deployed model will not work as well as expected with completely new data

Normalizing (Standardizing) Data

-Variables with the largest scales would dominate & skew results -Puts variables on same scale -Subtract mean & divide by standard deviation -Or scale to 0-1 by subtracting minimum and dividing by the range

Steps in data mining

1. Define & understand purpose 2. Obtain data 3. Explore, clean, pre-process data 4. Reduce the data; if supervised DM, partition it 5. Specify task (classification , clustering, etc.) 6. Choose the techniques (regression, CART, neural networks) 7. Iterative implementation and tuning 8. Assess results- compare models 9. Deploy best model

Business Intelligence Pyramid

Chapter 1 Slide 6

Error

Classifying a record as belonging to one class when it belongs to another clas

Heat maps

Color conveys information -Correlations -Missing data

Determine customer purchasing patterns over time

Conversion of single to a joint bank account: marriage, etc.

Customer profiling

Data mining can tell you what types of customers buy what products (clustering or classification)

Distribution Plots

Display 'how many' of each value occur in a data set. OR for continuous data or data with many possible values are in each of a series of ranges or bins

Scatterplot

Displays relationship between two numerical variables

Target marketing

Find clusters of model customers who share the same characteristics: interest, income level, spending habits

Decile chart

Most probable model is twice as likely to identify the important class

Training

Partition to develop the model

Validation

Partition to implement the model and evaluate its performance on new data

Error rate

Percent of misclassified records out of total records

Imputation

Replace missing values with reasonable substitutes

Partitioning the Data

Separate the data into two parts -Training -Validation

Matrix plot

Shows scatterplots for variable pairs

Histogram

Shows the distribution of the outcome variable

Boxplot

Side-by-side boxplots are useful for comparing subgroups

Omission

Small number of records

Lift chart: cumulative performance

Tells us how much improvement out model provides over selecting cases at random

What is data mining?

The process of identifying hidden patterns and relationships within data sets; automating information discovery

What is data mining good for?

Turning information into useful knowledge


Kaugnay na mga set ng pag-aaral

hesi evolve questions (integumentary & musculoskeletal)

View Set

Fundamentals chpt 32 skin,39 Oxygen, 30 pre op, Chp41SELF CONCEPT

View Set

5.2 Present Progressive and ser vs estar

View Set

Chapter 1-2 - The nature of costs

View Set

LAL mood characterization study guide

View Set

Unit 6: Brokerage Agreement Questions

View Set

GEOG1301 Don Jonsson (16 sets of info)

View Set

Real Estate Settlement Procedures Act (RESPA)

View Set

Law, Regulatory Environment & Nursing JP

View Set