454 Exam 1
predictor/input variable/ feature/ independent variable/ field
a variable (X) used as an input into a predictive model
response/dependent variable/output variable/ target variable/ outcome variable
a variable (Y) which is the variable being prediction in supervised learning
variable
any measurement on the records, including both input (X) variables and the output (Y) variable
supervised learning
classification prediction output variable is known
core ideas in data mining
classification prediction predictive analytics association rules data reduction data exploration data visualization
predictive analytics =
classification + prediction
sample
collection of observations
categorical variable handling
-Naive Bayes can use as-is -In most other algorithms, must create binary dummies (number of dummies = number of categories - 1)
data exploration
-Need to review large, complex, messy data sets to help refine the task -Use reduction and visualization
consequence of overfitting
Deployed model will not work as well as expected with completely new data
elements of visualization
Marker (shape) Size Color
BA relevant to industries with
large volumes of data large number of customers high level of competition - need for differentiation
strategic applications of BA
long-term operational planning long-term financial planning
operational applications of BA
manufacturing, supply chain, customers, financial operations, etc
special circumstances of Applications of BA
mergers acquisitions product recall
2 kinds of numeric variables
continuous integer
where does visualization fit in
during data preprocessing visuals could be affected by missing data
variable selection
selecting a subset of predictors (rather than including them all)
partitioning data
separate the data into 2 or 3 parts
2 types of data mining techniques
supervised learning unsupervised learning
omission
-If a small number of records have missing values, can omit them -If many records are missing values on a small set of variables, can drop those variables -If many records have missing values, omission is not practical
rules of thumb for overfitting
-10 records for every predictor variable -At least 6 * number of outcome classes * number of variables
exhaustive search
-All possible subsets of predictors assessed (single, pairs, triplets, etc) -Computationally intensive -Judge by "adjusted R squared"
histograms
-Best for: distribution of a single continuous variable -May be used for: Anomaly detection, distribution assumption checking, bin decisions
unsupervised: data reduction
-Complex /large data into simpler/ smaller data -Reducing the number of variables/ columns -Reducing the number of records/rows
logistic regression
-Extends idea of linear regression to situation where outcome variable is categorical -Widely used, particularly where a structured model is useful to explain (=profiling) or to predict -ex) how is the performance of male vs female CEOs -We focus on binary classification (Y=0 or Y=1) -ex) if students will perform well on a test
supervised: prediction
-Goal: predict numerical target (outcome) variable -ex) sales, revenue, performance -Each row is a case -Each column is a variable
unsupervised: association rules
-Goal: produce rules that define "what goes with what" -ex) if X was purchased, Y was also purchased -Rows are transaction -Recommendation settings - you may also like this -"Affinity analysis"
the logit
-Instead of Y as outcome variable (like in linear regression), we use a function of Y called logit -Logit can be modeled as a linear function of the predictors -The logit can be mapped back to a probability, which, in turn, can be mapped to a class
overfitting
-More is not necessarily better -Statistical models can produce highly complex explanations of relationships between variables
business analytics
-No standard definition -Exploration and analysis of large quantities of data to discover meaningful trends, patterns, and rules to help improve decision-making in business -ex) targeted advertising Personalize recommendation supplier/ customer relationship management Credit scoring Fraud detection Workforce management
normalizing (standardizing) data
-Normalizing function: subtract mean and divide by standard deviation (used in XLMiner) -Alternative function: scale 0-1 by subtracting minimum and dividing by the range - useful when the data contains dummies and numeric -Puts all variables on the same scale -Used when variables with the largest scales would dominate and skew results
outliers
-Outlier - "extreme", being distant from the rest of the data -Over 3 standard deviations away from the mean -Can have disproportionate influence on models (a problem if it is spurious) -Domain knowledge is required to determine if it is an error, or truly extreme
reasons for variable selection
-Parsimony: simpler model is more robust -Including inputs uncorrelated to response reduces predictive accuracy -Multicollinearity: redundancy in inputs (two or more predictors share the same relationship with the response) can cause unstable results
imputation
-Replace missing values with reasonable substitutes -Lets you keep the record and use the information you have
role of visualization
-data exploration stage - anomaly detection (missing values, errors, duplications), bin size, pattern recognition -knowledge/data presentation
rare event oversampling
-The event of interest is rare -Solution: oversample the rare cases to obtain a more balanced training set -Later adjust the results for the oversampling
causes of overfitting
-Too many predictors -A model with too many parameters -Trying many different models
3 types of partitions
-Training partition to develop various models -Validation partition to assess performance of each model and pick the best one -Test partition (optional) to assess performance of the chosen model with new data
test partition
-When a model is developed on training data, it can overfit the training data (hence need to assess on validation) -Assessing multiple models on same validation data can overfit validation data -Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data -Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new data
why data mining
-huge amounts of data (census, gps, banking info) -advances in computational speed and storage -information "hidden in the data that is not immediately apparent"
numeric variable handling
-most algorithms in XLMiner can handle numeric data -May occasionally need to "bin" into categories
steps of data mining
1) define/understand purpose 2) Obtain data 3) Explore, clean, pre-process data 4) Reduce the data - if supervised DM - partition it 5) Specify task (classification, prediction, clustering, etc) 6) Chose the techniques (regression, cart, neural networks 7)Iterative implementation and "tuning" 8) Assess results - compare models Deploy best model
box plot
Best for side by side comparisons of subgroups on a single continuous variable Used for: outlier detection, compare group means, etc.
scatterplot
Best for: displaying relationship between two continuous variables
matrix plot
Best for : displaying scatter-plots for pairs of continuous variables
other visualization techniques
Chart types: line/bar/pie charts Bubble chart Tree maps Geo-maps
two steps of logistic regression
First step yields estimates of the probabilities of belonging to each class -Class 1 or class 0 -Midpoint .5 (cutoff) Classify each case into one of the classes based on a cutoff value on these probabilities
stepwise
Like forward selection Except at each step, also consider dropping non-significant predictors
supervised: classification
Predict categorical target (outcome) variable -Binary: yes/no, girl/boy, purchase/no purchase, fraud/ no fraud -Each row is a case -Each column is a variable
dimensions
Qualitative data headers/groups ex) state or data
measures
Quantitative data axes/values ex) sales, # of clicks
Data mining software
SAS, SQL, XLminer, teradata
backward elimination
Start with all predictors Successively eliminate least useful predictors one by one Stop when all remaining predictors have statistically significant contribution
forward selection
Start with no predictors Add them one by one (add the one with largest contribution) Stop when the addition is not statistically significant -Alternative criterions may include: -Adjusted R-square (penalizes complex models) -F statistics -AIC (Akaike's information criterion) etc
obtaining data: sampling
Take a smaller portion to analyze out of the huge database
parsimonious model
a model that accomplishes a desired level of explanation or prediction with as few predictor variables as possible
score
a predicted value or class using the model developed to predict output variables for new data
validation/test/holdout set
a sample of data not used in fitting a model, but instead used to assess the performance of that model
profile
a set of measurements on an observation (height, weight, age)
algorithm
a specific procedure used to implement a particular data mining technique
model
an algorithm as applied to a dataset, complete with settings
unsupervised learning (book def)
an analysis in which one attempts to learn patterns in the data other than predicting an output value of interest
unsupervised learning
data reduction data exploration data visualization don't know what you will find goal is to segment the data into meaningful segments
what is data mining
exploration and analysis of large quantities of data in order to discover meaningful patterns knowledge discovery in data (KDD)
popular variable selection algorithms
forward backward stepwise exhaustive search (not using a lot)
types of variables in preprocessing data
numeric categorical
2 solutions for missing data
omission imputation
3 types of applications of business analytics and what they should provide
operational strategic special circumstances should provide insights, intelligence, and value
categorical variables
ordered (low, medium, high) unordered (male, female_
benefits of BA
productivity - not overproduce optimization of processes return on assets - machine maintenance return on equity asset utilization market value
logistic regression preprocessing
similar to linear regression -Missing values -Categorical inputs -Nonlinear transformations of inputs -Variable selection (including avoiding multicollinearity)
success class
the class of interest in a binary outcome yes/no, you want yes
confidence
the conditional probability that C will occur if A and B have occurred the degree of error in an estimate that results from selecting one sample as opposed to another
test data/ test set
the portion of data used only at the end of the model building and selection process to assess how well the final model might perform on new data
validation data/set
the portion of data used to assess how well the model fits, to adjust models, and to select the best model from among those that have been tried
training data
the portion of data used to fit a model
prediction/estimation
the prediction of the numerical value of a continuous output variable
supervised learning (book def)
the process of providing an algorithm with records in which an output variable of interest is known and the algorithm "learns" how to predict this value with new records where the output is known
observation/instance/sample/ example/case/record/patter/row
the unit of analysis on which the measurements are taken