454 Exam 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

predictor/input variable/ feature/ independent variable/ field

a variable (X) used as an input into a predictive model

response/dependent variable/output variable/ target variable/ outcome variable

a variable (Y) which is the variable being prediction in supervised learning

variable

any measurement on the records, including both input (X) variables and the output (Y) variable

supervised learning

classification prediction output variable is known

core ideas in data mining

classification prediction predictive analytics association rules data reduction data exploration data visualization

predictive analytics =

classification + prediction

sample

collection of observations

categorical variable handling

-Naive Bayes can use as-is -In most other algorithms, must create binary dummies (number of dummies = number of categories - 1)

data exploration

-Need to review large, complex, messy data sets to help refine the task -Use reduction and visualization

consequence of overfitting

Deployed model will not work as well as expected with completely new data

elements of visualization

Marker (shape) Size Color

BA relevant to industries with

large volumes of data large number of customers high level of competition - need for differentiation

strategic applications of BA

long-term operational planning long-term financial planning

operational applications of BA

manufacturing, supply chain, customers, financial operations, etc

special circumstances of Applications of BA

mergers acquisitions product recall

2 kinds of numeric variables

continuous integer

where does visualization fit in

during data preprocessing visuals could be affected by missing data

variable selection

selecting a subset of predictors (rather than including them all)

partitioning data

separate the data into 2 or 3 parts

2 types of data mining techniques

supervised learning unsupervised learning

omission

-If a small number of records have missing values, can omit them -If many records are missing values on a small set of variables, can drop those variables -If many records have missing values, omission is not practical

rules of thumb for overfitting

-10 records for every predictor variable -At least 6 * number of outcome classes * number of variables

exhaustive search

-All possible subsets of predictors assessed (single, pairs, triplets, etc) -Computationally intensive -Judge by "adjusted R squared"

histograms

-Best for: distribution of a single continuous variable -May be used for: Anomaly detection, distribution assumption checking, bin decisions

unsupervised: data reduction

-Complex /large data into simpler/ smaller data -Reducing the number of variables/ columns -Reducing the number of records/rows

logistic regression

-Extends idea of linear regression to situation where outcome variable is categorical -Widely used, particularly where a structured model is useful to explain (=profiling) or to predict -ex) how is the performance of male vs female CEOs -We focus on binary classification (Y=0 or Y=1) -ex) if students will perform well on a test

supervised: prediction

-Goal: predict numerical target (outcome) variable -ex) sales, revenue, performance -Each row is a case -Each column is a variable

unsupervised: association rules

-Goal: produce rules that define "what goes with what" -ex) if X was purchased, Y was also purchased -Rows are transaction -Recommendation settings - you may also like this -"Affinity analysis"

the logit

-Instead of Y as outcome variable (like in linear regression), we use a function of Y called logit -Logit can be modeled as a linear function of the predictors -The logit can be mapped back to a probability, which, in turn, can be mapped to a class

overfitting

-More is not necessarily better -Statistical models can produce highly complex explanations of relationships between variables

business analytics

-No standard definition -Exploration and analysis of large quantities of data to discover meaningful trends, patterns, and rules to help improve decision-making in business -ex) targeted advertising Personalize recommendation supplier/ customer relationship management Credit scoring Fraud detection Workforce management

normalizing (standardizing) data

-Normalizing function: subtract mean and divide by standard deviation (used in XLMiner) -Alternative function: scale 0-1 by subtracting minimum and dividing by the range - useful when the data contains dummies and numeric -Puts all variables on the same scale -Used when variables with the largest scales would dominate and skew results

outliers

-Outlier - "extreme", being distant from the rest of the data -Over 3 standard deviations away from the mean -Can have disproportionate influence on models (a problem if it is spurious) -Domain knowledge is required to determine if it is an error, or truly extreme

reasons for variable selection

-Parsimony: simpler model is more robust -Including inputs uncorrelated to response reduces predictive accuracy -Multicollinearity: redundancy in inputs (two or more predictors share the same relationship with the response) can cause unstable results

imputation

-Replace missing values with reasonable substitutes -Lets you keep the record and use the information you have

role of visualization

-data exploration stage - anomaly detection (missing values, errors, duplications), bin size, pattern recognition -knowledge/data presentation

rare event oversampling

-The event of interest is rare -Solution: oversample the rare cases to obtain a more balanced training set -Later adjust the results for the oversampling

causes of overfitting

-Too many predictors -A model with too many parameters -Trying many different models

3 types of partitions

-Training partition to develop various models -Validation partition to assess performance of each model and pick the best one -Test partition (optional) to assess performance of the chosen model with new data

test partition

-When a model is developed on training data, it can overfit the training data (hence need to assess on validation) -Assessing multiple models on same validation data can overfit validation data -Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data -Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new data

why data mining

-huge amounts of data (census, gps, banking info) -advances in computational speed and storage -information "hidden in the data that is not immediately apparent"

numeric variable handling

-most algorithms in XLMiner can handle numeric data -May occasionally need to "bin" into categories

steps of data mining

1) define/understand purpose 2) Obtain data 3) Explore, clean, pre-process data 4) Reduce the data - if supervised DM - partition it 5) Specify task (classification, prediction, clustering, etc) 6) Chose the techniques (regression, cart, neural networks 7)Iterative implementation and "tuning" 8) Assess results - compare models Deploy best model

box plot

Best for side by side comparisons of subgroups on a single continuous variable Used for: outlier detection, compare group means, etc.

scatterplot

Best for: displaying relationship between two continuous variables

matrix plot

Best for : displaying scatter-plots for pairs of continuous variables

other visualization techniques

Chart types: line/bar/pie charts Bubble chart Tree maps Geo-maps

two steps of logistic regression

First step yields estimates of the probabilities of belonging to each class -Class 1 or class 0 -Midpoint .5 (cutoff) Classify each case into one of the classes based on a cutoff value on these probabilities

stepwise

Like forward selection Except at each step, also consider dropping non-significant predictors

supervised: classification

Predict categorical target (outcome) variable -Binary: yes/no, girl/boy, purchase/no purchase, fraud/ no fraud -Each row is a case -Each column is a variable

dimensions

Qualitative data headers/groups ex) state or data

measures

Quantitative data axes/values ex) sales, # of clicks

Data mining software

SAS, SQL, XLminer, teradata

backward elimination

Start with all predictors Successively eliminate least useful predictors one by one Stop when all remaining predictors have statistically significant contribution

forward selection

Start with no predictors Add them one by one (add the one with largest contribution) Stop when the addition is not statistically significant -Alternative criterions may include: -Adjusted R-square (penalizes complex models) -F statistics -AIC (Akaike's information criterion) etc

obtaining data: sampling

Take a smaller portion to analyze out of the huge database

parsimonious model

a model that accomplishes a desired level of explanation or prediction with as few predictor variables as possible

score

a predicted value or class using the model developed to predict output variables for new data

validation/test/holdout set

a sample of data not used in fitting a model, but instead used to assess the performance of that model

profile

a set of measurements on an observation (height, weight, age)

algorithm

a specific procedure used to implement a particular data mining technique

model

an algorithm as applied to a dataset, complete with settings

unsupervised learning (book def)

an analysis in which one attempts to learn patterns in the data other than predicting an output value of interest

unsupervised learning

data reduction data exploration data visualization don't know what you will find goal is to segment the data into meaningful segments

what is data mining

exploration and analysis of large quantities of data in order to discover meaningful patterns knowledge discovery in data (KDD)

popular variable selection algorithms

forward backward stepwise exhaustive search (not using a lot)

types of variables in preprocessing data

numeric categorical

2 solutions for missing data

omission imputation

3 types of applications of business analytics and what they should provide

operational strategic special circumstances should provide insights, intelligence, and value

categorical variables

ordered (low, medium, high) unordered (male, female_

benefits of BA

productivity - not overproduce optimization of processes return on assets - machine maintenance return on equity asset utilization market value

logistic regression preprocessing

similar to linear regression -Missing values -Categorical inputs -Nonlinear transformations of inputs -Variable selection (including avoiding multicollinearity)

success class

the class of interest in a binary outcome yes/no, you want yes

confidence

the conditional probability that C will occur if A and B have occurred the degree of error in an estimate that results from selecting one sample as opposed to another

test data/ test set

the portion of data used only at the end of the model building and selection process to assess how well the final model might perform on new data

validation data/set

the portion of data used to assess how well the model fits, to adjust models, and to select the best model from among those that have been tried

training data

the portion of data used to fit a model

prediction/estimation

the prediction of the numerical value of a continuous output variable

supervised learning (book def)

the process of providing an algorithm with records in which an output variable of interest is known and the algorithm "learns" how to predict this value with new records where the output is known

454 Exam 1

Ensembles d'études connexes

EXSS 385 Exam 2

Chapter 24

Bones and Skeletal Tissue

Pharm NCLEX RN questions

Astronomy Final

Section 10: Capacitors

Combo with "ServSafe Flashcards: Chapter's 1-9

City Gov Outline Questions

Econ 232

Study Guide

Iggy Chapters 49-51 - Musculoskeletal

Blockchain

The Uterus and Vagina

Medical surgical ATI proctored exam review

Toddler/ Preschooler Development

the U.S. constitution

Chapter 19 HW

Hwk #9

mark301

AP Government Fall Final