DATA 403 Final
SAS Enterprise Miner Analytic Strengths
Pattern Discovery - SAS is the leader in pattern discovery Predictive Modeling - i.e. decision trees
Data Mining: Two Broad Areas
Pattern Discovery/Exploratory Analysis (Unsupervised Learning) - There is no target variable, and some form of analysis is performed to do the following: --identify or define homogeneous groups, clusters, or segments --find links or associations between entities, as in market basket analysis Prediction (Supervised Learning) - A target variable is used, and some form of predictive or classification model is developed. - Input variables are associated with values of a target variable, and the model produces a predicted target value for a given set of inputs.
Missing Values and Regression Modeling
Problem 1:Training data cases with missing values on inputs used by a regression model are ignored. -empty boxes mean data missing the only usable column is the ones with no missing values Consequence: Missing values can significantly reduce your amount of training data for regression modeling!
Missing Values and the Prediction Formula
Problem 2: Prediction formulas cannot score cases with missing values.
Text Mining Applications
Successful results are easier to verify for predictive modeling applications, where free-form textual data can be used to derive new types of input variables. Predictive modeling requires data labeled with a known target (outcome) variable. Most analysts agree that predictive modeling is where the "big payoff" in data mining is. Predictive modeling takes most advantage of the integrated environment in SAS Enterprise Miner, which provides powerful predictive modeling tools (regression, decision trees, neural nets, and so on).
Data Partitioning
Partition available data into training and validation sets.
Pruning Two Splits
-Similarly, this is done for subsequent models. -Prune two splits from the maximal tree,... -...rate each subtree using validationassessment, and... -...select the subtree withthe best assessment rating.
Estimate Optimization: Squared Error
(target - estimate)2 Minimize squared error:squared difference between target and prediction -how far off you are -low -alot of time will take avg
The Modeling Sample
+ Similar predictive powerwith smaller case count − Must adjust assessmentstatistics and graphics − Must adjust predictionestimates for bias
Neural Networks -Stopped Training
- Other predictive models select an optimal model from a sequence of possible models...our results showed one model with three hidden units. - For Neural Nets, Enterprise Miner treats each iteration as a separate model. - The iteration with the smallest fit statistic is the final model. -The name stopped trainingcomes from the fact that the final model is selected as if training were stopped on the optimal iteration. Detecting when this optimal iteration occurs (while actually training) is somewhat problematic. To avoid stopping too early, the Neural Network tool continues to train until convergence on the training data or until reaching the maximum iteration count, whichever comes first.
Novelty detection - segment profile
- seek unique or previously unobserved data patterns. - Applications in business, science, and engineering - Business applications include fraud detection, warranty claims analysis, and general business process monitoring.
History of Data Mining
-1960's -"Data Fishing" or "Data Dredging" - what data mining was called first -Early 1990 -"Data Mining" -Intersected statistics, machine learning, data management and databases, pattern recognition, artificial intelligence into a new discipline -Birthed from the digital data age and increased storage technologies -Data is already collected -Data miners typically play little or no role in the data collection strategies -Large data sets: ---housekeeping issues ---fundamental issues -Knowledge Discovery in Databases (KDD)
Market Basket Analysis
-Also called association rule discovery or affinity analysis, is used to analyze streams of transaction data for combinations of items that occur (or do not occur) more (or less) commonly than expected. -Goal is to determine the strength of all the association rules among a set of items
Separate Sampling
-Also called: oversampling, balanced sampling, choice‐based sampling, case‐control sampling, etc. -If you do not adjust for separate sampling, the following occurs: •Prediction estimates reflect target proportions in the training sample, not the population from which the sample was drawn. •Score Rankings plots are inaccurate and misleading, •Decision‐based statistics related to misclassification or accuracy misrepresent the model performance on the population. ---Fortunately, it is easy to adjust for separate sampling in SAS Enterprise Miner. However, you must rerun the models that you created. Takes time -Target-based samples are created by considering the primary outcome cases separately from the secondary outcome cases. -secondary - select some cases -primary -select all cases
Targeted Marketing
-Cases = customers, prospects, suspects, households -Inputs = geo/demo/psycho‐graphics -Target = response to a past or test solicitation -Action = target high‐responding segments of customers in future campaigns
Attrition Prediction/ Defection Detection
-Cases = existing customers -Inputs = payment history, product/service usage, demographics -Target = churn, brand‐switching, cancellation, defection -Action = customer loyalty promotion
Credit Scoring
-Cases = past applicants -Inputs = application information, credit bureau reports -Target = default, charge‐off, serious delinquency, repossession, foreclosure -Action = accept or reject future applicants for credit
Fraud Detection
-Cases = past transactions or claims -Inputs = particulars and circumstances -Target = fraud, abuse, deception -Action = impede or investigate suspicious cases
Validation Assessment
-Choose the simplest model with the highest validationassessment. -What are appropriate validation assessment ratings?
Predictive Model Sequence
-Create a sequence of models with increasing complexity from the training data. -1 - least complex -5 - most complex -A maximal tree is the most complex model in the sequence.
Summary Statistics Summary
-Decisions ---Accuracy ---Misclassification (smallest) ---Profit/Loss Inverse prior threshold (smallest/largest) -Rankings ---ROC Index (concordance) (largest) ---Gini coefficient (largest) -Estimates ---Average squared error (smallest) ---SBC (smallest) ---Likelihood (largest)
Text Mining Dictionaries
-Inclusion Dictionary Contains only relevant terms to be used in the analysis Called a start list by SAS Text Miner -Exclusion Dictionary Contains irrelevant or low-information terms that will be ignored Called a stop list by SAS Text Miner -Synonym Dictionary -Multi-word Term Dictionary -Topic Dictionary
Redundancy
-Input x2has thesame informationas input x1. -Example: x1is household income and x2is home value. -closely correlated
Neural Nets: Beyond the Prediction Formula
-Manage missing values. Imputation -Handle extreme or unusual values. Helped somewhat by the hyperbolic tangent function -Use non-numeric inputs. Not as complicated as with regressions -Account for nonlinearities. Easily accommodated -Interpret the model. Can be difficult
Neural Networks
-Mysterious and powerful - most typical form is a natural extension of a regression model -When properly trained, Neural Nets can modelany association between inputs and targets. -Cost: input selection is not easy -Cost offset somewhat by "stopped training" Stopped training can reduce the chances of overfitting, even in the presence of redundant and irrelevant inputs. - Neural Nets can be thought of as a regression model on a set of derived inputs, called hidden units. - The hidden units can be thought of as regressions on the original inputs. - The hidden unit "regressions" include a default link function (in neural network language, an activation function), the hyperbolic tangent. ---a shift and rescaling of the weights -The hyperbolic tangent is a shift and rescaling of the logistic function introduced in Module 4. -# of input nodes depends on the number and the type of attributes in the data set -# of hidden layers (nodes in each hidden layer) are configurable by the user - There can be more than one output/target layer
Modeling Essentials
-Predict new cases - prediction rules - in order to make decisons ----Decide, rank,and estimate. ----how with model -Select useful inputs. - split search - select useful variables ----lots of data, lots of inputs ----Eradicate redundancies and irrelevancies. -Optimize complexity. - pruning ----Tune models with validation data. ----to compensate to noisy data
Model Essentials: Neural Networks
-Predict new cases. ---prediction formula ---prediction rules -Select useful inputs. ---none ---split search -Optimize complexity. ---stopped training ---ripping
Model Essentials: Dmine Regression
-Predict new cases. --prediction formula -Select useful inputs. --forward selection -Optimize complexity. --stop r square
Model Essentials: DMNeural
-Predict new cases. --stagewise prediction formula -Select useful inputs. --principal component -Optimize complexity. --max stage
Model Essentials: Regressions
-Predict new cases. -Prediction formula -Select useful inputs. - Sequential selection -Optimize complexity. Best model from sequence
Model Implementation
-Predictions can be added to a data source inside (or outside of) SAS Enterprise Miner. -After you train and compare models, select the "winning model." -Put the model to use! -Two options: 1. Internally scored data setsare created by combining the Score tool with a data set identified for scoring. •A copy of the scored data set is stored on the SAS Foundation server assigned to your project. 2. Scoring code modulesare used to generate predicted target values in environments outside of SAS Enterprise Miner. •Miner can create scoring code in the SAS, C, and Java programming languages. •The SAS language code can be embedded directly into a SAS Foundation application to generate predictions. The C and Java language code must be compiled. •Use if the data set to be scored is very large.
Irrelevancy
-Predictions change withinput x4but much lesswith input x3. -Example: Target is the response to direct mail solicitation, x3is religious affiliation, and x4is the response to previous solicitations. -x3=irrelevant -input includes no added info
Score Node
-The Score node creates predictions using the model deemed BEST by the Model Comparison node -If you want to create predictions using a specific model, either delete the connection to the Model Comparison node of the models that you do not want to use, or connect the Score node directly to the desired model and continue as described.
Pruning One Split
-The next model in the sequence is formed by pruningone split from the maximal tree. -Each subtree's predictive performance is rated on validationdata. -The subtree with the highest validationassessment is selected.
Profiling - SOM/Kohonen
-a by-product of reduction methods such as cluster analysis. - Create rules that isolate clusters or segments, often based on demographic or behavioral measurements. - A marketing analyst might develop profiles of a customer database to describe the consumers of a company's products.
Market basket analysis - association
-also called association rule discovery, is used to analyze streams of transaction data (for example, market baskets) for combinations of items that occur (or do not occur) more (or less) commonly than expected. - Retailers can use this as a way to identify interesting combinations of purchases or as predictors of customer segments.- Diapers and beer case study- Target (and other retailers) coupons
Sequence analysis - path analysis
-an extension of market basket analysis to include a time dimension to the analysis. - Transactions data is examined for sequencesof items that occur (or do not occur) more (or less) commonly than expected. - A Webmaster might use sequence analysis to identify patterns or problems of navigation through a Web site.
Charity Direct Mail Demonstration Analysis goal:A veterans' organization seeks continued contributions from lapsing donors. Use lapsing‐donor responses from an earlier campaign to predict future lapsing‐donor responses.
-modeling behavior on the now non givers -what are the reasons people stop giving -Analysis data: •extracted from previous year's campaign •sample balances response and non‐response rate •actual response rate of approximately 5% -modeling the future based on a recent sampling (1-3) years
Which of the following are true about neural networks in SAS Enterprise Miner?
-neural networks are universal approximators. -Neural networks have no internal, automated process for selecting useful inputs.
Predictive Modeling Tools
-primary --decision tree --regression --neural networks -specialty --autoneural --Dmine regression --gradient boosting --MBR -- Partial least squares --rule induction --LARS --model input -multiple --ensemble --two stage
Logistic Regression
-response is qualitative (categorical) with two possible outcomes ex. Coupon usage (Y/ N); Heart attack risk (has had a heart attack, has not had a heart attack); credit scoring (default? Y/N)
Regression
-response is quantitative (interval) with many outcomes ex. Weight vs Height; GPA vs ACT/SAT scores
SEMMA Tools Palette
-sample -explore -model -modify -assess crisp cm is like this
a data source in SAS EM differs from a raw data file because a data source has additional metadata attached. this metadata includes which of the following?
-the variable measurement levels -the data table role -the variable roles
split search algorithm
-used when working with decision trees -uses max log worth to determine the best split -you can split any interval input, and categorical inputs can be split using the mean. the splits help in building maximal and optimal trees. and discovering the best possible model
Decision Tree Split Search
1. Select an input for partitioning the available data ‐ if interval: each unique value is a potential split point ‐if categorical: average value of the target taken within each categorical level (average becomes unique value) 2. Two groups generated ‐Cases with input values less thanthe split point = branch left ‐Cases with input values greater thanthe split point = branch right 3. Groups form a 2x2 contingency (two‐way) table 4. Pearson Chi‐Squared statistic used to quantify the independence of counts in the table's columns.‐Large differencein outcome proportions = good split 5. Get the p‐value from Pearson Chi‐Squared statistic ‐Large data sets have p‐values close to 0 (zero) 6. Report logworth= ‐log(Chi‐Squared p‐value) ----large data sets all have significant P values -Calculate the logworthof every partition on input x1. -Select the partition with the maximum logworth. -Repeat for input x2. -Compare partition logworth ratings. -Create a partition rule from the best partition across all inputs. -Repeat the process in each subset. -Create a second partition rule. -Repeat to form a maximal tree. - biggest
k‐means Clustering Algorithm
1. Select inputs. 2. Select kcluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5until convergence.
Sequential Selection -Stepwise
1. Start with a baseline model 2. The one-input model that improves baseline is selected 3. Evaluate the statistical significance of included input (p-value) 4. Keep input and move on to select another input OR remove input ...Sequence terminates when all inputs available for inclusion have p-values > entry cutoff AND all inputs included in the model have p-values < stay cutoff.
Sequential Selection -Forward
1. Start with a baseline model 2. The one-input model that improves baseline is selected 3. The two-input model that improves previous step is selected 4. The three-input model that improves previous step is selected ...Sequence terminates when no significant improvement is made. -p value less than entry cutoff enters
Sequential Selection -Backward
1. Start with a saturated model 2. Remove the input that least reduces the overall model fit statistic (input with highest p-value) ...Sequence terminates when all remaining inputs have a p-value less than the predetermined stay cutoff. -end result the boxes with p value less than stay cutoff
give two examples of real world business applications that can benefit from data mining
1. in a boutique it would be helpful to help you figure out the amount of returning customers. from this you could start trying to figure out ways to work on customer retention 2. a utility company could use data mining to be better informed for future gas/utility usages. to get a gauging in the uses during different parts of the year and also different areas. this would be helpful to be better prepared on their side and also prepare financially for low usage monthsWhat is data mining?
In what decade were the first forms of data mining, known as data fishing or data dredging used?
1960
Outcome Overrepresentation
A common predictive modeling practice is to build models from a sample with a primary outcome proportion that is different from the original population.
Decision Predictions
A predictive model usesinput measurementsto make the best decision for each case. -simplest -categorical target -primary or secondary -binary -but dependent on target
Estimate Predictions
A predictive model usesinput measurementsto optimally estimate the target value.
Ranking Predictions
A predictive model usesinput measurementsto optimally rank each case. -order based on input values -ex: credit score
polynomial regression
Add polynomial terms to a regression either by hand or by an autonomous exhaustive search.
Odds ratio:
Amount odds change with a unit change in input. -amount odds change with unit change in input -pivot point around 1 >1 positive <1 negative
Sequence Analysis
An extension of market basket analysis to include a time dimension to the analysis. ‐ Transactions data is examined for sequencesof items that occur (or do not occur) more (or less) commonly than expected. ‐ A Webmaster might use sequence analysis to identify patterns or problems of navigation through a Web site.
Variable Roles
Assessment Censor Classification Cost Cross ID Decision Frequency ID Input Label Prediction Referrer Rejected Residual Segment Sequence Target Text Text Location Time ID Web Address
Measurement Levels
Categorical (Class, Qualitative) -Unary -Binary -Nominal -Ordinal Numeric (Quantitative) -Interval -Ratio ** All methods that accommodate an interval measurement scale in SAS Enterprise Miner also support a ratio scale.
model comparison
Compare model summary statistics and statistical graphics.
Selecting the Best Tree
Compare validation assessment between tree complexities.
replacement
Consolidate levels of a nonnumeric input using the Replacement Editor window.
Subsequent Pruning
Continue pruning until all subtrees are considered.
Text Mining: Training
Corpus = Original Documents Text Mining Training Scores
neural network
Create a multi‐layer perceptron on selected inputs. Control complexity with stopped training and hidden unit count
input data
Create decision data; add prior probabilities and profit matrices.
regression
Create linear and logistic regression models. Select inputs with a sequential selection method and appropriate fit statistic. Interpret models with odds ratios.
Predictive Modeling Applications
Database marketing Financial risk management Fraud detection Process monitoring Pattern detection
The Analytic Workflow
Define analytic objective Select cases Extract input data Validate input data Repair input data Transform input data Apply analysis Generate deployment methods Integrate deployment Gather results Assess observed results Refine analytic objective -iterative process -sequence of steps required to component our analytical process
Doubling amount:
How much does an input need to change to double the odds? -not as common
Missing Value Causes
Manage missing values. Non-applicable measurement No match on merge Non-disclosed measurement
How Are the Scores Used?
Information retrieval - Querying using linear algebra - Discovery through topic identification Text categorization - Clustering or grouping documents (unsupervised classification) - Automatic categorizing of documents (supervised classification) Analytics support - Derive additional features or attributes to be used for exploration or predictive modeling
Text Mining Applications: Unsupervised
Information retrieval - finding documents with relevant content of interest - used for researching medical, scientific, legal, and news documents such as books and journal articles Document categorization for organizing - clustering documents into naturally occurring groups - extracting themes or concepts Anomaly detection - identifying unusual documents that might be associated with cases requiring special handling such as unhappy customers, fraud activity, and so on
Regressions: Beyond the Prediction Formula
Manage missing values. Interpret the model. Account for nonlinearities. Handle extreme or unusual values. Use nonnumeric inputs.
Missing Value Issues
Manage missing values. Problem 1: Training data cases with missing values on inputs used by a regression model are ignored.... Problem 2: Prediction formulas cannot score cases with missing values.
Text Mining Applications: Supervised
Many typical predictive modeling or classification applications can be enhanced by incorporating textual data in addition to traditional input variables. - churning propensity models that include customer center notes, website forms, emails, and Twitter messages - hospital admission prediction models incorporating medical records notes as a new source of information - insurance fraud modeling using adjustor notes - sentiment categorization from customer comments - stylometry or forensic applications that identify the author of a particular writing sample
Score Code Modules
Model deployment usually occurs outside of SAS Enterprise Miner and sometimes even outside of SAS. To accommodate this need, SAS Enterprise Miner is designed to provide score code modules to create predictions from properly prepared tables. In addition to the prediction code, the score code modules include all the transformations that are found in your modeling process flow. You can save the code as a SAS, C, or Java program.
stat explore
Obtain means and other statistics on data source variables
Analysis Element Organization
Projects - holds all of it Libraries and Diagrams - holds the analysis Process Flows - performing the analysis Nodes
Model Performance Assessment
Rate model performance using validation data. -Select the simplest model with the highest validation assessment.
Table Roles
Raw Training Validation Test Score Transaction
transform variables
Regularize distributions of inputs. Typical transformations control for input skewness via a log transformation.
impute
Replace missing values for interval (means) and categorical data (mode). Create a unique replacement indicator.
replacement
Replace unwanted values of an input variable in analysis data.
Defining a Data Source module 9
Select a table. Define variable roles .Define measurement levels. Define the table role.
Signal versus Noise in Predictive Modeling
Target = Signal + Noise Signal = Systematic Variation = Predictable Noise = Random Variation = Unpredictable
Text Mining
Text mining as presented here has the following characteristics: operates with respect to a corpus of documents uses one or more dictionaries or vocabularies to identify relevant terms accommodates a variety of algorithms andmetricsto quantify the contents of a document relative to the corpus derives a structured vector of measurements for each document relative to the corpus uses analytical methods that are applied to the structured vector of measurements based on the goals of the analysis (for example, groups documents into segments)
Predictive Modeling
The Essence of Data Mining "Most of the big payoff [in data mining] has been in predictive modeling." -Herb Edelstein
Text Mining Preliminaries
The core ingredient of any text mining solution is a well-defined process to turn unstructured text into a set of numbers. Experimenting with search engine software illustrates how text mining algorithms work, and provides insight into the history of text mining. The primary application of search engine software is information retrieval. Text mining has many applications that go beyond information retrieval. Fortunately, information retrieval technology generalizes to address many other applications.
Text Mining Tools
The following Enterprise Miner text mining nodes are discussed: Text Cluster Text Filter Text Import Text Parsing Text Profile Text Rule Builder Text Topic HP Text Miner
Implication
The interpretation of the implication () in association rules is precarious. •High confidence and support does not imply cause and effect. •The rule is not necessarily interesting. The two items might not even be correlated. •The term confidenceis not related to the statistical usage; therefore, there is norepeated sampling interpretation.
clustering
The purpose of clustering is often description - Segmenting existing customers into groups and associating a distinct profile with each group might help future marketing strategies. - No guarantee that the resulting clusters will be meaningful or useful.
How Are the Scores Judged?
The quality of a set of scores depends on the purpose of scoring. For supervised text categorization, three popular assessment measures are used. -Precision- the percentage of selected documents that are correctly classified -Recall- the percentage of all documents in the requested category that are selected -F1- the harmonic mean of precision and recall -Misclassification Rate - the percentage of incorrectly classified documents
Text Scores
The value or values associated with a document can be segment identifiers related to text categorization or more general predictive modeling cluster identifiers related to grouping documents based on similarity of content probabilities of membership in segments or clusters numeric values representing document content based on weighted averages of transformed word frequencies.
Which statement below is true about transformations of input variables in a regression analysis?
They are performed to reduce the bias in model predictions.
Predictive Modeling Training Data
Training data case: categorical or numeric input and target measurements -supervised learning -observations = training case -target = mainly binary
neural networks regression decision trees
Tune models with average squared error or an appropriate profit matrix.
Working with Text Mining Data Sources
When documents are stored in separate files in the same directory, or subdirectories under the same directory, then the Text Importnode can be used to create an appropriate SAS data set for text mining .When documents are stored together (for example, one document per row in a Microsoft Excel spreadsheet), then the Import Data Wizardor File Importnode can be used to create a text mining data set. Sometimes special SAS programming might be required if you are combining text data with other data. Two supported types of text mining data: The data set contains at least one variable with the role Text, and documents can be stored completely as a SAS character variable (limited to 32K). The data set contains at least one variable with the role Text Location. (This is used in the situation where a document size exceeds 32K.) - The location must be the full pathname of the document with respect to the Text Miner server .- An additional variable with the role Web Addresscan include the path to an unfiltered version of the document to be displayed in an interactive viewer such as the Interactive Filter Viewer. Additional data sources: Dictionaries - start lists - stop lists Synonym tables Multi-word term tables Topic tables
Segmentation Analysis
When no clusters exist, use the k-means algorithm to partition cases into contiguous groups.
Other Enterprise Miner Nodes
You also use other Enterprise Miner nodes for various purposes, such as predictive modeling and scoring new cases. Data Partition node Decision Tree node Regression node Memory-Based Reasoning node Score node
Text Analytics
You use the terms text analytics, text data mining, and text miningsynonymously in this course. Text analytics uses natural language processing techniquesand other algorithmsfor turning free-form text into data that can then be analyzed by applying statistical and machine learning methods .Text analytics encompasses many sub areas, including stylometry, entity extraction, sentiment analysis, content filtering, and content categorization. Text analytics spans many fields of study, including information retrieval, web mining, search engine technology, and document classification.
Predictive Model
a concise representation of the input and target association
when using decision trees in predictive modeling, we are interested in decision optimization. Which of the following describes this maximization of accuracy?
count if the true positives and true negatives
curse of dimensionality
deals with the variables in the data. we learned about 1d, 2d, and 3d dimensionalities which means how many variable are in the data set. big data sets have an abundance of variables. which does make it difficult
fraud detection
decision
voice recognition
decision
Three Prediction Types
decisions - simplest action classification -accuracy / misclassification rankings -concordance / discordance estimates -squared error - low as possible
loss reserving
estimate
revenue forecasting
estimate
Data miners typically play a big role in data collection strategies
false
Of the three sequential selection methods discussed here, step‐wise selection always results in the "best" model.
false
Tools on the Sample tab change the number of columns of the data, whereas tools on the Modify tab change the number of rows.
false
the three sequential selection methods for building regression models can never lead to the same model for the same set of data
false
tools under the sample tab change the number of columns of the data, whereas tools under the modify tab change the number of rows
false
Unsupervised Classification
grouping of cases based on similarities in input values Unsupervised classification may be a useful as a step in predictive modeling. - Customers can be clustered into homogenous groups based on sales of different items. - Build a model to predict the cluster membership based on more easily obtained input variables.
Binary Targets
primary outcome 1s secondary outcome 0s
The ROC chart
illustrates a tradeoffbetween a captured response fractionand a false positive fraction. Each point on the ROC chart corresponds to a specific fraction of cases, ordered by their predicted value. For example, this point on the ROC chart corresponds to the 40% of cases with the highest predicted values. The y-coordinate shows the fraction of primary outcomecases captured in the top 40% of all cases. The x-coordinate shows the fraction of secondaryoutcome cases captured in the top 40% of all cases. Repeat for all selection fractions.
Ranking Optimization: Concordance
maximize concordance:proper ordering of primary and secondary outcomes target=0→low score - secondary target=1→high score - primary
the impute tool is found in the ______ portion of the SEMMA process
modify
the replacement tool is found in the _________ portion of the SEMMA process
modify
Predictions
output of the predictive model given a set of input measurements -end results
credit scoring
ranking
risk profiling
ranking
Assessment Statistics
ratings depends on target measurement (binary, continuous, and so on) prediction type (decisions, rankings, estimates)
a(n) ______ input does not give any new information that was not already explained by other inputs
redundant
The Data Partition tool is found in the ____ portion of the SEMMA process
sample
Accuracy/Misclassification
tally the correct or incorrect prediction decisions
Decision Optimization: Accuracy
target / prediction 1/ primary / true positive 0/ secondary / true negative Maximize accuracy: agreement between outcome and prediction agreement between target and prediction
Decision Optimization: misclassification
target / prediction 1/ secondary/false negative 0/ primary / false positive Minimize misclassification: disagreement between outcome and prediction target yellow dot and you say its blue
Ranking Optimization: Discordance
target=0→high score target=1→low score Minimize discordance:improper ordering of primary and secondary outcomes
Inverse Prior Threshold
the Kolmogorov‐Smirnov statistic describes the ability of the model to separate the primary and secondary outcomes.
Data reduction - cluster
the most ubiquitous application, that is, exploiting patterns in data to create a more compact representation of the original. Though vastly broader in scope, data reduction includes analytic methods such as cluster analysis.
What is data mining?
the process of taking data and finding info. you use models and analytics in order to find the best results and new info that you would otherwise not have. it takes previous data and helps you learn about the future
A data source in SAS Enterprise Miner differs from a raw data file because a data source has additional attached metadata. This metadata includes which of the following?
the variable roles the variable measurement levels the data table role
EM will not force the percentages among the training, validation, and test data set to add up to 100%
true
In practice, modelers often use several tools, sometimes both graphical and numerical, to choose a best model.
true
the _________ data set is used to fine tune and adjust the model
validation
Average Squared Error
was used to tune many of the models fit in earlier chapters
Gini Coefficient
‐(for binary prediction) equals 2 x (ROC Index -0.5)
Estimation Methods
‐Avoid one‐size‐fits‐all approach ‐Provide tailored imputations for each case with missing values ‐View missing data as a prediction problem ‐When input value is unknown, use this model to predict the unknown missing value ‐Best for missing values due to lack of knowledge (no‐match or nondisclosure) -uses tools of regression
Synthetic Distribution Methods
‐One‐size‐fits‐all approach for missing values ‐Missing input measurement replaced with a fixed number ‐Use point mass selection of the input's distribution (mean)
SBC/Likelihood
‐The Schwarz's Bayesian Criterion (SBC) is a penalized likelihood statistic. The likelihood statistic was used to estimate regression and neural network model parameters and can be thought of as a weighted average squared error. •SBC is provided only for regression and neural network models and is calculated only on training data.
ROC Index
‐similar to concordance; equals the percent of concordant cases plus one‐half times the percent tied cases. •Recall that a pair of cases, consisting of one primary outcome and one secondary outcome, is concordantif the primary outcome case has a higher rank than the secondary outcome case. •By contrast, if the primary outcome case has a lower rank, that pair is discordant. •If the two cases have the same rank, they are said to be tied.
SEMMA: Sample Tab
•Append - add on data sets •Data Partition - separate data •File Import •Filter - filter out data •Input Data •Merge •Sample -data we are creating from one or more tables
SEMMA: Explore Tab
•Association •Cluster •DMDB •Graph Explore •Link Analysis •Market Basket •Multiplot •Path Analysis •SOM/Kohonen •StatExplore •Variable Clustering •Variable Selection -searching for relationships abnormalities trends
SEMMA: Model Tab
•AutoNeural •Decision Tree •DmineRegression •DMNeural •Ensemble •Gradient Boosting •Least Angle Regression •MBR •Model Import •Neural Network •Partial Least Squares •Regression •Rule Induction •Two Stage -use analytical tools to see the relationship
Beyond SEMMA: Utility Tab
•Control Point •End Groups •Ext Demo •Metadata •Open‐Source Integration •Register Model •Reporter •SAS Code •SAS ViyaCode •Save Data •Score Code Export •Start Groups -clean up the process -change the role of the data
SEMMA: Assess Tab
•Cutoff •Decisions •Model Comparison •Score •Segment Profile -where you are looking at the diff models you have created comparing and finding best model
Basic Terms
•Data set - collection of cases •Cases - observations •Variables - predictors •Inputs -predictors, help to explain (the x's) •Quantitative -continuous, discrete •Categorical -nominal, binary, ordinal •Target -response, variable of interest (the y) •Training, validation, test data sets •Model -assessment and implementation •Action - what is your action plan
SEMMA: Modify Tab
•Drop •Impute - used when you have missing data •Interactive Binning •Principal Components •Replacement •Rules Builder •Transform Variables - data that is nonlinear transforming improves fit -modify the data
Exploratory Data Analysis
•Goal = explore the data without any clear ideas of what you are looking for •Use both graphs and numerical summaries •Dimensionality may make it difficult •Bad data exposed
Beyond SEMMA: HPDM Tab high performance
•HP BN Classifier •HP Cluster •HP Data Partition •HP Explore •HP Forest •HP GLM •HP Impute •HP Neural •HP Principal Components •HP Regression •HP SVM •HP Text Miner •HP Transform •HP Tree •HP Variable Selection
Beyond SEMMA: Applications Tab
•Incremental Response •Survival - survival analysis
Partial Least Squares Predictions
•Input combinations (factors) that optimally account for both predictor and response variation are successively selected. •Factor count with a minimum validation PRESS statistic is selected. •Inputs with small VIP are rejected for subsequent diagram nodes.
Least Angle Regression Predictions
•Inputs are selected using a generalization of forward selection. •An input combination in the sequence with optimal, penalized validation assessment is selected by default.
Dmine Regression Predictions
•Interval inputs binned, categorical inputs grouped •Forward selection choices from binned and original inputs
analysis data
•Link existing analysis data sets to SAS Enterprise Miner. •Set variable metadata. •Explore variable distribution characteristics.
Pattern Discovery Caution
•Poor data quality - many guises: inaccuracies (measurement or recording errors), missing, incomplete or outdated values, and inconsistencies (changes of definition). Patterns found in false data are fantasies. •Opportunity -transforms the possible to the perceived. - Hand refers to this as the problem of multiplicity, or the law of truly large numbers. - Example: The odds of a person winning the lottery in the United States are extremely small and the odds of that person winning it twice are fantastically so. However, the odds of someone in the United Stateswinning it twice (in a given year) are actually better than even. - Example: Search the digits of pifor "prophetic" strings such as your birthday or significant dates in history and you will usually find them, given enough digits (www.angio.net/pi/piquery). •Interventions - taking action on the process that generates a set of data, can destroy or distort detected patterns. - Example: Fraud detection techniques lead to preventative measures, but the fraudulent behavior often evolves in response to this intervention. •Separability -separating the interesting from the mundane is not always possible, given the information found in a data set. - Example: Despite the many safeguards in place, it is estimated that credit card companies lose $0.18 to $0.24 per $100 in online transactions (Rubinkam 2006). •Obviousness - Discovered patterns reduces the perceived utility of an analysis. - Example: Among the patterns discovered through automatic detection algorithms, you find that there is an almost equal number of married men as married women - Example: Ovarian cancer occurs primarily in women and that check fraud occurs most often for customers with checking accounts. •Non‐stationarity -occurs when the process that generates a data set changes of its own accord. - In such circumstances, patterns detected from historic data can simply cease. - As Eric Hoffer states, "In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world that no longer exists."
Defining a Data Source
•Select a table. - define what the tables role in •Define variable roles. - X,Y,Z and define •Define measurement levels. •Define the table role. -data sets for SAS should be ready
MBR Prediction Estimates
•Sixteen nearest training data cases predict the target for each point in the input space. •Scoring requires training data and the PMBR procedure.
Beyond SEMMA: Time Series Tab
•TS Correlation •TS Data Preparation •TS Decomposition •TS Dimension Reduction •TS Exponential Smoothing •TS Similarity
DMNeural Predictions
•Up to three PCs with highest target R square are selected. •One of eight continuous transformations are selected and applied to selected PCs. •The process is repeated three times with residuals from each stage.
Rule Induction Predictions
•[Rips create prediction rules.] •A binary model sequentially classifies and removes correctly classified cases. •[A neural network predicts remaining cases.] -3 decisions --decision tree to figure out concentration --filter --neural network