Chapter 3 Modeling & Evaluation
Target:
an expected attribute or value that we want to evaluate ex. fraud score, interest rate
Co-occurrence grouping (U)
used to discover associations between individuals based on transactions involving them ex. match vendors by geographic region ex. "frequently bought together"
Decision trees
used to divide data into smaller groups.
Data-Reduction: (U)
used to reduce the amount of information that needs to be considered to focus on the most critical items (highest cost, highest risk) ex. simplify vendors into obvious categories (ex. wholesale or retail)
Filtering:
Whole-dollar amounts have a greater likelihood of being made up or fraudulent, anyone can create square payments account; duplicate invoice payments
Pruning
removes branches from a decision tree to avoid overfitting the model.
Data Reduction Steps
1. Identify the attribute you would like to reduce or focus on. 2. Filter the results 3. Interpret the results. 4. Follow up on results.
Classification Steps
1. Identify the classes you wish to predict. 2. Manually classify an existing set of records. 3. Select a set of classification models. 4. Divide your data into training and testing sets. 5. Generate your model. 6. Interpret the results and select the "best" model.
regression anaylsis Steps
1. Identify the variables that might predict an outcome 2. Determine the functional form of the relationship 3. Identify the parameters of the model
Profiling Steps
1.Determine the types of profiling you want to perform. 2.Set boundaries or thresholds for the activity 3.Interpret the results and monitor the activity and/or generate a list of exceptions. Here 4.Follow up on exceptions.
Fuzzy match:
A computer-assisted technique of finding matches that are less than 100 percent perfect by finding correspondencies between portions of the text of each potential match. Ex. addressing matching
Causal modeling:
A data approach similar to regression, but used when the relationship between independent and dependent variables where it is hypothesized that the independent variables cause or are associated with the dependent variable.
similarity matching:
A data approach used to identify similar individuals based on data known about them.
Test Data
A set of data used to assess the degree and strength of a predicted relationship established by the analysis of training data
unsupervised approach/method
Approach used for data exploration looking for potential patterns of interest. If you don't have a specific question and are simply exploring the data for potential patterns of interest, you would use an unsupervised approach
Training data:
Existing data that have been manually evaluated and assigned a class, which assists in classifying the test data. *train our model
5 Approached used most frequently in Accounting and Auditing
Profiling, Data reduction, regression, classification, and clustering data approached
Classification Model
a data approach used to assign each unit in a population into a few categories potentially to help with prediction ex. you can predict whether a new vendor belongs to one class or another based on the behavior of the others
Clustering: (U)
a data approach used to divide individuals (like customers) into groups (or clusters) in a useful or meaningful way -used to find natural groupings within the data. In this case, we have three natural clusters of vendors.
Regression:
a data approach used to estimate or predict, for each unit, the numerical value of some variable using some type of statistical model ex. predict a specific value to answer a question based on the activity we have observed from other vendors
link prediction
a data approach used to predict a relationship between two data items ex. identify seller and customer fraud ex. mutual friends on social media to look for relationships between two parties
Support Vector Machine
a discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe) and then works to find the middle line.
Class
a manually assigned category applied to a record based on an event. Ex. Rejected , "Fraud"
Benford's Law
an observation about the frequency of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the significant lending digit is likely to be small. If the distribution of transactions for an account like "sales revenue" is substantially different than Benford's law would predict, then we would investigate the sales revenue account further
Profiling (U)
an unsupervised method that is used to discover patterns of behavior. In this case, the higher the Z-score (farther away from the mean), the more likely a vendor will have a delayed shipment (green circle). We use profiling to explore the attributes of that vendor that we may want to avoid in the future.
Supervised Approach
approach used to learn more about the basic relationship between independent and dependent variables that are hypothesized to exist. SPECIFIC OUTCOMES and use historical data to predict the future outcome
Post-pruning:
evaluates the complete model and discards branches after the fact.
Gap Detection:
looking for a missing check number in a sequence of checks
Decision boundaries
mark the split between one class and another