Chapter 3 Modeling & Evaluation

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Target:

an expected attribute or value that we want to evaluate ex. fraud score, interest rate

Co-occurrence grouping (U)

used to discover associations between individuals based on transactions involving them ex. match vendors by geographic region ex. "frequently bought together"

Decision trees

used to divide data into smaller groups.

Data-Reduction: (U)

used to reduce the amount of information that needs to be considered to focus on the most critical items (highest cost, highest risk) ex. simplify vendors into obvious categories (ex. wholesale or retail)

Filtering:

Whole-dollar amounts have a greater likelihood of being made up or fraudulent, anyone can create square payments account; duplicate invoice payments

Pruning

removes branches from a decision tree to avoid overfitting the model.

Data Reduction Steps

1. Identify the attribute you would like to reduce or focus on. 2. Filter the results 3. Interpret the results. 4. Follow up on results.

Classification Steps

1. Identify the classes you wish to predict. 2. Manually classify an existing set of records. 3. Select a set of classification models. 4. Divide your data into training and testing sets. 5. Generate your model. 6. Interpret the results and select the "best" model.

regression anaylsis Steps

1. Identify the variables that might predict an outcome 2. Determine the functional form of the relationship 3. Identify the parameters of the model

Profiling Steps

1.Determine the types of profiling you want to perform. 2.Set boundaries or thresholds for the activity 3.Interpret the results and monitor the activity and/or generate a list of exceptions. Here 4.Follow up on exceptions.

Fuzzy match:

A computer-assisted technique of finding matches that are less than 100 percent perfect by finding correspondencies between portions of the text of each potential match. Ex. addressing matching

Causal modeling:

A data approach similar to regression, but used when the relationship between independent and dependent variables where it is hypothesized that the independent variables cause or are associated with the dependent variable.

similarity matching:

A data approach used to identify similar individuals based on data known about them.

Test Data

A set of data used to assess the degree and strength of a predicted relationship established by the analysis of training data

unsupervised approach/method

Approach used for data exploration looking for potential patterns of interest. If you don't have a specific question and are simply exploring the data for potential patterns of interest, you would use an unsupervised approach

Training data:

Existing data that have been manually evaluated and assigned a class, which assists in classifying the test data. *train our model

5 Approached used most frequently in Accounting and Auditing

Profiling, Data reduction, regression, classification, and clustering data approached

Classification Model

a data approach used to assign each unit in a population into a few categories potentially to help with prediction ex. you can predict whether a new vendor belongs to one class or another based on the behavior of the others

Clustering: (U)

a data approach used to divide individuals (like customers) into groups (or clusters) in a useful or meaningful way -used to find natural groupings within the data. In this case, we have three natural clusters of vendors.

Regression:

a data approach used to estimate or predict, for each unit, the numerical value of some variable using some type of statistical model ex. predict a specific value to answer a question based on the activity we have observed from other vendors

link prediction

a data approach used to predict a relationship between two data items ex. identify seller and customer fraud ex. mutual friends on social media to look for relationships between two parties

Support Vector Machine

a discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe) and then works to find the middle line.

Class

a manually assigned category applied to a record based on an event. Ex. Rejected , "Fraud"

Benford's Law

an observation about the frequency of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the significant lending digit is likely to be small. If the distribution of transactions for an account like "sales revenue" is substantially different than Benford's law would predict, then we would investigate the sales revenue account further

Profiling (U)

an unsupervised method that is used to discover patterns of behavior. In this case, the higher the Z-score (farther away from the mean), the more likely a vendor will have a delayed shipment (green circle). We use profiling to explore the attributes of that vendor that we may want to avoid in the future.

Supervised Approach

approach used to learn more about the basic relationship between independent and dependent variables that are hypothesized to exist. SPECIFIC OUTCOMES and use historical data to predict the future outcome

Post-pruning:

evaluates the complete model and discards branches after the fact.

Gap Detection:

looking for a missing check number in a sequence of checks

Decision boundaries

mark the split between one class and another


Kaugnay na mga set ng pag-aaral

BUS STATS: CH 5 - Discrete Probability

View Set

Federal Tax COnsiderations for Life Insurance

View Set

Chapter 10,11,12 Child Development

View Set

Chapter 6- Professional Organizations

View Set

PSYCH 1XX3 - Quiz Questions and Answers

View Set