Fraud Chapter 9- Transforming Data Into Evidence (Part 2)
Daubert factors
(1) Can be tested with scientific method? (2) Subject to peer review and publication? (3) Known or potential rate of error? (4) Generally accepted?
Two common Excel add-ins are
Analysis Toolpak ActiveData for Excel
Benford's Law used to identify patterns in Income Tax
Christian and Gupta 1993 identified tendency among individuals to claim additional deductions for the purpose of decreasing taxable income to fall within the next-lowest tax bracket Nigrini 1996 examined interest income and interest expense reported on tax returns, found evidence of understatement in the former and overstatement in the latter
Two basic charts that can be used for a variety of purposes are
pie charts and bar graphs
Data sets for financial transactions such as accounts payable, accounts receivable, and sales are
positively skewed and not normally distributed
According to Benford's law the distribution of first digits is
positively skewed or more heavily weighted toward smaller numbers means the first digit or left-most digit is more often low than high
The ultimate goal of comparative analysis of data is
prediction of the likelihood that a deviant observation is the result of some external influence such as error or fraud and not attributable to mere chance
data mining can serve as a
preventative or detective role in fraud investigations
Skewness
a measure of the degree of asymmetry of a data distribution around its Mean
Negative relationship data
negative data moving in the opposite direction
The value of RSF is that it provides a
numeric measure that can be compared to a bench or tracked over time
Discrete distribution
observations are countable, and there is a discrete "jump" between successive values
Every observation should be included in only _____ interval
one
Applications of Benford's law
-a number series must approximately follow a geometric sequence in which each successive number is calculated as a fixed percentage increase over the previous number
Examples of credit card transaction frauds/patterns
-abrupt shifts in the curves or changes of a spending slope indicate fraud -Customers using specific cards for specific types of purchases -fraudster usually spends as much as possible on the card in a short amount time before theft is discovered -Transactions of first-time users are usually less frequent than usage of long-time users -certain transactional patters: red flags; frequent purchases of small electronics or jewelry which can be resold on the black market and usage across a wide geographic area
Data mining methods to detect potentially fraudulent transactions are based on
-customer usage patterns -expected usage patterns -patterns that are known to be associated with fraud
Trade-off between Type I and Type II errors
-decreasing the occurrence of one increases the occurrence of the other
Advantages of basic data analysis programs
-ease of use -flexibility of application -various functions are available
Useful tasks from sorting
-identifying duplicate entries -identifying transactions with round numbers -identify gaps in the data sequence (such as dates, check numbers, or invoice numbers) -identify matches in data fields (such as employees and vendors with the same name or contact information) -Compute category totals (such as total payments made to a specific vendor or employee or total payments for specific expense category) -Highlight blanks (or lack of data) in a particular data field (such as employees without a social security number or vendors without an address) -identify inconsistencies among data fields (such as incompatible telephone numbers and addresses or back-dated checks)
Advantages of Visual exhibits and methods of displaying data
-images are more effective than words in conveying ideas, especially complex ideas - information can be communicated more efficiently in visual in less time and with more precision
Important features of the normal distribution
-it is symmetric around its Mean- it has zero skewness -because it is symmetric, its mean, median, and mode are all equal -it is completely described by its mean and standard deviation. graphing the distribution does not require knowledge of the individual data points, just the mean and standard deviation -the curve is bell-shaped; the normal distribution is often called a " bell curve"
Examples of data mining applications
-marketing research: predicting customer demand and sales -drug research: predicting the effectiveness of drugs and the likelihood of side effects -credit scoring: predicting the likelihood of default or bankruptcy -operations management: predicting input usage and productive efficiency -investment analysis: predicting life expectancies and probabilities of other insurable events -fraud detection: predicting the likelihood that irregular transactions reflect unlawful practices
Commonly employed ratios
-ratio of the largest value to the smallest value; larger ratio indicates greater variation in the data set -ratio of the largest value to the second-largest value; known as the Relative Size Factor; large RSF indicates an outlier in the data set -ratio of the smallest value to the second-smallest value: identifies outliers but on the opposite side of the distribution -ration of the largest or smallest value to the mean: means of identifying outliers using a different reference point
Ways to manipulate graphs
-scale (should always be included) -inclusions or lack thereof of labels including the title of the graph, labels of the horizontal and vertical axes, and labels of individual data points
Common types of credit card fraud
-stolen card: unauthorized usage of a stolen card -counterfeit card: duplicating credit cards for the purpose of fraudulent transactions -cardholder-not-present fraud: unauthorized usage of credit card information for transactions via phone, Internet, or mail -Application fraud: opening a credit account using another person's personal information
Data profiles may reflect
-the past behavior of the system being studied -may be extrapolated from other similar systems -may be the product of complex models that consider multiple factors
Disadvantages of Microsoft Excel and Access
-they allow data to be altered intentionally or unintentionally without record -errors can be easily introduced, most common via formulas, copy paste, incorrect cell references, improperly defined cell ranges -sometimes cannot accommodate data in certain formats; data must be converted where data could be further compromised
Data profiles are often defined in terms of
-trends over time (time series models, data distributions) -changes in expected trends or observations that fall outside the expected distributions suggest the need for closer investigation
USDA high-tech strategies against fraud
-working with social media firms and using mining techniques -data is collected from LINK terminals and reviewed for suspicious transactions -uses Anti-Fraud locator using EBT Retailer Transactions Alert System
Two common examples of GAS
1. Audit Command Language (ACL) 2. Interactive Data Extraction Analysis (IDEA)
Categories for Descriptive measures
1. Measures of central tendency (where observations are concentrated) 2. Measures of viability (how the observations are dispersed)
Three Common Measures of Variability
1. Range 2. Variance 3. Standard deviation
Major disadvantages of GAS programs
1. higher cost compared to basic programs 2. extensive training required to use them effectively
Total area under the curve is
1.00 or 100%
Close to conformity to Benford's Law requires a large data set often defined as at least
1000 observations with numbers having at least four digits
Benford's Law use to identify patterns in reported net income
Carlaw 1988 found evidence that companies with net income below a certain threshold have a tendency to round the income number up
Data Analysis tests used in Benford's Law to identify irregularities in data sets
First-Digit test First-Two-Digits test Last-Two-Digits test
GAS
Generalized Audit Software developed for use in auditing and fraud investigation engagements
data mining
Goal is to find individual items of value accountant uses it to reduce a large number of observations to a smaller number tat can be examined more closely allows the analyst to screen all the observations in a data set instead of relying on a sample
Fore symmetric distribution (with no skewness), the
Mean, Median, and Mode are all equal
Two basic data analysis programs
Microsoft Excel Microsoft Access
Benford's Law used to identify patterns in Fraud detection
Nigrini 1994 was the first to use digital analysis for fraud detection
Benford's Law formula
P(d)=log10 [1+(1/d)] where P represents probability or frequency and d is an integer from 1-9
Benford's Law
Pattern describes the expected frequencies of digits in numbers, the probability that any given digit in a number will take a certain value Mathematical algorithm or series of formulas that accurately predicts that, for many data sets, the first digit of each group of numbers in a random sample will begin with 1 more than a 2, a 2 more than a 3, a 3 more than a 4, and so on. Predicts the percentage of time each digit will appear in a sequence of numbers.
Descriptive Statistics
Purpose: to describe data using various numerical measures and graphical depictions -measures that describe samples of data *can be used in any engagement that involves analysis of a numerical data set; the larger the data set the more valuable the summary measures
SNAP
Supplemental Nutrition Assistance Program -a federally funded benefit program that assists low-income individuals and families with purchasing eligible food items.
Variance
The average of the squared deviations of the observations from the mean
Mean
The average value calculated by adding all the observations and dividing the number of observations
Median
The center point of the data set, when the observations are ordered by magnitude. This can be a single observation or point between two observations
Range
The difference between the values of the largest and smallest observations
Standard deviation
The square root of the variance
Benford's Law used to identofy patterns in Earnings per share
Thomas 1989 found that EPS numbers in the U.S. displayed unusually high frequencies of 5-10 cent multiples this provided additional evidence of rounding numbers up
In statistics, a false positive is called a
Type I error
A failure to identify a true signal is called a
Type II error
SNAP Fraud
USDA defines it as the exchange of SNAP benefits for cash also known as trafficking or discounting; prohibited by federal law -most is committed by retailers exchanging benefits for cash or by -people selling or trading their LINK cards in the open market, often through websites -usually smaller stores are more likely to participate in Fraud
Histogram
a bar chart in which each bar represents a single interval and the height is an important data analysis tool because it illustrates the shape of the data distribution which is a key determinant of the analytic methods that can be applied a graph of the frequencies of grouped data
Negatively Skewed
a distribution that extends farther to the left than to the right the distribution is more heavily weighted toward larger numbers
A measure that describes a population is?
a parameter
To perform data comparisons you must determine
a) what the data set actually looks like b) what it should look like
When can data be sorted?
after the data has been compiled into a spreadsheet with various data fields
Specialized programs ability to record the analytics
allows for the forensic accountant to review the analysis that has already been completed for guiding future efforts and avoiding future engagements a complete record of the data analysis process provides essential context for the results of the analysis
Using a larger number of intervals provides
amore detailed picture of the data distribution but may be misleading if the observations are heavily weighted in only a few intervals *between 7 to 12 intervals is sufficient
What is the most common way that graphs are biased
by manipulating the scale
Data mining cannot detect fraudulent transactions with
certainty it is limited to identifying irregular transactions that have a higher likelihood of being fraudulent
Almost all natural numbers display geometric tendency including
city populations sizes of geologic objects accounting numbers (stock prices, company revenues, trading volume)
For a fraud scheme to be eligible for data analysis, the data must be
collected, recorded, stored, and organized -bribery, kickbacks, and other forms of corruption do not create such data
First-Two-Digits Test
compares the first two digits of a data set with Benford's profile for the first two digits slightly steep slope this test offers more precision 90 total combinations 10-99
Last-Two-Digit Test
compares the last two digits of a data set with Benford's profile for the last two digits 100 possible combinations 00-99 each combination has the same probability of occurrence 1% this test is useful for identifying round or whole numbers which are red flags for invented numbers
data profiles
created with patterns of existing data that reflect expected or normal experience for it can be compared to new data
Five dimensions Banks use to evaluate transactions
customer account product geography time
In Access each row in the table contains
data for a single record (a transaction) fields are columns each field can only store one type of data for all the records
Positive relationship data
data moving in the same direction
Two ____ _____ can have the same number of observations and the same Mean, Median, and mode but have different variability
data sets
Examples of numbers that do not follow Benford's Law
data sets with built-in maximums or minimums and assigned numbers such as (Social Security numbers, account numbers, and zip codes) there is some unnatural (external) influence on the numbers that stifles the development of geometric pattern
What is the first step in creating a histogram?
defining the intervals, which is a matter of judgement for the analyst
Measures of Viability
describe how the observations are dispersed around the Mean
The challenge in applying the last-two-digits test is
determining which last-two digits are appropriate for the analysis
Both _____ and ____ distributions can be graphed as histograms
discrete; continuous
Pie charts are used to
display categories of data that sum to a total slices of the pie represent the percentage of the total contained within each category
For a right-skewed distribution, the Mean is
greater than the Median, which is greater than the Mode
The magnitude of an observation can
have a negative value a zero value or a positive value
The key function of statistics is defining
imprecision, which is represented by terms such as the error rate significance level confidence level
Analysis Toolpak
included with Excel and can be accessed through simple loading process includes analysis tools such as: descriptive statistics histogram sampling
Compared to the variance, the standard deviation is more easily
interpreted because it is state in terms of basic units rather than squared units
What kind of relationship exists between size of the intervals and the number of intervals for a given data set?
inverse the smaller the intervals the larger the number of intervals
digital analysis
is founded on the counterintuitive observation that individual digits of multidigit numbers are not random, but follow a pattern known as Benford's Law
Normal distribution
is the most prominent continuous distribution
Defining the intervals for a histogram is a_____ process
iterative, where the analyst considers various alternatives before selecting one that is most appropriate for the specific purpose of the analysis
In an embezzlement scenario, a fraudster may deliberately issue payments
just below some threshold or transportation of digits to make it seem like it was an innocent error (12,323 vs. 21,313)
The most basic form of presenting quantitative data is a
listing of the value for each individual observation not feasible for large data sets useful to summarize the data in some way such as value intervals, time intervals, or categories and present the summary measures in tabular format
With Access it is always clear whether you are
looking at data (input) or results (output)
Numbers with fewer digits paly a slightly higher bias toward
lower digits
The advantage of the Mean is that it considers the
magnitude of all the observations, representing the point where their mass (or weight) is concentrated
The Media does Not consider the
magnitude of each observations, only whether it is located in the upper or lower half of the distribution for this reason it is not affected by extreme observations (outliers)
ActiveData for Excel
more than 100 tools available that must be purchased and installed; offered in two versions
Experts can be challenged in deposition or cross-examination about
probabilities related to their conclusions an effective response requires knowledge of whether such probabilities can be determined, if not, the ability to explain why
Known or potential error rate of the method
probability concept
Inferential Statistics
purpose: to draw conclusions (inferences) about a population based on information obtained from a sample -limited -requires that a sample be drawn randomly from a population
Observations can be compiled in a
single point in time (a cross section) or over some period of time (a time series) different statistical sets are needed for each
It is more difficult to identify significant observations from Benford's profile in
small data sets
What is a straightforward form of data mining?
sorting
A measure that describes a sample is called a
statistic
Financial data that is normally distributed includes
stock prices, rates of return, profits, commodity or currency prices
A sample is a
subset of observations selected from the population
A key difference between Microsoft Excel and Microsoft Access is
that Access forces some structure on the data analysis project while excel allows more flexibility
The key measures of central tendency are
the Mean, Median, and Mode of these three Mean is the most commonly recognized
A key advantage of data mining is
the ability to examine all not just samples of the data
The standard deviation and the variance have
the advantage of considering all the observations in the data set
Absolute frequency
the count of observations within an interval
A population is
the entire group of observations in which we are interested
First Digit Test
the first-digit profile of a data set is compared to Benford's first- digit profile.
For a left-skewed distribution
the median is larger than the mean
Mode
the most frequently occurring value
The most basic feature of a data set is
the number of observations -it is important because it determines the scope of the analysis; - determines what methods can be applied to the data, what technological resources are needed, and how long the analysis will take
Low variability indications that
the observations are located farther from the Mean
Relative Frequency
the percentage of the total number of observations that fall within an interval
Measures of Central Tendency
the tendency for quantitative data to cluster around certain values -clustering often (but not always) occurs near the center of the data distribution
The efficiency of data mining can be evaluated in terms of
the true signals it identifies relative to false signals, also described as false positives or noise
Advantages of Relative frequencies
they are standardized or described relative to a standard quantity---the total number of observations
Advantages of specialized software
they can process data in a wide variety of formats which eliminates a need for conversion analyzes database as read only to avoid altering data they have the ability to record the analytics that have been performed creating an audit trail
What is the purpose of statistics?
to summarize data, analyze them, and draw meaningful inferences that lead to improved decisions
Second important feature in a data set is
two dimensions: 1. time 2. magnitude (amount)
Continuous Distribution
values can be measured to an infinitesimally small degree of accuracy values are continuous' there is no discrete jump ex: time, weight, and distance
Specialized software programs address the
weaknesses of Microsoft Excel and Access
Positively skewed
when a distribution extends farther to the right than to the left more heavily weighted toward smaller numbers