Chapter 3

Ace your homework & exams now with Quizwiz!

Standard Deviation

=STDEV() -The variability or spread of the data from the mean; a larger standard deviation means a wider spread away from the mean

Diagnostic

profiling, clustering

Predictive

regression, classification, link prediction

Pruning

removes branches from a decision tree to avoid overfitting the model, reduces the number of times we split the groups of data into smaller groups

LO 3-1

understand the four categories of Data Analytics

1. Identify the objects or activity you want to profile

what data do you want to evaluate? Sales transactions? Customer data? Imagine a manager wants to track sales volume for each store in a retail chain. She might evaluate total sales dollars, asset turnover, use of promotions and discounts, and/or employee incentives

2. Determine the types of profiling you want to perform

what is your goal? Do you want to set a benchmark for minimum activity, such as monthly sales? Have you set a budget that you wish to follow? Are you trying to reduce fraud risk? In the retail store scenario, the manager would likely want to compare each store to the others to identify which ones are underperforming or over performing

Evaluating classifers

when classifiers wrongly classify an observation, they are penalized. The larger the penalty (error), the less accurate the model is at predicting a future value, or classification

LO 3-4

Understand predictive analytics, including regression and classification

3. Identify the parameters of the model

What are the relative weights of each variable or the thresholds of each branch in a classification?

Support vector machine

a discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe)

Class

a manually assigned category applied to a record based on an event. For example, if the credit department has rejected a credit line for a customer, the credit department assigns the class "Rejected" to the customer's master record. Likewise, if the internal auditors have confirmed that fraud has occurred, they would assign the class "fraud" to that transaction

Test data

a set of data used to assess the degree and strength of a predicted relationship established by the analysis of training data

Unsupervised approach

approach used for data exploration looking for potential patterns of interest

Machine learning and artificial intelligence

are learning models or intelligent agents that adapt to new external data to recommend a course of action. For example, an artificial intelligence model may observe opinions given by an audit partner and adjust the model to reflect changing levels of risk appetite and regulation

Diagnostic Analytics (understand why it happened)

are procedures that explore the current data to determine why something happened the way it has, typically comparing the data to a benchmark. As an example, diagnostic analytics allow users to drill-down in the data and see how it compares to a budget, competitor or trend

Prescriptive Analytics (make recommendations for a course of action)

are procedures that model data to enable recommendations for what should be done in the future. These typically include developing more advanced machine learning and artificial intelligence models to recommend a course of action based on a current problem

Predictive Analytics (estimate a future value or category)

are procedures used to generate a model that can be used to determine what is likely to happen in the future. Examples of predictive analytics include regression analysis, forecasting, classification, and other predictive modeling

Decision support systems

are rule-based systems that gather data and recommend actions based on the input. Tax preparation software, investment advice tools, and auditing tools recommend courses of actions based on data that are input as part of an interview or interrogation process

Quartile

=QUARTILE() -The value that divides a quarter of the data from the rest; indicates skewness of the data

Decision trees

tool used to divide data into smaller groups

Minimum

=MIN() -the smallest value

Data Reduction approach steps

1. Identify the attribute you would like to reduce or focus on 2. Filter the results 3. Interpret the results 4. Follow up on results

Classification steps

1. Identify the classes you wish to predict 2. Manually classify an existing set of records 3. Select a set of classification models 4. Divide your data into training and testing sets 5. Generate your model 6. Interpret the results and select the "best" model

Data profiling steps

1. Identify the objects or activity you want to profile 2. Determine the types of profiling you want to perform 3. Set boundaries or thresholds for the activity 4. Interpret the results and monitor the activity and/or generate a list of expectations 5. Follow up on expectations

Regression steps

1. Identify the variables that might predict an outcome 2. Determine the functional form of the relationship 3. Identify the parameters of the model 4. Evaluate the goodness of fit

Mean

=AVERAGE() -The center value; sum of all observations divided by the number of observations

Correlation Coefficient

=CORREL() -How closely two datasets are correlated or predictive of one another

Count

=COUNT() -The number of observations

Frequency

=FREQUENCY() -The number of observations in each of a series of numerical categorical buckets

Maximum

=MAX() -The largest value

Median

=MEDIAN() -The middle value that divides the top half of the data from the bottom half

Machine Learning and Artificial Intelligence

AI models work similarly in that they learn from the inputs and corrections to improve decision making

Decision support system

An information system that supports decision-making activity within a business by combining data and expertise to solve problems and perform calculations

4. Evaluate the goodness of fit

Calculate the correlation coefficient or RSQ value to determine whether the data are close to the line or not. In general, the better the fit (closer to 1. at least 8 is good) the more accurate the prediction will be

LO 3-2

Describe some descriptive analytics approaches, including summary statistics and data reduction

LO 3-5

Describe the use of prescriptive analytics, including decision support systems, machine learning and artificial intelligence

1. Identify the attribute you would like to reduce or focus on

Example: an employee may commit fraud by creating a fake vendor and submitting fake invoices. Rather than evaluate every employee, an auditor may be interested only in employee records that have addresses that match vendor addresses

LO 3-3

Explain the diagnostic approach to Data Analytics, including profiling and clustering

2. Determine the functional form of the relationship

Is it a linear relationship where each input plots to another? Are you trying to divide the records into different groups or classes?

Sum

SUM() -the total value of all numerical values

Descriptive

Summary statistics, data reduction or filtering

2. Filter the results

This could be as simple as using filters in Excel, or using the WHERE phrase in a SQL query. May also involve a more complicated calculation. Ex: employees who create fake vendors will often use addresses that are similar, but not exactly the same, as their own address to foil basic SQL queries. Here the auditor should use a tool that allows fuzzy matching, which uses probability to identify likely similar addresses.

4. Follow up on results

at this point, you will continue to build a model or use the results as a targeted sample for follow-up. The auditor should review company policy and follow up with each employee who appears in this reduced list as it represents risk

Overfitting

be wary of classifiers that are too accurate. You want a good amount of accuracy without being too perfect.

Prescriptive analytics

decision support systems, machine learning and artificial intelligence

Summary statistics

describe a set of data in terms of their location (mean, median) range (standard deviation, minimum, maximum), shape (quartile) and size (count)

Co-occurrence grouping

discovers associations between individuals based on common events, such as transactions they are involved in. Amazon might use this to sell another item to you by knowing what items are "frequently bought together"

Regression

estimates or predicts the numerical value of a dependent variable based on the slope and intersect of a line and the value of an independent variable. R-squared value indicates how closely the line fits to the data used to calculate the regression. Example: given a balance of total accounts receivable held by a firm, what is the appropriate level of allowance for doubtful accounts for bad debts?

Post-pruning

evaluates the complete model and discards branches after the fact

Training data

existing data that have been manually evaluated and assigned a class, which assists in classifying the test data

Clustering

helps identify groups (or clusters) of individuals (such as customers) that share common underlying characteristics--in other words, identifying groups of similar data elements and the underlying drivers of those groups. For example, clustering might be used to segment a customer into a small number of groups. Ex: clustering might be used to segment a customer into a small number of groups for additional analysis and risk assessment

4. Interpret the results and monitor the activity and/or generate a list of expectations

here is where dashboards come into play. Management can use dashboards to quickly see multiple sets of profiled data and make decisions that would affect behavior. As you evaluate the results, try to understand what a deviation from the defined boundary represents. Is it a risk? Is it a fraud? Is it something to keep an eye on? To evaluate her stores, the retail chain manager may review a summary of the sales indicators and quickly identify under and over performing stores. She is likely to be more concerned with underperforming stores because they represent major challenges for the chain. Over performing stores may provide insight into marketing efforts or customer base.

Profiling

identifies the "typical" behavior of an individual, group, or population by compiling summary statistics about the data (including mean, standard deviations, etc.) and comparing individuals to the population. By understanding the typical behavior, we'll be able to identify abnormal behavior more easily. Profiling might be used in accounting to identify transactions that might warrant some additional investigation (Ex: outlier travel expenses or potential fraud)

Similarity matching

is a grouping technique used to identify similar individuals based on data known about them.

Target

is an expected attribute or value that we want to evaluate. For example: if we are trying to predict whether a transaction is fraudulent, the target might be a specific "fraud score". If we're trying to predict an interest rate, the target would be "interest rate"

Data reduction or filtering

is used to reduce the amount of observations to focus on relevant items (Ex: highest cost, highest risk, largest impact). It does this by taking a large set of data (ex: population) and reducing it to a smaller set that has the vast majority of the critical information of the larger set

Pre-pruning

occurs during the model generation. The model stops creating new branches when the information usefulness of an additional branch is low

5. Follow up on expectations

once a deviation has been identified, management should have a plan to take a course of action to validate, correct, or identify the causes of the abnormal behavior. When the retail chain manager notices a store that is underperforming compared to its peers, she may follow up with the individual store manager to understand his concerns or offer a local promotion to stimulate sales.

3. Interpret the results

once you have eliminated irrelevant data, take a moment to see if the results make sense. Calculate the summary statistics. Have you eliminated any obvious entries? Looking at the list of matching employees, the auditor might tweak the probability in the fuzzy match to be more or less precise to narrow or broaden the number of employees who appear

profiling is used to discover

patterns in behavior

"P" in IMPACT

performing the test plan

Classification

predicts a class or category for a new observation based on the manual identification of classes from previous observations. Membership of a class may be binary in the case of decision trees or indicate the distance from a decision boundary. Some examples of classification include predicting which loans are likely to default, credit applications that are expected to be approved, the classification of an operating or financial lease, or identification of suspicious transactions. In each of these cases, prior data must be manually identified as belonging to each class to build the predictive model.

Link prediction

predicts a relationship between two data items, such as members of a social media platform. For example, if two individuals have mutual friends on social media and both attended the same university, it is likely that they know each other and the site may make a recommendation for them to connect. Example: provides an example of this used in FB. Link prediction in an account setting might work to use social media to look for relationships between related parties that are not otherwise disclosed to identify related party transactions

Descriptive analytics (understand what happened)

procedures that summarize existing data to determine what has happened in the past. ex: count, min, max, average and median

Profiling done primarily using

structured data (data that are organized and reside in a fixed field with a record or a file. generally contained in a relational database or spreadsheet and are readily searchable by search algorithms)

when using historical data to predict a future outcome, you will use a

supervised approach

Decision boundaries

technique used to mark the split between one class and one another

1. Identify the variables that might predict an outcome

the inputs are called independent variables, where the output is a dependent variable

3. Set boundaries or thresholds for the activity

this is a benchmark that may be manually set, such as a budgeted value, or automatically set, such as a statistical mean, quartile, or percentile. The retail chain manager may define underperforming stores as those whose sales activity falls between the 20th percentile of the group and overperforming stores as those whose salary is above the 80th percentile. These thresholds are automatically calculated based on the total activity of the stores, so the benchmark is dynamic

Cluster analysis

works to identify groups of similar data elements and the udnerlying relationships of those groups. More specifically, clustering techniques are used to group data/observations into a specific number of clusters or groups so that all the data within any cluster are similar, while data across clusters are different. Cluster analysis works by calculating the minimum distance between each observation and the center of each cluster


Related study sets

ACG 2021 Chapter 1, Financial Accounting

View Set

Chapter 4- The National Income and Product Accounts

View Set

the intricacies of trade: washing machine (2)

View Set