ch3: master the data
Place the steps of classification into order.
-identify the classes you wish to predict -manually classify an existing set of records -select a set of classification models -divide your data into training and testing sets -generate your model -interpret the results and select the "best" model
Place the steps of profiling in order, from 1 through 5.
-identify the objects or activity you want to profile -determine the types of profiling you want to perform -set boundaries or thresholds for the activity -interpret the results and monitor the activity and/or generate a list of exceptions -follow up on exceptions
______ include both unsupervised exploratory analysis and supervised model generation to provide insight and predictive foresight into the business and decisions made by accountants and auditors.
Machine learning and artificial intelligence
After you have identified the classes you wish to predict, what is the next step?
Manually classify an existing set of records.
Decision support systems are an example of ______.
Prescriptive analytics
In the example of profiling for management accounting regarding Advanced Environmental Recycling Technologies, what are they looking for significant variances in?
Standard Cost
Using a ___model, you can predict whether a new vendor belongs to one class or another based on the behavior of others
classification
When evaluating classifiers, you need to be careful to strike a balance between what two things?
complexity of the model and accuracy of the classification
In the example provided in the text regarding employee turnover, the analyst is trying to predict employee turnover based on current professional salaries, health of the economy (GDP), and salaries offered by other accounting firms. In this scenario, select the explanatory variable(s). Select all that apply.
current professional salaries health of the economy salaries offered by other accounting firms
______ might be used to identify areas where there is a lack of controls, changes in procedures, or individuals more willing to spend excessively in potential types of T&E expenses which might be associated with higher risk.
profiling
What is the terminology for removing branches from a decision tree to avoid overfitting the model?
pruning
Structured data is stored in a database or spreadsheet and are readily ___
searchable
XBRL is used to facilitate the exchange of financial reporting information between the company and the Blank______?
securities and exchange commission
In the profiling example regarding T&E Expenses, which of the following is NOT one of the areas that the analyst would try to uncover?
significant variances in standard cost
____data are existing data that have been manually evaluated and assigned a class. ____data are existing data used to evaluate the model.
training; test
A decision _______is a tool used to divide data into smaller groups. Decision ____mark the split between one class and another.
trees; boundaries
The null hypothesis assumes the hypothesized relationship does not exist.
true
a _________ approach is used when you don't have a specific question and are simply exploring the data for potential patterns of interest.
unsupervised
A manually assigned category applied to a record based on an event.
a class
An expected attribute or value that we want to evaluate in a dataset.
a target
Any transaction that has a Z-score of ______ or above would represent abnormal transactions.
3
Select the appropriate definition for regression:
A method used to predict specific values
Select the correct definition of a target.
An expected attribute or value that we want to evaluate.
______ is an observation about the frequency of leading digits in many real-life sets of numerical data.
Benford's law
______ are designed to be interactive and adapt to the information collected by the user.
Decision support systems
After you have identified the objects or activity you wish to profile, what should you do next?
Determine the types of profiling you want to perform.
An example of time series analysis would be a prediction of future earnings based on past sales.
False; An example of time series analysis would be a prediction of future earnings based on past earnings.
What is the purpose of regression analysis?
It allows analysts to develop models to predict expected outcomes.
Which of the following is true regarding the profiling approach?
It is generally performed on data that is readily available.
Which of the following is true regarding the Data Reduction approach?
It primarily uses structured data that is readily searchable.
In the example regarding the LendingClub data in which the analyst is researching loan rejection, they identified three possible indicators for why a loan would be rejected, the debt-to-income ratio, length of employment, and credit [risk] score. Which of the following is/are the explanatory variable(s)?
Length of employment Credit [risk] score Debt-to-income ratio
What is the terminology for the items that are useful for ranking observations rather than simply predicting class probability?
Linear classifiers
What is the purpose of clustering?
To identify groups of similar data elements and the underlying relationship of these groups.
What is the purpose of classification?
To predict which class an observation that we know little about will belong to.
What is the purpose of Data Reduction?
To reduce the amount of detailed information considered to focus on the most interesting or abnormal items.
in the following question, what would be the target? Given a set of customer data, we are trying to predict the total transaction amount based on a variety of attributes.
Transaction amount
A class is a manually assigned ________ applied to a record based on an event.
category
A method for simplifying large datasets into obvious categories.
data reduction
What types of analytics summarizes existing data to determine past performance?
descriptive analytics
Variance analysis, a common practice in management accounting, is an example of ______ analytics.
diagnostic
In the example provided in the text regarding employee turnover, the analyst is trying to predict employee turnover based on current professional salaries, health of the economy (GDP), and salaries offered by other accounting firms. In this scenario, what is the dependent variable?
employee turnover
__ is a/an unsupervised method that is used to discover patterns of behavior, based on the distance of z-scores from the mean.
profiling
Dependent variables can only be explained by a maximum of one independent variable.
false
Diagnostic analytics forecast future performance.
false
True or false: Classification requires that we know a great deal about the observation that we're attempting to place in a class
false
True or false: When clustering works well, observations within a cluster should be different, and the data across clusters should be very similar.
false
Time series analysis is a predictive analytics technique used to predict future values based on past values of other variables.
false; Time series analysis is a predictive analytics technique used to predict future values based on past values of the same variable.
True or false: Classification requires that we know a great deal about the observation that we're attempting to place in a class.
false; the power of Classification is that we can know very little about a given observation in order to predict which class it will belong to.
After you have identified the attribute you would like to reduce or focus on, what is the next step?
filter the results
A specific type of data profiling that is used to look for correspondences between portions, or segments, of text for potential matches is called __ match.
fuzzy
______ looks for similarities between portions, or segments, of the text of each potential match.
fuzzy match
Clustering is an unsupervised method that is used to find ___ of similar data elements and the underlying relationships of those groups.
groups
steps of data reduction in order
identify the attribute you would like to reduce or focus on filter the results interpret the results follow up on results
In the example regarding the LendingClub data in which the analyst is researching loan rejection, they identified three possible indicators for why a loan would be rejected, the debt-to-income ratio, length of employment, and credit [risk] score. Which is the dependent variable?
loan rejection
Classification predicts a class for a new observation based on the ___ identification of classes from previous observations.
manual
Classification predicts a class for a new observation based on the _____identification of classes from previous observations.
manual
Generally the more complex and complete the model, the higher degree of the model Blank______ the data.
overfitting
Profiling is used to discover ___ of behavior, based on the distance of z-scores from the mean.
patterns
Machine learning, artificial intelligence and decision support systems are all examples of Blank______ analytics.
prescriptive
Which analytics type works to identify the best possible options given constraints or changing conditions?
prescriptive analytics
Which of the following data approaches are associated with diagnostic analytics?
profiling
Benford's law states that in many naturally occurring collections of numbers, the significant leading digit is likely to be ______
small
In the example of profiling for management accounting regarding Advanced Environmental Recycling Technologies, what are they looking for significant variances in?
standard cost
Clustering is a/an ___method that is used to find natural groupings within the data.
supervised
Regression is a/an ______method used to predict specific values given an explanatory variable (or variables).
supervised
a/an ___ approach is used when you are performing analysis that uses historical data to predict a future outcome based on a specific question.
supervised
regression is a __method used to predict specific values given an explanatory variable (or variables).
supervised
What is XBRL used for?
to facilitate the exchange of financial reporting information between a company and the SEC.
Knowing the mean and standard deviation, and assuming a normal distribution, one can compute which statistic that can be used to identify abnormal transactions?
z-score