test 1
volume refers to
size
the purpose of loading data is to
put the data into the appropriate tool for analysis
most immediate/significant effect of database technology on accounting
quicker access and greater use of accounting info for decision making
when obtaining data yourself, you should do all of the following except
identify any errors or issues from the extraction
steps of data reduction
identify attributes to reduce or focus on, filter results, interpret results, follow up on results
examples of clustering
identify groups of transactions that may indicate risk or fraud
step 4: address and refine results
identify issues with the analyses, possible issues and refine the model
data analytics
process of evaluating data with intent of drawing conclusions to address business questions
which approach to data analytics would you use to identify fraud or transactions that might warrant additional investigation
profiling
5 most frequently used approaches
profiling, data reduction, regression, classification and clustering
ADS (audit data standards) aim
provide a guide to standardize audit data requests and the format in which data from the company is provided to the auditor; provide an opportunity for standardization
the IMPACT cycle is _________ in nature and suggests that _____
recursive; as questions are addressed, new important questions may emerge that can be addressed in a similar way
vlookup: col_index_num
refers to the column in selected table_array that contains data you wish to view; indicates what you want the function to return
Given a balance of total A/R held by a firm, what is the appropriate level of allowance for doubtful accounts for bad debts? is an example of which data analytics approach
regression
structured data should be stored in a normalized
relational database
type of database you are most likely to come across when extracting and using financial data
relational database
relational databases and enforcing business rules
relational databases can be designed to aid in the placement and enforcement of internal controls and business rules in ways that flat files cannot
foreign key creates the ____________ between tables
relationship
link prediction key word
relationship/related
cleaning the data
remove headings or subtotals, clean leading zeros and nonprintable characters, format negative numbers, correct inconsistencies
how to clean the data
remove headings or subtotals, clean leading zeros and nonprintable characters, format negative numbers, correct inconsistencies
pruning
removes branches from a decision tree to avoid overfitting the model
data visualization and data reporting
report results of analysis in an accessible way to each varied decision maker and their specific needs
data analytics provides a way
to search through large (un)structured data to identify unknown relationships or patterns
best describes the purpose of a relational database
to support business processes across the organization
VLookup
tool for looking up data from two separate tables and matching them base on a matching primary/foreign key relationship
a digital dashboard would be used during which step of the impact cycle
track outcomes
co-occurence grouping key word
transactions
support vector machines have ___ decision boundaries
two
two linked tables do not necessarily have
two foreign keys
profiling key word
typical
the simpler the model, the greater the chances of
underfitting the model
step 1: identify the questions
understand the business problems that need to be addressed
every column in a table must be
unique and relevant to the purpose of the table
unsupervised approach
used for data exploration looking for potential patterns
decision trees
used to divide data into smaller groups
data reduction is used to
filter results
step 6: track outcomes
follow up on the results of the analysis
link rows in one table to rows in another table
foreign key
vlookup: lookup_value
foreign key you wish to look up; single cell reference
velocity refers to
frequency
profiling relies on
gathering summary statistics and identifying outliers
visual example of classification
graph with different symbols and colors
visual example of clustering
graph with dots separated by color
clustering is used to identify
groups of similar data elements and the underlying drivers of those groups
three types of columns
primary keys, foreign keys and descriptive attributes
logical data model
abstract representation of a database's contents
each process-specific schema is a piece of
a greater whole, combining to form one integrated database
classification key words
classes, categories
SQL allows you to
extract only a potion of the data
supervised data mining
used when you are trying to predict a future outcome based on historical data; a form of data mining in which data miners develop a model prior to the analysis and apply statistical techniques to data to estimate values of the parameters of the model
linear classifiers
useful for ranking items rather than simply predicting the class probability
the purpose of transforming the data is to
validate for completeness and integrity
3 Vs of big data
volume, velocity and variety
join clause
way to extract data from more than one table in SQL
for each attribute, we learn:
what kind of key it is, what data is required, what data can be stored in it, how much data is stored
post-pruning occurs
when model is completed; evaluates completed model and discards branches after the fact
unsupervised data mining
when you don't have a specific question; a form of data mining where the analysts do not create a model or hypothesis before running the analysis. instead, they apply the data mining technique to the data and observe the results. with this method, analysts create hypotheses after the analysis to explain the patterns found.
physical views of database
where the data is physically arranged and sorted
In most cases, you need to know
which tables and attributes contain the relevant data
can a foreign key be null
yes
visual example of profiling
z score
order of SQL code to create a query
1) Select &* 2) From (table) 3) Inner Join (other table) 4) On
5 steps of ETL
1) determine purpose/scope of data request 2) obtain the data 3) validate data for completeness and integrity 4) clean the data 5) load the data
Extraction
1) determine the purpose and scope of the data request 2) obtain the data
skills that analytic-minded accountant should possess
1) develop an analytic mindset 2) data scrubbing and data pre 3) data quality 4) descriptive data analysis 5) data analysis through data manipulation 6) define and address problems through statistical analysis 7) data visualization and date reporting
5 steps of requesting data
1. determine purpose and scope of data request 2. obtain data 3. validate data 4. clean the data 5. load the data
a transaction with a z score of ___ or above would represent abnormal transactions
3
composite keys
A composite key is a combination of two or more foreign keys in a table to create a Primary Key; the use of more than one column of data to uniquely identify each row in a relational database table.
IMPACT Cycle
Identify questions, Master the data, Perform the test, Address and refine results, Communicate insights, Track outcomes
an example of classification
Of all the loans a bank has offered, which are most likely to default? Which loan applications are expected to be approved?
if you have direct access to a data warehouse, you can use
SQL and other tools to pull the data yourself
SQL example in excel
VLookup
questions for determining the purpose and scope of the data request
What is the purpose of the data request? what do you need to solve the data? what business problems will they address? what risks exists in data integrity? what is the mitigation plan? what other information will impact the nature, timing and extent of the analysis?
similarity matching
an attempt to identify similar individuals based on data known about them
foreign keys
a column or group of columns used to represent relationships. values of the foreign key are attributes that point to a primary key in another table
The primary key in the second table is
a combination of primary keys in the first table
the primary key in the first table is
a foreign key in the second table
use SQL to
combine data from one or more tables and organize it in a way that is more intuitive than how it is stored in a relational database
integration
combining databases
clustering
an attempt to divide individuals into groups (clusters) in a useful or meaningful way; identifying groups of similar data elements and the underlying drivers of those groups
regression
an attempt to estimate or predict, for each unit, the numerical value of some variable using some type of statistical model; predict specific values
class examples
accept/reject, fraud/not fruad
descriptive attributes provide
actual business information
asking colleagues what they think of the analysis would be considered to be a part of which stage of the impact cycles
address and refine results
slicing/dicing the data, finding correlations, revising and rerunning the analysis are part of which stage of impact cycle
address and refine results
classification
an attempt to assign each unit (or individual) we know very little about in a population into a few categories
profiling
an attempt to characterize the "typical" behavior of an individual, group or population by generating summary statistics about the data (mean, standard deviations, etc) so that we can more easily identify abnormal behavior (anomalies)
co-occurence grouping
an attempt to discover associations between individuals based on transactions involving them
link prediction
an attempt to predict relationships between two data items; social media
target
an expected attribute or value that the want to evaluated
examples of profiling
analyzing travel and entertainment expenses; comparing variances from target ranges; Benford's law
accountants should be able to
articulate business problems, communicate with data scientists, draw conclusions, present results, develop an analytical mindset
profiling is typically used to
assess data quality and internal controls
test data is used to
assess the degree and strength of a relationship
data reduction
attempts to reduce the amount of information that needs to be considered to focus on the most critical items by taking a large set of data and reducing it with a smaller set that has the vast majority of the critical information from the larger set
step 5: communicate insights
communicate effectively using clear language and visualizations
after revising and rerunning the analysis, what comes next in the IMPACT cycle
communicate insights
one of the biggest differences between a flat file and a relational database is
how many tables there are; relational databases have multiple tables
validating data for completeness and integrity
compare the number of records and descriptive statistics for numeric fields; validate date and time fields; compare string limits
logical view of database
how the data is conceptually organized/understood
how to ensure data is valid for completeness and integrity
compare the number of records, compare descriptive stats, validate date/time fields, compare string limits for text fields
when evaluating classifiers, need to strike a balance between
complexity of the model and accuracy of the classification
how does data analytics affect financial reporting
better estimates of collectability and write downs, better understanding of business environment through social media, identifying risks and opportunities through analysis of internet searches
use of data warehouse in decision making
business intelligence
how do users retrieve data stored in a database
by executing a query
profiling regarding T&E expenses, which is not one of the areas that the analyst would try to uncover A) lack of controls B) change in procedures C) significant variances in standard cost D) individuals more willing to spend excessively
c
use a flowchart to
identify an appropriate approach
the purpose of extracting data is to
identify and obtain data from the appropriate source
the goal of ETL is to
identify and obtain the data needed for solving a problem
linear discriminants use mathematical equations to
draw the line that separates the two classes
pre-pruning occurs
during model generation
unsupervised approaches
clustering, profiling, co-occurence grouping, data reduction
attempting to sell additional items by suggesting "customers who bought this also liked..." or "frequently bought together" is an example of which approach to data analytics
co-occurence grouping
relational databases and redundancy
each element of data is stored in only one place
clustering algorithms
calculate the minimum distance of all observations and groups those elements
data dictionary
centralized repository of descriptions for all of the data attributes of a data set; contains information about the structure of the database
Which transactions is a credit card company flag as potentially being fraudulent and deny payment? is an example of which data analytics approach
classification
skill not emphasized that analytic-minded accountants should have
classification of test approaches; data and systems analysis and design
supervised approaches
classification, regression, similarity matching, link prediction, causal modeling
segmenting a customer into a small number of groups for additional analysis and marketing activities is an example of which approach to data analytics
clustering
vlookup: range_lookup
either FALSE or TRUE; false indicates you want an exact match
how does data analytics affect auditing
enhance audit quality, expand services, add value to clients and allow auditors to stay engaged beyond the audit
in a well structured relational database
every table should be related to at least one other table; every column in a row must be single valued
first assumption in normalization approach
everything initially stored in one large table
training data
existing data that has been manually evaluated and assigned to a class
test data
existing data used to evaluate the model; data that exists (for example, in a database) before a test is executed, and that affects or is affected by the component or system under test.
T/F: a data dictionary will be more robust and have more attributes to keep track of for a dataset stored as a flat file
false
data reduction key word
filter
______________ is the metadata that describes each attribute in a database.
data dictionary
storing data in a normalized relational database ensures that
data is complete, not redundant and that business rules are enforced; aids in communication and integration across business processes
not a benefit of using a normalized relational database
data is stored in one place
asking accountant to identify customers who might be candidates
data mining
profiling is used to assess
data quality and internal controls
when a manager wants to gather info about employees, use which language
data query language
data analytics that suggests new ways to highlight which transactions do not need the same level of vetting as the other transactions is an example of which approach to data analytics
data reduction
structured data
data that is organized and resides in a fixed field with a record or file
big data refers to
datasets that are too large and complex to be analyzed traditionally
not a benefit of database approach
decentralized management of data
linear classifiers identify
decision boundaries; ranks
formula for regression
dependent variable = f(independent variables)
primary keys are rarely
descriptive
critical data but not necessary to build the data model
descriptive attributes
after you have identified the objects/activity you want to profile, what should you do next?
determine the types of profiling you want to perform
linear classifiers are useful for
determining the really important values
ETL-- what other info will impact data analysis
determining the scope/purpose of data request
variety refers to
different types
support vector machine
discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin and then to find the middle line
steps of classification
identify the classes you wish to predict, manually classify an existing set of records, select a set of classification models, divide data into training and testing sets, generate model, interpret results and select the best model
data dictionaries help analysts
identify the data they need to use
steps of regression
identify variables that might predict an outcome, determine the functional form of the relationship, identify the parameters of the model
steps of profiling
identify what you want to profile, the type of profiling you want to perform, set boundaries/thresholds, interpret results, follow up on exceptions
the ETL process begins with
identifying what data you need
relational databases should be designed to support business processes which results in
improved communication across functional areas and more integrated business processes
independent variables
inputs; x axis
examples of targets
interest rate, fraud score
relational databases ensure that data:
is complete, not redundant, follow business rules and internal controls, and aid communication
descriptive attribute
it is an attribute that is used to describe or record information about the 'relationship'; includes everything else
why are relational databases preferred
its ability to store and maintain data integrity; "one version of the truth" across multiple data elements
step 2: master the data
know what data is available and how it relates to the problems
similarity matching key words
known data, similar
visual example of regression
line of best fit on graph
suggesting friends to add on social media based on mutual friends is an example of which approach to data analytics
link prediction
examples of data reduction
locating payments made to specific vendors, using XBRL to filter specific tags
data dictionaries help administrators
maintain databases
class
manually assigned category applied to a record based on an event
after you have identified classes you wish to predict, what is the next step
manually classify an existing set of records
decision boundaries
mark the split between one class and another
can primary keys be null
no
process of first developing a relational database and then breaking the table down into smaller tables
normalization
ETL-- where is data located in systems
obtain the data
data warehouse data storage
often fed by a variety of sources, and data is analyzed centrally
Unified Modeling Language (UML) is
one way to understand databases
the primary key is typically made of ___ column
one; but it occasionally be made of multiple columns
problem to normalization
only having one primary key
dependent variable
output; y axis
the more complex the model, the greater the chance of
overfitting the model
not found in a data dictionary
physical location of data
the goal of classification is to
predict whether an individual we know very little about will belong to one class or another
regression key words
predict, numerical values
causal modeling
predicting an outcome by identifying its relationship with one or more other factors; independent variables cause or are associated with dependent variables
examples of regression
predicting employee turnover; determining the appropriateness of allowance accounts
a data warehouse
primarily used for analysis than transaction processing
structured data is readily
searchable
by 2020, 1.7 megabytes of new information will be created every
second
clustering key words
segments, similar
step 3: perform the test plan
select and appropriate model to find a target variable
Attempting to identify seller and customer fraud based on various characteristics known about them to see if they are similar to known fraud cases is an example of which data analytics approach
similarity matching
benford's law states that in many naturally occurring collections of numbers, the significant leading digit is likely to be
small
flat file
stores data in one place as opposed to multiple tables, such as a relational database
profiling is done primarily using
structured data; data that is readily available
SQLetl
structured query language; used to create, update and delete records and tables in databases, extract data, select precise attributes and records that fit criteria of analysis goal
vlookup: tabble_array
table that contains the corresponding primary key; always looks in the first column
consider when obtaining data
tables that contain info you need (data dictionary/relationship model), identify which attributes hold the info you need, identify how the tables relate to each other
how does data analytics affect taxes
tax strategy and planning, understanding tax consequences of international transactions/investments/M&A, better organization of tax tables and other tax data
models associated with regression and classification do not have
test data
regression allows
the accountant to develop models to predict outcomes
the ETL process ends when
the clean data is loaded into the appropriate format into the tool to be used for analysis
primary keys
the column in a database that uniquely identifies each row; unique identifiers
ETL
the extract, transform and load process that is integral to mastering the data
the first argument in a vlookup is
the foreign key
business intelligence
the practice of monitoring customers, competitors and suppliers to better understand opportunities and threats
the model you should use depends on
the questions you are trying to answer
primary and foreign keys facilitate
the structure of a relational database
joins rely on
the structure of normalized relational databases that have tables related through primary and foreign keys
when you need to extract data from more than one table in a SQL query, what do you need to identify to properly join tables
the two fields that the tables have in common
schemas do not represent
their own separate databases