ACCT 3130 Midterm
An observation about the frequency of leading digits in many real-life sets of numerical data is called: leading digits hypothesis. Moore's law. Benford's law. clustering.
Benford's law.
While accountants don't need to become data scientists, they must know how to do the following except: Comprehend the process needed to clean and prepare the data before analysis Build a data repository Communicate with the data scientists about specific data needs and understand the underlying quality of the data Clearly articulate the business problem the company is facing
Build a data repository
__________ mark the split between one class and another. Decision trees Identified questions Decision boundaries Linear classifiers
Decision boundaries
What are attributes that exist in a relational database that are neither primary nor foreign keys? Nondescript attributes Descriptive attributes Composite key Relational table attributes
Descriptive attributes
Which of these is not included in the five steps of the ETL process? Determine the purpose and scope of the data request. Obtain the data. Validate the data for completeness and integrity. Scrub the data.
Determine the purpose and scope of the data request.
Data profiling is used to assess data quality and internal controls and typically involves the following steps except: Identify the objects or activity you want to profile. Set boundaries or thresholds for the activity. Filter the results. Determine the types of profiling you want to perform.
Filter the results.
IMPACT: Track Outcomes
Follow up on the results of the analysis
IMPACT Model
Identify Questions Master the Data Perform test plan Address and refine results Communicate Insights Track Outcomes
Which of the following best describes an independent variable? Application Output Input Operation
Input
Why is Supplier ID considered to be a primary key for a Supplier table? It contains a unique identifier for each supplier. It is a 10-digit number. It can either be for a vendor or miscellaneous provider. It is used to identify different supplier categories.
It contains a unique identifier for each supplier.
IMPACT: Master the Data
Know what data are available and how they relate to the problem *ETL
Which of these is not included in the five steps of the ETL process? Learn what data is available in the data warehouse. Determine the purpose and scope of the data request. Obtain the data. Validate the data for completeness and integrity.
Learn what data is available in the data warehouse.
Which approach to Data Analytics attempts to predict relationship between two data items? Profiling Classification Link prediction Regression
Link prediction
Which approach to data analytics attempts to predict a relationship between two data items? Similarity matching Classification Link prediction Co-occurrence grouping
Link prediction
The advantages of storing data in a relational database include which of the following? Help in enforcing business rules Increased information redundancy Integrating business processes All of the above Only A and B Only B and C Only A and C
Only A and C
Which attribute is required to exist in each table of a relational database and serves as the "unique identifier" for each record in a table? Foreign key Unique identifier Primary key Key attribute
Primary key
Which approach to data analytics attempts to characterize the typical behavior of an individual, group or population by generating summary statistics about the data? Classification Profiling Link prediction Regression
Profiling
Which approach to Data Analytics attempts to identify similar individuals based on data known about them? Classification Regression Similarity matching Data reduction
Similarity matching
_________ is a discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe) and then works to find the middle line. Linear classifier Support vector machine Decision tree Multiple regression
Support vector machine
___________ is a set of data used to assess the degree and strength of a predicted relationship. Training data Unstructured data Structured data Test data
Test data
fuzzy matching
locates approximate matches Useful for identifying relationships in imperfect data.
decision boundaries
mark split between one class and another
overfitting
models that are too accurate are actually worse at predicting future observations *want to maximize accuracy of testing without overfitting
Gold, silver, and bronze medals would be examples of: nominal data. ordinal data. structured data. test data.
ordinal data.
In general, the more complex the model, the greater the chance of: overfitting the data. underfitting the data. pruning the data. a more accurate prediction of the data.
overfitting the data.
How does data analytics affect financial reporting?
-better estimates of collectability -better understand business environment -identify risks and opportunities
Patterns discovered from ________ enable businesses to identify opportunities and risks and better plan for ________. past archives; today current data; the future current data; today past archives; the future
past archives; the future
Big Data is often described by the three Vs, or volume, velocity, and variability. volume, velocity, and variety. volume, volatility, and variability. variability, velocity, and variety.
volume, velocity, and variety.
Accountants need to be able to:
•Articulate business problems. •Communicate with data scientists. •Draw appropriate conclusions. •Present results in an accessible manner. •Develop an analytics mindset.
Accountants need to be comfortable with:
•Data scrubbing and data preparation •Data quality •Descriptive data analysis •Data analysis through data manipulation •Define and address problems through statistical analysis •Data visualization and data reporting
How does data analytics affect tax?
-better tax planning strategies -understand tax consequences for international, investments, mergers, etc -aid compliance
How does data analytics affect auditing?
-enhances audit quality -expanded services -added value to clients -auditors engaged beyond audit
relational database
-ensures data is complete -Are not redundant -follow business rules -aid communication and integration of business processes
5 Steps to Requesting data
1. Determine the purpose and scope of the data request 2. Obtain the data (yourself or through IT dept) 3. Validate the data for completeness and integrity 4. Clean the data 5. Load the data for data analysis
make sure data is valid
1.Compare the number of records 2.Compare descriptive statistics for numeric fields 3.Validate Date/Time fields 4.Compare string limits for text fields
Steps of Classification
1.Identify the classes you wish to predict. 2.Manually classify an existing set of records. 3.Select a set of classification models. 4.Divide your data into training and testing sets. 5.Generate your model. 6.Interpret the results and select the "best" model.
Profiling steps
1.Identify the objects or activity you want to profile. 2.Determine the types of profiling you want to perform. 3.Set boundaries or thresholds for the activity. 4.Interpret the results and monitor the activity and/or generate a list of exceptions. 5.Follow up on exceptions.
Steps of regression
1.Identify the variables that might predict an outcome. 2.Determine the functional form of the relationship. 3.Identify the parameters of the model.
Common corrections when cleaning data
1.Remove headings or subtotals 2.Clean leading zeroes and nonprintable characters 3.Format negative numbers 4.Correct inconsistencies across data, in general
Which approach to Data Analytics attempts to assign each unit in a population into a small set of classes (or groups) where the unit best fits? Regression Similarity matching Co-occurrence grouping Classification
Classification
Which approach to data analytics attempts to assign each unit in a population into a small set of classes where the unit belongs? Classification Regression Similarity matching Co-occurrence grouping
Classification
Which skills were not emphasized that analytic-minded accountants should have? Develop an analytics mindset Data scrubbing and data preparation Classification of test approaches Define and address problems through statistical data analysis
Classification of test approaches
As mentioned in the chapter, which of the following is not a common way that data will need to be cleaned after extraction and validation? Remove headings and subtotals. Format negative numbers. Clean up trailing zeroes. Correct inconsistencies across data.
Clean up trailing zeroes.
Correcting inconsistencies across data is an example of which of the following? Validating the data for integrity Validating the data for completeness Cleaning the data Obtaining the data
Cleaning the data
Retail stores often request customers' zip codes at the end of a sales transaction. This is an example of which data approach? Clustering Similarity matching Classification Regression
Clustering
IMPACT: Communicate Insights
Communicate effectively using clear language and visualizations
Which skills were not emphasized that analytic-minded accountants should have? Data quality Descriptive data analysis Data visualization Data and systems analysis and design
Data and systems analysis and design
The metadata that describes each attribute in a database is which of the following? Composite primary key Data dictionary Descriptive attributes Flat file
Data dictionary
Which of these terms is defined as being a central repository of descriptions for all of the data attributes of the dataset? Big Data Data warehouse Data dictionary Data Analytics
Data dictionary
The objective of data extraction is: To validate the data for completeness and integrity To identify and obtain the data from the appropriate source To identify which approach to data analytics should be used To load the data into the appropriate tool for analysis
To identify and obtain the data from the appropriate source
The objective of loading data is: To identify and obtain the data from the appropriate source To validate the data for completeness and integrity To identify which approach to data analytics should be used To load the data into the appropriate tool for analysis
To load the data into the appropriate tool for analysis
Which of the following best describes the purpose of relational databases? To ensure that business rules are enforced To support business processes across the organization To provide business information to data analysts To increase information redundancy in the organization
To support business processes across the organization
Data analytics is the process of evaluating data with the purpose of drawing conclusions to address business questions. (T/F)
True
IMPACT: Identify the questions
Understand the business problems that need to be addressed
UML
Unified Modeling Language: a way of visualizing the relationships between classes in a program.
foreign key
attributes that point to a primary key in another table
composite key
combination of two foreign keys used for line items
The IMPACT cycle includes all except the following process: data preparation. communicate insights. address and refine results. perform test plan.
data preparation.
test data
data that exists, used to evaluate the model
Summary Statistics
describe a set of data in terms of their location (mean, median), range (standard deviation, minimum, maximum), shape (quartile), and size (count).
Variety
different types
Support vector machine
discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe) and then works to find the middle line.
Regression
estimates or predicts the numerical value of a dependent variable based on the slope and intersect of a line and the value of an independent variable.
training data
existing data that have been manually evaluated and assigned a class
Mastering the data can also be described via the ETL process. The ETL process stands for: extract, total, and load data. enter, transform, and load data. extract, transform, and load data. enter, total, and load data.
extract, transform, and load data.
ETL
extraction, transformation, and loading
Velocity
frequency
Co-occurrence
grouping discovers associations between individuals based on common events, such as transactions they are involved in.
Similarity matching
grouping technique used to identify similar individuals based on data known about them.
With a goal to give organizations the information they need to make sound and timely business decisions, data analytics often involves all of the following except: technologies. growth. databases. statistics.
growth.
profiling
identifies the "typical" behavior of an individual, group, or population by compiling summary statistics about the data (including mean, standard deviations, etc.) and comparing individuals to the population. typically in structured data ex: z-score
Which of the following describes part of the goal of the ETL process: identify which approach to data analytics should be used. load the data into a relational database for storage. communicate the results and insights found through the analysis. identify and obtain the data needed for solving the problem.
identify and obtain the data needed for solving the problem.
clustering
identify groups (or clusters) of individuals (such as customers) that share common underlying characteristics—in other words, identifying groups of similar data elements and the underlying drivers of those groups.
IMPACT: Address and refine results
identify issues with the analyses, possible issues, and refine the model -ask further questions -explore data -rerun analyses
Descriptive attributes
include everything else
Machine learning and artificial intelligence
learning models or intelligent agents that adapt to new external data to recommend a course of action.
Which of the following best describes the goal of descriptive data analysis: demonstrate ability to sort, rearrange, merge, and reconfigure data in a manner that allows enhanced analysis comprehend the process needed to clean and prepare the data before analysis recognize what is meant by data quality, be it completeness, reliability or validity perform basic analysis to understand the quality of the underlying data and its ability to address the business question
perform basic analysis to understand the quality of the underlying data and its ability to address the business question
Classification
predicts a class or category for a new observation based on the manual identification of classes from previous observations.
Link prediction
predicts a relationship between two data items, such as members of a social media platform.
4 types of attributes
primary keys, foreign keys, composite keys, descriptive attributes
Diagnostic Analytics
procedures that explore the current data to determine why something has happened the way it has, typically comparing the data to a benchmark. ex: profiling, similarity matching, co-occurence & clustering
Prescriptive Analytics
procedures that model data to enable recommendations for what should be done in the future. ex: decision support systems & Machine learning and artificial intelligence
Descriptive Analytics
procedures that summarize existing data to determine what has happened in the past. ex: summary stats & data reduction or filtering
Predictive Analytics
procedures used to generate a model that can be used to determine what is likely to happen in the future. ex: regression, link prediction, & classification
Data Analytics
process of evaluating data with the purpose of drawing conclusions to address business questions. provides a way to search through large structured and unstructured data to identify unknown patterns or relationships
Data reduction or filtering
reduce the amount of observations to focus on relevant items (i.e., highest cost, highest risk, largest impact, etc.). It does this by taking a large set of data (perhaps the population) and reducing it to a smaller set that has the vast majority of the critical information of the larger set.
pruning
removes branches from decision tree to avoid overfitting the model
Decision support systems
rule-based systems that gather data and recommend actions based on the input.
By the year 2020, about 1.7 megabytes of new information will be created every: week. second. minute. day.
second.
IMPACT: Perform the test plan
select an appropriate model to find a target variable ex: classification, regression, similarity matching, clustering, co-occurence grouping, profiling
Volume
size
Data that are organized and reside in a fixed field with a record or a file. Such data are generally contained in a relational database or spreadsheet and are readily searchable by search algorithms. The term matching this definition is: training data. unstructured data. structured data. test data.
structured data.
Models associated with regression and classification data approaches have all except this important part: identifying which variables (we'll call these independent variables) might help predict an outcome (we'll call this the dependent variable). the functional form of the relationship (linear, nonlinear, etc.). the numeric parameters of the model (detailing the relative weights of each of the variables associated with the prediction). test data.
test data.
Big Data
to datasets which are too large and complex to be analyzed traditionally
The purpose of transforming data is: to validate the data for completeness and integrity. to load the data into the appropriate tool for analysis. to obtain the data from the appropriate source. to identify which data are necessary to complete the analysis.
to validate the data for completeness and integrity.
In general, the simpler the model, the greater the chance of: overfitting the data. underfitting the data. pruning the data. the need to reduce the amount of data considered.
underfitting the data.
primary key
unique identifier
Decision tree
used to divide data into smaller groups
Linear classifiers
useful for ranking items rather than simply predicting class probability. useful for determining the really important values, such as valuable customers, or which transactions are most likely fraudulent
The IMPACT cycle includes all except the following process: perform test plan. visualize the data. master the data. track outcomes.
visualize the data.