Data Analytics for Accounting - Midterm Set
Which skills were not emphasized that analytic-minded accountants should have? Multiple Choice A. Statistical data analysis competency B. Classification of test approaches C. Developed an analytics mindset D. Data scrubbing and data preparation
B. Classification of test approaches
In which areas were skills not emphasized for analytic-minded accountants? Multiple Choice A. Descriptive data analysis B. Data and systems analysis and design C. Data visualization and data reporting D. Data quality
B. Data and systems analysis and design
Which of the following is not a typical example of nominal data? A. Gender B. SAT scores C. Hair color D. Ethnic group
B. SAT scores
Match the desired visualization for quantitative data to the following chart types: 5. Data trends for net income over the past eight quarters
Bar Charts
An observation about the frequency of leading digits in many real-life sets of numerical data is called:
Benford's Law
Match the desired visualization for quantitative data to the following chart types: 1. Useful for showing quartiles, medians, and outliers
Box and whisker plots
As mentioned in the chapter, which of the following is not a common way that data will need to be cleaned after extraction and validation? Multiple Choice A. Remove headings and subtotals. B. Format negative numbers. C. Clean up trailing zeroes. D. Correct inconsistencies across data.
C. Clean up trailing zeroes.
What data analytics test will predict which firms will go bankrupt and which firms will not go bankrupt?
Classification
Which approach to data analytics attempts to assign each unit in a population into a small set of classes where the unit belongs?
Classification
What data analytics test will segment all of the company's customers into groups that will allow further specific analysis?
Clustering
What data analytics test will identity customers who buy product X will be most likely to be also interested in product Y.
Co-occurrence Grouping
Which of the following describes part of the goal of the ETL process? Multiple Choice A. Identify which approach to data analytics should be used. B. Load the data into a relational database for storage. C. Communicate the results and insights found through the analysis. D. Identify and obtain the data needed for solving the problem.
D. Identify and obtain the data needed for solving the problem.
The IMPACT cycle specifically includes all except the following steps: Multiple Choice A. address and refine results. B. perform test plan. C. communicate insights. D. data preparation.
D. data preparation.
Models associated with regression and classification data approaches have all have these important parts except: A. identifying which variables (we'll call these independent variables) might help predict an outcome (we'll call this the dependent variable). B. the functional form of the relationship (linear, nonlinear, etc.). C. the numeric parameters of the model (detailing the relative weights of each of the variables associated with the prediction). D. test data.
D. test data.
The IMPACT cycle includes all except the following steps: Multiple Choice A. perform test plan. B. master the data. C. track outcomes. D. visualize the data.
D. visualize the data.
What data analytics test will use stratified sampling to focus audit effort on transactions with greatest risk.?
Data Reduction
Why is identifying the question such a critical first step in the IMPACT process cycle?
Data analysis is most effective when a question is identified that needs to be addressed. That will focus the analysis on which data and which test method might be most effective in addressing or answering the question.
The metadata that describes each attribute in a database is called:
Data dictionary
3. Summary statistics
Descriptive analytics
4. What were the total taxes paid in the past 5 years?
Descriptive analytics
8. Which product sold the most last month?
Descriptive analytics
9. Data reduction or filtering
Descriptive analytics
What are attributes that exist in a relational database that are neither primary nor foreign keys?
Descriptive attributes
1. Clustering
Diagnostic analytics
10. Profiling
Diagnostic analytics
2. What was the price and quantity variance associated with the production of chicken at Tyson?
Diagnostic analytics
6. Co-occurrence grouping
Diagnostic analytics
7. Our refunds seem to be high. Are they fraudulent?
Diagnostic analytics
8. Similarity matching
Diagnostic analytics
Identify the behavior, error, or fraudulent scheme that could be detected when you apply Benford's Law to the following accounts: Vendor payments
Duplicate checks
Identify the behavior, error, or fraudulent scheme that could be detected when you apply Benford's Law to the following accounts: Travel and entertainment
Expense approval circumvention
Identify the behavior, error, or fraudulent scheme that could be detected when you apply Benford's Law to the following accounts: Sales Records
Ficitious sales transactions
Match the desired visualization for quantitative data to the following chart types: 3. Distribution of sales across states or countries
Filled geographic maps
Even though it is preferable to store data in a relational database, storing data across separate tables can make data analysis cumbersome. Describe three reasons it is worth the trouble to store data in a relational database.
Flat files make maintenance cumbersome due to redundancy, inaccuracies, and completeness. In a relational database, there is only one table where you would update the supplier's address. This ensures that all staff have the same accurate information, without having to type in the updated address in every table where the supplier's location is linked.
Among the advantages of using a relational database is enforcing business rules. Based on your understanding of how the structure of a relational database helps prevent data redundancy and other advantages, how does the primary key/foreign key relationship structure help enforce a business rule that indicates that a company shouldn't process any purchase orders from suppliers who don't exist in the database?
For purchase orders, the primary key would be the PO number- while one of the foreign keys required would be a supplier ID. The supplier ID is a foreign key in the PO table, which is a primary key in the supplier table. A relational database is able to enforce supplier registration, by making the supplier ID a required data point. If the supplier has not been registered, the individual setting up the PO will be unable to move forward with issuing a PO- since the ID is required.
Why is Supplier ID considered to be a primary key for a Supplier table?
It contains a unique identifier for each supplier.
Match the desired visualization for quantitative data to the following chart types: 6. Data trends for stock price over the past five years
Line charts
What data analytics test will look for relationships between related parties that are not otherwise disclosed.?
Link Prediction
Which approach to data analytics attempts to predict a relationship between two data items?
Link prediction
Identify the behavior, error, or fraudulent scheme that could be detected when you apply Benford's Law to the following accounts: Purchases
Potential kickback schemes
Identify the behavior, error, or fraudulent scheme that could be detected when you apply Benford's Law to the following accounts: Sales returns
Potential kickback schemes
1. What are the expected stock returns to our investment in Facebook stock?
Predictive analytics
11. Regression
Predictive analytics
2. Classification
Predictive analytics
3. What are the cash needs and projections over the next 3 months?
Predictive analytics
5. Link prediction
Predictive analytics
6. Should we ship by truck, rail, or air given the expected increase in fuel expenses?
Prescriptive
4. Decision support systems
Prescriptive analytics
5. If we expect our Asian sales to increase, where should we produce them?
Prescriptive analytics
7. Machine learning and artificial intelligence
Prescriptive analytics
Which attribute is required to exist in each table of a relational database and serves as the "unique identifier" for each record in a table?
Primary key
What data analytics test will Work to understand normal behavior, to then be able to identify abnormal behavior (such as fraud)?
Profilling
What data analytics test will predict the relationship between an investment in advertising expenditures and subsequent operating income?
Regression
Match the desired visualization for quantitative data to the following chart types: 2. Correlation between two variables
Scatter plots
Match the desired visualization for quantitative data to the following chart types: 4. Visualize the line of best fit
Scatter plots
What data analytics test will predict which new customers resemble the company's best customers?
Similarity Matching
Box and whisker plots (or box plots) are particularly adept at showing extreme observations and outliers. In what situations would it be important to communicate these data to a reader? Any particular accounts on the balance sheet or income statement?
Since Box Plots are useful for outlier detection, they are useful for visualizing exceptions to internal controls. Plotting transaction accounts (such as purchase card transactions) allow auditors to quickly observe whether there is unusual user activity and allow them to focus their effort on those high risk transactions.
The Big Four accounting firms (Deloitte, EY, KPMG, and PwC) dominate the audit and tax market in the United States. What chart would you use to show which accounting firm dominates in each state in terms of audit revenues? 1. Area Chart 2. Line Chart 3. Column Chart 4. Histogram 5. Bubble Chart 6. Stacked Column Chart 7. Stacked Area Chart 8. Pie Chart 9. Waterfall Chart 10. Symbol Chart
Since we are comparing relative sizes for each state the following charts would be effective: pie chart, stacked column chart, bubble map, or symbol map. The other choices are better choices for numerical data depicting trends. 5. Bubble Chart 6. Stacked Column Chart 8. Pie Chart 10. Symbol Chart
These data are organized and reside in a fixed field with a record or a file. Such data are generally contained in a relational database or spreadsheet and are readily searchable by search algorithms. The term matching this definition is:
Structured Data
is a discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe) and then works to find the middle line.
Support vector machine
is a set of data used to assess the degree and strength of a predicted relationship.
Test Data
In the ETL process, if the analyst does not have the security permissions to access the data directly, then he or she will need to fill out a data request form. While this doesn't necessarily require the analyst to know extraction techniques, why does the analyst still need to understand the raw data very well in order to complete the data request?
The analyst needs to understand the data very well in order to request the correct and complete data needed for the extraction. Requesting data can be a time-intensive and iterative process: the more prepared an analyst is in understanding the raw data, the more they can save on time and labor costs of the IT team.
What is the purpose of a data dictionary? Identify four different attributes that could be stored in a data dictionary, and describe the purpose of each.
The data dictionary is a centralized repository of descriptions for all of the data attributes of the data set 1. variable name 2. brief description 3. whether the field is made up of numbers or text or alphanumerics 4. the size 5. whether it serves as a primary or foreign key 6. notes, etc.
Based on the data from datavizcatalogue.com, what are some major flaws of using word clouds to communicate the frequency of words in a document?
The main flaws of word clouds include: long words are emphasised over short words and words whose letters contain many ascenders and descenders may receive more attention.
To address the question "Will I receive a loan from LendingClub?" we had available data to assess the relationship among (1) the debt-to-income ratios and number of rejected loans, (2) the length of employment and number of rejected loans, and (3) the credit (or risk) score and number of rejected loans. What additional data would you recommend to further assess whether a loan would be offered? Why would they be helpful?
There are many other potential predictors of whether the LendingClub would pay a loan. Here are a few possibilities: What other debt do they have? How much is their disposable income? Do they have a clean criminal record? Have they had a loan with LendingClub before and did they repay it? Do they rent or own their house?
What is the difference between training datasets and test (or testing) datasets?
Training datasets are used to help teach the algorithm certain classifying skillsets and can be introduced first in the process, while testing data must be kept separate in order to properly access the applicability of the algorithm's modeling.
Match the data examples to one of the following data types: (Interval Data, Ratio Data, Ordinal Data, Nominal Data, Structured Data, Unstructured Data) Data Examples: 1. GMAT Score 2. Total Sales 3. Blue Ribbon, Yellow Ribbon, Red Ribbon 4. Company Use of Cash Basis vs. Accrual Basis 5. Depreciation Method (Declining Balance, Straight-Line, etc.) 6. Management Discussion and Analysis 7. Income Statement 8. Inventory Method (FIFO, LIFO, etc.) 9. Blogs 10. Total Liabilities
1. GMAT Score = Interval data 2. Total Sales = Ratio data 3. Blue Ribbon, Yellow Ribbon, Red Ribbon = Ordinal data 4. Company Use of Cash Basis vs. Accrual Basis = Nominal data 5. Depreciation Method (Declining Balance, Straight-Line, etc.) = Nominal data 6. Management Discussion and Analysis = Unstructured data 7. Income Statement = Structured data 8. Inventory Method (FIFO, LIFO, etc.) =Nominal data 9.Blogs = Unstructured data 10. Total Liabilities = Ratio data
Identify the order sequence from least sophisticated (1) to most sophisticated data type (4). Interval Data Ordinal Data Nominal Data Ratio Data
1. Nominal Data 2. Ordinal Data 3. Interval Data 4. Ratio Data
What is the difference between a supervised and an unsupervised approach?
A supervised approach means that the analyst is starting with a business question to theory, and is selecting the relevant test area. Unsupervised approaches involve data exploration, for example, cluster mapping allows an analyst to look for potential patterns in the data.
"Customers" CustomerID (PK) FirstName LastName City State Phone_Number "Sales_Orders" Sales_Order_ID (PK) InventoryID Quantity_Sold Price CustomerID "Inventory" InventoryID (PK) Inventory_Description Price Write SQL queries for the following: A. The description and prices of all items with unit prices of at least $1,000. B. The average price of all inventory items. Rename the column in the output Avg_Price. C. The total number of orders for each state with more than one order. The aggregate should be renamed Total_Orders and the information should be presented alphabetically by state.
A. SELECT: Inventory_Description, Price FROM: Inventory WHERE: Price>=1000 B. SELECT: AVG (Price) AS Avg_Price FROM: Inventory C. SELECT: State, COUNT (Sales_Order_ID) AS Total_Orders FROM: Sales_Orders INNER JOIN: Customers ON: Sales_Orders.CustomerID=Customers.CustomerID GROUP BY: State HAVING: COUNT (Sales_Order_ID) > 1 ORDER BY: State ASC
In general, the more complex the model, the greater the chance of: Multiple Choice A. overfitting the data. B. underfitting the data. C. pruning the data. D. a more accurate prediction of the data.
A. overfitting the data.
The purpose of transforming data is: A. to validate the data for completeness and integrity. B. to load the data into the appropriate tool for analysis. C. to obtain the data from the appropriate source. D. to identify which data are necessary to complete the analysis.
A. to validate the data for completeness and integrity.
In the ETL process, one important step to process when transforming the data is to work with NULL, N/A, and zero values in the dataset. If you have a field of quantitative data (e.g., number of years each individual in the table has held a full-time job), what would be the effect of the following? 1. Transforming NULL and N/A values into blanks 2. Transforming NULL and N/A values into zeroes 3. Deleting records that have NULL and N/A values from your dataset
Transforming NULL and N/A values into blanks.The COUNT and AVERAGE functions would not include these fields in their computation for these variables. Transforming NULL and N/A values into zeroes.The COUNT and AVERAGE functions would incorporate these zeroes and would be included in their computation for these variables. It would have an impact particularly on the computation of the average since it would have the value of zero. Deleting records that have NULL and N/A values from your dataset.The COUNT and AVERAGE functions would not include these fields in their computation for these variables. If they are deleted all of the other fields and variables would be deleted as well, thus having a bigger impact on the overall dataset.
Based on the data from datavizcatalogue.com, a line graph is best at showing trends, relationships, compositions, or distributions?
Trends over time.
Identify the order sequence in the ETL process as part of mastering the data (i.e., 1 is first; 5 is last). a. Validate the data for completeness and integrity. selected b. Sanitize the data. c. Obtain the data. d. Load the data in preparation for data analysis e. Determine the purpose and scope of the data request.
e c a b d
Mastering the data can also be described via the ETL process. The ETL process stands for:
extract, transform, and load data
The Fahrenheit scale of temperature measurement would best be described as an example of:
interval data.