Midterm Study Guide
Steps in data reduction
1. Identify the attribute you would like to reduce or focus on 2. Filter the results 3. Interpret the results. 4. Follow up on results
Steps in Classification
1. Identify the classes you wish to predict. 2. Manually classify an existing set of records. 3. Select a set of classification models. 4. Divide your data into training and testing sets. 5. Generate your model. 6. Interpret the results and select the "best" model.
Steps of profiling
1. Identify the objects or activity you want to profile. 2. Determine the types of profiling you want to perform. 3. Set boundaries or thresholds for the activity. 4. Interpret the results and monitor the activity and/or generate a list of exceptions. 5. Follow up on exceptions.
Steps in Regression
1. Identify the variables that might predict an outcome.
The IMPACT cycle includes all except the following process: a. Visualize the data. b. Identify the questions. c. Master the data. d. Track outcomes.
A
Classification
An attempt to assign each unit (or individual) in a population into a few categories.
Co-occurrence grouping
An attempt to discover associations between individuals based on transactions involving them
Clustering
An attempt to divide individuals (like customers) into groups (or clusters) in a useful or meaningful way.
Regression
An attempt to estimate or predict, for each unit, the numerical value of some variable using some type of statistical model.
Similarity matching
An attempt to identify similar individuals based on data known about them.
The IMPACT cycle includes all except the following process: a. Communicate insights. b. Data preparation. c. Address and refine results. d. Perform test plan.
B
Which of the following is not a typical example of nominal data? a. Gender b. SAT scores c. Hair color d. Ethnic group
B
When it comes to visually representing qualitative data, the charts most frequently considered for depicting qualitative data are:
Bar charts. Pie charts. Stacked bar chart.
The observation that the frequency of leading digits in many real-life sets of numerical data is called:
Benford's law.
Which skills were not emphasized that analytic-minded accountants should have? a. Develop an analytics mindset b. Data scrubbing and data preparation c. Classification of test approaches d. Define and address problems through statistical data analysis
C
Which approach to Data Analytics attempts to assign each unit in a population into a small set of classes where the unit belongs?
Classification
Which approach to data analytics attempts to assign each unit in a population into a small set of classes where the unit belongs?
Classification
As mentioned in the chapter, which of the following is not a common way that data will need to be cleaned after extraction and validation? a. Remove headings and subtotals. b. Format negative numbers. c. Clean up trailing zeroes. d. Correct inconsistencies across data.
Clean up trailing zeroes.
When evaluating classifiers, you need to be careful to strike a balance between what two things?
Complexity of the model and accuracy of the classification
Which of these is not included in the five steps of the ETL process? a. Determine the purpose and scope of the data request. b. Obtain the data. c. Validate the data for completeness and integrity. d. Scrub the data.
D
Which skills were not emphasized that analytic-minded accountants should have? a. Data quality b. Descriptive data analysis c. Data visualization d. Data and systems analysis and design
D
The metadata that describes each attribute in a database is which of the following?
Data dictionary
Which of these terms is defined as being a central repository of descriptions for all of the data attributes of the dataset?
Data dictionary
Big Data
Datasets that are too large and complex for businesses' existing systems to handle utilizing their traditional capabilities to capture, store, manage, and analyze these datasets.
__________ mark (marks) the split between one class and another.
Decision boundaries
What are attributes that exist in a relational database that are neither primary nor foreign keys?
Descriptive attributes
Skills of analytic-minded accountants
Develop an analytics mindset Data scrubbing and data preparation Data quality Descriptive data analysis Data analysis through data manipulation Define and address problems through statistical data analysis Data visualization and data reporting
Mastering the data can also be described via the ETL process. The ETL process stands for:
Extract, transform, and load data.
Asking questions like "Are our customers paying us in a timely manner" would be the first step in which of the following processes?
IMPACT cycle
The goal of the ETL process is to:
Identify and obtain the data needed for solving the problem.
Data Analytics uses a process IMPACT
Identify the Questions Master the data Perform test plan Address and refine results Communicate Insights Track Outcomes
Unsupervised approach
If you don't have a specific question and are simply exploring the data for potential patterns of interest
The Fahrenheit scale of temperature measurement would best be described as an example of:
Interval data.
Why is Supplier ID considered to be a primary key for a Supplier table?
It contains a unique identifier for each supplier.
Join Clause
Joins rely on the structure of normalized relational databases that have tables related through primary keys and foreign keys
Which approach to data analytics attempts to predict a relationship between two data items?
Link prediction
Tax Compliance deals primarily with filing tax returns. In contrast, tax planning primarily helps...
Minimize the amount of taxes paid
_______ data would be considered the least sophisticated type of data.
Nominal
Exhibit 4-8 gives chart suggestions for what data you'd like to portray. Those options include all of the following except:
Normalization
Gold, silver, and bronze medals would be examples of:
Ordinal data.
In general, the more complex the model, the greater the chance of:
Overfitting the data.
Which attribute is required to exist in each table of a relational database and serves as the "unique identifier" for each record in a table?
Primary key
Line charts are not recommended for what type of data?
Qualitative data
Word clouds
Qualitative; If you are working with text data instead of categorical data, you can represent them in a word cloud.
Tree maps and heat maps:
Qualitative; These are similar types of visualizations, and they both use size and color to show proportional size of values.
Symbol maps
Qualitative; are geographic maps, so they should be used when expressing qualitative data proportions across geographic areas such as states or countries.
Filled geographic maps
Quantitative; As opposed to symbol maps, a filled geographic map is used to illustrate data ranges for quantitative data across different geographic areas such as states or countries.
Line charts
Quantitative; Show similar information to what a bar chart shows, but line charts are good for showing data changes or trend lines over time
Scatter plots
Quantitative; Useful for identifying the correlation between two variables or for identifying a trend line or line of best fit.
Box and whisker plots
Quantitative; Useful for when quartiles, median, and outliers are required for analysis and insights.
_______ data would be considered the most sophisticated type of data.
Ratio
Which approach to Data Analytics attempts to predict relationship between two data items?
Regression
What is the most appropriate chart when showing a relationship between two variables (according to Exhibit 4-8)?
Scatter chart
By the year 2020, about 1.7 megabytes of new information will be created every:
Second
Which approach to Data Analytics attempts to identify similar individuals based on data known about them?
Similarity matching
In the late 1960s, Ed Altman developed a model to predict if a company was at severe risk of going bankrupt. He called his statistic Altman's Z-score, now a widely used score in finance. Based on the name of the statistic, which statistical distribution would you guess this came from?
Standardized normal distribution
If you are comparing two datasets that follow the normal distribution, even if the two datasets have very different means, you can still compare them by ________ the distributions with _____
Standardizing; z-scores
Steps of ETL
Step 1: Determine the Purpose and Scope of the Data Request Step 2: Obtain the Data Step 3: Validating the Data for Completeness and Integrity Step 4: Cleaning the Data Step 5: Loading the Data for Data Analysis
SQL
Structured Query Language
Data that are organized and reside in a fixed field with a record or a file. Such data are generally contained in a relational database or spreadsheet and are readily searchable by search algorithms. The term matching this definition is:
Structured data.
_________ is a discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe) and then works to find the middle line.
Support vector machine
_______ is a set of data used to assess the degree and strength of a predicted relationship.
Test data
Models associated with regression and classification data approaches have all except this important part:
Test data.
Data Analytics
The process of evaluating data with the purpose of drawing conclusions to address business questions. Indeed, effective Data Analytics provides a way to search through large structured and unstructured data to identify unknown patterns or relationships
Justin Zobel suggests that revising your writing requires you to "be egoless—ready to dislike anything you have previously written," suggesting that it is you need to please:
The reader
The purpose of transforming data is:
To validate the data for completeness and integrity.
True or false: Data analytics can impact financial accounting by helping evaluate estimates and valuations
True
In general, the simpler the model, the greater the chance of:
Underfitting the data.
UML
Unified Modeling Language (UML)
Big Data is often described by the three Vs,
Volume, velocity, and variety.
Flat file
a means of storing data in one place, such as an Excel spreadsheet
Relational database
a means of storing data in order to ensure that the data are complete, not redundant, and to help enforce business rules and aids in communication and integration of business processes across an organization
Data request form
a method for obtaining data if you do not have access to obtain the data directly yourself
Any transaction that has a Z-score of 3 or above would represent...
abnormal transactions that might be associated with higher risk
Qualitative data
are categorical data - can be split into nominal and ordinal data
Discrete data
are data that are represented by whole numbers
Continuous data
are data that can take on any value within a range
Training data
are existing data that have been manually evaluated and assigned a class
Test data
are existing data used to evaluate the model.
Quantitative data
are more complex than qualitative data because not only can they be counted and grouped just like qualitative data, but the differences between each data point are meaningful
Declarative visualizations
are the product of wanting to "declare" or present your findings to an audience
Decision trees
are used to divide data into smaller groups
Data reduction
attempts to reduce the amount of detailed information considered to focus on the most critical, interesting, or abnormal items
Nominal data
can be counted and categorized
Ordinal data
can be counted and categorized like nominal data but can go a step further—the categories can also be ranked
A _______ is a manually assigned category applied to a record based on an event
class
structured data
data that are stored in a database or spreadsheet and are readily searchable
Primary key
each row in the table is unique, so it is often referred to as a "unique identifier."
Benford's law
frequency of leading digits
The advantages of storing data in a relational database include which of the following? a. Help in enforcing business rules. b. Increased information redundancy. c. Integrating business processes. d. All of the above are advantages of a relational database. e. Only A and B. f. Only B and C. g. Only A and C.
g
The clustering data approach works to identify...
groups of similar data elements and the underlying drivers of those groups
[range_lookup]
has two options, either FALSE or TRUE
Linear classifiers are used to...
identify a decision boundary.
Data dictionary
is paramount in helping database administrators maintain databases and analysts identify the data they need to use.
Lookup_value
is the foreign key you wish to look up
Table_array
is the table that contains the corresponding primary key
data profiling
is typically used to assess data quality and internal controls
XBRL (eXtensible Business Reporting Language)
is used to facilitate the exchange of financial reporting information between the company and the Securities and Exchange Commission (SEC)
3 V's of Big Data
its Volume (the sheer size of the dataset), Velocity (the speed of data processing), and Variety (the number of types of data).
Decision boundaries
mark the split between one class and another.
Standard normal distribution
mean = 0
Standard deviation.
mean = 1
Foreign key
must be identified when mastering the data from a relational data base in order to extract the data correctly from more than one table
Use of fuzzy match looks for correspondences between...
portions, or segments, of the text of each potential match
Regressions allow the accountant to develop models to...
predict expected outcomes
The goal of classification is to...
predict whether an individual we know very little about will belong to one class or another.
Predictive analytics
predicting the future! (in tax: helps to to predict what will happen rather than reacting to what just did happen)
Linear classifiers are useful for..
ranking items rather than simply predicting class probability
Col_Index_Num
refers to the column in your selected table_array that contains the data you wish to view.
Regression is a(n) ________ method used to...
supervised; predict specific values
Classification is a(n) ___________ method that can be used to...
supervised; predict the class of a new observation.
A ________ is an expected attribute or value that we want to evaluate
target
Data Analytics also expands auditors' capabilities in services like...
testing for fraudulent transactions and automating compliance-monitoring activities
Data Analytics to scan the environment
that is, scanning Google searches and social media (such as Instagram and Facebook) to identify potential risks and opportunities to the firm
Exploratory visualization
the answers to the questions from step I (identify the questions) won't have already been answered before working with the data in the visualization software
Normal distribution
the data should have equal median, mean, and mode, with half of the observations falling below the mean and the other half falling above the mean
Benford's law states that in many naturally occurring collections of numbers...
the significant lending digit is likely to be small.
the process of Data Analytics aims...
to transform raw data into knowledge to create value.
Profiling is a(n) _______ method that is used to...
unsupervised; discover patterns of behavior
Clustering is a(n) ________ method that is used to ...
unsupervised; find natural groupings within the data.