Midterm Study Guide

Ace your homework & exams now with Quizwiz!

Steps in data reduction

1. Identify the attribute you would like to reduce or focus on 2. Filter the results 3. Interpret the results. 4. Follow up on results

Steps in Classification

1. Identify the classes you wish to predict. 2. Manually classify an existing set of records. 3. Select a set of classification models. 4. Divide your data into training and testing sets. 5. Generate your model. 6. Interpret the results and select the "best" model.

Steps of profiling

1. Identify the objects or activity you want to profile. 2. Determine the types of profiling you want to perform. 3. Set boundaries or thresholds for the activity. 4. Interpret the results and monitor the activity and/or generate a list of exceptions. 5. Follow up on exceptions.

Steps in Regression

1. Identify the variables that might predict an outcome.

The IMPACT cycle includes all except the following process: a. Visualize the data. b. Identify the questions. c. Master the data. d. Track outcomes.

A

Classification

An attempt to assign each unit (or individual) in a population into a few categories.

Co-occurrence grouping

An attempt to discover associations between individuals based on transactions involving them

Clustering

An attempt to divide individuals (like customers) into groups (or clusters) in a useful or meaningful way.

Regression

An attempt to estimate or predict, for each unit, the numerical value of some variable using some type of statistical model.

Similarity matching

An attempt to identify similar individuals based on data known about them.

The IMPACT cycle includes all except the following process: a. Communicate insights. b. Data preparation. c. Address and refine results. d. Perform test plan.

B

Which of the following is not a typical example of nominal data? a. Gender b. SAT scores c. Hair color d. Ethnic group

B

When it comes to visually representing qualitative data, the charts most frequently considered for depicting qualitative data are:

Bar charts. Pie charts. Stacked bar chart.

The observation that the frequency of leading digits in many real-life sets of numerical data is called:

Benford's law.

Which skills were not emphasized that analytic-minded accountants should have? a. Develop an analytics mindset b. Data scrubbing and data preparation c. Classification of test approaches d. Define and address problems through statistical data analysis

C

Which approach to Data Analytics attempts to assign each unit in a population into a small set of classes where the unit belongs?

Classification

Which approach to data analytics attempts to assign each unit in a population into a small set of classes where the unit belongs?

Classification

As mentioned in the chapter, which of the following is not a common way that data will need to be cleaned after extraction and validation? a. Remove headings and subtotals. b. Format negative numbers. c. Clean up trailing zeroes. d. Correct inconsistencies across data.

Clean up trailing zeroes.

When evaluating classifiers, you need to be careful to strike a balance between what two things?

Complexity of the model and accuracy of the classification

Which of these is not included in the five steps of the ETL process? a. Determine the purpose and scope of the data request. b. Obtain the data. c. Validate the data for completeness and integrity. d. Scrub the data.

D

Which skills were not emphasized that analytic-minded accountants should have? a. Data quality b. Descriptive data analysis c. Data visualization d. Data and systems analysis and design

D

The metadata that describes each attribute in a database is which of the following?

Data dictionary

Which of these terms is defined as being a central repository of descriptions for all of the data attributes of the dataset?

Data dictionary

Big Data

Datasets that are too large and complex for businesses' existing systems to handle utilizing their traditional capabilities to capture, store, manage, and analyze these datasets.

__________ mark (marks) the split between one class and another.

Decision boundaries

What are attributes that exist in a relational database that are neither primary nor foreign keys?

Descriptive attributes

Skills of analytic-minded accountants

Develop an analytics mindset Data scrubbing and data preparation Data quality Descriptive data analysis Data analysis through data manipulation Define and address problems through statistical data analysis Data visualization and data reporting

Mastering the data can also be described via the ETL process. The ETL process stands for:

Extract, transform, and load data.

Asking questions like "Are our customers paying us in a timely manner" would be the first step in which of the following processes?

IMPACT cycle

The goal of the ETL process is to:

Identify and obtain the data needed for solving the problem.

Data Analytics uses a process IMPACT

Identify the Questions Master the data Perform test plan Address and refine results Communicate Insights Track Outcomes

Unsupervised approach

If you don't have a specific question and are simply exploring the data for potential patterns of interest

The Fahrenheit scale of temperature measurement would best be described as an example of:

Interval data.

Why is Supplier ID considered to be a primary key for a Supplier table?

It contains a unique identifier for each supplier.

Join Clause

Joins rely on the structure of normalized relational databases that have tables related through primary keys and foreign keys

Which approach to data analytics attempts to predict a relationship between two data items?

Link prediction

Tax Compliance deals primarily with filing tax returns. In contrast, tax planning primarily helps...

Minimize the amount of taxes paid

_______ data would be considered the least sophisticated type of data.

Nominal

Exhibit 4-8 gives chart suggestions for what data you'd like to portray. Those options include all of the following except:

Normalization

Gold, silver, and bronze medals would be examples of:

Ordinal data.

In general, the more complex the model, the greater the chance of:

Overfitting the data.

Which attribute is required to exist in each table of a relational database and serves as the "unique identifier" for each record in a table?

Primary key

Line charts are not recommended for what type of data?

Qualitative data

Word clouds

Qualitative; If you are working with text data instead of categorical data, you can represent them in a word cloud.

Tree maps and heat maps:

Qualitative; These are similar types of visualizations, and they both use size and color to show proportional size of values.

Symbol maps

Qualitative; are geographic maps, so they should be used when expressing qualitative data proportions across geographic areas such as states or countries.

Filled geographic maps

Quantitative; As opposed to symbol maps, a filled geographic map is used to illustrate data ranges for quantitative data across different geographic areas such as states or countries.

Line charts

Quantitative; Show similar information to what a bar chart shows, but line charts are good for showing data changes or trend lines over time

Scatter plots

Quantitative; Useful for identifying the correlation between two variables or for identifying a trend line or line of best fit.

Box and whisker plots

Quantitative; Useful for when quartiles, median, and outliers are required for analysis and insights.

_______ data would be considered the most sophisticated type of data.

Ratio

Which approach to Data Analytics attempts to predict relationship between two data items?

Regression

What is the most appropriate chart when showing a relationship between two variables (according to Exhibit 4-8)?

Scatter chart

By the year 2020, about 1.7 megabytes of new information will be created every:

Second

Which approach to Data Analytics attempts to identify similar individuals based on data known about them?

Similarity matching

In the late 1960s, Ed Altman developed a model to predict if a company was at severe risk of going bankrupt. He called his statistic Altman's Z-score, now a widely used score in finance. Based on the name of the statistic, which statistical distribution would you guess this came from?

Standardized normal distribution

If you are comparing two datasets that follow the normal distribution, even if the two datasets have very different means, you can still compare them by ________ the distributions with _____

Standardizing; z-scores

Steps of ETL

Step 1: Determine the Purpose and Scope of the Data Request Step 2: Obtain the Data Step 3: Validating the Data for Completeness and Integrity Step 4: Cleaning the Data Step 5: Loading the Data for Data Analysis

SQL

Structured Query Language

Data that are organized and reside in a fixed field with a record or a file. Such data are generally contained in a relational database or spreadsheet and are readily searchable by search algorithms. The term matching this definition is:

Structured data.

_________ is a discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe) and then works to find the middle line.

Support vector machine

_______ is a set of data used to assess the degree and strength of a predicted relationship.

Test data

Models associated with regression and classification data approaches have all except this important part:

Test data.

Data Analytics

The process of evaluating data with the purpose of drawing conclusions to address business questions. Indeed, effective Data Analytics provides a way to search through large structured and unstructured data to identify unknown patterns or relationships

Justin Zobel suggests that revising your writing requires you to "be egoless—ready to dislike anything you have previously written," suggesting that it is you need to please:

The reader

The purpose of transforming data is:

To validate the data for completeness and integrity.

True or false: Data analytics can impact financial accounting by helping evaluate estimates and valuations

True

In general, the simpler the model, the greater the chance of:

Underfitting the data.

UML

Unified Modeling Language (UML)

Big Data is often described by the three Vs,

Volume, velocity, and variety.

Flat file

a means of storing data in one place, such as an Excel spreadsheet

Relational database

a means of storing data in order to ensure that the data are complete, not redundant, and to help enforce business rules and aids in communication and integration of business processes across an organization

Data request form

a method for obtaining data if you do not have access to obtain the data directly yourself

Any transaction that has a Z-score of 3 or above would represent...

abnormal transactions that might be associated with higher risk

Qualitative data

are categorical data - can be split into nominal and ordinal data

Discrete data

are data that are represented by whole numbers

Continuous data

are data that can take on any value within a range

Training data

are existing data that have been manually evaluated and assigned a class

Test data

are existing data used to evaluate the model.

Quantitative data

are more complex than qualitative data because not only can they be counted and grouped just like qualitative data, but the differences between each data point are meaningful

Declarative visualizations

are the product of wanting to "declare" or present your findings to an audience

Decision trees

are used to divide data into smaller groups

Data reduction

attempts to reduce the amount of detailed information considered to focus on the most critical, interesting, or abnormal items

Nominal data

can be counted and categorized

Ordinal data

can be counted and categorized like nominal data but can go a step further—the categories can also be ranked

A _______ is a manually assigned category applied to a record based on an event

class

structured data

data that are stored in a database or spreadsheet and are readily searchable

Primary key

each row in the table is unique, so it is often referred to as a "unique identifier."

Benford's law

frequency of leading digits

The advantages of storing data in a relational database include which of the following? a. Help in enforcing business rules. b. Increased information redundancy. c. Integrating business processes. d. All of the above are advantages of a relational database. e. Only A and B. f. Only B and C. g. Only A and C.

g

The clustering data approach works to identify...

groups of similar data elements and the underlying drivers of those groups

[range_lookup]

has two options, either FALSE or TRUE

Linear classifiers are used to...

identify a decision boundary.

Data dictionary

is paramount in helping database administrators maintain databases and analysts identify the data they need to use.

Lookup_value

is the foreign key you wish to look up

Table_array

is the table that contains the corresponding primary key

data profiling

is typically used to assess data quality and internal controls

XBRL (eXtensible Business Reporting Language)

is used to facilitate the exchange of financial reporting information between the company and the Securities and Exchange Commission (SEC)

3 V's of Big Data

its Volume (the sheer size of the dataset), Velocity (the speed of data processing), and Variety (the number of types of data).

Decision boundaries

mark the split between one class and another.

Standard normal distribution

mean = 0

Standard deviation.

mean = 1

Foreign key

must be identified when mastering the data from a relational data base in order to extract the data correctly from more than one table

Use of fuzzy match looks for correspondences between...

portions, or segments, of the text of each potential match

Regressions allow the accountant to develop models to...

predict expected outcomes

The goal of classification is to...

predict whether an individual we know very little about will belong to one class or another.

Predictive analytics

predicting the future! (in tax: helps to to predict what will happen rather than reacting to what just did happen)

Linear classifiers are useful for..

ranking items rather than simply predicting class probability

Col_Index_Num

refers to the column in your selected table_array that contains the data you wish to view.

Regression is a(n) ________ method used to...

supervised; predict specific values

Classification is a(n) ___________ method that can be used to...

supervised; predict the class of a new observation.

A ________ is an expected attribute or value that we want to evaluate

target

Data Analytics also expands auditors' capabilities in services like...

testing for fraudulent transactions and automating compliance-monitoring activities

Data Analytics to scan the environment

that is, scanning Google searches and social media (such as Instagram and Facebook) to identify potential risks and opportunities to the firm

Exploratory visualization

the answers to the questions from step I (identify the questions) won't have already been answered before working with the data in the visualization software

Normal distribution

the data should have equal median, mean, and mode, with half of the observations falling below the mean and the other half falling above the mean

Benford's law states that in many naturally occurring collections of numbers...

the significant lending digit is likely to be small.

the process of Data Analytics aims...

to transform raw data into knowledge to create value.

Profiling is a(n) _______ method that is used to...

unsupervised; discover patterns of behavior

Clustering is a(n) ________ method that is used to ...

unsupervised; find natural groupings within the data.


Related study sets

Copyright Law: 12 Dos and Dont's

View Set

Medications specific to maternal/newborn nursing

View Set

Ap pyschology unit 5 practice test

View Set

Random assignment evaluation studies

View Set

Ch. 2 (Managing public issues and stakeholder relationships)

View Set

Speech 1110- Final- Noelle Anderson

View Set

Chapter 39 Fluid, Electrolyte, and Acid-Base Balance foundation

View Set

ARRT Test Prep, LIMITED X-RAY CORE

View Set