ACIS 3504 Exam 2
Which of the following is NOT one of the responsibilities of auditors in detecting fraud according to SAS No. 99?
Catching the perpetrators in the act of committing the fraud.
Examples of concealment efforts include
Charge a stolen asset to an expense account or to an account receivable that is about to be written off. Create a ghost employee who receives an extra paycheck Lapping Kiting.
The principle of simplification techniques include
Color, Quantity, distance, orientation
The principle of emphasis techniques include
Color, highlighting, weighting, ordering
•Opportunity is the opening or gateway that allows an individual to do what three things?
Commit the fraud Conceal the fraud Convert the proceeds
What type of computer fraud is tampering with software, illegal copying of software, using software in an unauthorized manner, creating software to undergo unauthorized activities
Computer instructions fraud
The highest level of DFD that provides a summary-level view of the system and depicts a data processing system and the external entities that are sources of its input destinations of its output. The process symbol is numbered with a "0
Context diagram
Employees who steal inventory or equipment sell the items or otherwise convert them to cash is an example of
Convert the theft or misrepresentation to personal gain
•place a heavy emphasis on the logical aspects of a system.
DFDs
Which of the following statements is FALSE?
DFDs help convey the timing of events.
Catalina was reviewing the data imputation formula for missing values in a customer's credit score. She found that two lines of data had the exact same customer name, birthdate, address, and buying history, but they had two different social security numbers. Assuming there was no fraud by the customer, what best describes what Catalina likely found
Data Contradiction Error
Santiago reviewed a recent extract of data about customer credit limits. He noticed that one company had a credit limit of $1,000,000,000 USD, whereas the next highest credit limit was $10,000 USD. What might have Santiago discovered?
Data Threshold Violation
•the process of updating data to be consistent, accurate, and complete.
Data cleaning
•the combining of data from two or more fields into a single field.
Data concatenation
the principle that every value in a field should be stored in the same way.
Data consistency
•errors that exist when the same entity is described in two conflicting ways and need to be investigated and resolved appropriately
Data contradiction errors
The entity that receives data produced by a system
Data destination
When an employee makes a mistake typing data into the system, it is called a _______.
Data entry error
•types of errors that come from inputting data incorrectly. They often occur in human data entry and can also be introduced by the computer system. May be indistinguishable from data formatting and data consistency errors in an output data file
Data entry errors
the process of removing records or fields of information from a data source
Data filtering
The movement of data among processes, stores, sources and destinations
Data flow
A graphical description of the flow of data within an organization, including data sources/destinations, data flows, transformation processes, and data storage
Data flow diagram (DFD)
Data flow diagram symbol that is represented by an arrow
Data flows
What do Data Flow Diagrams (DFD) focus on?
Data flows, processes, sources and destinations of the data, data stores
What type of computer fraud is illegally using, copying, browsing, searching, or harming company data
Data fraud
The intentional arranging of visualization items in a way to produce emphasis. Can be used in ascending, descending order, random, or alphabetically
Data ordering
The entity that produces or sends the data entered into a system
Data source
Data flow diagram symbol that is represented by a square
Data sources and destinations
The place or medium where system data is stored
Data store
Data flow diagram symbol that is represented by two parallel lines
Data stores
Computer systems are vulnerable to computer crimes because
Databases can be huge and access privileges can be difficult to create and enforce Organizations want employees, customers, suppliers and others to have access to their system Computer programs only need to be altered once
What are some of the reasons for fraudulent financial statements
Deceive investors or creditors Increase a company's stock price Meet cash flow needs Hide company losses or other problems
________ often make use of exploratory data analytic techniques, while _______ make use of machine learning techniques.
Descriptive analytics, predictive analytics
-goes beyond examining what happened to try to answer the question, "why did this happen?"
Diagnostic
A company wants to determine how to decrease employee turnover. In order to do this, they test whether paying off an employee's student debt will cause fewer employees to leave. The analytic testing whether paying off an employee's student debt causes lower turnover is an example of which type of analytic?
Diagnostic
When confirmatory data analysis techniques are used, what type of analytic is likely being computed?
Diagnostic analytic
Which of the following is a technique to simplify data presentations?
Distance
Illustrate the flow of documents and data among areas of responsibility within an organization, from cradle to grave; shows where each document originates, its distribution, its purpose and its ultimate disposition
Document flowcharts
-assuring the most important message is easily identifiable.
Emphasis
Process represented by a small bolded circle
End
refers to avoiding the intentional or unintentional use of deceptive practices that can alter the user's understanding of the data being presented.
Ethical presentation
An internal auditor validates the daily changes in customer's accounts receivable balances against daily sales made on account less cash collected on receivables
Example of Advanced Testing Techniques
A computer engineer performs a complicated merge of data from five different accounting systems of company subsidiaries. To check her work, she randomly selects 50 transactions from each system to validate to make sure the merge worked correctly.
Example of Audit a Sample
Ashton selects has 100,000 records. Ashton chooses to audit 1,000 records, or 1% of the total number of records. If in those 1,000 records Ashton finds 70 errors, Ashton can compute a 7% error rate.
Example of Audit sample
A CFO receives a spreadsheet file for review that contains annual pay raises for all company employees. The CFO examines the minimum, maximum, average, and median to make sure the data looks correct before making the final approval for pay increases.
Example of Basic Statistical Tests
COVID is not prevented by wearing masks and a company thought that masks did prevent COVID and people are wearing masks for no reason
Example of Type 1 Error
COVID is prevented by wearing a mask but the company thought that masks didn't prevent COVID and people are not wearing masks from spreading the virus
Example of Type 2 Error
If an alarm goes off while there is no fire
Example of Type I error
If an alarm doesn't go off and there is a fire
Example of Type II error
Which of the following reasons describes why transforming data is necessary?
**All of the above Data aggregated at different levels need to be joined Data within a field has various formats Multiple data values contained in the same field need to separated
Which of the following can be used to present data unethically?
**All of the above Selectively presenting only part of a viz With an axis, showing the most recent time closest to the origin Truncating or stretching the axes
At what point in the ETL process should data validation take place?
**All of these During data cleaning During data structuring During data standardization
a record accurately indicates a person lives in Nauvoo, Illinois but mistakenly lists the zip code as 26354, but the actual zip code for Nauvoo, Illinois is 62354
Example of Violated attribute dependencies
A terrorist group launches a computer virus aimed at corrupting all transaction data for a corporation by randomly changing the currency of transactions. An internal auditor scans the company's database to see if the transactions appear to be in multiple different currencies.
Example of Visual Inspection
paying employees more decreases the likelihood of employees leaving the company
Example of alternative hypothesis
consulting firm may keep track of positions in the organization such as partner, senior consultant, and research analyst by entering into the database the number 1 for partner, 2 for senior consultant, and 3 for research analyst.
Example of cryptic data values
each of these date formats represents the same date: April 3, 1982; 3 April 1982; 03/04/82; and 04/03/82. A single format should be chosen and used for all dates in a field and typically all dates contained in a file
Example of data consistency
Milton Armstrong's telephone number on line 16 is different than his phone number on all other lines. Due to the contradiction error in Milton's phone number, we do not know the true value. The phone number should be corrected so that Milton's phone number is the same throughout the dataset.
Example of data contradiction errors
a system may fail to record the first two digits of a year, and so it is not clear if the date is meant to be 1910 or 2010
Example of data entry errors
The office manager of a Wall Street law firm sold information to friends and relatives about prospective mergers and acquisitions found in Word files. They made several million dollars trading the securities.
Example of data fraud
A data set is created showing the number of hours employees work a week. Sam Howell's data was entered as "400" hours per week. This would be an example of
Example of data threshold violation
a field capturing the number of children a taxpayer claims as dependents in which the taxpayer lists the value of "300.
Example of data threshold violation
if a column of information listing customer names shows that all names were recorded fully capitalized, one might want to change the formatting such that only the first letter of the first and last name are capitalized.
Example of data validation
A virus pandemic causes governments to shut down all restaurants. A national restaurant chain creates an analysis to see how long their cash reserves can continue to pay employees before the company runs out of money.
Example of descriptive data
a company wants to examine social media data to see if people are saying positive or negative things about their company. They count positive and negative social media mentions or using text analysis software to give a numerical score of the tone of the tweets is an example of
Example of descriptive data analytics
An internal auditor notices an increase in a company's inventory shrinkage (i.e., inventory being stolen). The internal auditor creates a data model that explains what types of inventory are being stolen.
Example of diagnostic data
a large quantity of low gross margin products was sold, the data analyst finds that the marketing department advertised these products heavily in the last quarter. He wants to know why the marketing department focused on those products
Example of diagnostic data analytics
Railroad employees entered data to scrap more than 200 railroad cars. They removed the cars from the railway system, repainted them, and sold them.
Example of input fraud
a data field for city contains the country name Germany, the data values are misfielded. The value Germany should be entered in a data field for country.
Example of misfielded data values
paying employees more will have no effect on their likelihood of leaving the company
Example of null hypothesis
An employee scans a company paycheck, use desktop publishing software to erase the payee and amount, and print fictitious paychecks.
Example of output fraud
A tax accountant prepares analyses that shows what will happen to the customers of his client if the country adopts a new tax law.
Example of predictive data
Match.com uses sophisticated prediction algorithms that consider users' stated preferences and their browsing and searching activities in order to match each client with potentially successful future love interests
Example of predictive data Analytics
Amazon.com uses customer purchasing and search patterns to predict (and then display) other products the customer might be interested in purchasing.
Example of predictive data analytics
A corporate accountant designs a cook scheduling system based on past data for meal preparation. The new system should assure that there are always enough cooks scheduled for peak demand at the restaurant.
Example of prescriptive data
United Parcel Services (UPS), design a real-time solution. The program and subsequent updates optimize driver's delivery routes to save time, minimize driving distance, reduce emissions, increase safety, and ultimately boost the bottom
Example of prescriptive data analytics
An insurance company installed software to detect abnormal system activity and found that employees were using company computers to run an illegal gambling website.
Example of processor fraud
if a column should have numeric values, sorting will show if there are also characters contained in some entries in the column.
Example of visual inspection
the spread of the data about a prediction inherent in a model.
Failing to consider the variation
Which of the following is NOT an example of computer fraud?
Failure to perform preventive maintenance on a computer
•distinct from other types of fraud in that the individuals who commit the fraud are not the direct beneficiaries.
Financial Statement fraud
Which of the following statements is FALSE regarding flowcharts?
A system flowchart is a narrative representation of an information system
Flowchart symbol that indicates the flow of data, where flowcharts begin or end where decisions are made and how to add explanatory notes to flowcharts
Flow and miscellaneous symbols
An analytical technique that describes some aspect of an information system in a clear, concise, and logical manner. Use a set of standard symbols to depict processing procedures and the flow of data
Flowchart
All of the following are recommended guidelines for making flowcharts more readable, clear, concise, consistent, and understandable EXCEPT:
Flowchart all data flows, especially exception procedures and error routines
place more emphasis on the physical characteristics of the system.
Flowcharts
Intentional or reckless conduct, whether by act or omission, that results in materially misleading financial statements. "Cooking the books" (booking fictitious revenue, overstating assets)
Fraudulent financial reporting
Which type of fraud is associated with 50% of all auditor lawsuits?
Fraudulent financial reporting
a process is represented by a rounded-edge rectangle. An explanation of the activity is placed inside the rectangle.
Activity in a process
•the presentation of data in a summarized form.
Aggregate data
In a document flowchart you want to identify
All departments. documents, and processes
Wearing masks does decrease the chances of catching COVID
Alternative hypothesis example
Information that helps explain a business process is entered in the BPD and, if needed, a bolded dashed arrow is drawn from the explanation to the symbol.
Annotation information
Which chart type is best for depicting trends over time.
Area chart
How are data sources and destinations represented in a data flow diagram?
As a square
Pie charts are the most over-used type of charts. This is because they are often used to show comparison. Select which chart type is best for making comparisons
Bar charts
What is the term used for a data flow diagram where there is an inflow of data but no outflow of data
Black hole
Which of the following is NOT a good reason to visualize data?
Building visualizations does not take as much time as writing a report.
The intent is that all business users can easily understand the process from a standard notation Can show the organizational unit performing the activity
Business Process Modeling Notation (BPMN)
receiving an order, checking customer credit, verifying inventory availability, and confirming customer order acceptance, shipping the goods ordered, billing the customer, and collecting customer payments are all examples used in a
Business process diagram
A visual way to describe the different steps or activities in a business process, providing a reader with an easily understood pictorial view of what takes place in a business process
Business process diagram (BPD)
A general rule of thumb is that a visualization should only have 3-5 groups in the data area. Putting in more or less than this amount violates which principle?
Goldilocks principle
What pattern do a system flowchart and a process flowchart follow?
Identify the inputs Each input is followed by a process (steps preformed on the data) The process is followed by outputs (the resulting new information)
Flowchart symbol that shows input to or output from a system
Input/output symbols
In a document flowchart what does each department get
It's own column
Factors that allow opportunity include:
Lack of internal controls Failure to enforce controls (the most prevalent reason) Excessive trust in key employees Incompetent supervisory personnel Inattention to details Inadequate staff
•a projection of the process on the Context diagram. It is like opening up that process and looking inside to see how it works to show the internal sub-processes. You repeat the external entities but you also expand the main process into its subprocesses. Also, data stores will appear at this level.
Level 0 diagram
-Theft of company assets by employees which can include physical assets (e.g., cash, inventory) and digital assets (e.g., intellectual property such as protected trade secrets, customer data)
Misappropriation of assets
Analyn spent the entire day entering information about suppliers into the company database. She did not make a single spelling mistake in any of the entries. However, at the end of the day, Analyn notices that she entered the state into the country field for all of the data. The mistaken data values in the country field are best described as which of the following?
Misfielded data values
Data values that are correctly formatted but not listed in the correct field
Misfielded data values
All of the following are guidelines that should be followed in naming DFD data elements EXCEPT:
Name only the most important DFD elements
bars on a bar chart are displayed as much thicker than the other, it makes the thicker bar appear to be much more important because of increased visual weight is an example of
Non-proportional display of data
-a proposed explanation worded as a statement of equality meaning that one of the two concepts, ideas, or groups will be no different from the other concept, idea or group
Null hypothesis
Wearing masks doesn't effect the likelihood of catching COVID
Null hypothesis example
Computes minimum, maximum, mean, median, and sum for numeric fields and see if the dataset contains a complete set of all the original transactions.
Numeric values in basic statistical tests
Ashton selects has 100,000 records. Ashton chooses to audit 1,000 records, or 1% of the total number of records. If in those 1,000 records Ashton finds 70 errors, Ashton can compute a 7% error rate. Ashton can assume that
Out of the 100,000 records there could be 7,000 total errors
Data analytics techniques to detect fraud include
Outliner detection, anomaly detection using trends and patterns, regression analysis, semantic modeling, and Benford's Law.
What type of computer fraud is stealing, copying, or misusing computer printouts or displayed information
Output Fraud
Which of the following control procedures is most likely to deter lapping?
Periodic rotation of duties
Part to Whole uses what two types of visualizations
Pie chart, treemap
Making sure to use separate training datasets and test datasets is especially important for creating what type of analytic?
Predictive analytic
Indicate which option orders the type of analytic from the one that provides the most value added to an organization to the least value added to the organization.
Prescriptive, predictive, diagnostic, descriptive
These three conditions must be present for fraud to occur
Pressure, opportunity, rationalize
Action that transform data into other data or information
Processes
Flowchart symbol that shows data processing, either electronically or by hand
Processing symbols
illustrates the sequence of logical operations performed in a computer progrm
Program
•illustrate the sequence of logical operations performed by a computer in executing a program; describes the specific logic to perform a process show on a system flowchart
Program Flowchart
The documentation skills that accountants require vary with their job function. However, all accountants should at least be able to do which of the following?
Read documentation to determine how the system works
How do accountants use documentation?
Read documentation to understand how a system works (auditors assess risk) Evaluate strengths and weaknesses of an entity's internal controls Prepare documentation to demonstrate how a proposed system would work or demonstrate their understanding of a system of internal controls
Requires that auditors understand the automated and manual procedures an entity uses. This understanding can be gleaned through documenting the internal control system—a process that effectively exposes the strengths and weaknesses of the system.
SAS-94
Legislation intended to prevent financial statement fraud, make financial reports more transparent, provide protection to investors, strengthen internal controls at public companies, and punish executives who perpetrate fraud.
Sarbanes-Oxley Act
•requires management to assess internal controls and auditors to evaluate the assessment
Sarbanes-Oxley Act (SOX)
Correlation uses what two types of visualizations
Scatterplot, heat map
refers to making a visualization easy to interpret and understand
Simplification
a process is represented by a small circle.
Start/Begin
Flowchart symbol that shows where data is stored
Storage symbols
Which of the following statements is FALSE about fraud criminals
The psychological profiles of white-collar criminals are significantly different from those of the general public.
Data flow diagram symbol that is represented by a circle
Transformation processes
Chibuzo creates a chart to show the percentage of activities in the accounting function have been automated over time. She wants to stress the slow rate of change by the department to adopt automation. What is the purpose of Chibuzo's visualization and what type of chart would be best for this purpose?
Trend evaluation, line chart
A DFD consists of the following four basic elements: data sources and destinations, data flows, transformation processes, and data stores. Each is represented on a DFD by a different symbol.
True
Documentation methods such as DFDs, BPDs, and flowcharts save both time and money, adding value to an organization.
True
Making an item in the data area of a viz larger to increase emphasis is an example of using which principle?
Weighting
In a document flowchart what does it show for documents
Where each document originated from and its final disposition
Researchers found significant differences between what two types of people
White collar criminals, violent criminals
The flow of data or information is indicated by
an arrow
help focus on a trend rather than individual values, and are useful when trying to show a progression over time.
area chart
puts the categorical data variable on the x-axis (or on the y-axis if the chart is rotated) and then plots the numerical value on the other axis.
bar chart
draws a line at the median value for a numeric variable and then shows another line for the upper quartile and lower quartile (the connection of these lines forms the box).
boxplot
adds a "bullet" or a small line by each bar that indicates an important benchmark
bullet graph
Any type of fraud that requires computer technology to perpetrate leaves little evidence making them more difficult to detect can steal more of something, In less time, With less effort
computer fraud
When data is joined together it is called _________, when it is split apart it is called ________.
data concatenation, data parsing
A graphical depiction of information designed with or without an intent to deceive, that may create a belief about the message and/or its components, which varies from the actual message
data deception
the process is represented by a diamond. An explanation of the decision is placed inside the symbol.
decision
A company uses a boxplot in a visualization. What is likely the purpose of the visualization?
distribution
•shows the flow of documents and data between departments or units, useful in evaluating internal controls
document
Which of the following flowcharts illustrates the flow of data among areas of responsibility in an organization?
document flowcharts
Whether or not someone is registered to vote could be an example of a
dummy variable
A pharmaceutical company is trying to develop a drug that will help cure the most people with a serious disease. To choose the drug that can cure the most people, the data analyst should look at what?
effect size
A level 0 diagram and context diagram should both have the same
external entities with the same flows to and from those entities.
Pressures That Lead To Employee Fraud include
financial, emotional, lifestyle
A DFD is a representation of which of the following?
flow of data in an organization
Changes in the physical characteristics of the process do affect the ___________ but have little or no impact on the ___________
flowchart, DFD
Any means a person uses to gain an unfair advantage over another person.
fraud
shows colors that relate to the magnitude of the different entries.
heat map
a single numeric value is divided into equal-sized bins, and the bin sizes are listed on the x-axis. Then, a bar is used to show the count of each value that falls into the bins.
histogram
Which of the following causes the majority of computer security problems?
human errors
What type of computer fraud is alteration or falsifying input
input fraud
Data flow diagram symbol that is represented by a orange triangle
internal control
Creating cash using the lag between the time a check is deposited and the time it clears the bank.
kiting
Former and current employees are much more likely than non-employees to perpetrate frauds (and big ones) against companies are also called
knowledgeable insiders
Which of the following is a fraud in which later payments on accounts recievable are used to pay off earlier payments that were stolen?
lapping
the x-axis is an ordered unit such as days, months, or years.
line chart
What is the term used in a data flow diagram where there is an outflow of data but no inflow of data
miracle
Every data flow diagram must have
one date inflow and one data outflow
The condition or situation that allows a person or organization to commit and conceal a dishonest act and convert it to a personal gain
opportunity
show which items make up the parts of a total. Appropriate when showing percentages that sum up to 100%
pie chart
A person's incentive or motivation for committing fraud
pressure
What type of computer fraud includes unauthorized system use, including theft of computer time and services
processor fraud
Which of the following conditions is/are usually necessary for fraud to occur? Please select ALL of the correct answers.
rationalization, pressure, opportunity
recasting actions as "morally acceptable" behaviors to maintain self image
rationalizations
In a system flowchart a process will almost always be represented by a
rectangle
where a numeric variable is listed on the x-axis, a different numeric variable is listed on the y-axis, and the values of each are plotted in the data area.
scatterplot
•depicts the data processing cycle for a process; describes the relationship between inputs, processing, and outputs
system
•depicts the relationship among the inputs, processes, and outputs of an AIS.
system flowchart
A program flow chart is drawn for each rectangle in
the system flowchart
A subset of data used to train a model for future prediction
training dataset
nested rectangles to show the amount that each group or category contributes
treemaps
You co-own a theme park. You believe that the longer customers stay in the park, the hungrier they will be which would increase the amount they spend on food. Your co-owner believes that the longer customers stay in the park, the more likely they are to feel nauseated which would decrease the amount they spend on food. Both of you gather data and find some evidence supporting your belief. If the true relation is that there is no relation between time in the park and food sales, what type of error did your co-owner make?
type 1 error
the amount of attention an element attracts
visual weighting
any visual representation of data, such as a graph, diagram, or animation; called a viz for short
visualization
Fraud is a
white collar crime
Researchers found few differences between what two types of people
white collar criminals, the general public
Typically, businesspeople who commit fraud. Usually resort to trickery or cunning and their crimes usually involve a violation of trust or confidence
white-collar criminals
Threats to AIS include
•Natural and political disasters •Software errors and equipment malfunctions •Unintentional acts •Intentional acts
The auditor's responsibility of SAS No. 99 includes
•Understand fraud •Discuss the risks of material fraudulent misstatements •Obtain information •Identify, assess, and respond to risks •Evaluate the results of their audit tests •Document and communicate findings •Incorporate a technology focus
Guidelines for creating a DFD include
•Understand the system that you are trying to represent. •A DFD is a simple representation meaning that you need to consider what is relevant and what needs to be included •Start with a high level (context diagram) to show how data flows between outside entities and inside the system. Use additional D F Ds at the detailed level to show how data flows within the system •Identify and group all the basic elements of the DFD •Name data elements with descriptive names, use action verbs for processes (e.g., update, edit, prepare, validate, etc.) •Give each process a sequential number to help the reader navigate from the abstract to the detailed levels. Edit/Review/Refine your D F D to make it easy to read and understand
Guidelines for Drawing Flowcharts include
•Understand the system you are trying to represent. •Identify business processes, documents, data flows, and data processing procedures. •Organize the flowchart so that it reads from top to bottom and left to right. •Clearly label all symbols . •Use page connectors (if it cannot fit on a single page) Draw a rough sketch of the flowchart •Edit/review/refine to make it easy to read and understand. Draw a final copy of the flowchart
There are only 7 unique job positions at a company but 9 different positions are attributed to employees
Violation of validity
Which of the following techniques is most likely to discover an error where a data analyst did not correctly parse data from one field into two fields?
Visual Inspection
•process of examining data using human vision to see if there are problems.
Visual inspection
Correct; free of error; accurately represents events and activities
Accuracy
possible with a deeper understanding of the content of data.
Advanced testing techniques
Suzette sends Jimmy a flat file with a list of all sales transactions the company made during the last year. Each line contains all the information about a single sale. Jimmy prepares a report that shows three different views of the data (1) the total sales for each quarter, (2) the total sales by customer, and (3) the total sales for the entire year. To make this report, Jimmy had to do which of the following to the data Suzette sent?
Aggregate the data
Comparison uses what two types of visualizations
Bar chart, bullet graph
Does not omit aspects of events or activities; of enough breadth and depth
Completeness
-tests a hypothesis and provides statistical measures of the likelihood that the evidence (data) refutes or supports a hypothesis.
Confirmatory data analysis
Presented in same format over time
Consistency
•data items that have no meaning without understanding a coding scheme.
Cryptic data values
Joleen queried the company database and returned 23 columns of information for her report. In examining the data, she noticed that one column only had values half of the time. Joleen decided to delete this column from her report. This is an example of which of the following?
Data filtering
Among the following statements, which is likely to be detected using visual inspection? The setting: a company extracts data from one system, transforms the data into a new format, and then loads it into a new system. The visual inspection validation tests are performed on a portion of the data in the new setting.
Data from two fields was not concatenated into one field during the transformation process.
process of analyzing data to make certain the data has the properties of high-quality data
Data validation
When a field contains only two different responses, typically 0 or 1, this field is called
Dummy/dichotomous variable
Distribution uses what two types of visualizations
Histogram, boxplot
When data is aggregated, some of the detailed information is lost. Which of the following is needed if you want to show both the aggregated and disaggregated data together?
Joining the aggregated data with the disaggregated data
answers the question "what should be done?"
Prescriptive
What different forms can data be presented?
Static graphics, tables, videos, static and dynamic models
Provided in time for decision makers to make decisions
Timely
incorrect rejection of a true null hyp.
Type I error
failure to reject a false null hyp.
Type II error
Data measures what it is intended to measure; conforms to syntax rules and to requirements
Validity
Billy-Bob Barker bakes big, beautiful brownies. However, Billy-Bob notices that the recipe he printed from the company database correctly states that the recipe needs flour but incorrectly lists the approved flour company and instead lists the approved salt vendor. This is an example of which of the following?
Violated Attribute dependency
•errors that occur when a secondary attribute in a row of data does not match the primary attribute.
Violated attribute dependencies
A sale occurred on December 27 but is recorded as occurring the following year on January 4.
Violation of accuracy
An annual evaluation of vendor performance only contains 7 months of data
Violation of completeness
A company switches the denomination of amounts regularly (thousands to millions)
Violation of consistency
Customer purchasing metrics are 2 years old
Violation of timely
As part of the data standardization process, often items contained in different field for the same record need to be combined into a single field. This process is called:
data concatenation
•a data point, or a few data points, that lie an abnormal distance from other values in the data
outlier
-a proposed explanation worded as a statement of inequality, meaning that one of the two concepts, ideas, or groups will be greater or less than the other concept, idea or group
Alternative hypothesis
Using which of the following data validation techniques, can the validator estimate a likely error rate in the population of data?
Audit Of A Sample
One of the best techniques for assuring data quality
Audit a sample
select a sample of data items from the original data sources and make sure all those items are listed in the final dataset.
Audit a sample
if the field captures data about whether a vendor is a preferred vendor or not, the value of 1 would suggest they are a preferred vendor and 0 that they are not. With dummy variables, best practice is to give them a meaningful name rather than a generic name.
Example of dichotomous variable
-data that is inconsistent, inaccurate, or incomplete.
Dirty data
Aggregating data, data joining, and data pivoting are examples of which of the following?
Examples of data structuring
•process of estimating a value that is beyond the data used to create the model.
Extrapolation beyond the range of data
Which of the following techniques is most likely to discover a very large data threshold violation in a dataset containing 10 billion transactions?
Basic Statistical Test
Preformed to validate the data
Basic statistical tests
A construction company classifies their projects into one of seven different types. To keep track of project classification, the clerk enters a number from 1 to 7 in the ProjectType field. The values 1 to 7 are best described as ________.
Cryptic data values
the process of analyzing data and removing two or more records that contain identical information.
Data de-duplication
•the process of replacing a null or missing value with a substituted value. This process only works with numeric data
Data imputation
The process of combining different data sources
Data joining
•when a model is designed to fit training data very well but does not predict well when applied to other datasets.
Data overfitting
•involves separating data from a single field into multiple fields.
Data parsing
A technique that rotates data from rows to columns
Data pivoting
•the process of standardizing the structure and meaning of each data element so it can be analyzed and used in decision making.
Data standardization
•the process of changing the organization and relationships among data fields to prepare the data for analysis.
Data structuring
data errors that occur when a data value falls outside an allowable level.
Data threshold violations
In which step of the data transformation process would you analyze whether the data has the properties of high-quality data?
Data validation
-computations that address the basic question of "what happened?"
Descriptive
-go a step further than diagnostic analytics to answer the question "what is likely to happen in the future?"
Predictive
The benefits of visualizing data relative to reading are
Processed faster than written or tabular information Easier to use. Users need less guidance to find information with visualized data Supports the dominant learning style of visual learning because most people are visual learners
A subset of data not used for the development of a model but used to test how well the model predicts the target outcome
Test dataset
counting the number of total records or the number of distinct records present before and after a data transformation., it is also possible to compute the length of each value and compare this amount to pre-transformation lengths to see if there are changes.
Text fields in basic statistical tests
Julie knows that her report printout should have only two columns of information from the database and that each column should have a dummy variable in it. The report she receives from IT has a single column and some examples of the values in the column are "10", "01", "11", "00." Julie surmises that what likely is the problem?
The IT department improperly concatenated the data.
What are the five main purposes of visualization
comparison, correlation, distribution, trend evaluation, and part-to-whole.
Adi queries the company database to return all values from the field "FullAddress." Adi reviews the information and finds that half of the time the values store the city and country values before the street address and the other half of the time the street address is listed before the city and country. What type of error did Adi find in the database?
data consistency error
•approach that explores data without testing formal models or hypotheses.
exploratory data