Analytics- Midterm 1
What are the attributes that exist in a relational database that are neither primary nor foreign keys?
descriptive attributes
completeness
ensures that all data required for a business process are included in the dataset
If your data analysis project is more declarative than explanatory, it is more likely that you will preform your data visualization to communicate results in
excel
training data
existing data that have been manually evaluated and assigned a class
target
expected attribute or value that we want to evaluate
Mastering the data can be described via the ETL process. The ETL process stands for
extract, transform, and load
Which of the following describes part of the goal of the ETL process
identify and obtain the data needed for solving the problem
decision support systems
information systems that support decision making activity within a business by combining data and expertise to solve problems and preform calculations
support vector machines
is a discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin and then works to find the middle line
XBRL
is used to facilitate the exchange of financial reporting information between the company and the SEC
quantitative data charts
line, box and whisker plot, scatter plot, filled geographic maps
class
manually assigned category applied to a record based on an event
decision boundaries
mark the split between one class and another
Benford's law
observation about the frequency of leading digits in many real life sets of numerical data
In general, the more complex the model, the greater the chance of
overfitting the data
declarative visualizations
product of wanting to declare or present your findings to an audience
data dictionary
provides descriptions for all of the data attributes of the dataset
pruning
removes branches from a decision tree to avoid overfitting the model
discrete data
represented by whole numbers
What is the most appropriate chart when showing a relationship between two variables
scatter chart
By the year 2020, about 1.7 megabytes of new information will be created every
second
In the late 1960s, Ed Altman developed a model to predict if a company was at severe risk of going bankrupt. He called his statistic Altman's Z-score, now a widely used score in finance. Based on the name of the statistic, which statistical distribution would you guess this came from?
standardized normal distribution
models associated with regression and classification data approaches have all except this important part
test data
The purpose of transforming data is
to validate the data for completeness and integrity
charts that show proportions
tree/heat maps, symbol maps, word clouds
composite primary key
two foreign keys from the tables that it is linking combine to make up a unique identifier
decision trees
used to divide data into smaller groups
profiling
-attempt to characterize the typical behavior of an individual, group, or population by generating summary statistics about the data -primarily uses structured data -discover patterns of behavior -assess data quality and internal controls
qualitative data
-categorical -split into nominal and ordinal
big data
-datasets that are too large and complex for business' existing systems to handle utilizing their traditional capabilities to capture, store, manage, and analyze these data sets -volume, velocity, and variety
Five steps of the ETL process in order
-determine the purpose and scope of the data request -obtain the data -load the data for data analysis -validate the data for completeness and integrity -clean the data
data analytic skills needed by analytic minded accountants
-develop an analytics mindset -data scrubbing and data preparation -data quality -descriptive data analysis -data analysis through data manipulation -define and address problems through statistical data analysis -data visualization and data reporting
test data
-existing data used to evaluate the model -set of data used to assess the degree and strength of a predicted relationship
ratio data
-have a meaningful zero -0 means the absence of -money -most sophisticated type of data
data reduction steps
-identify the attribute you would like to reduce or focus on -filter the results -interpret the results -follow up on results
classification steps
-identify the classes you wish to predict -manually classify an existing set of records -select a set of classification models -divide your data into training and testing sets -generate your model -interpret the results and select the best model
IMPACT cycle
-identify the questions -master the data -perform test plan -address and refine results -communicate insights -track outcomes
explanatory visualization
-lines between step P,A, and C are not clearly divided -align with performing the test plan within visualization software -Tableau> gaining insight while working with the data
quantitative data
-more complex -difference between each point are meaningful -counted, ranked, averaged, and take standard deviation -split into interval and ratio
interval data
-no meaningful 0 -temperature
data analytics
-process of evaluating data with the purpose of drawing conclusions to address business questions -aims to transform raw data into knowledge to create value
nominal data
-simplest form of data -can only be counted
Slicing and dicing the data, finding correlations, revising and re-running the analysis would be considered to be part of which stage of the IMPACT cycle
Address and refine results
When evaluating classifiers, you need to be careful to strike a balance between what two things
Complexity of the model and accuracy of the classification
Which of the following is not a step for cleaning the data
Deleting any results that are unfavorable to the results you were hoping to retrieve
Which of the following is not one of the considerations for determining the purpose and scope of the data request
Determining how the data will be cleaned
Variance analysis, a common practice in management accounting, is an example of _____ analytics
Diagnostic
In which format do analysts typically prefer to analyze data
Flat file
______ looks for similarities between portions, or segments, of the text of each potential maych
Fuzzy match
When is a foreign key required?
If two tables are related in a relational database, one of the two must have a foreign key
The four benefits of storing data in a relational database or completeness of data, no redundant data, business rules are enforced, and communication and ________ a business processes
Integration
Which of the following is not one of the means of cleaning the data after extraction and validation
Load the data into the software program in preparation for analysis
Which of the following is not an existing Audit Data Standard
Manufacturing subledger
Machine learning, artificial intelligence, and decision support systems are all examples of _____ analytics
Prescriptive
The extraction process requires two steps. One of the steps is determine ___________ of the data request
Purpose and scope
Data Analytics may use what source to assess the probability of a goodwill write down, warranty claims or the collectibility of bad debts
Social media
Which of the following is not one of the means of cleaning the data after extraction and validation
Transform the data into a usable form
McKinsey Global Institute estimates that Data Analytics could generate up to $3_______ in value each year
Trillion
classification
attempt to assign each unit in a population into a few categories
co-occurence grouping
attempt to discover associations between individuals based on transactions involving them
clustering
attempt to divide individuals into groups in a useful or meaningful way
similarity matching
attempt to identify similar individuals based on data known about them
link prediction
attempt to predict a relationship between two data items
qualitative data charts
bar, pie, stacked bar
ordinal data
can be counted, categorized, and ranked
continuous data
can take on any value within a range
As mentioned in the chapter, which of the following is not a common way that data will need to be cleaned after extraction
clean up trailing zeroes
data reduction
data approach that attempts to reduce the amount of information that needs to be considered to focus on the most critical items
regression
data approach used to predict a specific dependent variable value based on independent variable inputs using a statistical model
structured data
data that are stored in a database or spreadsheet and are readily searchable