B6 M4-M6
transforming data: cleaning data involves what
- determine the desired output - deduplicate data points, remove inaccurate data, and account for outliers - address missing fields - remove unnecessary attributes - ensure the data is accurate and complete after the cleaning process - remove sensitive information if it is not needed for analysis - split data for analysis - ensure data points are properly formatted
define relational database
allow data to be stored in different tables and the tables are lined through relationships using key fields
relational database concepts: define data dictionary
also referred to as metadata, provides information about the data in a database
transforming data: what are common manipulations of data
appending demographic and socioeconomic data creating new variable that are a function of existing variables creating new variable that classify or categorize existing variables
relational database concepts: define foreign keys
attributes in one table that are also primary keys in another table a primary key in one table and foreign key in another table is what creates a relationship between tables
define data extraction
automated process, semiautomated process or manual extraction
define the 4 parts of Big Data governance
big data confidentiality big data privacy big data ethics governance responsibility
define customer and marketing analytics
build consumer profiles and analyze spending preferences allows organizations to optimize their marketing strategies
define continuous data
can take on any value (including decimal values) within a given (finite or infinite) interval
define ordinal data
categorical and not quantitative but it can be ranked in a meaningful way
relational database concepts: define data types
category of data set or data point ex: numerical or text
what is included into transforming data
cleaning data validating data manipulating data
what are column charts effective at shower
comparisons
define big data
corporate accumulation of massive amounts of data that can be used for analysis, commonly referred to as data analytics
relational database concepts: define database keys
creates relationships within relational databases
what are the different uses of data analytics (6)
customer and marketing analytics managerial and operational analytics risk and compliance analytics financial analytics audit analytics tax analytics
define big data privacy
customer and patient data must be safeguarded from unauthorized access to meet consumer privacy expectations as well as regulatory requirements
loading the data: define the data storage attribute - relationship between elements include validity, completeness, accuracy
data being entered in the correct manner no required data is missing data entered is true and free from errors
what are the two steps in data extraction
data identification obtaining the data
define structured data
defined organizational format that has specific parameters
loading the data: define the data storage attribute - relevance
defining the purpose helps users understand a repository's relevance
define symbol maps
demonstrate data on a geographic map through the use of symbols to help users compare and contrast values
define scatter plot
demonstrate relationship between two variables a simple trendline can be added as a form of simple regression to provide information on correlation
explain geographic maps
demonstrate values on a geographical map and are typically colored or shaded in a manner to signify numeric values
relational database concepts: explain attributes (columns)
describe the characteristics or properties desired to be known about each entity ex: last name
descriptive analytics =
describing or explaining what has occurred backward looking
diagnostic analytics =
diagnosing or explaining why it occurred backward looking
quantitative data =
discrete or continuous
relational database concepts: explain records (rows)
each record contains information about one entity within the table ex: information about a single customer
transforming data: define validating data
ensure data is not lost or inappropriately modified in the cleaning process may be visual review and basic statistical tests may be required (max, min, avgs)
define data management
ensuring that the data is maintained and stored appropriately key for every organization
loading the data: define full refresh loading
entire data set is loaded, replacing the previous load
loading the data: what are the data storage requirements and define them (2)
entity integrity - each table must have a unique primary key as a record identifier referential integrity - a change to a primary key in one table must also cause a change to any related foreign key in a table that is linked
relational database concepts: explain tables
establish columns and tows to store specific types of data records ex: customer table
ETL standards for
extract, transform, and load used for data analytics
what are the 5 dimensions of big data
five Vs of big data 1. volume 2. velocity 3. variety 4. veracity 5. value
define boxplot
graphical displays that show lower and upper extremes, lower and upper quartiles, as well as the medium data point
define semi-structured data
hybrid of unstructured and structured data common example is a CSV file (file has comma-separated values)
to leverage the power of evolving big data, companies must
identify a data point, then capture it, store it, protect it, and eventually dispose of it (if needed)
loading the data: what are the types of loading (3)
initial (full) loading incremental loading full refresh loading
five Vs of big data: define value
insights the big data can yield important to understand the question or business problem that needs to be solved
relational database concepts: define fields
intersection of a column and row the information inside the fields is known as data values
define nominal data
is the simplest form of data that cannot be ordered or ranked
transforming data: define manipulating data
it can be supplemented, enhanced, or otherwise manipulated in a way that adds value to the existing data points
define audit analytics
key to an audit assessing risk providing assurance around certain operations establishing thresholds and expectations improving the quality of the audit by testing full populations
extraction: obtaining the data - explain automated extraction
likely use an application programming interface (API) so extraction is just a matter of a user application accessing the API to obtain the source data
define loading the data
load the data into a software program for analysis or into a data storage location
relational database concepts: what are the different database views
logical database view physical database view
define big data ethics
make sure authorized personnel are granted the minimum level of access to the data necessary to perform their job functions
define flow charts
map out a process that has a beginning and ending steps and a series of steps in between
extraction: obtaining the data - explain manual extraction
may have to use specialized data mining software or write customized queries to obtain the data
define financial analytics
monitor financial performance through data mining and ratio analysis on a continuous basis
define risk and compliance analytics
monitor their transactions through continuous auditing, continuous monitoring, and continuous reporting
loading the data: define data mart
much like a data warehouse but is more focused on a specific purpose such as marketing or logistics and is often a subset of a data warehouse
qualitative data =
nominal and ordinal data
loading the data: define load verification
once the data is loaded into the data repository, it is vital to validate it to ensure no data was lost in the process
loading the data: define incremental loading
only the differences between existing data and new data are added to the data repository
loading the data: what are the different data storages
operational data store (ODS) data warehouse data mart data lake
predictive analytics =
predicting what will occur forward looking forecast future data points by transforming insight into foresight, projecting what will happen based on historical data
prescriptive analytics =
prescribing what could or should occur forward looking how to achieve a desired event
what are the different database keys
primary and foreign
define data analytics
process of taking raw data, identifying trends, and then transforming that knowledge into insights that can help solve complex business problems
what are the types of data
qualitative data quantitative data
five Vs of big data: define volume
quantity or amount of data points may also factor in the size of the data
five Vs of big data: define variety
range of data types being processed of analyzed -structured data -semi structured data -unstructured data
extraction: obtaining the data - explain requesting the data
recipient of the request must be provided with full details on what is needed, including the data file type, format, time period, and required attributes
what is the most efficient and effective methods for storing data
relational database
loading the data: what are the different data storage attributes
relevance elements to be included and excluded relationship between elements include validity, completeness and accuracy
define veracity
reliability, quality, or integrity of data processes should be implemented so that data is cleansed of irregularities, including duplicate fields, missing fields, incorrect formats or characters, transposed fields or incorrect labeling
loading the data: define data lake
repository similar to a data warehouse but it contains both structured and unstructured data, with data mostly being in its natural or raw format
relational database concepts: define physical database view
represents how data is actually physically stored, processed and/or accessed within a database
relational database concepts: define logical database view
represents the type of data that is stored in a database and is intended to explain the contents as well as the logical structure of a database to users
extraction: what are the different ways to obtain data
requesting the data automated extraction manual extraction can be internal or external to the org
relational database concepts: define relationships
result from a link between a primary key in one table and a foreign key in another table
define big data confidentiality and what it includes
safeguarded to protect it from unauthorized access and exploitation -copyrights -patents -trademarks -trade secrets
define governance responsibility for big data
should be lead by a designed individual, like chief privacy officer, corporate compliance officer, or a job role equivalent should have input from leaders across the organization and program should be periodically updated as necessary
define pie charts
show respective proportions of a whole
define waterfall chart
show the cumulative effect of a series of data points that make up a whole
relational database concepts: what are data queries and reports based on
some form of structured query language (SQL)
five Vs of big data: define velocity
speed of data accumulation or data processing
for big data privacy, to maintain compliance organizations must implement
strong governance practices surrounding what type of data can be collected, what disclosures to make as the data is collected, and what controls must be in place to protect that data
define transforming data
taking the often-unstructured raw data, cleaning it, manipulating it, and validating it to ensure it is accurate and ready for analysis
define dot plot
two dimensional mapping of observances onto a coordinate plane
extraction: define data identification
understand the issue the business is trying to address to ensure the data request has the proper scope to resolve it
relational database concepts: define primary key
unique identifies for a specific row within a table and are made up of one or more attributes each row must have a unique primary key ex: social security numbers
define tax analytics
use this to organize tax information and guidelines, improve tax planning, and monitor tax performance indicators
define managerial and operational analytics
usually run in real time to maximize efficiencies and production within an organization
once ETL process has been performed, data analytics can be utilized for a variety of tasks include
validation, planning, insights, risk mitigation, and decision support
when is a stacked column chart effective
very effective when you want to have total comparisons as well as percentage breakdowns of the whole each column is stratified to show additional details
loading the data: define data warehouse
very large data repositories that are centralized and utilized for reporting and analysis rather than for transaction purposes
extraction: data identification involves determining what 3 things
what attributes to analyze time span to use what risks exist in the data
when are line charts best used
when showing quantitative trends over time and can help users discover hidden trends
when is a pyramid most helpful
when the bottom layer represents an action or a target that must first be achieved before the next layer up can take place use for when needing to understand underlying foundations or building blocks
loading the data: define initial (full) loading
when the entire data set is loaded into a repository
loading the data: define the data storage attribute - elements to be included and excluded
which attributes are included outlines the universe of data points housed within a repository
define discrete data
whole numbers and can only have certain values
what does data extraction dictate
will dictate the tools needed for designing the overall process of extraction
loading the data: define operational data store (ODS)
a repository of transactional data from multiple sources and is often a source for data warehouses
define data
a fact, occurrence, instance, or an otherwise measurable observation after organizing raw data, it adds value
define unstructured data
a format that does not have predefined parameters and generally lacks organization