ISA401 exam 1
EAI tools
enable event driven (real time) data to be captured and used in an integrated way across disparate source systems - can be used to feed ETL tools
timely
except for real time data warehousing, data need not be real time but data must be current enough that decision making can react in a timely manner
operational data are...
transient, nor normalized, generally restricted in scope, poor quality
● How are changes in dimension values handled?
○ Add new column
● What is granularity? Why is it important?
○ Granularity relates to the fact table ○ Tells us the level of information and determines the size of the fact table ○ Fine grain: more data, more ways to slice & dice
● What is a surrogate key?
○ Key that doesn't have a business meaning, it is a randomly generated unique identifier
logical vs physically stored data
○ Logical means the data mart references the data warehouse ○ Physical means the data is physically stores in the data mart
quality controled
reconciled data must be of unquestioned quality and integrity
comprehensive data
reconciled data reflect an enterprise wide perspective whose design conforms to the enterprise data model
cleanse process
role of this ETL process is to identify erroneous data, not fix it
what are different sources of data
spreadsheet, database, sensors, social media
what is BI
the application of knowledge gained from business analytics to make informed business decisions
normalized
the data are fully normalized which provides greater flexibility than denormalized data
log capture
the most common used method of extracting; only imaged that are logged after the last capture are selected for the log
joining
the process of combining data from various sources into a single table or view
selection
the process of partitioning data according to predefined criteria
aggregation
the process of transforming data from a detailed level to a summary level
what are the differences between transactional and informational systems
transactional: executing and completing transactions for business processes, does not have all the information you need it is an "island of data" informational: data mining, analytics - MUST USE BOTH SYSTEMS TO HAVE A FULL VIEW ON THE COMPANY
descriptive data
what happened in the past, this is most commonly used by businesses; its the simplest and most basic
data analytics can help answer questions such as...
what has happened in the past? why did it happen? what could happen in the future? with what certainty? what actions can we take now to support or prevent certain events from happening in the future?
prescriptive data
what should we do
predictive data
what will happen in the future and with what certainty
is the fact table normalized?
yes
is data analytics interdisciplinary
yes, it involves computer science, mathematics/statistics, and domain knowledge
Snowflake Schema
■ An extension of star schema where the dimensions are decomposed into multiple related tables
Factless Fact Table
■ No nonkey data in _____ ; only FKs for associated dimensions ■ Used for tracking events and possible occurrences
Galaxy Schema (Multiple FACT TABLES )
■ Often used to store facts for different processes (with different combinations of dimensions) ■ Has conformed DIMENSIONS
○ Difference between DBA and DWA
■ Security and privacy is a pressing issue in DW administration ■ Safeguarding data ■ Government regulations ■ Profile management (who can access what data)
○ Data warehouse administrator (DWA)
■ Technical knowledge ■ Business knowledge ■ Process knowledge ■ Communication skills
● What are the variations of star schema?
○ 2 variations ○ Galaxy ○ Snowflake
● How many steps are there in ETL? What is the sequence?
○ 4 steps: Extract, clean, transform, load
● Hosted data warehouses (via cloud computing)
○ Alternative to development ○ Less investment in infrastructure for cloud computing
● Two types of Data Marts
○ Dependent (data, ETL): here you do have a central data warehouse ○ Independent (data, ETL): here you do not have a central data warehouse ● If data marts are not linked together, data silos can result ● Independent data marts are easier to deploy but harder to maintain ● Dependent data marts get their data from EDW and they feed their data to end user
● A star schema is for which data layer?
○ Derived data ○ Two kinds of tables: fact and dimension
● Data warehouse development approaches
○ EDW approach (top-down) ○ Data mart approach (bottom-up)
● Traditional versus active DW
○ Enabling real-time data updates for real-time analysis and real-time decision making ○ Ex: banks need to be updating the information of withdrawals every second ■ Most businesses that use data to make decisions with less urgency will update their information nightly
● How is fact table different from dimension tables?
○ Fact table is a one to many relationship ○ Fact table is on the MANY side ○ Fact table is QUANTITATIVE data that you can slice & dice
● What types of activities fall under data cleaning/scrubbing/wrangling?
○ Finding error and notifying the source system of those errors
successful BI implementation
success of a BI implementation is measured by its widespread usage for better (data-driven decision making) the typical BI user community includes - strategic level: all levels of the management industry - tactical level: operational users
OLAP (informational systems)
-provide a place for data to be stored and prepared for analytical purposes -hold large volumes of historical data - archived, aggregated data which may have come from a transactional system originally - data are extracted from other systems and loaded into the informational system periodically, possibly in summarized form - support managerial decision making
2 phases of data reconciliation
1) during an initial load when the EDW is first created 2) during subsequent updates to keep the EDW current
5 steps of data reconciliation
1) mapping and metadata management 2) capture 3) scrub 4) transform 5) load and index
4 characteristics of a data warehouse
1) subject oriented: organized by subject detail such as sales, product, customers, and contains information only related to that specific subject 2) time variant: time is an important element of analytics 3) integrated: data is integrated in one place for the purpose of analysis 4) non volatile: after data are entered into the database they can not be changed
● Data warehouse
: a system that extracts, cleans, conforms, and consolidates data periodically from the source systems into dimensional data store ○ Usually keeps years of history ○ It is expensive to refresh the system → it puts a strain on transactional systems
● What are the three layers of data in a DW data architecture? What is the purpose of each layer?
? ○ Derived data ○ Reconciled data ○ Operational data
self service BI
PROS: allows BI users to become more self reliant and less dependent on the IT organization CONS: can lead to chaotic reporting, it should work with business to determine key metrics and agree on vocabulary
● What types of data extraction are there?
STATIC
● What is always a dimension?
TIME
● A data mart may contain several star schemas. (T/F?)
TURE
● Granularity of FACT table
What level of detail do we need? Most important design step.
ERP
a fully integrated system to allow functional business areas to share data
static extract
a method of capturing a snapshot of the required source data at a point in time
data scrubbing (cleansing)
a process of using pattern recognition and other artificial intelligence techniques to upgrade the quality of raw data before transforming and moving the data to the data warehouse
data analytics
accessing/cleaning, transforming it, and analyzing it
refresh mode
an approach to filling a data warehouse that invovles bulk rewriting of the target data at periodic intervals
update mode
an approach to filling a dataware house in which only changes in the source data are written to the data warehouse
● Periodic (non-volatile) data
are never physically altered or deleted once they have been added to the data source
transactional systems
availability close to 100% - updateable - process transactions quickly - store current information - support organizations business functions
incremental extract
captures only the changes that have occured in the source data since the last capture
● Transient (volatile) data:
changes to existing records are written over previous records, thus destroying the previous data content
Powerpivot
column oriented - data access is much quicker when forming queries
clustering
customer profiling, descriptive analytics
big data (three v's)
data sources (variety) data storage (volume) processing and analyzing (velocity)
detailed data
data are detailed rather than summarized which provides maximum flexibility in terms of how data can be structured
historical data
data are periodic to provide a historical perspective
structured data
data organized such that it can be read and used by people and computers - string, numeric, dates - stored in cells or fields of fixed length - spreadsheets: data integrity might be a concern, file size is limited, east to use - flat files .csv, .txt
what is the difference between data analytics and data science
data science is about creating models, data analytics is about using the models
● STAR SCHEMA
has a FACT table surrounded by several DIMENSION tables through 1:M relationships
OLAP
online analytical processing systems are used to extract information from data stored by OLTP systems - routine sales reports by product, region, sales person - often built on top of a data warehouse where the data is not transactional
OLTP
online transaction processing systems are constantly involved in handling updates (add/edit/delete) to operational databases -databases are geared towards OLTP - the main goal of OLTP is transaction accuracy - ERP systems are OLTP systems that generate and store transaction data
churn classification
predicting who is likely to defect from your product/offering
● Are relationships between dimensions allowed?
○ There are relationships between the dimension and fact tables ○ Relationships from dimension to dimension ARE NOT ALLOWED ○ One exception: when you have a hierarchy of table (you can have a relationship between product and product category)
● Relationships between dimensions typically allowed
NOT ALLOWED
dimension tables?
denormalized
what are applications of analytics
fraud detection, demand forecasts
data transformation
involved converting data from the format of the source operational system to the format of the enterprise data warehouse
● Most common data model
is dimensional model, which is usually implemented as a star schema
data reconciliation
is the process responsible for transforming operational data to reconciled data - the differences between them are so sharp so data reconciliation is the hardest part of building a data warehouse
traditional pivot tables
limitations: source of data MUST be excel, number of tables, size of data (excel can only hold around 1 million rows)
● The simplest dimensional (derived data)
model where FACTS, MEASURES, DATA VALUES are separated from DIMENSION & REFERENCE data
● Measures:
money, sales volume, head count, inventory, profit, actual versus forecast
● Finer grain→
more refined analysis
● Dimensions
products, salespeople, market segments, business units, geographical locations, distribution channels, country, or industry
● The purpose of ETL
○ Operational data ■ Transient ■ Normalized ■ Restricted in scope ■ Can be of poor quality ○ Reconciled data ■ Detailed ■ Historical ■ Normalized ■ Comprehensive ■ Quality controlled ○ 80-90% of the cost of running a data warehouse comes from getting operational data to ETL ○ ETL is also known as harmonizing data
● What is the purpose of an operational data store
○ Short term data that you retrieve from data sources and put it in an interim storage where you can do ETL(extract, transform, load) before you load it into the database
● What types of activities fall under data transformation?
○ Single & multi field transformations ○ Aggregations
● What types of data load are there
○ Static extract = full/refresh load ○ Incremental extract = update load
● DW Scalability can be in terms of
○ The amount of data ○ The number of concurrent users ○ The complexity of user queries ○ Business functions supported ○ Impact of cloud computing → easier to get a scalable system
● What is the main difference between a data warehouse and a data mart?
○ The main difference is SCOPE ○ Data mart is for a specific business unit (accounting, sales) ○ Data warehouse is for an entire organization