DRM TEST 2 Study Guide
Updating DW 2 Stratedgies
1. Audit columns in the transactional tables like date inserted or date updated so that we know which rows to extract in the next update. 2. Using Triggers on inserts and updates.
Staging Server Reasoning (2 reasons)
1. Reduce the connection time for ETL packages to the source database. This can be a very important issue for large data volumes, creating significant pressure on the source databases while the package is running. 2. Provide a solid restart point in the event of a failure. Not only we avoid accessing the source database again, but we also have a consistent image as of the extract date and time!!
Star Schema
A dimensional model is made up of a central fact table (or tables) and its associated dimensions. The dimensional model is also called a star schema because it looks like a star with the fact table in the middle and the dimensions serving as the points on the star.
Fact Table - Facts or measures
A record in a fact table is a series of measurements, or numeric values such as quantity ordered or sale amount. These numbers are called facts (or measures in Analysis Services).
Dimensional model vs the TPS model
Because of the de-normalization involved in creating the dimensions. Queries against the SQL Server relational database generally perform better — often far better — against a dimensional structure than against a fully normalized structure.
ETL 3 uses
Business Analytics ETL is the reason why more than 50% of corporate business intelligence initiatives fail. BIG Data No big data processing without knowledge of ETL Operational Systems No efficiency without ETL
Data Cleaning Process (transforming Data) 3 ways
Column screens: Test the data within a single column, for nulls, data type violations, or value violations. Structure screens: Test data relationships, for example referential integrity failures. Business rule screens: Test business logic, for example requiring that a Platinum customer is at least five years with the company and has made a purchase in the last two years.
Data Latency
Data latency describes how quickly the transactional data must be delivered to the DW/BI system. Most DW/BI systems process data on a daily basis
Dimensions
Dimensions are implemented as tables in the dimensional model. You can spot dimensions in conversation with the operational people because they are often the "by" words in a report request. For example, a user wants to see sales by month by product by date By promotion
Data warehouse bus matrix (4)
Each row in the matrix is a business process. Each column is a dimension that participates in multiple business processes. The processes down the left side of the matrix follow the organization's value chain. The bus matrix is essentially the enterprise dimensional data architecture.
Operational work: job scheduler
Every enterprise data warehouse needs a robust ETL scheduler. ETL processes should run automatically and independent of human intervention especially if we work with multiple updates a day
ETL
Extraction Transformation loading It involves accessing data from multiple heterogeneous and geographically dispersed resources, manipulating this data, and moving it to the destination system.
Fact Table - 95% of Data
For most transaction-driven organizations, fact tables are the largest tables in the data warehouse database, often making up 95 percent or more of the total relational database size.
Calculations documentation
Fully documenting calculations for functions, allocations, adjustments, and derivations.
many to many between dimension tables
In a banking example, a given account can have one or more customers as signatories, and any given customer can have one or more accounts. The bank might choose to analyze data by customer or account.
Designing the Dimensional Model (4)
It constitutes the blueprint for the development of the DW database. It is the plan on which we reflect for changes and improvements. We use it in our communication with the operational users. We go through multiple iterations of its desing
Error Tables (transforming data)
It might be useful to create error tables to hold rows that have failed a critical data screening.
Transforming Data
Manipulating data through a series of ETL packages to improve quality and merging data from multiple heterogeneous sources
slowly changing dimension (scd)
Most attribute values like date of birth are fixed. However, other attributes like an employee's title might change over time. Changing attributes allow us to understand the dynamics of the business and constitute one of the major reasons for the existence of the DW/BI system. Type 1 and type 2 SCDs. Type 1, we just replace values. Type 2, we keep historical records. It is a business decision to decide which SCD is type 1 or type 2.
working with duplicates(transforming data)
Often records for the dimension tables are derived from several heterogeneous sources. This is common for organizations that have many customer-facing TPS systems that create and manage separate customer data. We should have a strategy in place to identify duplicate records
Loading Data
Physically loading the data into the DW target dimensional models.
ETL 3 KEY points
Politics within the organization. Cooperation with the transactional DBAs. Access to data sources. 80% of the production cost of a DW.Analysts estimate that almost 80% of the capital required for a data warehouse project will be spent on data quality and data transfer projects. 55% of the operational cost for a DW.55% of the total warehouse cost at run time is due to ETL costs alone.
Data quality at TPS level
Push back hard at the transaction systems operations to clean data at the source before it enters the ETL stream
operational work: ETL Security
The ETL system and staging tables should be off limits to all business users
Proposal for a Staging Server
The extract system should be decoupled (separate - independent) from the transactional (source) and destination systems. We should create separate ETL packages for the extract that simply pull the appropriate data from the source systems and stage it, untransformed, in a staging database
Fact Table - Level of detail in the fact table
The level of detail contained in the fact table is called the grain. It is recommended to build fact tables with the lowest level of detail possible. Detailed fact tables provide complete flexibility to roll up the data to any level of summary needed across any dimension
Fact Table - Primary Key
The primary key to the fact table is usually a composite key made up of a subset of the foreign keys from each dimension table
Prioritizing Business Requirements (3)
The prioritization process is a planning meeting involving the DW/BI team, the DW/BI project business sponsors, and other senior managers from across the organization. There needs to be a conclusive consensus of what is needed by the organization. Prioritizing requirements is one of the most important tasks you have.
operational work: recovery and restart system
The role of staged dataAfter your ETL system is in production, failures can occur for countless reasons. You need staged data sets as a basis for restarting the system.
Many to many between fact and dimension tables
The standard relationship between a dimension table and fact table is one-to-many. This means one row in the dimension table will join to many rows in the fact table. For example a product has many sales. However, we can have a many-to-many relationship between the fact table and a dimension. For example when multiple sales reps handle the same order.
Surrogate keys (pro and con)
They protect the DW/BI system from changes in the keys coming from the source system. They allow the DW/BI system to integrate the same data, such as customer, from multiple source systems where they have different keys. The real cost of using surrogate keys is the burden it places on the ETL system.
4 Assumptions of DW
We have a transactional/operational system in place. We understand that a data warehouse is not an archiving system. We have specific goals and we are looking for specific information from the data warehouse initiative. There is not a standard road to follow when building the DW. It will depend on the corporate reporting needs.
Developer Skills:
We need to get the best. It is imperative to have experienced ETL developers as part of a team and know how to communicate in business terms with the operational and DW/BI users SQL and Scripting Excellent SQL skills are required as well!! Every production quality SSIS ETL system will require knowledge of SQL.
Lineage Documentation
We should be able to look at a number in a report and know exactly how and when it entered the data warehouse (or other data destination), and what transformations occurred to it along the way
Data profiling
auditing from the operational system conclusions from operational data
determining important Processes and business requirements
conduct a set of interviews to identify Processes and BI needs
Data Warehouse
it is a database we develop and use for business analytics of a corp.
Data Profiling
the analysis of the source (mostly transactional) data to describe its content and structure.
3 Bullets (operational people)
we must listen and understand the operational people. communication is a top priority we first need to understand the data in the operations/transnational system. we can achieve this by using sql the next step is to define the business requirements