DM - Data Warehouse and Data Mart Modeling
comparing dimensional modeling and ER modeling as data warehouse/data mart design techniques
-ER modeling can be used as a conceptual data warehouse/data mart design technique, followed by relational modeling as logical data warehouse/ data mart design technique -dimensional modeling can be used both for conceptual data warehouse/data mart design and logical data warehouse/data mart design
surrogate key
-in a star schema, all dimension tables are given a simple, non-composite system-generated key -values for these are typically simple auto-increment integer values -values have no meaning or purpose except to give each dimension a new column that serves as a primary key within the dimensional model instead of the operational key
dimensionally modeled data warehouse development
1. a set of commonly used dimensions known as conformed dimensions is designed first 2. fact tables corresponding to the subjects of analysis are then subsequently added 3. a set of dimensional models is created where each fact table is connected to multiple dimensions, and some of the dimensions are shared by more than one fact table 4. in addition to the originally created set of conformed dimensions, additional dimensions are included as needed 5. the result is a data warehouse that is a collection of intertwined dimensionally modeled data marts, i.e. a constellation of stars
for every dimensions under consideration, two must questions must be answered
1. can the dimension table be useful for the analysis of the chosen subject? 2. can the dimension table be created based on the existing data sources?
three major components of the data warehouse
1. source systems 2. extraction-transformation-load (ETL) infrastructure 3. the data warehouse itself and the front-end applications
Azure
Microsoft azure is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers
snowflake model
star schema that contains the dimensions that are normalized -usually not used in dimensional modeling: non-normalized (not snowflaked) dimensions provide for simpler analysis, normalization is usually not necessary for analytical databases
in a star schema
the chosen subject of analysis is represented by a fact table -designing this involves considering which dimensions to use with the fact table representing the chosen subject
typical dimension in a star schema contains
-attributes whose values do not change such as store size and customer gender -attributes whose values change occasionally and sporadically over time, such as customer zip and employee salary
data warehouse/data mart modeling approaches
-normalized data warehouse -dimensionally modeled data warehouse -independent data marts
characteristics of dimensions and facts
-typical dimension contains relatively static data, while in a typical fact table, records are added continually, and the table rapidly grows in size -in a typical dimensionally modeled analytical database, dimension tables have orders of magnitude few records than in fact tables
operational data
typically reflects detailed data -typically represents the current state of affairs in the real world
dimensional modeling
a modeling technique tailored specifically for analytical database design purposes -regularly used in practice for modeling data warehouses and data marts
ER modeling
a predominant technique for visualizing database requirements, used extensively for conceptual modeling of operational databases -can be used during the development of data warehouses and data marts
analytical data
both detailed and summarized -can represent both the current situation and snapshots from the past
dimensionally modeled data warehouse as a source
can be used as a source for dependent data marts and other views, subsets, and/or extracts
type 1
changes the value in the dimension's record - new value replaces the old -no history is preserved -simplest approach, used most often when a change in a dimension is the result of an error
row indicator
column that provides a quick indicator of whether the record is currently valid
timestamps
columns that indicate the time interval for which the values in the records are applicable
fact tables
contain measures relating to the subject of analysis and the foreign keys (associating fact tables with dimension tables) -measures in this are typically numeric and are intended for mathematical computation and quantitative analysis
dimension tables (dimensions)
contains descriptions of the business, organization, or enterprise to which the subject of analysis belongs -columns in dimension tables contain descriptive information that is often textual but can also be numeric -provides a basis for analysis of the subject
type 2
creates a new additional dimension records using a new value for the surrogate key every time a value in a dimension record changes -used in cases where history should be preserved -can be combined with the use of timestamps and row indicators
primary purpose of a data warehouse
data analysis
dimensional modeling use
data design methodology used for designing subject-oriented analytical databases, such as data warehouses or data marts -in addition to using the regular relational concepts (PKs, FKs, integrity constraints, etc) this distinguishes between two types of tables -dimensions -facts
data mart
data store based on the same principles as a data warehouse, but with a more limited scope
dimensionally modeled data warehouse
data warehouse as a collection of dimensionally modeled intertwined data marts (i.e. constellation of dimensional models) that integrates analytically useful information from the operational data sources -same as the normalized data warehouse approach when it comes to the utilization of operational data sources and the ETL process
granularity of the fact tables
describes what is depicted by one row in the fact table -coarser granularity aggregated fact tables are quicker to query than detailed fact talbes -coarser granularity tables are limited in terms of what information can be retrieved from them
slowly changing dimension
dimension that contains attributes whose values can change -three types
dependent data mart
does not have its own source systems -data comes from the data warehouse
detailed fact table
each record refers to a single fact
transaction-level detailed fact table
each row represents a particular transaction
multiple facts in a dimensional model
when multiple subjects of analysis can share the same dimensions, a dimensional model contains more than one fact table -referred to as a constellation or galaxy of starts -enables quicker development of analytical databases for multiple subjects of analysis, because dimensions are re-used instead of duplicated -straightforward cross-fact analysis
aggregated fact table
each record summarizes multiple facts -has less rows than a detailed fact table
line-item detailed fact table
each row represents a line item of a particular transaction
normalized data warehouse
envisions a data warehouse as an integrated analytical database modeled by using the traditional database modeling techniques of ER modeling and relational modeling, resulting in a normalized relational database schema -populated with the analytically useful data from the operational data sources via the ETL process -serves as a source of data for dimensionally modeled data marts and for any other non-dimensional analytically useful data sets
additional possible fact attributes
fact table contains: -FKs connecting the fact table to the dimension tables -the measures related to the subject of analysis -in certain cases, fact tables can contain other attributes that are not measures -two of the most typical additional attributes are: 1. transaction identifier 2. transaction time
detailed vs aggregated fact tables
fact tables in a dimensional model can contain either detailed data or aggregated data
detailed fact tables granularity
have a fine level of granularity because each record represents a single fact
Relational modeling
standard method for logical modeling of operational databases -can be used during the development of data warehouses and data marts
aggregated fact tables granularity
have a coarser level of granularity than detailed fact tables as records in aggregated fact tables always represent summarizations of multiple facts
operational systems
have a shorter time horizon of data than analytical systems
type 3
involves creating a "previous" and "current" column in the dimension table for each column where changes are anticipated -applicable in cases in which there is a fixed number of changes possible per column of a dimension, or in cases when only a limited history is recorded -can be combined with the use of timestamps
end users of analytical databases
only able to retrieve data, and updates of the data by the end users are not allowed
data in operational systems
regularly updated by the users
star schema
result of dimensional modeling is a dimensional schema containing facts and dimensions
independent data mart
stand-alone data mart, created in the same fashion as the data warehouse. -has its own source systems and ETL infrastructure
independent data marts
stand-alone marts are created by various groups within the organization, independent of other stand-alone data marts in the organization -consequently, multiple ETL systems are created and maintained