INFS CH 8
Dimensional modeling
is a data design methodology used for designing subject-oriented analytical databases (e.g., data warehouses or data marts). The result of relational dimensional modeling is a dimensional relational schema containing facts and dimensions, shown in Figure 8.1. The dimensional schema is often referred to as the star schema.
Data Warehouse (Data Mart) Modeling Approaches
Contemporary methodologies offer several data modeling options for designing analytical databases, such as data warehouses and data marts. The following are three of the most common data warehouse and data mart modeling approaches: • normalized data warehouse • dimensionally modeled data warehouse • independent data marts These approaches differ in how they utilize the data modeling techniques we introduced in the earlier chapters in this book. In the remainder of this chapter, we will examine and illustrate the basic concepts of these approaches, as well as their differences and similarities.
Independent Data Marts
Even though the majority of the discussions about data modeling approaches for data warehouses involve the two above-listed approaches (the normalized and dimensionally modeled data warehouses and their variations), there is a third approach that should also be acknowledged and discussed. This approach involves the creation of so-called independent data marts, as illustrated by Figure 8.35. In this method, stand-alone data marts are created by various groups within the organization, independent of other stand-alone data marts in the organization. Consequently, multiple ETL systems are created and maintained. There is a consensus among virtually all members of the data warehousing community about the inappropriateness of using the independent data marts approach as a strategy for designing an enterprise-wide analytical database. There are obvious reasons why independent data marts are considered an inferior strategy. First, this strategy does not result in a data warehouse but rather in a collection of unrelated independent data marts. While the independent data marts within one organization may end up as a whole containing all the necessary analytical information, such information is scattered and difficult or even impossible to analyze as one unit. The inability for straightforward analysis across the enterprise is a major weakness of this approach. Another shortcoming of the independent data marts approach is the existence of multiple unrelated ETL infrastructures. ETL, as we mentioned in Chapter 7, is usually the most time- and resource-consuming part of the data warehouse or data mart project. Having multiple unrelated ETL efforts virtually guarantees the unnecessary duplication of similar expensive processes, which of course results in unnecessary waste. In spite of these obvious disadvantages, a significant number of corporate analytical data stores are developed as a collection of independent data marts. The reason for this seeming paradox lies in the lack of initial enterprise-wide focus when data analysis is addressed. In such a case, a number of departments within an organization take a "go-at-it-alone" approach to developing the data marts for their analytical needs. This is commonly due to the "turf" culture in organizations, where individual departments put more value on their independence than on cross-department collaboration. In those cases, the existence of independent data marts is more of a symptom of the problems with the organization's leadership and culture than a result of deliberately adopting an inferior data warehousing approach.
Type 1 Approach
The Type 1 approach is the simplest and is used most often when a change in a dimension is the result of an error. The Type 1 approach simply changes the value in the dimension's record, where the new value replaces the old value. If the Type 1 approach is used, no history is preserved. The following simple example illustrates the Type 1 approach
Type 3 Approach
The Type 3 approach is applicable in cases in which there is a fixed number of changes possible per column of a dimension, or in cases when only a limited history is recorded. The Type 3 approach involves creating a "previous" and "current" column in the dimension table for each column where changes are anticipated. The Type 3 approach would be appropriate in this scenario if customers' tax brackets can change only once, or if only the two latest values for the customers' Tax Bracket attribute are ever needed for analysis in this organization.
Dimension tables (dimensions)
contain descriptions of the business, organization, or enterprise to which the subject of analysis belongs. Columns in dimension tables contain descriptive information that is often textual (e.g., product brand, product color, customer gender, customer education level), but can also be numeric (e.g., product weight, customer income level). This information provides a basis for analysis of the subject. For example, if the subject of the business analysis is sales, it can be analyzed by product brand, customer gender, customer income level, and so on.
detailed fact tables
each record refers to a single fact
Surrogate key
Typically, in a star schema all dimension tables are given a simple, non-composite, system-generated key, also called a..... For example, instead of using the primary key ProductID from the PRODUCT table in the operational database as the primary key of the PRODUCT dimension, a new surrogate key column ProductKey is created. One of the main reasons for creating a surrogate primary key (e.g., ProductKey), and not using the operational primary key (e.g., ProductID) as a primary key of the dimension is to enable the handling of so-called slowly changing dimensions.
line-item detailed fact table
each row represents a line item of a particular transaction. For example, the table SALES shown in Figures 8.15 and 8.22 is a line-item detailed fact table because each row represents a line item of a sales transaction. The finest granularity fact available about the subject sales in the underlying data source (the ZAGI Retail Company Sales Department Database) is represented by the single line item of a sales transaction.
transaction-level detailed fact table
each row represents a particular transaction. Consider the example of a ZippyZoom Car Rental Agency, where each rental transaction is performed in one branch and involves one car and one customer. An operational database for ZippyZoom Car Rental Agency would record for each car rental transaction the unique RentalTID and the amount taken in addition to the information about the date of the rental, the car rented, and the customer.
granularity
s we mentioned in Chapter 3, the granularity of the table describes what is depicted by one row in the table. Therefore, the granularity of the fact table describes what is depicted by one row in the fact table. Detailed fact tables have a fine level of granularity because each record represents a single fact. Aggregated fact tables have a coarser level of granularity than detailed fact tables, as records in aggregated fact tables always represent summarizations of multiple facts.
slowly changing dimension
A typical dimension in a star schema contains either attributes whose values do not change (or change extremely rarely), such as store size and customer gender, or attributes whose values change occasionally and sporadically over time, such as customer zip and employee salary. A dimension that contains attributes whose values can change is often referred to as a slowly changing dimension. There are several different approaches to dealing with slowly changing dimensions. The most common approaches are referred to as Type 1, Type 2, and Type 3.
aggregated fact tables
each record summarizes multiple facts
constellation (galaxy) of stars
A dimensional model with multiple fact tables (such as the one shown in Figure 8.19) is referred to as a constellation (galaxy) of stars. This approach allows for quicker development of analytical databases for multiple subjects of analysis, because dimensions are reused instead of duplicated. Also, due to the sharing of dimensions, this approach enables straightforward cross-fact analysis, such as a comparison of the average daily number of products sold vs. the average daily number of defective products found and removed per store, region, quarter, product category, and so on.
Snowflake Model
If the dimensions are normalized within the dimensional model, such models are referred to as snowflake models. Figure 8.29 shows a snowflake version (i.e., normalized version) of the star-schema model shown in Figure 8.15. In practice, snowflaking is usually not used in dimensional modeling. One reason for not using snowflaking in dimensional modeling is that not-normalized (i.e., not snowflaked) dimensions provide for simpler analysis. A normalized star schema results in a larger number of tables to consider during the analysis, which adds to the complexity of the process. Compare schemas in Figures 8.15 and 8.29. If both schemas were implemented, they would both end up containing the same data. However, an analyst using the schema in Figure 8.15 could perform sales-related analytical tasks using four dimensions, while an analyst using the schema in Figure 8.29 would have to use seven dimensions to perform the same analytical tasks. Another reason for not snowflaking in dimensional modeling is that normalization is usually not necessary for analytical databases. Recall from Chapter 4 that the primary reason for normalizing operational databases is to avoid update anomalies. Also recall from Chapter 7 that analytical databases, such as data warehouses and data marts, are not subject to update anomalies, since they are "append and read only" databases. Since dimensional modeling is used primarily to design analytical business intelligence databases that cannot be affected by update anomalies, normalization as a prevention of update anomalies is simply not necessary.
Fact tables
contain measures related to the subject of analysis. In addition, fact tables contain foreign keys associating them with dimension tables. The measures in the fact tables are typically numeric and are intended for mathematical computation and quantitative analysis. For example, if the subject of the business analysis is sales, one of the measures in the fact table sales could be the sale's dollar amount. The sale amounts can be calculated and recalculated using different mathematical functions across various dimension columns. For example, the total and average sale can be calculated per product brand, customer gender, customer income level, and so on.
Normalized Data Warehouse
One option for modeling data warehouses envisions a data warehouse as an integrated analytical database modeled by using ER modeling and relational modeling, resulting in a normalized relational database schema. The normalized data warehouse is populated with the analytically useful data from the operational data sources via the ETL process, and it serves as a source of data for dimensionally modeled data marts and for any other non-dimensional, analytically useful data sets. The data warehouse as a normalized integrated analytical database was first proposed by Bill Inmon, and hence the normalized data warehouse approach is often referred to as the Inmon approach.3 Once a data warehouse is completed and populated with the data from the underlying sources via the ETL infrastructure, various analytically useful views, subsets, and extracts are possible based on this fully integrated database. One of the primary types of analytical data sets resulting from the normalized data warehouse is indeed a dimensionally modeled data mart, which can then be queried using OLAP/ BI tools (to be covered in Chapter 9). Such data marts, which result from the larger data warehouse either as views (virtual tables) or as physical extracts, are known as dependent data marts.
degenerate dimension
The approach of including a transaction identifier in the fact table is often referred to in the literature and practice as a degenerate dimension where the term "degenerate" signifies "mathematically simpler than." The term merely indicates that it is simpler to include an event identifier within the fact table than to create a separate dimension for it.
Dimensionally Modeled Data Warehouse
Another approach views a data warehouse as a dimensional model that integrates analytically useful information from the operational data sources. As Figure 8.33 illustrates, this approach is the same as the normalized data warehouse approach when it comes to the utilization of operational data sources and the ETL process. The difference is the technique used for modeling the data warehouse. In this approach, a set of commonly used dimensions known as conformed dimensions is designed first. For example, in a retail company, conformed dimensions such as CALENDAR, PRODUCT, and STORE can be designed first, as they will be commonly used by subjects of analysis. Fact tables corresponding to the subjects of analysis are then subsequently added. A set of dimensional models is created in which each fact table is connected to multiple dimensions, and some of the dimensions are shared by more than one fact table. In addition to the originally created set of conformed dimensions, other dimensions are included as needed. The result is a data warehouse that is a dimensional model with multiple fact tables, such as the one shown in Figure 8.19. In Figure 8.19, a data warehouse designer could have first created dimensions CALENDAR, CUSTOMER, STORE, and PRODUCT as conformed dimensions, anticipating that they would be useful for multiple subjects represented by fact tables. Then, based on the requirements that called for analysis of two subjects, sales and defects, two fact tables, SALES and DEFECTS, were created. These fact tables conveniently had the already-made conformed dimensions CALENDAR, CUSTOMER, STORE, and PRODUCT ready to connect to. Once the fact tables SALES and DEFECTS were created and connected to the dimensions (SALES to CALENDAR, CUSTOMER, STORE, and PRODUCT; and DEFECTS to CALENDAR, STORE, and PRODUCT), the data warehouse model was in place.
Type 2 Approach
The Type 2 approach is used in cases where history should be preserved. The Type 2 approach creates a new additional dimension record using a new value for the surrogate key every time a value in a dimension record changes. The same example used to illustrate the Type 1 approach will be used to illustrate the Type 2 approach. The Type 2 approach is the most commonly used approach for dealing with slowly changing dimensions. It enables straightforward handling of multiple changes of dimension attributes values. The Type 2 approach is often combined with the use of additional columns in the dimensions, called timestamps. Timestamps indicate the time interval for which the values in the record were applicable. Timestamp columns indicating the start and end dates can be added to each column subject to a change of values in the dimension record. Another column entitled row indicator can also be added to the entire table. The role of this column is to provide a quick indicator of which records are currently valid. Assuming the Customer table contains the data since the date 1.1.2010, and also assuming that Susan joined the High tax bracket on the date 1.1.2020, the following illustrates the use of timestamps and a row indicator by the Type 2 approach: