INFS CH 7
Data Update Difference
Data in operational systems is regularly updated by the users. In operational databases, in addition to issuing queries for the retrieval of data, users routinely insert, modify, and delete data. On the other hand, the end users of analytical databases are only able to retrieve data, and updates of the data by the end users are not allowed.
Data Warehouse Components
Every data warehousing system has three major components at its core: source systems, extraction-transformation-load (ETL) infrastructure, and the data warehouse itself. Most data warehousing systems also have the front-end applications component.
Functional Differences
Functional differences are defined as differences in the rationale for the use of operational and analytical data. Two particular functional differences that we will describe relate to the audience and the orientation of operational and analytical data.
Data Warehouse Front-End (BI) Applications
Just like operational databases, data warehousing systems usually also have front-end applications that allow straightforward access to their functionalities for users who are engaging in indirect use. This is illustrated by Figure 7.6. Data warehouse front-end applications are also referred to as BI applications.
The Data Warehouse Definition
Now that we have examined the differences between the analytical and operational information, we will examine the definition of the most fundamental analytical data repository: the data warehouse.
Technical Differences
Technical differences are defined as the differences in the way that operational and analytical data is handled and accessed by the DBMS and applications. In particular, we describe here the technical differences related to the queried amounts of data and frequency of queries, the update of data, and the redundancy of operational and analytical information.
Structured Repository
The data warehouse is a database containing analytically useful information. Any database is a structured repository, with its structure represented in its metadata. Hence, the data warehouse, being a database, is also a structured repository. In other words, the data warehouse is not a disorganized random mass of data.
Enterprise-Wide
The data warehouse is a repository of analytically useful information in the organization. The term "enterprise-wide" refers to the fact that the data warehouse provides an organization-wide view of the analytically useful information it contains. For example, if one of the subjects of the data warehouse is cost, then all of the analytically useful data regarding cost within the operational data sources throughout the entire organization will be brought into the data warehouse.
Developing Front-End (BI) Applications
The process of developing front-end (BI) applications refers to designing and creating applications for indirect use by the end users. Front-end applications are included in most data warehousing systems and are often referred to as business intelligence (BI) applications. Data warehouse front-end applications usually contain interfaces, such as forms and reports accessible via a navigation mechanism, such as a menu.
Data Orientation Difference
The term "orientation" in the context of the discussion about the operational and analytical information refers to the purpose for which the information is organized. Operational databases are typically created to support an application serving one or more business operations and processes. For example, operational databases, such as a shipping company order-entry database or a dental office appointment management database, serve to facilitate operations by providing a data store for applications supporting those operations. That is why operational databases are referred to as application-oriented. On the other hand, analytical databases are created for the analysis of one or more specific business subject areas, such as sales, returns, cost, or profit. Analytical databases are organized around their subjects of analysis, and that is why they are referred to as subject-oriented.
Subject-Oriented
The term "subject-oriented" refers to the fundamental difference in the purpose of an operational database system and a data warehouse. An operational database system is developed in order to support a specific business operation. On the other hand, a data warehouse is developed to analyze specific business subject areas.
operational information (transactional information)
refers to the information collected and used in support of day-to-day operational needs. Since operational information results from individual transactions, such as an ATM withdrawal or purchase of an airline ticket, operational information is also sometimes referred to as transactional information.
Retrieval of Analytical Information
A data warehouse is developed for the retrieval of analytical information, and it is not meant for direct data entry by the users. The only functionality available to the users of the data warehouse is retrieval. The data in the data warehouse is not subject to modifications, insertions, or deletions by the end users. New data in the data warehouse is periodically loaded from the operational data sources and appended to the existing data, in an automatic fashion. The data that eventually gets older than the required time horizon is automatically purged from the data warehouse (and possibly archived and/or summarized). However, the data in the data warehouse is not subject to changes. That is why the data in the data warehouse is referred to as nonvolatile, static, or read-only.
ETL
For any data warehouse, the infrastructure that facilitates the retrieval of data from operational databases into the data warehouses is known as an extraction-transformation-load (ETL) infrastructure. ETL includes the following tasks: • Extracting analytically useful data from the operational data sources; • Transforming such data so that it conforms to the structure of the subject-oriented target data warehouse model while ensuring the quality of the transformed data through processes such as data cleansing or scrubbing; • Loading the transformed and quality assured data into the target data warehouse. These three components—source systems, ETL, and data warehouse—form the central part of any data warehousing system.
Data Audience Difference
Operational data supports day-to-day operations of businesses and organizations and, as such, is used by all types of employees, customers, and other users for various tactical purposes. For example, a sales clerk uses operational information to charge the appropriate amount to the customer for his or her purchase, or a traveler uses operational information to book a flight to his or her destination. In contrast to the widespread constituency of operational data, analytical data is used by a narrower set of users for decision-making purposes. For example, the CEO of a retail chain may use a complex and voluminous set of analytical information to determine which existing stores to close or where to open new stores.
Requirements Collection, Definition, and Visualization
The first and most critical step in the development of data warehouses is the requirements collection, definition, and visualization. If this step is successful, the remaining steps have a great chance of success. However, if this step is done incorrectly, all of the remaining steps, and consequently the entire project, will be futile. This step is highlighted in Figure 7.8 with a gray background in order to underscore its critical nature.
Data Warehouse Modeling
The first step following requirement collection, definition, and visualization is data warehouse modeling. We use the term "data warehouse modeling" to refer to the creation of the data warehouse models that are implementable by the data management software. As we mentioned in Chapter 1, this type of modeling is also known as logical data modeling or implementational data modeling, as opposed to conceptual data modeling, which is simply a visualization of the requirements, independent of the logic on which a particular data management software is based.
Historical
When compared to operational information, analytical information has a longer time horizon. Given that a data warehouse contains analytical information, its time horizon is longer (usually substantially so) than the time horizon in the operational databases. The term "historical" refers to the larger time horizon in the data warehouses than in the operational databases. For example, many traditional operational databases have a time horizon of 60 to 90 days, where it is quite common for data warehouses to contain multiple years worth of data.
Data Warehouse Deployment
After the data warehouse and its associated front-end applications are implemented, the next step is data warehouse deployment. This step involves releasing the data warehouse and its front-end applications for use to the end users. Typically, prior to this step, an initial load is executed to populate the implemented data warehouse with an initial set of data from the operational data sources via the ETL infrastructure.
Steps in Development of Data Warehouses
As was the case with regular database systems, data warehousing systems are developed through various steps that should be taken in a certain order. This is illustrated by Figure 7.8, which depicts the core development activities during the life cycle of a data warehousing (DWH) system.
Data Redundancy Difference
As we described in detail in the earlier chapters of this book, reducing the redundancy of information is a major goal of operational databases. One of the main reasons for minimizing the redundancy in operational databases is to eliminate the possibility of update anomalies. However, the updating of data by the end users is not allowed in analytical databases, and therefore there is no particular danger of update anomalies. Hence, eliminating redundancy is not as critical in analytical databases as it is in operational databases.
The Next Version of the Data Warehouse
Typically, a data warehouse in use is followed by another iteration of the data warehouse. The new version of the data warehouse should be created following the same development steps. As with the initial version of the data warehouse, the development of subsequent versions of the data warehouse will start with the requirements collection, definition, and visualization step. Of course, unlike the initial version, in the subsequent versions not all requirements will be collected from scratch. Original requirements provide the starting point for additions and alterations. Many of the additions and modifications result from observations by the end users during the use of the previous version indicating the ways in which the data warehouse can be improved or expanded. Other new requirements may stem from changes in the operational data sources or changes in underlying technology. As with any other information system, new subsequent versions are practically inevitable in data warehousing.
Data Warehouse
The data warehouse itself is the repository of the analytical data integrated from the source systems. The data warehouse is sometimes referred to as the "target system," to indicate the fact that it is a destination for the data from the source systems. A typical data warehouse periodically retrieves selected analytically useful data from the operational data sources. In so-called "active" data warehouses, the retrieval of data from operational data sources is continuous.
Data Marts
A data mart is a data store based on the same principles as a data warehouse, but with a more limited scope. While a data warehouse contains data about multiple subjects of analysis retrieved from operational databases across an entire organization, a data mart is typically smaller, containing data about one subject, and does not necessarily have an enterprise-wide focus. The table in Figure 7.7 summarizes the differences between the terms "data warehouse" and "data mart." There are two major categories of data marts: independent data marts and dependent data marts. An independent data mart is a stand-alone data mart, created in the same fashion as the data warehouse. It has its own source systems and ETL infrastructure. The difference between an independent data mart and a data warehouse is in the factors listed in Figure 7.7: a single subject, fewer data sources, smaller size, shorter implementation, and often a narrower focus. A dependent data mart does not have its own source systems. Instead, its data comes from the data warehouse, which explains the term "dependent data mart." Dependent data marts are simply a way to provide users with a subset of the data from the data warehouse, in cases when users or applications do not want, need, or are not allowed to have access to all the data in the entire data warehouse.
Source Systems
In the context of data warehousing, source systems are operational databases and other operational data repositories (in other words, any sets of data used for operational purposes) that provide analytically useful information for the data warehouse's subjects of analysis. Every operational data store that is used as a source system for the data warehouse has two purposes: • the original operational purpose; • as a source system for the data warehouse. In its original purpose, the operational data store serves operational users in its operational function. That same operational data store can also provide the data for the data warehouse as one of the data warehouse's source systems. It is important to note that both of those functions are done simultaneously. In other words, once it becomes a source system for a data warehouse, an operational data store continues its operational functions in the same way it did before it was a source system for a data warehouse.
Creating ETL Infrastructure
The process of creating ETL infrastructure refers to creating necessary procedures and code for: • automatic extraction of relevant data from the operational data sources; • transformation of the extracted data so that its quality is assured and its structure conforms to the structure of the modeled and implemented data warehouse; • the seamless load of the transformed data into the data warehouse. ETL infrastructure has to account for and reconcile all of the differences in the metadata and the data between the operational sources and the target data warehouses. In many cases, organizations have multiple separate operational sources with overlapping information. In such cases, the process of creating the ETL infrastructure involves deciding how to bring in such information without creating misleading duplicates (i.e., how to bring in all the useful information while avoiding the uniqueness data quality problem). Due to the large number of details that have to be considered, creating ETL infrastructure is often the most time- and resource-consuming part of the data warehouse development process.
Data Warehouse Use
Once the data warehouse is deployed, end users engage in data warehouse use. Data warehouse use involves the retrieval of the data contained within the data warehouse. The data warehouse can be used by the end users indirectly via the front-end applications. Data warehouses can also be accessed directly by using the language of the data management software used to host the data warehouse. For example, relational data warehouses can be queried directly by issuing SQL statements. However, a much more common type of direct data warehouse use occurs when end users engage in an ad-hoc analytical querying of data warehouses via the so-called online analytical processing (OLAP) tools, which are also known as business intelligence (BI) tools.
Data Warehouse Administration and Maintenance
To support end users engaging in data warehouse use, various data warehouse administration and maintenance activities are undertaken. Similar to operational database administration and maintenance activities, data warehouse administration and maintenance activities include dealing with technical issues (such as providing security for the information contained in the data warehouse), ensuring sufficient hard-drive space for the data warehouse content, and implementing backup and recovery procedures.
analytical information
refers to the information collected and used to support decisions involving tasks that require data analysis. As we noted in Chapter 1, an example of analytical information is information showing a pattern of use of ATM machines (such as what hours of the day have the highest and lowest number of withdrawals) or information revealing sales trends in the airline industry (such as which routes in the United States have the most and fewest sales). It is important to note that analytical information is based on operational information. For example, the analytical information showing a pattern of use of ATM machines at different times of the day is created by combining numerous instances of transactional information resulting from individual ATM withdrawals.
Detailed and/or Summarized Data
A data warehouse, depending on its purpose, may include detailed data or summary data, or both. The detailed data is also called atomic data or transaction-level data. For example, a table in which each ATM transaction is recorded as a separate record contains detailed data (i.e., atomic or transaction-level data). On the other hand, a table in which a record represents calculations based on multiple instances of transaction-level data contains summarized data at a coarser level of detail. For example, a summarized data record could represent the total amount of money withdrawn in one month from one account via an ATM. A data warehouse that contains the data at the finest level of detail is the most powerful, because all summaries can be calculated from it and stored if they are to be repeatedly used. However, there are cases when storing all of the analytical data at the transaction level of detail for large time horizons is cost prohibitive. For example, for some organizations the financial resources necessary for the specialized hardware and software required to store and process such enormous data sets may be out of reach. Also, in some other cases, organizations may decide that for some subjects of analysis, summarizations are adequate and transaction-level detail data is not necessary. In cases when the company is not able or does not want to have all of its required analytical information at the finest level of detail, the data for some or all (depending on the particular situation) of the subjects of analysis in the data warehouse is kept only at a certain level of summarization.
Data Makeup Differences
Data makeup differences are differences in the characteristics of the data that comprises the operational and analytical information. Three particular data makeup differences between the operational and analytical data that we describe here are data time-horizon, data level-of-detail, and data time-representation.
Creating the Data Warehouse
Once the data warehouse model is created, the next step is creating the data warehouse. This step involves using database management software and implementing the data warehouse model as the actual data warehouse that is initially empty. Typically, data warehouses are modeled as relational databases. For this reason, they are implemented using a relational DBMS. Regular relational DBMS packages, such as Microsoft SQL Server or Oracle, can be used for implementing both operational databases and data warehouses. However, other relational DBMS packages, such as Teradata, that are specialized for processing large amounts of data typically found in data warehouses are also available for implementation of data warehouses. Creating a data warehouse is a straightforward process that involves data warehouse developers using the functionalities and capabilities of a DBMS to implement the data warehouse model as an actual functioning analytical database.
Data Level-of-Detail Difference
Operational data typically reflects detailed data. In other words, the details of every individual transaction are recorded. Such level of detail is also often referred to as fine detail level. For example, each phone call is recorded in a telecom company's operational database. Usually the calling number, called number, the time of the call, and the call duration are recorded. Summarized data, such as the total duration of all of the customer's phone calls this week, can be calculated from the detailed data. Such summarized values are typically not stored in operational databases because they can usually be calculated almost as quickly as they could be retrieved if they were stored. Also, the summarized data is frequently subject to changes, in which case storing the summarized data becomes pointless. For example, every time the customer makes another phone call, the value of the total duration of weekly calls must change. Instead of being stored, the summaries are usually implemented in the operational databases as derived attributes using formulas. Therefore, every time the customer makes another phone call, the total duration of weekly calls is automatically recalculated and updated in the front-end application. On the other hand, analytical databases can contain both detailed and summarized data that is physically stored. Summarizations, such as a customer's weekly durations of phone calls for all weeks in the past 10 years, are often precomputed and physically stored in analytical databases. Physically storing such summarizations in analytical databases is feasible because these summarizations are not subject to change, due to their historical nature. For example, a summary of weekly durations of phone calls in the previous year is not going to change. If analysts need such summarized information frequently, it makes sense to store the result of such a calculation rather than recalculating it every time it is needed. Recalculations are usually appropriate in operational databases due to the dynamic nature of the data, the relatively small amounts of data stored, and the relatively short time horizon. However, in analytical databases containing huge amounts of data due to long-time horizons, storing summarized data, instead of recalculating it every time it is needed, provides a significant time saver for analytical queries.
Data Time-Representation Difference
Operational data typically represents the current state of affairs in the real world, while analytical data can represent both the current situation and snapshots from the past. For example, a bank customer will have one value in an operational database representing his or her checking account balance. This value will represent the current balance. When the customer makes a withdrawal or a deposit, the checking account balance value will change accordingly, and the old balance will be replaced by the new current balance. An analytical data repository can contain the client's current balance, but it can also contain values over time, called snapshots, that show, for example, the client's checking account balances at the end of each month for the entire time horizon. These snapshots would be calculated once and then physically stored for the repeated use of the analysts, instead of recalculating the snapshots every time they are needed.
Difference in Queried Amounts of Data and Frequency of Queries
Operational queries typically process much smaller amounts of data than analytical queries. On the other hand, operational queries are typically issued much more often and by more users than analytical queries. It is usually possible to configure the system to optimally process frequent queries requiring smaller amounts of data or to optimally process less frequent queries requesting large amounts of data. However, it is usually not possible to technically configure and optimize the system for both of those types of scenarios. This is yet another reason why data warehouses are physically separate data stores.
Data Time-Horizon Difference
Operational systems have a shorter time horizon of data than analytical systems. In many operational systems the data time horizon is 60 to 90 days because most queries that are requested for operational purposes require the data from a relatively short time horizon. After that time period, the data is removed from the operational system and (typically) archived. On the other hand, analytical data repositories typically must be able to provide trend and pattern analysis over years of data. Consequently, analytical databases have a much longer time horizon than operational databases. The time horizon for an analytical database often spans years of data. The need for different time horizons of data is one of the main reasons for the physical separation of operational and analytical databases. Operational databases with shorter time horizons serve as sites for the operational use of data, while a separate analytical repository of data with a longer time horizon serves as a site for analytical queries.
Integrated
The idea behind a data warehouse is to create a repository of analytical information in the organization. This repository is physically separate from the existing operational databases in the organization. The data warehouse integrates the analytically useful data from those various operational databases. Integration refers to the process of bringing the data from multiple operational databases into a single data warehouse. Of course, in this process no data is actually removed from the operational sources. Instead, the analytically useful data from various operational databases is copied and brought into the data warehouse.
Time-Variant
The term "time-variant" refers to the fact that a data warehouse contains slices or snapshots of data from different periods of time across its time horizon. With these data slices, the user can create reports for various periods of time within the time horizon. For example, if the subject of analysis is cost and the time horizon is a number of years, we can analyze and compare the cost for the first quarter from a year ago versus the cost for the first quarter from two years ago.