BIS 4: The Data Warehouse
Describe why data warehouse is integrated
- Constructed by integrating multiple, heterogeneous data sources, for example relational databases, or flat files and spread-sheets - Data cleaning and harmonisation prior to storage to Ensure data quality and consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources.
Describe why data warehouse is subject-oriented
- Organised around major subjects, such as: customer, product, sales. - Provides a simple and concise view around its particular subject issues by excluding data that are not useful in the decision support process. - Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing
Describe the first data warehouse layer
1. The bottom tier is a warehouse database server that is often a relational database system. - Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources. - These tools and utilities perform data extraction, cleaning, and transformation, as well as load and refresh functions to update the data warehouse (ETL). - The data are extracted using application program interfaces known as gateways and allow client programs to generate SQL code to be executed at a server. Examples include ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity). - This tier also contains a metadata repository, which stores information about the data warehouse and its contents.
Describe the second data warehouse layer
2. The middle tier is an OLAP server that is typically implemented using: - Relational OLAP (ROLAP), that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations; or / and: - Multidimensional OLAP (MOLAP), that is, a special-purpose server that directly implements multidimensional data and operations.
Describe the third data warehouse layer
3. The top tier is a front-end client and server layer. It contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, ...). - Report engines facilitate designing standardised business reports. Defined user groups receive such reports on a regular basis or based on trigger events (alert) - Dashboards also provide clear defined information in a static manner. They focus on KPIs and their visualisation in a user-friendly manner. - Query and OLAP tools facilitate an interactive access to the data store. Users can browse through business data along pre- defined dimensions and change the level of aggregation - Data mining comprises algorithms for a semi- automated analysis of large data sources and identification of novel, relevant patterns.
Define data warehouse
A data warehouse is a subject-oriented, integrated, time- variant, and nonvolatile collection of data in support of management's decision-making process.
Name the reason for data storage
Analytical data is stored for the decision support.
Define metadata
Defines DWH objects - Metadata are created for the data names, definitions, and descriptions of the given DWH - Additional metadata are created and captured for time-stamping any extracted data, the source of the data, and missing fields that have been added by data cleaning or integration processes - Furthermore, the algorithms used for transformation of data can be specified
Describe the Multidimensional OLAP (MOLAP) and state two examples
Deploy a specialised DB that physically stores data in multidimensional form. Examples: Oracle Express, Microsoft SQL - Stores data using a multidimensional structure: an array - Application layer provides the multidimensional view - Efficient storage and processing - Complexity hidden from the user - Analysis using pre-aggregated summaries and pre- calculated measures
State the three types of data warehouse models
Enterprise Data Warehouse: Collects all information about subjects, spanning the entire organization (corporate memory). Fact constellation schema is commonly used, since it can model multiple, interrelated subjects. Data Mart: A subset of corporate-wide data that is of value to a specific groups of users (e.g., department). Star or snowflake schema are commonly used, since both are geared toward modelling single subjects. Hybrid solutions exists (often sensible).
Describe why data warehouse is time variant
Every data item in the data warehouse is associated with an element of time, explicitly or implicitly. The data warehouse provides historical data and in comparison to the operational database which stores current data stores more data.
state HR
Human Resources
Describe the Relational OLAP (ROLAP) and state one example
Implement the logical dimensional model by using a star or snowflake schema in a relational database. Example: MySQL - Warehouse stores data in relational table form (RDBM) - Organise tables using star, snowflake, fact constellation schema - Presentation layer provides multidimensional view - Application layer generates complex SQL queries for OLAP operations on those multidimensional views
Describe the difference between independent and dependent data marts and draw the corresponding diagrams.
Independent data marts are sourced from one or more operational systems or external information providers, or from data generated locally within a particular department or geographic area. Dependent data marts are sourced directly from enterprise DWH.
Describe the structured approach to architecture development in data warehouse models.
Independent data marts are sourced from one or more operational systems or external information providers, or from data generated locally within a particular department or geographic area. Dependent data marts are sourced directly from enterprise DWH. Second: Independent data marts can now be implemented in parallel with the enterprise warehouse based on the same corporate data model set as above. Third: Distributed data marts can be constructed to integrate different data marts. Finally: An enterprise data warehouse is constructed where the enterprise warehouse is the sole custodian of all warehouse data, which is then distributed to the various dependent data marts.
Describe the difference between the heterogenous database vs the data warehouse. Draw a diagram if necessary.
The data warehouse is a collective storage space for heterogenous databases.
Describe why data warehouse is non volatile
The data warehouse is a physically separate store of data transformed from the operational environment. Does not require transaction processing, recovery, and concurrency control mechanisms. Requires only two main operations in data accessing: Loading of novel/additional data and Reading of data
Describe the galaxy schema and draw a basic example for Sales and Shipping
The galaxy schema consists of many star schemas that contains multiple fact tables that share many dimension tables..
Define Data Warehousing
The process of constructing and using a data warehouse.
Describe the snowflake schema to implement a data cube.
The snowflake schema is a variation of the star schema, featuring "normalization" of dimension tables. Normalization splits up data to avoid redundancy (= duplication of data) by moving commonly repeating groups of data into a new table. It reduces the space required to hold the data and the number of places where it needs to be updated if the data changes. Normalisation tends to increase the number of tables that need to be joined in order to perform a given query. This "snowflaking" by normalization only affects the dimension tables, not the fact tables.
Describe the star schema to implement a data cube.
The star schema is a fact table in the center, connected to a set of dimension tables. The star schema is the simplest relational data warehouse schema. It consists of a single fact table referencing any number of dimension tables. Motivation: The "facts", which the DWH helps to analyse, are classified along different "dimensions". The fact table should the main data, while the (usually smaller) dimension tables describe each value of a dimension, and can be joined to fact tables as needed. Simplicity from the users point of view: queries are never complex because joins and conditions involve only one fact table and a single level of dimension tables.
State the main difference between the snowflake and star schema in regards to the data implementation
The star scheme is faster whilst the snowflake system is more efficient. The snowflake model may be kept in normalised form to reduce redundancies as it is easy to maintain and saves storage space. However, this saving of space is often negligible in comparison to the typical magnitude of the fact table. Furthermore, the snowflake structure can reduce the effectiveness of querying, since more joins will be needed to execute a query. Consequently, the system performance may be adversely impacted. Hence, although the snowflake schema reduces redundancy, it is not as popular as the star schema in data warehouse design.
Describe the Hybrid OLAP (HOLAP) data storage
User flexibility: HOLAP allows to store part of the data in the MOLAP store and another part of the data in the ROLAP store. It is the best breed approach.
state CRM
customer relationship management
Define metadata
data that provides information about other data
state PPS
production planning system
state SCM
supply chain management