CIS355
De-Duplication
The identification, merging and removal of duplicate names and addresses from a database. We do not want to count the same person twice in our system.
ETL staging layer objective
ETL staging layer objective is to focus on the E, the extraction rather than the transformation. - We want to push the transformation into the data warehousing environment, not have the source applications themselves be part of the overall picture.
ETL: Transformation
Two overarching goals: 1. uniformity: we need to transform data to look alike. apples to apples 2. Restructuring: pulling data into staging layer and throwing area into the air and it comes down into a very structured way.
Component Based (Architected)
Along the architected path, we either could put data warehouses and data marts together and have them interoperate with one another. OR We have build an environment of data marts only.
Alternative platform for data warehouses and data marts?
Cubes: cubes= Multidimensional database (MDBMS) - It is a specialized database that has inherent awareness - A leading alternative for 1st generation DWs (relational databases) Example: Express Space, Powerplay, all of which are hybrid by data warehousing products and are built on cubes instead of RDMS - Cubes are usually used for smaller scale data warehouses and data marts vs larger enterprise scale systems. What they are not: - A relational database (RDBMS) - Specialized "dimensionally-aware" database
Dependent data marts (architected)
Classic depended data mart architecture, which resembles kind of the supplier to wholesaler to retailer paradigm. And we can see that dependent data marts are one form of that data warehouse plus data mart model. May run across a variation of the dependent data mart approach that's called the Corporate Information Factory, or CIF.
Four major incremental ETL patterns
- Append (data gets added into DW) - In-place update (going into existing rows of data and you're making changes.) - Complete replacement pattern (even if you have only one little small piece of data that's changed or new, you're going to go into a portion of the data warehouse and totally overwrite whatever's there.) - Rolling append (make a conscious decision that you're only going to maintain a certain duration of history.) Example: 36 weeks or two years ETL today - Append or In-place update
What to bring into data warehouse
- Data definitely needed for BI and analytics - Data probably needed for BI and analytics - Historical Data
Data virtualization use cases
- Simple transformations - Smaller number of data sources - Relaxed time (flexibility) Business Intelligence + Data warehousing and/or Data Lakes and/or Data Virtualization = Best portfolio Helps us make data driven decisions through analytics .
Advantages of cubes/multi dimensional databases
- Tend to be best for "modest" data volumes with the definition or ground rules for modest changing over time. - Very fast query response time
Disadvantages of cubes
- Tend to be less flexible than when we use relational database management system. - Give us a very rigid organizational structure for our data that is built for that fast performance for the queries and reports. - Tend to be far more complex and more time consuming than with the relational database management system. - Far more vendor variation with cubes than you do with relational database management systems.
Compare a Data Warehouse to Data Virtualization
-Data Virtualization is a read-only (vs. read-write which was the original way) DDBMS. - Unlike data warehousing, we do not copy data into a separate database. Instead, we access it from its original. locations at the time we need to do so for reports and analytics. - Many names over the years such as virtual data warehousing, enterprise information integration (EII), enterprise data access (EDA), data virtualization.
Rules for data warehouse
-Integrated: Data from a number different source systems are sent to the warehouse . - Subject oriented: We need to reorganize data by subject -Time variant: It also contains historical data not only recent data. -Non Volatile: We will periodically update data in batches. Copy data often and it stays as stables in between refreshes so we can do strategic planning. -Improvement
Operational Data Stores environment vs. Data warehouse environment
-Just like a data warehouse in ODS integrates data from multiple sources into a single place for one stop shopping. - However, unlike a data warehouse, ODS focuses primarily or even exclusively on current operational data rather than historical data (strategic) - We find real-time feeds from our source applications to an oods unlike the batch oriented ETL feeds of a data warehouse. For example, when a transaction occurs in a point sale system, the data from that point sale system is transmitted immediately to the odds rather than being queued up for the next data warehouse refresh.
Steps to choosing data warehouses
1. Do we want to create a centralized data warehouse or component-based data warehouse? - If we decide a centralized data warehouse we have some advantages such as Default option, "One stop shopping", modern technology. - If we decide a centralized data warehouse we also have some distavanges such as a high cross-org cooperation, high data governance( the maturity and rules that go along with our data), ripple effects(you make some small changes and you typically affect other portions of your overall environment. If you make changes in your source systems by adding a new application that could ripple into the entire data warehousing environment rather than just a single piece of it.)
Steps to selecting a data warehouse
1. Do we want to create a centralized data warehouse or component-based data warehouse? If we choose a component-based data warehouse: we have decomposition(bring benefits when it comes to isolating portions of the environment away from changes in other portions.), mix-and match technology, "bolt together" components, and overcome org. challenges. Some challenges that can come with a component based approach: -Often inconsistent data - Difficult to cross-integrate
Reasons why we build a data warehouse
1. Make data-driven decisions: To support making decisions in a data driven manner, rather than having to rely solely on experience or intuition. 2. One stop shopping: The data that we need is all in a single location rather than scattered among the transactional and operational applications where we get that data from.
Question 2: Congratulations! You have successfully built a data warehouse, and now it's time to put the data warehouse to use. Which related discipline is the one that will be used for your reporting and analytics?
BI (business intelligence) BI is the discipline that makes use of data warehouse-produced data after the data is already brought into the data warehouse.
Question 3: To support your organization's analytics, you are going to use several different approaches in concert with one another. Specifically, you will be building a data warehouse; also building a data lake; and building data virtualization capabilities. Which of those disciplines does NOT make copies of original data for usage in analytics?
Data Virtualization because it reaches directly into the source applications and accesses data at the time its needed for reporting and analytics, rather than pre-staging.
What is Datawarehouse?
Datawarehouse refers to a central repository of data where the data is assembled from multiple sources of information. Those data are consolidated, transformed and made available for the mining as well as online processing. Warehouse data also have a subset of data called Data Marts. (Datawarehouse is build on the top of a database) Data is copied NOT moved
Data type and size unification
If one system has more characters then we want to make sure that they are both the same. We do not want to have present the users two different tables with different data types and sizes for the essentially same data type. We unify type and size using the largest (doesn't always have the be the largest) characters.
Emphasis on "enterprise"
If we are building a data warehouse with a centralized approach and leaning towards an EDW (enterprise data warehouse). We can think of it as the default of a centralized DW. An enterprise data warehouse is essentially trying to satisfy the analytical needs of perhaps much or even all of our enterprise. With a centralized data warehouse we can also build data lakes which are sort of next generation data warehouses. For these a dash line --- is used rather than a solid line around the data.
CIF (Corporate Information Factory)
The CIF model was originally proposed by Bill Inman, who's the same fellow who gave us our original four rules of data warehousing: integrated, subject oriented, time variance and non-volatile. CIF is essentially a dependent data mart architecture, but with very specific rules for how the CDW component is architected and who is allowed to have access to which components.
Question 4: In the supplier-wholesaler-retailer analogy that describes the flow of data from sources to data marts, which would be considered the "data retailers" for the users in your organization for their reporting and analytics?
The data marts because they are the places where users would most often go for specific subsets of the overall analytical data that they need.
Why do we build this?
To support data driven decision making
Expanding our data warehousing architecture
We look at that architecture picture with data flowing from the source systems into the data warehouse where it's being used for decision making, there's actually an extra stop that needs to be made. It doesn't just show up in the data warehouse and then it's available for decision making. It makes its way into the staging layer. The appropriate transformations and sort of finishing touches are made to the data to put it into the user access layer in a form that's ready to support the decision making.
What is our default for data warehouse?
relational database management systems (RDMS) is typically the default form for our data warehouses and data marts.
Role of User Access Layer (Data warehouse)
**We want to understand the role of this alongside the staging layer** - This is where the users go. From the user perspective, the user access layer really is the data warehouse or the data mart, since the same architecture appliers to both a data warehouse or data mart, since the same architecture appliers to both a data warehouse or a data mart. -Dimensional modeling. So that would be staff schema, snowflake schemas, fact tables and dimension tables and all the rest So a data warehouse really is synonymous with user access layer from the user perspective.
common transformation models
- Data value unification - Data type and size unification -De-duplication - Dropping columns (vertical slicing) - Value based row filtering (horizontal slicing) - Correcting known errors
What is an extract file?
- Extract files are like mini data warehouses pulling data from other data from one or more applications without any coordination or rules. More time was spent organizing than analysing.
Incremental ETL
- Incrementally "refreshes" the data warehouse - New data: employees, customers, products,... - Modified data: employee promotions, product price change, etc. - Special handling for deleted data: customer drops from a subscription plan,... BRING THE DATA WAREHOUSE UP TO DATE.
Two variations of ETL
- Initial - Incremental
ODSs Today
- Less popular - "Faster" and more current DWs - Superseded by big data (V=Velocity): we can blast data into a big data environment pretty much real time, then we can essentially do a lot of our operational analytics and operational business intelligence right out of a big data environment. - ODS components within data lake - "Traditional" ODSs still sometimes used for mission-critical situations.
Non-Persistent Staging Layer
- if we have a non-persistent layer in a data warehouse it is most-likely empty. - once we are done using the data it will be erased. Onde it makes its way through Faculty system (data sources)--> Staging layer (data warehouse) --> User Access Layer (BI) then we an empty out the staging layer.
Dependent data marts
-Their existence depends on the existence of a data warehouse. No data warehouse = No data marts. -Because they can't be supplied with data otherwise. -Sourced from data warehouse - (Mostly) uniform data across marts -Architecturally straightforward
Staging layer table layout options
1. You can only integrate East & West Campus if you have identical software, so identical systems. -"Many to 1" Feed: Multiple instances of identical source applications - (mostly) uniform data across marts 2. If yo -"1 to 1" Feed: Multiple customized instances of source applications - Different vendor applications for same functions
Compare a Data Warehouse to a Data Lake
A data warehouse is often built on the top of a relational database such as Microsoft SQL Server or Oracle. Relational databases are also used for transactional systems applications, not just data warehousing.
Persistent Staging Layer pros & cons
Advantages: - Rebuild user layer without source systems. - Data quality assurance: compare staging with user layer Disadvantages: - More storage space - Risk of "ungoverned" access
Non-persistent layer pros & cons
Advantages: -Less storage space -Data already moved to user layer Disadvantages: - Need to go back to source systems to rebuild user layer - Data QA (quality assurance) also required source systems.
Federated EDW or Federated collection of data marts (Non-architectural )
Agree to disagree about the business rules, the data models and structures, and pretty much everything that would be required to build either a centralized data warehouse or that data warehouse bus approach. So instead, what we do is build a collection of independent data marts, as we've seen already, and essentially draw a dotted line around those independent data marts and just sort of declare. We're going to consider all of you data marts to be part of a data warehouse, even though we are likely to have different answers to the same queries and reports coming out of those multiple data marts you should think of, especially nowadays, this independent data mart or federated data warehousing approach really only as a last resort as you embark on data warehousing.
When did data warehousing and business intelligence come to be?
Around 1990
Data warehouse vs independent data mart
Data warehouse: - Many sources (dozens ) - ETL from different sources directly into the DW you will probably have very large data volumes in the DW. -Probably large data volumes - Dimensionally organized data Independent Data Mart - One or more sources (1 and up to 6) - ETL from sources -Possibly large data volumes - Dimensionally organized data
Emphasis on Enterprise-Level Strategy 1
Data warehousing --> centralized --> Data Lake <Hadoop, Specialized database o other (AWS s3, etc).
Emphasis on Enterprise-Level Strategy
Data warehousing --> centralized --> EDW < Rational or Specialized database. - Specialized large scale databases, columns, databases and other things that are typically known as data warehousing appliances.
ETL vs ELT
ETL: Extract, Transform, Load Extract: - Quickly pull data from source applications - Traditionally done in "batches" - Raw data.. errors and all -Land in data warehouse staging layer Transform: - "Apples to apples": bring data from different sources and even though they might have the same function or have similar data, they might actually be stored and represented differently. - Prepare data from staging layer for uniform data in user access layer. - Can be very complex Load: - Final stop along the data pathway - Store uniform data in user access layer for business intelligence and analytics. Challenges with traditional ETL: - significant business analysis before storing data. - significant data modeling before storing data. Change the order ELT: Extract, Load, Transform - Make the transformations at the point where you actually need them. - "Blast" data from various source application and external data sources into big data environment - Raw form in Hadoop, HDFS, AWS, S3, etc: whether its unstructured, structured, or semi structured. - Use big data environment computing power to transform when needed. -"Schema on read" vs. "schema on write" --> schema on write means that at the point you create the data and make it available for analytics. Local usage, the schema, the structure for that data has already been set, so the schema is there. Then you write the data. ETLis known as a batch oriented data flow process.
Inside the Staging Layer
Essentially, inside the staging layer, what we're going to do is create mirror images of what's coming out of the source application. - our objective is to pull the data as quickly as possible and as non intrusively as possible. - The closer our staging layer data structures are to those of the source environment, the more straightforward that data flow is going to be. STANGINGLAYER IS ONLY FOR INTERNAL DATA WAREHOUSE USAGE.
What is ETL?
Extract, Transform, Load or alternatively, extraction transformation and loading.
Initial loading ETL
Initial ETL: - Normally one time only - Right before the data warehouse goes live - All relevant data necessary for BI and analytics - Can redo the initial ETL if data warehouse "blows up" In a data warehousing environment you do not bring all possible source data in a big data environment in theory you can do that by blasting in data from all your different data sources but not in data warehousing.
Front-End Data Marts
Instead of having the data marts behind the data warehouse, just like the name would apply where the front end data mart they exist in fronts. For example: SAP implementation that was for your North America operations and another SAP implementation for international operations outside of North America, you could essentially build a data mart. Which is something other than a full-scale data warehouse for each of those SAP implementations. That is where the primary analytics and business intelligence occur. Not necessarily all of the data is sent out from each of those downstream into a data warehouse. You reverse the order of your data.
Data Warehouse (DW)
It doesn't necessarily need to be the end point in the flow of data from our source application into some sort of an analytical environment to support decision making. - Sometimes data is sent further downstream into these additional environments that we typically would call data marts.
Independent data marts
It doesn't need a data warehouse. Instead, each independent data mart draws data directly from one or more source applications. Need to be organized dimensionally. - A small data warehouse with its data organized internally the same way. - Sourced directly from applications and systems - Little to no uniformity across marts. -"Spaghetti" architecture
Operational Data Stores (ODS)
It is a type of database that's often used as an interim logical area for a data warehouse. Designed to integrate data from multiple sources for lightweight data processing activities such as operational reporting and real-time analysis. - Integrates data from multiple sources - Emphasis on current operational data - Often real-time source --> ODS data feeds - "Tell me what is happening right now" - Popular late 1990's/early 2000's
Role of Staging Layer (Data Warehouse)
Most Data Warehouses have: 1. Staging Layer (Two variations) - "Landing Zone" for the incoming data form the source application (data is coming from elsewhere). - "E" within ETL : Extraction Get the data as quickly as possible from the source applications, especially if those applications are very finely tuned environments that are built for very high transaction volumes. What we're trying to do then is be as non-intrusive as possible while we copy the data that we need.
RDMS & Cubes used together
Some companies implement the use of both. We have a data warehouse built on top of a relational database management system platform, but each of the data marts are built on the top of the cube.
Sometimes also "cubes"
Sometimes we also build a data warehouse on top of a multidimensional database that's typically known as a cube. (Think of it as a special purpose database not just a relational one).
What are some BI utilization tools?
Tableau and Microsoft Power BI
Data Mart Only (architected)
The data warehouse dimensional buss that was defined by Ralph Kimball, who was one of the pioneers of data warehousing. The idea behind the data warehouse bus is that all of the data marts follow a principle that known as conformed dimensions, which essentially means that we have an apples to apples view about our key subject areas, such as customers, products, employees, geographies, students, faculty and many other subject areas across all of our data marts. Even though they're built differently and also potentially even on different platforms.
Persistent Staging Layer
There is still some data inside the staging layer and the user access layer/ Difference is that. we do not empty out the staging layer once we are done. If we want to add new data we just add the new data while still keeping the old and it goes through the staging and user access layer.
Architectural options: Centralized Data Warehouse
Using a single database as the home for your data to support business intelligence and analytics. - Your default go to architecture. -Single database and everything from all of your sources will feed into that single database. - One stop shopping. All the data that we need for reporting, business, intelligence and analytics is all together in a single place, sort of like a data supermarket.
Differences: the 3 V's of Big Data
Volume - Big data and its usage in data lakes helps us manage extremely large volumes of data larger than we typically would include in the largest data warehousing. Volume is one of the traditional three V's of big data. Velocity - Big data also supports very rapid intake of new and change data much more rapidly than we typically do with traditional data warehousing. Variety - Our traditional warehouse mostly work with structured data numbers, dates, shorter length characters. (Is supported by structured data: messages, textes, emails but also by unstructured data: Audio or video)
Comparing two types of staging layers (Non-persistent & Persistent)
We have a staging layer that serves as a landing zone for our incoming data, which is then transformed and integrated and loaded into the user access layer.
Decisions of component based data warehouse
We have two options when deciding. Architected or non-architected approach.
Making data-driven decisions
We need to look at -Past -Present -Future - The unknown: turning our analytics lose to see what the data says. This is BI (Business intelligence )
Data value unification
When we unify data values you select and rank the data into one consistent system.
One stop shopping
With data warehousing, though, we integrate all of the data into one place and provide the aforementioned one stop shopping. Concentrate on the analysis of data.
Question 1: Suppose that your organization decides to build a data warehouse for the first time. You have many different operational applications, but right now all of your reporting and analytics comes directly out of those applications. You decide that you need a data warehouse to support integrated reporting that will combine data from some or many of those source applications. After you successfully build and deploy your data warehouse, which of the following statements is most true about the data in your operational systems and your new data warehouse?
Your data will be copied from the source systems and applications to the data warehouse, but also retained in those original sources for operational purposes.
What is a data mart?
a data collection, smaller scale or more narrowly focused data warehouse, that addresses the needs of a particular department or functional area of the business