Week 4
Data Load Errors
- Common cause is mismatched data - text field going into numeric field -Data package key and data package dimension assist in recovery when errors occur -No update, No reporting -Update valid records, no reporting -Update valid records, reporting possible -Roll back to date state prior to loads -Delete data load, repair rogue data, reload
Rank
-Combines sorting and filtering -Top N, Top %, Bottom N, Bottom %
Shortcomings of Star Schema
-Duplication of dimensional Data -No support for multiple languages -Slow join due to alphanumeric keys -No historization - changing dimensional data -Hierarchies can cause anomalies Example: Sales_territory & Country
Filter
-Filter by label for textual fields -filter by value
(1) Multidimensional Modeling - Star Schema
-Includes 1 fact table and multiple dimension table
Data Warehouse Structure
-Information System -Takes large quantities of data and stores them so that we can easily access the data -Relational Database that has been denormalized. Meaning some redundant data is added back into the database to reduce the number of tables. This speeds up the queries. -Because data warehouse are "read only", modification anomalies are not relevant -Provides a "single source of truth" for analytics
InfoCubes - Virtual Cube (Virtual Provider)
-Only represent logical views of a dataset -Data resides in data source (Not the virtual cube)
InfoCubes - Standard Cube
-Physically contains data in the database -They are data targets (data can be loaded into the cubes)
Queries
-Prefilters -Presorts -Restricted key figures -Calculated key figures
Star Schema - Fact Table
-Stores measures/key figures -Represents Transactional Data
InfoCubes
A self-enclosed dataset for a business area from a reporting viewpoint -Queries can be defined and executed on the InfoCube
Calculation
Adding fields based on other fields
Techniques - Sort
Ascending, descending, alphabetical, numerical, chronological, nested
Data Reformatting
Changing the format of the data from data source to target requirements Transformation Rules -Algebraic rules - converting weights or currencies -String rules - relates to alphanumeric data such as combining or separating a name -Date and time rules - change time and date formats -Programmatic or Algorithm-Based Rules - Use transformation programming languages such as stripping off XML tags
Chapter 3: Dimensional Data Modeling
Chapter 3: Dimensional Data Modeling
Chapter 4: Extraction, Transformation, and Loading (ETL)
Chapter 4: Extraction, Transformation, and Loading (ETL)
Chapter 5: Slicing & Dicing
Chapter 5: Slicing & Dicing
Data Consolidation
Combining and integrating data from data sources Done by data wranglers who: -Realign mismatched data -Harmonize keys and records -Combine or split data
Dynamic Data
Data that are updated -prospective customer becoming buying customer Data that have different values depending on the user -Languages -Currency -Unit
Hierarchies
Drill down and roll up
Slice and Dice Mistakes
Incorrect Aggregations -Adding together ending inventory for reach month -Adding apples and oranges such as different currencies Treating Null as Zero Errors in data acquisition Errors in calculations
Slicing and Dicing Tools
Microsoft excel tables and pivot tables IBM Cognos SAP Analytics Oracle Hyperion Tableau Microsoft PowerPivot
Exceptional Aggregations
Occur when it doesn't make sense to aggregate. Instead we want results as a point in time such as the number of employees at the end of the year. We can't add employees at the end of all years to get a total. it won't work. Think of these as balance sheet accounts in financials
Conditions (Conditional Formatting)
Predefined rules
Data Cleansing
Removing noisy data, replacing missing values, realigning formats -Identification of signal and noise -Getting rid of the noise and "dirty" or missing data -Null values can be misleading - nulls can be ignored or replaced if we have the knowledge to do so -Identification and elimination of outliers -Elimination of redundant data
Star Schema - Dimension Tables
Represents master data and provides a more detailed view of the facts Examples: -Time dimension -Customer dimension -Product dimension -Sales Org dimension -Region dimension
Standard Aggregations
Standard aggregations are SUM, MIN, MAX, AVG, COUNT.
Aggregations
Sum, Avg, Min, Max, Count
Slicing and Dicing
The Manipulation of data to be able to see it from many different angles Slicing: A way to filter a large data set to smaller data sets of interest Dicing: the sliced data set creates even more granularly defined dataset -Multi-dimensional Data examined from many angles -Larger data sets filtered down to area of interest -Trends and outliers identified via sorting
Pivot Charts
The chart changes when the data is pivoted Field controls allow for filters and sorts
(2) Multidimensional Modeling - Snowflake Schema
The dimensional tables are split into more tables -Reduces data redundancy -Supports multiple languages, currencies, and UoM -Quicker joins via conversion of text keys to numbers -Supports historization - changing dimensional data -Support cube administration -Reduces data redundancy which reduces storage space but increases the number of joins -A dimension is said to be snowflaked when the low-cardinality columns in the dimension have been removed to separate normalized tables that then link back into the original dimension table.
Loading
The movement of data from source systems to their intermediate or final destinations -Delta Loads and Full Loads -Data transfer process (DTP)
Extraction
The process of identifying data sources and source fields and acquiring, or sourcing, the data required for analysis -The data source is a collection of fields from the source system -Extractors are programs that pull the data from the source system and bring them into the data warehousing system -Extracted data are staged in a persistent storage area (PSA
Transformation
The process of mapping and harmonizing - that is making certain the data are consistent, cleansed, and reliable - from their sources to the targets
ETL
The process of populating data structures such as InfoCubes, data marts, and data warehouses from one or many sources
Transformation - Data Harmonization
When data are from multiple sources they need to be made more consistent (harmonized) -Harmonization involves mapping fields from source to target systems -IE infocube 1. Data Consolidation 2. Data Cleansing 3. Data reformatting
Pivot Table
creates what is called a cross tabulated structure or crosstab The crosstab displays aggregated and summarized data based on which way you sort the columns and rows