CIS 4610 Quiz 4

Ace your homework & exams now with Quizwiz!

Subject oriented database

(not all data from operational DB is migrated because we don't need all of it): analytical DB (subject to be analyzed)

dimension tables (dimensions)

contain descriptions of the business, organization or enterprise to which the subject of analysis belongs

Steps in the development of data warehouse: they use the retrieval of the data in the data WH:

o Indirect use - via the front-end BI applications o Direct use- via the DMBS and the OLAP BI tools

Data warehouse components:

o Source systems o Extraction transformation load infrastructure o Data warehouse o Front end applications

loading

this is the last step of the ETL process, it loads data into the target data WH DB and the load process should be optimized for performance.

transformation

this process adds value and changes data such that insightful BI reports can be generated. In this step you apply a set of functions on extracted data. it can perform customized operations on data

A data warehouse is created within an organization as a separate data store whose primary purpose is

to support data analysis

ETL tools:

-marklogic -oracle -amazon redshift

Which of the following data represent detailed and current data and the typical time-horizon is ' Days/Months'?

Operational Data

The ETL process requires

active inputs from various stakeholders including developers, analysts, top executives and it technically challenging

The ETL process

o Capture/extract o Scrub or data cleansing o Transform o Load and index

Application-oriented database

operational DB

Which of the following is (are) characteristics of data warehouse (DW)?

DW is a structured repository DW contains integrated data DW data are subject-oriented

relational modeling

standard method for logical modeling of operational DBs

analytical information

the information collected and used in support of Analytical tasks -Analytical information is based on operational (transactional) information -Once data is in analytical database it cannot be modified

Scrubbing or Data Cleansing happens during

transformation phase

surrogate key

typically in a star schema all dimension tables are given a simple non composite system generate key

operational information

(transactional information) - the information collected and used in support of day to day operational needs in business and other organizations -data base records are being created, modified all the tim

in case of load failure

recover mechanisms should be configured to restart from the point of failure w/o data integrity loss

star schema

the result of dimensional modeling is a dimensional schema containing facts and dimensions -the chosen subject of analysis is represented by a fact table

Scrub/Cleanse:

uses pattern recognition and AI techniques to upgrade data quality -fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies -also: deceasing, reformatting etc.

Retrieval of analytical information

-A data warehouse is developed for the retrieval of analytical information, and it not meant for direct data entry by the users - The only functionality available to the users of the data warehouse is retrieval - The data in the data warehouse is not subject to changes - The data in the data warehouse is referred to as non-volatile, static or read

data integrity problems:

-Different spelling of the same person like Jon, John, etc. -There are multiple ways to denote company name like Google, Google Inc. -Use of different names like Cleaveland, Cleveland. -There may be a case that different account numbers are generated by various applications for the same customer. -In some data required files remains blank -Invalid product collected at POS as manual entry can lead to mistakes.

multi-field transformation

-Many sources to one target -One source to many targets

characteristics of dimensions and facts

-a typical dimension contain relatively static data, while in a typical face table records are added continually and the table rapidly grows in size -in a typical dimensionally modeled analytical DB, dimension tables have orders of magnitude fewer records than fact tables

load verificaiton

-ensures that the key field data is neither missing nor null -test modeling views based on the target tables -checks that combined values and calculated measures -data checks in dimension table as well as history table -checks the BI report on the loaded fact and dimension table

the extraction methods:

-full extraction -partial extraction (without update notification) -partial extraction (with update notification)

types of loading

-initial load: population all DW tables -incremental load: applying ongoing changes as when needed periodically -full refresh: erasing the contents of one or more tables and reloading with fresh data

best practices of ETL process:

-never try to cleanse all the data -always cleans something -determine the cost of cleansing the data -to speed up query processing have auxiliary views and indexes

validations that are done during extraction

-reconcile records w/ the source data -make sure that no spam/unwanted data loaded -data type check -remove all types of duplicate/fragmented data -check whether all the keys are in place or not

operational data

-typical time-horizon: day/months -detailed -current -small amounts used in a process -high frequency of access -can be updated -non redundant

analytical data

-typical time-horizon: years -summarized and or detailed -values over time (snapshots) -large amounts used in a process -low/modest frequency of access -read only -redundancy not an issue

amazon redshift

Amazon Redshift is Data warehouse tool. It is a simple and cost-effective tool to analyze all types of data using standard SQL and existing BI tools. It also allows running complex queries against petabytes of structured data.

Which of the following is (are) applicable to data warehouse (DW)?

DW is the repository of analytical data A data warehouse can store detailed and/or summarized data

The infrastructure that facilitates the retrieval of data from operational databases into the data warehouses is called:

ETL Infrastructure

ETL stands for:

Extraction, Transformation, & Loading

oracle

Oracle is the industry-leading database. It offers a wide range of choice of Data Warehouse solutions for both on-premises and in the cloud. It helps to optimize customer experiences by increasing operational efficiency.

During ETL process, capturing a snapshot of the source data at a point in time is called:

Static extract

Which of the following is (are) applicable to data warehouse (DW)?

The data in the data warehouse is not subject to changes. The data in the data warehouse is referred to as non-volatile, static, or read-only

what do we need before the data is extracted?

a logical data map that describes the relationship between sources and target data

ER modeling

a predominant technique for visualizing DB requirements, used extensively for conceptual modeling of operational DBs

ETL

a process that extracts the data from different source systems, that transforms the data (like applying calculations, concatenations, etc.) and finally loads the data into the data warehouse system.

snow flake model

a star schema that contains the dimensions that are normalized

ETL is a recurring

activity of a data warehouse systems and needs to be agile, automated and well documented

extraction should not

affect performance and response time of the source systems, these source systems are live production databases

fact tables

contain measures related to the subject of analysis and the FKs (associating fact tables with dimension tables)

extraction

data is extracted from the source system into the staging area, staging areas gives an opportunity to validate extracted data before it moves into the data WH

Data marts:

o A data store based on the same principals as a data warehouse, but with a more limited scope

Detailed and/or summarized data

o A data warehouse, depending on its purpose, may include the detailed data or summary data or both o A data warehouse that contains the data at the finest level of detail is the most powerful o Data warehouse database is a relational DB but the structure is different, it keeps unnormalized data.

Dependent data mart:

o Does not have its own source system o The data comes from the data WH

Independent data mart:

o Stand-alone data mart, created in the same fashion as the data warehouse, not live communication with data WH o Independent data mart has its own source systems and ETL infrastructure o Has its own systems, is created from scratch

§ ETL infrastructure

o The infrastructure that facilitates the retrieval of data from operational databases into the data warehouses o ETL includes the following tasks: -Extracting analytically useful data from the operational data sources -Transforming such data so that it conforms to the structure of the subject-oriented target data warehouse model (while ensuring the quality of the transformed data) -Loading the transformed and quality assured data into the target data warehouse

Datawarehouse front-end (BI) applications

o Used to provide access to the data warehouse for users who are engaging in indirect use. Uses data from the data WH and creates figures or reports or whatever you need.

validations that are done during the transformation stage:

-Filtering - Select only certain columns to load -Using rules and lookup tables for Data standardization -Character Set Conversion and encoding handling -Conversion of Units of Measurements like Date Time Conversion, currency conversions, numerical conversions, etc. -Data threshold validation check. For example, age cannot be more than two digits. -Data flow validation from the staging area to the intermediate tables. -Required fields should not be left blank. -Cleaning (for example, mapping NULL to 0 or Gender Male to "M" and Female to "F" etc.) -Split a column into multiples and merging multiple columns into a single column. -Transposing rows and columns, -Use lookups to merge data -Using any complex data validation (e.g., if the first two columns in a row are empty then it automatically reject the row from processing)

why we need ETL:

-it helps companies to analyze their business data for making critical business decisions -transactional data bases cannot answer complex business questions that can be answered by ETL -a data WH provides a common data repository -it provides a method of moving the data from various sources into a data WH -as data sources change, the data WH will automatically update -well-designed and documented ETL system is almost essential to the success of data WH projects -it allows verification of data transformation, aggregation and calculations rules -it allows sample data comparisons between the source and the target system -is can perform complex transformations and requires the extra area to store the data -it helps to migrate data into the data WH -it is a predefined process for accessing and manipulating source data into the target DB -it offers deep historical context for the business -it helps to improve productivity because it codifies and reuses without a need for technical skills

marklogic

MarkLogic is a data warehousing solution which makes data integration easier and faster using an array of enterprise features. It can query different types of data like documents, relationships, and metadata.

Transform:

convert data from format of operational system to format of data warehouse -record level: selection-data portioning, joining-data combining, aggregation-data summarization -field-level: single-field-form one field to one field, multi-field-from many fields to one, or one field to many

A data store based on the same principles as a data warehouse, but with a more limited scope is called:

data mart

a direct move or pass through data

data that does not require any transformation

granularity of the face tables

detailed fact tables have fine level of granularity because each record represents a single fact -aggregated tables have a courser level of granularity than detailed fact tables because they always represent summarizations of data

Single-field transformation:

in general, some transformation function translates data from old form to new form, -Algorithmic transformation uses formular or logical expression -Table lookup- another approach, uses a separate table keyed by source record code

Data warehouse

is a structured repository of integrated (captured from one or more operational DBs) subject-oriented, enterprise-wide (gives enterprise wide view), historical (contains time lagged data), and time variant- data (to be able to create trend analysis). The purpose of the data warehouse is the retrieval of analytical information. A data warehouse can store detailed and/or summarized data.

dimensional modeling

modeling technique tailored specifically for analytical DB design purposes, regularly used in practice for modeling data WH and data marts

Capture/Extract:

obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse, 2 types: -static extract = capturing a snapshot of the source data at a point in time -incremental extract= capturing changes that have occurred since the last static extract

Load/Index:

place transformed data into the warehouse and create indexes -refresh mode: bulk rewriting of target data at periodic intervals -update mode: only change in source data are written to data WH


Related study sets

3rd attempt on the guaranteed exam

View Set

Nurs 380: Helpful Pediatric Questions from Prep U (Final Exam)

View Set

Pharm Exam 1: Ch. 54 Upper Respiratory Drugs

View Set

World Geography: Chapter 9 Section 4

View Set