ISA401 exam 1

Ace your homework & exams now with Quizwiz!

EAI tools

enable event driven (real time) data to be captured and used in an integrated way across disparate source systems - can be used to feed ETL tools

timely

except for real time data warehousing, data need not be real time but data must be current enough that decision making can react in a timely manner

operational data are...

transient, nor normalized, generally restricted in scope, poor quality

● How are changes in dimension values handled?

○ Add new column

● What is granularity? Why is it important?

○ Granularity relates to the fact table ○ Tells us the level of information and determines the size of the fact table ○ Fine grain: more data, more ways to slice & dice

● What is a surrogate key?

○ Key that doesn't have a business meaning, it is a randomly generated unique identifier

logical vs physically stored data

○ Logical means the data mart references the data warehouse ○ Physical means the data is physically stores in the data mart

quality controled

reconciled data must be of unquestioned quality and integrity

comprehensive data

reconciled data reflect an enterprise wide perspective whose design conforms to the enterprise data model

cleanse process

role of this ETL process is to identify erroneous data, not fix it

what are different sources of data

spreadsheet, database, sensors, social media

what is BI

the application of knowledge gained from business analytics to make informed business decisions

normalized

the data are fully normalized which provides greater flexibility than denormalized data

log capture

the most common used method of extracting; only imaged that are logged after the last capture are selected for the log

joining

the process of combining data from various sources into a single table or view

selection

the process of partitioning data according to predefined criteria

aggregation

the process of transforming data from a detailed level to a summary level

what are the differences between transactional and informational systems

transactional: executing and completing transactions for business processes, does not have all the information you need it is an "island of data" informational: data mining, analytics - MUST USE BOTH SYSTEMS TO HAVE A FULL VIEW ON THE COMPANY

descriptive data

what happened in the past, this is most commonly used by businesses; its the simplest and most basic

data analytics can help answer questions such as...

what has happened in the past? why did it happen? what could happen in the future? with what certainty? what actions can we take now to support or prevent certain events from happening in the future?

prescriptive data

what should we do

predictive data

what will happen in the future and with what certainty

is the fact table normalized?

yes

is data analytics interdisciplinary

yes, it involves computer science, mathematics/statistics, and domain knowledge

Snowflake Schema

■ An extension of star schema where the dimensions are decomposed into multiple related tables

Factless Fact Table

■ No nonkey data in _____ ; only FKs for associated dimensions ■ Used for tracking events and possible occurrences

Galaxy Schema (Multiple FACT TABLES )

■ Often used to store facts for different processes (with different combinations of dimensions) ■ Has conformed DIMENSIONS

○ Difference between DBA and DWA

■ Security and privacy is a pressing issue in DW administration ■ Safeguarding data ■ Government regulations ■ Profile management (who can access what data)

○ Data warehouse administrator (DWA)

■ Technical knowledge ■ Business knowledge ■ Process knowledge ■ Communication skills

● What are the variations of star schema?

○ 2 variations ○ Galaxy ○ Snowflake

● How many steps are there in ETL? What is the sequence?

○ 4 steps: Extract, clean, transform, load

● Hosted data warehouses (via cloud computing)

○ Alternative to development ○ Less investment in infrastructure for cloud computing

● Two types of Data Marts

○ Dependent (data, ETL): here you do have a central data warehouse ○ Independent (data, ETL): here you do not have a central data warehouse ● If data marts are not linked together, data silos can result ● Independent data marts are easier to deploy but harder to maintain ● Dependent data marts get their data from EDW and they feed their data to end user

● A star schema is for which data layer?

○ Derived data ○ Two kinds of tables: fact and dimension

● Data warehouse development approaches

○ EDW approach (top-down) ○ Data mart approach (bottom-up)

● Traditional versus active DW

○ Enabling real-time data updates for real-time analysis and real-time decision making ○ Ex: banks need to be updating the information of withdrawals every second ■ Most businesses that use data to make decisions with less urgency will update their information nightly

● How is fact table different from dimension tables?

○ Fact table is a one to many relationship ○ Fact table is on the MANY side ○ Fact table is QUANTITATIVE data that you can slice & dice

● What types of activities fall under data cleaning/scrubbing/wrangling?

○ Finding error and notifying the source system of those errors

successful BI implementation

success of a BI implementation is measured by its widespread usage for better (data-driven decision making) the typical BI user community includes - strategic level: all levels of the management industry - tactical level: operational users

OLAP (informational systems)

-provide a place for data to be stored and prepared for analytical purposes -hold large volumes of historical data - archived, aggregated data which may have come from a transactional system originally - data are extracted from other systems and loaded into the informational system periodically, possibly in summarized form - support managerial decision making

2 phases of data reconciliation

1) during an initial load when the EDW is first created 2) during subsequent updates to keep the EDW current

5 steps of data reconciliation

1) mapping and metadata management 2) capture 3) scrub 4) transform 5) load and index

4 characteristics of a data warehouse

1) subject oriented: organized by subject detail such as sales, product, customers, and contains information only related to that specific subject 2) time variant: time is an important element of analytics 3) integrated: data is integrated in one place for the purpose of analysis 4) non volatile: after data are entered into the database they can not be changed

● Data warehouse

: a system that extracts, cleans, conforms, and consolidates data periodically from the source systems into dimensional data store ○ Usually keeps years of history ○ It is expensive to refresh the system → it puts a strain on transactional systems

● What are the three layers of data in a DW data architecture? What is the purpose of each layer?

? ○ Derived data ○ Reconciled data ○ Operational data

self service BI

PROS: allows BI users to become more self reliant and less dependent on the IT organization CONS: can lead to chaotic reporting, it should work with business to determine key metrics and agree on vocabulary

● What types of data extraction are there?

STATIC

● What is always a dimension?

TIME

● A data mart may contain several star schemas. (T/F?)

TURE

● Granularity of FACT table

What level of detail do we need? Most important design step.

ERP

a fully integrated system to allow functional business areas to share data

static extract

a method of capturing a snapshot of the required source data at a point in time

data scrubbing (cleansing)

a process of using pattern recognition and other artificial intelligence techniques to upgrade the quality of raw data before transforming and moving the data to the data warehouse

data analytics

accessing/cleaning, transforming it, and analyzing it

refresh mode

an approach to filling a data warehouse that invovles bulk rewriting of the target data at periodic intervals

update mode

an approach to filling a dataware house in which only changes in the source data are written to the data warehouse

● Periodic (non-volatile) data

are never physically altered or deleted once they have been added to the data source

transactional systems

availability close to 100% - updateable - process transactions quickly - store current information - support organizations business functions

incremental extract

captures only the changes that have occured in the source data since the last capture

● Transient (volatile) data:

changes to existing records are written over previous records, thus destroying the previous data content

Powerpivot

column oriented - data access is much quicker when forming queries

clustering

customer profiling, descriptive analytics

big data (three v's)

data sources (variety) data storage (volume) processing and analyzing (velocity)

detailed data

data are detailed rather than summarized which provides maximum flexibility in terms of how data can be structured

historical data

data are periodic to provide a historical perspective

structured data

data organized such that it can be read and used by people and computers - string, numeric, dates - stored in cells or fields of fixed length - spreadsheets: data integrity might be a concern, file size is limited, east to use - flat files .csv, .txt

what is the difference between data analytics and data science

data science is about creating models, data analytics is about using the models

● STAR SCHEMA

has a FACT table surrounded by several DIMENSION tables through 1:M relationships

OLAP

online analytical processing systems are used to extract information from data stored by OLTP systems - routine sales reports by product, region, sales person - often built on top of a data warehouse where the data is not transactional

OLTP

online transaction processing systems are constantly involved in handling updates (add/edit/delete) to operational databases -databases are geared towards OLTP - the main goal of OLTP is transaction accuracy - ERP systems are OLTP systems that generate and store transaction data

churn classification

predicting who is likely to defect from your product/offering

● Are relationships between dimensions allowed?

○ There are relationships between the dimension and fact tables ○ Relationships from dimension to dimension ARE NOT ALLOWED ○ One exception: when you have a hierarchy of table (you can have a relationship between product and product category)

● Relationships between dimensions typically allowed

NOT ALLOWED

dimension tables?

denormalized

what are applications of analytics

fraud detection, demand forecasts

data transformation

involved converting data from the format of the source operational system to the format of the enterprise data warehouse

● Most common data model

is dimensional model, which is usually implemented as a star schema

data reconciliation

is the process responsible for transforming operational data to reconciled data - the differences between them are so sharp so data reconciliation is the hardest part of building a data warehouse

traditional pivot tables

limitations: source of data MUST be excel, number of tables, size of data (excel can only hold around 1 million rows)

● The simplest dimensional (derived data)

model where FACTS, MEASURES, DATA VALUES are separated from DIMENSION & REFERENCE data

● Measures:

money, sales volume, head count, inventory, profit, actual versus forecast

● Finer grain→

more refined analysis

● Dimensions

products, salespeople, market segments, business units, geographical locations, distribution channels, country, or industry

● The purpose of ETL

○ Operational data ■ Transient ■ Normalized ■ Restricted in scope ■ Can be of poor quality ○ Reconciled data ■ Detailed ■ Historical ■ Normalized ■ Comprehensive ■ Quality controlled ○ 80-90% of the cost of running a data warehouse comes from getting operational data to ETL ○ ETL is also known as harmonizing data

● What is the purpose of an operational data store

○ Short term data that you retrieve from data sources and put it in an interim storage where you can do ETL(extract, transform, load) before you load it into the database

● What types of activities fall under data transformation?

○ Single & multi field transformations ○ Aggregations

● What types of data load are there

○ Static extract = full/refresh load ○ Incremental extract = update load

● DW Scalability can be in terms of

○ The amount of data ○ The number of concurrent users ○ The complexity of user queries ○ Business functions supported ○ Impact of cloud computing → easier to get a scalable system

● What is the main difference between a data warehouse and a data mart?

○ The main difference is SCOPE ○ Data mart is for a specific business unit (accounting, sales) ○ Data warehouse is for an entire organization


Related study sets

Fundamentals OVERALL QUESTIONS NEED TO BE REVIEWED

View Set

Genetics Quiz 3 + video questions

View Set

Successful Project Management 6e Chapters 1 - 5

View Set