Module 12 Data warehousing
A small data warehouse designed for a strategic business unit or a dept.
Independent data mart
SQL
(Structured Query Language) is a data definition and manipulation language that allows for the creation of database tables, querying of data across multiple tables, and the generation of transaction-based reporting of relational databases.
An evolving tool space that promises real-time data integration from a variety of sources, such as relational or multidimensional databases, web services, etc
Enterprise information integration (EII)
ETL stands for
Exchange, transfer and load
The technologies that come with Big Data are
Hadoop, MapReduce, and NoSQL, Hive
data warehousing
creates a well-planned information management solution to enable analytical and informational processing despite platform, application, organizational, and other barriers.
Data integration uses three things:
data access, data federation (integration of business views across multiple data stores) and change capture (based on the identification, capture and delivery of changes made to enterprise data sources.
data warehouse parts
data warehouse itself, data acquisition (back-end), client (front-end).
A star schema contains a central fact table surrounded by and connected to several _.
dimensional tables
Dependent Data Mart
is a subset that is created directly from the data warehouse. It has the advantage of using a consistent data model and providing quality data. A dependent data mart ensures that the end user is viewing the same version of the data that is accessed by all other data warehouse users. The high cost of data warehouses limits their use to large companies.
Business statistics
is the science of good decision making in the face of uncertainty and is used in many disciplines such as financial analysis, econometrics, auditing, production and operations including services improvement, and marketing research.
relational database
it uses a series of logically related two-dimensional tables or files to store information in the form of a database
Data warehousing used primarily to help
make informed decisions.
17.lack of data standards
managers need to perform cross-functional analysis using data from all departments, which differed in granularities, formats and levels
Dimensional modeling
modeling is a retrieval-based system that supports high-volume query access.
data warehouse
multidimensional, layers of rows & columns
13.Four common characteristics
of big data 4V's Variety Veracity Volume Velocity
OLAP tools
provide data access to end users. allow a user to "drill-down" into their data to view it at whatever level of detail they need.
integrity constraints
rules that help ensure the quality of the information
star schema
simplest form of dimensional modeling. contains a central tact table surrounded by and connected to several dimension tables. the fact table contains a large number of rows that correspond to observed facts and external links.
18.poor data quality
the data, if available, were often incorrect or incomplete. so users could not rely on the data to make decisions
operational databases
the database that support oltp, inside this these operational databases is valuable information that forms the basis for business intelligence
Distributed database management system
would pull the requested data from databases across the organization, bring all the data back to the same place, and then consolidate in, sort it, and do whatever else was necessary to answer the user's question. Islands of data problem still existed.
Dimension
A particular attribute of information
Zhao described five levels of metadata management maturity:
1. Ad-hoc, discovered, managed, optimized, and automated.
10.Data warehouse
A logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks
12.Extraction, transformation, and loading (ETL)
A process that extracts information from internal and external databases, transforms the information using a common set of enterprise definitions, and loads the information into a data warehouse
Down-Flow
Aging. To archive data into storage hierarchy
Data mart
Contains a subset of data warehouse information
_ tools enable access to data warehouse.
Middleware
data dictionary
the logical structure for the information in a database
DBMS 5 parts
1. DBMS engine 2.Data definition subsystem 3.data manipulation system 4.application generation subsystem 5. data administration subsystem
Support for mobile users
: Many users who are relatively mobile (users who spend most of their time out of the office and use laptops or mobile devices, such as a Blackberry, to access office‐based computing resources) have to perform business intelligence functions when they're out of the office. In one model, mobile users can dial in or otherwise connect to a report server or an OLAP server, receive a download of the most recent data, and then (after detaching and working elsewhere) work with and manipulate that data in a standalone, disconnected manner. In another model, mobile users can leverage Wi‐Fi network connectivity or data networks, such as the Blackberry network, to run business intelligence reports and analytics that they have on the company intranet on their mobile device.
dashboard
A business intelligence _________ is a graphical interface that displays the current status of performance metrics and key performance indicators (KPIs) for an enterprise.
1. Big Data
A collection of large, complex data sets, including structured and unstructured data, which cannot be analyzed using traditional database methods and tools and includes the following four common characteristics Variety Veracity Volume Velocity
OLAP cube
A(n) __________ is a multidimensional database that is optimized for data warehouse and online analytical processing (OLAP) applications.
OLTP
A(n) ___________ system is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for these systems is put on very fast query processing. These systems are optimized for single-step queries, not detailed analysis.
Real‐time intelligence:
Accessing real‐time, or almost real‐time, information for business intelligence (rather than having to wait for traditional batch processes) is becoming more commonplace. In these situations, an application must be capable of "pushing" information, as opposed to the traditional method of "pulling" the data through a report or query. Like with traditional data‐extraction services (described in Chapter 7), business intelligence tools must detect when new data is pushed into its environment and, if necessary, update measures and indicators that are already on a user's screen. (In most of today's business intelligence tools, on‐screen results are "frozen" until the user requests new data by issuing a new query or otherwise explicitly changing what appears on the screen.)
Out flow:
Accessing to obtain data by consumer ad hoc and routine. Delivery: to render data by warehouse via publish and subscribe mechanisms.
Teradata did what first?
Built the first data warehousing appliance. a combination of hardware and software to solve the data warehousing needs of many.
7.Data Mining Techniques
CEAC Classification Estimation Affinity grouping Clustering
EDW's are used to provide data for many types of DSS including:
CRM, supply chain management (SCM), business performance management (BPM), business activity monitoring (BAM), product life-cycle management (PLM), revenue management, and sometimes even Knowledge Management Systems (KMS).
20.Structured Data
Contains a defined length, type, and format and includes numbers, dates, or strings machine-generated or human-generated
2.
Cube Common term for the representation of multidimensional information
Data warehousing depends on:
DBMS, Extraction and conversion tools, internetworking techniques, front-end analysis tools, graphics
A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.
Data Warehouse
Integration that comprises 3 major processes: data access, data federation, and change capture.
Data integration
A departmental small-scale Data Warehouse that stores only limited/relevant data.
Data mart
What are the three main types of data warehouses?
Data marts, operational data store (ODS), and enterprise data warehouses (EDW)
21.Three organizational methods for analyzing big data
Data mining Big data analytics Data visualization
Major components of the data warehousing process:
Data sources, ETL, data loading, comprehensive database, metadata, middleware tools
The four major components of the data warehousing process
Data sources. Data extraction (using custom-written or commercial software called ETL), Data loading (data loaded to staging area) Comprehensive database, metadata (used by IT personnel and users).
A subset that is created directly from a data warehouse.
Dependent data mart
A retrieval-based system that supports high-volume query access
Dimensional Modeling
EDW stands for
Enterprise Data Warehouse
A technology that provides a vehicle for pushing data from source systems into a data warehouse.
Enterprise application integration (EAI)
Types of integration technologies that enable data and metadata integration:
Enterprise application integration (EAI, vehival pushes data from source to data warehouse), Enterprise information integration (EII, promotes real-time data integration).
A data warehouse for the enterprise
Enterprise data warehouse (EDW)
Hub-and-spoke architecture
Famous data warehousing architecture today. Focus on building a scalable and maintainable infrastructure that includes a centralized data warehouse and several dependent data marts. Allows for easy customization of user interfaces and reports. Lacks a holistic enterprise view, and may lead to data redundancy and data latency.
Agent technology:
In a growing trend, intelligent agents are used as part of a business intelligence environment. An intelligent agent might detect a major change in a key indicator, for example, or detect the presence of new data and then alert the user that he or she should check out the new information.
9.Data Visualization
Infographics Analysis paralysis Data visualization Data visualization tools Business intelligence dashboards Data artist
This approach emphasizes top-down development, employing established db development methodologies and tools, such as ERD
Inmon model, EDW approach
"Plan big, build small" approach, bottom-up. It is a scaled down DW that focuses on requests from specific depts
Kimball model, Data Mart approach
Types of analytical processing
MOLAP (multidimensional online analytical processing) is an alternative to the ROLAP (Relational OLAP) technology))) indexes directly into a multidimensional database. ROLAP(relational online analytical processing) is an alternative to the MOLAP (Multidimensional OLAP) technology. HOLAP(hybrid online analytical processing) is a combination of ROLAP ( Relational OLAP) and MOLAP (Multidimensional OLAP) SQL -SQL (pronounced "ess-que-el") stands for Structured Query Language. SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems.
Data about data. In DW, it describes the contents and the manner of its acquisition and use.
Metadata
inom model
Model, also known as the EDW approach, emphasizes top-down development, employing established database development methodologies and tools, such as entity-relationship diagrams (ERD), and an adjustment of the spiral development approach.
kimball model
Model, also known as the data mart approach, is a "plan big, build small" approach. A data mart is a subject-oriented or department-oriented data warehouse. It is a scaled-down version of a data warehouse that focuses on the requests of a specific department, such as marketing or sales.
OLAP implemented via a specialized multidimensional database (or data store) that summarizes transactions into multidimensional views ahead of time
Multidimensional OLAP (MOLAP)
The ability to organize, present, and analyze data by several dimensions, such as sales by region, by product, by salesperson, and by time (four dimensions).
Multidimensionality
22.Unstructured data
Not defined, does not follow a specified format, and is typically freeform text such as emails, Twitter tweets, text messages
Operational data store
ODS. Provides a fairly recent form of customer information file (CIF). This type of database is often used as an interim staging area for a data warehouse. Used for short term decisions. Uploads just recent info not for long-term use. Data warehouse on the other hand stores permanent info. An ODS consolidates data from multiple source systems and provides a near-real time, integrated view of a volatile, current data.
most common analysis technique in data warehouse?
OLAP online analytical processing.
Types of analytical processing activities:
Online analytical processing (OLAP), data mining, querying, reporting, and other decision-support applications.
OLAP vs OLTP
Online analytical processing VS online transactional processing. OTLP for capturing and storing data for day-to-day business functions such as ERP, CRM, SCM, point of sale, and so forth. Not for ad-hoc and complex queries that deal with a number of data items. OLAP on the other hand is designed to address this need by providing ad hoc analysis of organizational data much more effectively and efficiently. OLAP and OLTP rely on each other. OLAP uses the data captures by OLTP and OLTP automates the business processes that are managed by decisions supported by OLAP.
A type of database often used as an interim area for a data warehouse.
Operational data stores (ODS)
An ODS is a
Opertaional data stores. type of customer-information-file database that is often used as a staging area for a data warehouse.
The importance of datawarehouses
Organizes - merchandising, advertising, distribution, sales, marketing, production, service, and accounts receivable
Data mining analysis methods
POFR Prediction Optimization Forecasting Regression
Server‐based functionality:
Rather than have most or all of the data manipulation performed on users' desktops, server‐based software (known as a report server) handles most of these tasks after receiving a request from a user's desktop tool. After the task is completed, the result is made available to the user, either directly (a report is passed back to the client, for example) or by posting the result on the company intranet.
Ralph Kimball founded
Red Brick Systems in 1986. Software company aimed at improving data access.
RDSMS stands for
Relational Database Management System
The implementation of an OLAP database on top of an existing relational database
Relational OLAP (ROLAP)
Centralized data warehouse
Similar to the hub-and-spoke one. except no dependent data marts, rather a big enterprise data warehouse that serves the needs of all organizational units. More holistic view. No data marts.
Slice And Dice
Slice and dice refers to a strategy for segmenting, viewing and understanding data in a database. Users slices and dice by cutting a large segment of data into smaller parts, and repeating this process until arriving at the right level of detail for analysis. Slicing and dicing helps provide a closer view of data for analysis and presents data in new and diverse perspectives.
Logical arrangement of tables in a multidimensional db in such a way that the ERD resembles a snowflake.
Snowflake Schema
The most commonly used and the simplest style of dimensional modeling.
Star Schema
Characteristics of Data Warehousing include
Subject oriented (data organized by detailed subject such as sales, customer,) Integrated (consistent format), Time Varient ( maintains historical data). Nonvolatile (users can't change data, changes are recorded as new data).
Meta-flow:
System modeling: to define structure of legacy systems, synthesizing to create valued, regulating to create modules for capturing.
Isle of Capri ( a gaming company) solution for meet demands?
Teradata ( a company) and and IBM Cognos for Business Intelligence.
ETL(Extract, Transform and Load)
The __________ comprises the processing layer in a data warehousing model. It is responsible for pulling data out of the source systems and placing it into a data warehouse.
5.Data mining
The process of analyzing data to extract information not offered by the raw data alone
Data and information
The term data refers to factual information, especially that used for analysis and based on reasoning or calculation. Data itself has no meaning, but becomes information when it is interpreted. Information is a collection of facts or data that is communicated.
Data mart bus architecture
This architecture is a viable alternative to the independent data marts where the individual marts are linked to each other via some kind of middleware. Not optimal for complex data queries.
3 tiers of data warehousing architecture. ( a 2 tier is more economical where the last two work together but not great for large companies).
Tier 1: Client workstation. Tier 2: Application server. Tier 3: Database server.
Oper marts are an operational data mart.
True
Additional data warehouse characteristics include:
Web based, Relational/multidimensional, Client/Server (for easy access to end-users), Real time (newer data warehouses provide real-time or active data-access and analysis capabilities) Metadata (data about data, how its all organized and how to use them, etc).
database
a collection of information that you organize and access according to the logical structure of that information
data warehouse
a logical collection of information gathered from many different operational databases used to create business intelligence that supports business analysis activities and decision making tasks
In an OLAP a cube is
a multidimensional data structure actual or virtual that allows fast analysis of data. The capability of efficiently manipulating and analyzing data from multiple perspectives. aimed for overcome a limitation of relational databases. an analyst can navigate through the database and screen for a particular subset of the data by changing the data's orientations and defining analytical calculations. not great for lots of data as a standard relational format is.
Hans Peter Luhn
a researcher at IBM, introduced the idea of business intelligence in 1958 as "the ability to apprehend the interrelationships of presented facts in such a way as to guide action to a desired goal.
dimensional modeling is
a retrieval based system that supports high-volume query access.
Independent Data Mart
a small warehouse designed for a strategic business unit (SBU) or a department, but its source is not an EDW.
A data warehouse is
a specially constructed data repository where data are organized so that they can be easily accessed by end users for several applications.
OLAP(Online analytical processing)
a(n) ___________ system is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. These systems manipulate aggregated, historical data, stored in multi-dimensional schemas (usually star schema like the cube).
drill down
access data that is in a lower level of a hierarchically structured database.
Middleware tools enable
access to the data warehouse. Power users such as analysts may write their own SQL queries.
19.primary purpose of a data warehouse
aggregate information throughout an organization into a single repository for decision-making purposes
Relational DBMS
allow multiple access queries.
Active Data warehousing (as opposed to traditional data warehousing)
allows for large users and operational staffs.Active Data Warehouse is repository of any form of captured transactional data so that they can be used for the purpose of finding trends
data mining
analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data.
Oper marts
are created when operational data needs to be analyzed multidimensionally. The data for an oper mart come from an ODS.
In-Flow DS flow
capturing data from legacy system, validating to test data for reality, repairing to examine and build data, transforming for consolidation, applying to move and load data.
A data mart...
contain data on one topic (e.g., marketing). A data mart can be a replication of a subset of data in the data warehouse. Data marts are a less expensive solution that can be replaced by or can supplement a data warehouse. Data marts can be independent of or dependent on a data warehouse.
3.
data bases are 2D - rows (entities) and columns (attributes)
A web-server is backed by both a
data warehouse and an application server. used for ease of access, platform independence, and lower cost.
The federated data warehouse
data warehouse architecture involves integrating disparate systems and analytical resources from multiple sources to meet changing needs or business conditions.
15.inconsistent data definitions
every department had its own method for recording data, so when trying to share info, data did not match and users did not get the data they really needed
The ETL process consists of
extraction (reading data from one or more dbs), transformation (converting extracted data form one or more dbs), and load (putting the data into the DW)
11.Data Warehousing
fixes the following problems inconsistent data definitions lack of data standards poor data quality inadequate data usefulness ineffective direct data access
DBMS (database management system)
helps you specify the logical organization for a database and access and use the information within a database
multidimensional cube is
inflexible and does not support the ad hoc creation of multidimensional views of the products, services and customers. can't handle more then 30 gigabits of data.
Inmon vs kimball
inmom op-down, enterprise wide, complex, dubjrct driven, low end0user, IT professionals, WHEREAS kimball bottom-up, simple method, data marts, process oriented, dimensional modeling, high end user accessibilites.
The data warehouse is a collection of _, _ databases designed to support DSS functions, where each using of data is _ and relevant to some moment in time.
integrated, subject-oriented, non-volatile
Federated data warehouse
integrates analytical resources from multiple sources to meet changing needs or business conditions.
snowflake schema
is a logical arrangement of tables in a multidimensional database in such a way that the entity relationship diagram resembles a snowflake in shape.
Enterprise integration informaiton
is a mechanism for pulling data from source systems to satisfy a request for information. It is an evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases.
Relational Databases are not well suited for
manipulating records. support a lot of data. supports dynamic joining of data. proven technology. performance less than optimal cannot be used for purely optimized processing.
16.Ineffective Direct Data Access
most data stored in operational databases did not allow users direct access, users had to wait to have their queries or questions answered by MIS professionals who code SQL
How are data warehouses different from operational databases
operational databaseses are more product oriented and data warehouses use subject orientation to give a more comprehensive view of the organization.
Data mining take analysis further by sifting through a large amount of data to find info using these such algorithms:
predictive modeling, database segmentation, link analysis, deviation detection.
OLAP Operations:
slice, dice, drill down/up, roll up, pivot
oltp's (online transaction processing)
the gathering of input information, processing that information and updating existing information to reflect the gathered and processed information
online analytical processing
the manipulation of information to support decision making
Key performance indicators
the most essential and important quantifiable measures used in analytics initiatives to monitor success of a business activity
multidimensional databases lack
the scalability and flexibility for DSS
analytics
the science of fact based decision making
performing extensive ETL (extraction, transformation, load)
to move data to the data warehouse may be a sign of poorly managed data and a fundamental lack of a coherent data management strategy.
AI (artificial intelligence)
tools such as neutral networks and fuzzy logic to form the bases of information discovery and build business intelligence in OLAP
8.Data-mining tools
use a variety of techniques to find patterns and relationships in large volumes of information
14.inadequate data usefulness
users could not get the data they needed, what was collected was not always useful for intended purposes
Multidimensional Database
usually contain a star model. designed for slice and dice and drill down analysis. highly indexed databases. provides data mining and drill down capabilities.