Module 12 Data warehousing

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

A small data warehouse designed for a strategic business unit or a dept.

Independent data mart

SQL

(Structured Query Language) is a data definition and manipulation language that allows for the creation of database tables, querying of data across multiple tables, and the generation of transaction-based reporting of relational databases.

An evolving tool space that promises real-time data integration from a variety of sources, such as relational or multidimensional databases, web services, etc

Enterprise information integration (EII)

ETL stands for

Exchange, transfer and load

The technologies that come with Big Data are

Hadoop, MapReduce, and NoSQL, Hive

data warehousing

creates a well-planned information management solution to enable analytical and informational processing despite platform, application, organizational, and other barriers.

Data integration uses three things:

data access, data federation (integration of business views across multiple data stores) and change capture (based on the identification, capture and delivery of changes made to enterprise data sources.

data warehouse parts

data warehouse itself, data acquisition (back-end), client (front-end).

A star schema contains a central fact table surrounded by and connected to several _.

dimensional tables

Dependent Data Mart

is a subset that is created directly from the data warehouse. It has the advantage of using a consistent data model and providing quality data. A dependent data mart ensures that the end user is viewing the same version of the data that is accessed by all other data warehouse users. The high cost of data warehouses limits their use to large companies.

Business statistics

is the science of good decision making in the face of uncertainty and is used in many disciplines such as financial analysis, econometrics, auditing, production and operations including services improvement, and marketing research.

relational database

it uses a series of logically related two-dimensional tables or files to store information in the form of a database

Data warehousing used primarily to help

make informed decisions.

17.lack of data standards

managers need to perform cross-functional analysis using data from all departments, which differed in granularities, formats and levels

Dimensional modeling

modeling is a retrieval-based system that supports high-volume query access.

data warehouse

multidimensional, layers of rows & columns

13.Four common characteristics

of big data 4V's Variety Veracity Volume Velocity

OLAP tools

provide data access to end users. allow a user to "drill-down" into their data to view it at whatever level of detail they need.

integrity constraints

rules that help ensure the quality of the information

star schema

simplest form of dimensional modeling. contains a central tact table surrounded by and connected to several dimension tables. the fact table contains a large number of rows that correspond to observed facts and external links.

18.poor data quality

the data, if available, were often incorrect or incomplete. so users could not rely on the data to make decisions

operational databases

the database that support oltp, inside this these operational databases is valuable information that forms the basis for business intelligence

Distributed database management system

would pull the requested data from databases across the organization, bring all the data back to the same place, and then consolidate in, sort it, and do whatever else was necessary to answer the user's question. Islands of data problem still existed.

Dimension

A particular attribute of information

Zhao described five levels of metadata management maturity:

1. Ad-hoc, discovered, managed, optimized, and automated.

10.Data warehouse

A logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks

12.Extraction, transformation, and loading (ETL)

A process that extracts information from internal and external databases, transforms the information using a common set of enterprise definitions, and loads the information into a data warehouse

Down-Flow

Aging. To archive data into storage hierarchy

Data mart

Contains a subset of data warehouse information

_ tools enable access to data warehouse.

Middleware

data dictionary

the logical structure for the information in a database

DBMS 5 parts

1. DBMS engine 2.Data definition subsystem 3.data manipulation system 4.application generation subsystem 5. data administration subsystem

Support for mobile users

: Many users who are relatively mobile (users who spend most of their time out of the office and use laptops or mobile devices, such as a Blackberry, to access office‐based computing resources) have to perform business intelligence functions when they're out of the office. In one model, mobile users can dial in or otherwise connect to a report server or an OLAP server, receive a download of the most recent data, and then (after detaching and working elsewhere) work with and manipulate that data in a standalone, disconnected manner. In another model, mobile users can leverage Wi‐Fi network connectivity or data networks, such as the Blackberry network, to run business intelligence reports and analytics that they have on the company intranet on their mobile device.

dashboard

A business intelligence _________ is a graphical interface that displays the current status of performance metrics and key performance indicators (KPIs) for an enterprise.

1. Big Data

A collection of large, complex data sets, including structured and unstructured data, which cannot be analyzed using traditional database methods and tools and includes the following four common characteristics Variety Veracity Volume Velocity

OLAP cube

A(n) __________ is a multidimensional database that is optimized for data warehouse and online analytical processing (OLAP) applications.

OLTP

A(n) ___________ system is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for these systems is put on very fast query processing. These systems are optimized for single-step queries, not detailed analysis.

Real‐time intelligence:

Accessing real‐time, or almost real‐time, information for business intelligence (rather than having to wait for traditional batch processes) is becoming more commonplace. In these situations, an application must be capable of "pushing" information, as opposed to the traditional method of "pulling" the data through a report or query. Like with traditional data‐extraction services (described in Chapter 7), business intelligence tools must detect when new data is pushed into its environment and, if necessary, update measures and indicators that are already on a user's screen. (In most of today's business intelligence tools, on‐screen results are "frozen" until the user requests new data by issuing a new query or otherwise explicitly changing what appears on the screen.)

Out flow:

Accessing to obtain data by consumer ad hoc and routine. Delivery: to render data by warehouse via publish and subscribe mechanisms.

Teradata did what first?

Built the first data warehousing appliance. a combination of hardware and software to solve the data warehousing needs of many.

7.Data Mining Techniques

CEAC Classification Estimation Affinity grouping Clustering

EDW's are used to provide data for many types of DSS including:

CRM, supply chain management (SCM), business performance management (BPM), business activity monitoring (BAM), product life-cycle management (PLM), revenue management, and sometimes even Knowledge Management Systems (KMS).

20.Structured Data

Contains a defined length, type, and format and includes numbers, dates, or strings machine-generated or human-generated

2.

Cube Common term for the representation of multidimensional information

Data warehousing depends on:

DBMS, Extraction and conversion tools, internetworking techniques, front-end analysis tools, graphics

A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format.

Data Warehouse

Integration that comprises 3 major processes: data access, data federation, and change capture.

Data integration

A departmental small-scale Data Warehouse that stores only limited/relevant data.

Data mart

What are the three main types of data warehouses?

Data marts, operational data store (ODS), and enterprise data warehouses (EDW)

21.Three organizational methods for analyzing big data

Data mining Big data analytics Data visualization

Major components of the data warehousing process:

Data sources, ETL, data loading, comprehensive database, metadata, middleware tools

The four major components of the data warehousing process

Data sources. Data extraction (using custom-written or commercial software called ETL), Data loading (data loaded to staging area) Comprehensive database, metadata (used by IT personnel and users).

A subset that is created directly from a data warehouse.

Dependent data mart

A retrieval-based system that supports high-volume query access

Dimensional Modeling

EDW stands for

Enterprise Data Warehouse

A technology that provides a vehicle for pushing data from source systems into a data warehouse.

Enterprise application integration (EAI)

Types of integration technologies that enable data and metadata integration:

Enterprise application integration (EAI, vehival pushes data from source to data warehouse), Enterprise information integration (EII, promotes real-time data integration).

A data warehouse for the enterprise

Enterprise data warehouse (EDW)

Hub-and-spoke architecture

Famous data warehousing architecture today. Focus on building a scalable and maintainable infrastructure that includes a centralized data warehouse and several dependent data marts. Allows for easy customization of user interfaces and reports. Lacks a holistic enterprise view, and may lead to data redundancy and data latency.

Agent technology:

In a growing trend, intelligent agents are used as part of a business intelligence environment. An intelligent agent might detect a major change in a key indicator, for example, or detect the presence of new data and then alert the user that he or she should check out the new information.

9.Data Visualization

Infographics Analysis paralysis Data visualization Data visualization tools Business intelligence dashboards Data artist

This approach emphasizes top-down development, employing established db development methodologies and tools, such as ERD

Inmon model, EDW approach

"Plan big, build small" approach, bottom-up. It is a scaled down DW that focuses on requests from specific depts

Kimball model, Data Mart approach

Types of analytical processing

MOLAP (multidimensional online analytical processing) is an alternative to the ROLAP (Relational OLAP) technology))) indexes directly into a multidimensional database. ROLAP(relational online analytical processing) is an alternative to the MOLAP (Multidimensional OLAP) technology. HOLAP(hybrid online analytical processing) is a combination of ROLAP ( Relational OLAP) and MOLAP (Multidimensional OLAP) SQL -SQL (pronounced "ess-que-el") stands for Structured Query Language. SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems.

Data about data. In DW, it describes the contents and the manner of its acquisition and use.

Metadata

inom model

Model, also known as the EDW approach, emphasizes top-down development, employing established database development methodologies and tools, such as entity-relationship diagrams (ERD), and an adjustment of the spiral development approach.

kimball model

Model, also known as the data mart approach, is a "plan big, build small" approach. A data mart is a subject-oriented or department-oriented data warehouse. It is a scaled-down version of a data warehouse that focuses on the requests of a specific department, such as marketing or sales.

OLAP implemented via a specialized multidimensional database (or data store) that summarizes transactions into multidimensional views ahead of time

Multidimensional OLAP (MOLAP)

The ability to organize, present, and analyze data by several dimensions, such as sales by region, by product, by salesperson, and by time (four dimensions).

Multidimensionality

22.Unstructured data

Not defined, does not follow a specified format, and is typically freeform text such as emails, Twitter tweets, text messages

Operational data store

ODS. Provides a fairly recent form of customer information file (CIF). This type of database is often used as an interim staging area for a data warehouse. Used for short term decisions. Uploads just recent info not for long-term use. Data warehouse on the other hand stores permanent info. An ODS consolidates data from multiple source systems and provides a near-real time, integrated view of a volatile, current data.

most common analysis technique in data warehouse?

OLAP online analytical processing.

Types of analytical processing activities:

Online analytical processing (OLAP), data mining, querying, reporting, and other decision-support applications.

OLAP vs OLTP

Online analytical processing VS online transactional processing. OTLP for capturing and storing data for day-to-day business functions such as ERP, CRM, SCM, point of sale, and so forth. Not for ad-hoc and complex queries that deal with a number of data items. OLAP on the other hand is designed to address this need by providing ad hoc analysis of organizational data much more effectively and efficiently. OLAP and OLTP rely on each other. OLAP uses the data captures by OLTP and OLTP automates the business processes that are managed by decisions supported by OLAP.

A type of database often used as an interim area for a data warehouse.

Operational data stores (ODS)

An ODS is a

Opertaional data stores. type of customer-information-file database that is often used as a staging area for a data warehouse.

The importance of datawarehouses

Organizes - merchandising, advertising, distribution, sales, marketing, production, service, and accounts receivable

Data mining analysis methods

POFR Prediction Optimization Forecasting Regression

Server‐based functionality:

Rather than have most or all of the data manipulation performed on users' desktops, server‐based software (known as a report server) handles most of these tasks after receiving a request from a user's desktop tool. After the task is completed, the result is made available to the user, either directly (a report is passed back to the client, for example) or by posting the result on the company intranet.

Ralph Kimball founded

Red Brick Systems in 1986. Software company aimed at improving data access.

RDSMS stands for

Relational Database Management System

The implementation of an OLAP database on top of an existing relational database

Relational OLAP (ROLAP)

Centralized data warehouse

Similar to the hub-and-spoke one. except no dependent data marts, rather a big enterprise data warehouse that serves the needs of all organizational units. More holistic view. No data marts.

Slice And Dice

Slice and dice refers to a strategy for segmenting, viewing and understanding data in a database. Users slices and dice by cutting a large segment of data into smaller parts, and repeating this process until arriving at the right level of detail for analysis. Slicing and dicing helps provide a closer view of data for analysis and presents data in new and diverse perspectives.

Logical arrangement of tables in a multidimensional db in such a way that the ERD resembles a snowflake.

Snowflake Schema

The most commonly used and the simplest style of dimensional modeling.

Star Schema

Characteristics of Data Warehousing include

Subject oriented (data organized by detailed subject such as sales, customer,) Integrated (consistent format), Time Varient ( maintains historical data). Nonvolatile (users can't change data, changes are recorded as new data).

Meta-flow:

System modeling: to define structure of legacy systems, synthesizing to create valued, regulating to create modules for capturing.

Isle of Capri ( a gaming company) solution for meet demands?

Teradata ( a company) and and IBM Cognos for Business Intelligence.

ETL(Extract, Transform and Load)

The __________ comprises the processing layer in a data warehousing model. It is responsible for pulling data out of the source systems and placing it into a data warehouse.

5.Data mining

The process of analyzing data to extract information not offered by the raw data alone

Data and information

The term data refers to factual information, especially that used for analysis and based on reasoning or calculation. Data itself has no meaning, but becomes information when it is interpreted. Information is a collection of facts or data that is communicated.

Data mart bus architecture

This architecture is a viable alternative to the independent data marts where the individual marts are linked to each other via some kind of middleware. Not optimal for complex data queries.

3 tiers of data warehousing architecture. ( a 2 tier is more economical where the last two work together but not great for large companies).

Tier 1: Client workstation. Tier 2: Application server. Tier 3: Database server.

Oper marts are an operational data mart.

True

Additional data warehouse characteristics include:

Web based, Relational/multidimensional, Client/Server (for easy access to end-users), Real time (newer data warehouses provide real-time or active data-access and analysis capabilities) Metadata (data about data, how its all organized and how to use them, etc).

database

a collection of information that you organize and access according to the logical structure of that information

data warehouse

a logical collection of information gathered from many different operational databases used to create business intelligence that supports business analysis activities and decision making tasks

In an OLAP a cube is

a multidimensional data structure actual or virtual that allows fast analysis of data. The capability of efficiently manipulating and analyzing data from multiple perspectives. aimed for overcome a limitation of relational databases. an analyst can navigate through the database and screen for a particular subset of the data by changing the data's orientations and defining analytical calculations. not great for lots of data as a standard relational format is.

Hans Peter Luhn

a researcher at IBM, introduced the idea of business intelligence in 1958 as "the ability to apprehend the interrelationships of presented facts in such a way as to guide action to a desired goal.

dimensional modeling is

a retrieval based system that supports high-volume query access.

Independent Data Mart

a small warehouse designed for a strategic business unit (SBU) or a department, but its source is not an EDW.

A data warehouse is

a specially constructed data repository where data are organized so that they can be easily accessed by end users for several applications.

OLAP(Online analytical processing)

a(n) ___________ system is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. These systems manipulate aggregated, historical data, stored in multi-dimensional schemas (usually star schema like the cube).

drill down

access data that is in a lower level of a hierarchically structured database.

Middleware tools enable

access to the data warehouse. Power users such as analysts may write their own SQL queries.

19.primary purpose of a data warehouse

aggregate information throughout an organization into a single repository for decision-making purposes

Relational DBMS

allow multiple access queries.

Active Data warehousing (as opposed to traditional data warehousing)

allows for large users and operational staffs.Active Data Warehouse is repository of any form of captured transactional data so that they can be used for the purpose of finding trends

data mining

analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data.

Oper marts

are created when operational data needs to be analyzed multidimensionally. The data for an oper mart come from an ODS.

In-Flow DS flow

capturing data from legacy system, validating to test data for reality, repairing to examine and build data, transforming for consolidation, applying to move and load data.

A data mart...

contain data on one topic (e.g., marketing). A data mart can be a replication of a subset of data in the data warehouse. Data marts are a less expensive solution that can be replaced by or can supplement a data warehouse. Data marts can be independent of or dependent on a data warehouse.

3.

data bases are 2D - rows (entities) and columns (attributes)

A web-server is backed by both a

data warehouse and an application server. used for ease of access, platform independence, and lower cost.

The federated data warehouse

data warehouse architecture involves integrating disparate systems and analytical resources from multiple sources to meet changing needs or business conditions.

15.inconsistent data definitions

every department had its own method for recording data, so when trying to share info, data did not match and users did not get the data they really needed

The ETL process consists of

extraction (reading data from one or more dbs), transformation (converting extracted data form one or more dbs), and load (putting the data into the DW)

11.Data Warehousing

fixes the following problems inconsistent data definitions lack of data standards poor data quality inadequate data usefulness ineffective direct data access

DBMS (database management system)

helps you specify the logical organization for a database and access and use the information within a database

multidimensional cube is

inflexible and does not support the ad hoc creation of multidimensional views of the products, services and customers. can't handle more then 30 gigabits of data.

Inmon vs kimball

inmom op-down, enterprise wide, complex, dubjrct driven, low end0user, IT professionals, WHEREAS kimball bottom-up, simple method, data marts, process oriented, dimensional modeling, high end user accessibilites.

The data warehouse is a collection of _, _ databases designed to support DSS functions, where each using of data is _ and relevant to some moment in time.

integrated, subject-oriented, non-volatile

Federated data warehouse

integrates analytical resources from multiple sources to meet changing needs or business conditions.

snowflake schema

is a logical arrangement of tables in a multidimensional database in such a way that the entity relationship diagram resembles a snowflake in shape.

Enterprise integration informaiton

is a mechanism for pulling data from source systems to satisfy a request for information. It is an evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases.

Relational Databases are not well suited for

manipulating records. support a lot of data. supports dynamic joining of data. proven technology. performance less than optimal cannot be used for purely optimized processing.

16.Ineffective Direct Data Access

most data stored in operational databases did not allow users direct access, users had to wait to have their queries or questions answered by MIS professionals who code SQL

How are data warehouses different from operational databases

operational databaseses are more product oriented and data warehouses use subject orientation to give a more comprehensive view of the organization.

Data mining take analysis further by sifting through a large amount of data to find info using these such algorithms:

predictive modeling, database segmentation, link analysis, deviation detection.

OLAP Operations:

slice, dice, drill down/up, roll up, pivot

oltp's (online transaction processing)

the gathering of input information, processing that information and updating existing information to reflect the gathered and processed information

online analytical processing

the manipulation of information to support decision making

Key performance indicators

the most essential and important quantifiable measures used in analytics initiatives to monitor success of a business activity

multidimensional databases lack

the scalability and flexibility for DSS

analytics

the science of fact based decision making

performing extensive ETL (extraction, transformation, load)

to move data to the data warehouse may be a sign of poorly managed data and a fundamental lack of a coherent data management strategy.

AI (artificial intelligence)

tools such as neutral networks and fuzzy logic to form the bases of information discovery and build business intelligence in OLAP

8.Data-mining tools

use a variety of techniques to find patterns and relationships in large volumes of information

14.inadequate data usefulness

users could not get the data they needed, what was collected was not always useful for intended purposes

Multidimensional Database

usually contain a star model. designed for slice and dice and drill down analysis. highly indexed databases. provides data mining and drill down capabilities.


Set pelajaran terkait

TIM 102. CHAPTER 15. REGIONAL AMERICAS

View Set

Medical Terminology Test 1 (chapters 1-3)

View Set

MICR 3050 Exam 1 : "Microbial World"

View Set

Chapter 24: Structure and Function of the Kidney

View Set

LUOA U.S. History Test 6 Study Guide

View Set

Linux Section 3 (Users and Groups)

View Set