Chapter 9 Data Warehousing
NoSQL
A NoSQL (originally referring to "non SQL", "non relational" or "not only SQL") database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases.
Operational data store (ODS)
An integrated, subject-oriented, continuously updatable, current-valued (with recent history) enterprise wide, detailed database designed to serve operational users as they do decision support processing.
rule discovery
Association - looking for patterns where one event is connected to another event
data-mining techniques
Data mining is sorting through data to identify patterns and establish relationships.
data scrubbing
Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated. A process of using pattern recognition and other artificial intelligence techniques to upgrade the quality of raw data before transforming and moving the data to the data warehouse.
ETL
Extract transform load (ETL) is the process of extraction, transformation and loading during database use, but particularly during data storage use. It includes the following sub-processes: Retrieving data from external data storage or transmission sources. extraction and loading happens periodically
mySQL
MySQL is an open source relational database management system. Information in a MySQL database is stored in the form of related tables. MySQL databases are typically used for web application development (often accessed using PHP)
on-line analytical processing (OLAP)
OLAP (online analytical processing) is computer processing that enables a user to easily and selectively extract and view data from different points of view. performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.
PHP
PHP is a script language and interpreter that is freely available and used primarily on Linux Web servers. PHP, originally derived from Personal Home Page Tools, now stands for PHP: Hypertext Preprocessor, which the PHP FAQ describes as a "recursive acronym."
logical data mart:
RFID GPS
case reasoning
Reasoning that adapts previous solutions for similar problem in solving new problem in hand
GPS
The GPS (Global Positioning System) is a "constellation" of approximately 30 well-spaced satellites that orbit the Earth and make it possible for people with ground receivers to pinpoint their geographic location.
data visualisation
The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.
Enterprise data warehouse (EDW)
a centralized, integrated data warehouse that is the control point and single source of all data made available to end users for decision support applications
logical data mart
a data mart created by a relational view of a data warehouse
dependent data mart
a data mart exclusively from an enterprise data warehouse and its reconciled data
independent data mart
a data mart filled with data extracted from the operational environment, without the benefit of a data warehouse.
RFID
allowing these kinds of opportunities for real-time data warehousing (with massive amount of data) coupled with real-time analytics to be used to greatly reduce the latency between event data capture and appropriate actions being taken.
real-time data warehouse
an enterprise data warehouse that accepts near-real-time feeds of transactional data from the systems of record, analyzes warehouse data, and in near-real-time relays business rules to the warehouse and systems of record so that immediate action can be taken in response to business events.
snowflake schema
an expanded version of a star schema in which dimension tables are normalized into several related tables.
fact tables
contain factual or quantitive data (measurements that are numerical, continuously valued, and additive) about a business, such as units sold, orders booked, and so on.
transient data
data in which changes to existing records are written over previous records, thus destroying the previous data content.
periodic data
data that are never physically altered or deleted once they have been added to the store.
derived data
data that have been selected, formatted, and aggregated for end-user decision support applications.
event data
data warehouse likely containing history of snapshots of status data or a summary of transaction. represent transactions, stored for a defined period but then deleted or archived to save storage.
Corporate Information Factory (CIF)
dependent data mart and operational data store architecture
reconciled data
detailed, current data intended to be the single, authoritative source for all decision support applications
dimension
hold descriptive data (context) about the subjects of the business. usually the source of attributes used to qualify, categorize, or summarize facts in queries, reports, or graphs.
status data
most of data stored in databases.
conformed dimension
one or more dimension tables associated with two or more fact tables for which the dimension tables have the same business meaning and primary key with each fact table.
operational systems
operational systems is used to run a business in real time, based on current data also called a system of record.
grain
the level of detail in a fact table, determined by the intersection of all the components of the primary key, including all foreign keys and any other primary key elements.
duration of the database
the natural duration is about 13 months or 5 calendar quarters, which is sufficient to see annual cycles in the data. some businesses, such as financial institutions, have a need for longer durations.
limitations of independent data mart
- separate ETL process for each data mart > redundant data and processing - inconsistency between data marts - difficult to drill down for related facts between data marts - excessive scaling costs are more applications are built - high cost obtaining consistency between marts
DSS schema
A decision support system (DSS) is a computerized information system used to support decision-making in an organization or a business. A DSS lets users sift through and analyze massive reams of data and compile information that can be used to solve problems and make better decisions.
data mart
a data warehouse that is limited in scope, whose data are obtained by selecting and summarizing data from a data warehouse or from separate extract, transform, and load processes from source data systems.
star schema
a simple database design in which dimensional data are separated from fact or event data. a dimensional model is another name for a star schema.
informational systems
a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications.
clusters in data mining
Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.
data warehousing
the process whereby organizations create and maintain data warehouses and extract meaning from and help inform decision making through the use of data in the data warehouses.
artificial intelligence
the theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.