Chapter 8: Big Data, Data Warehouses, and Business Intelligence Systems
data mart
A collection of data that is smaller than the data warehouse and addresses a particularly component or functional area of the business. Users obtain data from the data warehouse that pertain to a particular business function. They do not have the expertise of data warehouse employees, but they are knowledgable analysts for a given business function.
RFM Analysis
A data collection and processing analyzes and ranks customers according to their purchasing patterns. It is a simple customer classification technique that considers how recently (R) a customer orders, how frequently (F) a customer orders, and how much money (M) the customer spends per order.
dimensional database
A database design that is used for data warehouses and is designed for efficient queries and analysis. It contains a central fact table connected to one or more dimension tables.
operational systems
A database system in use for the operations of an enterprise, typically an OLTP system.
data warehouse
A database system that has data, programs and personnel that specialize in the preparation of data for BI processing.
distributed database
A database that is stored and processed on more than one computer
NoSQL (or Not only SQL) Database
A distributed, replicated database used where this type of DBMS is needed to support large datasets.
server cluster
A group of associated servers
replication
A means of distributing a database by storing copies of the database on multiple computers;
snowflake schema
A more complex version of the star schema. In this schema, each dimension of the table is normalized, which may created additional tables attached to the dimension table.
1. Read and process data existing in the operational database but they use the operational DBMS to obtain this data, but do not insert, modify or delete operational data 2. Process data extracted from operational databases, in this situation, they manage the extracted database using a BI DBMS which may be the same as or different from the operational DBMS 3. BI systems read data purchased from data vendors
From where do BI systems obtain data? And how do they process the obtained data?
Business Intelligence (BI) systems
Information systems that assist managers and other professionals in the analysis of current and past activities and in the prediction of future events. Unlike transaction processing systems, they do not support operational activities, such as the recording and processing of orders. Instead they are used to support management assessment, analysis, planning, control and ultimately, decision-making.
star schema
Model for dimensional databases where fact table is the center with connected dimension tables, there is usually a date or time dimension to track changes ver time.
partitioning
One way of distributing a database which means breaking down the database into pieces and storing the pieces on multiple computers.
OLAP reports
Online Analytical Processing report, which usually contains the results of OLAP, which provides the ability to sum, count, average and perform other simple artithmetic operations on groups of data. Also called an OLAP cube.
Map Reduce Process
Process used to break a large analytical task into smaller tasks, assign each smaller task to a separate computer in the cluster, gather the results of each of those tasks and combine them in the final product of the original tasks.
measures (of business activities)
Quantitative or factual data about the entity represented by the fact table.
slowly changing dimension
Rare changes in data information (such as a customer moving to a different city or state, or changed address) that would affect the data in a data mart
column families
The combination of columns and supercolumns that result to form the database storage equivalent of RDBMS tables
-"dirty data" problematic data such as "G" for a alue of gender, "213" for age, part of a color "rd", etc. -missing values -inconsistent data (data that has changed, such as a customer's phone number or address) -nonintegrated data (data from two or more sources that need to be combined) -incorrect format -too much data
What are some of the problems of using operational data for BI processing?
1. Used for unstructured analytical data processing 2. Current and historical data are used 3. Data are loaded and updated systematically, not by users
What are the 3 main characteristics of a dimensional database?
-Key-Value--Dynamo and MemcacheDB -Document--Couchbase and MongoDB -Column Family- Apache Cassandra and HBase -Graph--Neo4J and AllegroGraph
What are the Four Categories of NoSQL Databases?
supercolumns
When columns are grouped into sets , i.e Customer Name ______ consists of a FirstName column and a LastName column and which stores the CustomerName "FirstName LastName"
object persistence
When using an OOP, the properties of the object are created and stored in main memory. This is the term for storing the values of properties of an object.
keyspace
Where all the column families are contained; this provides the set of RowKey values that can be used in the data store.
1. Querying data for BI applications can place a substantial burden on the DBMS and unacceptably slow the performance of operational applications 2. The creation and maintenance of BI systems require application programs, facilities and expertise that are not normally available from operations 3. Operational data have problems that limit their use for BI applications
Why is operational data difficult to use?
1. Most data mining applications have only a few users and those users have sophisticated computer skills 2. The results of a data mining analysis are usually incorporated into some other report, analysis or information system
Why is report delivery not as important for data mining systems as it is for reporting systems? (2 main things)
dirty read
A read of data that have been changed but not yet committed to a database. Such changes may later be rolled back and removed from the database.
Online Analytical Processing (OLAP)
A technique for analyzing data values, called measures, aganist characteristics associated with those data values, called dimensions. They are sometimes used to ease the task of report production.
object-oriented programming (OOP)
A technique for designing and writing computer programs. Today most new program development is done using languages Java, C++, C# and Visual Basic.NET
Hadoop Distributed File System (HDFS)
An Apache software which provides standard file services to clustered servers so that their file systems can function as one distributed file system. It was originally part of Casandra but the project spun off a nonrelational datastore of its own called HBase and query language named Pig.
Online Transaction processing (OLTP) System
An operational database system available for, and dedicated to, transaction processing or the ongoing stream of businesses transactions. Also known as transactional system.
methods
Computer programs that perform some task
object-relational database
Basically add-on features and functions for a DBMS product which facilitate object persistence
data warehouse
Can be thought of like a distributor in a supply chain. Takes data from the manufacturers (operational systems and purchased data), cleans and processes them and locates that data. The people who work in a data warehouse are experts at data mgmt, cleaning, transformation and the like. They are usually not experts at a given business function.
properties
Data items particular to an object
objects
Data structures that have both methods and properties
data warehouse metadata database (or simply, "data warehouse")
The place where metadata concerning the data's source, format, assumptions, constraints and other facts are kept.
enterprise data warehouse (EDW) architecture
The system design of a corporation's BI data. In one configuration, a data warehouse maintains all enterprise BI data and acts as an authoritative source for data extracts provided to the data marts. Data marts receive all data from the data warehouse.
Big Data
The term for the enormous datasets generated by the Web applications such as search tools and social networks.
extract, transform and load (ETL) system
This system cleans and prepares data for BI processing. Problematic operational data that may have been cleaned in this system can also be used to update the operational system to fix the original data problems.
data mining applications
This type of BI system performs sophisticated analyses on data, analyses that usually involve complex statistical and mathematical processing. They are used for: what-if analyses, predictions, decisions. The results are often incorporated into some other report or system.
reporting systems
This type of BI system sorts, filters, groups and makes elementary calculations on operational data. Summarizes current status to past or predicted status. Classify entities (customers, products, employees, etc.) and report delivery crucial.
drill down (into data)
To further divide data into more detail