CIS 330 Chapter 9, 10, 11 (Final Exam)
In general, certain trends in organizations encourage the need for data warehousing; these trends include the following:
-No single system of record -Multiple systems are not synchronized -Organizations want to analyze the activities in a balanced way -Customer relationship management -Supplier relationship management
What are the characteristics of quality data?
-Uniqueness -Accuracy -Consistency -Completeness -Timeliness -Currency (is the degree to which data are recent enough to be useful) -Conformance (refers to whether data are stored, exchanged, or presented in a format that is as specified by their metadata) -Referential Integrity (data that refer to other data need to be unique and satisfy requirements to exist)
What are the characteristics of Data After ETL?
1) Detailed 2) Historical 3) Normalized (data is fully normalized) 4) Comprehensive (data reflects an enterprise-wide perspective) 5) Timely 6) Quality controlled
What are the two stages of which data reconciliation occurs?
1) During an initial load, when the EDW is first created 2) During subsequent updates (normally performed on a periodic basis) to keep the EDW current and/or to expand it
What are the goals of data mining?
1) Explanatory 2) Confirmatory 3) Exploratory
What are the four main types of NoSQL?
1) Key-value stores 2) Document stores 3) Wide-column stores 4) Graph databases
The need to separate operational and informational systems is based on three primary factors:
1. A data warehouse centralizes data that are scattered throughout disparate operational systems and makes them readily available for decision support applications 2. A properly designed data warehouse adds value to data by improving their quality and consistency 3. A separate data warehouse eliminates much of the contention for resources that results when informational applications are confounded with operational processing
A subject-oriented, integrated, time-variant, nonupdateable collection of data used in support of management decision-making processes
Data Warehouse
knowledge discovery using a sophisticated blend of techniques from traditional statistics, artificial intelligence, and computer graphics
Data mining
a process of using pattern recognition and other artificial intelligence techniques to upgrade the quality of raw data before transforming and moving the data to the warehouse. Also called data cleansing
Data scrubbing
the component of data reconciliation that converts data from the format of the source operational systems to the format of the enterprise data warehouse
Data transformation
Is the process whereby organizations create and maintain data warehouses and extract meaning from and help inform decision making through the use of data in the data warehouses
Data warehousing
Is the process of data warehousing requires extracting data from existing operational systems, cleansing and transforming data for decision making, and loading them into a data warehouse
Extract-Transform-Load (ETL)
is a file system designed for managing a large number of potentially very large files in a highly distributed environment
HDFS of Hadoop Distributed File System
An open source implemenation framework of MapReduce
Hadoop
a method of capturing only the changes that have occured in the source data since the last capture
Incremental extract
The data housed in the data warehouse are defined using consistent naming conventions, formats, encoding structures, and related characteristics gathered from several internal systems of record and also often from sources external to the organziation. This means that the data warehouse holds the one version of "the truth"
Integrated
The question "What will happen?" refers to what?
Predicitvie analytics
The question "How can we make it happen?" refers to what?
Prescriptive Analytics
Data in the data warehouse contain a time dimension so that they may be used to study trends and changes
Time-variant
the process of partitioning data according to predefined criteria
selection
What are some ways to consolidate data?
-Application Integration -Business Process Integration -User Interaction Integration
Hortonworks specifies three characteristics of a data lake, what are they?
-Collect everything -Dive in anywhere -Flexible access
Data quality is important to:
-Minimize IT project risk -Make timely business decisions -Ensure regulatory compliance -Expand the customer base
system that allows managers to measure,monitor, and manage key activities and processes to achieve organizational goals
Business Performance Management and Dashboards
a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and usefull information
Business intelligence
-No single system of record -Multiple systems are not sychronized -Organizations want to analyze the activities in a balanced way -Customer relationship management -Supplier relationship management
Certain trends in organizations encourage the need for data warehousing; these trends include the following:
technique that indicates which data have changed since the last data integration activity
Changed data capture (CDC)
an executive-level position accountable for all data-related activities in the enterprise
Chief data officer (CDO)
a technique for data integration that provides a virtual view of integrated data without actually creating one centralized database
Data federation
a large integrated repositroy for internal and external data that does not followw a predefined schema
Data lake
The question "What happened?" refers to what?
Descriptive analytics
a centralized, integrated data warehouse that is the control point and single source of all data made available to end users for decision support applications
Enterprise data warehouse (EDW)
an algorithm for massive parallel processing of various types of computing tasks
MapReduce
disciplines, technologies, and methods used to ensure the currency, meaning, and quality of reference data within and across various subject areas
Master data management (MDM)
OLAP tools that load data into an intermediate structure, usually a three- or higher-dimensional array
Multidimensional OLAP (MOLAP)
a category of recently introduced data storage and retrieval technologies that are not based on the relational model
NoSQL
Data in the data warehouse are loaded and refreshed from operational systems but cannot be updated by end users
Nonupdataeable
the use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
Online analytical processing (OLAP)
an approach to filling a data warehouse that involves bulk rewriting of the target data at periodic intervals
Refresh mode
OLAP tools that view the database as a traditional relational database in either a star schema or other normalized or denormalized set of tables
Relational OLAP (ROLAP)
Schoenborn identified four specific infrastructure capabilities that are required for big data and advanced analytics. What are they?
Scalability (ability to add capacity) Parallelism (being able to do multiple things at the same time) Low latency (high speed in various processing) Data optimization (skills needed to desing optimal storage and processing structures)
a method of capturing a snapshot of the required source data at a point in time
Static extract
A data warehouse is organized around the key subjects (or high-level entities) of the entreprise
Subject-oriented
the process of discovering meaningful information algorithmically based on computational analysis of unstructured textual information
Text mining
1) a business requires an integrated, company-wide view of high-quality information 2) the information systems department must separate informational from operational systems to improve performance dramatically in managing company data
Two major factors drive the need for data warehousing in most organizations today:
an approach to filling a data warehouse in which only changes in the source data are written to the data warehouse
Update mode
the process of transforming data from a detailed level to a summary level
aggregation
high-level organizational groups and processes that oversee data stewardship across the organization. It usually guides data quality intiatives, data architecture, data integration and master data management, data warehousing and business intelligence, and other data-related matters
data governance
a data warehouse that is limited in scope, whose data are obtained by selecting and summarizing data from a data warehouse or from separate extract, transform, and load processes from source data systems
data mart
a person assigned the responsibility of ensuring that organizationa applications properly support the organization's enterprise goals for data quality
data steward
a data mart filled exclusively from an enterprise data warehouse and its reconciled data
dependent data mart
What is the oldest form of analytics?
descriptive analytics
describes the past status of the domain of interest using a variety of tools through techniques such as reporting, data visualization, dashboards, and scorecards
descriptive analytics
capturing the relevant data from the source files and databases used to fill the EDW
extracting
a data mart filled with data extracted from the operational environment, without the benefit of a data warehouse
independent data mart
a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications
informational system
the process of combining data from various sources into a single table or view
joining
Converts data from one or more source fields to one or more target fields
multifield
a system that is used to run a business in real time, based on current data. Also called a system of record.
operational system
applies statistical and computational methods and models to data regarding past and current events to predict what might happen in the future
predicitve analytics
uses results of predicitive analytics together with optimization and simulation tools to recommend actions that will lead to a desired outcome
prescriptive analytics
Converts data from a single source field to a single target field
single-field