Information Technology Management, Chapter 3
Factors That Determine the Performance of a DBMS
* Data Latency * Ability to handle the volatility of the data * Query response time * Data consistency * Query predictability
Three technologies involved in preparing raw data for analytics
- ETL, - change data capture (CDC), - data deduplication ("deduping the data")
Hadoop
- It places no conditions on the structure of the data it can process. - distributes computing problems across a number of servers.
Business value categories
- Making more informed decisions at the time they need to be made - Discovering unknown insights, patterns, or relationships - Automating and streamlining or digitizing business processes
Four factors contributing to increased use of BI.
- Smart Devices Everywhere have created demand for effortless 24/7 access to insights. - Data are Big Business when they provide insight that supports decisions and action. - Advanced Bl and Analytics help to ask questions that were previously unknown and unanswerable. - Cloud Enabled Bl and Analytics are providing low-cost and flexible solutions
Benefits of centralized database
1. Better control of data quality. Data consistency is easier when data are kept in one physical location because data additions, updates, and deletions can be made in a supervised and orderly fashion. 2. Better IT security. Data are accessed via the centralized host computer, where they can be protected more easily from unauthorized access or modifcation.
BI governance program
1. Clearly articulate business strategies. 2. Deconstruct the business strategies into a set of specific goals and objectives— the targets. 3. Identify the key performance indicators (KPIs) that will be used to measure progress toward each target. 4. Prioritize the list of KPIs. 5. Create a plan to achieve goals and objectives based on the priorities. 6. Estimate the costs needed to implement the BI plan. 7. Assess and update the priorities based on business results and changes in business strategy.
Text Analytics Steps
1. Exploration. 2. Preprocessing. 3. Categorizing and Modeling.
data moved from databases to a warehouse are:
1. Extracted from designated databases. 2. Transformed by standardizing formats, cleaning the data, integrating them. 3. Loaded into a data warehouse.
MapReduce Stages
1. Map stage: MapReduce breaks up the huge dataset into smaller subsets; then distributes the subsets among multiple servers where they are partially processed. 2. Reduce stage: The partial results from the map stage are then recombined and made available for analytic tools.
Three general data principles relate to the data life cycle
1. Principle of diminishing data value. 2. Principle of 90/90 data use. 3. Principle of data in context.
Four V's of Data Analytics
1. Variety: 2. Volume: 3. Velocity: 4. Veracity:
Principle of Integrity.
A recordkeeping program will be able to reasonably guarantee the authenticity and reliability of records and data.
Principle of Accountability.
An organization will assign a senior executive to oversee a recordkeeping program; adopt policies and procedures to guide personnel; and ensure program audit ability.
Data security:
Check and control data integrity over time.
Data integrity and maintenance:
Correct, standardize, and verify the consistency and integrity of the data.
Enterprise data warehouses (EDW)
Data warehouses that pull together data from disparate sources and databases across an entire enterprise
HDFS
HaDoop File Systems
Data synchronization:
Integrate, match, or link data from disparate sources.
electronic records management (ERM) system
Keeps most records in electronic format and maintained throughout their life cycle—from creation to final archiving or destruction
Volume:
Large volumes of structured and unstructured data are analyzed.
Eventual consistency,
Means not all query responses will reflect data changes uniformly
OLTP
Online transaction processing systems
The highest-ranking enterprise DBMSs in mid-2014
Oracle's MySQL, Microsoft's SQL Server, PostgreSQL, IBM's DB2, and Teradata Database
Data filtering and profiling:
Process and store data efficiently. Inspect the data for errors, inconsistencies, redundancies, and incomplete information.
Data Access
Provide authorized access to data in both planned and ad hoc ways within acceptable time
Principle of Retention.
Records and data will be maintained for an appropriate time based on legal, regulatory, fiscal, operational, and historical requirements.
Principle of Availability.
Records will be maintained in a manner that ensures timely, efficient, and accurate retrieval of needed information.
Sentiment Analysis
Social commentary and social media are being mined to understand consumer intent
Velocity:
Speed of access to reports that are drawn from data defines the difference between effective and ineffective analytics.
Variety:
The analytic environment has expanded from pulling data from enterprise systems to include big data and unstructured sources.
Principle of data in context.
The capability to capture, process, format, and distribute data in near real time or faster requires a huge investment in data architecture and infrastructure to link remote POS systems to data storage, data analysis systems, and reporting apps. The investment can be justifed on the principle that data must be integrated, processed, analyzed, and formatted into "actionable information."
Principle of Transparency.
The processes and activities of an organization's recordkeeping program will be documented in an understandable manner and available to all personnel and appropriate parties.
Principle of Protection.
The recordkeeping program will be constructed to ensure a reasonable level of protection to records and information that are private, confidential, privileged, secret, or essential to business continuity.
Principle of Compliance.
The recordkeeping program will comply with applicable laws, authorities, and the organization's policies.
Principle of diminishing data value.
The value of data diminishes as they age. This is a simple, yet powerful principle. Most organizations cannot operate at peak performance with blind spots (lack of data availability) of 30 days or longer. Global f nancial services institutions rely on near real time data for peak performance
Veracity:
Validating data and extracting insights that managers and workers can trust are key factors of successful analytics. Trust in analytics has grown more difficult with the explosion of data sources.
Text mining
a broad category that involves interpreting words and concepts in context - helps companies tap into the explosion of customer opinions expressed online
Principle of 90/90 data use.
a majority of stored data, as high as 90 percent, is seldom accessed after 90 days (except for auditing purposes). That is, roughly 90 percent of data lose most of their value after 3 months
distributed database system
allows apps on computers and mobiles to access data from both local and remote databases
Queries
are ad hoc (unplanned) user requests for specific data
Databases
are collections of datasets or records stored in a systematic way
Data marts
are lower-cost, scaled-down versions of data warehouses that can be implemented in a much shorter time, for example, in less than 90 days. They serve a specific department or function, such as finance, marketing, or operations
Data marts
are small-scale data warehouses that support a single function or one department
Master data entities
are the main entities of a company, such as customers, products, suppliers, employees, and assets.
Data and text mining
are used to discover knowledge that you did not know existed in the databases
ERM systems
consist of hardware and software that manage and archive electronic documents and image paper documents; then index and store them according to company policy
Record examples
contracts, research and development, accounting source documents, memos, customer/client communications, hiring and promotion decisions, meeting minutes, social posts, texts, e-mails, website content, database records, and paper and electronic files
Business analytics
describes the entire function of applying technologies, algorithms, human expertise, and judgment
SQL Server
ease of use, availability, and Windows operating system integration make it an easy choice for firms that choose Microsoft products for their enterprises
Data mining software
enables users to analyze data from various dimensions or angles, categorize them, and find correlations or patterns among fields in the data warehouse
Data ownership problems
exist when there are no policies defining responsibility and accountability for managing data
ETL
extract, transform, and load
GIGO
garbage in, garbage out
BI tools
integrate and consolidate data from various internal and external sources and then process them into information to make smart decisions
Data warehouses
integrate data from multiple databases and data silos, and organize them for complex analysis, knowledge discovery, and to support decision making
Database, data warehouse, big data, and business intelligence (BI) technologies
interact to create a new biz-tech ecosystem
Online transaction processing (OLTP) systems
is a database design that breaks down complex information into simpler data tables to strike a balance between transaction-processing efficiency and query efficiency
Operating margin
is a measure of the percent of a company's revenue left over after paying for its variable costs, such as wages and raw materials
The data life cycle
is a model that illustrates the way data travel through an organization
SQL
is a standardized query language for accessing databases
online analytics-processing systems (OLAP)
is a term used to describe the analysis of complex data from the data warehouse
A data entity
is anything real or abstract about which a company wants to collect and store data.
A record
is documentation of a business event, action, decision, or transaction
Big data analytics
is not just about managing more or varied data. Rather, it is about asking new questions, formulating new hypotheses, exploration and discovery, and making data-driven decisions.
Latency
is the elapsed time (or delay) between when data are created and when they are available for a query or report
PostgreSQL
is the most advanced open source database, often used by online gaming applications and Skype, Yahoo!, and MySpace
Market share
is the percentage of total sales in a market captured by a brand, product, or company
DB2
is widely used in data centers and runs on Linux, UNIX, Windows, and mainframes
Immediate consistency
means that as soon as data are updated, responses to any new query will return the updated value.
Fault tolerance
means that no single failure results in any loss of service
Scalability
means the system can increase in size to handle data growth or the load of an increasing number of concurrent users
CDC
minimizes the resources required for ETL processes by only dealing with data changes
ETL processes
move data from databases into data warehouses or data marts, where the data are available for access, reports, and analysis.
OLAP
online analytics-processing systems
Dirty data
poor-quality data
master data management (MDM)
processes integrate data from various sources or enterprise applications to create a more complete (unified) view of a customer, product, or other entity
Deduping
processes remove duplicates and standardize data formats, which helps to minimize storage and data synch
Relational management systems (RDBMSs)
provide access to data using a declarative language—structured query language (SQL)
A decision model
quantifies the relationship between variables, which reduces uncertainty
active data warehouse (ADW)
real time data warehousing and analytics
Volatile
refers to data that change frequently
Declarative languages
simplify data access by requiring that users only specify what data they want to access without defining how access will be achieved
business-driven development approach
starts with a business strategy and work backward to identify data sources and the data that need to be acquired and analyzed
Data Warehouses
store data from various source systems and databases across an enterprise in order to run analytical queries against huge datasets collected over long time periods. They are the primary source of cleansed data for analysis, reporting, and BI
Relational databases
store data in tables consisting of columns and rows, similar to the format of a spreadsheet
Centralized database
stores data at a single location that is accessible from anywhere. Searches can be fast because the search engine does not need to check multiple distributed locations to find responsive data
Data
the driving force behind any successful business
Business intelligence (BI)
tools and techniques process data and do statistical analysis for insight and discovery—that is, to discover meaningful relationships in the data, keep informed in real time, detect trends, and identify opportunities and risks
MySQL
which was acquired by Oracle in January 2010, powers hundreds of thousands of commercial websites and a huge number of internal enterprise applications
The major ERM tools
workflow software, authoring tools, scanners, and databases
Functions performed by a DBMS
• Data filtering and profiling • Data integrity and maintenance: • Data synchronization: • Data security: • Data access:
Data warehouses are:
• Designed and optimized for analysis and quick response to queries. • Nonvolatile. This stability is important to being able to analyze the data and make comparisons. When data are stored, they might never be changed or deleted in order to do trend analysis or make comparisons with newer data. • OLAP systems. • Subject-oriented, which means that the data captured are organized to have similar data linked together.
Databases are:
• Designed and optimized to ensure that every transaction gets recorded and stored immediately. • Volatile because data are constantly being updated, added, or edited. • OLTP systems.
An ERM can help a business to become more efficient and productive by:
• Enabling the company to access and use the content contained in documents. • Cutting labor costs by automating business processes. • Reducing the time and effort required to locate information the business needs to support decision making. • Improving the security of content, thereby reducing the risk of intellectual property theft. • Minimizing the costs associated with printing, storing, and searching for content.
Advantages of NoSQL
• Higher performance • Easy distribution of data on different nodes, which enables scalability and fault tolerance • Greater flexibility • Simpler administration
HDFS Stages
• Loads data into HDFS. • Performs the MapReduce operations. • Retrieves results from HDFS.
Cost of Poor Quality Data
• Lost business. • Time spent preventing errors. • Time spent correcting errors.
Data Warehouses support
• Marketing and sales. • Pricing and contracts. • Forecasting. • Sales. • Financial
Generally Accepted Recordkeeping Principles
• Principle of Accountability. • Principle of Transparency. • Principle of Integrity. • Principle of Protection. • Principle of Compliance. • Principle of Availability. • Principle of Retention. • Principle of Disposition.