Chapter 6 Foundations of Business Intelligence: Databases and Information Management
Databases and the web
A customer with a Web browser wants to search an online retailer's database for pricing information. Because many black-end databases cannot interpret commands written in HTML, the Web server passes these requests for data to software that translates HTML commands into SQL so the commands can be processed by the DBMS working with the database.
Database Management Systems
A database management system (DBMS) is a software that permits an organisation to centralise data, manage them efficiently, and provide access to the stored data by application programs and the physical data files. The DBMS acts as an interface between application programs and the physical data files. When the application program calls for a data item, such as gross pay, the DBMS finds this item in the database and presents it to the application program.
in the past, BAE systems had used manual paper processes to reconcile its inconsistent and redundant data and to assemble data for management reporting. The solution was extremely time-consuming and costly, and prevented the company's information technology department from performing higher-value work.
A more appropriate solution was to install new hardware and software to create an enterprise-wide repository for business information that would support a more streamlined set of business applications. The new software included enterprise software that was integrated with an up-to-date database management system that could supply data for enterprise-wide reporting. The company had to reorganise its data into a standard company-wide format, eliminate redundancies, and establish rules, responsibilities, and procedures for updating and using the data.
Analytical tools: relationships, patterns, trends
Once data have been captured and organized using the business intelligence technologies we have just described, they are available for further analysis using software for database querying and reporting, multidimensional data analysis (OLAP), and data mining.
Capabilities of Database Management Systems
Querying and Reporting - DBMS includes tools for accessing and manipulating information in databases. Most DBMS have a specialised language called a data manipulation language that is used to add, change, delete, and retrieve the data in the database. The most prominent data manipulation language today is Structured Query Language, or SQL.
Lack of flexibility
a traditional file system can deliver routine scheduled reports after extensive programming efforts, but it cannot deliver ad hoc reports or respond to unanticipated information requirements in a timely fashion. The information required by ad hoc requests is somewhere in the system but may be too expensive to retrieve. Several programmers might have to work for weeks to put together the required data items in a new file
Business Intelligence infrastructure
suppose you wanted concise, reliable information about current operations, trends, and changes across the entire company If you worked in a large company, the data you need might have to be pieced together from separate systems, such as sales, manufacturing, and accounting, and even from external sources, such as demographic or competitor data Increasingly, you might need to use big data. A contemporary infrastructure for business intelligence has an array of tools for obtaining useful information from all the different types of data used by businesses today, including semi-structured and unstructured big data in vast quantities These capabilities include data warehouses and data marts, Hadoop, in-memory computing, and analytical platforms. Some of these capabilities are available as cloud services
Operations of a Relational DBMS
• select - creates a subset of all records meeting stated criteria • join - combines relational tables to present the server with more information than is available from individual tables • project - creates a subset consisting of columns in a table, permits user to create new tables containing only desired information
Capabilities of Database Management Systems
A DBMS includes capabilities and tools for organizing, managing, and accessing the data in the database. The most important are its data definition language, data dictionary, and data manipulation language. DBMS have a data definition capability to specify the structure of the content of the database. It would be used to create database tables and to define the characteristics of the fields in each table. This information about the database would be documented in a data dictionary. A data dictionary is an automated or manual file that stores definitions of data elements and their characteristics.
How a DBMS Solves the Problem of the Traditional File Environment
A DBMS reduces data redundancy and inconsistency by minimising isolated files in which the same data are repeated. The DBMS may not enable the organisation to eliminate data redundancy entirely, but it can help control redundancy. Even if the organisation maintains some redundant data, using a DBMS eliminates data inconsistency because the DBMS can help the organisation ensure that every occurrence of redundant data has the same values. The DBMS enables the organisation to centrally manage data, their use, and security. Data sharing throughout the organisation is easier because the data are presented to users as being in a single location rather than fragmented in many different systems and files.
What are the problems of managing data resources in a traditional file environment?
An effective information system provides users with accurate, timely, and relevant information. Accurate information is free of errors. Information is timely when it is available to decision makers when it is needed. Information is relevant when it is useful and appropriate for the types of work and decisions that require it. Many businesses don't have timely, accurate, or relevant information because the data in their information systems have been poorly organised and maintained. That's why data management is so essential.
In-Memory Computing
Another way of facilitating big data analysis is to use in-memory computing, which relies primarily on a computer's main memory (RAM) for data storage. (Conventional DBMS use disk storage systems.) Users access data stored in system primary memory, thereby eliminating bottlenecks from retrieving and reading data in a traditional, disk-based database and dramatically shortening query response times. In-memory processing makes it possible for very large sets of data, amounting to the size of a data mart or small data warehouse, to reside entirely in memory. Complex business calculations that used to take hours or days are able to be completed within seconds, and this can even be accomplished on handheld devices. (See the Interactive Session on Technology.)
What are the principal tools and technologies for accessing information from databases to improve business performance and decision making?
Businesses use their databases to keep track of basic transactions, such as paying suppliers, processing orders, keeping track of customers, and paying employees. But they also need databases to provide information that will help the company run the business more efficiently, and help managers and employees make better decisions. If a company wants to know which product is the most popular or who is its most profitable customer, the answer lies in the data. The challenge of big data: Most data collected by organizations used to be transaction data that could easily fit into rows and columns of relational database management systems. Bc of the explosion of data from Web traffic, e-mail messages, and social media content (tweets, status messages), as well as machine-generated data from sensors (used in smart meters, manufacturing sensors, and electrical meters) or from electronic trading systems. These data may be unstructured or semi-structured and thus not suitable for relational database products that organize data in the form of columns and rows. We now use the term big data to describe these datasets with volumes so huge that they are beyond the ability of typical DBMS to capture, store, and analyze. Big data doesn't refer to any specific quantity, but usually refers to data in the petabyte and exabyte range - in other words, billions to trillions of records, all from different sources.
Relational DBMS
Contemporary DBMS use different database models to keep track of entities, attributes, and relationships. The most popular type of DBMS today for PCs as well as for larger computers and mainframes is the relational DBMS. Relational databases represent data as two-dimensional tables (called relations).
Data mining
Data mining is more discovery-driven. Data mining provides insights into corporate data that cannot be obtained with OLAP by finding hidden patterns and relationships in large databases and inferring rules from them to predict future behaviour. The patterns and rules are used to guide decision making and forecast the effect of those decisions. The types of information obtainable from data mining include associations, sequences, classifications, clusters, and forecasts. • Associations are occurrences linked to a single event. For instance, a study of supermarket purchasing patterns might reveal that, when corn chips are purchased, a cola drink is purchased 65 percent of the time, but when there is a promotion, cola is purchased 85 percent of the time. This information helps managers make better decisions because they have learned the profitability of a promotion. • In sequences, events are linked over time. We might find, for example, that if a house is purchased, a new refrigerator will be purchased within two weeks 65 percent of the time, and an oven will be bought within one month of the home purchase 45 percent of the time. • Classification recognises patterns that describe the group to which an item belongs by examining existing items that have been classified and by inferring a set of rules. For example, businesses such as credit card or telephone companies worry about the loss of steady customers. Classification helps discover the characteristics of customers who are likely to leave and can provide a model to help managers predict who those customers are so that the managers can devise special campaigns to retain such customers. • Clustering works in a manner similar to classification when no groups have yet been defined. A data mining tool can discover different groupings within data, such as finding affinity groups for bank cards or partitioning a database into groups of customers based on demographics and types of personal investments. • Although these applications involve predictions, forecasting uses predictions in a different way. It uses a series of existing values to forecast what other values will be. For example, forecasting might find patterns in data to help managers estimate the future value of continuous variables, such as sales figures. These systems perform high-level analysis of patterns or trends, but they can also drill down to provide more detail when needed. There are data mining applications for all the functional areas of business, and for government and scientific work. One popular use for data mining is to provide detailed analyses of patterns in customer data for one-to-one marketing campaigns or for identifying profitable customers
data redundancy and inconsistency
Data redundancy is the presence of duplicate data in multiple data files so that the same data are stored in more than one place or location. Data redundancy occurs when different groups in an organisation independently collect the same piece of data and store it independently of each other. Data redundancy wastes storage resources and also leads to data inconsistency, where the same attribute may have different values. For example, in instances of the entity COURSE illustrated in Figure 6.1, the Date may be updated in some systems but not in others. The same attribute, Student_ID, may also have different names in different systems throughout the organisation. Some systems might use Student_ID and others might use ID, for example.
What are the major capabilities of database management systems (DBMS) and why is a relational DBMS so powerful?
Database technology cuts through many of the problems of traditional file organisation. a more rigorous definition of a database is a collection of data organised to serve many applications efficiently by centralising the data and controlling redundant data Rather than storing data in separate files for each application, data appears to users as being stored in only one location. A single database services multiple applications
Problems with the traditional file environment
In most organisations, systems tended to grow independently without a company-wide plan. Accounting, finance, manufacturing, human resources, and sales and marketing all developed their own systems and data files. Each application, of course, required its own files and its own computer program to operate. For example, the human resources functional area might have a personnel master file, a payroll file, a medical insurance file, a pension file, a mailing list file, and so forth until tens, perhaps hundreds, of files and programs existed. In the company as a whole, this process led to multiple master files created, maintained, and operated by separate divisions or departments. As this process goes on for 5 or 10 years, the organisation is saddled with hundreds of programs and applications that are very difficult to maintain and manage. The resulting problems are: - data redundancy - inconsistency - program-data dependence - inflexibility - poor data security - an inability to share data among applications.
OLAP (online analytical processing)
Manipulation of information to create business intelligence in support of strategic decision making Suppose your company sells four different products—nuts, bolts, washers, and screws—in the East, West, and Central regions. If you wanted to ask a fairly straightforward question, such as how many washers sold during the past quarter, you could easily find the answer by querying your sales database. But what if you wanted to know how many washers sold in each of your sales regions and compare actual results with projected sales? To obtain the answer, you would need online analytical processing (OLAP). Each aspect of information—product, pricing, cost, region, or time period—represents a different dimension. So, a product manager could use a multidimensional data analysis tool to learn how many washers were sold in the East in June, how that compares with the previous month and the previous June, and how it compares with the sales forecast. OLAP enables users to obtain online answers to ad hoc questions such as these in a fairly rapid amount of time, even when the data are stored in very large databases, such as sales figures for multiple years.
Hadoop
Relational DBMS and data warehouse products are not well-suited for organizing and analyzing big data or data that do not easily fit into columns and rows used in their data models. For handling unstructured and semi-structured data in vast quantities, as well as structured data, organizations are using Hadoop. Hadoop is an open source software framework managed by the Apache Software Foundation that enables distributed parallel processing of huge amounts of data across inexpensive computers. It breaks a big data problem down into sub-problems, distributes them among up to thousands of inexpensive computer processing nodes, and then combines the result into a smaller data set that is easier to analyze. You've probably used Hadoop to find the best airfare on the Internet, get directions to a restaurant, do a search on Google, or connect with a friend on Facebook.
Data Warehouse and Data Marts
The traditional tool for analysing corporate data for the past two decades has been the data warehouse A data warehouse is a database that stores current and historical data of potential interest to decision makers throughout the company. The data originate in many core operational transaction systems, such as systems for sales, customer accounts, and manufacturing, and may include data from Web site transactions. The data warehouse extracts current and historical data from multiple operational systems inside the organization. These data are combined with data from external sources and transformed by correcting inaccurate and incomplete data and restructuring the data for mgmt reporting and analysis before being loaded into the data warehouse. The data warehouse makes the data available for anyone to access as needed, but it cannot be altered. Companies often build enterprise-wide data warehouses, where a central data warehouse serves the entire organization, or they create smaller, decentralized warehouses called data marts. A data mart is a subset of a data warehouse in which a summarized or highly focused portion of the organization's data is placed in a separate database for a specific population of users. For example, a company might develop marketing and sales data marts to deal with customer information. Bookseller Barnes & Noble used to maintain a series of data marts—one for point-of-sale data in retail stores, another for college book- store sales, and a third for online sales.
Text Mining and Web Mining
Unstructured data, most in the form of text files, is believed to account for over 80 percent of useful organisational information and is one of the major sources of big data that firms want to analyse. e-mail, call centre transcripts, survey responses, legal cases, patent descriptions, and service reports are all valuable for finding patterns and trends that will help employees make better business decisions. Text mining tools are now available to help businesses analyse these data. These tools are able to extract key elements from unstructured big data sets, discover patterns and relationships, and summarise the information. Businesses might turn to text mining to analyse transcripts of calls to customer service centre to identify major service and repair issues or to measure customer sentiment about their company. Sentiment analysis software is able to mine text comments in an e-mail message, blog, social media conversation, or survey from to detect favourable and unfavourable opinions about specific subjects. The Web is another rich source of unstructured big data for revealing patterns, trends, and insights into customer behaviour. The discovery and analysis of useful patterns and information from the World Wide Web is called Web mining. Businesses might turn to Web mining to help them understand customer behaviour, evaluate the effectiveness of a particular Web site, or quantify the success of a marketing campaign. For instance, marketers use the Google Trends and Google Insights for Search services, which track the popularity of various words and phrases used in Google search queries, to learn what people are interested in and what they are interested in buying. Web content mining is the process of extracting knowledge from the content of Web pages, which may include text, image, audio, and video data. Web structure mining examines data related to the struc- ture of a particular Web site. For example, links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. Web usage mining examines user interaction data recorded by a Web server when- ever requests for a Web site's resources are received. The usage data records the user's behavior when the user browses or makes transactions on the Web site and collects the data in a server log. Analyzing such data can help companies determine the value of particular customers, cross marketing strategies across products, and the effectiveness of promotional campaigns.
Program-data dependence
the coupling of data stored in files and the specific programs required to update and maintain those files such that changes in programs require changes to the data Every traditional computer program has to describe the location and nature of the data with which it works. in a traditional file environment, any change in a software program could require a change in the data accessed by that program. One program might be modified from a five-digit to a nine-digit zip code. If the original data file were changed from five-digit to nine-digit zip codes, then other programs that required the five-digit zip code would no longer work properly Such changes could cost millions of dollars to implement properly