Chapter 6: Business Intelligence Big Data and Analytics
Benefits achieved from BI and Analytics
- detect fraud - improve forecasting - increase sales - optimize operations - reduce costs
Variety (Big Data)
-Structured data: format is known in advance. -Unstructured data: most of the deal in the organization. --Not organized in any predefined manner.
Four categories of NoSQL databases
1. Key-value NoSQL databases -Two columns ("key" and "value") 2. Document NoSQL databases -store, retrieve, and manage document orient information 3. Graph NoSQL databases -well-suited for analyzing interconnections 4. Column NoSQL databases -store data in columns
Conversion funnel
A graphical representation that summarizes the steps a consumer takes in making the decision to buy your product and become a customer.
Data Warehouse
A logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks
Scenario analysis
A process for predicting future values based on certain potential events.
Cross-Industry Process for Data Mining (CRISP-DM)
A six-phase structured approach for the planning and execution of a data mining project -Business understanding, data understanding, data preparation, modeling, evaluation, deployment
Linear programming
A technique for finding the optimum value (largest or smallest, depending on the problem) of a linear expression (called the objective function) that is calculated based on the value of a set of decision variables that are subject to a set of constraints.
Time series analysis
A type of forecast in which data relating to past demand are used to predict future demand.
Word cloud
A visual depiction of a set of words that have been grouped together because of the frequency of their occurrence.
NoSQL database
A way to store and retrieve data that is modeled using some means other than the simple two-dimensional tabular relations used in relational databases. -have the capability to spread data over multiple servers so that each server contains only a subset of the total data
An organization's collection of useful data
Archives, Documents, Data from business apps, social media, sensor data, media, machine log data and public data
Online Transaction Processing (OLTP)
Capturing of transaction and event information using technology to process, store, and update -Do not support data analysis required today -Data warehouses and data marts allow organizations to access OLTP data and support decision making more effectively
Technologies used to manage and process big data
Data warehouses, extract transform load process, data marts, data lakes, NoSQL databases, Hadoop, In-Memory databases
Extract Transform Load (ETL) process
Extracts data from a variety of sources, edits and transforms data into a data warehouse format, loads data into the warehouse
MapReduce program
a composite program that consists of a Map procedure that performs filtering and sorting and a Reduce method that performs a summary operation.
Hadoop Distributed File System (HDFS)
a system used for data storage that divides the data into subsets and distributes the subsets onto different servers for processing.
Genetic algorithm
a technique that employs a natural selection-like process to find approximate solutions to optimization and search problems -typically implemented as a computer simulation
Hadoop
an open-source software framework that includes several software modules that provide a means for storing and processing extremely large data sets. -Includes, HDFS and MapReduce program -Limitation: can only perform batch processing
Big Data
describe data collections that are so enormous and complex that traditional data management software, hardware, and analytics processes are incapable of dealing with them.
Data scientist
extracts knowledge from data by performing statistical analysis, data mining, and advanced analytics on big data to identify trends, market changes, and other relevant information
Self-Service Analytics
includes training, techniques, and processes that empower end users to perform their own analyses using an endorsed set of tools.
Regression Analysis
involves determining the relationship between a dependent variable and one or more independent variables. Ex. Pharmaceutical company uses this to predict drug shelf life to meet FDA regulations and identify a suitable expiration data for the drug.
Descriptive analysis
preliminary data processing stage used to identify patterns in the data and answer questions about who, what, where, when, and to what extent. -identifies data patterns -Includes visual analytics and regression analysis
Visual analytics
presentation of data pictorially or graphically. -Word cloud and Conversion funnel
Text analysis
process for extracting value from large quantities of unstructured text data such as consumer comments, social media postings, and customer reviews.
Video analysis
process of obtaining information or insights from video footage.
Monte Carlo Simulation
simulation that enables you to see a spectrum of thousands of possible outcomes, considering not only the many variables involved, but also the range of potential values for each of those variables.
In-memory database (IMDB)
stores the entire database in random access memory (RAM) -faster access to data --Rates much faster than storing data on secondary storage
Data Mart
subset of a data warehouse in which only a focused portion of the data warehouse information is kept -used by small and medium-sized businesses and departments within large companies
Data Lake
takes a "store everything" approach to big data -saves all data in its raw and unaltered form
Predictive analytics
technics to analyze current data. Identifies future probabilities and tends and makes predictions about the future.
Data mining
the process of analyzing data to extract information not offered by the raw data alone -explores large amounts of data for hidden patterns -Association analysis, neural computing, and case-based reasoning.
Veracity (Big Data)
the uncertainty of data, including biases, noise, and abnormalities
Volume (Big Data)
volume of data that exists in the digital universe is about 16.1 zettabyte. -one zettabyte= one trillion gigabytes
Key characteristics of Big Data
volume, velocity, value, variety, and veracity
Business intelligence (BI)
wide range of applications, practices, and technologies -extracts, transforms, integrates, visualizes, analyzes, interprets, and presents data