Mid Term MISY 4390
Data Warehouse Development Approaches: 2. The Kimball Model:
The Data Mart Approach (bottom-up) a. "plan big, build small" approach b. subject-oriented or department-oriented data warehouse c. focuses on the requests of a specific department
Data Warehouse Development Approaches: 1. The Inmon Model:
The EDW Approach (top-down) a. Employing established database development methodologies and tools, such as entity-relationship diagrams (ERD), and an adjustment of the spiral development approach
How Data Mining Works
Using existing and relevant data, datamining builds models to identify patterns among the attributes presented in the data set. Models are the mathematical representations (simple linear relationships and/or complex highly nonlinear relationships) that identify the patterns among the attributes of the objects (ex. Customers) described in the data set. 1. Associations: find the commonly co-occurring groupings of things, such as beer and diapers going together in market-basket analysis 2. Predictions: tell the nature of future occurrences of certain events based on what has happened in the past, such as predicting the winner of the Super Bowl 3. Clusters: identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their demographics and past purchase behaviors 4. Sequential Relationships: discover time-ordered events, such as predicting that an existing banking customer who already has a checking account will open a savings account followed by an investment account within a year
Star Schema
most commonly used and simplest style of dimensional modeling • Contain a fact table surrounded by and connected to several dimension tables
Web Content Mining
the extraction of useful information from Web pages
Four Major Components of BI
1. Data Warehouse - with its source data 2. Business Analytics - a collection of tools for manipulating, mining, and analyzing the data in the data warehouse 3. Business Performance Management (BPM) - for monitoring and analyzing performance 4. User Interface - ex. Dashboard
Data Warehousing Architecture Three-Tier Architecture
1. Data acquisition software (back-end) 2. The data warehouse that contains the data and software 3. Client (front-end) software that allows users to access and analyze data from the warehouse
Two-Tier Architecture
1. Data acquisition software (back-end) AND the data warehouse that contains the data and software 2. Client (front-end) software that allows users to access and analyze data from the warehouse
The V's that Define Big Data
1. Volume - is obviously the most common trait of Big Data 2. Variety - data today comes in all types of formats-ranging from traditional databases to hierarchical data stores created by the end users and OLAP systems to text documents, e-mail, XML, meter-collected, and sensor-captured data, to video, audio, and stock ticker data 3. Velocity - both how fast data is being produced and how fast the data must be processed to meet the need or demand 4. Veracity - accuracy, quality, truthfulness, or trustworthiness of the data 5. Variability - in addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks 6. Value Proposition - analyzing large and feature-rich data, organizations can gain greater business value that they may not have otherwise
Using a Lexicon
A lexicon is essentially the catalog of words, their synonyms, and their meanings for a given language. In addition to lexicons for many other languages, there are several general-purpose lexicons created for English. Often general-purpose lexicons are used to relate a variety of special-purpose lexicons for use in sentiment analysis projects. (ex. WordNet; a popular general-purpose lexicon created at Princeton University)
Association Rule Mining
Association rule mining (also known as affinity analysis or market-basket analysis) is a popular data mining method that is commonly used as an example to explain what data mining is and what it can do to a technologically less savvy audience. • Input: the simple point-of-sale transaction data • Output: Most frequent affinities among items
Task 3: Extract The Knowledge
Classification - supervised induction used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior Clustering - partitioning a database into segments in which the members of a segment share similar qualities Association - a category of data mining algorithm that establishes relationships about items that occur together in a given record Trend Analysis - the collecting of information and attempting to spot a pattern, or trend, in the information
Data Marts
Data Mart - a departmental data warehouse that stores only relevant data; usually smaller and focuses on a particular subject or department Dependent Data Mart - a subset that is created directly from a data warehouse Independent Data Mart - a small data warehouse designed for a strategic business unit (SBU) or department
Big Data for Health Sciences
Modern era medical devices (ex. Electrocardiograms and equipment that measures blood pressure, blood oxygen level, blood sugar level, body temperature, and so on) are capable of producing invaluable streaming diagnostic/sensory data at a very fast rate. Harnessing this data and analyzing it in real time offers benefits - the kind that we often call "life and death" - unlike any other field. In addition to helping healthcare companies become more effective and efficient, stream analytics is also improving patient conditions and saving lives. These systems are meant to help human decision makers make faster and better decisions by being exposed to a multitude of information as soon as it becomes available.
OLAP
Online Analytical Processing (OLAP) - an information system that enables the user, while at a PC, to query the system, conduct an analysis, and so on. The result is generated in seconds.
OLTP
Online Transaction Processing (OLTP) - transaction system that is primarily responsible for capturing and storing data related to day-to-day business functions
Real-Time, On-Demand BI
The demand for instant, on-demand access to dispersed information has grown as the need to close the gap between the operational data and strategic objectives has become more pressing. As a result, a category of products called real-time BI applications has emerged. The introduction of new data-generating technologies, such as radio-frequency identification (RFID), is only accelerating this growth and the subsequent need for real-time BI. Traditional BI systems use a large volume of static data that has been extracted, cleansed, and laded into a data warehouse to produce reports and analyses.
petabytes (PB)
The highest mass of data that used to be called petabytes (PB) has left its place to zettabytes (ZB), which is a trillion gigabytes (GB) or a billion terabytes (TB)
Business Intelligence (BI)
a conceptual framework for decision support; an umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies (pg. 8-9)
Extraction, Transformation, and Load (ETL)
a data warehousing process that consists of extraction, transformation, and load • An integral component in any data-centric project • ETL process typically consumes 70% of the time in a data-centric project • Extraction: reading data from one or more databases • Transformation: converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database • Load: putting the data into the data warehouse
Data Warehouse (DW)
a pool of data produced to support decision making; a physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format
Data Mining
a process that uses statistical, mathematical, artificial intelligence, and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases; used to describe discovering or "mining" knowledge from large amounts of data
Dimensional Modeling
a retrieval-based system that supports high-volume query access
Operational Data Store (ODS)
a type of database often used as an interim area for a data warehouse, especially for customer information files • The contents of an ODS are updated throughout the course of business operations • Used for short term decisions rather that medium or long term • Consolidates data from multiple source systems and provides a near-real time, current data
Metadata
data about data; in a data warehouse, metadata describes the contents of a data warehouse and the manner of its use • Generally defined in terms of usage as technical or business metadata • Pattern is another way to view metadata
Big Data
data that exceeds the reach of commonly used hardware environments and/or capabilities of software tools to capture, manage, and process it within a tolerable time span "Big Data" = massive volumes of data • The Vs that define Big Data o Volume o Variety o Velocity o Veracity o Variability o Value Proposition
Middleware Tools
enable access to the data warehouse. Power users such as analysts may write their own SQL queries. Others may employ a managed query environment, such as Business Objects, to access data. There are many front-end applications that business users can use to interact with data stored in the data repositories, including data mining, OLAP, reporting tools, and data visualization tools.
Taxonomy of Data
• Categorical data - represent the labels of multiple classes used to divide a variable into specific groups • Ordinal data - contain codes assigned to objects or events as labels that also represent the rank order among them • Numeric data - represent the numeric values of specific variables • Interval data - variables that can be measured on interval scales • Ratio data - include measurement variables commonly found in the physical sciences and engineering
Data Mining Software Tools
• Commercial o IBM SPSS Modeler (formerly Clementine) o SAS - Enterprise Miner o IBM - Intelligent Miner o StatSoft - Statistica Data Miner • Free and/or Open Source o KNIME o RapidMiner o Weka
Parallel Processing
• Enables multiple CPUs to process data warehouse query requests simultaneously and provides scalability • Data warehouse designers need to decide whether the database tables will be partitioned for access efficiency and what the criteria will be
High-Performance Computing
• In-memory analytics: solves complex problems in near-real-time with highly accurate insights by allowing analytical computations and Big Data to be processed in-memory and distributed across a dedicated set of nodes • In-database analytics: speeds time to insights and enables better data governance by performing data integration and analytic functions inside the database so you won't have to move or convert data repeatedly • Grid computing: promotes efficiency, lower cost, and better performance by processing jobs in a shared, centrally managed pool of IT resources • Appliances: brings together hardware and software in a physical unit that is not only fast but also scalable on an as-needed basis
Variations of OLAP
• Multidimensional OLAP (MOLAP) o OLAP implemented via a specialized multidimensional database (or data store) that summarizes transactions into multidimensional views ahead of time • Relational OLAP (ROLAP) o The implementation of an OLAP database on top of an existing relational database o Database OLAP and Web OLAP (DOLAP and WOLAP); Desktop OLAP,...
OLAP Operations
• Slice - a subset of a multidimensional array • Dice - a slice on more than two dimensions • Drill Down/Up - navigating among levels of data ranging from the most summarized (up) to the most detailed (down) • Roll Up - computing all of the data relationships for one or more dimensions • Pivot - used to change the dimensional orientation of a report or an ad hoc query-page display
Characteristics of Data Warehousing
• Subject Oriented: enables users to determine not only how their business is performing, but why • Integrated: place data from different sources into a consistent format; they must deal with naming conflicts and discrepancies among units of measure • Time Variant (time series): time is the one important dimension that all data warehouses must support • Nonvolatile: users cannot change or update the data • Web Based •Relational/Multidimensional • Client/Server • Real Time • Include Metadata
Text Mining Lingo
• Unstructured data - has a predetermined format • Corpus - a large and structured set of texts prepared for the purpose of conducting knowledge discovery • Terms - a single word extracted directly from the corpus of a specific domain by means of natural language processing (NLP) • Concepts - features generated from a collection of documents by means of manual, statistical, rule-based, or hybrid categorization methodology • Stemming - the process of reducing inflected words to their stem form • Stop words - words that are filtered out prior to or after processing of natural language data • Synonyms and polysemes - syntactically different words with identical or at least similar meanings; polysemes are similar words with different meanings • Tokenizing - a categorized block of text in a sentence • Term dictionary - a collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus • Word frequency - the number of times a word is found in a specific document • Part-of-speech tagging - the process of marking up the words in a text as corresponding to a particular part of speech based on a world's definition and the context in which it is used • Morphology - a branch of the field of linguistics and a part of natural language processing that studies the internal structure of words • Term-by-document matrix - a common representation schema of the frequency-based relationship between the terms and documents in tabular format • Singular-value decomposition - a dimensionality reduction method used to transform the term-by-document matrix to a manageable size
Big Data and Data Warehousing
• What is the impact of Big Data on DW? o Big Data and RDBMS do not go nicely together o Will Hadoop replace data warehousing/RDBMS? • Use Cases for Hadoop o Hadoop as the repository and refinery o Hadoop as the active archive • Use Cases for Data Warehousing o Data warehouse performance o Integrating data that provides business value o Interactive BI tools
Real-time/Active Data Warehousing (RDW/ADW)
• the process of loading and providing data via a data warehouse as they become available o Push vs. Pull (of data) • Concerns about real-time BI o Not all data should be updated continuously o Mismatch of reports generated minutes apart o May be cost prohibitive o May also be infeasible
