Data Analytics First exam
1. What is the difference between Supervised Learning and Unsupervised Learning?
•Supervised Methods - Involve input variables and a defined output variable •Prediction •Classification •Unsupervised Methods - Have input variables only (no output variable defined) •Association Rules •Clustering
1. What is Time Series Forecasting? Give examples.
•Time Series - the use of mathematical modeling to predict future values, based on previously observed valuesExamples: Sales over time, stock performance over time, temperature over time, etc.
1. Why are data for Supervised Models divided among Training, Validation, and Testing Partitions?
•Training Partition - used to "train" model with patterns in historical data •Validation Partition - also historical data, used to evaluate how well the model "learned" from the Training Partition •Test Partition - second level of validation, often done with a second set of historical data Emoji Example
1. What concepts and technologies will define the future of DWs?
•Web, Social Media, and Big Data (lots of unstructured and semi-structured data, & huge volumes of data!) •Open Source Software - R, Python, NoSQL, Hadoop •Software as a Service (SaaS) - Hosted DWs •Cloud Computing •Data Lakes
What are the 4 main components of BI Architecture?
1.Data Warehouse 2.Analytics Tools 3.Business Performance Management (BPM) 4.User Interface
1. What are the 5 DW Architectures? What are the advantages and disadvantages of each?
1.Independent Data Marts - "Data Islands" or "Stove-Pipes" (not desirable) 2.Data Mart Bus - "Bottom-Up" approach (Ralph Kimball) 3.Hub-and-Spoke - "Top-Down" approach (Bill Inmon) 4.Centralized - "Operational Copy" (separate DW, but no Dimensional Modeling) 5.Federated - "Patchwork" approach (query the "live" operational systems)
1. What are the five main Data Mining Tasks and what is involved in each?
1.Prediction - forecasting the future, based on patterns in historical data 2.Classification - determine behavior and assign classes based on patterns in historical data 3.Clustering - partition a collection of things into segments of natural groupings, where each group shares similar characteristics 4.Associations - determining "what goes with what" based on past patterns of behavior 5.Visualization and Time-Series Forecasting - help to explore and uncover patterns in historical data
What terms were the predecessors of Analytics? What terms were the predecessors of Data Science?
Analytics: decision support systems, management info systems, business intelligence, data analytics Data science: AI, expert systems, data mining, predictive analytics
1. What is a Snowflake Schema? How does it differ from a Star Schema?
Snowflake Schema: •Variation on the Star Schema •Dimensional tables are allowed to have a hierarchy of detail •Resembles a Snowflake shape
1. What is the formal definition of Data Mining (DM)? What are other names for Data Mining?
a. The non-trivial process of identifying valid, novel, protentially useful and ultimately understandable patterns in data found in structured databases b. Patterns in structured data (data that can be put into a table) c. These patterns goals are to predict the future by looking at what happened in the past d. Ex: adventure works selling bikes, might look at past sales history to find out which ones we think will sell in the future, which ones should we promote e. Ex: weather, can look at past data to find what the weather will be like in the future f. Other names: predictive analytics, •Knowledge Extraction, Pattern Analysis, Data Archaeology, Information Harvesting, Pattern Searching, Data Drudging
What is the difference between a Data Analyst and a Data Scientist?
data analyst: Perform data compilation, cleaning, reporting, and visualization, Skills: MS-Excel, SQL, and Reporting data scientist: Perform predictive analysis, statistical analysis, and uses advanced analytical tools and algorithms, Skills: Statistics, Mathematical Modeling, Java, Python, and R
1. What is Data Preprocessing? What are the main steps?
data preprocessing: •Steps necessary to convert raw, real-world data into a well-defined form for analytics steps: i. Consolidate data: data warehouses and data lakes can help us do this ii. Data cleaning: get rid of duplicates, take care of missing info iii. Data transformation: transform data from its operational systems into a different layout before we put it into the data warehouse iv. Data reduction: if too much data we might have to sample some, may have to only look at certain variables
1. Explain the difference between Data Visualization and Visual Analytics?
data visualizations: •Aimed at "What happened?," or "What is happening?" visual analytics: •Aimed at "Why is it happening?," and "What is more likely to happen next?"
What are examples of analytics used in Health Care? In Retail Sales?
health care: Humana Insurance: Predict higher-risk patients and place them in preventative care programs retail sales: Descriptive, Predictive, and Prescriptive Analytics all along Retail Value Chain: Inventory Optimization, Market Basket Analysis, Shopper Insight, Churn Analysis, Store Layout, et al. (Page 34)
What are examples of Software Certifications and Professional Certifications for Analytics?
software certifications: IBM-Cognos & IBM Watson, Microsoft, Oracle, SAS, Tableau, Teradata, and others professional certifications: INFORMS: Certified Analytics Professional (www.informs.org), Professional Certifications are NOT software-specific
1. What is an OLAP Cube (or Multidimensional Cube)? What ways can a cube be navigated?
•A Cube (or Multidimensional Cube) is the data repository (i.e., file) for OLAP •A Cube actually contains a static (i.e., snapshot) copy of the data •Cubes must be refreshed with new data in order to be current •Cubes contain summarized totals, for faster analysis •Navigating Cubes is faster than querying the data source (or DW) •Cubes can be Sliced, Diced, Drilled-Down, Drilled-Up, Rolled-Up, and Pivoted
What is the definition of Analytics? What are the 3 levels of Analytics?
•Analytics - the process of developing actionable decisions or recommendations for actions, based on insights gathered from historical data •The term "Analytics" has replaced previous decision-support terminology, including "Business Intelligence" 3 levels of analytics: 1.Descriptive - What has/is happening? 2.Predictive - What will happen? 3.Prescriptive - What should we do?
1. What are Association Rules and why are they classified as Unsupervised Learning?
•Association Rules - determine "what goes with what" based on past patterns of behavior •Also called Affinity Analysis and Market Basket Analysis •Considered Unsupervised because no output variable is specified •The results of the algorithm is a series of "IF . . . Then" rules •Apriori algorithm is the commonly used
1. What is Business Performance Management (BPM) and its 4 main processes?
•BPM - Business Performance Management •BPM - Business processes, methodologies, metrics, and technologies used to measure, monitor, and manage business performance •4 Main Processes: 1.Strategize - Where do we want to go? 2.Plan - How do we get there? 3.Monitor/Analyze - How are we doing? 4.Act/Adjust - What do we need to do differently?
1. What is a Balanced Scorecard (BSC)? How does a BSC help an organization manage performance from 4 different perspectives?
•Balanced Scorecard - A performance measurement and management methodology that helps translate an organization's financial, customer, internal process, and learning and growth objectives into a set of actionable initiatives •4 Perspectives: 1.Customer 2.Financial 3.Learning and Growth Internal Business Process
1. What are the differences between the Bill Inmon and Ralph Kimball approaches to DW design?
•Bill Inmon •Hub-and-Spoke Architecture "Top Down" •Enterprise Data Warehouse (EDW)Build all-encompassing DW first, then spin-off DMs •"Cargo-Ship" Approach •Ralph Kimball •Data Mart Bus Architecture "Bottom Up" •Start small, then add components over time •Build DMs first, then connect to form DW •"Freight-Train" Approach
What is Business Intelligence (BI) and what is the overall process of BI?
•Business Intelligence (BI) is an umbrella term that combines architecture, tools, databases, analytical tools, applications, and methodologies. •The overall process of BI is based on the transformation of data to information, then to decisions, then to actions.
1. What is CRISP-DM and what are the steps involved? What other DM methodologies are mentioned?
•CRISP-DM - Cross Industry Standard Process for Data Mining •CRISP-DM Steps: 1.Business Understanding - know what the study is for 2.Data Understanding - identify the most appropriate data for the study 3.Data Preparation - prepare data for analysis 4.Model Building - select the most appropriate DM model for the study 5.Test and Evaluation - assess the various models and pick the best model 6.Deployment - apply the model to the problem and form recommendations
Give examples for each of the following: Categorical Data, Nominal Data, Ordinal Data, Numeric Data, Interval Data, and Ratio Data.
•Categorical - Text Data •Nominal - labels that divide data into specific groups, with no rank order (e.g., Automobile Manufacturer = Ford, Chevy, Toyota, Honda, etc.) •Ordinal - labels that represent a rank order(e.g., Satisfaction = Not Satisfied, Somewhat Satisfied, Satisfied, Very Satisfied) •Numeric - Numeric Data •Interval - variables measured on an interval scale, with equal distances between each value (e.g., Celsius Temperature: 100⁰ C is twice as hot as 50⁰ C) •Ratio - same as Interval, except "0" value means "zero amount" of something(e.g., 0⁰ angle means "no angle," or 0⁰ Kelvin means "no heat")
1. What is Classification? What is an example of Classification?
•Classification - machine-learning technique that learns patterns from historical data, and uses that learning to place new instances into their respective groups or classes •Classification Algorithms: Bayes Classifier, Decision Trees, k-Nearest Neighbors, Neural Networks •Examples: •Weather prediction •Credit Approval Target Marketing
1. What are Clustering techniques and why are they classified as Unsupervised Learning?
•Clustering - partition a collection of things into segments of natural groupings, where each group shares similar characteristics •Considered Unsupervised because no Output Variable is specified (all data values are considered input) •The natural groupings in the historical data determine how the items should be grouped (i.e., similar items are grouped together) K-Means is a common Clustering Algorithm
1. What is a Confusion Matrix and how does it help evaluate the accuracy of a Classification model?
•Confusion Matrix - table used to evaluate the accuracy of a Classification Model •The diagonal from top left to bottom right are "Correct" Classifications •The diagonal from top right to bottom left are "Incorrect"
1. What is a Data Warehouse Administrator (DWA)? What skills does a DWA need?
•DWA - responsible for administration and management of a DW •Skills Needed: •Strong Database and SQL skills •Dimensional Modeling skills •Familiarity with software, hardware, and networking technologies •Knowledge of data security •Knowledge of end-user analytics and decision-support tools
1. What is a Dashboard? What are the most common dashboard characteristics?
•Dashboards provide a visual display of important information that is consolidated and arranged on a single screen, so that information can be digested at a single glance, and easily "drilled into" for more detail •Dashboard Characteristics: •Use visual components •Navigation is transparent to the user •Combine data from a variety of systems into a single view •Allow drill-down and drill-through to underlying detail •Present information in "real-time," or near real-time •Require little or no custom coding to implement, deploy, and maintain
What is the difference among data, information, and knowledge?
•Data - raw facts that are meaningless by themselves (e.g., names, numbers) •Can be small or large •Can be text, numeric, date, or other format (e.g., audio, video, image) •Can be structured, semi-structured, or structured •Information - data that has meaning, or context (e.g., Customer Name, Annual Salary) •Knowledge - Information that can be used; information in action •Explicit Knowledge - easy to communicate, store, and distribute (e.g., formula for calculating profit) •Tacit Knowledge - difficult to communicate, store, and distribute (e.g., personal experience)
What are other types (i.e., Domains) of Analytics that are not described in Chapter 1?
•Data Analytics •Business Analytics •Geographic/Location Analytics •Real-Time Analytics •Social Media Analytics •Sentiment Analytics •Web Analytics •Text Analytics •Big Data Analytics
1. What is Data Integration? How does Data Integration make organization-wide data available to users?
•Data Integration - allows data to be Extracted from multiple source DBs, Transformed into an analytics-ready form, and Loaded into a central Data Warehouse repository •Data Integration typically involves ETL (Extract, Transform, & Load) process and tools •Data Integration brings together all of the organization's data (from various systems) into one central repository
1. What is a Data Lake (DL)? How is a DL different from a DW? How can a DL compliment a DW?
•Data Lake - large data storage repository that can hold structured data (like a DW), but also semi-structured and unstructured data •Data are typically stored in a Data Lake in their raw form (i.e., not stored in structured, relational tables) •The ability to handle semi-structured and unstructured data make DLs ideal to store Social Media and "Big Data" •So far, DLs have been a compliment to DWs, rather than a replacement
1. What is a Data Mart (DM)? What is the difference between a Dependent DM and an Independent DM?
•Data Mart - Smaller subset of a Data Warehouse that is typically focused on 1 subject area or department (e.g., Sales DM) 2 Types of Data Marts: 1.Dependent DM - DM created directly from a larger Data Warehouse 2.Independent DM - Smaller, standalone DM (not recommended). Why?
1. How does Data Mining/Predictive Analytics differ from other types of Data Analytics?
•Data Mining uses mathematical modeling and algorithms •Data Mining uses statistics and statistical concepts •Data Mining uses Machine Learning to "learn" patterns in data and apply that learning to future data •Many data mining algorithms mimic the learning process of the human brain (e.g., Neural Networks) •Data Mining can detect trends and patterns in data that humans may not know exist (e.g., Beer and Diapers story) Data Mining is sometimes referred to as "The unasked question"
What is a Data Warehouse (DW)?
•Data Warehouse - Large repository of structured data (simple definition) •More complex definition: Large, clean, standardized, integrated, time-stamped, and subject-oriented data repository. •The "Walmart of data!" The one-stop-shop for all of an organization's (structured) data.
1. According to William Inmon, what are the characteristics of a Data Warehouse (DW)?
•Data Warehouse - subject-oriented, integrated, time-variant, nonvolatile collection of data that support managerial decision making •Once again, the "Walmart of an Organization's Data"
What are the functions of a BI Competency Center?
•Demonstrates how BI linked to Corporate Strategy •Encourages interaction between Business Users and IS Organization •Serve as repository of BI Best Practices •Advocates and encourages BI Standards of Excellence throughout company •Can be resource to IS Organization on Business User needs •Can help IS Organization adapt DW to changing business requirements •Can help Executives and other Stakeholders recognize the value of BI in the company a. Demonstrates for BI linked to corporate strategy b. Encourages interaction between business users and IS organization c. Serve as a repository of BI best practices d. Advocated and encourages BI standards Etc.
1. Distinguish among Descriptive, Predictive, and Prescriptive Analytics. What are examples of each?
•Descriptive Analytics - What is happening now? What has happened? •Reporting, Dashboards, Scorecards, & Data Warehousing •Predictive Analytics - What will happen? Why will it happen? •Data Mining, Text Mining, Web/Media Mining, & Forecasting •Prescriptive Analytics - What should I do? Why should I do it? •Optimization, Simulation, Decision Modeling, & Expert Systems
1. What is Dimensional Modeling? How does it reorganize data into a user's view?
•Dimensional Modeling - A retrieval-based data querying system that supports high-volume, high-speed access to subsets of data •Dimensional Modeling Accomplishes 3 Things: 1.Makes data retrieval faster and more efficient 2.Transforms operational data into "business view" of the organization 3.Allows analytics from an end-user, "business perspective" •Dimensional Modeling is typically achieved by Transforming data into "Star-Schemas" (i.e., star-shaped layouts)
1. What is Dispersion and what is the difference among Range, Variance, and Standard Deviation?
•Dispersion - mathematical methods used to estimate or describe the degree of variability among the values in a sample of observations •Measures of Dispersion •Range - difference between the largest and smallest value in a given data set •Variance - measure of deviation of all values in a data set from the Mean of the data set (σ2) •Standard Deviation - square root of the Variance; also measures deviation of a data set values around a Mean (σ)
1. What is an Enterprise Data Warehouse (EDW)?
•EDW - large scale Data Warehouse that is used across an enterprise for decision support. EDWs encompass ALL of the organization's data. •"Walmart Super Center"
1. What is ETL and how is it used to populate and update a DW?
•ETL - Extract, Transform, and Load •Extract - reading data from one or more source databases •Transform - converting data from its original form into an analytics-ready form •Load - Copying the data into a central repository (i.e., Data Warehouse) • •ETL and Laundromats analogy
1. Explain an n-Tier Architecture. What makes up each tier?
•Each server is a tier •ALL end-user (i.e., client) PCs are 1 tier a. All servers are another level or tier b. All end user pcs are grouped together in the same tier Ex: in picture on powerpoint there are three tiers
what factors have contributed to the growth of analytics
•Group communication and collaboration - collaboration from anywhere •Improved data management - data everywhere and in every format •Managing giant Data Warehouses and Big Data - more and more flexible data repositories •Analytical support - more access to human expertise •Overcoming cognitive limits in processing/storing info - technology expanding power of human mind •Knowledge Management - capture of human expertise (Explicit & Tacit knowledge) •Anywhere, anytime support - 24/7 wireless connectivity
1. What are the most common software packages for Data Mining?
•IBM SPSS Modeler (formerly Clementine) •SAS Enterprise Miner •Dell Statistica •RapidMiner •XLMiner
1. What is In-Memory Data Storage? What advantages does In-Memory storage have over traditional data storage?
•In-Memory Storage / In-Memory Computing - data are kept in RAM, instead of being stored on a physical drive •Accessing data from RAM is faster than querying a traditional DB •In-Memory tools also allow the data to be re-structured before storing in RAM (i.e., Dimensional Modeling) MS-Power BI, Tableau, and Qlik all use In-Memory technology
1. What are the common mistakes made in Data Mining projects?
•Incorrect identification of problem - not all problems need data mining •Not managing expectations about data mining •Beginning without an end result in mind •Not having access to proper necessary data •Not properly preparing data •Not drilling down into detail/outliers •More on page 240
1. What is a KPI? What are a KPI's distinguishing features?
•KPI - Key Performance Indicator •A measure of performance against a strategic objective and goal •KPIs Features: •Embody a strategic objective •Measure performance against specific targets •Have targets with performance ranges •Performance ranges are typically represented in software as Red, Yellow, Green (Traffic light) •Have targets with assigned time-frames •Targets are measured against baselines or benchmarks
1. What is Master Data Management? Why is it critical in modern organizations?
•Master Data Management - organization-wide, agreed upon definition for common variables and subject matters (e.g., What is a "customer?," What is a "student?") •Not having Master Data Management can lead to the wrong answers, which ultimately leads to inaccurate and untimely decisions •Organizations want "a single version of the truth"
List and define the most common Descriptive Statistics.
•Mean (a.k.a. Average) - sum of all of the values in an observation divided by the number of observations in a sample •Median - the middle value in a sorted order of observations in a sample •Mode - most frequently occurring value in a sample of observations
1. What are example uses of Data Mining across various industries? Pg 204
•Medical/Healthcare - discover better diagnoses and treatments •Pharmaceuticals - discover new and improved drugs •Retail - identify customer buying patterns, and new customers/markets •Banking/Credit Card - identify fraudulent activities •Others?
1. What is Metadata? What are the two main forms?
•Meta Data - "Data about Data" •Additional information that describes individual data elements 2 Forms: 1.Technical Meta Data - Information about processes, tools, repositories, & "physical layer" of data (e.g., source tables and columns, run times, etc.) 2.Business Meta Data - Business definitions; how the data elements are used in the business (e.g., business terms, business formulas, business process flows, etc.)
What is a Normal Distribution? What is the difference between Skewness and Kurtosis?
•Normal Distribution - data values that are perfectly symmetrical on both sides of a Mean, creating a Normal or "Bell" Curve •Any data set can be "Normalized" (or Standardized) to fit a Normal Curve by computing the z (or Standard) score (z = x - μ / σ) •Skewness - measure of the degree to which a data set is offset (i.e., skewed) from the Mean of the data set •Kurtosis - height of the peak of the curve in a Normal Curve
1. What is an Operational Data Store (ODS)? How does an ODS differ from a DW?
•ODS - Data repository that is typically used as an interim staging area for a Data Warehouse •Used for short-term decisions (e.g., Daily or Weekly Reporting) •Has very detailed information, but short time-frame
1. What is OLAP? How does OLAP differ from OLTP?
•OLAP - Online Analytical Processing •Specialized way of storing data, that allows the data to be analyzed in multiple dimensions •Leverages the benefits from Dimensional Modeling •Data storage file is called a Cube, or Multidimensional Cube •OLTP (Online Transaction Processing Systems) capture transactions from operational systems, whereas OLAP is read-only data, specifically structured for analysis
What is OLTP? What is the difference between OLTP systems and BI/Analytics systems?
•OLTP (Online Transaction Processing) ̶ Systems that handle day-to-day business operations for a company. OLTP Systems are data-capturing systems. •Analytics Systems ̶ Systems that allow data to be queried for analysis and decision-making. Analytics Systems are data-retrieving systems. •OLAP (Online Analytical Processing) Systems are an example of Analytics Systems. OLAP involves Multidimensional Cubes.
1. What data types are typically analyzed in Data Mining? What are the differences among the data types?
•Quantitative Data •Numeric Data •Qualitative Data •Nominal Data - has finite, nonordered values (Ford, Chevy, Toyota) •Ordinal Data - has finite ordered values (Strongly agree, Agree, Disagree, Strongly Disagree) •Independent Variables, Input Variables, or Predictors - varied in an experiment •Dependent Variables, Output Variables, or Results - are determined by the manipulation of the Independent Variable(s)
1. What is Regression and how does it differ from Correlation?
•Regression - statistical technique used to model the dependence of one variable (Dependent or Output) on one or more other variables (Independent or Input). Linear Regression allows prediction of Dependent Variable (y), based on changing values of the Independent Variable (x). •Correlation - measures the level of association between two variables, but does not assume that one variable's value is dependent on another (i.e., there is no Dependent nor Independent Variables present). Therefore, Correlation cannot be used for Prediction.
1. What is a Report? What are the most common types of reports?
•Report - any communication artifact prepared with the specific intent of conveying information in a digestible form to whomever needs it, whenever, and wherever they need it •Most common types of reports: •Metric Management - convey state of KPIs (Key Performance Indicators) •Dashboard-Type - provide a range of KPIs on one screen •Balanced Scorecard-Type - provide KPIs within the context of agreed-upon thresholds of performance (e.g., Green, Yellow, Red "traffic lights")
1. How do Lift Charts/ROC Curves help evaluate Data Mining models?
•Show how effective a data mining model is at Classifying or Predicting •The straight line is no model being used •The curved line is the application of the data model •The larger the distance between the lines, the better the model's performance •Often used to choose among various models
1. What is Six Sigma? How many defects does Six Sigma allow per million opportunities (DPMO)?
•Six Sigma - a performance management methodology aimed at reducing the number of defects in a business process to as close to zero as possible •DPMO - Defects per million opportunities •Six Sigma compliance allows 3.4 DPMO
1. What is a Star Schema? What are examples of each type of table in a Star Schema?
•Star Schema - most commonly used and simplest style of Dimensional Modeling •Separates Transactions from Non-Transactions •Transactions are placed in a central Fact table, surrounded by multiple Dimensional tables •Allows Business Queries: "Show me [Fact/Transaction], by [Dimension]" Resembles a Star shape
1. What is Statistics? What are the two main categories of statistics?
•Statistics - collection of mathematical techniques to characterize and interpret data •Two main categories of Statistics: 1.Descriptive Statistics - describe a sample of data 2.Inferential Statistics - draw inferences or conclusions regarding a population of data, by analyzing a sample of data
What is the difference between structured and unstructured data? what are examples of each?
•Structured Data - data that have a specific, predetermined data type and format (usually tables, with rows and columns) for computers to easily understand and process. For example, a spreadsheet of payroll information. •Unstructured/Semi-structured Data - data that do not have a specific, predetermined data type and format. Unstructured data can be stored in many formats, such as text, graphic images, audio, and video.
1. Who is credited with inventing the modern chart? What information did it show?
•William Playfair in 1786 •Showed debts owed to, and by, England from 1700 - 1780