Business Analytics Test 1 - Chp 3
What to look for in a dashboard
• They use visual components (e.g., charts, performance bars, sparklines, gauges, meters, stoplights) to highlight, at a glance, the data and exceptions that require action. • They are transparent to the user, meaning that they require minimal training and are extremely easy to use. • They combine data from a variety of systems into a single, summarized, unified view of the business. • They enable drill-down or drill-through to underlying data sources or reports, providing more detail about the underlying comparative and evaluative context. • They present a dynamic, real-world view with timely data refreshes, enabling the end user to stay up-to-date with any recent changes in the business. • They require little, if any, customized coding to implement, deploy, and maintain.
Simple Taxonomy Data
[Data Analytics] / \ [Structured Data] [Unstructured/semistructured] / \ | [Categorical] [Numerical] |->[Textual] | | | /-> image |->Nominal |->Interval |->[Multimedia]-audio | | | \-> video |->Ordinal |->Ratio |->[XML/JSON]
Emergence of Visual analytics
Magic Quadrant for BI and Analytics platforms Challengers | Leaders | | | /\ | | | | | | | | | | Niche players | Visionaries Ability to execute Completeness of vision --------> -Leaders and Visionaries quadrants are either relatively recently founded information visualization companies (e.g., Tableau Software, QlikTech) or well-established large analytics companies (e.g., Microsoft, SAS, IBM, SAP, MicroStrategy, Alteryx) that are increasingly focusing their efforts on information visualization and visual analytics.
Data richness
-This means that all required data elements are included in the data set. In essence, richness (or comprehensiveness) means that the available variables portray a rich enough dimensionality of the underlying subject matter for an accurate and worthy analytics study. It also means that the information content is complete (or near complete) to build a predictive and/or prescriptive analytics model.
Data consistency
-This means that the data are accurately collected and combined/merged. -If the data integration/merging is not done properly, some of the variables of different subjects could appear in the same record
Data currency/data timeliness
-This means that the data should be up-to-date (or as recent/new as they need to be) for a given analytics model. It also means that the data are recorded at or near the time of the event or observation so that the time delay-related misrepresentation (incorrectly remembering and encoding) of the data is prevented.
Data relevancy
-This means that the variables in the data set are all relevant to the study being conducted. Relevancy is not a dichotomous measure (whether a variable is relevant or not); rather, it has a spectrum of relevancy from least relevant to most relevant. -Based on the analytics algorithms being used, one can choose to include only the most relevant information (i.e., variables) or, if the algorithm is capable enough to sort them out, can choose to include all the relevant ones regardless of their levels. -One thing that analytics studies should avoid is including totally irrelevant data as it can lead to inaccurate results
Data granularity
-This requires that the variables and data values be defined at the lowest (or as low as required) level of detail for the intended use of the data. -attributes and values of data defined at correct level of detail for intended use (quality characteristic)
Data source reliability
-term refers to the originality and appropriateness of the storage medium where the data are obtained—answering the question "Do we have the right confidence and belief in this data source?" -one should always look for the original source/creator of the data to eliminate/mitigate the possibilities of data misrepresentation and data transformation caused by the mishandling of the data as they moved
Data prepossessing tasks and sub tasks
Main Task Subtasks Data consolidation -Access and collect the data -Select and filter the data -Integrate and unify the data Data cleaning -Handle missing values in the data -Identify and reduce noise in the data -Find and eliminate erroneous data Data transformation -Normalize the data -Discretize or aggregate the data -Construct new attributes Data reduction -Reduce number of attributes -Reduce number of records -Balance skewed data
time series
-A time series is a sequence of data points of the variable of interest, measured and represented at successive points in time spaced at uniform time intervals. intervals. Examples of time series include monthly rain volumes in a geographic area, the daily closing value of the stock market indexes,
What is a Business Report?
-A written document that contains information regarding business matters. -Purpose: to improve managerial decisions -Source: data from inside and outside the organization (via the use of E T L) -Format: text + tables + graphs/charts -Distribution: in-print, email, portal/intranet -Data acquisition -> Information generation -> Decision making -> Process management
Difference between Correlation and Regression
-Correlation gives an estimate on the degree of association between the variables. correlation is interested in the low-level relationships between two variables -Regression attempts to describe the dependence of a response variable on one (or more) explanatory variables where it implicitly assumes that there is a one way causal effect from the explanatory variable(s) to the response variable, regardless of whether the path of effect is direct or indirect. regression is concerned with the relationships between all explanatory variables and the response variable.
Performance Dashboards (cont.)
-Dashboard design: The fundamental challenge of dashboard design is to display all the required information on a single screen, clearly and without distraction, in a manner that can be assimilated quickly -According to Eckerson (2006), the most distinctive feature of a dashboard is its three layers of information: 1. Monitoring: Graphical, abstracted data to monitor key performance metrics. 2. Analysis: Summarized dimensional data to analyze the root cause of problems. 3. Management: Detailed operational data that identify what actions to take to resolve a problem.
Information Dashboards
-Dashboards: provide visual displays of important information that is consolidated and arranged on a single screen so that the information can be digested at a single glance and easily drilled in and further explored. -Performance dashboards are commonly used in BPM software suites and BI platforms
Types of business reports (3 major categories)
-Metric Management Reports: Help manage business performance through metrics (SLAs (service level agreements) for externals; KPIs (Key performance metrics) for internals). Can be used as part of Six Sigma and/or total quality management (TQM) -Dashboard type reports: Graphical presentation of several performance indicators in a single page using dials/gauges -Balanced Scorecard: Type Reports: Include financial, customer, business process, and learning & growth indicators
Nature of Data
-Data (datum - singular form of data): a collection of facts usually obtained as the result of experiences, observations, or experiments -Data are the source for any BI, data science, and business analytics initiative. they can be viewed as the raw material (source) for what decision technologies produce—information, insight, and knowledge. -data today are considered among the most valuable assets of an organization and it can create insight to better understand customers, competitors, and the business processes. -Data can be small or very large. They can be structured (nicely organized for computers to process), or they can be unstructured (e.g., text that is created for humans and hence not readily understandable/consumable by computers). Data may consist of numbers, words, images, etc. and can come in small batches continuously or pour in all at once as a large batch. -Data is the lowest level of abstraction (from which information and knowledge are derived) -Data quality and data integrity are critical to analytics, a typical analytics continuum is—data to analytics to actionable information.
Data security and data privacy
-Data security means that the data are secured to allow only those people who have the authority and the need to access them and to prevent anyone else from reaching them.
characteristics (metrics) that define readiness of data for analysis
-Data source reliability -Data content accuracy -Data accessibility -Data security and data privacy -Data richness -Data consistency -Data currency/data timeliness -Data granularity -Data validity and data relevancy
Data Visualization
-Data visualization (or more appropriately, information visualization) has been defined as "the use of visual representations to explore, make sense of, and communicate data" -Although the name that is commonly used is data visualization, usually what this means is information visualization. Because information is the aggregation, summarization, and contextualization of data (raw facts), what is portrayed in visualizations is the information, not the data. -Data visualization is closely related to the fields of information graphics, information visualization, scientific visualization, and statistical graphics.
How do we know if the model is good enough?
-For the numerical assessment, three statistical measures are often used in evaluating the fit of a regression model: R2 (R-squared), the overall F-test, and the root mean square error (RMSE). -R2: The value of R2 ranges from 0 to 1 (corresponding to the amount of variability explained in percentage) with 0 indicating that the relationship and the prediction power of the proposed model is not good, and 1 indicating that the proposed model is a perfect fit
Specialized Charts and Graphs
-Gantt chart: is a special case of horizontal bar charts used to portray project timelines, project tasks/activity durations, and overlap among the tasks/activities. By showing start and end dates/times of tasks/activities and the overlapping relationships, Gantt charts provide an invaluable aid for management and control of projects. -The PERT chart: (also called a network diagram) is developed primarily to simplify the planning and scheduling of large and complex projects. A PERT chart shows precedence relationships among project activities/tasks. It is composed of nodes (represented as circles or rectangles) and edges (represented with directed arrows). -Geographic chart: When the data set includes any kind of location data (e.g., physical addresses, postal codes, state names or abbreviations, country names, latitude/longitude, or some type of custom geographic encoding), it is better and more informative to see the data on a map. -Bullet: A bullet graph is often used to show progress toward a goal. it compares a primary measure (e.g., year-to-date revenue) to one or more other measures (e.g., annual revenue target) and present this in the context of defined performance metrics (e.g., sales quotas). A bullet graph can illustrate how the primary measure is performing against overall goals -Heat map: The heat map is a great visual to illustrate the comparison of continuous values across two categories using color. -Highlight Table - further. In addition to showing how data intersect by using color, highlight tables add a number on top to provide additional detail. That is, they are two-dimensional tables with cells populated with numerical values and gradients of colors. -Tree Map - displays hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing subbranches. A leaf node's rectangle has an area proportional to a specified dimension on the data. Often the leaf nodes are colored to show a separate dimension of the data.
Simple Regression versus Multiple Regression
-If the regression equation is built between one response variable and one explanatory variable, then it is called simple regression. -Multiple regression is the extension of simple regression when the explanatory variables are more than one.
descriptive statistics for descriptive analytics
-In business analytics, descriptive statistics plays a critical role—it allows us to understand and explain/present our data in a meaningful manner using aggregated numbers, data tables, or charts/graphs. -descriptive statistics helps us convert our numbers and symbols into meaningful representations for anyone to understand and use. -understanding the representations allows analytics professionals and data scientists to characterize and validate the data for other more sophisticated analytics tasks. Descriptive statistics allows analysts to identify data concertation, unusually large or small values (i.e., outliers), and unexpectedly distributed data values for numeric variables.
Logistic Regression
-Logistic regression is a very popular, statistically sound, probability-based classification algorithm that employs supervised learning. Developed in the 1940s. -Logistic regression is similar to linear regression in that it also aims to regress to a mathematical function that explains the relationship between the response variable and the explanatory variables using a sample of past observations -Logistic regression differs from linear regression with one major point: its Output/Target variable is a binomial (binary classification) variable as opposed to a numerical variable. That is, whereas linear regression is used to estimate a continuous numerical variable, logistic regression is used to classify a categorical variable. -logistic regression models are used to develop probabilistic models between one or more explanatory/predictor variables (which can be a mix of both continuous and categorical in nature) and a class/response variable
Measures of Dispersion (measures of spread or decentrality)
-Range: is the difference between the largest and the smallest values in a given data set, Range = Max - Min -Variance: variance. It is a method used to calculate the deviation of all data points in a given data set from the mean. The larger the variance, the more the data are spread out from the mean and the more variability one can observe in the data sample. Variance equation: S^2 = ((Σ^n,i = 1(Xi - X̅)^2) / n - 1) where n is the number of samples, x and Xi is the value in the data set you are testing -Standard Deviation: deviation is also a measure of the spread of values within a set of data. The standard deviation is calculated by simply taking the square root of the variations. Equation: S = √((Σ^n,i = 1(Xi - X̅)^2) / n - 1) -Mean Absolute Deviation (MAD): Specifically, the mean absolute deviation is calculated by measuring the absolute values of the differences between each data point and the mean and then summing them and dividing the sum by n. This process provides a measure of spread without being specific about the data point being lower or higher than the mean. -Quartile: is a quarter of the number of data points given in a data set. Quartiles are determined by first sorting the data and then splitting the sorted data into four disjoint smaller data sets. useful measure of dispersion because they are much less affected by outliers or a skewness in the data set -The box-and-whiskers plot: or a box plot, is a graphical illustration of several descriptive statistics about a given data set. The box plot shows the centrality (median and sometimes also mean) as well as the dispersion (the density of the data within the middle half—drawn as a box between the first and third quartiles), Plot is drawn with bottom line being smallest value (excluding outliers), the bottom of the box is lower quartile (25% of data smaller than this point), next is the mean or median (whichever has lower value, Median represents 2nd quartile, 50% data included at that line), then the upper quartile to close box, followed by Max or largest value (excluding outliers). Outliers go outside the Max or Min on the plot. -Histogram: frequency bar chart, used to show the frequency distribution of one variable or several variables. In a histogram, the x-axis is often used to show the categories or ranges, and the y-axis is used to show the measures/values/frequencies. Histograms show the distributional shape of the data. -Skewness: is a measure of asymmetry (sway) in a distribution of the data that portrays a unimodal structure—only one peak exists in the distribution of the data. Because normal distribution is a perfectly symmetric unimodal distribution, it does not have skewness, skew is equal to zero -Kurtosis: is another measure to use in characterizing the shape of a unimodal distribution. it focuses more on characterizing the peak/tall/skinny nature of the distribution. Specifically, it measures the degree to which a distribution is more or less peaked than a normal distribution. Normal dist. has kurtosis of 3
Statistical Modeling for Business Analytics
-Statistics is usually considered as part of descriptive analytics and it (descriptive analytics) has two main branches: statistics and Online analytics processing (OLAP) 1. OLAP - is the term used for analyzing, characterizing, and summarizing structured data stored in organizational databases (often stored in a data warehouse or in a data mart) using cubes. This branch has also been called business intelligence 2. Statistics - helps to characterize the data, either one variable at a time or multivariable, all together using either descriptive or inferential methods. -descriptive statistics - is all about describing the sample data on hand, Describing the data (as it is) -inferential statistics - is about drawing inferences or conclusions about the characteristics of a population based on a sample
Data Quality
-The holistic quality of data, including their accuracy, precision, completeness, and relevance.
The Art and Science of Data Preprocessing
-The real-world data is dirty, misaligned, overly complex, and inaccurate, not ready for analytics -A tedious and time-demanding process (so-called data preprocessing) is necessary to convert the raw real-world data into a well-refined form for analytics algorithms -the time spent on data preprocessing (perhaps the least enjoyable phase in the process) is significantly longer than the time spent on the rest of the analytics tasks
Interval Data
-These are variables that can be measured on interval scales. there is not an absolute zero value, just a relative zero. Temperature is an example of interval data
Ordinal Data
-These contain codes assigned to objects or events as labels that also represent the rank order among them. For example, the variable credit score can be generally categorized as (1) low, (2) medium, or (3) high. Other examples include age group (child, young adult, middle-aged, elderly) or education level
Ratio Data
-These include measurement variables commonly found in the physical sciences and engineering. Mass, length, time, plane angle, energy, and electric charge are examples of physical measures that are ratio scales. -The scale type takes its name from the fact that measurement is the estimation of the ratio between a magnitude of a continuous quantity and a unit magnitude of the same kind. -the distinguishing feature of a ratio scale is the possession of a nonarbitrary zero point of absolute zero, a point where the data cannot possibly go lower. In kelvin temperature scale, 0 is the lowest possible temperature (-273.15 degrees C) as matter at this temp. has no kinetic energy
Categorical data
-These represent the labels of multiple classes used to divide a variable into specific groups. Examples of categorical variables include race, sex, age group, and educational level. -also called discrete data, implying that they represent a finite number of values with no continuum between them. Even if the values used for the categorical (or discrete) variables are numeric, these numbers are nothing more than symbols and do not imply the possibility of calculating fractional values. -This data can be subdivided into 2 groups: Nominal data and Ordinal data
Numeric Data
-These represent the numeric values of specific variables. -Numeric values representing a variable can be integers (whole numbers) or real (fractional numbers). The numeric data can also be called continuous data, implying that the variable contains continuous measures on a specific scale that allows insertion of interim values. -Can be subdivided into interval or ratio data
Data validity
-This is the term used to describe a match/mismatch between the actual and expected data values of a given variable. As part of data definition, the acceptable values or value ranges for each data element must be defined.
Time-Series Forecasting
-Time-series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values. -The time-series plots/charts look and feel very similar to simple linear regression in that, both simple linear regression, in time series there are two variables: the response variable and the time variable presented in a scatter plot. Beyond this appearance similarity, there is hardly any other commonality between the two. regression analysis is often employed in testing theories to see if current values of one or more explanatory variables explain (and hence predict) the response variable, the time-series models are focused on extrapolating on their time-varying behavior to estimate the future values. -Most popular techniques are perhaps the averaging methods that include simple average, moving average, weighted moving average, and exponential smoothing.
How do we develop linear regression models?
-To understand the relationship between two variables, the simplest thing that one can do is to draw a scatter plot where the y-axis represents the values of the response variable and the x-axis represents the values of the explanatory variable -Regression line tries to find the signature of a straight line passing through right between the plotted dots (representing the observation/historical data) in such a way that it minimizes the distance between the dots and the line -ordinary least squares (OLS) method: aims to minimize the sum of squared residuals (squared vertical distances between the observation and the regression point) and leads to a mathematical expression for the estimated value of the regression line -For simple linear regression, the aforementioned relationship between the response variable (y) and the explanatory variable(s) (x) can be shown as a simple equation as follows: y = b0 + b1x, b0 is called the intercept, and b1 is called the slope. The sign and the value of b1 also reveal the direction and the strengths of relationship between the two variables. -Equation for multiple linear regression: y = b0 + b1x1 + b2x2 + b3x3 + ... + bnxn
Types of Data for data taxonomy
-Unstructured data/semistructured data are composed of any combination of textual, imagery, voice, and Web content. -Structured data are what data mining algorithms use and can be classified as categorical or numeric.
Nominal Data
-contain measurements of simple codes assigned to objects as labels, which are not measurements. For example, the variable marital status can be generally categorized as (1) single, (2) married, and (3) divorced. -can be represented with binomial values having two possible values (e.g., yes/no, true/false, good/bad) or multinomial values having three or more possible values (e.g., brown/green/blue)
Unstructured data/semistructured data
-data types including textual, spatial, multimedia, imagery, video, and audio(voice), need to be converted into some form of categorical or numeric representation before they can be processed by analytics methods (data mining algorithms; Delen, 2015). -Data can also be classified as static or dynamic
Visual Analytics
-is the combination of visualization and predictive analytics. information visualization is aimed at answering "What happened?" and "What is happening?" and is closely associated with BI (routine reports, scorecards, and dashboards), visual analytics is aimed at answering "Why is it happening?" and "What is more likely to happen?" and is usually associated with business analytics (forecasting, segmentation, correlation analysis). -Visual analytics is a recently coined term that is often used loosely to mean nothing more than information visualization.
Correlation
-makes no a priori assumption of whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate on the degree of association between the variables.
Data content accuracy
-means that data are correct and are a good match for the analytics problem—answering the question of "Do we have the right data for the job?" -The data should represent what was intended or defined by the original source of the data.
Data accessibility
-means that the data are easily and readily obtainable— answering the question of "Can we easily get to the data when we need to?"
Regression Modeling for Inferential Statistics
-regression essentially is a relatively simple statistical technique to model the dependence of a variable (response or output variable) on one (or more) explanatory (input, dependent) variables. A part of inferential statistics and the most widely known and used analytics technique in statistics. It can be used for Hypothesis testing (explanation) and Forecasting (prediction)
Business Reporting Definitions and Concepts
-report: is any communication artifact prepared with the specific intention of conveying information. Report = Information => Decision -Functions of reports: 1. To ensure proper departmental functioning 2. To provide information 3. To provide the results of an analysis 4. To persuade others to act 5. To create an organizational memory -Business reporting (also called OLAP or BI) is an essential part of the larger drive toward improved, evidence-based, optimal managerial decision making. -The foundation of these business reports is various sources of data coming from both inside and outside the organization. -Creation of these reports involves extract, transform, and load (ETL) procedures in coordination with a data warehouse and then using one or more reporting tools. -Key to any successful report are clarity, brevity, completeness, and correctness.
Data Preprocessing steps, tasks, and popular methods
1. Data consolidation - relevant data are collected from the identified sources, the necessary records and variables are selected (based on an intimate understanding of the data, the unnecessary information is filtered out), and the records coming from multiple data sources are integrated/merged 2. Data cleaning (Data scrubbing) -Handle missing values in the data: Fill in missing values (imputations) with most appropriate values (mean, median, min/max, mode, etc.); recode the missing values with a constant such as "ML"; remove the record of the missing value; do nothing. -Identify and reduce noise in the data: Identify the outliers in data with simple statistical techniques (such as averages and standard deviations) or with cluster analysis; once identified, either remove the outliers or smooth them by using binning, regression, or simple averages. -Find and eliminate erroneous data: Identify the erroneous values in data (other than outliers), such as odd values, inconsistent class labels, odd distributions; once identified, use domain expertise to correct the values or remove the records holding the erroneous values. 3. Data Transformation - the data are transformed for better processing. -Normalize the data: Reduce the range of values in each numerically valued variable to a standard range (e.g., 0 to 1 or -1 to +1) by using a variety of normalization or scaling techniques. -Discretize or aggregate the data: If needed, convert the numeric variables into discrete representations using range- or frequency-based binning techniques; for categorical variables, reduce the number of values by applying proper concept hierarchies. -Construct new attributes: Derive new and more informative variables from the existing ones using a wide range of mathematical functions (as simple as addition and multiplication or as complex as a hybrid combination of log transformations). 4. Data reduction - too much data can be a problem, in simplest sense one can visualize the data commonly used in predictive analytics projects as a flat file consisting of two dimensions: variables (the number of columns) and cases/records (the number of rows). Harder to reduce number of rows as there is generally more data there than in variables (columns) -Reduce number of attributes: Use principal component analysis, independent component analysis, chi-square testing, correlation analysis, and decision tree induction. -Reduce number of records: Perform random sampling, stratified sampling, expert knowledge-driven purposeful sampling. -Balance skewed data: Oversample the less represented or undersample the more represented classes.
Regression modeling assumptions
1. Linearity - This assumption states that the relationship between the response (dependent) variable and the explanatory (independent) variables is linear. That is, the expected value of the response variable is a straight-line function of each explanatory variable while holding all other explanatory variables fixed. Also, the slope of the line does not depend on the values of the other variables. 2. Independence (of errors) - This assumption states that the errors of the response variable are uncorrelated with each other. This independence of the errors is weaker than actual statistical independence, which is a stronger condition and is often not needed for linear regression analysis. 3. Normality (of errors) - This assumption states that the errors of the response variable are normally distributed. That is, they are supposed to be totally random and should not represent any nonrandom patterns. 4. Constant variance (of errors) - This assumption, also called homoscedasticity, states that the response variables have the same variance in their error regardless of the values of the explanatory variables. In practice, this assumption is invalid if the response variable varies over a wide enough range/scale. 5. Multicollinearity - This assumption states that the explanatory variables are not correlated (i.e., do not replicate the same but provide a different perspective of the information needed for the model). Multicollinearity can be triggered by having two or more perfectly correlated explanatory variables presented to the model -There are statistical techniques developed to identify the violation of these assumptions and techniques to mitigate them. The most important part for a modeler is to be aware of their existence and to put in place the means to assess the models to make sure that they are compliant with the assumptions they are built on.
Measures of Central tendency (represents center point or typical value of a dataset)
1. Mean - arithmetic mean (or simply mean or average) is the sum of all the values/observations divided by the number of observations in the data set. Most popular for measuring central tendency. It is used with continuous or discrete numeric data. 2. Median - is the measure of center value in a given data set. It is the number in the middle of a given set of data that has been arranged/sorted in order of magnitude (either ascending or descending). The median is meaningful and calculable for ratio, interval, and ordinal data types. 3. Mode - is the observation that occurs most frequently (the most frequent value in our data set). it is most useful for data sets with a relatively small number of unique values. it could be useless if the data have too many unique values. it's not a very good representation of centrality, and therefore, it should not be used as the only measure of central tendency for a given data set. -use the mean when the data are not prone to outliers and there is no significant level of skewness; use the median when the data have outliers and/or it is ordinal in nature; use the mode when the data are nominal. Perhaps the best practice is to use all three together so that the central tendency can be captured and represented from three perspectives.
Story best practices
1. Think of your analysis as a story—use a story structure. 2. Be authentic—your story will flow. 3. Be visual—think of yourself as a film editor. 4. Make it easy for your audience and you. 5. Invite and direct discussion.
Dispersion
Degree of variation in a given variable, One of the main reasons dispersion/spread of data values are important is the fact that they give a framework within which we can judge the central tendency—give us the indication of how well the mean (or other centrality measures) represents the sample data.
Preparing Data
Predictive algorithms generally require a flat file with a target variable, so making data analytics ready for prediction means that data sets must be transformed into a flat-file format and made ready for ingestion into those predictive algorithms.
Process flow for developing regression models (goes from top to bottom)
Tabulated Data | [Data Assessment] - scatter plot - correlations | [model fitting] - transform data - estimate parameters | [Model assessment] - test assumptions - assess model fit | (Deployment) - one-time use - recurrent use
Best practices in Dashboard design
•Benchmark Key performance indicators (KPIs) with Industry Standards •Wrap the Metrics with Contextual Metadata, make sure following questions are addressed -Where did you source these data? -While loading the data warehouse, what percentage of the data was rejected/ encountered data quality problems? -Is the dashboard presenting "fresh" information or "stale" information? -When was the data warehouse last refreshed? -When is it going to be refreshed next? -Were any high-value transactions that would skew the overall trends rejected as a part of the loading process? •Validate the Design by a Usability Specialist -Up-front validation of the dashboard design by a usability specialist can mitigate this risk. •Prioritize and Rank Alerts and Exceptions -Because there are tons of raw data, having a mechanism by which important exceptions/ behaviors are proactively pushed to the information consumers is important. A business rule can be codified, which detects the alert pattern of interest. •Enrich Dashboard with Business-User Comments -When the same dashboard information is presented to multiple business users, a small text box can be provided that can capture the comments from an end user's perspective. •Present Information in Three Different Levels -Information can be presented in three layers depending on the granularity of the information: the visual dashboard level, the static report level, and the self-service cube level. •Pick the Right Visual Constructs -Pick visuals that best display/fit the data you're trying to present •Provide for Guided Analytics -The capability of the dashboard can be used to guide the "average" business user to access the same navigational path as that of an analytically savvy business user.
History of Data Visualization
•Data visualization can date back to the second century AD •Most developments have occurred in the last two and a half centuries •Until recently it was not recognized as a discipline •Today's most popular visual forms date back a few centuries •William Playfair is widely credited as the inventor of the modern chart, having created the first line and pie charts. •Decimation of Napoleon's Army During the 1812 Russian Campaign is arguably the most popular multi-dimensional chart •The future of data/information visualization is very hard to predict. We can only extrapolate from what has already been invented: more three-dimensional visualization, immersive experience with multidimensional data in a virtual reality environment, and holographic visualization of information.
Emergence of Data visualization and Visual analytics
•Emergence of new companies -Tableau, Spotfire, QlikView, ... •Increased focus by the big players -MicroStrategy improved Visual Insight -SAP launched Visual Intelligence -SAS launched Visual Analytics -Microsoft bolstered PowerPivot with Power View -IBM launched Cognos Insight -Oracle acquired Endeca