ACC307 Chapter 7- Data Analytics and Presentation
Nominal/ Categorical
Categorize (equals and does not equal) YES Order (greater than, less than, equal) NO Calculation: NO permissible statistics: Mode more examples: Zip Code, telephone number, eye color Discrete
Ordinal
Categorize: YES Order: YES calculation: NO Permissible statistics: Mode, Median, Percentile more examples: Education Level, Letter Grade Discrete
Interval
Categorize: Yes Order: Yes Calculation: Some Permissible Statistics: Mode, Median, Percentile, Mean, St. Deviation, Correlation, Regression more examples: calendar date, calendar year continuous
Ratio
Categorize? Yes Order: Yes Calculation: All permissible statistics: mode, median, percentile, mean, st. dev., correlation, regression, and others. more examples: Age, heights, etc., continuous
Interval Data
Continuous Data without a meaningful zero point can be categorized, ordered, and calculated in most ways example: Temperature, Likert Scale example: SAT Score Example: GMAT Ex4: Credit Score "The distance between two consecutive data points are the same (+/-) without meaningful zero point (Cannot multiply or divide)
Ratio Data
Continuous data with a meaningful zero point can be categorized, ordered, and calculated in all manners. example: Age, Price, Size, length
Outliers
Descriptive Analytics a type of data structure Definition: observations that are radically different from the rest. no standard definition resulting from measurement error, data entry error, or other causes. the main purpose is to "call attention to data values that require additional review" not to represent the data. Why identify outliers? -might be errors/ fraud -might be more interesting/ useful -statistical assumptions- some stat methods are sensitive to outliers.
Nominal (Categorical) Variables
Discrete and Categorical Data can be categorized or counted cannot be ordered or calculated ex: brands, jobs, job titles, gender, cities, ID numbers Not "Coding" HSV=1 Dec =2 (meaningless) Special Case of Nominal- Binary/ Dichotomous (only two levels of data value) Proportion or Probability 1= Male, 0= Female (0.5)=50% male or female
Ordinal Variables
Discrete and Ranked data can be categorized and ordered cannot be calculated example of car preferences: tesla> Lexus> toyota ex of age group: How old are you? example of education Example of income group "The distance between two consecutive data points are different"
continuous
Interval & Ratio
discrete
Nominal and Ordinal
Diagnostic Analytics
a large manufacturer of farm equipment continuously analyzes data sent from engine sensors to understand how load, temperature, and other factors influence engine failure
Predictive Analytics
a shipyard company runs a computer simulation of how a tsunami would damage its shipyards, computing damages in terms of destruction and lost production time.
descriptive analytics
a small tax services business provides its financial statements to ta bank to get a loan so it can buy a new building to grow its business.
bullet graph
adds a "bullet" or small line by each bar that indicates an important benchmark. linking data only when a logical relationship exists
visualization (AKA "Vis" Visual Analytics/ Infographics)
any static or dynamic representation of data -visualized data is processed father than written or tabular information -visualizations are easier to use -visualization supports dominant learning style of the population because most learners are visual learners. goals for visual communication: decorative, indicative, informative
measures
are numerical values of metrics ex: 200 lbs BMI- 20.0
Pie of Pie Chart
comparing 6-10 components
Bar of Pie Chart
comparing 6-15 components
ratio
continuous data with a meaningful zero point
interval
continuous data without a meaningful zero point
discrete metrics
countable data gender (F/M) On time or not on time (Y/N) Number of on time deliveries
Nominal (categorial)
discrete and categorical data
ordinal
discrete and ranked data
histogram and boxplot
distribution (spread)
impression management
earnings management the process through which financial information perparers try to control or manipulate the impressions other people form of them through disclosures and financial graphs
orientation
easier to understand if oriented appropriately change chart orientation when labels are inevitably long present or sort data meaningfully
bar charts
for one series with a few periods grouped bar charts- only when "Within Group" comparisons are important, otherwise use line charts
line charts
for time series with many periods for multiple series
Quantity
goldilocks principle of containing not too much and not too little, but just the right amount of data -informative titles -succinct but sufficient legends and axes, avoid information overload: divide complicated graphs into separate ones, adjust the granularity/ quantity of information -gridlines or data values whenever appropriate -multiple formats: "Combo chart" only when necessary -Use 3D only when necessary "The third dimension carries information"
funnel charts
highlight the orders
distance
how far apart related information is presented removing distance aids in understanding and removes other unnecessary information
distribution
how often values in the data occur - choose approp. stat. methods -common distributions -Benford's Law
map chart
ideal to present geographic information
Metric
is used to quantify performance ex: lb/ kg/ EPS/ BMI
continuous metrics
measured on a continuum delivery time package weight purchase price
bar chart and bullet chart
numeric and categorical
scatterplot and heatmap
numeric and categorical
doughnut chart
occasionally used to compare two pie charts. Not as good as 100% stacked column/ bar charts
pie chart and treemap
part to whole proportion
radar chart
profile 2-6 variables
indicative visual
purpose: To provoke action examples: colors, animations, sounds principles: separate, divide, distance, contrast
decorative visual
purpose: to evoke feelings examples: shapes, symbols, colors, fonts Principles: do not interfere with the clarity of other elements
informative visual
purpose: to promote understanding examples: headings, titles, subtitles, highlights, summary, executive summary principles: Clear, filter out all but relevant details represent generalizations better than specifics
waterfall charts
reconcile changes over time/ between numbers
scatter plots
show relations between/ among data between two continuous variables
stacked area chart
show the contribution of each set to the total (i.e. values)
100% stacked area chart
show the proportion of each set to the total
measurement
the act of obtaining data associated with a metric
weighting
the amount of attention an element attracts
causation
the criteria to establish causality/ causation or fundamental criteria to judge if a theory fits observations -Covariation- association does not imply causation -reciprocal causality: A causes B, B causes A -absence of plausible rival hypotheses
ordering
the intentional arranging of visualization items to produce emphasis. present or sort data meaningfully
standard deviation
the square root of the variance -Easier to Interpret than the variance -A popular measure of risk or uncertainty influenced by outliers? Yes Mean 60, Score 90 St Dev 15- (90-60)/15 = 2 Z-Score
correlation
to determine the size, direction and strength of relationships between variables
box plot or box & whisker plot
to show more information of numeric variables
histogram
to show the distribution of a numeric variable
Line Chart and Area Chart
trend Evaluation Changes over time
Column Chart/ Rotated Bar Chart/ Horizontal Bar Chart
- categorical data variable on the y-axis and numeric data on the x- axis -ideal when long labels -Excel calls these "Bar Chart"
Predictive Analytics
-Goes beyond examining the past to answer the question, "What is likely to happen in the future?" -Build on Descriptive and Diagnostic Analytics -Basic assumption- History Repeats itself!
Prescriptive Analytics
An online retail company tracks past customer purchases. Based on the amount customers previously spent, the program automatically computes purchase discounts for current customer purchases to build loyalty.
bubble chart
show by size, similar to scatterplot
100% stacked bar/ Area Charts
using less space to tell better stories
variance
The mean of the squared deviation of that variable from its expected value or mean - has several stat advantages over others -influenced by outliers -frequently used, the square of Standard deviation
data deception
a graphical depiction of information designed with or without an intent to deceive, that may create a belief about the message or its components, which varies from the actual message. proportional trend left to right completeness: present complete data given the context.
data mining
a process of discovering patterns involving methods at the intersection of machine learning, statistics, and database systems.
predictive analytics
an airline downloads weather data for the past 10 years to help build a model that will estimate future fuel usage for flights.
prescriptive analytics
an all you can eat restaurant uses automated conveyor belts to bring cold food to the chefs for preparation. The conveyor belts bring the food to the chefs based on algorithms that monitor the number of people entering and leaving the restaurant.
Machine Learning
an application of artificial intelligence that allows computer systems to improve and update prediction models without explicit programming
2D Pie Chart:
comparing 2-5 components
highlighting
using colors, contrasts, callouts, labeling, fonts, arrows and others that bring attention to irem. highlight meaningfully use colors carefully within a culture colors can have natural meanings -use monochrome patters for color blind -gradients are used to indicate progressions from low to high, whereas distinct colors represent categories
Diagnostic Analytics
"Backward looking analytics" Build upon descriptive analytics to determine causal relationships why did this happen? more contextual information, hypothesis testing
Descriptive Analytics
"Backward looking" Focus on the past examines data to understand the past what happened? what is happening? Financial St. Scatterplot correlations basic and frequently used
Prescriptive Analytics
"Forward Looking analytics" Provide a recommendation of what should happen what should be done? Find the optimum solutions ex: highest profit/ lowest cost/ lowest risks/ highest return
Predictive Analytics
"forward looking analytics" Apply assumptions and focus on predicting the future what might happen in the future? regression time- series analysis Data Mining Alternatives
Pie Charts
(Aka Circle Chart) a circular statistical graphic divided into slices to illustrate numerical proportion. explode a pie chart only when necessary
mode
(Descriptive Analytics) -The most frequently occurring value in the sample -The only descriptive stats for variables with NOMINAL scale. - Not influenced by Outliers
Percentile
(Descriptive analytics) -value of a variable below which a certain percent of observations fall -Median = 50th percentile, Min 0 percentile, Max
overfitting
(Predictive Analytics) a model overly captures random errors or noises, instead of describing underlying relationships
Median
(descriptive Analytics) -The numerical value separating the higher half of a dataset from the lower half (50% above, 50% below, 50th percentile) -For Ratio, Interval, and Ordinal Variables Not influenced by outliers
range
(descriptive analytics) -the difference between the maximum value and the minimum value in the dataset. -influenced by outliers
Quartile
(descriptive analytics) a set of values are the three points that divide the data set into four equal groups - 1st quartile- 75th percentile -2nd quartile- median/ 50th percentile -3rd Quartile- 25th percentile -not influenced by outliers
correlation
(descriptive analytics) any statistical relationship between two random variables or bivariate data common measure- pearson correlation coefficient a measure of the linear association between two variables
mean
(descriptive analytics) the average amount -the sum of the observations divided by the number of observations - For Interval and Ratio variables -Influenced by the values of outliers
confirmatory modeling
(predictive Analytics) -to fit historical data closely -the entire dataset is used for estimating the best-fit mode, to max the amount of information that we have about the hypothesized relationship in the population -might overfit- captured all relationship in the historical dataset including non-recurring events.
predictive modeling
(predictive Analytics) to best predict the future partitioned datasets are used, where training dataset is used to estimate the model and validation dataset to assess this model's performance on new, unobserved data.
Diagnostic Analytics
- goes beyond examining "what happened" to answer the question "Why did this happen" -build on Descriptive Analytics using logic and basic tests to reveal relationships and explain historical events or associations -can be formal or informal- Hypothesis Testing
Prescriptive Analytics
-Offers recommendations to take or programmed actions, just like doctors recommend a substance or action - utilizes artificial intelligence, machine learning, and other stats to make predictions Common Techniques: Linear Programming- an optimization technique for a system of linear constraints and a linear objective function. - Self Driving Cars
Predictive Analytics
-Recurring Events -non- recurring events (Noises) -two different/ conflicting goals using historical events to predict the future: confirmatory modeling and Predictive Modeling
basic visualization design principles
-Simplification: making a visualization easy to interpret and understand -emphasis: assuring the most important message is easily identifiable -ethical data presentation- avoiding the intentional / unintentional use of deceptive practices that can alter the user's understanding of the data being presented.
How to Identify Outliers
-Sort the Records - Examine Max and Min - Compare Mean and Median -Create Scatterplots -perform conditional formatting -perform cluster analysis for complicated data How to treat outliers> depends on true cause of outliers if type, correct if not, keeping the outliers influences the outcomes
Bar Chart/ Bar graph/ Bar Plot/ Vertical Bar Chart
-categorical data variable on the x-axis and numeric data on the y-axis -ideal to show trend fewer than 12 periods -excel calls these "Column Chart"
treemaps
-nested rectangles to show the amount that each group or category contributes -used to highlight hierarchy among data elements
heatmap
-show by colors, looks like a data table but uses colors to show the magnitude of the different entries. -easily created by using conditional formatting in excel
SEC's Plain English Disclosure
1. Short Sentences 2. definite, concrete, everyday language 3. active voice 4. bullet lists 5. no legal jargon 6. no double negatives
area chart
A line chart with the areas below the lines filled with colors. highlight one portion of the line
Descriptive Analytics
A self driving car company uses artificial intelligence to help clean its historic social media data so they can analyze trends
Descriptive Analytics
Address the question "What Happened?" Historical Applies exploratory data analysis to: find mistakes in the data to understand the structure of the data to check the assumptions required to determine the size, direction, and strength of relationships between variables (Correlation)
Diagnostic Analytics
An Accounting Firm is trying to understand if its external audit fees are appropriate. They compute a regression using public data from all companies in their industry to understand the factors associated with higher audit.