Data Visualization
outliers in IQR
+/- 1.5 IQR
position attributes
-2D position -motion
what dashboards are not
-a display that is primarily used for data exploration and analysis -a portal -a scorecard -a report that people use to look up specific facts
rules to follow
-always label your axes for scale and variable encodings -always keep your geometry in check -always include your sources -always consider your audience -always avoid 3D effects -always avoid intercept deceptions
time series conventions
-always use the horizontal axis for time scale and the vertical axis for quant. scale -vertical bars when you want to emphasize individual values or compare categories of values, rather than overall pattern -lines only when you want to emphasize the pattern of change over time -points only when values were collected at irregular intervals of time or there is a little short term up and down fluctuation in the values -vertical box plots when you want to display how a distribution changes through time
serif
-appropriate font for data labels to speed processing -when there is a lot of text to read
characteristics of lines
-color hue and intensity -unless in black and white -points and gridlines are useful to compare values on different lines
interactions (few 2009)
-compare -sort -add variables -filter -highlight -aggregate -re-express -zoom/pan -re-scale -access details -annotate -bookmark
visualization amplifies cognition as it
-conveys meaning -increases working memory -facilitates search -facilitates discovery -supports inference -enhances detection hierarchy, relational, temporal, spatial
spatial
-dot distribution, graduated symbols, cartogram, choropleth
graphical inference (wickham)
-exploratory analysis may combine graphical methods, data transformations, and statistics -use questions to uncover more questios -formal methods may be used to confirm, sometimes on held-out or new data -visualization can further aid assessment of fitted statistical models
representation summary
-facilitate cognitive processing -some representations are innately better than others
common kernel functions for density estimation
-gaussian -rectangular -triangular -epanechnikov -cosine
why interaction?
-give control to the user -guide the user through your story -handle too much data or too many variables -allow for data exploration and new questions
color attributes
-hue -intensity
the roles of text
-label -introduce -explain -reinforce -highlight -sequence -recommend -inquire
form attributes
-length -width -orientation -shape -size -enclosure -curvature -added marks
common conventions
-like colors mean like things -color saturation indicates higher and lower values -categories are arranged and plotted from one extreme to another
temporal
-lines and motion (time)
common ways to summarize data are
-measures of average -measures of variation -measures of correlation -measures of ratio
visualization lies
-no zero line -dual axes -flipped y axis -doesn't add up -limited scope -strategic binning -problems with area/dimension
relationships among categorical items
-nominal -ordinal -interval -hierarchical
sorting
-often uncovers much more meaning in data -provide extremely quick and easy means to re-sort data in different ways -provide the means to link multiple graphs and easily sort the data in each graph the same way -provide the means to sort items in a graph based on various values, especially the values that are featured in the graph
dashboard design best practices
-organize information to support meaning and use -maintain consistency to enable quick and accurate interpretation -pet supplementary information within reach -make the experience aesthetically pleasing -expose lower-level alerts -keep viewers in the loop -when needed, accommodate real-time monitoring
characteristics of bars
-orientation -proximity -fills -borders -base value
LOESS curves
-performs multiple local regressions that place higher weighting on closer points -provides a richer visual representation of the trend and doesn't require an a priori model -doesn't provide a simple regression function to describe the trend
relationships among quantitative values
-rankings -ratios -correlations
process and provenance (heer and shneiderman)
-record analysis histories for revisitation, review and sharing -annotate patterns to document findings -share views and annotations to enable collaboration -guide users through analysis task or stories
view manipulation (heer and shneiderman)
-select items to highlight, filter, or manipulate them -navigate to examine high-level patterns and low-level detail -coordinate views for linked, multi-dimensional exploration -organize multiple windows and workspaces
characteristics of points
-shape -fill -color
quantitative stories always feature relationships
-simple associations between quantitative values and categorical items -more complex associations among multiple sets of quantitative values
relational
-suggests patterns of connections -heatmap, chord diagram, sankey diagram
hierarchy
-suggests relationship direction -stacked schemes (vertical, horizontal, center/periphery relationship) -nested schemes (treemap)
which graphs to use with multivariate datasets
-tableplots -scatterplot matrices -star graph -icon solutions
use tables when
-the display will be used to look up individual values -precise values are required -the quantitative values include more than one measure -both detail and summary values are included -the display will be used to compare individual values
use graphs when
-the message is contained in the shape of the values -the display will be used to reveal relationships among whole sets of values
common components of a chart/graph
-title/subtitle -data region -data label -data encoding -annotation -legend -grind lines -note -x/y axis -tick mark
where is the boundary
-to show relationships consider consider linking graphical representations of data objects using lines or ribbons of colors -consider putting related information inside a closed contour -color or texture can be used to define regions that have more complex shapes
secondary data component design
-trend lines -reference lines -annotations -scales -tick marks -grid lines -legends
properties of representation
-understanding without training -resistance to alternative conclusions -cross culture validity -immediacy (hard wired)
fundamental usage requirement features
-update frequency -user expertise -audience size -technology platform -screen type -data types
treemap
-use of color and form to represent two quantitative values -may also use relative position to represent nested/hierarchical data -precision not important -boxes represent entities -better for large data sets where smallest category still relatively significant
cartograms
-use of size of predefined units (states) to represent distribution of variable values -topology maintained -explains via familiar units
ranking designs basics
-use one axis for categorical items and use the other axis for a quantitative scale -bars are almost always preferred -except when the quantitative scale doesn't begin at zero, then use points -sorting is key to effectively communicate a ranking
aesthetics
-use subdued colors over bright colors -use off-whites instead of stark whites in background -align content and follow good layout principles -use legible font
data and view specification (heer and shneiderman)
-visualize data by choosing visual encodings -filter out data to focus on relevant items -sort items to expose patterns -derive values or models from source data
font sizing and spacing
1 inch = 72 points 1 pica = 12 points 12 points = 16 pixels
visualization priniciples
1. adopt novel approaches to visualization only when anticipated benefits are greater than the cost of learning + cost of inconsistency 2. when two visualizations can support the same task, adopt the tool that is innately more effective 3. visualization tool development cost is less than benefits from visualization tool
13 common mistakes in dashboard design (few 2013)
1. exceeding the boundaries of a single screen 2. supplying inadequate context for the data 3. displaying excessive detail or precision 4. choosing inappropriate media of display 5. expressing measures indirectly 6. introducing meaningless variety 7. using poorly designed display media 8. encoding quantitative data inaccurately 9. arranging the data poorly 10. ineffectively highlighting what's important 11. cluttering the screen with useless decoration 12. misusing or overusing color 13. designing an unappealing visual display
what should a good chart do?
1. show the data 2. induce the viewer to think about the substance rather than about methodology, graphic design, technology of graphic production 3. avoid distorting what the data has to say 4. present many numbers in a small space 5. make large data sets coherent 6. encourage the eye to compare different pieces of data 7. reveal the data at several levels of detail, from a broad overview to the fine structure 8. serve a reasonably clear purpose: description, exploration, tabulation, or decoration 9. be closely integrated with the statistical and verbal descriptions of a data set
tufte's principles of graphic integrity
1. the representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented 2. clear, detailed, and thorough labeling should be used to defeat graphical distortion and ambiguity. write out explanations of the data on the graphic itself. label important events in the data. 3. show data variation, not design variation 4. in time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units 5. the number of information carrying (variable) dimensions depicted should not exceed the number of dimensions in the data 6. graphics must not quote data out of context
tufte's principles of graphical execellence
1. the well-designed presentation of interesting data- a matter of substance, statistics, and design 2. consists of complex ideas communicated with clarity, precision, and efficiency 3. that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space 4. nearly always multivariate 5. requires telling the truth about the data
performance monitoring process
1. update high-level situation awareness 2. identify and focus on particular items that need attention: update awareness of this item in greater detail and determine whether an action is required 3. if action is required, access additional information that is needed, if any, to determine an appropriate response 4. respond
5 things to know how people perceive charts
1. we don't go in order 2. we see first what stands out 3. we see only a few things at once 4. we seek meaning and make connections 5. we rely on conventions and metaphors
data-ink ratio (tufte 1983)
= data ink/total ink used to print the graphic proportion of a graphic's ink devoted to the non-redundant display of data-information 1.0 - proportion of a graphic that can be erased
data density
= number of entries in data array/area of data graphic
the lie factor (tufte 1983)
=size of effect shown in graphic/size of effect in data avoid confounding design variation with data variation the scale of the graphic should always correspond to changes in the data being represented
"Data viz is often the most effective way to describe, explore, and summarize a set of numbers by looking at a picture of those numbers... well-designed data graphics are usually the simplest and at the same time the most powerful"
Edward Tufte (Visual display of quant info)
"The greatest value of a picture is when it forces us to notice what we never expected to see"
John Tukey
bean plot
a more complete way to represent a distribution that shows the smoothed density of points over a window called the 'bandwidth' combines box plot, density plot, and a rug in the middle
gestalt principles of perception
a psychological theory of perception that suggests the mind understands external stimuli as whole rather than the sum of their parts. the wholes are structured and organized using grouping laws
rows
a series of flowlines that create horizontal divisions of space on a page
dashboard
a visual display of the most important information needed to achieve one or more objectives, consolidated and arranged on a single screen so the information can be monitored at a glance (Few 2013)
LOESS curve bandwidth
aka the smoothing parameter (alpha), the trend line can look different depending on this
flowlines
alignments that break the space into horizontal bands
no chart junk
all visual elements in charts and graphs that are not necessary to comprehend the information represented on the graph, or that distract the viewer from this information (tufte)
isopleth
an isoline on a graph showing the occurrence or frequency of a phenomenon as a function of two variables quantitative value used to impose isolines (constant value)
where should legends appear on the graph
anywhere they fit as long as they don't interfere with more important components of the graph
numbers that summarize
better communicate your quantitative message by reducing large datasets to a few numbers that summarize the data
emphasize size
bigger objects, words, and numbers
should legends have borders
borders don't add any meaning and draw attention away from the data. avoid borders
ordinal data
categories with order -very happy to very sad -use a color scale
fundamental aspects of design
color typography layout and composition
analogous color palette
colors next to each other on the color wheel
split complementary color palette
colors next to the one opposite the color wheel (triangle)
complementary color palette
colors opposite the color wheel
design
communication and persception
boxes (box and whisker plots)
comparing distributions across categories
continuous data
connecting the dots warning- line implies continuity and the line between the dots may not be appropriate depending on the context
attention is drawn to
contrasts, similarity
alpha
controls the flexibility of the LOESS regression function larger values produce smoothest functions in the data that wiggle the lease in response to fluctuations in the data
CMYK
cyan- 100% magenta- 0% yellow- 36.1% black- 0%
scripts
decorative, thin, and wide fonts are generally hard to read and should be used sparingly and only if they truly add to the design of the visualization
interquartile range
difference between the 7th and 25th percentiles
distribution quantitative message
displays the way in which one or more sets of quantitative values are distributed across their full quantitative range, from lowest to the highest and everything in between
rug plot
draws a small vertical tick at each observation in a histogram
purposes of data visualizations
explore: confirm and analyze explain: inform and persuade
part-to-whole graph
features how individual values that make up the whole of something compare to each other and the whole
deviation relationships
features how one or more sets of quantitative values differ from a reference set of values
linear trend lines
fit the data with the best line using least square regression these trend lines provide easy to interpret equations describing the linear trend
the four key pre-attentive attributes
form, color, position, motion
anscombe's quartet
four datasets with the same properties (mean, sample variance, correlation) but very different graphs
ranking quantitative message
graph that displays how a set of quantitative values relate to each other sequentially, sorted in ascending or descending order
statistics
graphical data analysis
the shrink principle (tufte 1983)
graphics can be shrunk way down
scatterplot matrices
great way to roughly determine if you have a linear correlation between multiple variables
spatial zones
groups of modules that cross multiple rows and columns
"simplify, simplify, simplify"
henry david thoreau
HSL
hue- 158% saturation- 100% lightness- 50%
when can you eliminate a legend
if categorical variables are encoded using color, shape, etc. a legend can be used to label them
modules
individual units of space created from intersecting rows and columns
when designing a set of glyphs to represent quantity
mapping to any of the following glyph attributes will be effective: size (length/area), lightness, saturation, vertical position -never use volume of a 3D glyph to represent quantity
choropleth
maps with color -use of color to represent distribution of 'standardized' variable values using pre-defined boundaries -loses gradient subtlety -explains via familiar units
ratio
meaningful zero
ratio
measure the relationship between a single pair of values and can be expressed in four ways 1. statement 2. fraction 3. rate 4. percentage
nominal data
multiple categorical states -urban/suburb/rural
interval data
numbers with quantifiable differences -dates
emphasize color intensity
objects, words, and numbers that are darker or brighter than the norm
emphasize enclosure
objects, words, and numbers that are enclosed by lines or background fill colors
emphasize hue
objects, words, and numbers that have a hue that is different from the norm
pre-attentive processing
occurs below the level of consciousness at an extremely high speed and is tuned to detect specific visual attributes
list
one categorical variable
monochromatic color palette
one color
bar chart
one continuous and one categorical variable
pie chart
one continuous and one categorical variable
boxplot
one continuous variable
histogram
one continuous variable
nominal v. ordinal
order your categories if -the categorical variable has a specific order
visual information seeking mantra
overview, zoom and filter, then details-on-demand
voronoi diagram
partitioning of a plane into regions based on distances to points in a specific subset of the plane
the cleveland dotplot
plots of points that each belong to one of several categories. the bars are replaced by dots at the values associated with each category aka strip charts
icons
provide a universal communication mechanism and create visual interest in the reader
standard deviation
provides a single value that measures variation of a set of data values relative to the mean
quantitative stories include two types of values
quantitative and categorical
RGB
red- 0 green- 255 blue- 163
HEX codes
red- 00 green- FF blue- A3
correlation
relationships between variables. displayed in a graph when it is designed to show whether two paired of quantitative values vary in relation to one another
bubblechart
scatterplot with third attribute represented by pre-attentive selection of form (size)
time series data
series of quantitative values that show how something has changed over time
frequency distributions
show the number of times something occurs within consecutive intervals over the entire quantitative range
violin plot
similar to box plot with a symmetrical rotated kernel density plot on each side
white space
similar to tufte's idea of maximizing the data ink and removing chart junk
nominal/categorical comparison
simplest of all and typically least interesting -goal: display a set of discrete quantitative values so they can be easily read and compared
emphasize orientation
slanted words and numbers
kernel density plot
smoothed line over a histogram based on the declared 'bandwidth', the kernel function is used to estimate a density for each band. then all density estimates are added together weighted functions used in non-parametric estimation
median
sort values in order and then find the value that falls in the middle of the set
gutters
space that separates rows and columns or two facing pages
perception in a nutshell
specialized neurons extract features involuntarily --> rapid assembly of information into significance; object identification (and action) --> visual working memory (using three chunks) with active conscious attention
variation
spread, IQR, standard deviation
step chart
standard line chart implies steady change from point a to point b so the values dont change in between time points
spread
subtract the lowest value from the highest value
mean
sum of all values divided by the number of values
glyphs
symbols meant to represent a numerical attribute of an entity, representing through the use of our building blocks of form, color, position, and motion
margin
the space that separates the context from the edge of the page
rule of thirds
the subject isn't centered in the image and draws the viewer's eye into the composition instead of glancing at the center
triad color palette
three colors all opposite from each other
bubble
three continuous variables or two continuous variables and one categorical variable
xref list
two categorical variables
scatterplot
two continuous variables
line graph
two continuous variables with one being time -doesn't need a zero
binary data
two states -positive negative
when can you eliminate tick marks?
unnecessary for axes with categorical variables most useful when knowing precise locations of values the longer the scale line, the more tick marks it should contain
heatmap
use of color to represent quantitative value imposed upon an area/location (vs. leveraging predefined boundaries agnostic to the mapped values) can also be used in a highlight table
to make symbols in a set maximally distinctive
use redundant coding wherever possible, for example make symbols differ in both shape and color
when developing glyphs
use small, closed shapes to represent data entities, and use color, shape and size of those shapes to represent attributes of those entities
emphasize width
use thicker lines (including words and numbers that are boldfaced)
stacked bars
useful when there are subcategories and the sum of the subcategories is meaningful for your quantitative message
time series visualizations
useful when your quantitative messages includes -change -rise -increase -fluctuate -grow -decline -decrease -or trend
midrange
value midway between the highest and lowest value in the set (different than median)
mode
value that occurs the most in the set of values- if no value appears more than once- there's no mode
bullet chart
variation of a bar graph developed by Few. features a single primary measure and compares that measure to one or more measures to enrich its meaning. its displayed in the context of qualitative ranges of performance (poor to good) with varying intensities of a single hue to make them discernible
coxcomb chart
variation of a pie chart that represents numbers using the area of the circle segments instead of the radius (polar)
columns
vertical divisions of space on a page
how should you arrange labels in legends
vertical or horizontal is fine, whichever best fits your design
how visible should legends be
visible and legible but not as prominent as the actual data
Data visualization
visual display of quantitative information through the use of -points -lines -coordinate systems -numbers -symbols -words -shading -color
principle of enclosure
we perceive objects as belonging together when they are enclosed in a way that appears to create a boundary around them
principle of proximity
we perceive objects close together as belonging to a group
principle of continuity
we perceive objects that are connected as part of the same group
principle of closure
we perceive open, incomplete, and unusual forms as closed, whole, and regular
principle of similarity
we tend to group objects together that are similar
motion attribute
when animation is used in a visualization, aim for motion in the range of 0.5 to 4 degrees/second of a visual angle
sans-serif
when there are shorter amount of text, captions, text in charts, headings
"a graphic display has many purposes, but it achieves its highest value when it forces us to see what we are not expecting"
william cleveland