Visualisation and Exploratory Analysis (CS4250)

¡Supera tus tareas y exámenes ahora con Quizwiz!

Tufte's Design Principles

- Clear, detailed and thorough labelling and appropriate scales. - The size of the graphic effect should be directly proportional to the numerical quantities ("lie factor"). - Maximise data-ink ratio. - Avoid chart junk.

Sequential colour scale

- Constrain hue, vary luminance/saturation. - Map higher values to darker colours.

Bad ways to display data

- Display as little information as possible - Obscure what you do show (with chart junk) - Use pseudo-3D and colour gratuitously - Make a pie chart (preferably in colour and 3D) - Use a poorly chosen scale

Visualising numerical variables

- Dot plot: useful when individual values are of interest. - Histogram: provides a view of the data density and are especially convenient for describing the shape of the data distribution (note: the bin size is important). - Boxplot: especially useful for displaying the median, quartiles, unusual observation as well as the IQR. - Density plot: useful for smoother distributions.

Variable types

- Identifier variables These are variables (typically factors, or integers or dates) that we have set up: e.g. the chick identifier, the date, the diet fed. They are identifiers, rather than measurements. - Measurement variables These are the measurements (typically floating-point numbers) we make. e.g. the chick's weight.

Design critique

- Identify data, tasks/intentions. - Identify marks and channels. - Is the effectiveness principle followed? - Is the expressiveness principle followed? - Scales. - Context. - Would derived data be better? - Is the design visually appealing/aesthetically pleasing? - Is it immediately understandable? If not, is it understandable after a short period of study? - Does it provide insight or understanding that was not obtainable with the original representation (text, table, etc)? - Does it provide insight or understanding better than some alternative visualisation would? Or does it require excessive cognitive effort? What kind of visualisation might have been better? - Does the visualisation reveal trends, patterns, gaps, and/or outliers? Can the viewer make effective comparisons? - Does the visualisation successfully highlight important information, while providing context for that information? - Does it distort the information? If it transforms it in some way, is this misleading or helpfully simplifying? - Does it omit important information? - Is it memorable? - How can we see any interactions? - Is it a pattern or by chance (statistical significance)? Could the apparent interaction be just luck? - Do we have enough observations to reliably identify an interaction? - What is the explanation of the interaction? (confounding factors?)

Visualisation types

- Infographics (information graphs in short) • Infographics are visual representation of facts, events or numbers. • The visual patterns and trends are used in such a way that human cognition is enhanced. • Newspapers would be one common place where you find a lot of these, usually used to show weather, map and poll statistics. • Infographics communicate a subjective narrative or overview of a topic using illustrations to drive visual storytelling. - Scientific visualisation • Visualising results of simulations, experiments or observations. • Data is frequently multi-dimensional. • Usually involves data associated with a physical domain, e.g.: fluid visualisation. - Data visualisation (static or interactive) • Are used for searching for interesting phenomena • Consisting purely of data and design, successful data visualisation helps clarify information by giving an overview of the implications of your research.

Visual cognition

- It requires a conscious inspection. - It's attentive - It's cognitive - Top-down information

Why does visualisation matter

- Large size of data makes it necessary to provide summaries. - People prefer to look at figures rather than numbers. - Aid model construction and check plausibility of model assumptions. • Pattern discovery: clusters, outliers, trends • Contextual knowledge: dataset expectations, explanations for patterns • Action: humans learn and take action But we also have to design for humans and their limitations.

Which logarithm

- Log base 10: 10^log10(x) = x: conventional, can see the powers of 10. - Log base e (natural logs): e^log(x) = x: small changes (e.g. + .05) are roughly equal to changes by the same percentage (5%). - Log base 2: 2^log2(x) = x: it can be more intuitive than log base 10 (a doubling time is a smaller unit than a time to increase by a factor of 10.

Outlier detection

- Look critically at the maximum and minimum values for all variables. - An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. • In a scatter-plot, outliers are points that fall outside of the overall pattern. • Normally distributed data should not have large outliers, k σ from the mean. - Key questions about outliers: • Is the outlier a mistake or a legitimate point? • Is the outlier part of the population of interest? - "Fix why you have an outlier. Don't just delete" Deleting outliers prior to fitting? - Deleting outliers prior to fitting can yield better models, e.g. if these points correspond to measurement error. - Deleting outliers prior to fitting can yield worse models, e.g. if you are simply deleting points which are not explained by your simple model.

Histograms

- Not the same as a bar char. - The area under a bar is the count; bars can be of variable width (but usually not). - How many bars? • Rule of thumb: the standard error of bar count is √(number of elements in the bar). • No rules about how many bars, as distributions are non-uniform (better to have a few groups over many). • The appearance of the bar chart changes greatly as you change the width and arrangements of bars.

Parallel coordinates plot

- Plot that represent multidimensional data using lines. - A vertical line represents each dimension or attribute. - The maximum and minimum values of that dimension are usually scaled to the upper and lower points on these vertical lines. - n-1 connected to each vertical line at the appropriate dimensional value represent an n-dimensional point. Analysis: - A positive correlation between 2 adjacent variables: almost all segments are parallel to each other. - Clusters in some variable space: several trace lines that are near each other and have similar pattern. - Outliners: trace lines that have unusual pattern or fall out outside the common plot area Problems: - Trace lines overlap each other→difficult to find patterns, difficult to follow a specific trace line. - Analysis depends much on the order of variables (correlation, clusters)→a proper reordering may improve the analysis.

Pie charts

- Poor accuracy. - Only suitable for part-of-whole relationships (not a sequence of values). - Ideally 2 slides, no more than 5. - It's only suitable for very clear data with large differences in proportions. - Multiple pie charts are hard to compare (same problems as stacked bar charts - but worse).

Colour scales

- Sequential: a range of values from small to large (e.g. a greyscale). - Qualitative: different colours distinguish different categories, with no special ordering. - Diverging: suitable when large negative values are interesting, values near 0 are boring, and large positive values are interesting. More generally, when there are extreme values in 2 directions, but the centre is to be emphasised less.

Colour map

- Specifies a mapping between colour and values, sometimes called a transfer function. - Categorical vs ordered. - Sequential vs diverging - Segmented (discrete) vs continuous. - Uni-variate vs bi-variate Expressiveness: match the colour map to attribute type characteristics. Guideline: - Ordered colour maps should vary along with saturation or luminance. - Bi-variate colour maps are difficult to interpret if at least one variable isn't binary. - Categorical colours are easier to remember if they are nameable. - The number of hues, and distribution on the colour map, should be related to which, and how many structures in the data to emphasize. - Saturation and hue aren't separable in small regions (in small regions use bright, highly saturated colours). - Saturation interacts strongly with size: • more difficult to perceive in small regions, • for points and lines use just 2 saturation levels. - Higher saturation makes large areas look bigger (use low saturation pastel colours for large regions and backgrounds). - Luminance and saturation are most effective for ordinal data because they have an inherent ordering. - Hue is great for categorical data because there's no inherent ordering (but limit the number of hues to 6-12 for distinguishability) Hints for the colourist: - Use only a few colours (ideally 6). - Colours should be distinctive and named. - Strive for colour harmony (natural colours?). - Use cultural conventions; appreciate the symbolism. - Beware of bad interactions (e.g. red/blue). - Get it right in black and white. - Respect the colour blind.

Quartiles and IQR

- The 1st quartile (aka 25th percentile), Q1, is the value for which 25% of the observations are smaller and 75% are larger. - The median (aka 50th percentile), Q2, is the value for which 50% of the observations are smaller and 50% are larger. - The 3rd quartile (aka 75th percentile), Q3, is the value for which 75% of the observations are smaller and 25% are larger.

How the eye works

- The eye isn't a camera. - The attention is selective (filtering) - Cognitive processes - Psycho-physics: concerned with establishing quantitative relations between physical stimulation and perceptual events.

How do we see colour?

- The human eye has over 100 million rod cells but only has about 6 million cones. - 3 types of cones: LMS (Long, Middle, Short) cones which are sensitive to a different wavelength. - Integration with input stimulus. - All spectra can be reduced to precisely 3 values without loss of information with respect to the visual system. - Spectra that stimulate the same LMS response are indistinguishable. - Possessing 3 independent channels for conveying colour information. - 3-stimulus. • Computer displays • Digital scanners • Digital cameras

Tufte's principals of graphical integrity

- The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented. - Clear, detailed and thorough labelling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data. - Show data variation, not design variation. - In time-series displays of money, deflated and standardised units of monetary measurement are nearly always better than nominal units. - The number of information-carrying (variable) dimensions depicted shouldn't exceed the number of dimensions in the data. - Graphics must not quote data out of context.

Errors in hypothesis testing

- Type I error: committed when H_0 is rejected when in reality it's true. - Type II error: committed when H_0 isn't rejected when in reality it's false. The probability of a Type II error is called ß, but the value of ß typically depends on which particular alternative hypothesis is true. Just like in the court trial, a Type I error is considered to be a more serious type of error (e.g convicting an innocent man). Thus we try to minimise the probability of committing the Type I error. However, in trying to minimise the probability of a Type I error, ß will increase (i.e. reducing the power) as the probabilities of the Type I and Type II errors are inversely related. As a compromise we, therefore, specify a maximum tolerable Type I error probability, called the significance level, and denoted by α, such that the probability of a Type I error is (at most) equal to α. α is conventionally set to 0.05 or 0.01.

Recommendations

- Use the full range of space available; re-scale if necessary. - Use minimum ink. Prefer dots to bars, or lines that can be made into a dot. - Select, simplify and summarise: full scatter-plots are great for your own exploration and checking but for presentation don't make the reader do the work of selecting and summarising. - Figure out what comparisons you want to show; choose graphs that can be read accurately for these comparisons and info. - Don't show a categorical variable (e.g. country name) in alphabetical or a meaningless order: order it meaningfully! - More than one plot may be needed - Use facetting to show comparisons between subsets of data

Conjunction of features

- Viewers can't rapidly and accurately determine whether the target (red circle) is present or absent when the target has 2 or more features, each of which are present in the distractors. - Viewers must search sequentially.

Hypothesis testing

1. Specify the null (H_0) and alternative hypotheses (H_a). These 2 hypotheses are mutually exclusive and exhaustive so that one is true to the exclusion of the other. 2. Using the sample data and assuming the null hypothesis is true, calculate the value of the test statistic. 3. Using the known distribution of the test statistic, calculate the p-value: "If the null hypothesis is true, what is the probability that we'd observe a more extreme test statistic in the direction of the alternative hypothesis than we did?" 4. If the p-value we calculated in step 3 is high it means that the sample is likely under H_0 and so we have no evidence against. If the probability is low, there are 2 possibilities: - we observed a very unusual event, or - our assumption is wrong. The underlying logic is: Assume A. If A, then B. Not B. Therefore, Not A. But throw in a bit of uncertainty... If A, then probably B... Problems: - p < 0.05 is an arbitrary probability (p < 0.06?). - The size of the effect isn't expressed. - The variability of this effect isn't expressed. - Induction/deduction - reproducibility.

Chart junk

A term coined by Edward Tufte, who considers anything in a chart that doesn't represent data (or is a scale or label) as not just unnecessary, but harmful. Extraneous visual elements that distract from the message. (e.g. heavy or dark gridlines, ornamented chart axes and display frames, pictures or icons within data graphs, and ornamental shading).

RGB

A very common colour space but not perceptually uniform. The colour cube axes range either from 0-255 or 0.0-1.0. In principle, it can use colours to represent 3 numbers - but not very accurately.

Visual perception

Ability to interpret the surrounding environment by processing information that is contained in visible light. - 70% of the body' sense receptors reside in the eyes. - It's important to understand how visual perception works in order to effectively design visualisations. - It requires no conscious effort. - It's pre-attentive - Bottom-up information It's where graphs are the most effective to operate (to some extent).

ggplot

An R implementation of Grammar of Graphics. Components: - Data: a data.frame to visualise - Aesthetics: mapping data variables to aesthetic features of the graph (e.g. x-axis, y-axis, size, shape, colour, fill, transparency, ...). - Scales: mapping values of the data to graphical output (e.g. position, colour, fill and shape scales). - Statistical transformations: representations of our data to aid understanding (e.g. binning smoothing, ...). - Coordinate system: maps the position of objects onto the plane of the plot (e.g. Cartesian, polar, map projection, ...) - Facet: display split data in multi-panels (aka. conditioning) - Theme: control non-data visual elements (e.g. title, axes, tick, ...). It organises graphs in geometric layers. Each layer is dedicated to a single type of geometry, with data, an aesthetic mapping and statistics for relevant data transformation. Layers can be stacked - use the + operator for this. By appending layers we can connect the "how" (aesthetics) to the "what" (geometric objects). But for a plot to work, some consistency must be kept: - Key data must be kept constant across layers - No mapping of to differences variables to the same aesthetics. - Typically no change of scales is allowed between layers, but some transformation is possible. Every graph can be described as a combination of independent building blocks: - data: data frame - aesthetic mapping of variables into visual properties: size, color, x, y - geometric objects: points, lines, bars, areas, arrows, ... - coordinate-system: Cartesian, log, polar, map, ... - statistical transformations - data summaries: mean, sd, binning & counting, ... - scales: legends, axes to allow reading data from a plot

Colour issues

Approximately 10% of the population can't distinguish red from green. - Red and white have different symbolism in Western and Asian cultures. In some Chinese subcultures red = life and white = death. - Blue vs black can be very difficult to distinguish on some display phosphors.

Data compatibility

Data needs to be carefully massaged to make "apple to apple" comparisons: - unit conversions - number / character code representations (e.g. 32-bit vs 64-bit floats) - name unification - time/data unification - financial unification

Missing data

Data unavailable possibly due to: - equipment malfunction - data not entered due to a misunderstanding (e.g. death year of a living person) - certain data may not be considered important at the time of entry An important aspect of data cleaning is properly representing missing data. Setting such values to 0 is generally wrong. Types: - Missing completely at random (MCAR) - Missing at random (MAR) - Not missing at random (NMAR) Approaches: - Listwise deletion: if an observation is missing on any attribute, it's dropped from the analysis (this is biased without MCAR). - Pairwise deletion: correlations/covariances are computed using all available pairs of data. - Imputation of missing data values - Model-based use of complete data (Expectation-Maximisation approach)

Boxplot

Displays standard quantiles of the distribution in a standard diagram form. However, the convention used can slightly differ as packages can show different boxplots, some will put use the 10th and 90th percentile as whiskers while others would use ±1.5IQR. Pros: - It takes less space (useful when comparing distributions). - It conveys the key values, what the outliers are, the symmetry of the data, how tightly grouped the data is, if there's any skewness and the skewness direction. Con: This plot also assumes the distribution is unimodal (i.e. 1 pick). When notches are used, they indicate the confidence interval around the median. Those are normally based on: median ± 1.58*IQR/√(n)

Tail of a distribution

Extreme values at one end. Example use case: - longest waiting times - tallest people - wealth of the richest people However, we need a large sample to get meaningful information on the tail. In a light-tailed distribution, each new record exceeds the last by smaller and smaller amounts. In an exponential distribution, each new record exceeds the last by (on average) the same amount. In a heavy-tailed distribution, each new record exceeds the last by (on average) greater amounts.

Weber's Law

Given some physical quantity - length of a line, say x - what's the change in x that we can detect when we compare 2 separated lines? For a given probability p of telling the difference, and for given experimental conditions, the difference we can detect is between x and (1+ε)x for some more-or-less fixed ε over a wide range of values of x (Just Noticeable Difference). In other words, we can estimate or compare quantities with a fixed percentage accuracy over a wide range of x. Subjectively, equal percentage changes are perceived as equal changes. True for length/area/volume/weight/time/brightness/power of sound... It's why we need rulers or scale, which are pieces that allow us to break Weber's law. Image: the importance of distance. This law is about the accuracy, which can be objectively measured: % of correct judgements of "is A bigger than B?". If we estimate A by looking at it in isolation, we'll be accurate to some percentages.

Statistical graphics

Graphs intended to show the structure of data, and to support statistical inferences. They may need much concentration to understand and many users may be confused.

Infographics

Graphs that present the data as a story or as an aesthetic object. They may be informationally useless, but users love them.

Grey-scale and colour encoding

Not only is our judgement of grey-level poor and inaccurate; it can be biased by contrast with surrounding shades. The two circles are the same shade of grey, but appear different because of the surrounding context.

Exploratory Data Analysis (EDA)

Techniques for summarising, visualising and reviewing data, building an intuition for how the underlying process that generated the data works. We do it to determine trends in features that may be present in the data. Basic approach: - Generate questions about your data. - Search for answers by visualising, transforming and modelling your data. - Use what you learn to refine your questions and/or to generate questions. It's fundamentally a creative process and a critical first step in analysing the data. It's often described as a philosophy, and there are no hard-and-fast rules for how you approach it. Its methods can be numerical (i.e., descriptive statistics), graphical or tabular. Graphical methods make it very easy to discover trends and patterns in a data set.

p-value

The probability of obtaining a value of the test statistic as or more extreme than the observed test statistic when the null hypothesis is true. I.e., how probable is an observation if the null hypothesis is true? - The rejection region is determined by α, the desired level of significance, or probability of committing a type I error or the probability of falsely rejecting H_0. - It's the smallest α at which we would reject H_0. - Reporting the p-value associated with a test gives an indication of how common or rare the computed value of the test statistic is, given that H_0 is true.

Missing at random (MAR)

The probability that a feature is missing is independent of the value of the feature. In this case, the probability of being missing can depend on the value of the other features. The MAR assumption can never be verified from the data. In principle, the missing values can be predicted from them

Missing completely at random (MCAR)

The probability that the features is missing is independent of the value of the feature and the values of any other features. In this situation using only the complete data (cases having no missing value) will give an unbiased result, but with a smaller sample size than if no data was missing.

Lie factor

The size of the graphic effect should be directly proportional to the numerical quantities

Gestalt principle: continuation

We complete hidden objects into simple, familiar shapes. It occurs when the eye is compelled to move through one object and continue to another object.

Gestalt principle: closure

We see incomplete shapes as complete.

Probabilistic re-scaling

What if we are plotting probabilities: some of them may be very low (e.g. 1/1000, 1/10,000), others close to 1 (0.9, 0.999) etc. p = 1 / 1000 can be very different from p = 1 / 10000. p = 0.99 may be very different from p = 0.9999! In this case, try log(p / (1-p))

Gestalt principle: similarity

What patterns dominate? Things that look like each other (size, color, shape) are related

Quantile plot

A plot where x is the quantiles and y is the data. Pro: Useful when comparing distributions Con: It may be unfamiliar to an audience, so explanations might be needed.

Data wrangling

A process of iterative data exploration and transformation that enables analysis to generate valid insights. Its goal is to make data useful: - map data to a readable form for downstream tools (visualisation, modelling, ...), - identify, document, and (where possible) address data quality issues. Visualisations allow us to see data quality issues and can be an input device for such transformations Steps: Obtain→Understand→Explore→Transform→Augment→Visualise

Channels

A visual variable that controls the appearance of marks.

Gradient

An often interesting and important thing. Definition: gradient = [f(b) - f(a)] / (b - a) The velocity is the gradient of a distance vs time in a graph. It can indicate the growth rate or rate of change. Unfortunately when we estimate gradients, we see the angles instead.

Gestalt principle: grouping

It helps us organise data

How close vertically are 2 curves?

On stacked area chart, it may be better to use stacked bar chart to mitigate this bias. It's same for regression lines on scatter plots.

Least effective for Quantitative/Ordinal data

Or colours (slide 217 on the all-slides PDF)

Not missing at random (NMAR)

The probability that a feature is missing can depend on the value of the feature. Not much can be done in this situation.

Statistical power

The probability of finding an effect if it's there (the probability of not making a Type II error), 1-ß. When we design studies, we typically aim for a power of 80% (allowing a FNR or Type II error rate of 20%).

Type I error rate (or significance level)

The probability of finding an effect that isn't real (False Positive). If we require p-value < 0.05 for statistical significance, this means that 1/20 times we will find a positive result just by chance.

Type II error rate

The probability of missing an effect (False Negative).

Anscombe's quartet

four datasets that have the same simple descriptive statistics (mean, median, ... ), yet appear very different when graphed.

Empirical Cumulative Distribution Function (ECDF)

fraction of population ≤ x

Plot

map between data and visual elements, it consists of layers that share some common properties.

Lin-log scaling

y = Ae^(bx) log y = bx + log A To try if the reference comparison is exponential

Log-log scaling

y = Ax^b log y = b log x + log A To show powers as straight lines. Watch out for additive constants

Radar plot / star graph

"Parallel" dimensions in polar coordinate space. It's better if the same unit is applied to each axis. Problems: - It's difficult to judge orientations - Number of dimensions is very limited - More close radar charts easier to compare - Perception is much affected by observation orders.

Inference of causes

- "Cause" is a difficult word. - Very complex interacting causes: planning restriction, low-interest rates, population increase, conflicting interests of owners and renters, and many more. - No single study can disentangle this. You may be marginally more informed after the analysis than before.

Colour receptor sensitivities

- 3 types of cone cell in the eye. - Their sensitivities overlap. - Subjective brightness varies with wavelength. - We can't have 3 pure wavelengths that give pure R, G, B for the colour receptors. - Significant numbers of people have different types of colour vision.

Visualisation critique

1. Consider the purpose of the visualisation and who the intended audience is. 2. Ascertain your initial reaction. 3. Examine the visualisation in detail. 4. Answer questions like the following.

Graphical representation of ratios

1. Represent quantities as lengths on a common scale, and let people judge (ok but inaccurate). 2. Represent quantities on a common log scale; now equal ratios are equal distances on scale. (Requires sophistication from viewers). 3. Calculate the relevant ratios and represent them as quantities (this works)

Visualisation research goals

1. Understand how visualisations convey information. • What do people perceive/comprehend? • How do visualisations correspond with mental models? 2. Develop principles and techniques for creating effective visualisations and supporting analysis • Amplify perception and cognition • Strengthen tie between visualisation and mental models

Advice on displaying numbers with colours

1. Use the correct colour scale 2. Rescale your data values (using an appropriate value of gamma) so that you show clearly the interesting differences in values. I.e., set the contrast and brightness so that the picture shows what is interesting.

Luminance contrast

A difference in the intensity of illumination at adjacent retinal locations

Split-apply-combine pattern

A general strategy for working with big data. - Split: divide the problem into smaller pieces - Apply: work on each piece independently (usually either aggregate/transform/filter) - Combine: recombine the pieces In: - Python: map(), filter(), reduce() - MapReduce - R: split, *apply, aggregate, ... (or using the plyr package) Transform/summarise a data frame df by: 1. Split df by the factor or a combination of factors 2. Apply a function to each of the data frames to create a new one 3. Combine the new data frames back together with the factors

Selection bias

A huge influence on what we choose to investigate and what conclusions we will retain, accept and publish.

Mosaic plot

A simple and effective visualisation technique for contingency tables. "A graphical representation of a contingency table. The plot is divided into rectangles so that the area of each rectangle is proportional to the number of cases in the corresponding cell." The perceptual basis for this plot: - It's tempting to dismiss this plot because it represents counts as rectangular areas and so provides a distorted encoding. - In fact, the important encoding is the length - At each stage, the comparison of interest is of the lengths of the side of pieces of the most recently split rectangle.

Contingency table

A table of counts where rows and columns are labelled with values of categorical variables (factor). The value in each cell is the count of the number of cases in that combination of categories. We are, nearly always, interested in interactions or correlations between the variables. The table in the image suggests that drug Z might cure people, not completely reliably. There are 3 ways to represent such tables in R (slide 357). - Case form: data frame with 2 factors (e.g. Treatment and Result) - Frequency form: data frame with 3 factors (e.g. Treatment, Result and Count). - Table form: a labelled multidimensional array (needed for mosaic plots)

ggplot2

Advantages: - consistent underlying grammar of graphics - plot specification at a high level of abstraction - very flexible - theme system for polishing plot appearance - mature and complete graphics system - many users, active mailing list Things to not do or not doable: - 3D graphics (see the rgl package) - graph-theory type graphs (nodes/edges layout; see the igraph package) - interactive graphics (see the ggvis package)

Qualitative colour scale

All colours well separated, so that they can be distinguished

Expressiveness principle

All data from the dataset and nothing more should be shown. Do encode ordered data in an ordered fashion (don't encode categorical data in a way that implies an ordering).

Violin plot

Better to use when the distribution is (clearly) bimodal or ± multi-modal (2+ picks)

Grammar of graphics

Big idea: Independently specify plot building block and combine them to create just about any kind of graphical display you want (solid, creative and meaningful visualisation). - An abstraction which makes thinking/reasoning about and communicating graphics easier. - A grammar of graphic is a tool that enables us to concisely describe the components of a graphic. It was developed by Leland Wilkinson, particularly in the book "The Grammar of Graphics" (1999/2005): - graphics = distinct layers of grammatical elements - meaningful plot through aesthetic mapping

Eyes vs cameras

Cameras - Good optics - Single focus, white balance, exposure - "Full image capture" Eyes - Relatively poor optics - Constantly scanning (saccades) - Constantly adjusting focus - Constantly adapting (white balance, exposure) - Mental reconstruction of image (sort of)

Kernel Density Estimate

Construct a smoothed estimate of density using a weighted moving average of the number of observations. Produce an estimate of the density at any value x by: - centring the kernel function at x - adding up the weights of points according to the kernel function. If we do this for successive values of x, we will get a nice smooth curve. Beware of the bandwidth selection problem (c.f. slide 292): - a small bandwidth (e.g. h=0.01) leads to a noisy/overfitted density plot. - a large bandwidth (e.g. h=0.2) leads to a generic/underfitted density plot. Note: a KD plot is a "visual lie", there's a lovely smooth curve, which looks as if it should be accurate - but in fact the KDE is highly uncertain unless you have a lot of data. This plot also has all the problems bar graphs have.

Simultaneous contrast

Distortion of the appearance of a patch of colour in a way that increases the difference between a colour and its surroundings.

Gestalt principle: figure / ground

Elements are perceived as either figures or background

Why reduce dimensionality?

For automate use by computers: - Saves the cost of observing the features - Takes less memory, storage, transmission time - Reduces the number of parameters - Simpler models are more robust on small datasets For use by humans: - More interpretable and simpler explanations - Data visualisation (structure, groups, outliers, etc) if plotted in 2/3 dimensions

Statistical inference

Generating conclusions about a population from a sample. - Estimation: point estimation, confidence intervals. - Hypothesis testing: it's concerned with making decisions using data. - Prediction.

Dimension Reduction (DR)

Goal: - Extract information hidden in the data - Detect variables relevant for a specific task and how variables interact with each other→reformulate data with less variables. - With minimal loss of information.

Gradient perception

Gradient = tan(θ) We can judge θ with roughly constant accuracy over its whole range. tan(θ) is: - small for small angles; hard to judge because of the compressed scale - =1 when θ = 45°; the gradient can be accurately judged in this region - reaching ∞ as θ approaches 90°, The gradient sensitively depends on θ, and is impossible to judge accurately or at all. The gradient perception is less accurate than the one for positions on a scale.

Snooping

It occurs in predictive modelling. If it happens when you look at what you're trying to predict, you'll unconsciously choose prediction rules that will predict what you saw. This can be very difficult to avoid! This can invalidate conclusions!

Faceting

Key idea: applying the same analysis to many comparable subsets of the data, and putting them side-by-side, or in a table, so that it's easy to spot differences or trends by comparing similar simple charts. Ggplot2 syntax: p + facet_grid(row_variable ~ column_variable) where: - p is the plot that was constructed - *_variable are categorical variables not yet used in the plot p E.g: ggplot(movies) + geom_histogram(aes(x=rating, y=..density..)) + facet_wrap(~ decade)

Data mining

Looking for some patterns in a large collection of data. "If you torture the data long enough, it will confess to anything"

Visual illusions

Misperceptions of visual stimuli. - People don't perceive length, area, angle and brightness the way they "should". - The visual system does some really unexpected things. - They are many different types of illusions: • size and shapes of objects • interpretation of objects • Perceived colours and contrasts • depth perception • moving of objects • afterimages • ...

Common shapes of distributions

Note: long tail = right skew => use a log scale

Steven's Power Law

Our personal or "subjective" scales aren't linear. We can ask questions about "equal ratios" of lengths, areas, volumes or other quantities. Let x be the quantity, and let p(x) be our subjective perception of it. p(x) ∝ x^ß ß varies with the perceptual attribute and with the experiment. Typically: - length: ß ∈ [0.9, 1.1] - area: ß ∈ [0.6, 0.9] - volume: ß ∈ [0.5, 0.8] So the estimation of ratios of areas and volumes may be systematically biased; small areas and volumes may seem subjectively too large.

Ratio re-scaling

Ratios can be tricky. Reciprocal: is your ratio the right way up for your purpose? E.g. relating car weight to miles per gallon? mpg goes down as weight goes up, but it can't be negative, so mpg can't be linearly related to car weight. But gallons per mile = 1/mpg is much more suitable

Pre-attentive processing

Subconscious accumulation of information from the environment. It's done by the brain before filtering and processing what's important. Certain basic visual properties are detected immediately by the low-level visual system ("pop-out" vs serial search). It's important for designing visualisations: - What can be perceived immediately? - What properties are good discriminators? - What can mislead viewers? Examples: - Colour selection - Shape selection - Conjunction of features

Barycentric coordinates

Suppose we're interested in the relative proportions of 3 quantities: R, G and B. We can scale them so that R+G+B=1 and then plot the values on the blue triangle. In fact, we can draw the triangle in the plane.

Data-to-ink ratio

Term describing how many visual items (how much "ink") representing data there are in a chart in relation to how much there is overall.

Scatterplot

The most fundamental plot that relates 2 variables x and y. It shows a large number of x and y pairs at once, in such a way that they can be compared. - Checking data: • What is the story behind the data? • Does the data make sense? • Is there data missing or obviously incorrect? • Are there obvious questions that we want to ask? • Are there outliers? Are they errors or the most interesting and important cases. - Questions about the relationship between x and y: • Is there a relationship? What type of relationship could there be? Linear/Curved? How accurate is it? Can we find a "law"? • What are sensible scalings to use? • If there are different groups of data, do they have the same distribution?

Effectiveness principle

The most important attributes should be the most salient. - Salience: how noticeable something is? - How do the channels discussed measure up? - How was this determined?

Dimensionality

The number of measurements available for each observation/instance in a dataset.

Distribution

The observed data came from some process that produces different observations according to some parametric distributions. A set of values or measurements of named entities (e.g. countries, states, genes), where there's no statistical population of countries (we can't sample another random country). A set of independent samples from some (larger) population. A set of dependent samples (e.g. average temperatures of years).

Randomised intervention study

The point of randomised allocation is that, for large enough control and treatment groups, their statistical composition must be (nearly) the same. All confounders are well controlled and present in the same proportions in treatment and control groups. Although randomisation of the experimental group eliminates confounders that are characteristics of the subjects, it is also necessary to eliminate observational confounders. - Double-blinding eliminates observational bias: • placebo controls • difficult for effects of surgery! - intention-to-treat analysis requires that all subjects are included in the analysis and that there is an investigation of why subjects drop out.

Cognition

The processing of information, applying knowledge. - Recognising objects - Relations between objects - Conclusion drawing - Problem solving - Learning ...

Graphical perception

The visual decoding of information encoded on graphs. - When we draw a graph we encode a numerical value as a graphical attribute. - When we look at a graph, the aim is to decode the graphical attributes and extract information about the numbers which were encoded. - To design effective graphs we must know which graphical attributes are most easily decoded. - We need a selection of possible graphical attributes an ordering of their ease of decoding.

Layer

Thing that consists of data, aesthetic mapping, scales, geometry, a statistical transformation and a coordinate system. All of which how the plot will look like.

Multivariate visualisation

Visualisation of datasets that have more than 3 variables. Strategies: - Avoid "over-encoding". - Use the space and small multiples intelligently. - Reduce the problem space. - Use interaction to generate relevant views. Remember: there's rarely a single visualisation that answers all questions. Instead, the ability to generate appropriate visualisations quickly is key.

Data cleaning

Where most (80%) of the effort go in any data science projects, consisting of: 1. Gaining understanding of how the dataset was constructed, and what different numbers in it mean. 2. Formatting the data appropriately for the analysis. 3. Identifying and correcting errors, by exhaustive visualisation and checking. Exceptional data items should be identified, and then checked.

Causation

Where we are interested in P(Y ∈ A | set X = x). It's about active interventions.

Experimental investigation

- An experiment is designed, and data is specially collected. - Experiments are designed to eliminate confounders, usually by randomisation. - Expensive and time-consuming because carefully controlled observations must be specially made. - Typically "small data": the main problem is deciding when enough observations have been made and experiment can stop. - Main question: have I got enough data to be sure the pattern I observe is not caused by chance?

tall format

Pro: easier to process

Observational investigation

- A process that may use existing observations, data that has been found, or recorded for other purposes. - Confounders aren't controlled: the causes of correlations have to be guessed. - Data may be plentiful and cheap. - Main question: what may have caused the correlations observed? Investigation plan: Repeat: 0. Think of a possible confounder. 1. Investigate the possible confounder. Until you can't think of any more confounders or you have no relevant data to investigate them. Significance tests: If the result isn't statistically significant, then the correlation/effect may not be there and more evidence is needed. Else, the question is always: "What may have caused it, and evidence can I find about that?"

A/B testing

- A randomised controlled trial used by technology companies, web applications. - Typical use case: comparing versions of a web page. Which has a higher conversion rate? - Examples of where it's used: • Tech companies improving their UI / algorithms. • E-commerce sites improving their design / personalising context/offering promotions • Optimising ads. Design challenges: - Decide what to test: • What metrics you want to improve? • What treatments might improve them? - Running valid tests • Usually using all visitors to the site represents the desired population. • But sometimes we want to target a subpopulation to get a more focused inference. • Common mistakes: test with too many variations, you change experiment settings in the middle of a test. - Interpreting test results

plyr

- An R package for splitting, applying and combining data. - It contains 3 common data structures beyond vectors: arrays (matrices), data frames and lists - It aims to provide a function for each basic verb of data manipulation: • filter() to select cases based on their values. • arrange() to reorder the cases. • select() and rename() to select variables based on their names. • mutate() and transmute() to add new variables that are functions of existing variables. • summarise() to condense multiple values to a single value. • sample_n() and sample_frac() to take random samples. It provides a ddply function that has 3 pats: ddply(datafm, .(Subject), function(df) data.frame(RT.mean=mean(df$RT))) 1. The data, datafm 2. The splitting variable, .(Subject) (.() is a utility function which quotes a list of variables or expressions). 3. The function to apply to the individual pieces.

Why create visualisations?

- Answer questions (or discover them) - Make decisions (e.g. stock market) - See data in context (e.g. map) - Expand memory (e.g. multiplication) - Find patterns (e.g. astronomy data, transaction) - Present arguments or tell a story - Inspire (e.g. textbooks)

3 E's of displaying data

- Effectively: Assure readers can easily find your data alongside any text or verbal cues that refer to it. Take care to reinforce your written text or content with your data visualisations. Do not replace or repeat information that could be best explained in a visual. - Ethically: always be honest with your readers with your visual data. Avoid inflating trends, data points, results, or scale with visual tools. - Efficiently: depending on where your data is being displayed, use colour judiciously. Also assure that you utilise white space and page layout efficiently.

Marks

- Graphical element in an image representing items or links. - Basic geometric elements. - Classified according to the number of spatial dimensions required.

Logarithmic re-scaling

- Is a variable always positive? - Are we more interested in percentage changes than in absolute changes? If yes then try log(variable)q

Gestalt theory of visual perception

- Patterns that transcend the visual stimuli that produced them. - Translate into useful sets of rules for visual design. - This idea of seeing the whole before the parts and even more the whole becoming more than the sum of its parts is Gestalt. - There are a variety of principles. - From the 1st group of psychologists to systematically study perceptual organisation around the 1920s in Germany. Principles: - Proximity - Grouping - Similarity - Closure - Continuation - Figure / ground Tips: - Arrange data logically and systematically at every opportunity (slide 238) - Remove the additional cognitive load (e.g. colours) - Distinctive objects create a focal point - Objects placed close to one another are perceived as a group.

Visualisation goals

- To explore: nothing is known, visualisation used for data exploration - To analyse: there are hypotheses, visualisation is used for Verification or Falsification - To present: "everything" known about the data, visualisation used for communicating results. Good data visualisation: - makes data accessible - combines the strengths of human and computers - enables insight - communicates (accurate things) - it should have a single and clear purpose

Purpose of colour

- To label - To measure - To represent and imitate - To enliven and decorate

Common mistakes

- Too many variables • Problem: hard to see relationships between more than three variables, a position and another. • Solution: warn the user and suggest alternatives such as faceting. - Overplotting • Problem: prompts incorrect conclusions about the distribution • Solution: supplement the plot with contours or colour by density - Alphabetical ordering • Problem: categorical variables often ordered alphabetically • Solution: ordering by some property of data that are more useful - Polar coordinates • Problem: humans are better at judging length than angle or areas. • Solution: difficult to judge an angle for objects with a small radius

Data dimensionality

- Univariate: measurement made on one variable per subject. - Bivariate: measurement made on two variables per subject. - Multivariate: measurement made on many variables per subject.

Diverging colour scale

- Useful when data has a meaningful "midpoint". - Use neutral colour for midpoint. - Use saturated colours for endpoints. - Limit number of steps in colour to 3-9.

Report writing

1. Gather and construct your hard and precise information (graphics, tables, technical definitions and equations). 2. Write your text around the hard information (before and after) commenting on it to make your argument. Why? If you write the text first, you will waffle a description of your graphics/equations/tables/technical definitions in ordinary language - and ordinary language isn't good for this! 3. Revise and re-organise if needed.

Distribution of a variable

A list of the possible values of a variable together with the frequency of each value. (Note: probabilities can be given instead of frequencies.). Its critical to assess the probability of events.

Correlation

A measure of the relationship between two variables If A and B are correlated (or, more generally, if there is some pattern relating them), then either: 1. changes in A may cause changes in B; 2. changes in B may cause changes in A; 3. there's some confounder (or confounding variable) C, such that changes in C cause changes in both A and B.

Correlation is not causation

A perceived relationship between two variables does not mean that one caused the other. P(Y ∈ A | X = x) ≠ P(Y ∈ A | set X = x)

HSV

A perceptual colour space which is more natural for user interaction and corresponds to the artistic concepts of tint, shade and tone. A change in the amount of a colour value should produce an equivalent visual change. Colours represented in the H, S, V parametrised space (commonly modelled as a cone). - H(ue): dominant colour wavelength (the colour type like red, green, blue, ...) - S(aturation): the amount of hue present ("vibrancy"/purity of the colour) - V(alue): brightness (luminance) of the colour

Natural experiment

A situation in which people (or subjects) are exposed to experimental and control conditions by some special circumstances that are arguably similar to a random assignment. Situations where no randomised trial is possible, natural experiments may be all that there is. Examples: - Bus drivers and bus conductors: drivers and conductors from similar social backgrounds, but take different amounts of exercise; conductors were healthier.

Describing distributions of numerical variables

Always mention the: - shape: skewness, modality - centre: an estimate of a typical observation in the distribution (mean, median, mode, etc.). - spread: measure of variability in the distribution (e.g. SD, IQR, range) - unusual observations: observations that stand out from the rest of the data that may be suspected outliers.

Visualisation

Conversion of numbers→images - humans are generally poor at raw numerical data analysis. - human visual reasoning allows robust analysis of visual stimuli (convert numerical analysis into visual analysis). What do you want your visualisation to show about your data? - distribution: how a variable or variables in the dataset distribute over a range of possible values. - relationship: how the values of multiple variables in the dataset relate. - composition: how the dataset breaks down into subgroups. - comparison: how trends in multiple variables or datasets compare. When we make a visualisation from our data, we are telling a story. There is information buried in our data, and we are finding the best way to make it accessible. This means that a successful visualisation has to have something to say or a question to answer, which means its creator (you!) needs to know what that story or question is! - What are you trying to say about the data? - What question are you trying to ask the data? There are many ways to think about the different types of data: entities exist in relationships with one another, which can have attributes which can be comprised of multiple dimensions.

Gestalt principle: proximity

How is the data organised? Things that are visually close to each other are related

Visual attribute ranking

How well people decode visual clues? 1. Position along a common scale (scatter plot) 2. Position on identical but nonaligned scales (multiple scatter plots) 3. Length (bar chart) 4. Angle & slope (pie chart) 5. Area (bubbles) 6. Volume, density and colour saturation (heatmap) 7. Colour hue Thus the most important variables are those on the X and Y axis (position).

Perception

Identification and interpretation of sensory information from the physical stimulus to recognise information. - Eye, optical nerve, visual cortex - First processing (edges, planes) - Not conscious - Reflexes

Scaling: general principles

Identify a reference comparison. E.g. should the relationship be linear? Exponential? Logarithmic? y = 1/x ? f(y) = g(x) ? Should one term be a constant? Transform the data so that the reference comparison is as simple as possible. You may need to do this in several steps. It may be useful/necessary to combine variables (e.g. BMI = weight/height²

Scaling to a reference comparison

If it would be "natural" that f(y) = g(x) for 2 functions f and g, then plot f(x) against g(x), or f(y) - g(x) against x or f(y)/g(x) against x. If it would be "natural" that y = g(x) + delta(x) for some natural relationship g and some interesting difference delta(x), then plot(y - g(x)) versus x. Example: consider data on the weights and heights of a sample of people. A natural reference comparison is that weight = A height³, for some constant A. So plot, for example, (weight/height³) against height.

Banking

If you must read a gradient from a line graph, vary the aspect ratio to make gradients at different scales close to 45°.

States of reality and decisions made

In decision-making, there's the possibility of committing an error, which could either be an error of Type I or an error of Type II.

%>% (Pipe operator)

Operator that allows you to pipe the output from one function to the input of another function. The idea of piping is to read the functions from left to right (instead of having to nest functions). E.g.: diamonds2 <- diamonds %>% mutate(price_per_carat=price/carat) by_clarity <- diamonds %>% group_by(clarity) %>% summarise( n = n(), mean = mean(price), lq = quantile(price, 0.25), uq = quantile(price, 0.75) )

reshape(/reshape2)

Package designed to help reshape data from a wide to a tall format

QQ (Quantile-Quantile) plot

Plot quantiles of one sample against quantiles of the other. We can tell which group has bigger quantities, based on whether the points are above or below the reference line y=x. In the image example, we can see that the birth weights are bigger when smoke==no but the difference goes away at the lower quantities (i.e. premature births) Pro: - It's easier to see whether 2 samples are different and in which quantiles they differ specifically. Con: - Potentially harder to understand than the quantile plot.

wide format

Pro: easier for people to view and comprehend.

Dredging

Process of keeping on looking for some patterns or regularities after your initial analysis has failed - don't do this. This can invalidate conclusions!

Visual encoding

The way in which data is mapped into visual structures, upon which we build the images. Encoding challenge: how to pick the best encoding (or mapping) from many possibilities? Consider: - importance ordering: encode the most important information in the most perceptually accurate way - expressiveness: depict all the data and only the data - consistency: the properties of the image (visual attributes) should match the properties of the data.

Prediction

Where we are interested in P(Y ∈ A | X = x). It (association/correlation) is about passive observation.

Tabular data

Where we expect each record/observation to represent a set of measurements of a single object or event. Each type of measurement is called a variable or an attribute of the data (e.g. height, radius and "Do I like it?" are variables/attributes). The number of attributes is called the dimension of the data. We expect each table to contain a set of records/observations of the same kind of object or event (e.g. our table above contains observations of cylinders).

Randomisation in experimental design

Why randomise? Randomisation is defensive: it prevents unintentionally introducing confounding variables. Why randomise in sampling? It avoids unintentionally selecting a biased sample.

Re-scaling

Why? - To spread data out, to improve visibility. Don't have all data bunched together in a blob in one corner of the plot. - Finding a "law" by finding a straight line (or better, a horizontal line). - Plotting data relative to a "reference comparison" (a "law" we don't believe, but which makes a useful point of comparison). - Before doing any ML or linear modelling, re-scale input variables to compact distributions

Imputing missing values

With enough training data, one might drop all records with missing values, but we may want to use the model on records with missing fields. Often, it's better to estimate or impute missing values instead of leaving them blank. E.g. a good guess for one's death is birth+80. Methods: - Mean value imputation: leaves the mean the same. - Random value imputation: repeatedly selecting random values permits statistical evaluation of the impact of imputation. - Imputation by interpolation: using linear regression to predict missing values, that works well if few fields are missing per record.


Conjuntos de estudio relacionados

Tensor fasciae latae (TFL) and iliotibial tract

View Set

microeconomics unit 2 final review

View Set

Chapter Seven Review + Quiz Questions

View Set

Level Up: Step 1 - Query & Results History

View Set

Business 101- Management (Chapter 7)

View Set

Chapter 6 Consciousness (Fiest) questions, Chapter 5 Human Development, quiz 5 6

View Set

Life Policy Provisions, Rideers & Options

View Set